General Purpose Computing using Graphics Hardware
Upcoming SlideShare
Loading in...5
×
 

General Purpose Computing using Graphics Hardware

on

  • 5,557 views

 

Statistics

Views

Total Views
5,557
Views on SlideShare
5,525
Embed Views
32

Actions

Likes
3
Downloads
159
Comments
0

2 Embeds 32

http://www.slideshare.net 31
http://webcache.googleusercontent.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Fluid flow, level set segmentation, DTI image
  • One of the major debates you’ll see in graphics in the coming years, is whether the scheduling and work distribution logic should be provided as highly optimized hardware, or be implemented as a software program on the programmable cores.
  • Pack core full of ALUsWe are not going to increase our core’s ability to decode instructionsWe will decode 1 instruction, and execute on all 8 ALUs
  • How can we make use all these ALUs?
  • Just have the shader program work on 8 fragments at a time. Replace the scalar operation with 8-wide vector ones.
  • So the program processing 8 fragments at a time, and all the work for each fragment is carried out by 1 of the 8 ALUs. Notice that I’ve also replicate part of the context to store execution state for the 8 fragments. For example, I’d replicate the registers.
  • We continue this process, moving to a new group each time we encounter a stall.If we have enough groups there will always be some work do, and the processing core’s ALUs never go idle.
  • Described adding contextsIn reality there’s a fixed pool of on chip storage that is partitioned to hold contexts.Instead of using on chip storage as a traditional data cache, GPUs choose to use this store to hold contexts.
  • Shadingperformance relies on large scale interleavingNumber of interleaved groups per core ~20-30Could be separate hardware-managed contexts or software-managed using techniques
  • Fewer contexts fit on chipChip can hide less latencyHigher likelihood of stalls
  • Loose performance when shaders use a lot of registers
  • 128 simultaneous threads on each core
  • Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing
  • Numbers are relative cost of communication
  • Runs on each thread – is parallel
  • G = grid size, B = block size

General Purpose Computing using Graphics Hardware General Purpose Computing using Graphics Hardware Presentation Transcript

  • General Purpose Computingusing Graphics Hardware
    Hanspeter Pfister
    Harvard University
  • Acknowledgements
    Won-Ki Jeong, Harvard University
    KayvonFatahalian, Stanford University
    2
  • GPU (Graphics Processing Unit)
    PC hardware dedicated for 3D graphics
    Massively parallel SIMD processor
    Performance pushed by game industry
    3
    NVIDIA SLI System
  • GPGPU
    General Purpose computation on the GPU
    Started in computer graphics research community
    Mapping computational problems to graphics rendering pipeline
    4
    Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong
  • Why GPU for computing?
    GPU is fast
    Massively parallel
    CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core)
    GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200)
    High memory bandwidth
    Programmable
    NVIDIA CUDA, DirectX Compute Shader, OpenCL
    High precision floating point support
    64bit floating point (IEEE 754)
    Inexpensive desktop supercomputer
    NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000
    5
  • FLOPS
    6
    Image Courtesy NVIDIA
  • Memory Bandwidth
    7
    Image Courtesy NVIDIA
  • GPGPU Biomedical Examples
    8
    Level-Set Segmentation (Lefohn et al.)
    CT/MRI Reconstruction (Sumanaweera et al.)
    Image Registration (Strzodka et al.)
    EM Image Processing (Jeong et al.)
  • Overview
    GPU Architecture Overview
    GPU Programming Overview
    Programming Model
    NVIDIA CUDA
    OpenCL
    Application Example
    CUDA ITK
    9
  • 1. GPU Architecture Overview
    KayvonFatahalian
    Stanford University
    10
  • What’s in a GPU?
    11
    Input Assembly
    Rasterizer
    Output Blend
    Video Decode
    Tex
    Compute
    Core
    Compute
    Core
    Compute
    Core
    Compute
    Core
    Compute
    Core
    Compute
    Core
    Compute
    Core
    Compute
    Core
    Tex
    Tex
    HW
    or
    SW?
    Work
    Distributor
    Tex
    Heterogeneous chip multi-processor (highly tuned for graphics)
  • CPU-“style” cores
    12
    Fetch/
    Decode
    Out-of-order control logic
    Fancy branch predictor
    ALU
    (Execute)
    Memory pre-fetcher
    Execution
    Context
    Data Cache
    (A big one)
  • Slimming down
    13
    Fetch/
    Decode
    Idea #1:
    Remove components that
    help a single instruction
    stream run fast
    ALU
    (Execute)
    Execution
    Context
  • Two cores (two threads in parallel)
    14
    thread1
    thread 2
    Fetch/
    Decode
    Fetch/
    Decode
    <diffuseShader>:
    sample r0, v4, t0, s0
    mul r3, v0, cb0[0]
    madd r3, v1, cb0[1], r3
    madd r3, v2, cb0[2], r3
    clmp r3, r3, l(0.0), l(1.0)
    mul o0, r0, r3
    mul o1, r1, r3
    mul o2, r2, r3
    mov o3, l(1.0)
    <diffuseShader>:
    sample r0, v4, t0, s0
    mul r3, v0, cb0[0]
    madd r3, v1, cb0[1], r3
    madd r3, v2, cb0[2], r3
    clmp r3, r3, l(0.0), l(1.0)
    mul o0, r0, r3
    mul o1, r1, r3
    mul o2, r2, r3
    mov o3, l(1.0)
    ALU
    (Execute)
    ALU
    (Execute)
    Execution
    Context
    Execution
    Context
  • Four cores (four threads in parallel)
    15
    Fetch/
    Decode
    Fetch/
    Decode
    Fetch/
    Decode
    Fetch/
    Decode
    ALU
    (Execute)
    ALU
    (Execute)
    ALU
    (Execute)
    ALU
    (Execute)
    Execution
    Context
    Execution
    Context
    Execution
    Context
    Execution
    Context
  • Sixteen cores (sixteen threads in parallel)
    16
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    16 cores = 16 simultaneous instruction streams
  • Instruction stream sharing
    17
    But… many threads should
    be able to share an instruction
    stream!
    <diffuseShader>:
    sample r0, v4, t0, s0
    mul r3, v0, cb0[0]
    madd r3, v1, cb0[1], r3
    madd r3, v2, cb0[2], r3
    clmp r3, r3, l(0.0), l(1.0)
    mul o0, r0, r3
    mul o1, r1, r3
    mul o2, r2, r3
    mov o3, l(1.0)
  • Recall: simple processing core
    18
    Fetch/
    Decode
    ALU
    (Execute)
    Execution
    Context
  • Add ALUs
    19
    Idea #2:
    Amortize cost/complexity of
    managing an instruction
    stream across many ALUs
    Fetch/
    Decode
    ALU 1
    ALU 2
    ALU 3
    ALU 4
    ALU 5
    ALU 6
    ALU 7
    ALU 8
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    SIMD processing
    Shared Ctx Data
  • Modifying the code
    20
    Fetch/
    Decode
    <diffuseShader>:
    sample r0, v4, t0, s0
    mul r3, v0, cb0[0]
    madd r3, v1, cb0[1], r3
    madd r3, v2, cb0[2], r3
    clmp r3, r3, l(0.0), l(1.0)
    mul o0, r0, r3
    mul o1, r1, r3
    mul o2, r2, r3
    mov o3, l(1.0)
    ALU 1
    ALU 2
    ALU 3
    ALU 4
    ALU 5
    ALU 6
    ALU 7
    ALU 8
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Original compiled shader:
    Shared Ctx Data
    Processes one thread
    using scalar ops on scalar
    registers
  • Modifying the code
    21
    Fetch/
    Decode
    <VEC8_diffuseShader>:
    VEC8_sample vec_r0, vec_v4, t0, vec_s0
    VEC8_mul vec_r3, vec_v0, cb0[0]
    VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
    VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
    VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
    VEC8_mul vec_o0, vec_r0, vec_r3
    VEC8_mul vec_o1, vec_r1, vec_r3
    VEC8_mul vec_o2, vec_r2, vec_r3
    VEC8_mov vec_o3, l(1.0)
    ALU 1
    ALU 2
    ALU 3
    ALU 4
    ALU 5
    ALU 6
    ALU 7
    ALU 8
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    New compiled shader:
    Shared Ctx Data
    Processes 8 threads
    using vector ops on vector
    registers
  • Modifying the code
    22
    2
    3
    1
    4
    6
    7
    5
    8
    Fetch/
    Decode
    <VEC8_diffuseShader>:
    VEC8_sample vec_r0, vec_v4, t0, vec_s0
    VEC8_mul vec_r3, vec_v0, cb0[0]
    VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3
    VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3
    VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)
    VEC8_mul vec_o0, vec_r0, vec_r3
    VEC8_mul vec_o1, vec_r1, vec_r3
    VEC8_mul vec_o2, vec_r2, vec_r3
    VEC8_mov vec_o3, l(1.0)
    ALU 1
    ALU 2
    ALU 3
    ALU 4
    ALU 5
    ALU 6
    ALU 7
    ALU 8
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Shared Ctx Data
  • 128 threads in parallel
    23
    16 cores = 128 ALUs
    = 16 simultaneous instruction streams
  • But what about branches?
    24
    2
    ...
    1
    ...
    8
    Time (clocks)
    ALU 1
    ALU 2
    . . .
    ALU 8
    . . .
    <unconditional shader code>
    if (x> 0) {
    y = pow(x, exp);
    y *= Ks;
    refl = y + Ka;
    } else {
    x = 0;
    refl = Ka;
    }
    <resume unconditional shader code>
  • But what about branches?
    25
    2
    ...
    1
    ...
    8
    Time (clocks)
    ALU 1
    ALU 2
    . . .
    ALU 8
    . . .
    <unconditional shader code>
    T
    T
    T
    F
    F
    F
    F
    F
    if (x> 0) {
    y = pow(x, exp);
    y *= Ks;
    refl = y + Ka;
    } else {
    x = 0;
    refl = Ka;
    }
    <resume unconditional shader code>
  • But what about branches?
    26
    2
    ...
    1
    ...
    8
    Time (clocks)
    ALU 1
    ALU 2
    . . .
    ALU 8
    . . .
    <unconditional shader code>
    T
    T
    T
    F
    F
    F
    F
    F
    if (x> 0) {
    y = pow(x, exp);
    y *= Ks;
    refl = y + Ka;
    } else {
    x = 0;
    refl = Ka;
    }
    <resume unconditional shader code>
    Not all ALUs do useful work!
    Worst case: 1/8 performance
  • But what about branches?
    27
    2
    ...
    1
    ...
    8
    Time (clocks)
    ALU 1
    ALU 2
    . . .
    ALU 8
    . . .
    <unconditional shader code>
    T
    T
    T
    F
    F
    F
    F
    F
    if (x> 0) {
    y = pow(x, exp);
    y *= Ks;
    refl = y + Ka;
    } else {
    x = 0;
    refl = Ka;
    }
    <resume unconditional shader code>
  • Clarification
    28
    SIMD processing does not imply SIMD instructions
    • Option 1: Explicit vector instructions
    • Intel/AMD x86 SSE, Intel Larrabee
    • Option 2: Scalar instructions, implicit HW vectorization
    • HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)
    • NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures
    In practice: 16 to 64 threads share an instruction stream
  • Stalls!
    Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.
    Texture access latency = 100’s to 1000’s of cycles
    We’ve removed the fancy caches and logic that helps avoid stalls.
    29
  • But we have LOTS of independent threads.
    Idea #3:
    Interleave processing of many threads on a single core to avoid stalls caused by high latency operations.
    30
  • Hiding stalls
    31
    Time (clocks)
    Thread1 … 8
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    Fetch/
    Decode
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    Ctx
    SharedCtx Data
  • Hiding stalls
    32
    Time (clocks)
    Thread9… 16
    Thread17 … 24
    Thread25 … 32
    Thread1 … 8
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    Fetch/
    Decode
    1
    2
    3
    4
    1
    2
    3
    4
  • Hiding stalls
    33
    Time (clocks)
    Thread9… 16
    Thread17 … 24
    Thread25 … 32
    Thread1 … 8
    Stall
    Runnable
    1
    2
    3
    4
  • Hiding stalls
    34
    Time (clocks)
    Thread9… 16
    Thread17 … 24
    Thread25 … 32
    Thread1 … 8
    Stall
    Runnable
    1
    2
    3
    4
  • Hiding stalls
    35
    Time (clocks)
    Thread9… 16
    Thread17 … 24
    Thread25 … 32
    Thread1 … 8
    Stall
    Stall
    Stall
    Stall
    Runnable
    Runnable
    1
    2
    3
    4
    Runnable
  • Throughput!
    36
    Time (clocks)
    Thread9… 16
    Thread17 … 24
    Thread25 … 32
    Thread1 … 8
    Start
    Start
    Stall
    Stall
    Stall
    Stall
    Start
    Runnable
    Runnable
    Done!
    Runnable
    Done!
    Runnable
    2
    3
    4
    1
    Increase run time of one group
    To maximum throughput of many groups
    Done!
    Done!
  • Storing contexts
    37
    Fetch/
    Decode
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    Pool of context storage
    32KB
  • Twenty small contexts
    38
    (maximal latency hiding ability)
    Fetch/
    Decode
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    10
    1
    2
    3
    4
    5
    6
    7
    8
    9
    11
    15
    12
    13
    14
    16
    20
    17
    18
    19
  • Twelve medium contexts
    39
    Fetch/
    Decode
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
  • Four large contexts
    40
    (low latency hiding ability)
    Fetch/
    Decode
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    ALU
    4
    3
    1
    2
  • GPU block diagram key
    = single “physical” instruction stream fetch/decode
    (functional unit control)
    = SIMD programmable functional unit (FU), control shared with other
    functional units. This functional unit may contain multiple 32-bit “ALUs”
    = 32-bit mul-add unit
    = 32-bit multiply unit
    = execution context storage
    = fixed function unit
    41
  • Example: NVIDIA GeForce GTX 280
    NVIDIA-speak:
    240 stream processors
    “SIMT execution” (automatic HW-managed sharing of instruction stream)
    Generic speak:
    30 processing cores
    8 SIMD functional units per core
    1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock)
    Best case: 240 mul-adds + 240 muls per clock
    1.3 GHz clock
    30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS
    Mapping data-parallelism to chip:
    Instruction stream shared across 32 threads
    8 threads run on 8 SIMD functional units in one clock
    42
  • GTX 280 core
    43
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex






























    Zcull/Clip/Rast
    Output Blend
    Work Distributor
  • Example: ATI Radeon 4870
    AMD/ATI-speak:
    800 stream processors
    Automatic HW-managed sharing of scalar instruction stream (like “SIMT”)
    Generic speak:
    10 processing cores
    16 SIMD functional units per core
    5 mul-adds per functional unit (5 * 2 =10 flops/clock)
    Best case: 800 mul-adds per clock
    750 MHz clock
    10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS
    Mapping data-parallelism to chip:
    Instruction stream shared across 64 threads
    16 threads run on 16 SIMD functional units in one clock
    44
  • ATI Radeon 4870 core










    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Tex
    Zcull/Clip/Rast
    Output Blend
    Work Distributor
    45
  • Summary: three key ideas
    Use many “slimmed down cores” to run in parallel
    Pack cores full of ALUs (by sharing instruction stream across groups of threads)
    Option 1: Explicit SIMD vector instructions
    Option 2: Implicit sharing managed by hardware
    Avoid latency stalls by interleaving execution of many groups of threads
    When one group stalls, work on another group
    46
  • 2. GPU Programming Models
    Programming Model
    NVIDIA CUDA
    OpenCL
    47
  • Task parallelism
    Distribute the tasks across processors based on dependency
    Coarse-grain parallelism
    48
    Task 1
    Task 1
    Time
    Task 2
    Task 2
    Task 3
    Task 3
    P1
    Task 4
    Task 4
    P2
    Task 5
    Task 5
    Task 6
    Task 6
    P3
    Task 7
    Task 7
    Task 8
    Task 8
    Task 9
    Task 9
    Task assignment across 3 processors
    Task dependency graph
  • Data parallelism
    Run a single kernel over many elements
    Each element is independently updated
    Same operation is applied on each element
    Fine-grain parallelism
    Many lightweight threads, easy to switch context
    Maps well to ALU heavy architecture : GPU
    49
    Kernel
    …….
    Data
    P1
    P2
    P3
    P4
    P5
    Pn
    …….
  • GPU-friendly Problems
    Data-parallel processing
    High arithmetic intensity
    Keep GPU busy all the time
    Computation offsets memory latency
    Coherent data access
    Access large chunk of contiguous memory
    Exploit fast on-chip shared memory
    50
  • The Algorithm Matters
    • Jacobi: Parallelizable
    for(inti=0; i<num; i++)
    {
    vn+1[i] = (vn[i-1] + vn[i+1])/2.0;
    }
    • Gauss-Seidel: Difficult to parallelize
    for(inti=0; i<num; i++)
    {
    v[i] = (v[i-1] + v[i+1])/2.0;
    }
    51
  • Example: Reduction
    Serial version (O(N))
    for(inti=1; i<N; i++)
    {
    v[0] += v[i];
    }
    Parallel version (O(logN))
    width = N/2;
    while(width > 1)
    {
    for(inti=0; i<width; i++)
    {
    v[i] += v[i+width]; // computed in parallel
    }
    width /= 2;
    }
    52
  • GPU programming languages
    Using graphics APIs
    GLSL, Cg, HLSL
    Computing-specific APIs
    DX 11 Compute Shaders
    NVIDIA CUDA
    OpenCL
    53
  • NVIDIA CUDA
    C-extension programming language
    No graphics API
    Supports debugging tools
    Extensions / API
    Function type : __global__, __device__, __host__
    Variable type : __shared__, __constant__
    Low-level functions
    cudaMalloc(), cudaFree(), cudaMemcpy(),…
    __syncthread(), atomicAdd(),…
    Program types
    Device program (kernel) : runs on the GPU
    Host program : runs on the CPU to call device programs
    54
  • CUDA Programming Model
    Kernel
    GPU program that runs on a thread grid
    Thread hierarchy
    Grid : a set of blocks
    Block : a set of threads
    Grid size * block size = total # of threads
    55
    Grid
    Kernel
    Block 2
    Block n
    Block 1
    <diffuseShader>:
    sample r0, v4, t0, s0
    mul r3, v0, cb0[0]
    madd r3, v1, cb0[1], r3
    madd r3, v2, cb0[2], r3
    clmp r3, r3, l(0.0), l(1.0)
    mul o0, r0, r3
    mul o1, r1, r3
    mul o2, r2, r3
    mov o3, l(1.0)
    . . . . .
    Threads
    Threads
    Threads
  • CUDA Memory Structure
    56
    Graphics card
    GPU Core
    PC Memory
    (DRAM)
    GPU GlobalMemory(DRAM)
    GPU SharedMemory(On-Chip)
    ALUs
    1
    200
    4000
    Memory hierarchy
    PC memory : off-card
    GPU Global : off-chip / on-card
    Shared/register/cache : on-chip
    The host can read/write global memory
    Each thread communicates using shared memory
  • Synchronization
    Threads in the same block can communicate using shared memory
    No HW global synchronization function yet
    __syncthreads()
    Barrier for threads only within the current block
    __threadfence()
    Flushes global memory writes to make them visible to all threads
    57
  • Example: CPU Vector Addition
    58
    // Pair-wise addition of vector elements
    // CPU version : serial add
    void vectorAdd(float* iA, float* iB, float* oC, int num)
    {
    for(inti=0; i<num; i++)
    {
    oC[i] = iA[i] + iB[i];
    }
    }
  • Example: CUDA Vector Addition
    59
    // Pair-wise addition of vector elements
    // CUDA version : one thread per addition
    __global__ void
    vectorAdd(float* iA, float* iB, float* oC)
    {
    intidx = threadIdx.x
    + blockDim.x * blockIdx.x;
    oC[idx] = iA[idx] + iB[idx];
    }
  • Example: CUDA Host Code
    60
    float* h_A = (float*) malloc(N * sizeof(float));
    float* h_B = (float*) malloc(N * sizeof(float));
    // …initalizeh_A and h_B
    // allocate device memory
    float* d_A, d_B, d_C;
    cudaMalloc( (void**) &d_A, N * sizeof(float));
    cudaMalloc( (void**) &d_B, N * sizeof(float));
    cudaMalloc( (void**) &d_C, N * sizeof(float));
    // copy host memory to device
    cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice );
    cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice );
    // execute the kernel on N/256 blocks of 256 threads each
    vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C );
  • OpenCL (Open Computing Language)
    First industry standard for computing language
    Based on C language
    Platform independent
    NVIDIA, ATI, Intel, ….
    Data and task parallel compute model
    Use all computational resources in system
    CPU, GPU, …
    Work-item : same as thread / fragment / etc..
    Work-group : a group of work-items
    Work-items in a same work-group can communicate
    Execute multiple work-groups in parallel
    61
  • OpenCL program structure
    Host program (CPU)
    Platform layer
    Query compute devices
    Create context
    Runtime
    Create memory objects
    Compile and create kernel program objects
    Issue commands (i.e., kernel launching) to command-queue
    Synchronization of commands
    Clean up OpenCL resources
    Kernel (CPU, GPU)
    C-like code with some extensions
    Runs on compute device
    62
  • CUDA v.s. OpenCL comparison
    Conceptually almost identical
    Work-item == thread
    Work-group == block
    Similar memory model
    Global, local, shared memory
    Kernel, host program
    CUDA is highly optimized only for NVIDIA GPUs
    OpenCL can be widely used for any GPUs/CPUs
    63
  • Implementation status of OpenCL
    Specification 1.0 released by Khronos
    NVIDIA released Beta 1.2 driver and SDK
    Available for registered GPU computing developers
    Apple will include in Mac OS X Snow Leopard
    Q3 2009
    NVIDIA and ATI GPUs, Intel CPU for Mac
    More companies will join
    64
  • GPU optimization tips: configuration
    Identify bottleneck
    Computing / bandwidth bound (use profiler)
    Focus on most expensive but parallelizable parts (Amdahl’s law)
    Maximize parallel execution
    Use large input (many threads)
    Avoid divergent execution
    Efficient use of limited resource
    Minimize shared memory / register use
    65
  • GPU optimization tips: memory
    Memory access: the most important optimization
    Minimize device to host memory overhead
    Overlap kernel with memory copy (asynchronous copy)
    Avoid shared memory bank conflict
    Coalesced global memory access
    Texture or constant memory can be helpful (cache)
    Graphics card
    GPU Core
    PC Memory
    (DRAM)
    GPU GlobalMemory(DRAM)
    GPU SharedMemory(On-Chip)
    ALUs
    1
    200
    4000
    66
  • GPU optimization tips: instructions
    Use less expensive operators
    division: 32 cycles, multiplication: 4 cycles
    *0.5 instead of /2.0
    Atomic operator is expensive
    Possible race condition
    Double precision is much slower than float
    Use less accurate floating point instruction when possible
    __sin(), __exp(), __pow()
    Save unnecessary instructions
    Loop unrolling
    67
  • 3. Application Example
    CUDA ITK
    68
  • ITK image filters implemented using CUDA
    Convolution filters
    Mean filter
    Gaussian filter
    Derivative filter
    Hessian of Gaussian filter
    Statistical filter
    Median filter
    PDE-based filter
    Anisotropic diffusion filter
    69
  • CUDA ITK
    CUDA code is integrated into ITK
    Transparent to the ITK users
    No need to modify current code using ITK library
    Check environment variable ITK_CUDA
    Entry point
    GenerateData() or ThreadedGenerateData()
    If ITK_CUDA == 0
    Execute original ITK code
    If ITK_CUDA == 1
    Execute CUDA code
    70
  • Convolution filters
    • Weighted sum of neighbors
    For size n filter, each pixel is reused n times
    Non-separable filter (Anisotropic)
    Reusing data using shared memory
    Separable filter (Gaussian)
    N-dimensional convolution = N*1D convolution
    71
    kernel
    kernel
    kernel
    *
    *
    *
  • Read from input image whenever needed
    Naïve C/CUDA implementation
    72
    intxdim, ydim; // size of input image
    float *in, *out; // input/output image of size xdim*ydim
    float w[][]; // convolution kernel of size n*m
    for(x=0; x<xdim; x++)
    {
    for(y=0; y<ydim; y++)
    {
    // compute convolution
    for(sx=x-n/2; sx<=x+n/2; sx++)
    {
    for(sy=y-m/2; sy<=y+m/2; sy++)
    {
    wx = sx – x + n/2;
    wy = sy – y + m/2;
    out[x][y] = w[wx][wy]*in[sx][sy];
    }
    }
    }
    }
    xdim*ydim
    n*m
    load from global memory, n*m times
  • For size n*m filter, each pixel is reused n*m times
    Save n*m-1 global memory loads by using shared memory
    Improved CUDA convolution filter
    73
    __global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memory
    sharedmem[] = in[][];
    __syncthreads();
    // sum neighbor pixel values float _sum = 0; for(uint j=threadIdx.y; j<=threadIdx.y + m; j++) { for(uinti=threadIdx.x; i<=threadIdx.x + n; i++) {wx = i – threadIdx.x;wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }}
    load from global memory (slow), only once
    n*m
    load from shared memory (fast), n*m times
  • CUDA Gaussian filter
    Apply 1D convolution filter along each axis
    Use temporary buffers: ping-pong rendering
    74
    // temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev); temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i<dimension; i++) { cudaConvolutionFilter1DKernel<<<G,B>>>(temp[i%2], temp[(i+1)%2], w); }
    out = temp[i%2];}
    1D convolution cuda kernel
  • Median filter
    1
    4
    3
    1
    8
    2
    1
    0
    1
    4
    3
    1
    8
    2
    1
    0
    1
    4
    3
    1
    8
    2
    1
    0
    1
    4
    3
    1
    8
    2
    1
    0
    Viola et al. [VIS 03]
    Finding median by bisection of histogram bins
    Log(# bins) iterations
    8-bit pixel : log(256) = 8 iterations
    Intensity :
    0
    1
    2
    3
    4
    5
    6
    7
    1.
    16
    4
    2.
    Copy current block from global to shared memory
    min = 0;
    max = 255;
    pivot = (min+max)/2.0f;
    For(i=0; i<8; i++)
    {
    count = 0;
    For(j=0; j<kernelsize; j++)
    {
    if(kernel[j] > pivot) count++:
    }
    if(count <kernelsize/2) max = floor(pivot);
    else min = ceil(pivot);
    pivot = (min + max)/2.0f;
    }
    return floor(pivot);
    11
    5
    3.
    4.
    75
  • Perona & Malik anisotropic diffusion
    Nonlinear diffusion
    Adaptive smoothing based on magnitude of gradient
    Preserves edges (high gradient)
    Numerical solution
    Euler explicit integration (iterative method)
    Finite difference for derivative computation
    76
    Input Image
    Linear diffusion
    P & M diffusion
  • Performance
    Convolution filters
    Mean filter : ~140x
    Gaussian filter : ~60x
    Derivative filter
    Hessian of Gaussian filter
    Statistical filter
    Median filter : ~25x
    PDE-based filter
    Anisotropic diffusion filter : ~70x
    77
  • CUDA ITK
    Source code available at
    http://sourceforge.net/projects/cudaitk/
    78
  • CUDA ITK Future Work
    ITK GPU image class
    Reduce CPU to GPU memory I/O
    Pipelining support
    Native interface for GPU code
    Similar to ThreadedGenerateData() for GPU threads
    Numerical library (vnl)
    Out-of-GPU-core / GPU-cluster
    Processing large images (10~100 Terabytes)
    GPU Platform independent implementation
    OpenCL could be a solution
    79
  • Conclusions
    GPU computing delivers high performance
    Many scientific computing problems are parallelizable
    More consistency/stability in HW/SW
    Main GPU architecture is mature
    Industry-wide programming standard now exists (OpenCL)
    Better support/tools available
    C-based language, compiler, and debugger
    Issues
    Not every problem is suitable for GPUs
    Re-engineering of algorithms/software required
    Unclear future performance growth of GPU hardware
    Intel’s Larrabee
    80
  • thrust
    thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA
    C++ template metaprogramming automatically chooses the fastest code path at compile time
  • thrust::sort
    #include <thrust/host_vector.h>
    #include <thrust/device_vector.h>
    #include <thrust/generate.h>
    #include <thrust/sort.h>
    #include <cstdlib>
    int main(void)
    {
    // generate random data on the host
    thrust::host_vector<int> h_vec(1000000);
    thrust::generate(h_vec.begin(), h_vec.end(), rand);
    // transfer to device and sort
    thrust::device_vector<int> d_vec = h_vec;
    // sort 140M 32b keys/sec on GT200
    thrust::sort(d_vec.begin(), d_vec.end());
    return 0;}
    http://thrust.googlecode.com