Your SlideShare is downloading. ×
0
General Purpose Computingusing Graphics Hardware<br />Hanspeter Pfister<br />Harvard University<br />
Acknowledgements<br />Won-Ki Jeong, Harvard University<br />KayvonFatahalian, Stanford University <br />2<br />
GPU (Graphics Processing Unit)<br />PC hardware dedicated for 3D graphics<br />Massively parallel SIMD processor<br />Perf...
GPGPU<br />General Purpose computation on the GPU<br />Started in computer graphics research community<br />Mapping comput...
Why GPU for computing?<br />GPU is fast<br />Massively parallel<br />CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad ...
FLOPS<br />6<br />Image Courtesy NVIDIA<br />
Memory Bandwidth<br />7<br />Image Courtesy NVIDIA<br />
GPGPU Biomedical Examples<br />8<br />Level-Set Segmentation (Lefohn et al.)<br />CT/MRI Reconstruction (Sumanaweera et al...
Overview<br />GPU Architecture Overview<br />GPU Programming Overview<br />Programming Model<br />NVIDIA CUDA<br />OpenCL<...
1. GPU Architecture Overview<br />KayvonFatahalian<br />Stanford University<br />10<br />
What’s in a GPU?<br />11<br />Input Assembly<br />Rasterizer<br />Output Blend<br />Video Decode<br />Tex<br />Compute<br ...
CPU-“style” cores<br />12<br />Fetch/<br />Decode<br />Out-of-order control logic<br />Fancy branch predictor<br />ALU<br ...
Slimming down<br />13<br />Fetch/<br />Decode<br />Idea #1: <br />Remove components that<br />help a single instruction<br...
Two cores   (two threads in parallel)<br />14<br />thread1<br />thread 2<br />Fetch/<br />Decode<br />Fetch/<br />Decode<b...
Four cores   (four threads in parallel)<br />15<br />Fetch/<br />Decode<br />Fetch/<br />Decode<br />Fetch/<br />Decode<br...
Sixteen cores   (sixteen threads in parallel)<br />16<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />...
Instruction stream sharing<br />17<br />But… many threads should<br />be able to share an instruction<br />stream! <br />&...
Recall: simple processing core<br />18<br />Fetch/<br />Decode<br />ALU<br />(Execute)<br />Execution<br />Context<br />
Add ALUs<br />19<br />Idea #2:<br />Amortize cost/complexity of<br />managing an instruction<br />stream across many ALUs<...
Modifying the code<br />20<br />Fetch/<br />Decode<br />&lt;diffuseShader&gt;:<br />sample r0, v4, t0, s0<br />mul  r3, v0...
Modifying the code<br />21<br />Fetch/<br />Decode<br />&lt;VEC8_diffuseShader&gt;:<br />VEC8_sample vec_r0, vec_v4, t0, v...
Modifying the code<br />22<br />2<br />3<br />1<br />4<br />6<br />7<br />5<br />8<br />Fetch/<br />Decode<br />&lt;VEC8_d...
128 threads in parallel <br />23<br />16 cores = 128 ALUs<br />= 16 simultaneous instruction streams<br />
But what about branches?<br />24<br />2<br />... <br />1<br />...<br />8<br />Time (clocks)<br />ALU 1<br />ALU 2<br />. ....
But what about branches?<br />25<br />2<br />... <br />1<br />...<br />8<br />Time (clocks)<br />ALU 1<br />ALU 2<br />. ....
But what about branches?<br />26<br />2<br />... <br />1<br />...<br />8<br />Time (clocks)<br />ALU 1<br />ALU 2<br />. ....
But what about branches?<br />27<br />2<br />... <br />1<br />...<br />8<br />Time (clocks)<br />ALU 1<br />ALU 2<br />. ....
Clarification<br />28<br />SIMD processing does not imply SIMD instructions <br /><ul><li>Option 1: Explicit vector instru...
Intel/AMD x86 SSE, Intel Larrabee
Option 2:  Scalar instructions, implicit HW vectorization
HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)
NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures</li></ul>In practice: 16 to 64 threads share an instruction stream...
Stalls!<br />Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.<br ...
But we have  LOTS of independent threads.<br />Idea #3:<br />Interleave processing of many threads on a single core to avo...
Hiding stalls<br />31<br />Time (clocks)<br />Thread1 … 8<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br /...
Hiding stalls<br />32<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />ALU ...
Hiding stalls<br />33<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />Stal...
Hiding stalls<br />34<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />Stal...
Hiding stalls<br />35<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />Stal...
Throughput!<br />36<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />Start<...
Storing contexts<br />37<br />Fetch/<br />Decode<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br ...
Twenty small contexts<br />38<br />(maximal latency hiding ability)<br />Fetch/<br />Decode<br />ALU <br />ALU <br />ALU <...
Twelve medium contexts<br />39<br />Fetch/<br />Decode<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />AL...
Four large contexts<br />40<br />(low latency hiding ability)<br />Fetch/<br />Decode<br />ALU <br />ALU <br />ALU <br />A...
GPU block diagram key<br />= single “physical” instruction stream fetch/decode<br />    (functional unit control)<br />= S...
Example: NVIDIA GeForce GTX 280<br />NVIDIA-speak:<br />240 stream processors<br />“SIMT execution” (automatic HW-managed ...
GTX 280 core<br />43<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />…<br /...
Example: ATI Radeon 4870<br />AMD/ATI-speak:<br />800 stream processors<br />Automatic HW-managed sharing of scalar instru...
ATI Radeon 4870 core<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />Tex<br />Tex<br />Tex<br /...
Summary: three key ideas<br />Use many “slimmed down cores” to run in parallel<br />Pack cores full of ALUs (by sharing in...
2. GPU Programming Models<br />Programming Model<br />NVIDIA CUDA<br />OpenCL<br />47<br />
Task parallelism<br />Distribute the tasks across processors based on dependency<br />Coarse-grain parallelism<br />48<br ...
Data parallelism<br />Run a single kernel over many elements<br />Each element is independently updated<br />Same operatio...
GPU-friendly Problems<br />Data-parallel processing<br />High arithmetic intensity<br />Keep GPU busy all the time<br />Co...
The Algorithm Matters<br /><ul><li>Jacobi: Parallelizable</li></ul>for(inti=0; i&lt;num; i++)<br />	{<br />   	vn+1[i] = (...
Example: Reduction<br />Serial version (O(N))<br />for(inti=1; i&lt;N; i++)<br />	{<br />  	 v[0] += v[i];<br />	}<br />Pa...
GPU programming languages<br />Using graphics APIs<br />GLSL, Cg, HLSL<br />Computing-specific APIs<br />DX 11 Compute Sha...
NVIDIA CUDA<br />C-extension programming language<br />No graphics API<br />Supports debugging tools<br />Extensions / API...
CUDA Programming Model<br />Kernel<br />GPU program that runs on a thread grid<br />Thread hierarchy<br />Grid : a set of ...
CUDA Memory Structure<br />56<br />Graphics card<br />GPU Core<br />PC Memory<br />(DRAM)<br />GPU GlobalMemory(DRAM)<br /...
Synchronization<br />Threads in the same block can communicate using shared memory<br />No HW global synchronization funct...
Example: CPU Vector Addition<br />58<br />// Pair-wise addition of vector elements<br />// CPU version : serial add<br />v...
Example: CUDA Vector Addition<br />59<br />// Pair-wise addition of vector elements<br />// CUDA version : one thread per ...
Example: CUDA Host Code<br />60<br />float* h_A = (float*) malloc(N * sizeof(float));<br />float* h_B = (float*) malloc(N ...
OpenCL	(Open Computing Language)<br />First industry standard for computing language<br />Based on C language<br />Platfor...
OpenCL program structure<br />Host program (CPU)<br />Platform layer<br />Query compute devices<br />Create context<br />R...
CUDA v.s. OpenCL comparison<br />Conceptually almost identical<br />Work-item == thread<br />Work-group == block<br />Simi...
Implementation status of OpenCL<br />Specification 1.0 released by Khronos<br />NVIDIA released Beta 1.2 driver and SDK<br...
GPU optimization tips: configuration<br />Identify bottleneck<br />Computing / bandwidth bound (use profiler)<br />Focus o...
GPU optimization tips: memory<br />Memory access: the most important optimization<br />Minimize device to host memory over...
GPU optimization tips: instructions<br />Use less expensive operators<br />division: 32 cycles, multiplication: 4 cycles<b...
3. Application Example<br />CUDA ITK<br />68<br />
ITK image filters implemented using CUDA<br />Convolution filters<br />Mean filter<br />Gaussian filter<br />Derivative fi...
CUDA ITK<br />CUDA code is integrated into ITK<br />Transparent to the ITK users<br />No need to modify current code using...
Convolution filters<br /><ul><li>Weighted sum of neighbors</li></ul>For size n filter, each pixel is reused n times<br />N...
Read from input image whenever needed<br />Naïve C/CUDA implementation<br />72<br />intxdim, ydim;  // size of input image...
For size n*m filter, each pixel is reused n*m times<br />Save n*m-1 global memory loads by using shared memory<br />Improv...
CUDA Gaussian filter<br />Apply 1D convolution filter along each axis<br />Use temporary buffers: ping-pong rendering<br /...
Median filter<br />1<br />4<br />3<br />1<br />8<br />2<br />1<br />0<br />1<br />4<br />3<br />1<br />8<br />2<br />1<br ...
Perona & Malik anisotropic diffusion<br />Nonlinear diffusion<br />Adaptive smoothing based on magnitude of gradient<br />...
Performance<br />Convolution filters<br />Mean filter : ~140x<br />Gaussian filter : ~60x<br />Derivative filter<br />Hess...
CUDA ITK<br />Source code available at<br />http://sourceforge.net/projects/cudaitk/<br />78<br />
Upcoming SlideShare
Loading in...5
×

General Purpose Computing using Graphics Hardware

3,897

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,897
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
162
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Fluid flow, level set segmentation, DTI image
  • One of the major debates you’ll see in graphics in the coming years, is whether the scheduling and work distribution logic should be provided as highly optimized hardware, or be implemented as a software program on the programmable cores.
  • Pack core full of ALUsWe are not going to increase our core’s ability to decode instructionsWe will decode 1 instruction, and execute on all 8 ALUs
  • How can we make use all these ALUs?
  • Just have the shader program work on 8 fragments at a time. Replace the scalar operation with 8-wide vector ones.
  • So the program processing 8 fragments at a time, and all the work for each fragment is carried out by 1 of the 8 ALUs. Notice that I’ve also replicate part of the context to store execution state for the 8 fragments. For example, I’d replicate the registers.
  • We continue this process, moving to a new group each time we encounter a stall.If we have enough groups there will always be some work do, and the processing core’s ALUs never go idle.
  • Described adding contextsIn reality there’s a fixed pool of on chip storage that is partitioned to hold contexts.Instead of using on chip storage as a traditional data cache, GPUs choose to use this store to hold contexts.
  • Shadingperformance relies on large scale interleavingNumber of interleaved groups per core ~20-30Could be separate hardware-managed contexts or software-managed using techniques
  • Fewer contexts fit on chipChip can hide less latencyHigher likelihood of stalls
  • Loose performance when shaders use a lot of registers
  • 128 simultaneous threads on each core
  • Drive this ALUs using explicit SIMD instructions or implicit via HW determined sharing
  • Numbers are relative cost of communication
  • Runs on each thread – is parallel
  • G = grid size, B = block size
  • Transcript of "General Purpose Computing using Graphics Hardware"

    1. 1. General Purpose Computingusing Graphics Hardware<br />Hanspeter Pfister<br />Harvard University<br />
    2. 2. Acknowledgements<br />Won-Ki Jeong, Harvard University<br />KayvonFatahalian, Stanford University <br />2<br />
    3. 3. GPU (Graphics Processing Unit)<br />PC hardware dedicated for 3D graphics<br />Massively parallel SIMD processor<br />Performance pushed by game industry<br />3<br />NVIDIA SLI System<br />
    4. 4. GPGPU<br />General Purpose computation on the GPU<br />Started in computer graphics research community<br />Mapping computational problems to graphics rendering pipeline<br />4<br />Image CourtesyJens Krueger, Aaron Lefohn, and Won-Ki Jeong<br />
    5. 5. Why GPU for computing?<br />GPU is fast<br />Massively parallel<br />CPU : ~4 cores (16 SIMD lanes) @ 3.2 Ghz (Intel Quad Core)<br />GPU : ~30 cores (240 SIMD lanes) @ 1.3 Ghz (NVIDIA GT200)<br />High memory bandwidth<br />Programmable<br />NVIDIA CUDA, DirectX Compute Shader, OpenCL<br />High precision floating point support<br />64bit floating point (IEEE 754)<br />Inexpensive desktop supercomputer<br />NVIDIA Tesla C1060 : ~1 TFLOPS @ $1000<br />5<br />
    6. 6. FLOPS<br />6<br />Image Courtesy NVIDIA<br />
    7. 7. Memory Bandwidth<br />7<br />Image Courtesy NVIDIA<br />
    8. 8. GPGPU Biomedical Examples<br />8<br />Level-Set Segmentation (Lefohn et al.)<br />CT/MRI Reconstruction (Sumanaweera et al.)<br />Image Registration (Strzodka et al.)<br />EM Image Processing (Jeong et al.)<br />
    9. 9. Overview<br />GPU Architecture Overview<br />GPU Programming Overview<br />Programming Model<br />NVIDIA CUDA<br />OpenCL<br />Application Example<br />CUDA ITK<br />9<br />
    10. 10. 1. GPU Architecture Overview<br />KayvonFatahalian<br />Stanford University<br />10<br />
    11. 11. What’s in a GPU?<br />11<br />Input Assembly<br />Rasterizer<br />Output Blend<br />Video Decode<br />Tex<br />Compute<br />Core<br />Compute<br />Core<br />Compute<br />Core<br />Compute<br />Core<br />Compute<br />Core<br />Compute<br />Core<br />Compute<br />Core<br />Compute<br />Core<br />Tex<br />Tex<br />HW<br />or<br />SW?<br />Work<br />Distributor<br />Tex<br />Heterogeneous chip multi-processor (highly tuned for graphics)<br />
    12. 12. CPU-“style” cores<br />12<br />Fetch/<br />Decode<br />Out-of-order control logic<br />Fancy branch predictor<br />ALU<br />(Execute)<br />Memory pre-fetcher<br />Execution<br />Context<br />Data Cache<br />(A big one)<br />
    13. 13. Slimming down<br />13<br />Fetch/<br />Decode<br />Idea #1: <br />Remove components that<br />help a single instruction<br />stream run fast <br />ALU<br />(Execute)<br />Execution<br />Context<br />
    14. 14. Two cores (two threads in parallel)<br />14<br />thread1<br />thread 2<br />Fetch/<br />Decode<br />Fetch/<br />Decode<br />&lt;diffuseShader&gt;:<br />sample r0, v4, t0, s0<br />mul r3, v0, cb0[0]<br />madd r3, v1, cb0[1], r3<br />madd r3, v2, cb0[2], r3<br />clmp r3, r3, l(0.0), l(1.0)<br />mul o0, r0, r3<br />mul o1, r1, r3<br />mul o2, r2, r3<br />mov o3, l(1.0)<br />&lt;diffuseShader&gt;:<br />sample r0, v4, t0, s0<br />mul r3, v0, cb0[0]<br />madd r3, v1, cb0[1], r3<br />madd r3, v2, cb0[2], r3<br />clmp r3, r3, l(0.0), l(1.0)<br />mul o0, r0, r3<br />mul o1, r1, r3<br />mul o2, r2, r3<br />mov o3, l(1.0)<br />ALU<br />(Execute)<br />ALU<br />(Execute)<br />Execution<br />Context<br />Execution<br />Context<br />
    15. 15. Four cores (four threads in parallel)<br />15<br />Fetch/<br />Decode<br />Fetch/<br />Decode<br />Fetch/<br />Decode<br />Fetch/<br />Decode<br />ALU<br />(Execute)<br />ALU<br />(Execute)<br />ALU<br />(Execute)<br />ALU<br />(Execute)<br />Execution<br />Context<br />Execution<br />Context<br />Execution<br />Context<br />Execution<br />Context<br />
    16. 16. Sixteen cores (sixteen threads in parallel)<br />16<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />ALU<br />16 cores = 16 simultaneous instruction streams<br />
    17. 17. Instruction stream sharing<br />17<br />But… many threads should<br />be able to share an instruction<br />stream! <br />&lt;diffuseShader&gt;:<br />sample r0, v4, t0, s0<br />mul r3, v0, cb0[0]<br />madd r3, v1, cb0[1], r3<br />madd r3, v2, cb0[2], r3<br />clmp r3, r3, l(0.0), l(1.0)<br />mul o0, r0, r3<br />mul o1, r1, r3<br />mul o2, r2, r3<br />mov o3, l(1.0)<br />
    18. 18. Recall: simple processing core<br />18<br />Fetch/<br />Decode<br />ALU<br />(Execute)<br />Execution<br />Context<br />
    19. 19. Add ALUs<br />19<br />Idea #2:<br />Amortize cost/complexity of<br />managing an instruction<br />stream across many ALUs<br />Fetch/<br />Decode<br />ALU 1<br />ALU 2<br />ALU 3<br />ALU 4<br />ALU 5<br />ALU 6<br />ALU 7<br />ALU 8<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />SIMD processing<br />Shared Ctx Data <br />
    20. 20. Modifying the code<br />20<br />Fetch/<br />Decode<br />&lt;diffuseShader&gt;:<br />sample r0, v4, t0, s0<br />mul r3, v0, cb0[0]<br />madd r3, v1, cb0[1], r3<br />madd r3, v2, cb0[2], r3<br />clmp r3, r3, l(0.0), l(1.0)<br />mul o0, r0, r3<br />mul o1, r1, r3<br />mul o2, r2, r3<br />mov o3, l(1.0)<br />ALU 1<br />ALU 2<br />ALU 3<br />ALU 4<br />ALU 5<br />ALU 6<br />ALU 7<br />ALU 8<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Original compiled shader:<br />Shared Ctx Data <br />Processes one thread<br />using scalar ops on scalar<br />registers<br />
    21. 21. Modifying the code<br />21<br />Fetch/<br />Decode<br />&lt;VEC8_diffuseShader&gt;:<br />VEC8_sample vec_r0, vec_v4, t0, vec_s0<br />VEC8_mul vec_r3, vec_v0, cb0[0]<br />VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3<br />VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3<br />VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)<br />VEC8_mul vec_o0, vec_r0, vec_r3<br />VEC8_mul vec_o1, vec_r1, vec_r3<br />VEC8_mul vec_o2, vec_r2, vec_r3<br />VEC8_mov vec_o3, l(1.0)<br />ALU 1<br />ALU 2<br />ALU 3<br />ALU 4<br />ALU 5<br />ALU 6<br />ALU 7<br />ALU 8<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />New compiled shader:<br />Shared Ctx Data <br />Processes 8 threads<br />using vector ops on vector<br />registers<br />
    22. 22. Modifying the code<br />22<br />2<br />3<br />1<br />4<br />6<br />7<br />5<br />8<br />Fetch/<br />Decode<br />&lt;VEC8_diffuseShader&gt;:<br />VEC8_sample vec_r0, vec_v4, t0, vec_s0<br />VEC8_mul vec_r3, vec_v0, cb0[0]<br />VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3<br />VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3<br />VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0)<br />VEC8_mul vec_o0, vec_r0, vec_r3<br />VEC8_mul vec_o1, vec_r1, vec_r3<br />VEC8_mul vec_o2, vec_r2, vec_r3<br />VEC8_mov vec_o3, l(1.0)<br />ALU 1<br />ALU 2<br />ALU 3<br />ALU 4<br />ALU 5<br />ALU 6<br />ALU 7<br />ALU 8<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Shared Ctx Data <br />
    23. 23. 128 threads in parallel <br />23<br />16 cores = 128 ALUs<br />= 16 simultaneous instruction streams<br />
    24. 24. But what about branches?<br />24<br />2<br />... <br />1<br />...<br />8<br />Time (clocks)<br />ALU 1<br />ALU 2<br />. . . <br />ALU 8<br />. . . <br />&lt;unconditional shader code&gt;<br />if (x&gt; 0) {<br />y = pow(x, exp);<br />y *= Ks;<br />refl = y + Ka; <br />} else {<br />x = 0; <br />refl = Ka; <br />}<br />&lt;resume unconditional shader code&gt;<br />
    25. 25. But what about branches?<br />25<br />2<br />... <br />1<br />...<br />8<br />Time (clocks)<br />ALU 1<br />ALU 2<br />. . . <br />ALU 8<br />. . . <br />&lt;unconditional shader code&gt;<br />T<br />T<br />T<br />F<br />F<br />F<br />F<br />F<br />if (x&gt; 0) {<br />y = pow(x, exp);<br />y *= Ks;<br />refl = y + Ka; <br />} else {<br />x = 0; <br />refl = Ka; <br />}<br />&lt;resume unconditional shader code&gt;<br />
    26. 26. But what about branches?<br />26<br />2<br />... <br />1<br />...<br />8<br />Time (clocks)<br />ALU 1<br />ALU 2<br />. . . <br />ALU 8<br />. . . <br />&lt;unconditional shader code&gt;<br />T<br />T<br />T<br />F<br />F<br />F<br />F<br />F<br />if (x&gt; 0) {<br />y = pow(x, exp);<br />y *= Ks;<br />refl = y + Ka; <br />} else {<br />x = 0; <br />refl = Ka; <br />}<br />&lt;resume unconditional shader code&gt;<br />Not all ALUs do useful work! <br />Worst case: 1/8 performance<br />
    27. 27. But what about branches?<br />27<br />2<br />... <br />1<br />...<br />8<br />Time (clocks)<br />ALU 1<br />ALU 2<br />. . . <br />ALU 8<br />. . . <br />&lt;unconditional shader code&gt;<br />T<br />T<br />T<br />F<br />F<br />F<br />F<br />F<br />if (x&gt; 0) {<br />y = pow(x, exp);<br />y *= Ks;<br />refl = y + Ka; <br />} else {<br />x = 0; <br />refl = Ka; <br />}<br />&lt;resume unconditional shader code&gt;<br />
    28. 28. Clarification<br />28<br />SIMD processing does not imply SIMD instructions <br /><ul><li>Option 1: Explicit vector instructions
    29. 29. Intel/AMD x86 SSE, Intel Larrabee
    30. 30. Option 2: Scalar instructions, implicit HW vectorization
    31. 31. HW determines instruction stream sharing across ALUs (amount of sharing hidden from software)
    32. 32. NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures</li></ul>In practice: 16 to 64 threads share an instruction stream<br />
    33. 33. Stalls!<br />Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation.<br />Texture access latency = 100’s to 1000’s of cycles<br />We’ve removed the fancy caches and logic that helps avoid stalls.<br />29<br />
    34. 34. But we have LOTS of independent threads.<br />Idea #3:<br />Interleave processing of many threads on a single core to avoid stalls caused by high latency operations.<br />30<br />
    35. 35. Hiding stalls<br />31<br />Time (clocks)<br />Thread1 … 8<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />Fetch/<br />Decode<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />Ctx<br />SharedCtx Data <br />
    36. 36. Hiding stalls<br />32<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />Fetch/<br />Decode<br />1<br />2<br />3<br />4<br />1<br />2<br />3<br />4<br />
    37. 37. Hiding stalls<br />33<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />Stall<br />Runnable<br />1<br />2<br />3<br />4<br />
    38. 38. Hiding stalls<br />34<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />Stall<br />Runnable<br />1<br />2<br />3<br />4<br />
    39. 39. Hiding stalls<br />35<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />Stall<br />Stall<br />Stall<br />Stall<br />Runnable<br />Runnable<br />1<br />2<br />3<br />4<br />Runnable<br />
    40. 40. Throughput!<br />36<br />Time (clocks)<br />Thread9… 16<br />Thread17 … 24<br />Thread25 … 32<br />Thread1 … 8<br />Start<br />Start<br />Stall<br />Stall<br />Stall<br />Stall<br />Start<br />Runnable<br />Runnable<br />Done!<br />Runnable<br />Done!<br />Runnable<br />2<br />3<br />4<br />1<br />Increase run time of one group<br />To maximum throughput of many groups<br />Done!<br />Done!<br />
    41. 41. Storing contexts<br />37<br />Fetch/<br />Decode<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />Pool of context storage<br />32KB<br />
    42. 42. Twenty small contexts<br />38<br />(maximal latency hiding ability)<br />Fetch/<br />Decode<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />10<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />11<br />15<br />12<br />13<br />14<br />16<br />20<br />17<br />18<br />19<br />
    43. 43. Twelve medium contexts<br />39<br />Fetch/<br />Decode<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />
    44. 44. Four large contexts<br />40<br />(low latency hiding ability)<br />Fetch/<br />Decode<br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />ALU <br />4<br />3<br />1<br />2<br />
    45. 45. GPU block diagram key<br />= single “physical” instruction stream fetch/decode<br /> (functional unit control)<br />= SIMD programmable functional unit (FU), control shared with other<br /> functional units. This functional unit may contain multiple 32-bit “ALUs”<br />= 32-bit mul-add unit<br />= 32-bit multiply unit<br />= execution context storage <br />= fixed function unit<br />41<br />
    46. 46. Example: NVIDIA GeForce GTX 280<br />NVIDIA-speak:<br />240 stream processors<br />“SIMT execution” (automatic HW-managed sharing of instruction stream)<br />Generic speak:<br />30 processing cores<br />8 SIMD functional units per core<br />1 mul-add (2 flops) + 1 mul per functional units (3 flops/clock)<br />Best case: 240 mul-adds + 240 muls per clock<br />1.3 GHz clock<br />30 * 8 * (2 + 1) * 1.3 = 933 GFLOPS<br />Mapping data-parallelism to chip:<br />Instruction stream shared across 32 threads<br />8 threads run on 8 SIMD functional units in one clock<br />42<br />
    47. 47. GTX 280 core<br />43<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />Zcull/Clip/Rast<br />Output Blend<br />Work Distributor<br />
    48. 48. Example: ATI Radeon 4870<br />AMD/ATI-speak:<br />800 stream processors<br />Automatic HW-managed sharing of scalar instruction stream (like “SIMT”)<br />Generic speak:<br />10 processing cores<br />16 SIMD functional units per core<br />5 mul-adds per functional unit (5 * 2 =10 flops/clock)<br />Best case: 800 mul-adds per clock<br />750 MHz clock<br />10 * 16 * 5 * 2 * .75 = 1.2 TFLOPS<br />Mapping data-parallelism to chip:<br />Instruction stream shared across 64 threads<br />16 threads run on 16 SIMD functional units in one clock<br />44<br />
    49. 49. ATI Radeon 4870 core<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />…<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Tex<br />Zcull/Clip/Rast<br />Output Blend<br />Work Distributor<br />45<br />
    50. 50. Summary: three key ideas<br />Use many “slimmed down cores” to run in parallel<br />Pack cores full of ALUs (by sharing instruction stream across groups of threads)<br />Option 1: Explicit SIMD vector instructions<br />Option 2: Implicit sharing managed by hardware<br />Avoid latency stalls by interleaving execution of many groups of threads<br />When one group stalls, work on another group<br />46<br />
    51. 51. 2. GPU Programming Models<br />Programming Model<br />NVIDIA CUDA<br />OpenCL<br />47<br />
    52. 52. Task parallelism<br />Distribute the tasks across processors based on dependency<br />Coarse-grain parallelism<br />48<br />Task 1<br />Task 1<br />Time<br />Task 2<br />Task 2<br />Task 3<br />Task 3<br />P1<br />Task 4<br />Task 4<br />P2<br />Task 5<br />Task 5<br />Task 6<br />Task 6<br />P3<br />Task 7<br />Task 7<br />Task 8<br />Task 8<br />Task 9<br />Task 9<br />Task assignment across 3 processors<br />Task dependency graph<br />
    53. 53. Data parallelism<br />Run a single kernel over many elements<br />Each element is independently updated<br />Same operation is applied on each element<br />Fine-grain parallelism<br />Many lightweight threads, easy to switch context<br />Maps well to ALU heavy architecture : GPU<br />49<br />Kernel<br />…….<br />Data<br />P1<br />P2<br />P3<br />P4<br />P5<br />Pn<br />…….<br />
    54. 54. GPU-friendly Problems<br />Data-parallel processing<br />High arithmetic intensity<br />Keep GPU busy all the time<br />Computation offsets memory latency<br />Coherent data access<br />Access large chunk of contiguous memory<br />Exploit fast on-chip shared memory<br />50<br />
    55. 55. The Algorithm Matters<br /><ul><li>Jacobi: Parallelizable</li></ul>for(inti=0; i&lt;num; i++)<br /> {<br /> vn+1[i] = (vn[i-1] + vn[i+1])/2.0;<br /> }<br /><ul><li>Gauss-Seidel: Difficult to parallelize</li></ul>for(inti=0; i&lt;num; i++)<br /> {<br />v[i] = (v[i-1] + v[i+1])/2.0;<br /> }<br />51<br />
    56. 56. Example: Reduction<br />Serial version (O(N))<br />for(inti=1; i&lt;N; i++)<br /> {<br /> v[0] += v[i];<br /> }<br />Parallel version (O(logN))<br /> width = N/2;<br />while(width &gt; 1)<br /> {<br />for(inti=0; i&lt;width; i++)<br /> {<br />v[i] += v[i+width]; // computed in parallel<br /> }<br /> width /= 2;<br /> }<br />52<br />
    57. 57. GPU programming languages<br />Using graphics APIs<br />GLSL, Cg, HLSL<br />Computing-specific APIs<br />DX 11 Compute Shaders<br />NVIDIA CUDA<br />OpenCL<br />53<br />
    58. 58. NVIDIA CUDA<br />C-extension programming language<br />No graphics API<br />Supports debugging tools<br />Extensions / API<br />Function type : __global__, __device__, __host__<br />Variable type : __shared__, __constant__<br />Low-level functions<br />cudaMalloc(), cudaFree(), cudaMemcpy(),…<br />__syncthread(), atomicAdd(),…<br />Program types<br />Device program (kernel) : runs on the GPU<br />Host program : runs on the CPU to call device programs<br />54<br />
    59. 59. CUDA Programming Model<br />Kernel<br />GPU program that runs on a thread grid<br />Thread hierarchy<br />Grid : a set of blocks<br />Block : a set of threads<br />Grid size * block size = total # of threads<br />55<br />Grid<br />Kernel<br />Block 2<br />Block n<br />Block 1<br />&lt;diffuseShader&gt;:<br />sample r0, v4, t0, s0<br />mul r3, v0, cb0[0]<br />madd r3, v1, cb0[1], r3<br />madd r3, v2, cb0[2], r3<br />clmp r3, r3, l(0.0), l(1.0)<br />mul o0, r0, r3<br />mul o1, r1, r3<br />mul o2, r2, r3<br />mov o3, l(1.0)<br />. . . . .<br />Threads<br />Threads<br />Threads<br />
    60. 60. CUDA Memory Structure<br />56<br />Graphics card<br />GPU Core<br />PC Memory<br />(DRAM)<br />GPU GlobalMemory(DRAM)<br />GPU SharedMemory(On-Chip)<br />ALUs<br />1<br />200<br />4000<br />Memory hierarchy<br />PC memory : off-card<br />GPU Global : off-chip / on-card<br />Shared/register/cache : on-chip<br />The host can read/write global memory<br />Each thread communicates using shared memory<br />
    61. 61. Synchronization<br />Threads in the same block can communicate using shared memory<br />No HW global synchronization function yet<br />__syncthreads()<br />Barrier for threads only within the current block<br />__threadfence()<br />Flushes global memory writes to make them visible to all threads<br />57<br />
    62. 62. Example: CPU Vector Addition<br />58<br />// Pair-wise addition of vector elements<br />// CPU version : serial add<br />void vectorAdd(float* iA, float* iB, float* oC, int num) <br />{<br /> for(inti=0; i&lt;num; i++)<br /> {<br />oC[i] = iA[i] + iB[i];<br /> }<br />}<br />
    63. 63. Example: CUDA Vector Addition<br />59<br />// Pair-wise addition of vector elements<br />// CUDA version : one thread per addition<br />__global__ void<br />vectorAdd(float* iA, float* iB, float* oC) <br />{<br />intidx = threadIdx.x<br /> + blockDim.x * blockIdx.x;<br />oC[idx] = iA[idx] + iB[idx];<br />}<br />
    64. 64. Example: CUDA Host Code<br />60<br />float* h_A = (float*) malloc(N * sizeof(float));<br />float* h_B = (float*) malloc(N * sizeof(float));<br />// …initalizeh_A and h_B<br />// allocate device memory<br />float* d_A, d_B, d_C;<br />cudaMalloc( (void**) &d_A, N * sizeof(float));<br />cudaMalloc( (void**) &d_B, N * sizeof(float));<br />cudaMalloc( (void**) &d_C, N * sizeof(float));<br />// copy host memory to device<br />cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice );<br />cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice );<br />// execute the kernel on N/256 blocks of 256 threads each<br />vectorAdd&lt;&lt;&lt; N/256, 256&gt;&gt;&gt;( d_A, d_B, d_C );<br />
    65. 65. OpenCL (Open Computing Language)<br />First industry standard for computing language<br />Based on C language<br />Platform independent<br />NVIDIA, ATI, Intel, ….<br />Data and task parallel compute model<br />Use all computational resources in system<br />CPU, GPU, …<br />Work-item : same as thread / fragment / etc..<br />Work-group : a group of work-items<br />Work-items in a same work-group can communicate<br />Execute multiple work-groups in parallel<br />61<br />
    66. 66. OpenCL program structure<br />Host program (CPU)<br />Platform layer<br />Query compute devices<br />Create context<br />Runtime<br />Create memory objects<br />Compile and create kernel program objects<br />Issue commands (i.e., kernel launching) to command-queue<br />Synchronization of commands<br />Clean up OpenCL resources<br />Kernel (CPU, GPU)<br />C-like code with some extensions<br />Runs on compute device<br />62<br />
    67. 67. CUDA v.s. OpenCL comparison<br />Conceptually almost identical<br />Work-item == thread<br />Work-group == block<br />Similar memory model<br />Global, local, shared memory<br />Kernel, host program<br />CUDA is highly optimized only for NVIDIA GPUs<br />OpenCL can be widely used for any GPUs/CPUs<br />63<br />
    68. 68. Implementation status of OpenCL<br />Specification 1.0 released by Khronos<br />NVIDIA released Beta 1.2 driver and SDK<br />Available for registered GPU computing developers<br />Apple will include in Mac OS X Snow Leopard<br />Q3 2009<br />NVIDIA and ATI GPUs, Intel CPU for Mac<br />More companies will join<br />64<br />
    69. 69. GPU optimization tips: configuration<br />Identify bottleneck<br />Computing / bandwidth bound (use profiler)<br />Focus on most expensive but parallelizable parts (Amdahl’s law)<br />Maximize parallel execution<br />Use large input (many threads)<br />Avoid divergent execution<br />Efficient use of limited resource<br />Minimize shared memory / register use<br />65<br />
    70. 70. GPU optimization tips: memory<br />Memory access: the most important optimization<br />Minimize device to host memory overhead<br />Overlap kernel with memory copy (asynchronous copy)<br />Avoid shared memory bank conflict<br />Coalesced global memory access<br />Texture or constant memory can be helpful (cache)<br />Graphics card<br />GPU Core<br />PC Memory<br />(DRAM)<br />GPU GlobalMemory(DRAM)<br />GPU SharedMemory(On-Chip)<br />ALUs<br />1<br />200<br />4000<br />66<br />
    71. 71. GPU optimization tips: instructions<br />Use less expensive operators<br />division: 32 cycles, multiplication: 4 cycles<br />*0.5 instead of /2.0<br />Atomic operator is expensive<br />Possible race condition<br />Double precision is much slower than float<br />Use less accurate floating point instruction when possible<br />__sin(), __exp(), __pow()<br />Save unnecessary instructions<br />Loop unrolling<br />67<br />
    72. 72. 3. Application Example<br />CUDA ITK<br />68<br />
    73. 73. ITK image filters implemented using CUDA<br />Convolution filters<br />Mean filter<br />Gaussian filter<br />Derivative filter<br />Hessian of Gaussian filter<br />Statistical filter<br />Median filter<br />PDE-based filter<br />Anisotropic diffusion filter<br />69<br />
    74. 74. CUDA ITK<br />CUDA code is integrated into ITK<br />Transparent to the ITK users<br />No need to modify current code using ITK library<br />Check environment variable ITK_CUDA<br />Entry point<br />GenerateData() or ThreadedGenerateData()<br />If ITK_CUDA == 0<br />Execute original ITK code<br />If ITK_CUDA == 1<br />Execute CUDA code<br />70<br />
    75. 75. Convolution filters<br /><ul><li>Weighted sum of neighbors</li></ul>For size n filter, each pixel is reused n times<br />Non-separable filter (Anisotropic)<br />Reusing data using shared memory<br />Separable filter (Gaussian)<br />N-dimensional convolution = N*1D convolution<br />71<br />kernel<br />kernel<br />kernel<br />*<br />*<br />*<br />
    76. 76. Read from input image whenever needed<br />Naïve C/CUDA implementation<br />72<br />intxdim, ydim; // size of input image<br />float *in, *out; // input/output image of size xdim*ydim<br />float w[][]; // convolution kernel of size n*m<br />for(x=0; x&lt;xdim; x++)<br />{<br /> for(y=0; y&lt;ydim; y++)<br /> {<br /> // compute convolution<br /> for(sx=x-n/2; sx&lt;=x+n/2; sx++)<br /> {<br /> for(sy=y-m/2; sy&lt;=y+m/2; sy++)<br /> {<br />wx = sx – x + n/2;<br />wy = sy – y + m/2;<br /> out[x][y] = w[wx][wy]*in[sx][sy];<br /> }<br /> }<br /> }<br />}<br />xdim*ydim<br />n*m<br />load from global memory, n*m times<br />
    77. 77. For size n*m filter, each pixel is reused n*m times<br />Save n*m-1 global memory loads by using shared memory<br />Improved CUDA convolution filter<br />73<br />__global__ cudaConvolutionFilter2DKernel(in, out, w){ // copy global to shared memory<br />sharedmem[] = in[][];<br /> __syncthreads();<br /> // sum neighbor pixel values float _sum = 0; for(uint j=threadIdx.y; j&lt;=threadIdx.y + m; j++) { for(uinti=threadIdx.x; i&lt;=threadIdx.x + n; i++) {wx = i – threadIdx.x;wy = j – threadIdx.y; _sum += w[wx][wy]*sharedmem[j*sharedmemdim.x + i]; } }}<br />load from global memory (slow), only once<br />n*m<br />load from shared memory (fast), n*m times<br />
    78. 78. CUDA Gaussian filter<br />Apply 1D convolution filter along each axis<br />Use temporary buffers: ping-pong rendering<br />74<br />// temp[0], temp[1] : temporary buffer to store intermediate resultsvoid cudaDiscreteGaussianImageFilter(in, out, stddev){ // create Gaussian weight w = ComputeGaussKernel(stddev); temp[0] = in; // call 1D convolution with Gaussian weight dim3 G, B; for(i=0; i&lt;dimension; i++) { cudaConvolutionFilter1DKernel&lt;&lt;&lt;G,B&gt;&gt;&gt;(temp[i%2], temp[(i+1)%2], w); }<br /> out = temp[i%2];}<br />1D convolution cuda kernel<br />
    79. 79. Median filter<br />1<br />4<br />3<br />1<br />8<br />2<br />1<br />0<br />1<br />4<br />3<br />1<br />8<br />2<br />1<br />0<br />1<br />4<br />3<br />1<br />8<br />2<br />1<br />0<br />1<br />4<br />3<br />1<br />8<br />2<br />1<br />0<br />Viola et al. [VIS 03]<br />Finding median by bisection of histogram bins<br />Log(# bins) iterations<br />8-bit pixel : log(256) = 8 iterations<br />Intensity :<br />0<br />1<br />2<br />3<br />4<br />5<br />6<br />7<br />1.<br />16<br />4<br />2.<br />Copy current block from global to shared memory<br />min = 0;<br />max = 255;<br />pivot = (min+max)/2.0f;<br />For(i=0; i&lt;8; i++)<br />{<br /> count = 0;<br /> For(j=0; j&lt;kernelsize; j++)<br /> {<br /> if(kernel[j] &gt; pivot) count++:<br /> } <br /> if(count &lt;kernelsize/2) max = floor(pivot); <br /> else min = ceil(pivot);<br /> pivot = (min + max)/2.0f;<br />}<br />return floor(pivot);<br />11<br />5<br />3.<br />4.<br />75<br />
    80. 80. Perona & Malik anisotropic diffusion<br />Nonlinear diffusion<br />Adaptive smoothing based on magnitude of gradient<br />Preserves edges (high gradient)<br />Numerical solution<br />Euler explicit integration (iterative method)<br />Finite difference for derivative computation<br />76<br />Input Image<br />Linear diffusion<br />P & M diffusion<br />
    81. 81. Performance<br />Convolution filters<br />Mean filter : ~140x<br />Gaussian filter : ~60x<br />Derivative filter<br />Hessian of Gaussian filter<br />Statistical filter<br />Median filter : ~25x<br />PDE-based filter<br />Anisotropic diffusion filter : ~70x<br />77<br />
    82. 82. CUDA ITK<br />Source code available at<br />http://sourceforge.net/projects/cudaitk/<br />78<br />
    83. 83. CUDA ITK Future Work<br />ITK GPU image class<br />Reduce CPU to GPU memory I/O<br />Pipelining support<br />Native interface for GPU code<br />Similar to ThreadedGenerateData() for GPU threads<br />Numerical library (vnl)<br />Out-of-GPU-core / GPU-cluster<br />Processing large images (10~100 Terabytes)<br />GPU Platform independent implementation<br />OpenCL could be a solution<br />79<br />
    84. 84. Conclusions<br />GPU computing delivers high performance<br />Many scientific computing problems are parallelizable<br />More consistency/stability in HW/SW<br />Main GPU architecture is mature<br />Industry-wide programming standard now exists (OpenCL)<br />Better support/tools available<br />C-based language, compiler, and debugger<br />Issues<br />Not every problem is suitable for GPUs<br />Re-engineering of algorithms/software required<br />Unclear future performance growth of GPU hardware<br />Intel’s Larrabee<br />80<br />
    85. 85. thrust<br />thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA<br />C++ template metaprogramming automatically chooses the fastest code path at compile time<br />
    86. 86. thrust::sort<br />#include &lt;thrust/host_vector.h&gt;<br />#include &lt;thrust/device_vector.h&gt;<br />#include &lt;thrust/generate.h&gt;<br />#include &lt;thrust/sort.h&gt;<br />#include &lt;cstdlib&gt;<br />int main(void)<br />{<br /> // generate random data on the host<br />thrust::host_vector&lt;int&gt; h_vec(1000000);<br />thrust::generate(h_vec.begin(), h_vec.end(), rand);<br /> // transfer to device and sort<br />thrust::device_vector&lt;int&gt; d_vec = h_vec;<br /> // sort 140M 32b keys/sec on GT200<br />thrust::sort(d_vec.begin(), d_vec.end());<br /> return 0;}<br />http://thrust.googlecode.com<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×