Massively Parallel Computing                        CS 264 / CSCI E-292Lecture #5: Advanced CUDA | February 22th, 2011    ...
Administrivia• HW2: out, due Mon 3/14/11 (not Fri 3/11/11)• Projects: think about it, consult the staff (*),  proposals du...
During this course,                          r CS264                adapted fowe’ll try to          “                     ...
Todayyey!!
Outline1. Hardware Review2. Memory/Communication Optimizations3. Threading/Execution Optimizations
1. Hardware Review
10-Series Architecture         240 thread processors execute kernel threads         30 multiprocessors, each contains     ...
Threading HierarchyExecution ModelSoftware    Hardware                             Threads are executed by thread         ...
Warps and Half Warps                            32 Threads                                                          A thre...
Memory Architecture    Host                    Device                                                GPU                  ...
Kernel Memory Access Kernel Memory Access        Per-thread                                       Registers   On-chip     ...
Global Memory Kernel Memory Access   • Different types of “global memory”     Per-thread                                  ...
Memory Architecture   Memory                   Location   Cached   Access   Scope                 Lifetime   Register     ...
2. Memory/Communication      Optimizations
Revie w2.1 Host/Device Transfer        Optimizations
Rev ie w              PC Architecture   8 GB/s                                            >?@                       ?>L9G=...
Rev ie w   The PCI-“not-so”-e Bus• PCIe bus is slow• Try to minimize/group transfers• Use pinned memory on host whenever p...
2.2 Device Memory     Optimizations
Definitions• gmem: global memory• smem: shared memory• tmem: texture memory• cmem: constant memory• bmem: binary code (cubi...
Performance Analysise.g. Matrix Transpose
Matrix Transpose        Transpose 2048x2048 matrix of floats        Performed out-of-place                  Separate input...
Theoretical Bandwidth        Device Bandwidth of GTX 280                                                  DDR             ...
Effective Bandwidth        Transpose Effective Bandwidth                  2048^2 * 4 B/element / 1024^3 * 2 / (time in sec...
Matrix Copy Kernel__global__ void copy(float *odata, float *idata, int width,                     int height){  int xIndex...
Matrix Copy Kernel Timing        Measure elapsed time over loop        Looping/timing done in two ways:                  O...
Naïve Transpose        Similar to copy                  Input and output matrices have different indices    __global__ voi...
Effective Bandwidth                                        Effective Bandwidth (GB/s)                                     ...
gmem coalescing
Memory CoalescingGPU memory controller granularity is 64 or 128 bytes  Must also be 64 or 128 byte alignedSuppose thread l...
Memory Coalescing  Memory controller actually more intelligent  Consider half-warp (16 threads)    Suppose each thread rea...
Coalescing        Global memory access of 32, 64, or 128-bit words by a half-        warp of threads can result in as few ...
CoalescingCompute capability 1.0 and 1.1        K-th thread must access k-th word in the segment (or k-th word in 2       ...
Memory CoalescingGT200 has hardware coalescerInspects memory requests from each half-warpDetermines minimum set of transac...
Coalescing   Compute capability 1.2 and higher                 (e.g. GT200 like the C1060)          Coalescing is achieved...
Coalescing          Compute capability 2.0 (Fermi, Tesla C2050)                 Memory transactions handled per warp (32 t...
Coalescing SummaryCoalescing dramatically speeds global memory accessStrive for perfect coalescing:  Align starting addres...
Coalescing in Transpose        Naïve transpose coalesces reads, but not writes                            idata           ...
smem as a cache
Shared MemorySMs can access gmem at 80+ GiB/secbut have hundreds of cycles of latencyEach SM has 16 kiB ‘shared’ memory  E...
Shared Memory       ~Hundred times faster than global memory       Cache data to reduce global memory accesses       Threa...
Coalescing in Transpose        Naïve transpose coalesces reads, but not writes                            idata           ...
Shared Memory       ~Hundred times faster than global memory       Cache data to reduce global memory accesses       Threa...
A Common Programming Strategy!   Partition data into subsets that fit into shared memory © 2008 NVIDIA Corporation
A Common Programming Strategy!   Handle each data subset with one thread block © 2008 NVIDIA Corporation
A Common Programming Strategy!   Load the subset from global memory to shared    memory, using multiple threads to exploit...
A Common Programming Strategy!   Perform the computation on the subset from shared    memory © 2008 NVIDIA Corporation
A Common Programming Strategy!   Copy the result from shared memory back to global    memory © 2008 NVIDIA Corporation
Coalescing through shared memory        Access columns of a tile in shared memory to write        contiguous data to globa...
Coalescing through shared memory __global__ void transposeCoalesced(float *odata, float *idata, int width,                ...
Effective Bandwidth                                Effective Bandwidth (GB/s)                                    2048x2048...
smem bank conflicts
Shared Memory Architecture      Many threads accessing memory               Therefore, memory is divided into banks       ...
Shared Memory Banks                             Bank 0      0       16                             Bank 1      1       17 ...
Bank Addressing Examples       No Bank Conflicts                    No Bank Conflicts                 Linear addressing   ...
Bank Addressing Examples       2-way Bank Conflicts                 8-way Bank Conflicts                 Linear addressing...
Shared memory bank conflicts       Shared memory is ~ as fast as registers if there are no bank       conflicts       warp...
Bank Conflicts in Transpose        32x32 shared memory tile of floats                 Data in columns k and k+16 are in sa...
Bank Conflicts in Transpose        32x32 shared memory tile of floats                 Data in columns k and k+16 are in sa...
Illustration               !"#$%&(%)*$+,-.*/&/012#034*056/78               • :;<:; !(=(#$$#+               • >#$?#77%99%9#...
Illustration                 !"#$%&(%)*$+,-.*/&/012#034*056/789                 • -&&#7*6:)05*$;#&&/01,                   ...
Effective Bandwidth                                    Effective Bandwidth (GB/s)                                        2...
a pause?Need
Unrelated: Tchatcher Illusion
Unrelated: Tchatcher Illusion
gmem partition camping
Partition Camping        Global memory accesses go through partitions                  6 partitions on 8-series GPUs, 8 pa...
Partition Camping in Transpose       Partition width = 256 bytes = 64 floats                  Twice width of tile       On...
Partition Camping Solutions       Pad matrices (by two tiles)                  In general might be expensive/prohibitive m...
Diagonal Transpose__global__ void transposeDiagonal(float *odata, float *idata, int width,                                ...
Diagonal Transpose       Previous slide for square matrices (width == height)       More generally:if (width == height) { ...
Effective Bandwidth                                     Effective Bandwidth (GB/s)                                        ...
Order of Optimizations          Larger optimization issues can mask smaller ones          Proper order of some optimizatio...
Transpose Summary      Coalescing and shared memory bank conflicts are      small-scale phenomena           Deal with memo...
tmem
Textures in CUDA       Texture is an object for reading data       Benefits:                 Data is cached (optimized for...
Other goodiesOptional “format conversion”• {char, short, int, half} (16bit)   to   float (32bit)• “for free”• useful for *m...
Texture Addressing                                    0   1   2   3    4                                0                 ...
Two CUDA Texture Types       Bound to linear memory                 Global memory address is bound to a texture           ...
CUDA Texturing Steps      Host (CPU) code:               Allocate/obtain memory (global linear, or CUDA array)            ...
cmem
!"#$%&#%()*"+,                • -.)&/0"+1")00212)#%$&#."%3)+.&%&%3&%2$+)&.4#20"+*/,5,6&+7$                • 8&%&2$$%"+).2#...
!"#$%&#%()*"+,                • -.)&/0"+1")00212)#%$&#."%3)+.&%&%3&%2$+)&.4#20"+*/,5,6&+7$                • 8&%&2$$%"+).2#...
!"#$%&#%()*"+,                    • -)+#).)/)01%)$23-%4+)&5$67839&+:$;:)+<(51+=#>=%$.=?)%=*)                    • @..%4+)&...
!"#$%&#%()*"+,                    • -)+#).)/)01%)$23-%4+)&5$67839&+:$;:)+<(51+=#>=%$.=?)%=*)                    • @..%4+)&...
*mem compression
!"#$%$&$()*$#+),-%"./00$-                • 1+/)233)/30/)+20)4//)-"#$%$&/5)25)6/./3)$0)3$%$#/5)47)#+/)8%4/.)-9)            ...
tation t hrough                    compu             g GPUAcc eleratin          thods       -precis ion memixed           ...
... too much ?                      ba nk c                                 onflict                                        ...
ParallelProgrammParking     is Hard   (but you’ll pick it up)
(you are not alone)
3. Threading/Execution     Optimizations
3.1 Exec. Configuration        Optimizations
Occupancy       Thread instructions are executed sequentially, so       executing other warps is the only way to hide     ...
Grid/Block Size Heuristics       # of blocks > # of multiprocessors                 So all multiprocessors have at least o...
Register Dependency       Read-after-write register dependency                 Instruction’s result can be read ~24 cycles...
Register Pressure       Hide latency by using more threads per SM       Limiting Factors:                 Number of regist...
Occupancy Calculator© NVIDIA Corporation 2008   64
Optimizing threads per block       Choose threads per block as a multiple of warp size                 Avoid wasting compu...
Occupancy != Performance       Increasing occupancy does not necessarily increase       performance                       ...
01*+ ,2%                          ."$%/,,                 ,"% *#%-(       &"$($ )*+!"##"$%                        (%)(*   ...
Occupancy != Performance       Increasing occupancy does not necessarily increase       performance                       ...
3.2 Instruction    Optimizations
CUDA Instruction Performance       Instruction cycles (per warp) = sum of                 Operand read cycles             ...
Maximizing Instruction Throughput        Maximize use of high-bandwidth memory                 Maximize use of shared memo...
Arithmetic Instruction Throughput       int and float add, shift, min, max and float mul, mad:       4 cycles per warp    ...
Runtime Math Library       There are two types of runtime math operations in       single-precision                 __func...
GPU results may not match CPU       Many variables: hardware, compiler, optimization       settings       CPU operations a...
FP Math is Not Associative!       In symbolic math, (x+y)+z == x+(y+z)       This is not necessarily true for floating-poi...
Control Flow Instructions       Main performance concern with branching is       divergence                 Threads within...
Scared ?
Scared ?           Howwwwww?!            (do I start)
Profiler
!"#$%&&()*+(,-./$0-                • ,-./$0-(1.2"*0-&3                   – !"#$%&$!("#)!##&*+!"!"#$%&$!("#)*,*&$*+        ...
CUDA Visual Profiler                     data for memory transfers             Memory transfer type and direction         ...
CUDA Visual Profiler   data for kernels© NVIDIA Corporation 2010
CUDA Visual Profiler                    computed data for kernels            Instruction throughput: Ratio of achieved ins...
CUDA Visual Profiler                data analysis views             Views:                 Summary table                 K...
CUDA Visual Profiler                      Misc.             Multiple sessions             Compare views for different sess...
Scared ?      meh!!!! I don’t like to              profile
Modified source code
!"#$%&&()*+(,-./0.(1-2340(5-.0                • 670(707-3%8-"$%(#".(7#*+8-"$%(903&-"&(-/(*+0(:03"0$                  – !"#...
Scared ?           I want to believe...
!"#$%&(#)*$%!+$,(-."/time            mem math full         mem math full            mem math full             mem math ful...
!"#$%&(#)*$%!+$,(-."/time            mem math full         mem math full            mem math full             mem math ful...
!"#$%&(#)*$%!+$,(-."/time            mem math full         mem math full            mem math full             mem math ful...
Argn&%#$... too many optimizations !!!
Parameterize Your Application       Parameterization helps adaptation to different GPUs       GPUs vary in many ways      ...
More ?•   Next week:    GPU “Scripting”, Meta-programming, Auto-tuning•   Thu 3/31/11:    PyOpenCL (A.Knockler, NYU), ahh ...
one more thing           or two...
Life/Code Hacking #2.x                Speed {listen,read,writ}ingaccelerated e-learning (c) / massively parallel {learn,pr...
Life/Code Hacking #2.2                                                 Speed writingaccelerated e-learning (c) / massively...
Life/Code Hacking #2.2                                                       Speed writinghttp://steve-yegge.blogspot.com/...
Life/Code Hacking #2.2                                                   Speed writingTyping tutors: gtypist, ktouch, typi...
Life/Code Hacking #2.2                                                 Speed writing                       Kinesis Advanta...
Demo
CO ME
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
Upcoming SlideShare
Loading in...5
×

[Harvard CS264] 05 - Advanced-level CUDA Programming

1,286

Published on

http://cs264.org

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,286
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
89
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

[Harvard CS264] 05 - Advanced-level CUDA Programming

  1. 1. Massively Parallel Computing CS 264 / CSCI E-292Lecture #5: Advanced CUDA | February 22th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  2. 2. Administrivia• HW2: out, due Mon 3/14/11 (not Fri 3/11/11)• Projects: think about it, consult the staff (*), proposals due ~ Fri 3/25/11• Guest lectures: • schedule coming soon • on Fridays 7.35-9.35pm (March, April) ?
  3. 3. During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  4. 4. Todayyey!!
  5. 5. Outline1. Hardware Review2. Memory/Communication Optimizations3. Threading/Execution Optimizations
  6. 6. 1. Hardware Review
  7. 7. 10-Series Architecture 240 thread processors execute kernel threads 30 multiprocessors, each contains 8 thread processors One double-precision unit Shared memory enables thread cooperation Multiprocessor Thread Processors Double Shared Memory© NVIDIA Corporation 2008 8
  8. 8. Threading HierarchyExecution ModelSoftware Hardware Threads are executed by thread Thread processors Processor Thread Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can Thread reside on one multiprocessor - limited Block Multiprocessor by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks ... Only one kernel can execute on a Grid device at one time Device © 2008 NVIDIA Corporation.
  9. 9. Warps and Half Warps 32 Threads A thread block consists of 32- thread warps ... = 32 Threads A warp is executed physically in 32 Threads parallel (SIMD) on a Thread Block Warps Multiprocessor multiprocessor DRAM A half-warp of 16 threads can coordinate global memory 16 16 Global accesses into a single transaction Half Warps Local Device Memory© NVIDIA Corporation 2008 10
  10. 10. Memory Architecture Host Device GPU CPU DRAM Multiprocessor Registers Local Multiprocessor Shared Memory Registers Multiprocessor Chipset Shared Memory Registers Global Shared Memory DRAM Constant Constant and Texture Caches Texture© NVIDIA Corporation 2008 11
  11. 11. Kernel Memory Access Kernel Memory Access Per-thread Registers On-chip Thread Local Memory Off-chip, uncached Per-block Shared • On-chip, small Block • Fast Memory Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
  12. 12. Global Memory Kernel Memory Access • Different types of “global memory” Per-thread Registers On-chip • Linear Memory Thread Local Memory Off-chip, uncached • Texture Per-block Memory • Constant Memory Block • • Shared Memory On-chip, small Fast Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
  13. 13. Memory Architecture Memory Location Cached Access Scope Lifetime Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Application Constant Off-chip Yes R All threads + host Application Texture Off-chip Yes R All threads + host Application© NVIDIA Corporation 2008 12
  14. 14. 2. Memory/Communication Optimizations
  15. 15. Revie w2.1 Host/Device Transfer Optimizations
  16. 16. Rev ie w PC Architecture 8 GB/s >?@ ?>L9G=2%&66"K16 J%+8#"F7(&"K16 H%2$7,6">%("I" A+%#$)%7(B& F+1#$)%7(B& >@C! E&.+%/"K16 ?>L"K16 3+ Gb/s CD!E F!:! G#$&%8&# ! 160+ GB/s to VRAM 25+ GB/s modified from Matthew Bolitho
  17. 17. Rev ie w The PCI-“not-so”-e Bus• PCIe bus is slow• Try to minimize/group transfers• Use pinned memory on host whenever possible• Try to perform copies asynchronously (e.g. Streams)• Use “Zero-Copy” when appropriate• Examples in the SDK (e.g. bandwidthTest)
  18. 18. 2.2 Device Memory Optimizations
  19. 19. Definitions• gmem: global memory• smem: shared memory• tmem: texture memory• cmem: constant memory• bmem: binary code (cubin) memory ?!? (covered next week)
  20. 20. Performance Analysise.g. Matrix Transpose
  21. 21. Matrix Transpose Transpose 2048x2048 matrix of floats Performed out-of-place Separate input and output matrices Use tile of 32x32 elements, block of 32x8 threads Each thread processes 4 matrix elements In general tile and block size are fair game for optimization Process Get the right answer Measure effective bandwidth (relative to theoretical or reference case) Address global memory coalescing, shared memory bank conflicts, and partition camping while repeating above steps© NVIDIA Corporation 2008 22
  22. 22. Theoretical Bandwidth Device Bandwidth of GTX 280 DDR 1107 * 10^6 * (512 / 8) * 2 / 1024^3 = 131.9 GB/s Memory Memory clock (Hz) interface (bytes) Specs report 141 GB/s Use 10^9 B/GB conversion rather than 1024^3 Whichever you use, be consistent© NVIDIA Corporation 2008 23
  23. 23. Effective Bandwidth Transpose Effective Bandwidth 2048^2 * 4 B/element / 1024^3 * 2 / (time in secs) = GB/s Matrix size Read and (bytes) write Reference Case - Matrix Copy Transpose operates on tiles - need better comparison than raw device bandwidth Look at effective bandwidth of copy that uses tiles© NVIDIA Corporation 2008 24
  24. 24. Matrix Copy Kernel__global__ void copy(float *odata, float *idata, int width, int height){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex; for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } TILE_DIM = 32} BLOCK_ROWS = 8 idata odata 32x32 tile 32x8 thread block idata and odata in global memory Elements copied by a half-warp of threads© NVIDIA Corporation 2008 25
  25. 25. Matrix Copy Kernel Timing Measure elapsed time over loop Looping/timing done in two ways: Over kernel launches (nreps = 1) Includes launch/indexing overhead Within the kernel over loads/stores (nreps > 1) Amortizes launch/indexing overhead __global__ void copy(float *odata, float* idata, int width, int height, int nreps) { int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex; for (int r = 0; r < nreps; r++) { for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } } }© NVIDIA Corporation 2008 26
  26. 26. Naïve Transpose Similar to copy Input and output matrices have different indices __global__ void transposeNaive(float *odata, float* idata, int width, int height, int nreps) { int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + width * yIndex; int index_out = yIndex + height * xIndex; for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i] = idata[index_in+i*width]; } } idata odata }© NVIDIA Corporation 2008 27
  27. 27. Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over Loop in kernel kernel Simple Copy 96.9 81.6 Naïve 2.2 2.2 Transpose© NVIDIA Corporation 2008 28
  28. 28. gmem coalescing
  29. 29. Memory CoalescingGPU memory controller granularity is 64 or 128 bytes Must also be 64 or 128 byte alignedSuppose thread loads a float (4 bytes) Controller loads 64 bytes, throws 60 bytes away
  30. 30. Memory Coalescing Memory controller actually more intelligent Consider half-warp (16 threads) Suppose each thread reads consecutive float Memory controller will perform one 64 byte load This is known as coalescingMake threads read consecutive locations
  31. 31. Coalescing Global memory access of 32, 64, or 128-bit words by a half- warp of threads can result in as few as one (or two) transaction(s) if certain access requirements are met Depends on compute capability 1.0 and 1.1 have stricter access requirements Examples – float (32-bit) data Global Memory } 64B aligned segment (16 floats) } 128B aligned segment (32 floats) Half-warp of threads© NVIDIA Corporation 2008 30
  32. 32. CoalescingCompute capability 1.0 and 1.1 K-th thread must access k-th word in the segment (or k-th word in 2 contiguous 128B segments for 128-bit words), not all threads need to participate Coalesces – 1 transactionOut of sequence – 16 transactions Misaligned – 16 transactions© NVIDIA Corporation 2008 31
  33. 33. Memory CoalescingGT200 has hardware coalescerInspects memory requests from each half-warpDetermines minimum set of transactions which are 64 or 128 bytes long 64 or 128 byte aligned
  34. 34. Coalescing Compute capability 1.2 and higher (e.g. GT200 like the C1060) Coalescing is achieved for any pattern of addresses that fits into a segment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32- and 64-bit words Smaller transactions may be issued to avoid wasted bandwidth due to unused words 1 transaction - 64B segment 1 transaction - 128B segment2 transactions - 64B and 32B segments © NVIDIA Corporation 2008 32
  35. 35. Coalescing Compute capability 2.0 (Fermi, Tesla C2050) Memory transactions handled per warp (32 threads) L1 cache ON: Issues always 128B segment transactions caches them in 16kB or 48kB L1 cache per multiprocessor 2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache. L1 cache OFF: Issues always 32B segment transactions E.g. advantage for widely scattered thread accesses 32 transactions - 32 x 32B segments, instead of 32 x 128B segments.© NVIDIA Corporation 2010
  36. 36. Coalescing SummaryCoalescing dramatically speeds global memory accessStrive for perfect coalescing: Align starting address (may require padding) A warp should access within a contiguous region
  37. 37. Coalescing in Transpose Naïve transpose coalesces reads, but not writes idata odata Elements transposed by a half-warp of threads Q: How to coalesce writes ?© NVIDIA Corporation 2008 33
  38. 38. smem as a cache
  39. 39. Shared MemorySMs can access gmem at 80+ GiB/secbut have hundreds of cycles of latencyEach SM has 16 kiB ‘shared’ memory Essentially user-managed cache Speed comparable to registers Accessible to all threads in a blockReduces load/stores to device memory
  40. 40. Shared Memory ~Hundred times faster than global memory Cache data to reduce global memory accesses Threads can cooperate via shared memory Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non- coalesceable addressing© NVIDIA Corporation 2008 34
  41. 41. Coalescing in Transpose Naïve transpose coalesces reads, but not writes idata odata Elements transposed by a half-warp of threads Q: How to coalesce writes ?© NVIDIA Corporation 2008 33
  42. 42. Shared Memory ~Hundred times faster than global memory Cache data to reduce global memory accesses Threads can cooperate via shared memory Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non- coalesceable addressing© NVIDIA Corporation 2008 34
  43. 43. A Common Programming Strategy!   Partition data into subsets that fit into shared memory © 2008 NVIDIA Corporation
  44. 44. A Common Programming Strategy!   Handle each data subset with one thread block © 2008 NVIDIA Corporation
  45. 45. A Common Programming Strategy!   Load the subset from global memory to shared memory, using multiple threads to exploit memory- level parallelism © 2008 NVIDIA Corporation
  46. 46. A Common Programming Strategy!   Perform the computation on the subset from shared memory © 2008 NVIDIA Corporation
  47. 47. A Common Programming Strategy!   Copy the result from shared memory back to global memory © 2008 NVIDIA Corporation
  48. 48. Coalescing through shared memory Access columns of a tile in shared memory to write contiguous data to global memory Requires __syncthreads() since threads write data read by other threads idata odata tile Elements transposed by a half-warp of threads© NVIDIA Corporation 2008 35
  49. 49. Coalescing through shared memory __global__ void transposeCoalesced(float *odata, float *idata, int width, int height, int nreps) { __shared__ float tile[TILE_DIM][TILE_DIM]; int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width; xIndex = blockIdx.y * TILE_DIM + threadIdx.x; yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height; for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } } }© NVIDIA Corporation 2008 36
  50. 50. Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over kernel Loop in kernel Simple Copy 96.9 81.6 Uses shared memory tileShared Memory Copy 80.9 81.1 and Naïve Transpose 2.2 2.2 __syncthreads()Coalesced Transpose 16.5 17.1 © NVIDIA Corporation 2008 37
  51. 51. smem bank conflicts
  52. 52. Shared Memory Architecture Many threads accessing memory Therefore, memory is divided into banks Successive 32-bit words assigned to successive banks Each bank can service one address per cycle Bank 0 A memory can service as many simultaneous Bank 1 accesses as it has banks Bank 2 Bank 3 Bank 4 Multiple simultaneous accesses to a bank Bank 5 result in a bank conflict Bank 6 Conflicting accesses are serialized Bank 7 Bank 15© NVIDIA Corporation 2008 39
  53. 53. Shared Memory Banks Bank 0 0 16 Bank 1 1 17 Bank 2 2 18Shared memory divided Bank 3 Bank 4 3 4 19 20into 16 ‘banks’ Bank 5 5 21 Bank 6 6 22Shared memory is (almost) Bank 7 7 23as fast as registers (...) Bank 8 8 24 Bank 9 9 25Exception is in case of Bank 10 Bank 11 10 11 26 27bank conflicts Bank 12 12 28 Bank 13 13 29 Bank 14 14 30 Bank 15 15 31 4 bytes
  54. 54. Bank Addressing Examples No Bank Conflicts No Bank Conflicts Linear addressing Random 1:1 Permutation stride == 1 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15© NVIDIA Corporation 2008 40
  55. 55. Bank Addressing Examples 2-way Bank Conflicts 8-way Bank Conflicts Linear addressing Linear addressing stride == 2 stride == 8 x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15© NVIDIA Corporation 2008 41
  56. 56. Shared memory bank conflicts Shared memory is ~ as fast as registers if there are no bank conflicts warp_serialize profiler signal reflects conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp read the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank© NVIDIA Corporation 2008 42
  57. 57. Bank Conflicts in Transpose 32x32 shared memory tile of floats Data in columns k and k+16 are in same bank 16-way bank conflict reading half columns in tile Solution - pad shared memory array __shared__ float tile[TILE_DIM][TILE_DIM+1]; Q: How to avoid bank conflicts ? Data in anti-diagonals are in same bank idata odata tile© NVIDIA Corporation 2008 43
  58. 58. Bank Conflicts in Transpose 32x32 shared memory tile of floats Data in columns k and k+16 are in same bank 16-way bank conflict reading half columns in tile Solution - pad shared memory array __shared__ float tile[TILE_DIM][TILE_DIM+1]; Data in anti-diagonals are in same bank idata odata tile© NVIDIA Corporation 2008 43
  59. 59. Illustration !"#$%&(%)*$+,-.*/&/012#034*056/78 • :;<:; !(=(#$$#+ • >#$?#77%99%9#7*6@)0, – !"#$%&(%)*+,)-./+0120345%61/)%$%47%++5110351%85(%)*9 warps: 0 1 2 31 0 1 2 31 Bank 0 Bank 1 0 1 2 31 … 0 1 2 31 Bank 31 0 1 2 31NVIDIA 2010
  60. 60. Illustration !"#$%&(%)*$+,-.*/&/012#034*056/789 • -&&#7*6:)05*$;#&&/01, – !"#!! $%&%())(* • <#$;#77%99%9#7*6:)0, – !" +,--.)./01(/234/51(/265/-7,603 warps: 0 1 2 31 padding 0 1 2 31 Bank 0 Bank 1 0 1 2 31 … 0 1 2 31 Bank 31 0 1 2 31© NVIDIA 2010
  61. 61. Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over Loop in kernel kernel Simple Copy 96.9 81.6 Shared Memory Copy 80.9 81.1 Naïve Transpose 2.2 2.2 Coalesced Transpose 16.5 17.1 Bank Conflict Free Transpose 16.6 17.2© NVIDIA Corporation 2008 44
  62. 62. a pause?Need
  63. 63. Unrelated: Tchatcher Illusion
  64. 64. Unrelated: Tchatcher Illusion
  65. 65. gmem partition camping
  66. 66. Partition Camping Global memory accesses go through partitions 6 partitions on 8-series GPUs, 8 partitions on 10-series GPUs Successive 256-byte regions of global memory are assigned to successive partitions For best performance: Simultaneous global memory accesses GPU-wide should be distributed evenly amongst partitions Partition Camping occurs when global memory accesses at an instant use a subset of partitions Directly analogous to shared memory bank conflicts, but on a larger scale© NVIDIA Corporation 2008 46
  67. 67. Partition Camping in Transpose Partition width = 256 bytes = 64 floats Twice width of tile On GTX280 (8 partitions), data 2KB apart map to same partition 2048 floats divides evenly by 2KB => columns of matrices map to same partition idata odata tiles in matrices 0 1 2 3 4 5 0 64 128 colors = partitions 64 65 66 67 68 69 1 65 129 128 129 130 ... 2 66 130 3 67 ... 4 68 5 69 blockId = gridDim.x * blockIdx.y + blockIdx.x© NVIDIA Corporation 2008 47
  68. 68. Partition Camping Solutions Pad matrices (by two tiles) In general might be expensive/prohibitive memory-wise Diagonally reorder blocks Interpret blockIdx.y as different diagonal slices and blockIdx.x as distance along a diagonal idata odata 0 64 128 0 1 65 129 64 1 2 66 130 128 65 2 3 67 ... 129 66 3 4 68 130 67 4 5 ... 68 5 blockId = gridDim.x * blockIdx.y + blockIdx.x© NVIDIA Corporation 2008 48
  69. 69. Diagonal Transpose__global__ void transposeDiagonal(float *odata, float *idata, int width, int height, int nreps){ __shared__ float tile[TILE_DIM][TILE_DIM+1]; int blockIdx_y = blockIdx.x; Add lines to map diagonal int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x; to Cartesian coordinates int xIndex = blockIdx_x * TILE_DIM + threadIdx.x; int yIndex = blockIdx_y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width; Replace xIndex = blockIdx_y * TILE_DIM + threadIdx.x; yIndex = blockIdx_x * TILE_DIM + threadIdx.y; blockIdx.x int index_out = xIndex + (yIndex)*height; with blockIdx_x, for (int r=0; r < nreps; r++) { blockIdx.y for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { with tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; blockIdx_y } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } }}© NVIDIA Corporation 2008 49
  70. 70. Diagonal Transpose Previous slide for square matrices (width == height) More generally:if (width == height) { blockIdx_y = blockIdx.x; blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;} else { int bid = blockIdx.x + gridDim.x*blockIdx.y; blockIdx_y = bid%gridDim.y; blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x;}© NVIDIA Corporation 2008 50
  71. 71. Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over kernel Loop in kernel Simple Copy 96.9 81.6 Shared Memory Copy 80.9 81.1 Naïve Transpose 2.2 2.2 Coalesced Transpose 16.5 17.1 Bank Conflict Free Transpose 16.6 17.2 Diagonal 69.5 78.3© NVIDIA Corporation 2008 51
  72. 72. Order of Optimizations Larger optimization issues can mask smaller ones Proper order of some optimization techniques in not known a priori Eg. partition camping is problem-size dependent Don’t dismiss an optimization technique as ineffective until you know it was applied at the right time Bank Partition Conflicts 16.6 GB/s Camping Naïve Coalescing 16.5 GB/s 69.5 GB/s2.2 GB/s Partition 48.8 GB/s Bank Camping Conflicts© NVIDIA Corporation 2008 52
  73. 73. Transpose Summary Coalescing and shared memory bank conflicts are small-scale phenomena Deal with memory access within half-warp Problem-size independent Partition camping is a large-scale phenomenon Deals with simultaneous memory accesses by warps on different multiprocessors Problem size dependent Wouldn’t see in (2048+32)^2 matrix Coalescing is generally the most critical SDK Transpose Example: © NVIDIA Corporation 2008http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html 53
  74. 74. tmem
  75. 75. Textures in CUDA Texture is an object for reading data Benefits: Data is cached (optimized for 2D locality) Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear Dedicated hardware Wrap modes (for “out-of-bounds” addresses) Clamp to edge / repeat Addressable in 1D, 2D, or 3D Using integer or normalized coordinates Usage: CPU code binds data to a texture object Kernel reads data by calling a fetch function© NVIDIA Corporation 2008 55
  76. 76. Other goodiesOptional “format conversion”• {char, short, int, half} (16bit) to float (32bit)• “for free”• useful for *mem compression (see later)
  77. 77. Texture Addressing 0 1 2 3 4 0 (2.5, 0.5) 1 (1.0, 1.0) 2 3Wrap Clamp Out-of-bounds coordinate is Out-of-bounds coordinate is wrapped (modulo arithmetic) replaced with the closest boundary 0 1 2 3 4 0 1 2 3 4 0 0 (5.5, 1.5) (5.5, 1.5) 1 1 2 2 3 3© NVIDIA Corporation 2008 56
  78. 78. Two CUDA Texture Types Bound to linear memory Global memory address is bound to a texture Only 1D Integer addressing No filtering, no addressing modes Bound to CUDA arrays CUDA array is bound to a texture 1D, 2D, or 3D Float addressing (size-based or normalized) Filtering Addressing modes (clamping, repeat) Both: Return either element type or normalized float© NVIDIA Corporation 2008 57
  79. 79. CUDA Texturing Steps Host (CPU) code: Allocate/obtain memory (global linear, or CUDA array) Create a texture reference object Currently must be at file-scope Bind the texture reference to memory/array When done: Unbind the texture reference, free resources Device (kernel) code: Fetch using texture reference Linear memory textures: tex1Dfetch() Array textures: tex1D() or tex2D() or tex3D()© NVIDIA Corporation 2008 58
  80. 80. cmem
  81. 81. !"#$%&#%()*"+, • -.)&/0"+1")00212)#%$&#."%3)+.&%&%3&%2$+)&.4#20"+*/,5,6&+7$ • 8&%&2$$%"+).2#9/"5&/*)*"+,:+)&.%3+"493&1"#$%&#%;1&13) – !!"#$%&$&!!()*+,-,./(,$(0."+/&,#$% – 1$(#$+2(3.(/.0(32(456(7./$.+% – 8,9,&.0(&#(:;<= • <)+*2&..$4#20"+*&11)$$)$= – <./$.+(>#,$&./(/?*9.$&()*+,-,.0(@,&A(!"#$% – 1#9>,+./(9*%&(0.&./9,$.(&A&(++(&A/.0%(,$((&A/.03+#"7 @,++(0./.-./.$".(&A.(%9.(00/.%% – B#(+,9,&(#$(//2(%,C.D("$(*%.($2(?+#3+(9.9#/2(>#,$&./ • !"#$%&#%1&13)%3+"49374%= – EF(3,&%(>./(@/>(>./(F("+#"7%(>./(9*+&,>/#".%%#/ – G#(3.(*%.0(@A.$(++(&A/.0%(,$((@/>(/.0(&A.(%9.(00/.%% • H./,+,C.%(#&A./@,%.© NVIDIA 2010
  82. 82. !"#$%&#%()*"+, • -.)&/0"+1")00212)#%$&#."%3)+.&%&%3&%2$+)&.4#20"+*/,5,6&+7$ • 8&%&2$$%"+).2#9/"5&/*)*"+,:+)&.%3+"493&1"#$%&#%;1&13) – !!"#$%&$&!!()*+,-,./(,$(0."+/&,#$% !!>+#3+!!(?#,0(7./$.+@("#$%&(-+#&(A>! B – 1$(#$+2(3.(/.0(32(456(7./$.+% – 8,9,&.0(&#(:;<= C DDD • <)+*2&..$4#20"+*&11)$$)$= -+#&(O(P(>!QRSTU(((((((((((((((((((VV(*$,-#/9 – <./$.+(E#,$&./(/>*9.$&()*+,-,.0(F,&G(!"#$% -+#&(2(P(>!Q3+#"7W0ODOXSTU((((VV(*$,-#/9 – 1#9E,+./(9*%&(0.&./9,$.(&G&(++(&G/.0%(,$((&G/.03+#"7 F,++(0./.-./.$".(&G.(%9.(00/.%% -+#&(I(P(>!Q&G/.0W0ODOTU((((((VV($#$Y*$,-#/9 – H#(+,9,&(#$(//2(%,I.J("$(*%.($2(>+#3+(9.9#/2(E#,$&./ DDD • !"#$%&#%1&13)%3+"49374%= Z – KL(3,&%(E./(F/E(E./(L("+#"7%(E./(9*+&,E/#".%%#/ – M#(3.(*%.0(FG.$(++(&G/.0%(,$((F/E(/.0(&G.(%9.(00/.%% • N./,+,I.%(#&G./F,%.© NVIDIA 2010
  83. 83. !"#$%&#%()*"+, • -)+#).)/)01%)$23-%4+)&5$67839&+:$;:)+<(51+=#>=%$.=?)%=*) • @..%4+)&5$&00)$$%4)$&*)AB9"+5 • C$=#>D(E(F – !"#$%&"(%)*+#$*,%-./%01%234/%5)%67,%+"))8# – 9"#$8:;%<5"=,%(5+*:+8"<<>%&5,*%? 2.@/%<8:*A%B*>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*, addresses from a warp ... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448© NVIDIA 2010
  84. 84. !"#$%&#%()*"+, • -)+#).)/)01%)$23-%4+)&5$67839&+:$;:)+<(51+=#>=%$.=?)%=*) • @..%4+)&5$&00)$$%4)$&*)AB9"+5 • C$=#>0"#$%&#%D1#=?"+*&00)$$E – !"#$%&(#)&*+%,-+$&./&01%+$ – 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**", • 953":+31&%4&0+&+;",%+<&4;+#&:+#5+3&3"*+%"=+&> 4%-+#&34(<$&<4&54%&?4&%-#48?-&%-"$&,(,-+ addresses from a warp ... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448© NVIDIA 2010
  85. 85. *mem compression
  86. 86. !"#$%$&$()*$#+),-%"./00$- • 1+/)233)/30/)+20)4//)-"#$%$&/5)25)6/./3)$0)3$%$#/5)47)#+/)8%4/.)-9) 47#/0)//5/5:);-0$5/.);-%"./00$- • <"".-2;+/0= – !"#$%&"()*+,"%-)#.))"%/01%2301%450-,# ,"#)6)*+%,+ 2%,"+#*7&#,"%8390-,# *):7,*)+%;% &7<=)> – ?@$%&"()*+,"%-)#.))"%A<231%A<451%A<39 ,+%")%,"+#*7&#," • A<23 82+B)2CD>%,+%+#*;6)%"=E1%"%D;#F%,"+#*7&#,"+ – G;"6)0-;+)H$ • I.)*%;"H%7<<)*%=,D,#+%;*)%J)*")=%;*67D)#+ • K;#;%,+%;"%,"H)L%A*%,"#)*<=;#," • <""3$;2#$-)$)".2;#$;/= – M=;*J%!"#$%& NO=(,"6%I;##,&)%PMK%+E+#)D+%A%):7;#,"+%7+,"6%D,L)H%<*)&,+,"% +=()*+%"%Q@R+S – F##<$TT;*L,(U*6T;-+TCV22U42V2 34© NVIDIA 2010
  87. 87. tation t hrough compu g GPUAcc eleratin thods -precis ion memixed lark M ichael C s ic As trophys ian Ce nter for Univers ity on Harvard -Smiths Harvard SC’10
  88. 88. ... too much ? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ition inixe cla ca m g m pingm pi ng adca sting bro ms zero-cop trea
  89. 89. ParallelProgrammParking is Hard (but you’ll pick it up)
  90. 90. (you are not alone)
  91. 91. 3. Threading/Execution Optimizations
  92. 92. 3.1 Exec. Configuration Optimizations
  93. 93. Occupancy Thread instructions are executed sequentially, so executing other warps is the only way to hide latencies and keep the hardware busy Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently Limited by resource usage: Registers Shared memory© NVIDIA Corporation 2008 60
  94. 94. Grid/Block Size Heuristics # of blocks > # of multiprocessors So all multiprocessors have at least one block to execute # of blocks / # of multiprocessors > 2 Multiple blocks can run concurrently in a multiprocessor Blocks that aren’t waiting at a __syncthreads() keep the hardware busy Subject to resource availability – registers, shared memory # of blocks > 100 to scale to future devices Blocks executed in pipeline fashion 1000 blocks per grid will scale across multiple generations© NVIDIA Corporation 2008 61
  95. 95. Register Dependency Read-after-write register dependency Instruction’s result can be read ~24 cycles later Scenarios: CUDA: PTX: x = y + 5; add.f32 $f3, $f1, $f2 z = x + 3; add.f32 $f5, $f3, $f4 s_data[0] += 3; ld.shared.f32 $f3, [$r31+0] add.f32 $f3, $f3, $f4 To completely hide the latency: Run at least 192 threads (6 warps) per multiprocessor At least 25% occupancy (1.0/1.1), 18.75% (1.2/1.3) Threads do not have to belong to the same thread block© NVIDIA Corporation 2008 62
  96. 96. Register Pressure Hide latency by using more threads per SM Limiting Factors: Number of registers per kernel 8K/16K per SM, partitioned among concurrent threads Amount of shared memory 16KB per SM, partitioned among concurrent threadblocks Compile with –ptxas-options=-v flag Use –maxrregcount=N flag to NVCC N = desired maximum registers / kernel At some point “spilling” into local memory may occur Reduces performance – local memory is slow© NVIDIA Corporation 2008 63
  97. 97. Occupancy Calculator© NVIDIA Corporation 2008 64
  98. 98. Optimizing threads per block Choose threads per block as a multiple of warp size Avoid wasting computation on under-populated warps Want to run as many warps as possible per multiprocessor (hide latency) Multiprocessor can run up to 8 blocks at a time Heuristics Minimum: 64 threads per block Only if multiple concurrent blocks 192 or 256 threads a better choice Usually still enough regs to compile and invoke successfully This all depends on your computation, so experiment!© NVIDIA Corporation 2008 65
  99. 99. Occupancy != Performance Increasing occupancy does not necessarily increase performance BUT … Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels (It all comes down to arithmetic intensity and available parallelism)© NVIDIA Corporation 2008 66
  100. 100. 01*+ ,2% ."$%/,, ,"% *#%-( &"$($ )*+!"##"$% (%)(* !"#$%&! ).%.& +,-./ 8 ./5565787 0. 12.34 GTC’10
  101. 101. Occupancy != Performance Increasing occupancy does not necessarily increase performance BUT … Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels (It all comes down to arithmetic intensity and available parallelism)© NVIDIA Corporation 2008 66
  102. 102. 3.2 Instruction Optimizations
  103. 103. CUDA Instruction Performance Instruction cycles (per warp) = sum of Operand read cycles Instruction execution cycles Result update cycles Therefore instruction throughput depends on Nominal instruction throughput Memory latency Memory bandwidth “Cycle” refers to the multiprocessor clock rate 1.3 GHz on the Tesla C1060, for example© NVIDIA Corporation 2008 69
  104. 104. Maximizing Instruction Throughput Maximize use of high-bandwidth memory Maximize use of shared memory Minimize accesses to global memory Maximize coalescing of global memory accesses Optimize performance by overlapping memory accesses with HW computation High arithmetic intensity programs i.e. high ratio of math to memory transactions Many concurrent threads© NVIDIA Corporation 2008 70
  105. 105. Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp int multiply (*) is by default 32-bit requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are more expensive Compiler will convert literal power-of-2 divides to shifts But we have seen it miss some cases Be explicit in cases where compiler can’t tell that divisor is a power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2© NVIDIA Corporation 2008 71
  106. 106. Runtime Math Library There are two types of runtime math operations in single-precision __funcf(): direct mapping to hardware ISA Fast but lower accuracy (see prog. guide for details) Examples: __sinf(x), __expf(x), __powf(x,y) funcf() : compile to multiple instructions Slower but higher accuracy (5 ulp or less) Examples: sinf(x), expf(x), powf(x,y) The -use_fast_math compiler option forces every funcf() to compile to __funcf()© NVIDIA Corporation 2008 72
  107. 107. GPU results may not match CPU Many variables: hardware, compiler, optimization settings CPU operations aren’t strictly limited to 0.5 ulp Sequences of operations can be more accurate due to 80- bit extended precision ALUs Floating-point arithmetic is not associative!© NVIDIA Corporation 2008 73
  108. 108. FP Math is Not Associative! In symbolic math, (x+y)+z == x+(y+z) This is not necessarily true for floating-point addition Try x = 1030, y = -1030 and z = 1 in the above equation When you parallelize computations, you potentially change the order of operations Parallel results may not exactly match sequential results This is not specific to GPU or CUDA – inherent part of parallel execution© NVIDIA Corporation 2008 74
  109. 109. Control Flow Instructions Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths must be serialized Avoid divergence when branch condition is a function of thread ID Example with divergence: if (threadIdx.x > 2) { } Branch granularity < warp size Example without divergence: if (threadIdx.x / WARP_SIZE > 2) { } Branch granularity is a whole multiple of warp size© NVIDIA Corporation 2008 75
  110. 110. Scared ?
  111. 111. Scared ? Howwwwww?! (do I start)
  112. 112. Profiler
  113. 113. !"#$%&&()*+(,-./$0- • ,-./$0-(1.2"*0-&3 – !"#$%&$!("#)!##&*+!"!"#$%&$!("#)*,*&$*+ • #$%&"()*+,+(%+-"./"0 1+*"23*1 • 4556+-7"()86-+5"*+183/5!"4+9+)6%+-7"-$+5"($% – -.+)%*/&*#$!"-#$)%*/&*#$ • :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"(5%*6)%$( • :(5%*6)%$(",3/".+")$6(%+-";"%"5"41*+-)3%+-"$6%7 – .0)-.(12.).(2+)3!##!".0)-.(12.).(2+)4!$!"-.(12.)#$(%*)$%2"#2$!(" • :()*+,+(%+-"./"0 1+*"45($"0 =8(+"5"0>?#@ – &"24*+)-.(12.).(2+)$%2"#2$!(" • :()*+,+(%+-"./ 0"1+*"A*$16 $;"0!">!"B!"$*"C"%*3(53)%$(5 • 6.78#-03 – B>""D"!"#$%&$!("#)!##&*+ <D"B>"E"23*1"5F+"D< – 0>?#"D"=-.(12.)#$(%*)$%2"#2$!(" 56.0)-.(12.).(2+)3!##@ 7© NVIDIA 2010
  114. 114. CUDA Visual Profiler data for memory transfers Memory transfer type and direction (D=Device, H=Host, A=cuArray) e.g. H to D: Host to Device Synchronous / Asynchronous Memory transfer size, in bytes Stream ID© NVIDIA Corporation 2010
  115. 115. CUDA Visual Profiler data for kernels© NVIDIA Corporation 2010
  116. 116. CUDA Visual Profiler computed data for kernels Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate Global memory read throughput (Gigabytes/second) Global memory write throughput (Gigabytes/second) Overall global memory access throughput (Gigabytes/second) Global memory load efficiency Global memory store efficiency© NVIDIA Corporation 2010
  117. 117. CUDA Visual Profiler data analysis views Views: Summary table Kernel table Memcopy table Summary plot GPU Time Height plot GPU Time Width plot Profiler counter plot Profiler table column plot Multi-device plot Multi-stream plot Analyze profiler counters Analyze kernel occupancy© NVIDIA Corporation 2010
  118. 118. CUDA Visual Profiler Misc. Multiple sessions Compare views for different sessions Comparison Summary plot Profiler projects save & load Import/Export profiler data (.CSV format)© NVIDIA Corporation 2010
  119. 119. Scared ? meh!!!! I don’t like to profile
  120. 120. Modified source code
  121. 121. !"#$%&&()*+(,-./0.(1-2340(5-.0 • 670(707-3%8-"$%(#".(7#*+8-"$%(903&-"&(-/(*+0(:03"0$ – !"#$%&()&*)+%#,-",+)./,-"0%+","1+%2%.+%.,*).,&)31(3)4)& "++&%##$.5 – 6$0%#7)85))+%#,$9",%#()&: • ;$9%#2%.,"**%##$.59%9)&7 • ;$9%#2%.,$.%<%*8,$.5$.#,&8*,$).# • 5-7;#3"<(*+0(*70&(/-3(7-./0.(:03"0$& – =%32#+%*$+%4-%,-%&,-%>%&.%3$#9%9 )&9",-?)8.+ – @-)4#-)44%339%9)&7)2%&",$).#"&%)0%&3"22%+4$,-"&$,-9%,$* • A)92"&%,-%#89)(9%91).37".+9",-1).37,$9%#,)(8331>%&.%3,$9% 9© NVIDIA 2010
  122. 122. Scared ? I want to believe...
  123. 123. !"#$%&(#)*$%!+$,(-."/time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a problem Memory bound ? overlap: latency not a problem overlap: latency not a problem latency is a problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low Math bound ? compared to HW theory) compared to HW theory) compared to HW theory) 13© NVIDIA 2010 Latency bound ?
  124. 124. !"#$%&(#)*$%!+$,(-."/time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a overlap: latency not a overlap: latency not a latency is a problem problem problem problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 13© NVIDIA 2010
  125. 125. !"#$%&(#)*$%!+$,(-."/time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a overlap: latency not a overlap: latency not a latency is a problem problem problem problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 13© NVIDIA 2010
  126. 126. Argn&%#$... too many optimizations !!!
  127. 127. Parameterize Your Application Parameterization helps adaptation to different GPUs GPUs vary in many ways # of multiprocessors Memory bandwidth Shared memory size Register file size Max. threads per block You can even make apps self-tuning (like FFTW and ATLAS) “Experiment” mode discovers and saves optimal configuration© NVIDIA Corporation 2008 67
  128. 128. More ?• Next week: GPU “Scripting”, Meta-programming, Auto-tuning• Thu 3/31/11: PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)• Tue 3/29/11: Algorithm Strategies (W. Hwu, UIUC)• Tue 4/5/11: Analysis-driven Optimization (C.Wooley, NVIDIA)• Thu 4/7/11: Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)• Thu 4/14/11: Optimization for Ninjas (D.Merill, UVirg)• ...
  129. 129. one more thing or two...
  130. 130. Life/Code Hacking #2.x Speed {listen,read,writ}ingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  131. 131. Life/Code Hacking #2.2 Speed writingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  132. 132. Life/Code Hacking #2.2 Speed writinghttp://steve-yegge.blogspot.com/2008/09/programmings-dirtiest-little-secret.html accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  133. 133. Life/Code Hacking #2.2 Speed writingTyping tutors: gtypist, ktouch, typingweb.com, etc. accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  134. 134. Life/Code Hacking #2.2 Speed writing Kinesis Advantage (QWERTY/DVORAK)accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  135. 135. Demo
  136. 136. CO ME
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×