[Harvard CS264] 05 - Advanced-level CUDA Programming
Upcoming SlideShare
Loading in...5
×
 

[Harvard CS264] 05 - Advanced-level CUDA Programming

on

  • 1,339 views

http://cs264.org

http://cs264.org

Statistics

Views

Total Views
1,339
Views on SlideShare
1,338
Embed Views
1

Actions

Likes
0
Downloads
71
Comments
0

1 Embed 1

http://www.docshut.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

[Harvard CS264] 05 - Advanced-level CUDA Programming [Harvard CS264] 05 - Advanced-level CUDA Programming Presentation Transcript

  • Massively Parallel Computing CS 264 / CSCI E-292Lecture #5: Advanced CUDA | February 22th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • Administrivia• HW2: out, due Mon 3/14/11 (not Fri 3/11/11)• Projects: think about it, consult the staff (*), proposals due ~ Fri 3/25/11• Guest lectures: • schedule coming soon • on Fridays 7.35-9.35pm (March, April) ?
  • During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  • Todayyey!!
  • Outline1. Hardware Review2. Memory/Communication Optimizations3. Threading/Execution Optimizations
  • 1. Hardware Review
  • 10-Series Architecture 240 thread processors execute kernel threads 30 multiprocessors, each contains 8 thread processors One double-precision unit Shared memory enables thread cooperation Multiprocessor Thread Processors Double Shared Memory© NVIDIA Corporation 2008 8
  • Threading HierarchyExecution ModelSoftware Hardware Threads are executed by thread Thread processors Processor Thread Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can Thread reside on one multiprocessor - limited Block Multiprocessor by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks ... Only one kernel can execute on a Grid device at one time Device © 2008 NVIDIA Corporation.
  • Warps and Half Warps 32 Threads A thread block consists of 32- thread warps ... = 32 Threads A warp is executed physically in 32 Threads parallel (SIMD) on a Thread Block Warps Multiprocessor multiprocessor DRAM A half-warp of 16 threads can coordinate global memory 16 16 Global accesses into a single transaction Half Warps Local Device Memory© NVIDIA Corporation 2008 10
  • Memory Architecture Host Device GPU CPU DRAM Multiprocessor Registers Local Multiprocessor Shared Memory Registers Multiprocessor Chipset Shared Memory Registers Global Shared Memory DRAM Constant Constant and Texture Caches Texture© NVIDIA Corporation 2008 11
  • Kernel Memory Access Kernel Memory Access Per-thread Registers On-chip Thread Local Memory Off-chip, uncached Per-block Shared • On-chip, small Block • Fast Memory Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
  • Global Memory Kernel Memory Access • Different types of “global memory” Per-thread Registers On-chip • Linear Memory Thread Local Memory Off-chip, uncached • Texture Per-block Memory • Constant Memory Block • • Shared Memory On-chip, small Fast Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
  • Memory Architecture Memory Location Cached Access Scope Lifetime Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Application Constant Off-chip Yes R All threads + host Application Texture Off-chip Yes R All threads + host Application© NVIDIA Corporation 2008 12
  • 2. Memory/Communication Optimizations
  • Revie w2.1 Host/Device Transfer Optimizations
  • Rev ie w PC Architecture 8 GB/s >?@ ?>L9G=2%&66"K16 J%+8#"F7(&"K16 H%2$7,6">%("I" A+%#$)%7(B& F+1#$)%7(B& >@C! E&.+%/"K16 ?>L"K16 3+ Gb/s CD!E F!:! G#$&%8&# ! 160+ GB/s to VRAM 25+ GB/s modified from Matthew Bolitho
  • Rev ie w The PCI-“not-so”-e Bus• PCIe bus is slow• Try to minimize/group transfers• Use pinned memory on host whenever possible• Try to perform copies asynchronously (e.g. Streams)• Use “Zero-Copy” when appropriate• Examples in the SDK (e.g. bandwidthTest)
  • 2.2 Device Memory Optimizations
  • Definitions• gmem: global memory• smem: shared memory• tmem: texture memory• cmem: constant memory• bmem: binary code (cubin) memory ?!? (covered next week)
  • Performance Analysise.g. Matrix Transpose
  • Matrix Transpose Transpose 2048x2048 matrix of floats Performed out-of-place Separate input and output matrices Use tile of 32x32 elements, block of 32x8 threads Each thread processes 4 matrix elements In general tile and block size are fair game for optimization Process Get the right answer Measure effective bandwidth (relative to theoretical or reference case) Address global memory coalescing, shared memory bank conflicts, and partition camping while repeating above steps© NVIDIA Corporation 2008 22
  • Theoretical Bandwidth Device Bandwidth of GTX 280 DDR 1107 * 10^6 * (512 / 8) * 2 / 1024^3 = 131.9 GB/s Memory Memory clock (Hz) interface (bytes) Specs report 141 GB/s Use 10^9 B/GB conversion rather than 1024^3 Whichever you use, be consistent© NVIDIA Corporation 2008 23
  • Effective Bandwidth Transpose Effective Bandwidth 2048^2 * 4 B/element / 1024^3 * 2 / (time in secs) = GB/s Matrix size Read and (bytes) write Reference Case - Matrix Copy Transpose operates on tiles - need better comparison than raw device bandwidth Look at effective bandwidth of copy that uses tiles© NVIDIA Corporation 2008 24
  • Matrix Copy Kernel__global__ void copy(float *odata, float *idata, int width, int height){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex; for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } TILE_DIM = 32} BLOCK_ROWS = 8 idata odata 32x32 tile 32x8 thread block idata and odata in global memory Elements copied by a half-warp of threads© NVIDIA Corporation 2008 25
  • Matrix Copy Kernel Timing Measure elapsed time over loop Looping/timing done in two ways: Over kernel launches (nreps = 1) Includes launch/indexing overhead Within the kernel over loads/stores (nreps > 1) Amortizes launch/indexing overhead __global__ void copy(float *odata, float* idata, int width, int height, int nreps) { int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex; for (int r = 0; r < nreps; r++) { for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } } }© NVIDIA Corporation 2008 26
  • Naïve Transpose Similar to copy Input and output matrices have different indices __global__ void transposeNaive(float *odata, float* idata, int width, int height, int nreps) { int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + width * yIndex; int index_out = yIndex + height * xIndex; for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i] = idata[index_in+i*width]; } } idata odata }© NVIDIA Corporation 2008 27
  • Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over Loop in kernel kernel Simple Copy 96.9 81.6 Naïve 2.2 2.2 Transpose© NVIDIA Corporation 2008 28
  • gmem coalescing
  • Memory CoalescingGPU memory controller granularity is 64 or 128 bytes Must also be 64 or 128 byte alignedSuppose thread loads a float (4 bytes) Controller loads 64 bytes, throws 60 bytes away
  • Memory Coalescing Memory controller actually more intelligent Consider half-warp (16 threads) Suppose each thread reads consecutive float Memory controller will perform one 64 byte load This is known as coalescingMake threads read consecutive locations
  • Coalescing Global memory access of 32, 64, or 128-bit words by a half- warp of threads can result in as few as one (or two) transaction(s) if certain access requirements are met Depends on compute capability 1.0 and 1.1 have stricter access requirements Examples – float (32-bit) data Global Memory } 64B aligned segment (16 floats) } 128B aligned segment (32 floats) Half-warp of threads© NVIDIA Corporation 2008 30
  • CoalescingCompute capability 1.0 and 1.1 K-th thread must access k-th word in the segment (or k-th word in 2 contiguous 128B segments for 128-bit words), not all threads need to participate Coalesces – 1 transactionOut of sequence – 16 transactions Misaligned – 16 transactions© NVIDIA Corporation 2008 31
  • Memory CoalescingGT200 has hardware coalescerInspects memory requests from each half-warpDetermines minimum set of transactions which are 64 or 128 bytes long 64 or 128 byte aligned
  • Coalescing Compute capability 1.2 and higher (e.g. GT200 like the C1060) Coalescing is achieved for any pattern of addresses that fits into a segment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32- and 64-bit words Smaller transactions may be issued to avoid wasted bandwidth due to unused words 1 transaction - 64B segment 1 transaction - 128B segment2 transactions - 64B and 32B segments © NVIDIA Corporation 2008 32
  • Coalescing Compute capability 2.0 (Fermi, Tesla C2050) Memory transactions handled per warp (32 threads) L1 cache ON: Issues always 128B segment transactions caches them in 16kB or 48kB L1 cache per multiprocessor 2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache. L1 cache OFF: Issues always 32B segment transactions E.g. advantage for widely scattered thread accesses 32 transactions - 32 x 32B segments, instead of 32 x 128B segments.© NVIDIA Corporation 2010
  • Coalescing SummaryCoalescing dramatically speeds global memory accessStrive for perfect coalescing: Align starting address (may require padding) A warp should access within a contiguous region
  • Coalescing in Transpose Naïve transpose coalesces reads, but not writes idata odata Elements transposed by a half-warp of threads Q: How to coalesce writes ?© NVIDIA Corporation 2008 33
  • smem as a cache
  • Shared MemorySMs can access gmem at 80+ GiB/secbut have hundreds of cycles of latencyEach SM has 16 kiB ‘shared’ memory Essentially user-managed cache Speed comparable to registers Accessible to all threads in a blockReduces load/stores to device memory
  • Shared Memory ~Hundred times faster than global memory Cache data to reduce global memory accesses Threads can cooperate via shared memory Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non- coalesceable addressing© NVIDIA Corporation 2008 34
  • Coalescing in Transpose Naïve transpose coalesces reads, but not writes idata odata Elements transposed by a half-warp of threads Q: How to coalesce writes ?© NVIDIA Corporation 2008 33
  • Shared Memory ~Hundred times faster than global memory Cache data to reduce global memory accesses Threads can cooperate via shared memory Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non- coalesceable addressing© NVIDIA Corporation 2008 34
  • A Common Programming Strategy!   Partition data into subsets that fit into shared memory © 2008 NVIDIA Corporation
  • A Common Programming Strategy!   Handle each data subset with one thread block © 2008 NVIDIA Corporation
  • A Common Programming Strategy!   Load the subset from global memory to shared memory, using multiple threads to exploit memory- level parallelism © 2008 NVIDIA Corporation
  • A Common Programming Strategy!   Perform the computation on the subset from shared memory © 2008 NVIDIA Corporation
  • A Common Programming Strategy!   Copy the result from shared memory back to global memory © 2008 NVIDIA Corporation
  • Coalescing through shared memory Access columns of a tile in shared memory to write contiguous data to global memory Requires __syncthreads() since threads write data read by other threads idata odata tile Elements transposed by a half-warp of threads© NVIDIA Corporation 2008 35
  • Coalescing through shared memory __global__ void transposeCoalesced(float *odata, float *idata, int width, int height, int nreps) { __shared__ float tile[TILE_DIM][TILE_DIM]; int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width; xIndex = blockIdx.y * TILE_DIM + threadIdx.x; yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height; for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } } }© NVIDIA Corporation 2008 36
  • Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over kernel Loop in kernel Simple Copy 96.9 81.6 Uses shared memory tileShared Memory Copy 80.9 81.1 and Naïve Transpose 2.2 2.2 __syncthreads()Coalesced Transpose 16.5 17.1 © NVIDIA Corporation 2008 37
  • smem bank conflicts
  • Shared Memory Architecture Many threads accessing memory Therefore, memory is divided into banks Successive 32-bit words assigned to successive banks Each bank can service one address per cycle Bank 0 A memory can service as many simultaneous Bank 1 accesses as it has banks Bank 2 Bank 3 Bank 4 Multiple simultaneous accesses to a bank Bank 5 result in a bank conflict Bank 6 Conflicting accesses are serialized Bank 7 Bank 15© NVIDIA Corporation 2008 39
  • Shared Memory Banks Bank 0 0 16 Bank 1 1 17 Bank 2 2 18Shared memory divided Bank 3 Bank 4 3 4 19 20into 16 ‘banks’ Bank 5 5 21 Bank 6 6 22Shared memory is (almost) Bank 7 7 23as fast as registers (...) Bank 8 8 24 Bank 9 9 25Exception is in case of Bank 10 Bank 11 10 11 26 27bank conflicts Bank 12 12 28 Bank 13 13 29 Bank 14 14 30 Bank 15 15 31 4 bytes
  • Bank Addressing Examples No Bank Conflicts No Bank Conflicts Linear addressing Random 1:1 Permutation stride == 1 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15© NVIDIA Corporation 2008 40
  • Bank Addressing Examples 2-way Bank Conflicts 8-way Bank Conflicts Linear addressing Linear addressing stride == 2 stride == 8 x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15© NVIDIA Corporation 2008 41
  • Shared memory bank conflicts Shared memory is ~ as fast as registers if there are no bank conflicts warp_serialize profiler signal reflects conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp read the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank© NVIDIA Corporation 2008 42
  • Bank Conflicts in Transpose 32x32 shared memory tile of floats Data in columns k and k+16 are in same bank 16-way bank conflict reading half columns in tile Solution - pad shared memory array __shared__ float tile[TILE_DIM][TILE_DIM+1]; Q: How to avoid bank conflicts ? Data in anti-diagonals are in same bank idata odata tile© NVIDIA Corporation 2008 43
  • Bank Conflicts in Transpose 32x32 shared memory tile of floats Data in columns k and k+16 are in same bank 16-way bank conflict reading half columns in tile Solution - pad shared memory array __shared__ float tile[TILE_DIM][TILE_DIM+1]; Data in anti-diagonals are in same bank idata odata tile© NVIDIA Corporation 2008 43
  • Illustration !"#$%&(%)*$+,-.*/&/012#034*056/78 • :;<:; !(=(#$$#+ • >#$?#77%99%9#7*6@)0, – !"#$%&(%)*+,)-./+0120345%61/)%$%47%++5110351%85(%)*9 warps: 0 1 2 31 0 1 2 31 Bank 0 Bank 1 0 1 2 31 … 0 1 2 31 Bank 31 0 1 2 31NVIDIA 2010
  • Illustration !"#$%&(%)*$+,-.*/&/012#034*056/789 • -&&#7*6:)05*$;#&&/01, – !"#!! $%&%())(* • <#$;#77%99%9#7*6:)0, – !" +,--.)./01(/234/51(/265/-7,603 warps: 0 1 2 31 padding 0 1 2 31 Bank 0 Bank 1 0 1 2 31 … 0 1 2 31 Bank 31 0 1 2 31© NVIDIA 2010
  • Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over Loop in kernel kernel Simple Copy 96.9 81.6 Shared Memory Copy 80.9 81.1 Naïve Transpose 2.2 2.2 Coalesced Transpose 16.5 17.1 Bank Conflict Free Transpose 16.6 17.2© NVIDIA Corporation 2008 44
  • a pause?Need
  • Unrelated: Tchatcher Illusion
  • Unrelated: Tchatcher Illusion
  • gmem partition camping
  • Partition Camping Global memory accesses go through partitions 6 partitions on 8-series GPUs, 8 partitions on 10-series GPUs Successive 256-byte regions of global memory are assigned to successive partitions For best performance: Simultaneous global memory accesses GPU-wide should be distributed evenly amongst partitions Partition Camping occurs when global memory accesses at an instant use a subset of partitions Directly analogous to shared memory bank conflicts, but on a larger scale© NVIDIA Corporation 2008 46
  • Partition Camping in Transpose Partition width = 256 bytes = 64 floats Twice width of tile On GTX280 (8 partitions), data 2KB apart map to same partition 2048 floats divides evenly by 2KB => columns of matrices map to same partition idata odata tiles in matrices 0 1 2 3 4 5 0 64 128 colors = partitions 64 65 66 67 68 69 1 65 129 128 129 130 ... 2 66 130 3 67 ... 4 68 5 69 blockId = gridDim.x * blockIdx.y + blockIdx.x© NVIDIA Corporation 2008 47
  • Partition Camping Solutions Pad matrices (by two tiles) In general might be expensive/prohibitive memory-wise Diagonally reorder blocks Interpret blockIdx.y as different diagonal slices and blockIdx.x as distance along a diagonal idata odata 0 64 128 0 1 65 129 64 1 2 66 130 128 65 2 3 67 ... 129 66 3 4 68 130 67 4 5 ... 68 5 blockId = gridDim.x * blockIdx.y + blockIdx.x© NVIDIA Corporation 2008 48
  • Diagonal Transpose__global__ void transposeDiagonal(float *odata, float *idata, int width, int height, int nreps){ __shared__ float tile[TILE_DIM][TILE_DIM+1]; int blockIdx_y = blockIdx.x; Add lines to map diagonal int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x; to Cartesian coordinates int xIndex = blockIdx_x * TILE_DIM + threadIdx.x; int yIndex = blockIdx_y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width; Replace xIndex = blockIdx_y * TILE_DIM + threadIdx.x; yIndex = blockIdx_x * TILE_DIM + threadIdx.y; blockIdx.x int index_out = xIndex + (yIndex)*height; with blockIdx_x, for (int r=0; r < nreps; r++) { blockIdx.y for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { with tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; blockIdx_y } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } }}© NVIDIA Corporation 2008 49
  • Diagonal Transpose Previous slide for square matrices (width == height) More generally:if (width == height) { blockIdx_y = blockIdx.x; blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;} else { int bid = blockIdx.x + gridDim.x*blockIdx.y; blockIdx_y = bid%gridDim.y; blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x;}© NVIDIA Corporation 2008 50
  • Effective Bandwidth Effective Bandwidth (GB/s) 2048x2048, GTX 280 Loop over kernel Loop in kernel Simple Copy 96.9 81.6 Shared Memory Copy 80.9 81.1 Naïve Transpose 2.2 2.2 Coalesced Transpose 16.5 17.1 Bank Conflict Free Transpose 16.6 17.2 Diagonal 69.5 78.3© NVIDIA Corporation 2008 51
  • Order of Optimizations Larger optimization issues can mask smaller ones Proper order of some optimization techniques in not known a priori Eg. partition camping is problem-size dependent Don’t dismiss an optimization technique as ineffective until you know it was applied at the right time Bank Partition Conflicts 16.6 GB/s Camping Naïve Coalescing 16.5 GB/s 69.5 GB/s2.2 GB/s Partition 48.8 GB/s Bank Camping Conflicts© NVIDIA Corporation 2008 52
  • Transpose Summary Coalescing and shared memory bank conflicts are small-scale phenomena Deal with memory access within half-warp Problem-size independent Partition camping is a large-scale phenomenon Deals with simultaneous memory accesses by warps on different multiprocessors Problem size dependent Wouldn’t see in (2048+32)^2 matrix Coalescing is generally the most critical SDK Transpose Example: © NVIDIA Corporation 2008http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html 53
  • tmem
  • Textures in CUDA Texture is an object for reading data Benefits: Data is cached (optimized for 2D locality) Helpful when coalescing is a problem Filtering Linear / bilinear / trilinear Dedicated hardware Wrap modes (for “out-of-bounds” addresses) Clamp to edge / repeat Addressable in 1D, 2D, or 3D Using integer or normalized coordinates Usage: CPU code binds data to a texture object Kernel reads data by calling a fetch function© NVIDIA Corporation 2008 55
  • Other goodiesOptional “format conversion”• {char, short, int, half} (16bit) to float (32bit)• “for free”• useful for *mem compression (see later)
  • Texture Addressing 0 1 2 3 4 0 (2.5, 0.5) 1 (1.0, 1.0) 2 3Wrap Clamp Out-of-bounds coordinate is Out-of-bounds coordinate is wrapped (modulo arithmetic) replaced with the closest boundary 0 1 2 3 4 0 1 2 3 4 0 0 (5.5, 1.5) (5.5, 1.5) 1 1 2 2 3 3© NVIDIA Corporation 2008 56
  • Two CUDA Texture Types Bound to linear memory Global memory address is bound to a texture Only 1D Integer addressing No filtering, no addressing modes Bound to CUDA arrays CUDA array is bound to a texture 1D, 2D, or 3D Float addressing (size-based or normalized) Filtering Addressing modes (clamping, repeat) Both: Return either element type or normalized float© NVIDIA Corporation 2008 57
  • CUDA Texturing Steps Host (CPU) code: Allocate/obtain memory (global linear, or CUDA array) Create a texture reference object Currently must be at file-scope Bind the texture reference to memory/array When done: Unbind the texture reference, free resources Device (kernel) code: Fetch using texture reference Linear memory textures: tex1Dfetch() Array textures: tex1D() or tex2D() or tex3D()© NVIDIA Corporation 2008 58
  • cmem
  • !"#$%&#%()*"+, • -.)&/0"+1")00212)#%$&#."%3)+.&%&%3&%2$+)&.4#20"+*/,5,6&+7$ • 8&%&2$$%"+).2#9/"5&/*)*"+,:+)&.%3+"493&1"#$%&#%;1&13) – !!"#$%&$&!!()*+,-,./(,$(0."+/&,#$% – 1$(#$+2(3.(/.0(32(456(7./$.+% – 8,9,&.0(&#(:;<= • <)+*2&..$4#20"+*&11)$$)$= – <./$.+(>#,$&./(/?*9.$&()*+,-,.0(@,&A(!"#$% – 1#9>,+./(9*%&(0.&./9,$.(&A&(++(&A/.0%(,$((&A/.03+#"7 @,++(0./.-./.$".(&A.(%9.(00/.%% – B#(+,9,&(#$(//2(%,C.D("$(*%.($2(?+#3+(9.9#/2(>#,$&./ • !"#$%&#%1&13)%3+"49374%= – EF(3,&%(>./(@/>(>./(F("+#"7%(>./(9*+&,>/#".%%#/ – G#(3.(*%.0(@A.$(++(&A/.0%(,$((@/>(/.0(&A.(%9.(00/.%% • H./,+,C.%(#&A./@,%.© NVIDIA 2010
  • !"#$%&#%()*"+, • -.)&/0"+1")00212)#%$&#."%3)+.&%&%3&%2$+)&.4#20"+*/,5,6&+7$ • 8&%&2$$%"+).2#9/"5&/*)*"+,:+)&.%3+"493&1"#$%&#%;1&13) – !!"#$%&$&!!()*+,-,./(,$(0."+/&,#$% !!>+#3+!!(?#,0(7./$.+@("#$%&(-+#&(A>! B – 1$(#$+2(3.(/.0(32(456(7./$.+% – 8,9,&.0(&#(:;<= C DDD • <)+*2&..$4#20"+*&11)$$)$= -+#&(O(P(>!QRSTU(((((((((((((((((((VV(*$,-#/9 – <./$.+(E#,$&./(/>*9.$&()*+,-,.0(F,&G(!"#$% -+#&(2(P(>!Q3+#"7W0ODOXSTU((((VV(*$,-#/9 – 1#9E,+./(9*%&(0.&./9,$.(&G&(++(&G/.0%(,$((&G/.03+#"7 F,++(0./.-./.$".(&G.(%9.(00/.%% -+#&(I(P(>!Q&G/.0W0ODOTU((((((VV($#$Y*$,-#/9 – H#(+,9,&(#$(//2(%,I.J("$(*%.($2(>+#3+(9.9#/2(E#,$&./ DDD • !"#$%&#%1&13)%3+"49374%= Z – KL(3,&%(E./(F/E(E./(L("+#"7%(E./(9*+&,E/#".%%#/ – M#(3.(*%.0(FG.$(++(&G/.0%(,$((F/E(/.0(&G.(%9.(00/.%% • N./,+,I.%(#&G./F,%.© NVIDIA 2010
  • !"#$%&#%()*"+, • -)+#).)/)01%)$23-%4+)&5$67839&+:$;:)+<(51+=#>=%$.=?)%=*) • @..%4+)&5$&00)$$%4)$&*)AB9"+5 • C$=#>D(E(F – !"#$%&"(%)*+#$*,%-./%01%234/%5)%67,%+"))8# – 9"#$8:;%<5"=,%(5+*:+8"<<>%&5,*%? 2.@/%<8:*A%B*>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*, addresses from a warp ... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448© NVIDIA 2010
  • !"#$%&#%()*"+, • -)+#).)/)01%)$23-%4+)&5$67839&+:$;:)+<(51+=#>=%$.=?)%=*) • @..%4+)&5$&00)$$%4)$&*)AB9"+5 • C$=#>0"#$%&#%D1#=?"+*&00)$$E – !"#$%&(#)&*+%,-+$&./&01%+$ – 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**", • 953":+31&%4&0+&+;",%+<&4;+#&:+#5+3&3"*+%"=+&> 4%-+#&34(<$&<4&54%&?4&%-#48?-&%-"$&,(,-+ addresses from a warp ... 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448© NVIDIA 2010
  • *mem compression
  • !"#$%$&$()*$#+),-%"./00$- • 1+/)233)/30/)+20)4//)-"#$%$&/5)25)6/./3)$0)3$%$#/5)47)#+/)8%4/.)-9) 47#/0)//5/5:);-0$5/.);-%"./00$- • <"".-2;+/0= – !"#$%&"()*+,"%-)#.))"%/01%2301%450-,# ,"#)6)*+%,+ 2%,"+#*7&#,"%8390-,# *):7,*)+%;% &7<=)> – ?@$%&"()*+,"%-)#.))"%A<231%A<451%A<39 ,+%")%,"+#*7&#," • A<23 82+B)2CD>%,+%+#*;6)%"=E1%"%D;#F%,"+#*7&#,"+ – G;"6)0-;+)H$ • I.)*%;"H%7<<)*%=,D,#+%;*)%J)*")=%;*67D)#+ • K;#;%,+%;"%,"H)L%A*%,"#)*<=;#," • <""3$;2#$-)$)".2;#$;/= – M=;*J%!"#$%& NO=(,"6%I;##,&)%PMK%+E+#)D+%A%):7;#,"+%7+,"6%D,L)H%<*)&,+,"% +=()*+%"%Q@R+S – F##<$TT;*L,(U*6T;-+TCV22U42V2 34© NVIDIA 2010
  • tation t hrough compu g GPUAcc eleratin thods -precis ion memixed lark M ichael C s ic As trophys ian Ce nter for Univers ity on Harvard -Smiths Harvard SC’10
  • ... too much ? ba nk c onflict s on ing isi ale sc ec co ca pr ch d part ition inixe cla ca m g m pingm pi ng adca sting bro ms zero-cop trea
  • ParallelProgrammParking is Hard (but you’ll pick it up)
  • (you are not alone)
  • 3. Threading/Execution Optimizations
  • 3.1 Exec. Configuration Optimizations
  • Occupancy Thread instructions are executed sequentially, so executing other warps is the only way to hide latencies and keep the hardware busy Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently Limited by resource usage: Registers Shared memory© NVIDIA Corporation 2008 60
  • Grid/Block Size Heuristics # of blocks > # of multiprocessors So all multiprocessors have at least one block to execute # of blocks / # of multiprocessors > 2 Multiple blocks can run concurrently in a multiprocessor Blocks that aren’t waiting at a __syncthreads() keep the hardware busy Subject to resource availability – registers, shared memory # of blocks > 100 to scale to future devices Blocks executed in pipeline fashion 1000 blocks per grid will scale across multiple generations© NVIDIA Corporation 2008 61
  • Register Dependency Read-after-write register dependency Instruction’s result can be read ~24 cycles later Scenarios: CUDA: PTX: x = y + 5; add.f32 $f3, $f1, $f2 z = x + 3; add.f32 $f5, $f3, $f4 s_data[0] += 3; ld.shared.f32 $f3, [$r31+0] add.f32 $f3, $f3, $f4 To completely hide the latency: Run at least 192 threads (6 warps) per multiprocessor At least 25% occupancy (1.0/1.1), 18.75% (1.2/1.3) Threads do not have to belong to the same thread block© NVIDIA Corporation 2008 62
  • Register Pressure Hide latency by using more threads per SM Limiting Factors: Number of registers per kernel 8K/16K per SM, partitioned among concurrent threads Amount of shared memory 16KB per SM, partitioned among concurrent threadblocks Compile with –ptxas-options=-v flag Use –maxrregcount=N flag to NVCC N = desired maximum registers / kernel At some point “spilling” into local memory may occur Reduces performance – local memory is slow© NVIDIA Corporation 2008 63
  • Occupancy Calculator© NVIDIA Corporation 2008 64
  • Optimizing threads per block Choose threads per block as a multiple of warp size Avoid wasting computation on under-populated warps Want to run as many warps as possible per multiprocessor (hide latency) Multiprocessor can run up to 8 blocks at a time Heuristics Minimum: 64 threads per block Only if multiple concurrent blocks 192 or 256 threads a better choice Usually still enough regs to compile and invoke successfully This all depends on your computation, so experiment!© NVIDIA Corporation 2008 65
  • Occupancy != Performance Increasing occupancy does not necessarily increase performance BUT … Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels (It all comes down to arithmetic intensity and available parallelism)© NVIDIA Corporation 2008 66
  • 01*+ ,2% ."$%/,, ,"% *#%-( &"$($ )*+!"##"$% (%)(* !"#$%&! ).%.& +,-./ 8 ./5565787 0. 12.34 GTC’10
  • Occupancy != Performance Increasing occupancy does not necessarily increase performance BUT … Low-occupancy multiprocessors cannot adequately hide latency on memory-bound kernels (It all comes down to arithmetic intensity and available parallelism)© NVIDIA Corporation 2008 66
  • 3.2 Instruction Optimizations
  • CUDA Instruction Performance Instruction cycles (per warp) = sum of Operand read cycles Instruction execution cycles Result update cycles Therefore instruction throughput depends on Nominal instruction throughput Memory latency Memory bandwidth “Cycle” refers to the multiprocessor clock rate 1.3 GHz on the Tesla C1060, for example© NVIDIA Corporation 2008 69
  • Maximizing Instruction Throughput Maximize use of high-bandwidth memory Maximize use of shared memory Minimize accesses to global memory Maximize coalescing of global memory accesses Optimize performance by overlapping memory accesses with HW computation High arithmetic intensity programs i.e. high ratio of math to memory transactions Many concurrent threads© NVIDIA Corporation 2008 70
  • Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp int multiply (*) is by default 32-bit requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are more expensive Compiler will convert literal power-of-2 divides to shifts But we have seen it miss some cases Be explicit in cases where compiler can’t tell that divisor is a power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2© NVIDIA Corporation 2008 71
  • Runtime Math Library There are two types of runtime math operations in single-precision __funcf(): direct mapping to hardware ISA Fast but lower accuracy (see prog. guide for details) Examples: __sinf(x), __expf(x), __powf(x,y) funcf() : compile to multiple instructions Slower but higher accuracy (5 ulp or less) Examples: sinf(x), expf(x), powf(x,y) The -use_fast_math compiler option forces every funcf() to compile to __funcf()© NVIDIA Corporation 2008 72
  • GPU results may not match CPU Many variables: hardware, compiler, optimization settings CPU operations aren’t strictly limited to 0.5 ulp Sequences of operations can be more accurate due to 80- bit extended precision ALUs Floating-point arithmetic is not associative!© NVIDIA Corporation 2008 73
  • FP Math is Not Associative! In symbolic math, (x+y)+z == x+(y+z) This is not necessarily true for floating-point addition Try x = 1030, y = -1030 and z = 1 in the above equation When you parallelize computations, you potentially change the order of operations Parallel results may not exactly match sequential results This is not specific to GPU or CUDA – inherent part of parallel execution© NVIDIA Corporation 2008 74
  • Control Flow Instructions Main performance concern with branching is divergence Threads within a single warp take different paths Different execution paths must be serialized Avoid divergence when branch condition is a function of thread ID Example with divergence: if (threadIdx.x > 2) { } Branch granularity < warp size Example without divergence: if (threadIdx.x / WARP_SIZE > 2) { } Branch granularity is a whole multiple of warp size© NVIDIA Corporation 2008 75
  • Scared ?
  • Scared ? Howwwwww?! (do I start)
  • Profiler
  • !"#$%&&()*+(,-./$0- • ,-./$0-(1.2"*0-&3 – !"#$%&$!("#)!##&*+!"!"#$%&$!("#)*,*&$*+ • #$%&"()*+,+(%+-"./"0 1+*"23*1 • 4556+-7"()86-+5"*+183/5!"4+9+)6%+-7"-$+5"($% – -.+)%*/&*#$!"-#$)%*/&*#$ • :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"(5%*6)%$( • :(5%*6)%$(",3/".+")$6(%+-";"%"5"41*+-)3%+-"$6%7 – .0)-.(12.).(2+)3!##!".0)-.(12.).(2+)4!$!"-.(12.)#$(%*)$%2"#2$!(" • :()*+,+(%+-"./"0 1+*"45($"0 =8(+"5"0>?#@ – &"24*+)-.(12.).(2+)$%2"#2$!(" • :()*+,+(%+-"./ 0"1+*"A*$16 $;"0!">!"B!"$*"C"%*3(53)%$(5 • 6.78#-03 – B>""D"!"#$%&$!("#)!##&*+ <D"B>"E"23*1"5F+"D< – 0>?#"D"=-.(12.)#$(%*)$%2"#2$!(" 56.0)-.(12.).(2+)3!##@ 7© NVIDIA 2010
  • CUDA Visual Profiler data for memory transfers Memory transfer type and direction (D=Device, H=Host, A=cuArray) e.g. H to D: Host to Device Synchronous / Asynchronous Memory transfer size, in bytes Stream ID© NVIDIA Corporation 2010
  • CUDA Visual Profiler data for kernels© NVIDIA Corporation 2010
  • CUDA Visual Profiler computed data for kernels Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate Global memory read throughput (Gigabytes/second) Global memory write throughput (Gigabytes/second) Overall global memory access throughput (Gigabytes/second) Global memory load efficiency Global memory store efficiency© NVIDIA Corporation 2010
  • CUDA Visual Profiler data analysis views Views: Summary table Kernel table Memcopy table Summary plot GPU Time Height plot GPU Time Width plot Profiler counter plot Profiler table column plot Multi-device plot Multi-stream plot Analyze profiler counters Analyze kernel occupancy© NVIDIA Corporation 2010
  • CUDA Visual Profiler Misc. Multiple sessions Compare views for different sessions Comparison Summary plot Profiler projects save & load Import/Export profiler data (.CSV format)© NVIDIA Corporation 2010
  • Scared ? meh!!!! I don’t like to profile
  • Modified source code
  • !"#$%&&()*+(,-./0.(1-2340(5-.0 • 670(707-3%8-"$%(#".(7#*+8-"$%(903&-"&(-/(*+0(:03"0$ – !"#$%&()&*)+%#,-",+)./,-"0%+","1+%2%.+%.,*).,&)31(3)4)& "++&%##$.5 – 6$0%#7)85))+%#,$9",%#()&: • ;$9%#2%.,"**%##$.59%9)&7 • ;$9%#2%.,$.%<%*8,$.5$.#,&8*,$).# • 5-7;#3"<(*+0(*70&(/-3(7-./0.(:03"0$& – =%32#+%*$+%4-%,-%&,-%>%&.%3$#9%9 )&9",-?)8.+ – @-)4#-)44%339%9)&7)2%&",$).#"&%)0%&3"22%+4$,-"&$,-9%,$* • A)92"&%,-%#89)(9%91).37".+9",-1).37,$9%#,)(8331>%&.%3,$9% 9© NVIDIA 2010
  • Scared ? I want to believe...
  • !"#$%&(#)*$%!+$,(-."/time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a problem Memory bound ? overlap: latency not a problem overlap: latency not a problem latency is a problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low Math bound ? compared to HW theory) compared to HW theory) compared to HW theory) 13© NVIDIA 2010 Latency bound ?
  • !"#$%&(#)*$%!+$,(-."/time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a overlap: latency not a overlap: latency not a latency is a problem problem problem problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 13© NVIDIA 2010
  • !"#$%&(#)*$%!+$,(-."/time mem math full mem math full mem math full mem math full Memory-bound Math-bound Balanced Memory and latency bound Good mem-math Good mem-math Good mem-math Poor mem-math overlap: overlap: latency not a overlap: latency not a overlap: latency not a latency is a problem problem problem problem (assuming memory (assuming instruction (assuming memory/instr throughput is not low throughput is not low throughput is not low compared to HW theory) compared to HW theory) compared to HW theory) 13© NVIDIA 2010
  • Argn&%#$... too many optimizations !!!
  • Parameterize Your Application Parameterization helps adaptation to different GPUs GPUs vary in many ways # of multiprocessors Memory bandwidth Shared memory size Register file size Max. threads per block You can even make apps self-tuning (like FFTW and ATLAS) “Experiment” mode discovers and saves optimal configuration© NVIDIA Corporation 2008 67
  • More ?• Next week: GPU “Scripting”, Meta-programming, Auto-tuning• Thu 3/31/11: PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)• Tue 3/29/11: Algorithm Strategies (W. Hwu, UIUC)• Tue 4/5/11: Analysis-driven Optimization (C.Wooley, NVIDIA)• Thu 4/7/11: Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)• Thu 4/14/11: Optimization for Ninjas (D.Merill, UVirg)• ...
  • one more thing or two...
  • Life/Code Hacking #2.x Speed {listen,read,writ}ingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.2 Speed writingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.2 Speed writinghttp://steve-yegge.blogspot.com/2008/09/programmings-dirtiest-little-secret.html accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.2 Speed writingTyping tutors: gtypist, ktouch, typingweb.com, etc. accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Life/Code Hacking #2.2 Speed writing Kinesis Advantage (QWERTY/DVORAK)accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • Demo
  • CO ME