IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

4,557 views
4,413 views

Published on

See http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,557
On SlideShare
0
From Embeds
0
Number of Embeds
35
Actions
Shares
0
Downloads
313
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

  1. 1. IAP09 CUDA@MIT/6.963 David Luebke NVIDIA Research
  2. 2. The “New” Moore’s Law   Computers no longer get faster, just wider   You must re-think your algorithms to be parallel !   Data-parallel computing is most scalable solution © NVIDIA Corporation 2008
  3. 3. Enter the GPU   Massive economies of scale   Massively parallel © NVIDIA Corporation 2008
  4. 4. Enter CUDA   Scalable parallel programming model   Minimal extensions to familiar C/C++ environment   Heterogeneous serial-parallel computing © NVIDIA Corporation 2008
  5. 5. Sound Bite GPUs + CUDA = The Democratization of Parallel Computing Massively parallel computing has become a commodity technology © NVIDIA Corporation 2007
  6. 6. MOTIVATION
  7. 7. MOTIVATION 146X 36X 19X 17X 100X 149X 47X 20X 24X 30X
  8. 8. CUDA: ‘C’ FOR PARALLELISM void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel Standard C Code saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code © NVIDIA Corporation 2008
  9. 9. Hierarchy of concurrent threads Thread t   Parallel kernels composed of many threads   all threads execute the same sequential program Block b   Threads are grouped into thread blocks t0 t1 … tB   threads in the same block can cooperate Kernel foo()   Threads/blocks have ... unique IDs © NVIDIA Corporation 2008
  10. 10. Hierarchical organization Block Thread per-block per-thread shared local memory memory Local barrier Kernel 0 ... Global barrier per-device global Kernel 1 memory ... © NVIDIA Corporation 2008
  11. 11. Heterogeneous Programming   CUDA = serial program with parallel kernels, all in C   Serial C code executes in a CPU thread   Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code Parallel Kernel ... foo<<< nBlk, nTid >>>(args); Serial Code Parallel Kernel ... bar<<< nBlk, nTid >>>(args); © NVIDIA Corporation 2008
  12. 12. Thread = virtualized scalar processor   Independent thread of execution   has its own PC, variables (registers), processor state, etc.   no implication about how threads are scheduled   CUDA threads might be physical threads   as on NVIDIA GPUs   CUDA threads might be virtual threads   might pick 1 block = 1 physical thread on multicore CPU © NVIDIA Corporation 2008
  13. 13. Block = virtualized multiprocessor   Provides programmer flexibility   freely choose processors to fit data   freely customize for each kernel launch   Thread block = a (data) parallel task   all blocks in kernel have the same entry point   but may execute any code they want   Thread blocks of kernel must be independent tasks   program valid for any interleaving of block executions © NVIDIA Corporation 2008
  14. 14. Blocks must be independent   Any possible interleaving of blocks should be valid   presumed to run to completion without pre-emption   can run in any order   can run concurrently OR sequentially   Blocks may coordinate but not synchronize   shared queue pointer: OK   shared lock: BAD … can easily deadlock   Independence requirement gives scalability © NVIDIA Corporation 2008
  15. 15. Scalable Execution Model Kernel launched by host ... Blocks Run on Multiprocessors MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU SP SP SP SP SP SP SP SP ... Shared Shared Shared Shared Shared Shared Shared Shared Memory Memory Memory Memory Memory Memory Memory Memory Device Memory © NVIDIA Corporation 2008
  16. 16. Synchronization & Cooperation   Threads within block may synchronize with barriers … Step 1 … __syncthreads(); … Step 2 …   Blocks coordinate via atomic memory operations   e.g., increment shared queue pointer with atomicInc()   Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c); © NVIDIA Corporation 2008
  17. 17. Using per-block shared memory Block   Variables shared across block Shared __shared__ int *begin, *end;   Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x]; // … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x];   Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads(); int left = scratch[threadIdx.x - 1]; © NVIDIA Corporation 2008
  18. 18. Example: Parallel Reduction   Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];   Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps   one possible approach: Butterfly pattern © NVIDIA Corporation 2008
  19. 19. Example: Parallel Reduction   Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];   Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps   one possible approach: Butterfly pattern © NVIDIA Corporation 2008
  20. 20. Parallel Reduction for 1 Block // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i] © NVIDIA Corporation 2008
  21. 21. Summing Up CUDA   CUDA = C + a few simple extensions   makes it easy to start writing basic parallel programs   Three key abstractions: 1.  hierarchy of parallel threads 2.  corresponding levels of synchronization 3.  corresponding memory spaces   Supports massive parallelism of manycore GPUs © NVIDIA Corporation 2008
  22. 22. Parallel // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i] © NVIDIA Corporation 2008
  23. 23. More efficient reduction tree template<class T> __device__ T reduce(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x; for(unsigned int offset=n/2; offset>0; offset/=2) { if(tid < offset) x[i] += x[i + offset]; __syncthreads(); } // Note that only thread 0 has full sum return x[i]; } © NVIDIA Corporation 2008
  24. 24. Reduction tree execution example Input (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 1 2 3 4 5 6 7 active threads 0 x[i] += x[i+8]; 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 1 2 3 0 x[i] += x[i+4]; 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 1 0 x[i] += x[i+2]; 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 0 41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 x[i] += x[i+1]; Final result © NVIDIA Corporation 2008
  25. 25. Pattern 1: Fit kernel to the data   Code lets B-thread block reduce B-element array   For larger sequences:   launch N/B blocks to reduce each B-element subsequence   write N/B partial sums to temporary array   repeat until done   We’ll be done in logB N steps © NVIDIA Corporation 2008
  26. 26. Pattern 2: Fit kernel to the machine   For a block size of B: 1)  Launch B blocks to sum N/B elements 2)  Launch 1 block to combine B partial sums 31704 16 3 31704 16 331704 16 3 31704 16 3 31704 16 331704 16 331704 16 331704 16 3 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14 Stage 1: 25 25 25 25 25 25 25 25 many blocks

×