IAP09 CUDA@MIT/6.963

      David Luebke
     NVIDIA Research
The “New” Moore’s Law



   Computers no longer get faster, just wider


   You must re-think your algorithms to be parall...
Enter the GPU



   Massive economies of scale


   Massively parallel




© NVIDIA Corporation 2008
Enter CUDA



   Scalable parallel programming model


   Minimal extensions to familiar C/C++ environment


   Heterogene...
Sound Bite



                            GPUs + CUDA
                                 =
       The Democratization of Par...
MOTIVATION
MOTIVATION




146X   36X    19X     17X   100X




149X   47X    20X     24X   30X
CUDA: ‘C’ FOR PARALLELISM


void saxpy_serial(int n, float a, float *x, float *y)
{
    for (int i = 0; i < n; ++i)
      ...
Hierarchy of concurrent threads

                                                                  Thread t

   Parallel k...
Hierarchical organization
                                                    Block
Thread
                               ...
Heterogeneous Programming



   CUDA = serial program with parallel kernels, all in C
     
   Serial C code executes in a...
Thread = virtualized scalar processor



   Independent thread of execution
          
   has its own PC, variables (regis...
Block = virtualized multiprocessor



   Provides programmer flexibility
          
   freely choose processors to fit dat...
Blocks must be independent



   Any possible interleaving of blocks should be valid
          
   presumed to run to comp...
Scalable Execution Model

                                                           Kernel launched by host


           ...
Synchronization & Cooperation



   Threads within block may synchronize with barriers
                 … Step 1 …
       ...
Using per-block shared memory
                                                      Block


   Variables shared across blo...
Example: Parallel Reduction



   Summing up a sequence with 1 thread:
                 int sum = 0;
                 for(...
Example: Parallel Reduction



   Summing up a sequence with 1 thread:
                 int sum = 0;
                 for(...
Parallel Reduction for 1 Block

// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];

//...
Summing Up CUDA



   CUDA = C + a few simple extensions
          
   makes it easy to start writing basic parallel progr...
Parallel

// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];

// One thread per elemen...
More efficient reduction tree

template<class T> __device__ T reduce(T *x)
{
    unsigned int i = threadIdx.x;
    unsigne...
Reduction tree execution example
                              Input (shared memory)
                              10   1 ...
Pattern 1: Fit kernel to the data



   Code lets B-thread block reduce B-element array


   For larger sequences:
       ...
Pattern 2: Fit kernel to the machine


   For a block size of B:
       1)  Launch B blocks to sum N/B elements
       2) ...
Upcoming SlideShare
Loading in...5
×

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

4,237

Published on

See http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,237
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
311
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

  1. 1. IAP09 CUDA@MIT/6.963 David Luebke NVIDIA Research
  2. 2. The “New” Moore’s Law   Computers no longer get faster, just wider   You must re-think your algorithms to be parallel !   Data-parallel computing is most scalable solution © NVIDIA Corporation 2008
  3. 3. Enter the GPU   Massive economies of scale   Massively parallel © NVIDIA Corporation 2008
  4. 4. Enter CUDA   Scalable parallel programming model   Minimal extensions to familiar C/C++ environment   Heterogeneous serial-parallel computing © NVIDIA Corporation 2008
  5. 5. Sound Bite GPUs + CUDA = The Democratization of Parallel Computing Massively parallel computing has become a commodity technology © NVIDIA Corporation 2007
  6. 6. MOTIVATION
  7. 7. MOTIVATION 146X 36X 19X 17X 100X 149X 47X 20X 24X 30X
  8. 8. CUDA: ‘C’ FOR PARALLELISM void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel Standard C Code saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code © NVIDIA Corporation 2008
  9. 9. Hierarchy of concurrent threads Thread t   Parallel kernels composed of many threads   all threads execute the same sequential program Block b   Threads are grouped into thread blocks t0 t1 … tB   threads in the same block can cooperate Kernel foo()   Threads/blocks have ... unique IDs © NVIDIA Corporation 2008
  10. 10. Hierarchical organization Block Thread per-block per-thread shared local memory memory Local barrier Kernel 0 ... Global barrier per-device global Kernel 1 memory ... © NVIDIA Corporation 2008
  11. 11. Heterogeneous Programming   CUDA = serial program with parallel kernels, all in C   Serial C code executes in a CPU thread   Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code Parallel Kernel ... foo<<< nBlk, nTid >>>(args); Serial Code Parallel Kernel ... bar<<< nBlk, nTid >>>(args); © NVIDIA Corporation 2008
  12. 12. Thread = virtualized scalar processor   Independent thread of execution   has its own PC, variables (registers), processor state, etc.   no implication about how threads are scheduled   CUDA threads might be physical threads   as on NVIDIA GPUs   CUDA threads might be virtual threads   might pick 1 block = 1 physical thread on multicore CPU © NVIDIA Corporation 2008
  13. 13. Block = virtualized multiprocessor   Provides programmer flexibility   freely choose processors to fit data   freely customize for each kernel launch   Thread block = a (data) parallel task   all blocks in kernel have the same entry point   but may execute any code they want   Thread blocks of kernel must be independent tasks   program valid for any interleaving of block executions © NVIDIA Corporation 2008
  14. 14. Blocks must be independent   Any possible interleaving of blocks should be valid   presumed to run to completion without pre-emption   can run in any order   can run concurrently OR sequentially   Blocks may coordinate but not synchronize   shared queue pointer: OK   shared lock: BAD … can easily deadlock   Independence requirement gives scalability © NVIDIA Corporation 2008
  15. 15. Scalable Execution Model Kernel launched by host ... Blocks Run on Multiprocessors MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU SP SP SP SP SP SP SP SP ... Shared Shared Shared Shared Shared Shared Shared Shared Memory Memory Memory Memory Memory Memory Memory Memory Device Memory © NVIDIA Corporation 2008
  16. 16. Synchronization & Cooperation   Threads within block may synchronize with barriers … Step 1 … __syncthreads(); … Step 2 …   Blocks coordinate via atomic memory operations   e.g., increment shared queue pointer with atomicInc()   Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c); © NVIDIA Corporation 2008
  17. 17. Using per-block shared memory Block   Variables shared across block Shared __shared__ int *begin, *end;   Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x]; // … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x];   Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads(); int left = scratch[threadIdx.x - 1]; © NVIDIA Corporation 2008
  18. 18. Example: Parallel Reduction   Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];   Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps   one possible approach: Butterfly pattern © NVIDIA Corporation 2008
  19. 19. Example: Parallel Reduction   Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];   Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps   one possible approach: Butterfly pattern © NVIDIA Corporation 2008
  20. 20. Parallel Reduction for 1 Block // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i] © NVIDIA Corporation 2008
  21. 21. Summing Up CUDA   CUDA = C + a few simple extensions   makes it easy to start writing basic parallel programs   Three key abstractions: 1.  hierarchy of parallel threads 2.  corresponding levels of synchronization 3.  corresponding memory spaces   Supports massive parallelism of manycore GPUs © NVIDIA Corporation 2008
  22. 22. Parallel // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i] © NVIDIA Corporation 2008
  23. 23. More efficient reduction tree template<class T> __device__ T reduce(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x; for(unsigned int offset=n/2; offset>0; offset/=2) { if(tid < offset) x[i] += x[i + offset]; __syncthreads(); } // Note that only thread 0 has full sum return x[i]; } © NVIDIA Corporation 2008
  24. 24. Reduction tree execution example Input (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 1 2 3 4 5 6 7 active threads 0 x[i] += x[i+8]; 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 1 2 3 0 x[i] += x[i+4]; 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 1 0 x[i] += x[i+2]; 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 0 41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 x[i] += x[i+1]; Final result © NVIDIA Corporation 2008
  25. 25. Pattern 1: Fit kernel to the data   Code lets B-thread block reduce B-element array   For larger sequences:   launch N/B blocks to reduce each B-element subsequence   write N/B partial sums to temporary array   repeat until done   We’ll be done in logB N steps © NVIDIA Corporation 2008
  26. 26. Pattern 2: Fit kernel to the machine   For a block size of B: 1)  Launch B blocks to sum N/B elements 2)  Launch 1 block to combine B partial sums 31704 16 3 31704 16 331704 16 3 31704 16 3 31704 16 331704 16 331704 16 331704 16 3 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14 Stage 1: 25 25 25 25 25 25 25 25 many blocks
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×