IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA) - Presentation Transcript

    1. IAP09 CUDA@MIT/6.963 David Luebke NVIDIA Research
    2. The “New” Moore’s Law   Computers no longer get faster, just wider   You must re-think your algorithms to be parallel !   Data-parallel computing is most scalable solution © NVIDIA Corporation 2008
    3. Enter the GPU   Massive economies of scale   Massively parallel © NVIDIA Corporation 2008
    4. Enter CUDA   Scalable parallel programming model   Minimal extensions to familiar C/C++ environment   Heterogeneous serial-parallel computing © NVIDIA Corporation 2008
    5. Sound Bite GPUs + CUDA = The Democratization of Parallel Computing Massively parallel computing has become a commodity technology © NVIDIA Corporation 2007
    6. MOTIVATION
    7. MOTIVATION 146X 36X 19X 17X 100X 149X 47X 20X 24X 30X
    8. CUDA: ‘C’ FOR PARALLELISM void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel Standard C Code saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code © NVIDIA Corporation 2008
    9. Hierarchy of concurrent threads Thread t   Parallel kernels composed of many threads   all threads execute the same sequential program Block b   Threads are grouped into thread blocks t0 t1 … tB   threads in the same block can cooperate Kernel foo()   Threads/blocks have ... unique IDs © NVIDIA Corporation 2008
    10. Hierarchical organization Block Thread per-block per-thread shared local memory memory Local barrier Kernel 0 ... Global barrier per-device global Kernel 1 memory ... © NVIDIA Corporation 2008
    11. Heterogeneous Programming   CUDA = serial program with parallel kernels, all in C   Serial C code executes in a CPU thread   Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code Parallel Kernel ... foo<<< nBlk, nTid >>>(args); Serial Code Parallel Kernel ... bar<<< nBlk, nTid >>>(args); © NVIDIA Corporation 2008
    12. Thread = virtualized scalar processor   Independent thread of execution   has its own PC, variables (registers), processor state, etc.   no implication about how threads are scheduled   CUDA threads might be physical threads   as on NVIDIA GPUs   CUDA threads might be virtual threads   might pick 1 block = 1 physical thread on multicore CPU © NVIDIA Corporation 2008
    13. Block = virtualized multiprocessor   Provides programmer flexibility   freely choose processors to fit data   freely customize for each kernel launch   Thread block = a (data) parallel task   all blocks in kernel have the same entry point   but may execute any code they want   Thread blocks of kernel must be independent tasks   program valid for any interleaving of block executions © NVIDIA Corporation 2008
    14. Blocks must be independent   Any possible interleaving of blocks should be valid   presumed to run to completion without pre-emption   can run in any order   can run concurrently OR sequentially   Blocks may coordinate but not synchronize   shared queue pointer: OK   shared lock: BAD … can easily deadlock   Independence requirement gives scalability © NVIDIA Corporation 2008
    15. Scalable Execution Model Kernel launched by host ... Blocks Run on Multiprocessors MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU SP SP SP SP SP SP SP SP ... Shared Shared Shared Shared Shared Shared Shared Shared Memory Memory Memory Memory Memory Memory Memory Memory Device Memory © NVIDIA Corporation 2008
    16. Synchronization & Cooperation   Threads within block may synchronize with barriers … Step 1 … __syncthreads(); … Step 2 …   Blocks coordinate via atomic memory operations   e.g., increment shared queue pointer with atomicInc()   Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c); © NVIDIA Corporation 2008
    17. Using per-block shared memory Block   Variables shared across block Shared __shared__ int *begin, *end;   Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x]; // … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x];   Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads(); int left = scratch[threadIdx.x - 1]; © NVIDIA Corporation 2008
    18. Example: Parallel Reduction   Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];   Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps   one possible approach: Butterfly pattern © NVIDIA Corporation 2008
    19. Example: Parallel Reduction   Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];   Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps   one possible approach: Butterfly pattern © NVIDIA Corporation 2008
    20. Parallel Reduction for 1 Block // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i] © NVIDIA Corporation 2008
    21. Summing Up CUDA   CUDA = C + a few simple extensions   makes it easy to start writing basic parallel programs   Three key abstractions: 1.  hierarchy of parallel threads 2.  corresponding levels of synchronization 3.  corresponding memory spaces   Supports massive parallelism of manycore GPUs © NVIDIA Corporation 2008
    22. Parallel // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i] © NVIDIA Corporation 2008
    23. More efficient reduction tree template<class T> __device__ T reduce(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x; for(unsigned int offset=n/2; offset>0; offset/=2) { if(tid < offset) x[i] += x[i + offset]; __syncthreads(); } // Note that only thread 0 has full sum return x[i]; } © NVIDIA Corporation 2008
    24. Reduction tree execution example Input (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 1 2 3 4 5 6 7 active threads 0 x[i] += x[i+8]; 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 1 2 3 0 x[i] += x[i+4]; 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 1 0 x[i] += x[i+2]; 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 0 41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 x[i] += x[i+1]; Final result © NVIDIA Corporation 2008
    25. Pattern 1: Fit kernel to the data   Code lets B-thread block reduce B-element array   For larger sequences:   launch N/B blocks to reduce each B-element subsequence   write N/B partial sums to temporary array   repeat until done   We’ll be done in logB N steps © NVIDIA Corporation 2008
    26. Pattern 2: Fit kernel to the machine   For a block size of B: 1)  Launch B blocks to sum N/B elements 2)  Launch 1 block to combine B partial sums 31704 16 3 31704 16 331704 16 3 31704 16 3 31704 16 331704 16 331704 16 331704 16 3 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14 Stage 1: 25 25 25 25 25 25 25 25 many blocks
    SlideShare Zeitgeist 2009

    + npintonpinto Nominate

    custom

    2477 views, 1 favs, 1 embeds more stats

    See http://sites.google.com/site/cudaiap2009 and ht more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 2477
      • 2476 on SlideShare
      • 1 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 147
    Most viewed embeds
    • 1 views on http://www.jackiepeters.info

    more

    All embeds
    • 1 views on http://www.jackiepeters.info

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories