Intro to GPGPU with CUDA (DevLink)


Published on

Slides from my talk on CUDA programming at DevLink 2011

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Sparse linear algebra is interesting both because many science and engineering codes rely on it, and also because it was traditionally assumed to be something that GPUs would not be good at (because of irregular data access patterns). We have shown that in fact GPUs are extremely good at sparse matrix-vector multiply (SpMV), which is the basic building block of sparse linear algebra. The code and an accompanying white paper are available on the cuda forums and also posted on is compared to an extremely well-studied, well-optimized SpMV implementation from a widely respected paper in Supercomputing 2007. that paper only reported double-precision results for CPUs; our single precision results are even more impressive in comparison.
  • Compared to highly optimizedfortran code from an oceanography researcher at UCLA
  • Current implementation uses short-stack approach. Top elements of the stack are cached in registers.
  • RTAPI enables implementation of manydifferent raytracing flavors.left-right, top-bottom: Procedural materials, Ambient occlusion, Whittedraytracer (thin shell glass and metalic spheres) Path tracer (Cornell box), Refactions, Cook-style distribution raytracingCould also do non-rendering stuff, e.g. GIS (line of sight say), physics (collision/proximity detection)
  • Intro to GPGPU with CUDA (DevLink)

    1. 1. Intro to GPGPU Programing With CUDA<br />Rob Gillen<br /><br />@argodev<br />
    2. 2. Intro to GPGPU Programming with CUDA<br />Rob Gillen<br />
    3. 3. Welcome!<br />Goals:<br />Overview of GPGPU with CUDA<br />“Vision Casting” for how you can use GPUs to improve your application<br />Introduction to CUDA C<br />Outline<br />Why GPGPUs?<br />Applications<br />Tooling<br />Hands-On: Matrix Multiplication<br />
    4. 4. Context Setting<br />Level of the Talk<br />Introductory/Overview<br />Perspective of the Speaker<br />12+ years as professional developer<br />4+ years at Oak Ridge National Laboratory<br />Disclaimer:<br />Many (most) of these slides are courtesy of NVIDIA corporation although they bear no responsibility for inaccuracies I introduce during this presentation.<br />
    5. 5. Why Use GPUs?<br />Motivation<br />
    6. 6. CPU vs. GPU<br />GPU devotes more transistors to data processing<br />Specialized (purpose-designed) Silicon<br />
    7. 7. NVIDIA Fermi<br />~1.5TFLOPS (SP)/~800GFLOPS (DP)<br />230 GB/s DRAM Bandwidth<br />
    8. 8. Motivation<br />FLoating-Point Operations per Second (FLOPS) and <br />memory bandwidth For the CPU and GPU <br />
    9. 9. Example: Sparse Matrix-Vector<br />CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007<br />
    10. 10. Rayleigh-Bénard Results<br />Double precision<br />384 x 384 x 192 grid (max that fits in 4GB)<br />Vertical slice of temperature at y=0<br />Transition from stratified (left) to turbulent (right)<br />Regime depends on Rayleigh number: Ra = gαΔT/κν<br />8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon<br />
    11. 11. G80 Characteristics<br />367 GFLOPS peak performance (25-50 times of current high-end microprocessors)<br />265 GFLOPS sustained for apps such as VMD<br />Massively parallel, 128 cores, 90W<br />Massively threaded, sustains 1000s of threads per app<br />30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics<br />
    12. 12. Supercomputer Comparison<br />
    13. 13. Applications<br />Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”<br />Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products <br />These “Super-apps” represent and model physical, concurrent world<br />Various granularities of parallelism exist, but…<br />programming model must not hinder parallel implementation<br />data delivery needs careful management<br />
    14. 14. *Not* for all applications<br />SPMD (Single Program, Multiple Data) are best (data parallel)<br />Operations need to be of sufficient size to overcome overhead<br />Think Millions of operations.<br />
    15. 15. Raytracing<br />
    16. 16. NVIRT: CUDA Ray Tracing API<br />
    17. 17. Tooling<br />VS 2010 C++ (Express is OK… sort-of.)<br />NVIDIA CUDA-Capable GPU<br />NVIDIA CUDA Toolkit (v4+)<br />NVIDIA CUDA Tools (v4+)<br />GPU Computing SDK<br />NVIDIA Parallel Insight<br />
    18. 18. Parallel Debugging<br />
    19. 19. Parallel Analysis<br />
    20. 20. VS Project Templates<br />
    21. 21. VS Project Templates<br />
    22. 22. Outline of CUDA Basics<br />Basic Memory Management<br />Basic Kernels and Execution on GPU<br />Development Resources<br />See the Programming Guide for the full API<br />See the Getting Started Guide for installation and compilation instructions<br />Both guest are included in the toolkit<br />
    23. 23. Memory Spaces<br />CPU and GPU have separate memory spaces<br />Data is moved across PCIe bus<br />Use functions to allocate/set/copy memory on GPU<br />Very similar to corresponding C functions<br />Pointers are just addresses<br />Can’t tell from the pointer value whether the address is on CPU or GPU<br />Must exercise care when dereferencing:<br />Dereferencing CPU pointer on GPU will likely crash<br />Converse is also true<br />
    24. 24. GPU Memory Allocation / Release<br />Host (CPU) manages device (GPU) memory:<br />cudaMalloc (void ** pointer, size_tnbytes)<br />cudaMemset (void * pointer, int value, size_t count)<br />cudaFree(void* pointer)<br />int n = 1024;<br />intnbytes = 1024*sizeof(int);<br />int * d_a = 0;<br />cudaMalloc( (void**)&d_a, nbytes);<br />cudaMemset(d_a, 0, nbytes);<br />cudaFree(d_a);<br />
    25. 25. Data Copies<br />Cudamemcpy(void *dst, void *src, size_tnbytes, enumcudaMemcpyKinddirection);<br />Returns after copy is complete<br />Blocks CPU thread until all bytes have been copied<br />Doesn’t start copying until previous CUDA calls complete<br />EnumcudaMemcpyKind<br />cudaMemcpyHostToDevice<br />cudaMemcpyDeviceToHost<br />cudaMemcpyDeviceToDevice<br />Non-blocking memcopies are provided<br />
    26. 26. Demo<br />Code Walkthrough 1<br />
    27. 27. CUDA Programming Model<br />Parallel code (kernel) is launched and executed on a device by many threads<br />Threads are grouped into thread blocks<br />Parallel code is written for a thread<br />Each thread is free to execute a unique code path<br />Built-in thread and block ID variables<br />
    28. 28. Thread Hierarchy<br />Threads launched for a parallel section are partitioned into thread blocks<br />Grid == all blocks for a given launch<br />Thread block is a group of threads that can:<br />Synchronize their execution<br />Communicate via shared memory<br />Threads  Thread Blocks  Grid<br />
    29. 29. Block IDs and Threads<br />Threads:<br />3D IDs, Unique within a block<br />Blocks:<br />2D IDs, unique within a grid<br />Dimensions set at launch time<br />Can be unique for each grid<br />Built-in variables:<br />threadIdx, blockIdx<br />blockDim, gridDim<br />
    30. 30. Code executed on GPU<br />C function with some restrictions<br />Return void<br />Can only dereference GPU pointers<br />No static variables<br />Some additional restrictions for older GPUs<br />Must be declared with a qualifier:<br />__global__ : launched by CPU, cannot be called from GPU, must return void<br />__device__ : called from other GPU functions, cannot be launched by the CPU<br />__host__ : can be executed only by the CPU<br />__host__ and __device__ qualifiers can be combined<br />
    31. 31. Demo<br />Code Walkthrough 2<br />
    32. 32. Launching kernels on GPU<br />Launch Parameters:<br />Grid dimensions (up to 2D), dim3 type<br />Thread-block dimensions (up to 3D), dim3 type<br />Shared memory: number of bytes per block<br />For extern smem variables declared without size<br />Optional, 0 by default<br />Stream ID<br />Optional, 0 by default<br />dim3 grid(16, 16);<br />dim3 block(16, 16);<br />kernel<<<grid, block, 0, 0>>>(…);<br />kernel<<<32, 512>>>(…);<br />
    33. 33. Kernel Variations and Output<br />__global__ void kernel (int*a)<br />{<br />intidx = blockIdx.x * blockDim.x + threadIdx.x;<br /> a[idx] = 7;<br />} Output: 7777777777777777<br />__global__ void kernel (int *a)<br />{<br />intidx = blockIdx.x * blockDim.x + threadIdx.x;<br /> a[idx] = blockIdx.x;<br />} Output: 000011112222333<br />__global__ void kernel (int *a)<br />{<br />intidx = blockIdx.x * blockDim.x + threadIdx.x;<br /> a[idx] = threadIdx.x;<br />} Output: 0123012301230123<br />
    34. 34. Code Walkthrough 3<br />Build on Walkthrough 2<br />Write kernel to incrementnxmintegers<br />Copy the result back to CPU<br />Print the values<br />
    35. 35. Demo<br />Code Walkthrough 3<br />
    36. 36. Blocks must be independent<br />Any possible interleaving of blocks should be valid<br />Presumed to run to completion without pre-emption<br />Can run in any order<br />Can run concurrently OR sequentially<br />Blocks may coordinate but not synchronize<br />Shared queue pointer: OK<br />Shared lock: BAD … can easily deadlock<br />Independence requirement gives scalability<br />
    37. 37. Transparent Scalability<br />Hardware is free to assigns blocks to any processor at any time<br />A kernel scales across any number of parallel processors<br />Kernel grid<br />Device<br />Block 0<br />Block 1<br />Block 2<br />Block 3<br />Block 4<br />Block 5<br />Block 6<br />Block 7<br />Device<br />Block 0<br />Block 1<br />Block 2<br />Block 3<br />Block 4<br />Block 5<br />Block 6<br />Block 7<br />Block 0<br />Block 1<br />Block 2<br />Block 3<br />Block 4<br />Block 5<br />Block 6<br />Block 7<br />time<br />Each block can execute in any order relative to other blocks. <br />
    38. 38. Extended Example<br />Matrix Multiplication<br />
    39. 39. A Simple Running ExampleMatrix Multiplication<br />A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs<br />Leave shared memory usage until later<br />Local, register usage<br />Thread ID usage<br />Memory data transfer API between host and device<br />Assume square matrix for simplicity<br />
    40. 40. Programming Model:Square Matrix Multiplication Example<br />P = M * N of size WIDTH x WIDTH<br />Without tiling:<br />One thread calculates one element of P<br />M and N are loaded WIDTH timesfrom global memory<br />N<br />WIDTH<br />M<br />P<br />WIDTH<br />WIDTH<br />WIDTH<br />40<br />
    41. 41. Memory Layout of Matrix in C<br />M0,2<br />M0,1<br />M0,0<br />M0,3<br />M1,1<br />M1,0<br />M1,2<br />M1,3<br />M2,1<br />M2,0<br />M2,2<br />M2,3<br />M3,1<br />M3,0<br />M3,2<br />M3,3<br />M<br />M0,2<br />M0,1<br />M0,0<br />M0,3<br />M1,1<br />M1,0<br />M1,2<br />M1,3<br />M2,1<br />M2,0<br />M2,2<br />M2,3<br />M3,1<br />M3,0<br />M3,2<br />M3,3<br />
    42. 42. Simple Matrix Multiplication (CPU)<br />void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏<br />{ <br /> for (int i = 0; i < Width; ++i) {‏<br /> for (int j = 0; j < Width; ++j) { <br /> float sum = 0;<br /> for (int k = 0; k < Width; ++k) {<br /> float a = M[i * width + k];<br />float b = N[k * width + j];<br />sum += a * b;<br />}<br />P[i * Width + j] = sum;<br /> }<br /> }<br />}<br />N<br />k<br />j<br />WIDTH<br />M<br />P<br />i<br />WIDTH<br />k<br />42<br />WIDTH<br />WIDTH<br />
    43. 43. Simple Matrix Multiplication (GPU)<br />void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏<br />{<br />intsize = Width * Width * sizeof(float); <br /> float* Md, Nd, Pd;<br /> …<br /> // 1. Allocate and Load M, N to device memory <br />cudaMalloc(&Md, size);<br />cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);<br />cudaMalloc(&Nd, size);<br />cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);<br />// Allocate P on the device<br />cudaMalloc(&Pd, size);<br />
    44. 44. Simple Matrix Multiplication (GPU)<br /> // 2. Kernel invocation code – to be shown later<br /> …<br /> // 3. Read P from the device<br />cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);<br /> // Free device matrices<br />cudaFree(Md); <br />cudaFree(Nd); <br />cudaFree(Pd);<br />}<br />
    45. 45. Kernel Function<br />// Matrix multiplication kernel – per thread code<br />__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏<br />{<br /> // Pvalue is used to store the element of the matrix<br /> // that is computed by the thread<br /> float Pvalue = 0;<br />
    46. 46. Kernel Function (contd.)<br />for (int k = 0; k < Width; ++k)‏ {<br />float Melement = Md[threadIdx.y*Width+k];<br />float Nelement = Nd[k*Width+threadIdx.x];<br />Pvalue+= Melement * Nelement;<br /> }<br />Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;<br />}<br />Nd<br />k<br />WIDTH<br />tx<br />Md<br />Pd<br />ty<br />ty<br />WIDTH<br />tx<br />k<br />46<br />WIDTH<br />WIDTH<br />
    47. 47. Kernel Function (full)<br />// Matrix multiplication kernel – per thread code<br />__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏<br />{ <br /> // Pvalue is used to store the element of the matrix<br />// that is computed by the thread<br />float Pvalue = 0;<br /> for (int k = 0; k < Width; ++k)‏ {<br /> float Melement = Md[threadIdx.y*Width+k];<br /> float Nelement = Nd[k*Width+threadIdx.x];<br />Pvalue += Melement * Nelement;<br /> }<br />Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;<br />}<br />
    48. 48. Kernel Invocation (Host Side)<br /> // Setup the execution configuration<br />dim3 dimGrid(1, 1);<br />dim3 dimBlock(Width, Width);<br />// Launch the device computation threads!<br />MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);<br />
    49. 49. Only One Thread Block Used<br />Nd<br />Grid 1<br />One Block of threads compute matrix Pd<br />Each thread computes one element of Pd<br />Each thread<br />Loads a row of matrix Md<br />Loads a column of matrix Nd<br />Perform one multiply and addition for each pair of Md and Nd elements<br />Compute to off-chip memory access ratio close to 1:1 (not very high)‏<br />Size of matrix limited by the number of threads allowed in a thread block<br />Block 1<br />Thread<br />(2, 2)‏<br />48<br /> WIDTH<br />Pd<br />Md<br />
    50. 50. Handling Arbitrary Sized Square Matrices<br />Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix<br />Each has (TILE_WIDTH)2 threads<br />Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks<br />Nd<br />WIDTH<br />Md<br />Pd<br />by<br />You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!<br />TILE_WIDTH<br />ty<br />WIDTH<br />bx<br />tx<br />50<br />WIDTH<br />WIDTH<br />
    51. 51. Small Example<br />Nd1,0<br />Nd0,0<br />Block(0,0)<br />Block(1,0)<br />Nd1,1<br />Nd0,1<br />P1,0<br />P0,0<br />P2,0<br />P3,0<br />Nd1,2<br />Nd0,2<br />TILE_WIDTH = 2<br />P0,1<br />P1,1<br />P3,1<br />P2,1<br />Nd0,3<br />Nd1,3<br />P0,2<br />P2,2<br />P3,2<br />P1,2<br />P0,3<br />P2,3<br />P3,3<br />P1,3<br />Pd1,0<br />Md2,0<br />Md1,0<br />Md0,0<br />Md3,0<br />Pd0,0<br />Pd2,0<br />Pd3,0<br />Md1,1<br />Md0,1<br />Md2,1<br />Md3,1<br />Pd0,1<br />Pd1,1<br />Pd3,1<br />Pd2,1<br />Block(1,1)<br />Block(0,1)<br />Pd0,2<br />Pd2,2<br />Pd3,2<br />Pd1,2<br />Pd0,3<br />Pd2,3<br />Pd3,3<br />Pd1,3<br />
    52. 52. Cleanup Topics<br />Memory Management<br />Pinned Memory (Zero-Transfer)<br />Portable Pinned Memory<br />Multi-GPU<br />Wrappers (Python, Java, .NET)<br />Kernels<br />Atomics<br />Thread Synchronization (staged reductions)<br />NVCC<br />
    53. 53. Questions?<br /><br /><br />