Your SlideShare is downloading. ×
Intro to GPGPU Programming with Cuda
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Intro to GPGPU Programming with Cuda

3,830
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,830
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Sparse linear algebra is interesting both because many science and engineering codes rely on it, and also because it was traditionally assumed to be something that GPUs would not be good at (because of irregular data access patterns). We have shown that in fact GPUs are extremely good at sparse matrix-vector multiply (SpMV), which is the basic building block of sparse linear algebra. The code and an accompanying white paper are available on the cuda forums and also posted on research.nvidia.com.This is compared to an extremely well-studied, well-optimized SpMV implementation from a widely respected paper in Supercomputing 2007. that paper only reported double-precision results for CPUs; our single precision results are even more impressive in comparison.
  • Compared to highly optimizedfortran code from an oceanography researcher at UCLA
  • Current implementation uses short-stack approach. Top elements of the stack are cached in registers.
  • RTAPI enables implementation of manydifferent raytracing flavors.left-right, top-bottom: Procedural materials, Ambient occlusion, Whittedraytracer (thin shell glass and metalic spheres) Path tracer (Cornell box), Refactions, Cook-style distribution raytracingCould also do non-rendering stuff, e.g. GIS (line of sight say), physics (collision/proximity detection)
  • Transcript

    • 1. Rob Gillen
      Intro to GPGPU Programing With CUDA
    • 2. CodeStock is proudly partnered with:
      RecruitWise and Staff with Excellence - www.recruitwise.jobs
      Send instant feedback on this session via Twitter:
      Send a direct message with the room number to @CodeStock
      d codestock 411 This guy is Amazing!
      For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.
    • 3.
    • 4. Intro to GPGPU Programming with CUDA
      Rob Gillen
    • 5. Welcome!
      Goals:
      Overview of GPGPU with CUDA
      “Vision Casting” for how you can use GPUs to improve your application
      Outline
      Why GPGPUs?
      Applications
      Tooling
      Hands-On: Matrix Multiplication
      Rating: http://spkr8.com/t/7714
    • 6. CPU vs. GPU
      GPU devotes more transistors to data processing
    • 7. NVIDIA Fermi
      ~1.5TFLOPS (SP)/~800GFLOPS (DP)
      230 GB/s DRAM Bandwidth
    • 8. Motivation
      FLoating-Point Operations per Second (FLOPS) and
      memory bandwidth For the CPU and GPU
    • 9. Example: Sparse Matrix-Vector
      CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
    • 10. Rayleigh-Bénard Results
      Double precision
      384 x 384 x 192 grid (max that fits in 4GB)
      Vertical slice of temperature at y=0
      Transition from stratified (left) to turbulent (right)
      Regime depends on Rayleigh number: Ra = gαΔT/κν
      8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon
    • 11. G80 Characteristics
      367 GFLOPS peak performance (25-50 times of current high-end microprocessors)
      265 GFLOPS sustained for apps such as VMD
      Massively parallel, 128 cores, 90W
      Massively threaded, sustains 1000s of threads per app
      30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics
    • 12. Supercomputer Comparison
    • 13. Applications
      Exciting applications in future mass computing market have been traditionally considered “supercomputing applications”
      Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products
      These “Super-apps” represent and model physical, concurrent world
      Various granularities of parallelism exist, but…
      programming model must not hinder parallel implementation
      data delivery needs careful management
    • 14. *Not* for all applications
      SPMD (Single Program, Multiple Data) are best (data parallel)
      Operations need to be of sufficient size to overcome overhead
      Think Millions of operations.
    • 15. Raytracing
    • 16. NVIRT: CUDA Ray Tracing API
    • 17. Tooling
      VS 2010 C++ (Express is OK… sortof.)
      NVIDIA CUDA-Capable GPU
      NVIDIA CUDA Toolkit (v4+)
      NVIDIA CUDA Tools (v4+)
      GPU Computing SDK
      NVIDIA Parallel Insight
    • 18. Parallel Debugging
    • 19. Parallel Analysis
    • 20. VS Project Templates
    • 21. VS Project Templates
    • 22. Before we get too excited…
      Host vs Device
      Kernels
      __global__ __device__ __host__
      Thread/Block Control
      <<<x, y>>>
      Multi-dimensioned coordinate objects
      Memory Management/Movement
      Thread Management – think 1000’s or 1,000,000’s
    • 23. Block IDs and Threads
      Each thread uses IDs to decide what data to work on
      Block ID: 1D or 2D
      Thread ID: 1D, 2D, or 3D
      Simplifies memoryaddressing when processingmultidimensional data
      Image processing
    • 24. CUDA Thread Block
      All threads in a block execute the same kernel program (SPMD)
      Programmer declares block:
      Block size 1 to 512 concurrent threads
      Block shape 1D, 2D, or 3D
      Block dimensions in threads
      Threads have thread id numbers within block
      Thread program uses thread id to select work and address shared data
      Threads in the same block share data and synchronize while doing their share of the work
      Threads in different blocks cannot cooperate
      Each block can execute in any order relative to other blocs!
      CUDA Thread Block
      Thread Id #:0 1 2 3 … m
      Thread program
    • 25. Transparent Scalability
      Hardware is free to assigns blocks to any processor at any time
      A kernel scales across any number of parallel processors
      Kernel grid
      Device
      Block 0
      Block 1
      Block 2
      Block 3
      Block 4
      Block 5
      Block 6
      Block 7
      Device
      Block 0
      Block 1
      Block 2
      Block 3
      Block 4
      Block 5
      Block 6
      Block 7
      Block 0
      Block 1
      Block 2
      Block 3
      Block 4
      Block 5
      Block 6
      Block 7
      time
      Each block can execute in any order relative to other blocks.
    • 26. A Simple Running ExampleMatrix Multiplication
      A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs
      Leave shared memory usage until later
      Local, register usage
      Thread ID usage
      Memory data transfer API between host and device
      Assume square matrix for simplicity
    • 27. Programming Model:Square Matrix Multiplication Example
      P = M * N of size WIDTH x WIDTH
      Without tiling:
      One thread calculates one element of P
      M and N are loaded WIDTH timesfrom global memory
      N
      WIDTH
      M
      P
      WIDTH
      WIDTH
      WIDTH
      27
    • 28. Memory Layout of Matrix in C
      M0,2
      M0,1
      M0,0
      M0,3
      M1,1
      M1,0
      M1,2
      M1,3
      M2,1
      M2,0
      M2,2
      M2,3
      M3,1
      M3,0
      M3,2
      M3,3
      M
      M0,2
      M0,1
      M0,0
      M0,3
      M1,1
      M1,0
      M1,2
      M1,3
      M2,1
      M2,0
      M2,2
      M2,3
      M3,1
      M3,0
      M3,2
      M3,3
    • 29. Simple Matrix Multiplication (CPU)
      void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏
      {
      for (int i = 0; i < Width; ++i) {‏
      for (int j = 0; j < Width; ++j) {
      float sum = 0;
      for (int k = 0; k < Width; ++k) {
      float a = M[i * width + k];
      float b = N[k * width + j];
      sum += a * b;
      }
      P[i * Width + j] = sum;
      }
      }
      }
      N
      k
      j
      WIDTH
      M
      P
      i
      WIDTH
      k
      29
      WIDTH
      WIDTH
    • 30. Simple Matrix Multiplication (GPU)
      void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏
      {
      intsize = Width * Width * sizeof(float);
      float* Md, Nd, Pd;

      // 1. Allocate and Load M, N to device memory
      cudaMalloc(&Md, size);
      cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
      cudaMalloc(&Nd, size);
      cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
      // Allocate P on the device
      cudaMalloc(&Pd, size);
    • 31. Simple Matrix Multiplication (GPU)
      // 2. Kernel invocation code – to be shown later

      // 3. Read P from the device
      cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
      // Free device matrices
      cudaFree(Md);
      cudaFree(Nd);
      cudaFree(Pd);
      }
    • 32. Kernel Function
      // Matrix multiplication kernel – per thread code
      __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏
      {
      // Pvalue is used to store the element of the matrix
      // that is computed by the thread
      float Pvalue = 0;
    • 33. Kernel Function (contd.)
      for (int k = 0; k < Width; ++k)‏ {
      float Melement = Md[threadIdx.y*Width+k];
      float Nelement = Nd[k*Width+threadIdx.x];
      Pvalue+= Melement * Nelement;
      }
      Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
      }
      Nd
      k
      WIDTH
      tx
      Md
      Pd
      ty
      ty
      WIDTH
      tx
      k
      33
      WIDTH
      WIDTH
    • 34. Kernel Function (full)
      // Matrix multiplication kernel – per thread code
      __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏
      {
      // Pvalue is used to store the element of the matrix
      // that is computed by the thread
      float Pvalue = 0;
      for (int k = 0; k < Width; ++k)‏ {
      float Melement = Md[threadIdx.y*Width+k];
      float Nelement = Nd[k*Width+threadIdx.x];
      Pvalue += Melement * Nelement;
      }
      Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
      }
    • 35. Kernel Invocation (Host Side)
      // Setup the execution configuration
      dim3 dimGrid(1, 1);
      dim3 dimBlock(Width, Width);
      // Launch the device computation threads!
      MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
    • 36. Only One Thread Block Used
      Nd
      Grid 1
      One Block of threads compute matrix Pd
      Each thread computes one element of Pd
      Each thread
      Loads a row of matrix Md
      Loads a column of matrix Nd
      Perform one multiply and addition for each pair of Md and Nd elements
      Compute to off-chip memory access ratio close to 1:1 (not very high)‏
      Size of matrix limited by the number of threads allowed in a thread block
      Block 1
      Thread
      (2, 2)‏
      48
      WIDTH
      Pd
      Md
    • 37. Handling Arbitrary Sized Square Matrices
      Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix
      Each has (TILE_WIDTH)2 threads
      Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks
      Nd
      WIDTH
      Md
      Pd
      by
      You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)!
      TILE_WIDTH
      ty
      WIDTH
      bx
      tx
      37
      WIDTH
      WIDTH
    • 38. Small Example
      Nd1,0
      Nd0,0
      Block(0,0)
      Block(1,0)
      Nd1,1
      Nd0,1
      P1,0
      P0,0
      P2,0
      P3,0
      Nd1,2
      Nd0,2
      TILE_WIDTH = 2
      P0,1
      P1,1
      P3,1
      P2,1
      Nd0,3
      Nd1,3
      P0,2
      P2,2
      P3,2
      P1,2
      P0,3
      P2,3
      P3,3
      P1,3
      Pd1,0
      Md2,0
      Md1,0
      Md0,0
      Md3,0
      Pd0,0
      Pd2,0
      Pd3,0
      Md1,1
      Md0,1
      Md2,1
      Md3,1
      Pd0,1
      Pd1,1
      Pd3,1
      Pd2,1
      Block(1,1)
      Block(0,1)
      Pd0,2
      Pd2,2
      Pd3,2
      Pd1,2
      Pd0,3
      Pd2,3
      Pd3,3
      Pd1,3
    • 39. Cleanup Topics
      Memory Management
      Pinned Memory (Zero-Transfer)
      Portable Pinned Memory
      Multi-GPU
      Wrappers (Python, Java, .NET)
      Kernels
      Atomics
      Thread Synchronization (staged reductions)
      NVCC
    • 40. Questions?
      rob@gillenfamily.net@argodev
      http://rob.gillenfamily.netRate: http://spkr8.com/t/7714