Successfully reported this slideshow.
 
<ul><li>Supercomputers </li></ul><ul><ul><li>Performance, connectivity, iCub and rules </li></ul></ul><ul><li>Massively pa...
 
<ul><li>Programming Massively Parallel Processors </li></ul><ul><ul><li>World’s first textbook on programming GPUs </li></...
“ If you build it, they will come.” “ And so we built them. Multiprocessor workstations, massively parallel supercomputers...
<ul><li>Six Linux servers where each has: </li></ul><ul><ul><li>8 x 2.67GHz hyper-threaded processors </li></ul></ul><ul><...
 
 
<ul><li>Overall performance </li></ul><ul><ul><li>48 CPUs (2.67GHz each) </li></ul></ul><ul><ul><li>96 GB RAM </li></ul></...
<ul><li>All computers accessible locally or remotely via main gateway (gpu-node1) in the iCub’s room </li></ul><ul><ul><li...
10Mbit Ethernet
<ul><li>New libraries are placed to: </li></ul><ul><ul><li>/gpu-node*/home/Dev/Lib </li></ul></ul><ul><li>Your application...
<ul><li>A quiet revolution and potential build-up </li></ul><ul><ul><li>GPU in every PC– massive volume and potential impa...
 
<ul><li>GPUs deliver 500+ GFLOPS on parallel applications </li></ul><ul><ul><li>Available in laptops, desktops, and cluste...
<ul><li>C ompute Unified  Device   A rchitecture </li></ul><ul><ul><li>The API is an extension to the ANSI C programming l...
<ul><li>Differences in the fundamental design philosophies </li></ul><ul><ul><li>CPU is optimized for sequential code perf...
<ul><li>Differences in the fundamental design philosophies </li></ul><ul><ul><li>GPU development was driven by fast growin...
<ul><li>Discrepancy in floating-point capability between the CPU and the GPU </li></ul><ul><ul><li>GPU is specialized for ...
 
 
<ul><li>A compute device </li></ul><ul><ul><li>Is a coprocessor to the CPU or host </li></ul></ul><ul><ul><li>Has its own ...
<ul><li>A CUDA kernel is executed by an array of threads </li></ul><ul><ul><li>All threads run the same code (SPMD) </li><...
<ul><li>Divide monolithic thread array into multiple blocks </li></ul><ul><ul><li>Threads within a block cooperate via sha...
<ul><li>Each thread uses IDs to decide what data to work on </li></ul><ul><ul><li>Block ID: 1D or 2D </li></ul></ul><ul><u...
<ul><li>Global memory </li></ul><ul><ul><li>Main means of communicating R/W Data between host and device </li></ul></ul><u...
<ul><li>cudaMalloc() </li></ul><ul><ul><li>Allocates object in the device Global Memory </li></ul></ul><ul><ul><li>Require...
<ul><li>Code example:  </li></ul><ul><ul><li>Allocate a  64 * 64 single precision float array </li></ul></ul><ul><ul><li>A...
<ul><li>cudaMemcpy() </li></ul><ul><ul><li>Memory data transfer </li></ul></ul><ul><ul><li>Requires four parameters </li><...
<ul><li>Code example:  </li></ul><ul><ul><li>Transfer a  64 * 64 single precision float array </li></ul></ul><ul><ul><li>M...
<ul><li>P = M * N of size  WIDTH x WIDTH </li></ul><ul><li>Without tiling: </li></ul><ul><ul><li>One thread calculates one...
<ul><li>One Block of threads compute matrix Pd </li></ul><ul><ul><li>Each thread computes one element of Pd </li></ul></ul...
 
<ul><li>Any source file containing CUDA language extensions must be compiled with NVCC </li></ul><ul><li>NVCC is a compile...
<ul><li>An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely on the host using the CUDA run...
 
 
 
 
 
 
 
 
 
 
Upcoming SlideShare
Loading in …5
×

Introduction to parallel computing using CUDA

7,601 views

Published on

Supercomputers in our lab
CUDA - history, api, gpu vs cpu, etc.
Practical examples

Thanks to Nvidia for the pictures

Published in: Technology

Introduction to parallel computing using CUDA

  1. 2. <ul><li>Supercomputers </li></ul><ul><ul><li>Performance, connectivity, iCub and rules </li></ul></ul><ul><li>Massively parallel processing </li></ul><ul><ul><li>General overview </li></ul></ul><ul><li>CUDA </li></ul><ul><ul><li>Processor architecture </li></ul></ul><ul><ul><li>Memory model overview </li></ul></ul><ul><ul><li>Programming API, tools and techniques </li></ul></ul><ul><ul><li>Principles and patterns of parallel programming </li></ul></ul><ul><li>Example applications and case studies </li></ul><ul><ul><li>SDK, performance and practical implications </li></ul></ul><ul><li>Practical example </li></ul><ul><ul><li>Main functions explained on a simple application </li></ul></ul>
  2. 4. <ul><li>Programming Massively Parallel Processors </li></ul><ul><ul><li>World’s first textbook on programming GPUs </li></ul></ul><ul><li>http://courses.ece.illinois.edu/ece498/al/Syllabus.html </li></ul><ul><ul><li>Useful lecture slides </li></ul></ul><ul><ul><li>Textbook, documentation and software resources </li></ul></ul><ul><li>http://www.nvidia.com/object/cuda_home_new.html </li></ul><ul><ul><li>Research applications from various fields using CUDA </li></ul></ul><ul><ul><li>Drivers, SDK and CUDA programming guides </li></ul></ul><ul><li>http://dl.dropbox.com/u/81820/Supercomputers/CUDA.pptx </li></ul><ul><ul><li>This presentation </li></ul></ul><ul><li>http://www.ddj.com/cpp/207200659 </li></ul><ul><ul><li>Tutorials </li></ul></ul><ul><li>http://code.google.com/p/thrust/ </li></ul><ul><ul><li>C++ template library for CUDA </li></ul></ul>
  3. 5. “ If you build it, they will come.” “ And so we built them. Multiprocessor workstations, massively parallel supercomputers, a cluster in every department … and they haven’t come. Programmers haven’t come to program these wonderful machines. … The computer industry is ready to flood the market with hardware that will only run at full speed with parallel programs. But who will write these programs?” - Mattson, Sanders, Massingill (2005)
  4. 6. <ul><li>Six Linux servers where each has: </li></ul><ul><ul><li>8 x 2.67GHz hyper-threaded processors </li></ul></ul><ul><ul><li>16GB RAM </li></ul></ul><ul><ul><li>Either GeForce GTX 285 or 470 graphics card </li></ul></ul><ul><li>gpu-node2 and gpu-node6 have: </li></ul><ul><ul><li>3 x Tesla c1060 and 1 x GeForce GTX 470 </li></ul></ul><ul><li>gpu-node3 has: </li></ul><ul><ul><li>2 x Tesla c1060 and 1 x GeForce GTX 470 </li></ul></ul>
  5. 9. <ul><li>Overall performance </li></ul><ul><ul><li>48 CPUs (2.67GHz each) </li></ul></ul><ul><ul><li>96 GB RAM </li></ul></ul><ul><ul><li>4192 CUDA cores </li></ul></ul><ul><ul><li>1.67TB/sec peak CUDA memory bandwidth </li></ul></ul><ul><ul><li>39GB total CUDA global memory </li></ul></ul><ul><ul><li>10.02TFLOPS (trillion floating-point computing instructions per second) on CUDA cards </li></ul></ul>
  6. 10. <ul><li>All computers accessible locally or remotely via main gateway (gpu-node1) in the iCub’s room </li></ul><ul><ul><li>For remote control use any VNC client to connect to 141.163.186.64 from where you can control any other machine </li></ul></ul><ul><ul><li>VPN required outside the university network </li></ul></ul><ul><li>gpu-node1 has both has external and local IP address through which it relays internet to all other computers </li></ul><ul><li>iCub is connected to gpu-node3 </li></ul>
  7. 11. 10Mbit Ethernet
  8. 12. <ul><li>New libraries are placed to: </li></ul><ul><ul><li>/gpu-node*/home/Dev/Lib </li></ul></ul><ul><li>Your applications go to: </li></ul><ul><ul><li>/gpu-node*/home/Dev/Project </li></ul></ul><ul><li>System updates remain disabled </li></ul><ul><ul><li>Ubuntu updates often cause more trouble than good hence they are not normally used </li></ul></ul>
  9. 13. <ul><li>A quiet revolution and potential build-up </li></ul><ul><ul><li>GPU in every PC– massive volume and potential impact </li></ul></ul><ul><ul><li>Speed increase – some applications run 2,500x faster than just on CPU </li></ul></ul>
  10. 15. <ul><li>GPUs deliver 500+ GFLOPS on parallel applications </li></ul><ul><ul><li>Available in laptops, desktops, and clusters </li></ul></ul><ul><li>GPU parallelism is doubling every year </li></ul><ul><li>Programming model scales transparently </li></ul><ul><li>Programmable in C with CUDA tools </li></ul><ul><li>Multithreaded SPMD model uses application data parallelism and thread parallelism </li></ul>
  11. 16. <ul><li>C ompute Unified Device A rchitecture </li></ul><ul><ul><li>The API is an extension to the ANSI C programming language </li></ul></ul><ul><ul><li>Threads run on the GPU (super-threaded, massively data parallel co-processor) </li></ul></ul><ul><ul><li>Standalone Driver - Optimized for computation </li></ul></ul><ul><ul><li>Interface designed for compute-graphics-free API </li></ul></ul><ul><li>Started as GPGPU </li></ul><ul><ul><li>General Purpose computation using GPU and graphics API in applications other than 3D graphics </li></ul></ul><ul><ul><li>GPU accelerates critical path of application </li></ul></ul><ul><ul><li>Speedup of applications, very restricted usability </li></ul></ul><ul><ul><ul><li>Dealing with graphics API - Working with the corner cases of the graphics API </li></ul></ul></ul><ul><ul><ul><li>Addressing modes - Limited texture size </li></ul></ul></ul><ul><ul><ul><li>Shader capabilities - Limited outputs </li></ul></ul></ul><ul><ul><ul><li>Instruction sets - Lack of Integer and bit operations </li></ul></ul></ul><ul><ul><ul><li>Communication limited - Between pixels </li></ul></ul></ul>
  12. 17. <ul><li>Differences in the fundamental design philosophies </li></ul><ul><ul><li>CPU is optimized for sequential code performance </li></ul></ul><ul><ul><ul><li>Sophisticated control logic - single thread instructions can execute in parallel while maintaining the appearance of sequential execution </li></ul></ul></ul><ul><ul><ul><li>Large cache memories - reduce the instruction and data access latencies of large complex applications </li></ul></ul></ul><ul><ul><ul><li>Neither control logic nor cache memories contribute to the peak calculation speed </li></ul></ul></ul><ul><ul><ul><li>General-purpose multi-core microprocessors typically have four large processors to increase the calculation speed </li></ul></ul></ul><ul><ul><ul><li>Performance improvement of general-purpose microprocessors has slowed significantly </li></ul></ul></ul>
  13. 18. <ul><li>Differences in the fundamental design philosophies </li></ul><ul><ul><li>GPU development was driven by fast growing video game industry to satisfy massive demands on the number of floating-point calculations per video frame </li></ul></ul><ul><ul><ul><li>Maximised chip area dedicated to floating-point calculations </li></ul></ul></ul><ul><ul><ul><li>Optimised for the execution of massive number of threads </li></ul></ul></ul><ul><ul><ul><li>Small cache memories to help control the bandwidth and minimise thread accesses to DRAM </li></ul></ul></ul>
  14. 19. <ul><li>Discrepancy in floating-point capability between the CPU and the GPU </li></ul><ul><ul><li>GPU is specialized for compute-intensive, highly parallel computation </li></ul></ul><ul><ul><li>Transistors are devoted to data processing rather than data caching and flow control </li></ul></ul>
  15. 22. <ul><li>A compute device </li></ul><ul><ul><li>Is a coprocessor to the CPU or host </li></ul></ul><ul><ul><li>Has its own DRAM (device memory)‏ </li></ul></ul><ul><ul><li>Runs many threads in parallel </li></ul></ul><ul><ul><li>Is typically a GPU but can also be another type of parallel processing device </li></ul></ul><ul><li>Data-parallel portions of an application are expressed as device kernels which run on many threads </li></ul><ul><li>Differences between GPU and CPU threads </li></ul><ul><ul><li>GPU threads are extremely lightweight </li></ul></ul><ul><ul><ul><li>Very little creation overhead </li></ul></ul></ul><ul><ul><li>GPU needs 1000s of threads for full efficiency </li></ul></ul><ul><ul><ul><li>Multi-core CPU needs only a few </li></ul></ul></ul>
  16. 23. <ul><li>A CUDA kernel is executed by an array of threads </li></ul><ul><ul><li>All threads run the same code (SPMD) </li></ul></ul><ul><ul><li>Each thread has an ID that it uses to compute memory addresses and make control decisions </li></ul></ul>7 6 5 4 3 2 1 0 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID
  17. 24. <ul><li>Divide monolithic thread array into multiple blocks </li></ul><ul><ul><li>Threads within a block cooperate via shared memory, atomic operations and barrier synchronization </li></ul></ul><ul><ul><li>Threads in different blocks cannot cooperate </li></ul></ul>Thread Block 0 … Thread Block 1 Thread Block N - 1 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … … float x = input[threadID]; float y = func(x); output[threadID] = y; … 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
  18. 25. <ul><li>Each thread uses IDs to decide what data to work on </li></ul><ul><ul><li>Block ID: 1D or 2D </li></ul></ul><ul><ul><li>Thread ID: 1D, 2D, or 3D </li></ul></ul><ul><li>Simplifies memory addressing when processing multidimensional data </li></ul><ul><ul><li>Image processing </li></ul></ul><ul><ul><li>Solving PDEs on volumes </li></ul></ul>
  19. 26. <ul><li>Global memory </li></ul><ul><ul><li>Main means of communicating R/W Data between host and device </li></ul></ul><ul><ul><li>Contents visible to all threads </li></ul></ul><ul><ul><li>Long latency access </li></ul></ul><ul><li>We will focus on global memory for now </li></ul>Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host
  20. 27. <ul><li>cudaMalloc() </li></ul><ul><ul><li>Allocates object in the device Global Memory </li></ul></ul><ul><ul><li>Requires two parameters </li></ul></ul><ul><ul><ul><li>Address of a pointe r to the allocated object </li></ul></ul></ul><ul><ul><ul><li>Size of of allocated object </li></ul></ul></ul><ul><li>cudaFree() </li></ul><ul><ul><li>Frees object from device Global Memory </li></ul></ul><ul><ul><ul><li>Pointer to freed object </li></ul></ul></ul>Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host
  21. 28. <ul><li>Code example: </li></ul><ul><ul><li>Allocate a 64 * 64 single precision float array </li></ul></ul><ul><ul><li>Attach the allocated storage to Md </li></ul></ul><ul><ul><li>“ d” is often used to indicate a device data structure </li></ul></ul>TILE_WIDTH = 64; Float* Md; int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md);
  22. 29. <ul><li>cudaMemcpy() </li></ul><ul><ul><li>Memory data transfer </li></ul></ul><ul><ul><li>Requires four parameters </li></ul></ul><ul><ul><ul><li>Pointer to destination </li></ul></ul></ul><ul><ul><ul><li>Pointer to source </li></ul></ul></ul><ul><ul><ul><li>Number of bytes copied </li></ul></ul></ul><ul><ul><ul><li>Type of transfer </li></ul></ul></ul><ul><ul><ul><ul><li>Host to Host </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Host to Device </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Device to Host </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Device to Device </li></ul></ul></ul></ul><ul><li>Asynchronous transfer </li></ul>Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host
  23. 30. <ul><li>Code example: </li></ul><ul><ul><li>Transfer a 64 * 64 single precision float array </li></ul></ul><ul><ul><li>M is in host memory and Md is in device memory </li></ul></ul><ul><ul><li>cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants </li></ul></ul>cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);
  24. 31. <ul><li>P = M * N of size WIDTH x WIDTH </li></ul><ul><li>Without tiling: </li></ul><ul><ul><li>One thread calculates one element of P </li></ul></ul><ul><ul><li>M and N are loaded WIDTH times from global memory </li></ul></ul>__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } M N P WIDTH WIDTH WIDTH WIDTH i k k j // Matrix multiplication on the (CPU) host in double precision void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i)‏ for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } }
  25. 32. <ul><li>One Block of threads compute matrix Pd </li></ul><ul><ul><li>Each thread computes one element of Pd </li></ul></ul><ul><li>Each thread </li></ul><ul><ul><li>Loads a row of matrix Md </li></ul></ul><ul><ul><li>Loads a column of matrix Nd </li></ul></ul><ul><ul><li>Perform one multiply and addition for each pair of Md and Nd elements </li></ul></ul><ul><ul><li>Compute to off-chip memory access ratio close to 1:1 (not very high)‏ </li></ul></ul><ul><li>Size of matrix limited by the number of threads allowed in a thread block </li></ul>Grid 1 Block 1 48 Thread (2, 2)‏ WIDTH Md Pd Nd <ul><li>Unless we use tiling </li></ul>Md Nd Pd Pd sub TILE_WIDTH WIDTH WIDTH TILE_WIDTH TILE_WIDTH bx tx 0 1 TILE_WIDTH-1 2 0 1 2 by ty 2 1 0 TILE_WIDTH-1 2 1 0 TILE_WIDTH TILE_WIDTH TILE_WIDTHE WIDTH WIDTH
  26. 34. <ul><li>Any source file containing CUDA language extensions must be compiled with NVCC </li></ul><ul><li>NVCC is a compiler driver </li></ul><ul><ul><li>Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... </li></ul></ul><ul><li>NVCC outputs C code (host CPU Code)‏ </li></ul><ul><ul><li>Must then be compiled with the rest of the application using another tool </li></ul></ul><ul><li>Any executable with CUDA code requires two dynamic libraries: </li></ul><ul><ul><li>The CUDA runtime library ( cudart )‏ </li></ul></ul><ul><ul><li>The CUDA core library ( cuda )‏ </li></ul></ul>
  27. 35. <ul><li>An executable compiled in device emulation mode ( nvcc -deviceemu ) runs completely on the host using the CUDA runtime </li></ul><ul><ul><li>No need of any device and CUDA driver </li></ul></ul><ul><ul><li>Each device thread is emulated with a host thread </li></ul></ul><ul><li>Running in device emulation mode, one can: </li></ul><ul><ul><li>Use host native debug support (breakpoints, inspection, etc.)‏ </li></ul></ul><ul><ul><li>Access any device-specific data from host code and vice-versa </li></ul></ul><ul><ul><li>Call any host function from device code (e.g. printf ) and vice-versa </li></ul></ul><ul><ul><li>Detect deadlock situations caused by improper usage of __syncthreads </li></ul></ul><ul><li>CUDA-GDB </li></ul><ul><ul><li>all-in-one debugging environment that is capable of debugging native host code as well as CUDA code </li></ul></ul>

×