NVIDIA CUDA
The Compute Unified Device Architecture

Jens Rühmkorf
Tech Talk, DLR Köln-Porz, July 22nd 2009

             ...
References


  University of Illinois at Urbana-Champaign,
  Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009:
  h...
Multi- and Manycore Architectures
A Difficult Road Lies Ahead

  Don Knuth on Multicore Architectures
  “[…] my personal u...
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming i...
Multicore and Manycore (1)
Structural Differences




       Multicore                                    Manycore
  Multi...
Multicore and Manycore (2)
Technical Characteristics

 Specifica-
                   Core i7 960              GTX285
   ti...
Multicore and Manycore (3)
Performance Comparison: CPU vs. GPU




                              CPU vs. GPU
        y-axi...
An Example of the Physical Reality Behind CUDA

                                          CPU
  GPU w/                    ...
CUDA Processing Flow




                                                 Slide 9
                       CUDA > J. Rühmkor...
CUDA in a Nutshell
Key Characteristics

  CUDA is designed for wide SIMD/SPMD parallelism & scalability
  CUDA provides 3 ...
Hello World


// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A...
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming i...
CUDA Programming Model
  Structure of a CUDA application

      Integrated host + device application C program
           ...
CUDA Programming Model
CUDA Devices and Threads

  A compute device
       Is a coprocessor to the CPU or host
       Has ...
CUDA Programming Model
Arrays of Parallel Threads

  Execute a Kernel by specifying arrays of threads
  All threads run th...
CUDA Programming Model
  Use Thread Blocks for (Scalable) Cooperation
      Divide monolithic thread array into multiple b...
CUDA Programming Model
Organisation of Thread Blocks

   Thread Blocks can be one-, two- or
   three-dimensional arrays
  ...
CUDA Programming Model
Block IDs and Thread IDs
  Each thread uses IDs to decide what
  data to work on
        Block ID: ...
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming i...
CUDA Memory Model
Overview
  Global memory                           Grid

        Main means of communicating
        R/W...
CUDA Memory Model
CUDA Device Memory Allocation
  cudaMalloc():                           Grid

        Allocates object i...
CUDA Memory Model
CUDA Host-Device Data Transfer

  cudaMemcpy()                              Grid

       memory data tra...
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming i...
CUDA API
Extended C
                        Integrated source
                              (foo.cu)




                 ...
CUDA API
C for CUDA
  Function type specifiers                          image-convolution.cu
          __global__, __devic...
CUDA API
CUDA Function Type Qualifiers (1)

                                                     only callable
           ...
CUDA API
CUDA Function Type Qualifiers (2)
   __device__ functions cannot have their address taken
  For functions execute...
CUDA API
CUDA Variable Type Qualifiers
  __device__
        Resides in global memory space,
        Has the lifetime of an...
CUDA API
Calling a Kernel Function – Execution Configuration
  A kernel function ( == __global__ function) must be called ...
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming i...
A Simple CUDA Example
Matrix Multiplication

  A simple matrix multiplication example that illustrates the basic features
...
Simple CUDA Example
Square Matrix Multiplication                    N



    P = M * N of size WIDTH x WIDTH




         ...
Memory Layout of a Matrix in C


                              M0,0 M1,0 M2,0 M3,0

                              M0,1 M1,...
Step 1: Matrix Multiplication
                                                N
A Simple Host Version in C
               ...
Step 2: Input Matrix Data Transfer
(Host-sided Code)
void matrixMulOnDevice(float* M, float* N, float* P, int width){
    ...
Step 3: Output Matrix Data Transfer
(Host-sided Code)
2.   // Kernel invocation code – to be shown later
     …


3.   // ...
Step 4: Kernel Function (1)


// Matrix multiplication kernel – per thread code
__global__ void matrixMulKernel(float* Md,...
Step 4: Kernel Function (2)
                                          Nd



    for (int k = 0; k < width; ++k) {         ...
Step 5: Kernel Invocation
(Host-sided Code)
 // Insert into step 2. from before
 // Setup the execution configuration
 dim...
Example Far from Ideal
Only One Thread Block Used
                                                    Grid 1              ...
CUDA: A Bright Future?


                         Eure Rede aber sei: Ja, ja; nein, nein.
                         Was dar...
NVIDIA CUDA: Appendix
Best Practices & Things to Watch Out For




                                                       ...
Appendix
Best Practices & Things to Watch Out For

  Obtain relevant hardware data
  Compiling a CUDA program
  Linking
  ...
Obtain Relevant Hardware Data


  Make sure to obtain relevant additional hardware data
  Call cudaGetDeviceProperties()

...
Compiling a CUDA Program (1)
                                                          Parallel Thread eXecution
         ...
Compiling a CUDA Program (2)
NVCC as a Compiler Driver

  Any source file containing CUDA language extensions must be comp...
Linking


  Any executable with CUDA code requires two dynamic libraries:
       The CUDA runtime library (cudart)
       ...
Debugging Using the Device Emulation Mode


  An executable compiled in device emulation mode (enabled via nvcc
  -devicee...
Device Emulation Mode Pitfalls


  Emulated device threads execute sequentially, so simultaneous
  accesses of the same me...
CUDA Driver API vs. C for CUDA (1)
Extended C
                        Integrated source
                              (foo...
CUDA Driver API vs. C for CUDA (2)
Mutually Exclusive: Choose One or the Other
  The C runtime for CUDA handles kernel loa...
CUDA Driver API vs. C for CUDA (3)
Example: Vector Addition Using C for CUDA
const unsigned int cnBlockSize = 512;      //...
CUDA Driver API vs. C for CUDA (4)
Example: Vector Addition Using CUDA Driver API
const unsigned int cnBlocks = 3;        ...
Watch Out: Floating Point Computations
Differing Results of FP Computations

  Results of floating-point computations will...
Watch Out: Floating Point Computations
Single and Double Precision Operations

  Double precision
       No deviations fro...
Limitations (1)
Only a Subset of C Available

  C for CUDA offers only a subset of the C language:
        Recursion-free
...
Limitations (2)
Branching in Programm Code

  For best performance
         Threads should be running in groups 32 threads...
Coalesced Access to Device Global Memory


  High Priority: Ensure global memory accesses are coalesced whenever
  possibl...
Coalesced Access (Reading Floats)




                                                             Slide 59
              ...
Uncoalesced Access (Reading Floats)




                                                               Slide 60
          ...
Shared Memory – Bank Conflicts


  Shared Memory
       16 KB
       Organized in 16 Banks, 1 KB each
  Shared Memory
    ...
Shared Memory – No Bank Conflicts




 Linear addressing   Random        Linear addressing   Broadcast
 Step size = 1     ...
Shared Memory – Bank Conflicts




       Linear adressing   Linear addressing   No conflict
       Step size = 2      Ste...
Upcoming SlideShare
Loading in …5
×

Tech Talk NVIDIA CUDA

5,753 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
5,753
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
168
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Tech Talk NVIDIA CUDA

  1. 1. NVIDIA CUDA The Compute Unified Device Architecture Jens Rühmkorf Tech Talk, DLR Köln-Porz, July 22nd 2009 Slide 1 CUDA > J. Rühmkorf > July 22nd 2009
  2. 2. References University of Illinois at Urbana-Champaign, Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009: http://courses.ece.illinois.edu/ece498/al/ Website about General-Purpose Computation on Graphics Hardware: http://gpgpu.org/developer/cuda ACM Queue, Vol. 6 No. 2, March/April 2008 (Issue on GPGPU): http://mags.acm.org/queue/20080304/ Dr. Dobb‘s: CUDA, Supercomputing for the Masses, Part 1-13: http://www.ddj.com/architect/207200659 NVIDIA CUDA Best Practices Guide http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVI DIA_CUDA_BestPracticesGuide_2.3.pdf Hubert Nguyen (ed.), GPU Gems 3, Addison-Wesley, 2007, online: http://developer.nvidia.com/object/gpu-gems-3.html Slide 2 CUDA > J. Rühmkorf > July 22nd 2009
  3. 3. Multi- and Manycore Architectures A Difficult Road Lies Ahead Don Knuth on Multicore Architectures “[…] my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop” In: InformIT, April 25th 2008 http://www.informit.com/articles/article.aspx?p=1193856 Slide 3 CUDA > J. Rühmkorf > July 22nd 2009
  4. 4. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 4 CUDA > J. Rühmkorf > July 22nd 2009
  5. 5. Multicore and Manycore (1) Structural Differences Multicore Manycore Multicore: yoke of oxen Each core optimized for executing a single thread Manycore: flock of chickens Cores optimized for aggregate throughput, deemphasizing individual performance Slide 5 CUDA > J. Rühmkorf > July 22nd 2009
  6. 6. Multicore and Manycore (2) Technical Characteristics Specifica- Core i7 960 GTX285 tions 4 cores, 4 way 30 cores, 8 way Processing SIMD SIMD Elements @3.2 GHz @1.5 GHz 4 cores, 2 threads, 30 cores, 32 SIMD 4 width SIMD: Resident Threads (max) vectors, 32 width SIMD: Core i7 32 strands 30720 strands SP GFLOP/s 102 1080 Memory Bandwidth 25.6 GB/s 159 GB/s Register File - 1.875 MB Local Store - 480 kB GTX285 Slide 6 CUDA > J. Rühmkorf > July 22nd 2009
  7. 7. Multicore and Manycore (3) Performance Comparison: CPU vs. GPU CPU vs. GPU y-axis: floating point operations per sec., single precision Slide 7 CUDA > J. Rühmkorf > July 22nd 2009
  8. 8. An Example of the Physical Reality Behind CUDA CPU GPU w/ (host) local DRAM (device) Slide 8 CUDA > J. Rühmkorf > July 22nd 2009
  9. 9. CUDA Processing Flow Slide 9 CUDA > J. Rühmkorf > July 22nd 2009
  10. 10. CUDA in a Nutshell Key Characteristics CUDA is designed for wide SIMD/SPMD parallelism & scalability CUDA provides 3 key abstractions, i.e. a hierarchy: of thread groups, of shared memories, and of barrier synchronization CUDA programs are written in C + extensions OpenCL is inspired by CUDA, but HW & SW vendor neutral Programming model essentially identical Slide 10 CUDA > J. Rühmkorf > July 22nd 2009
  11. 11. Hello World // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i] } int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); } hello-world.cu Slide 11 CUDA > J. Rühmkorf > July 22nd 2009
  12. 12. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 12 CUDA > J. Rühmkorf > July 22nd 2009
  13. 13. CUDA Programming Model Structure of a CUDA application Integrated host + device application C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); ... Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); ... Slide 13 CUDA > J. Rühmkorf > July 22nd 2009
  14. 14. CUDA Programming Model CUDA Devices and Threads A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Express data-parallel portions as device kernels (which run on many threads) Slide 14 CUDA > J. Rühmkorf > July 22nd 2009
  15. 15. CUDA Programming Model Arrays of Parallel Threads Execute a Kernel by specifying arrays of threads All threads run the same code (SPMD) Use thread-ID to compute memory addresses & make control decisions threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Slide 15 CUDA > J. Rühmkorf > July 22nd 2009
  16. 16. CUDA Programming Model Use Thread Blocks for (Scalable) Cooperation Divide monolithic thread array into multiple blocks Threads within a block can cooperate via shared memory, atomic operations, and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 1 Thread Block N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadID … … … float x = float x = float x = input[threadID]; float y = func(x); output[threadID] = y; input[threadID]; float y = func(x); output[threadID] = y; … input[threadID]; float y = func(x); output[threadID] = y; … … … Slide 16 CUDA > J. Rühmkorf > July 22nd 2009
  17. 17. CUDA Programming Model Organisation of Thread Blocks Thread Blocks can be one-, two- or three-dimensional arrays The host issues a sequence of kernel invocations (kernel 1, kernel 2) to the device Each kernel is executed as a batch of threads This batch is organized as a grid of thread blocks 2-dimensional thread blocks Slide 17 CUDA > J. Rühmkorf > July 22nd 2009
  18. 18. CUDA Programming Model Block IDs and Thread IDs Each thread uses IDs to decide what data to work on Block ID: 1D, 2D, or 3D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Slide 18 CUDA > J. Rühmkorf > July 22nd 2009
  19. 19. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 19 CUDA > J. Rühmkorf > July 22nd 2009
  20. 20. CUDA Memory Model Overview Global memory Grid Main means of communicating R/W Data between host and Block (0, 0) Block (1, 0) device Shared Memory Shared Memory Contents visible to all threads Long latency access Registers Registers Registers Registers We will focus on global memory for now Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Constant and texture memory will not be covered here Host Global Memory Constant Memory Texture Memory Slide 20 CUDA > J. Rühmkorf > July 22nd 2009
  21. 21. CUDA Memory Model CUDA Device Memory Allocation cudaMalloc(): Grid Allocates object in the device global memory Block (0, 0) Block (1, 0) Requires two parameters Shared Memory Shared Memory Address of a pointer to the allocated object Registers Registers Registers Registers Size of allocated object cudaFree() Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Frees object from the device global memory Host Global Pointer to freed object Memory Slide 21 CUDA > J. Rühmkorf > July 22nd 2009
  22. 22. CUDA Memory Model CUDA Host-Device Data Transfer cudaMemcpy() Grid memory data transfer Block (0, 0) Block (1, 0) Requires four parameters Pointer to destination Shared Memory Shared Memory Pointer to source Registers Registers Registers Registers Number of bytes to copy Type of transfer Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Type of transfer is one of: Host to Host Host to Device Host Global Memory Device to Host Device to Device Asynchronous transfer Slide 22 CUDA > J. Rühmkorf > July 22nd 2009
  23. 23. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 23 CUDA > J. Rühmkorf > July 22nd 2009
  24. 24. CUDA API Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo.s foo.cpp OCG gcc / cl Mark Murphy, “NVIDIA’s Experience with G80 SASS foo.sass with Open64,” www.capsl.udel.edu/conferences/open64/2008 008/Papers/101.doc Slide 24 CUDA > J. Rühmkorf > July 22nd 2009
  25. 25. CUDA API C for CUDA Function type specifiers image-convolution.cu __global__, __device__, __host__ __device__ float filter[N]; Variable type specifiers __device__, __shared__, __global__ void convolve(float __constant__ *image) { Keywords __shared__ float region[M]; threadIdx, blockIdx ... Intrinsics / builtin functions: region[threadIdx] = image[i]; __syncthreads() __syncthreads() Runtime API ... Memory, symbol, execution management image[j] = result; Function launch } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>>(myimage); Slide 25 CUDA > J. Rühmkorf > July 22nd 2009
  26. 26. CUDA API CUDA Function Type Qualifiers (1) only callable executed on: from: __device__ float deviceFunc() device device __global__ void kernelFunc() device host __host__ float hostFunc() host host __global__: defines a kernel function must return void __device__ and __host__ can be used together Slide 26 CUDA > J. Rühmkorf > July 22nd 2009
  27. 27. CUDA API CUDA Function Type Qualifiers (2) __device__ functions cannot have their address taken For functions executed on the device: no recursion no static variable declarations inside the function no variable number of arguments Slide 27 CUDA > J. Rühmkorf > July 22nd 2009
  28. 28. CUDA API CUDA Variable Type Qualifiers __device__ Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library. __shared__ (optionally used together with __device__) Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block. Not covered here: __constant__ (optionally used together with __device__) Resides in constant memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library. Slide 28 CUDA > J. Rühmkorf > July 22nd 2009
  29. 29. CUDA API Calling a Kernel Function – Execution Configuration A kernel function ( == __global__ function) must be called with an execution configuration: __global__ void kernelFunc(...) {...}; dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block size_t sharedMemBytes = 64; // 64 bytes of shared memory kernelFunc<<< dimGrid, dimBlock, sharedMemBytes >>>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on Explicit synchronization needed for blocking Slide 29 CUDA > J. Rühmkorf > July 22nd 2009
  30. 30. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 30 CUDA > J. Rühmkorf > July 22nd 2009
  31. 31. A Simple CUDA Example Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity Slide 31 CUDA > J. Rühmkorf > July 22nd 2009
  32. 32. Simple CUDA Example Square Matrix Multiplication N P = M * N of size WIDTH x WIDTH WIDTH Here: without tiling! One thread calculates one element of P M and N are loaded WIDTH times from global memory M P WIDTH WIDTH WIDTH Slide 32 CUDA > J. Rühmkorf > July 22nd 2009
  33. 33. Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Slide 33 CUDA > J. Rühmkorf > July 22nd 2009
  34. 34. Step 1: Matrix Multiplication N A Simple Host Version in C k // Matrix multiplication on the (CPU) // host in double precision j WIDTH void matrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { double sum = 0; for (int k = 0; k < width; ++k) { M P double a = M[i * width + k]; double b = N[k *i width + j]; sum += a * b; WIDTH } P[i * width + j] = sum; } k } WIDTH WIDTH Slide 34 CUDA > J. Rühmkorf > July 22nd 2009
  35. 35. Step 2: Input Matrix Data Transfer (Host-sided Code) void matrixMulOnDevice(float* M, float* N, float* P, int width){ int size = width * width * sizeof(float); float* Md, *Nd, *Pd; … 1. // Allocate and load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); Slide 35 CUDA > J. Rühmkorf > July 22nd 2009
  36. 36. Step 3: Output Matrix Data Transfer (Host-sided Code) 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd); } Slide 36 CUDA > J. Rühmkorf > July 22nd 2009
  37. 37. Step 4: Kernel Function (1) // Matrix multiplication kernel – per thread code __global__ void matrixMulKernel(float* Md, float* Nd, float* Pd, int width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; // see next page… Slide 37 CUDA > J. Rühmkorf > July 22nd 2009
  38. 38. Step 4: Kernel Function (2) Nd for (int k = 0; k < width; ++k) { k WIDTH float Melement = Md[k + threadIdx.y*width]; tx float Nelement = Nd[threadIdx.x + k*width]; Pvalue += Melement * Nelement; } Md Pd { ty ty int i = threadIdx.x + WIDTH threadIdx.y*width; Pd[i] = Pvalue; k tx } } WIDTH WIDTH Slide 38 CUDA > J. Rühmkorf > July 22nd 2009
  39. 39. Step 5: Kernel Invocation (Host-sided Code) // Insert into step 2. from before // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(width, width); // Launch the device computation threads! matrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, width); Slide 39 CUDA > J. Rühmkorf > July 22nd 2009
  40. 40. Example Far from Ideal Only One Thread Block Used Grid 1 Nd Block 1 One Block of threads compute matrix 2 Pd 4 Each thread computes one element of Pd Thread (2, 2) 2 Each thread 6 Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory 3 2 5 4 48 access ratio close to 1:1 (not very high) Size of matrix limited by the number WIDTH of threads allowed in a thread block Md Pd Slide 40 CUDA > J. Rühmkorf > July 22nd 2009
  41. 41. CUDA: A Bright Future? Eure Rede aber sei: Ja, ja; nein, nein. Was darüber ist, das ist vom Übel. Matthäus 5:37 Slide 41 CUDA > J. Rühmkorf > July 22nd 2009
  42. 42. NVIDIA CUDA: Appendix Best Practices & Things to Watch Out For Slide 42 CUDA > J. Rühmkorf > July 22nd 2009
  43. 43. Appendix Best Practices & Things to Watch Out For Obtain relevant hardware data Compiling a CUDA program Linking Debugging C for CUDA vs. CUDA Driver API Watch out: floating point computations Unsupported C language elements Branching of code Coalesced access to device global memory Access patterns to avoid bank conflicts Slide 43 CUDA > J. Rühmkorf > July 22nd 2009
  44. 44. Obtain Relevant Hardware Data Make sure to obtain relevant additional hardware data Call cudaGetDeviceProperties() Slide 44 CUDA > J. Rühmkorf > July 22nd 2009
  45. 45. Compiling a CUDA Program (1) Parallel Thread eXecution C/C++ CUDA float4 me = gx[gtid]; (PTX) Application me.x += me.y * me.z; Virtual Machine and ISA (Instruction Set Architecture) NVCC CPU Code Programming model Execution Virtual PTX Code resources and state Physical PTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; Compiler mad.f32 $f1, $f5, $f3, $f1; G80 … GPU Target code Slide 45 CUDA > J. Rühmkorf > July 22nd 2009
  46. 46. Compiling a CUDA Program (2) NVCC as a Compiler Driver Any source file containing CUDA language extensions must be compiled with NVCC NVCC is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... NVCC outputs: C code (host CPU Code) Must then be compiled with the rest of the application using another tool PTX Object code directly Or, PTX source, interpreted at runtime Slide 46 CUDA > J. Rühmkorf > July 22nd 2009
  47. 47. Linking Any executable with CUDA code requires two dynamic libraries: The CUDA runtime library (cudart) The CUDA core library (cuda) Slide 47 CUDA > J. Rühmkorf > July 22nd 2009
  48. 48. Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (enabled via nvcc -deviceemu) runs completely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread Running in device emulation mode, one can: Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice- versa Detect deadlock situations caused by improper usage of __syncthreads() Slide 48 CUDA > J. Rühmkorf > July 22nd 2009
  49. 49. Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode Slide 49 CUDA > J. Rühmkorf > July 22nd 2009
  50. 50. CUDA Driver API vs. C for CUDA (1) Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo.s foo.cpp OCG gcc / cl G80 SASS Mark Murphy, “NVIDIA’s Experience foo.sass Experience with Open64,” www.capsl.udel.edu/conferences/open6 n64/2008/Papers/101.doc Slide 50 CUDA > J. Rühmkorf > July 22nd 2009
  51. 51. CUDA Driver API vs. C for CUDA (2) Mutually Exclusive: Choose One or the Other The C runtime for CUDA handles kernel loading The driver API is a lower-level API than the and setting kernels before they are launched. The runtime API. When compared with the runtime implicit code initialization, CUDA context API, the driver API has these advantages: management, CUDA module management (cubin No dependency on the runtime library and function mapping), kernel configuration, and parameter passing are all performed by the C More control over devices (for example, runtime for CUDA. only the driver API enables one CPU thread to control multiple GPUs) It comprises two principal parts: No C extensions in the host code, so The low-level functions compilers other than the default CPU (cuda_runtime_api.h) have a C-style compiler can be used interface that does not require compilation with nvcc. Its primary disadvantages The high-level functions (cuda_runtime.h) Verbose code have a C++-style interface built on top of Greater difficulty in debugging the low-level functions. No device emulation Of these, the high-level functions are the most A key point is that for every runtime API function, commonly used. They wrap some of the low-level there is an equivalent driver API function. The functions, using overloading, references, and driver API, however, includes other functions default arguments. These wrappers can be used missing in the runtime API, such as those for from C++ code and can be compiled with any C++ migrating a context from one host thread to compiler. another. Slide 51 CUDA > J. Rühmkorf > July 22nd 2009
  52. 52. CUDA Driver API vs. C for CUDA (3) Example: Vector Addition Using C for CUDA const unsigned int cnBlockSize = 512; // copy host vectors to device const unsigned int cnBlocks = 3; cudaMemcpy(pDeviceMemA, pA, cnDimension * const unsigned int cnDimension = sizeof(float), cnBlocks * cnBlockSize; cudaMemcpyHostToDevice); // create CUDA device & context cudaMemcpy(pDeviceMemB, pB, cnDimension * cudaSetDevice( 0 ); // pick first device sizeof(float), cudaMemcpyHostToDevice); // allocate host vectors float * pA = new float[cnDimension]; vectorAdd<<<cnBlocks, cnBlockSize>>> float * pB = new float[cnDimension]; (pDeviceMemA, pDeviceMemB, float * pC = new float[cnDimension]; pDeviceMemC); // initialize host memory randomInit(pA, cnDimension); // copy result from device to host randomInit(pB, cnDimension); cudaMemcpy ((void *) pC, pDeviceMemC, cnDimension * sizeof(float), cudaMemcpyDeviceToHost); // allocate device memory delete[] pA; float *pDeviceMemA, *pDeviceMemB, *pDeviceMemC; delete[] pB; cudaMalloc((void **)&pDeviceMemA, delete[] pC; cnDimension * sizeof(float)); cudaFree(pDeviceMemA); cudaMalloc((void **)&pDeviceMemB, cudaFree(pDeviceMemB); cnDimension * sizeof(float)); cudaFree(pDeviceMemC); cudaMalloc((void **)&pDeviceMemC, cnDimension * sizeof(float)); Slide 52 CUDA > J. Rühmkorf > July 22nd 2009
  53. 53. CUDA Driver API vs. C for CUDA (4) Example: Vector Addition Using CUDA Driver API const unsigned int cnBlocks = 3; #define ALIGN_UP(offset, alignment) const unsigned int cnDimension = cnBlocks * cnBlockSize; (offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1) CUdevice hDevice; CUcontext hContext; int offset = 0; CUmodule hModule; void* ptr; CUfunction hFunction; ptr = (void*)(size_t)pDeviceMemA; // create CUDA device & context ALIGN_UP(offset, __alignof(ptr)); cuInit(0); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); cuDeviceGet(&hContext, 0); // pick first device offset += sizeof(ptr); cuCtxCreate(&hContext, 0, hDevice)); ptr = (void*)(size_t)pDeviceMemB; cuModuleLoad(&hModule, “vectorAdd.cubin”); ALIGN_UP(offset, __alignof(ptr)); cuModuleGetFunction(&hFunction, hModule, "vectorAdd"); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); // allocate host vectors offset += sizeof(ptr); float * pA = new float[cnDimension]; ptr = (void*)(size_t)pDeviceMemC; float * pB = new float[cnDimension]; ALIGN_UP(offset, __alignof(ptr)); float * pC = new float[cnDimension]; cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); // initialize host memory offset += sizeof(ptr); randomInit(pA, cnDimension); cuParamSetSize(cuFunction, offset); randomInit(pB, cnDimension); // execute kernel // allocate memory on the device cuLaunchGrid(cuFunction, cnBlocks, 1); CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC; // copy the result from device back to host cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float)); cuMemcpyDtoH((void *) pC, pDeviceMemC, cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float)); cnDimension * sizeof(float)); cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float)); delete[] pA; // copy host vectors to device delete[] pB; cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float)); delete[] pC; cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float)); cuMemFree(pDeviceMemA); // set up parameter values cuMemFree(pDeviceMemB); cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1); cuMemFree(pDeviceMemC); Slide 53 CUDA > J. Rühmkorf > July 22nd 2009
  54. 54. Watch Out: Floating Point Computations Differing Results of FP Computations Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results There are various options to force strict single precision on the host Slide 54 CUDA > J. Rühmkorf > July 22nd 2009
  55. 55. Watch Out: Floating Point Computations Single and Double Precision Operations Double precision No deviations from the IEEE 754 standard Single precision Denormals and signalling NaNs are not supported; Only two IEEE rounding modes are supported (chop and round-to- nearest even); and The precision of division/square root is slightly lower than single precision. Slide 55 CUDA > J. Rühmkorf > July 22nd 2009
  56. 56. Limitations (1) Only a Subset of C Available C for CUDA offers only a subset of the C language: Recursion-free Function-pointer-free Functions reside in the global device memory, therefore we cannot obtain their addresses Slide 56 CUDA > J. Rühmkorf > July 22nd 2009
  57. 57. Limitations (2) Branching in Programm Code For best performance Threads should be running in groups 32 threads 32 threads = 1 warp All threads of a warp should take the same execution path Otherwise, branching will probably hurt Slide 57 CUDA > J. Rühmkorf > July 22nd 2009
  58. 58. Coalesced Access to Device Global Memory High Priority: Ensure global memory accesses are coalesced whenever possible Global memory loads and stores by threads of a half warp (16 threads) are coalesced by the device in as few as one transaction (or two transactions in the case of 128-bit words) But: certain access requirements have to be met Slide 58 CUDA > J. Rühmkorf > July 22nd 2009
  59. 59. Coalesced Access (Reading Floats) Slide 59 CUDA > J. Rühmkorf > July 22nd 2009
  60. 60. Uncoalesced Access (Reading Floats) Slide 60 CUDA > J. Rühmkorf > July 22nd 2009
  61. 61. Shared Memory – Bank Conflicts Shared Memory 16 KB Organized in 16 Banks, 1 KB each Shared Memory As fast as a register … … if no bank conflicts occur! Bank conflict: More than one thread in the same half-warp access the same bank Access needs to be serialized  Cost = max (# of concurrent access) Slide 61 CUDA > J. Rühmkorf > July 22nd 2009
  62. 62. Shared Memory – No Bank Conflicts Linear addressing Random Linear addressing Broadcast Step size = 1 Permutation Step size = 3 Word Words Slide 62 CUDA > J. Rühmkorf > July 22nd 2009
  63. 63. Shared Memory – Bank Conflicts Linear adressing Linear addressing No conflict Step size = 2 Step size = 8 or 5-way Words words conflict Slide 63 CUDA > J. Rühmkorf > July 22nd 2009

×