SlideShare a Scribd company logo
1 of 63
Download to read offline
NVIDIA CUDA
The Compute Unified Device Architecture

Jens Rühmkorf
Tech Talk, DLR Köln-Porz, July 22nd 2009

                                                                     Slide 1
                                           CUDA > J. Rühmkorf > July 22nd 2009
References


  University of Illinois at Urbana-Champaign,
  Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009:
  http://courses.ece.illinois.edu/ece498/al/

  Website about General-Purpose Computation on Graphics Hardware:
  http://gpgpu.org/developer/cuda
  ACM Queue, Vol. 6 No. 2, March/April 2008 (Issue on GPGPU):
  http://mags.acm.org/queue/20080304/
  Dr. Dobb‘s: CUDA, Supercomputing for the Masses, Part 1-13:
  http://www.ddj.com/architect/207200659
  NVIDIA CUDA Best Practices Guide
  http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVI
  DIA_CUDA_BestPracticesGuide_2.3.pdf
  Hubert Nguyen (ed.), GPU Gems 3, Addison-Wesley, 2007, online:
  http://developer.nvidia.com/object/gpu-gems-3.html

                                                                                     Slide 2
                                                           CUDA > J. Rühmkorf > July 22nd 2009
Multi- and Manycore Architectures
A Difficult Road Lies Ahead

  Don Knuth on Multicore Architectures
  “[…] my personal unhappiness with the current trend toward multicore
  architecture. To me, it looks more or less like the hardware designers
  have run out of ideas, and that they’re trying to pass the blame for the
  future demise of Moore’s Law to the software writers by giving us
  machines that work faster only on a few key benchmarks! I won’t be
  surprised at all if the whole multithreading idea turns out to be a flop”
  In: InformIT, April 25th 2008
  http://www.informit.com/articles/article.aspx?p=1193856




                                                                                          Slide 3
                                                                CUDA > J. Rühmkorf > July 22nd 2009
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming interface


  Simple CUDA example




                                                                     Slide 4
                                           CUDA > J. Rühmkorf > July 22nd 2009
Multicore and Manycore (1)
Structural Differences




       Multicore                                    Manycore
  Multicore: yoke of oxen
        Each core optimized for executing a single thread
  Manycore: flock of chickens
        Cores optimized for aggregate throughput,
        deemphasizing individual performance


                                                                                         Slide 5
                                                               CUDA > J. Rühmkorf > July 22nd 2009
Multicore and Manycore (2)
Technical Characteristics

 Specifica-
                   Core i7 960              GTX285
   tions
                     4 cores, 4 way       30 cores, 8 way
   Processing            SIMD                  SIMD
    Elements
                       @3.2 GHz             @1.5 GHz
                   4 cores, 2 threads,   30 cores, 32 SIMD
                     4 width SIMD:
Resident Threads
     (max)
                                          vectors, 32 width
                                               SIMD:
                                                                                 Core i7
                       32 strands          30720 strands


  SP GFLOP/s              102                  1080



Memory Bandwidth       25.6 GB/s             159 GB/s



  Register File             -                1.875 MB



   Local Store              -                 480 kB
                                                                             GTX285
                                                                                        Slide 6
                                                              CUDA > J. Rühmkorf > July 22nd 2009
Multicore and Manycore (3)
Performance Comparison: CPU vs. GPU




                              CPU vs. GPU
        y-axis: floating point operations per sec., single precision


                                                                                                 Slide 7
                                                                       CUDA > J. Rühmkorf > July 22nd 2009
An Example of the Physical Reality Behind CUDA

                                          CPU
  GPU w/                                 (host)
local DRAM
  (device)




                                                                     Slide 8
                                           CUDA > J. Rühmkorf > July 22nd 2009
CUDA Processing Flow




                                                 Slide 9
                       CUDA > J. Rühmkorf > July 22nd 2009
CUDA in a Nutshell
Key Characteristics

  CUDA is designed for wide SIMD/SPMD parallelism & scalability
  CUDA provides 3 key abstractions, i.e. a hierarchy:
      of thread groups,
      of shared memories, and
      of barrier synchronization
  CUDA programs are written in C + extensions
  OpenCL is inspired by CUDA, but HW & SW vendor neutral
      Programming model essentially identical




                                                                                   Slide 10
                                                          CUDA > J. Rühmkorf > July 22nd 2009
Hello World


// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    C[i] = A[i] + B[i]
}


int main() {
  // Run N/256 blocks of 256 threads each
  vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
                                       hello-world.cu

                                                                          Slide 11
                                                 CUDA > J. Rühmkorf > July 22nd 2009
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming interface


  Simple CUDA example




                                                                    Slide 12
                                           CUDA > J. Rühmkorf > July 22nd 2009
CUDA Programming Model
  Structure of a CUDA application

      Integrated host + device application C program
            Serial or modestly parallel parts in host C code
            Highly parallel parts in device SPMD kernel C code

       Serial Code (host)


    Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);                                 ...


       Serial Code (host)


    Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);                                 ...

                                                                                                Slide 13
                                                                       CUDA > J. Rühmkorf > July 22nd 2009
CUDA Programming Model
CUDA Devices and Threads

  A compute device
       Is a coprocessor to the CPU or host
       Has its own DRAM (device memory)
       Runs many threads in parallel
       Is typically a GPU but can also be another type of parallel
       processing device
  Express data-parallel portions as device kernels
  (which run on many threads)




                                                                                       Slide 14
                                                              CUDA > J. Rühmkorf > July 22nd 2009
CUDA Programming Model
Arrays of Parallel Threads

  Execute a Kernel by specifying arrays of threads
  All threads run the same code (SPMD)
  Use thread-ID to compute memory addresses & make control decisions

            threadID     0 1 2 3 4 5 6 7




                       …
                       float x = input[threadID];
                       float y = func(x);
                       output[threadID] = y;
                       …




                                                                                  Slide 15
                                                         CUDA > J. Rühmkorf > July 22nd 2009
CUDA Programming Model
  Use Thread Blocks for (Scalable) Cooperation
      Divide monolithic thread array into multiple blocks
      Threads within a block can cooperate via
               shared memory,
               atomic operations, and
               barrier synchronization
      Threads in different blocks cannot cooperate

            Thread Block 0                     Thread Block 1                   Thread Block N - 1
           0   1   2   3   4   5   6   7   0    1   2   3   4   5   6   7       0   1   2   3   4   5    6    7
threadID



           …                               …                                    …
           float x =                       float x =                            float x =
           input[threadID];
           float y = func(x);
           output[threadID] = y;
                                           input[threadID];
                                           float y = func(x);
                                           output[threadID] = y;
                                                                            …   input[threadID];
                                                                                float y = func(x);
                                                                                output[threadID] = y;
           …                               …                                    …




                                                                                                                         Slide 16
                                                                                                CUDA > J. Rühmkorf > July 22nd 2009
CUDA Programming Model
Organisation of Thread Blocks

   Thread Blocks can be one-, two- or
   three-dimensional arrays
   The host issues a sequence of
   kernel invocations (kernel 1,
   kernel 2) to the device
   Each kernel is executed as a batch
   of threads
   This batch is organized as a grid of
   thread blocks




                                          2-dimensional thread blocks

                                                                                     Slide 17
                                                            CUDA > J. Rühmkorf > July 22nd 2009
CUDA Programming Model
Block IDs and Thread IDs
  Each thread uses IDs to decide what
  data to work on
        Block ID: 1D, 2D, or 3D
        Thread ID: 1D, 2D, or 3D
  Simplifies memory
  addressing when processing
  multidimensional data




                                                                 Slide 18
                                        CUDA > J. Rühmkorf > July 22nd 2009
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming interface


  Simple CUDA example




                                                                    Slide 19
                                           CUDA > J. Rühmkorf > July 22nd 2009
CUDA Memory Model
Overview
  Global memory                           Grid

        Main means of communicating
        R/W Data between host and         Block (0, 0)                  Block (1, 0)
        device
                                                 Shared Memory               Shared Memory
        Contents visible to all threads
        Long latency access               Registers      Registers      Registers        Registers

  We will focus on global memory for
  now
                                          Thread (0, 0) Thread (1, 0)   Thread (0, 0) Thread (1, 0)
  Constant and texture memory will not
  be covered here
                                   Host   Global Memory


                                          Constant Memory


                                          Texture Memory



                                                                                                        Slide 20
                                                                               CUDA > J. Rühmkorf > July 22nd 2009
CUDA Memory Model
CUDA Device Memory Allocation
  cudaMalloc():                           Grid

        Allocates object in the device
        global memory                     Block (0, 0)                  Block (1, 0)

  Requires two parameters
                                                 Shared Memory               Shared Memory
        Address of a pointer to the
        allocated object                  Registers      Registers      Registers         Registers

       Size of allocated object
  cudaFree()
                                          Thread (0, 0) Thread (1, 0)   Thread (0, 0) Thread (1, 0)
        Frees object from the device
        global memory
                                   Host   Global
        Pointer to freed object           Memory




                                                                                                        Slide 21
                                                                               CUDA > J. Rühmkorf > July 22nd 2009
CUDA Memory Model
CUDA Host-Device Data Transfer

  cudaMemcpy()                              Grid

       memory data transfer
                                            Block (0, 0)                  Block (1, 0)
       Requires four parameters
            Pointer to destination                 Shared Memory               Shared Memory
            Pointer to source
                                            Registers      Registers      Registers         Registers
            Number of bytes to copy
            Type of transfer
                                            Thread (0, 0) Thread (1, 0)   Thread (0, 0) Thread (1, 0)
  Type of transfer is one of:
       Host to Host
       Host to Device                Host   Global
                                            Memory
       Device to Host
       Device to Device
  Asynchronous transfer


                                                                                                          Slide 22
                                                                                 CUDA > J. Rühmkorf > July 22nd 2009
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming interface


  Simple CUDA example




                                                                    Slide 23
                                           CUDA > J. Rühmkorf > July 22nd 2009
CUDA API
Extended C
                        Integrated source
                              (foo.cu)




                         cudacc
                   EDG C/C++ frontend
                  Open64 Global Optimizer


       GPU Assembly                      CPU Host Code
              foo.s                          foo.cpp



             OCG                            gcc / cl

                                  Mark Murphy, “NVIDIA’s Experience with
         G80 SASS
             foo.sass             with Open64,”
                                  www.capsl.udel.edu/conferences/open64/2008
                                  008/Papers/101.doc
                                                                                       Slide 24
                                                              CUDA > J. Rühmkorf > July 22nd 2009
CUDA API
C for CUDA
  Function type specifiers                          image-convolution.cu
          __global__, __device__,
          __host__                    __device__ float filter[N];
  Variable type specifiers
          __device__, __shared__,     __global__ void convolve(float
          __constant__                    *image) {
  Keywords                              __shared__ float region[M];
          threadIdx, blockIdx           ...
  Intrinsics / builtin functions:       region[threadIdx] = image[i];
          __syncthreads()               __syncthreads()
  Runtime API                           ...
          Memory, symbol, execution
          management
                                          image[j] = result;
  Function launch
                                      }

                                      // Allocate GPU memory
                                      void *myimage = cudaMalloc(bytes)

                                      // 100 blocks, 10 threads per block
                                      convolve<<<100, 10>>>(myimage);
                                                                                        Slide 25
                                                               CUDA > J. Rühmkorf > July 22nd 2009
CUDA API
CUDA Function Type Qualifiers (1)

                                                     only callable
                                      executed on:
                                                        from:
__device__ float deviceFunc()            device         device

__global__ void     kernelFunc()         device           host

__host__    float hostFunc()              host            host


  __global__:
       defines a kernel function
       must return void
  __device__ and __host__ can be used together




                                                                                Slide 26
                                                       CUDA > J. Rühmkorf > July 22nd 2009
CUDA API
CUDA Function Type Qualifiers (2)
   __device__ functions cannot have their address taken
  For functions executed on the device:
        no recursion
        no static variable declarations inside the function
        no variable number of arguments




                                                                                       Slide 27
                                                              CUDA > J. Rühmkorf > July 22nd 2009
CUDA API
CUDA Variable Type Qualifiers
  __device__
        Resides in global memory space,
        Has the lifetime of an application,
        Is accessible from all the threads within the grid and from the host through
        the runtime library.
  __shared__ (optionally used together with __device__)
        Resides in the shared memory space of a thread block,
        Has the lifetime of the block,
        Is only accessible from all the threads within the block.
  Not covered here:
  __constant__ (optionally used together with __device__)
        Resides in constant memory space,
        Has the lifetime of an application,
        Is accessible from all the threads within the grid and from the host through
        the runtime library.


                                                                                               Slide 28
                                                                      CUDA > J. Rühmkorf > July 22nd 2009
CUDA API
Calling a Kernel Function – Execution Configuration
  A kernel function ( == __global__ function) must be called with an
  execution configuration:

__global__ void kernelFunc(...) {...};
dim3   dimGrid(100, 50);    // 5000 thread blocks
dim3   dimBlock(4, 8, 8);   // 256 threads per block
size_t sharedMemBytes = 64; // 64 bytes of shared memory
kernelFunc<<< dimGrid, dimBlock, sharedMemBytes >>>(...);


  Any call to a kernel function is asynchronous from CUDA 1.0 on
  Explicit synchronization needed for blocking



                                                                                     Slide 29
                                                            CUDA > J. Rühmkorf > July 22nd 2009
Overview


  A high level view on CUDA


  CUDA programming model


  CUDA memory model


  CUDA application programming interface


  Simple CUDA example




                                                                    Slide 30
                                           CUDA > J. Rühmkorf > July 22nd 2009
A Simple CUDA Example
Matrix Multiplication

  A simple matrix multiplication example that illustrates the basic features
  of memory and thread management in CUDA programs
      Leave shared memory usage until later
      Local, register usage
      Thread ID usage
      Memory data transfer API between host and device
  Assume square matrix for simplicity




                                                                                        Slide 31
                                                               CUDA > J. Rühmkorf > July 22nd 2009
Simple CUDA Example
Square Matrix Multiplication                    N



    P = M * N of size WIDTH x WIDTH




                                                                                   WIDTH
    Here: without tiling!
    One thread calculates one element
    of P
    M and N are loaded WIDTH times
    from global memory

                              M                 P




                                                                                   WIDTH
                                        WIDTH       WIDTH




                                                                             Slide 32
                                                    CUDA > J. Rühmkorf > July 22nd 2009
Memory Layout of a Matrix in C


                              M0,0 M1,0 M2,0 M3,0

                              M0,1 M1,1 M2,1 M3,1

                              M0,2 M1,2 M2,2 M3,2

                              M0,3 M1,3 M2,3 M3,3

M

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3




                                                                                               Slide 33
                                                                      CUDA > J. Rühmkorf > July 22nd 2009
Step 1: Matrix Multiplication
                                                N
A Simple Host Version in C
                                                                           k
// Matrix multiplication on the (CPU)
// host in double precision                         j




                                                                                       WIDTH
void matrixMulOnHost(float* M, float* N,
     float* P, int width) {
    for (int i = 0; i < width; ++i)
        for (int j = 0; j < width; ++j) {
            double sum = 0;
            for (int k = 0; k < width; ++k) {
                         M                      P
                double a = M[i * width + k];
                double b = N[k *i width + j];
                sum += a * b;




                                                                                       WIDTH
            }
            P[i * width + j] = sum;
        }
                           k
}                                 WIDTH                 WIDTH




                                                                                 Slide 34
                                                        CUDA > J. Rühmkorf > July 22nd 2009
Step 2: Input Matrix Data Transfer
(Host-sided Code)
void matrixMulOnDevice(float* M, float* N, float* P, int width){
     int size = width * width * sizeof(float);
     float* Md, *Nd, *Pd;
     …
1.   // Allocate and load M, N to device memory
     cudaMalloc(&Md, size);
     cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);


     cudaMalloc(&Nd, size);
     cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);


     // Allocate P on the device
     cudaMalloc(&Pd, size);



                                                                                 Slide 35
                                                        CUDA > J. Rühmkorf > July 22nd 2009
Step 3: Output Matrix Data Transfer
(Host-sided Code)
2.   // Kernel invocation code – to be shown later
     …


3.   // Read P from the device
     cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);


     // Free device matrices
     cudaFree(Md);
     cudaFree(Nd);
     cudaFree(Pd);
}




                                                                                 Slide 36
                                                        CUDA > J. Rühmkorf > July 22nd 2009
Step 4: Kernel Function (1)


// Matrix multiplication kernel – per thread code
__global__ void matrixMulKernel(float* Md, float* Nd,
   float* Pd, int width) {


    // Pvalue is used to store the element of the matrix
    // that is computed by the thread
    float Pvalue = 0;


    // see next page…




                                                                        Slide 37
                                               CUDA > J. Rühmkorf > July 22nd 2009
Step 4: Kernel Function (2)
                                          Nd



    for (int k = 0; k < width; ++k) {                                  k




                                                                                   WIDTH
      float Melement =
        Md[k + threadIdx.y*width];             tx
      float Nelement =
        Nd[threadIdx.x + k*width];
      Pvalue += Melement * Nelement;
    }                    Md               Pd

    {                                ty                              ty
      int i = threadIdx.x +




                                                                                   WIDTH
        threadIdx.y*width;
      Pd[i] = Pvalue;
                            k                  tx
    }
}                               WIDTH               WIDTH




                                                                             Slide 38
                                                    CUDA > J. Rühmkorf > July 22nd 2009
Step 5: Kernel Invocation
(Host-sided Code)
 // Insert into step 2. from before
 // Setup the execution configuration
 dim3 dimGrid(1, 1);
 dim3 dimBlock(width, width);


 // Launch the device computation threads!
 matrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd,
   width);




                                                                       Slide 39
                                              CUDA > J. Rühmkorf > July 22nd 2009
Example Far from Ideal
Only One Thread Block Used
                                                    Grid 1                     Nd
                                                Block 1
  One Block of threads compute matrix                                                     2
  Pd
                                                                                          4
        Each thread computes one
        element of Pd                                  Thread
                                                        (2, 2)
                                                                                          2

  Each thread                                                                             6
        Loads a row of matrix Md
        Loads a column of matrix Nd
        Perform one multiply and
        addition for each pair of Md and
        Nd elements
        Compute to off-chip memory         3    2       5        4                      48
        access ratio close to 1:1 (not
        very high)
  Size of matrix limited by the number         WIDTH
  of threads allowed in a thread block
                                                Md                             Pd
                                                                                              Slide 40
                                                                     CUDA > J. Rühmkorf > July 22nd 2009
CUDA: A Bright Future?


                         Eure Rede aber sei: Ja, ja; nein, nein.
                         Was darüber ist, das ist vom Übel.
                                                 Matthäus 5:37




                                                                          Slide 41
                                                 CUDA > J. Rühmkorf > July 22nd 2009
NVIDIA CUDA: Appendix
Best Practices & Things to Watch Out For




                                                                    Slide 42
                                           CUDA > J. Rühmkorf > July 22nd 2009
Appendix
Best Practices & Things to Watch Out For

  Obtain relevant hardware data
  Compiling a CUDA program
  Linking
  Debugging
  C for CUDA vs. CUDA Driver API
  Watch out: floating point computations
  Unsupported C language elements
  Branching of code
  Coalesced access to device global memory
  Access patterns to avoid bank conflicts




                                                                      Slide 43
                                             CUDA > J. Rühmkorf > July 22nd 2009
Obtain Relevant Hardware Data


  Make sure to obtain relevant additional hardware data
  Call cudaGetDeviceProperties()




                                                                                   Slide 44
                                                          CUDA > J. Rühmkorf > July 22nd 2009
Compiling a CUDA Program (1)
                                                          Parallel Thread eXecution
             C/C++ CUDA        float4 me = gx[gtid];
                                                          (PTX)
             Application       me.x += me.y * me.z;
                                                                Virtual Machine
                                                                and ISA (Instruction
                                                                Set Architecture)
                NVCC            CPU Code                        Programming
                                                                model
                                                                Execution
 Virtual     PTX Code                                           resources and state


Physical
             PTX to Target     ld.global.v4.f32   {$f1,$f3,$f5,$f7}, [$r9+0];
              Compiler         mad.f32            $f1, $f5, $f3, $f1;




       G80        …      GPU

           Target code

                                                                                           Slide 45
                                                                  CUDA > J. Rühmkorf > July 22nd 2009
Compiling a CUDA Program (2)
NVCC as a Compiler Driver

  Any source file containing CUDA language extensions must be compiled
  with NVCC
  NVCC is a compiler driver
      Works by invoking all the necessary tools and compilers like
      cudacc, g++, cl, ...
  NVCC outputs:
      C code (host CPU Code)
           Must then be compiled with the rest of the application using
           another tool
       PTX
             Object code directly
             Or, PTX source, interpreted at runtime


                                                                                     Slide 46
                                                            CUDA > J. Rühmkorf > July 22nd 2009
Linking


  Any executable with CUDA code requires two dynamic libraries:
       The CUDA runtime library (cudart)
       The CUDA core library (cuda)




                                                                                    Slide 47
                                                           CUDA > J. Rühmkorf > July 22nd 2009
Debugging Using the Device Emulation Mode


  An executable compiled in device emulation mode (enabled via nvcc
  -deviceemu) runs completely on the host using the CUDA runtime
      No need of any device and CUDA driver
      Each device thread is emulated with a host thread
  Running in device emulation mode, one can:
      Use host native debug support (breakpoints, inspection, etc.)
      Access any device-specific data from host code and vice-versa
      Call any host function from device code (e.g. printf) and vice-
      versa
      Detect deadlock situations caused by improper usage of
      __syncthreads()




                                                                                    Slide 48
                                                           CUDA > J. Rühmkorf > July 22nd 2009
Device Emulation Mode Pitfalls


  Emulated device threads execute sequentially, so simultaneous
  accesses of the same memory location by multiple threads could
  produce different results.
  Dereferencing device pointers on the host or host pointers on the device
  can produce correct results in device emulation mode, but will generate
  an error in device execution mode




                                                                                      Slide 49
                                                             CUDA > J. Rühmkorf > July 22nd 2009
CUDA Driver API vs. C for CUDA (1)
Extended C
                        Integrated source
                              (foo.cu)



                         cudacc
                 EDG C/C++ frontend
                Open64 Global Optimizer

         GPU Assembly                    CPU Host Code
              foo.s                          foo.cpp


             OCG                            gcc / cl

           G80 SASS               Mark Murphy, “NVIDIA’s Experience
             foo.sass             Experience with Open64,”
                                  www.capsl.udel.edu/conferences/open6
                                  n64/2008/Papers/101.doc
                                                                                       Slide 50
                                                              CUDA > J. Rühmkorf > July 22nd 2009
CUDA Driver API vs. C for CUDA (2)
Mutually Exclusive: Choose One or the Other
  The C runtime for CUDA handles kernel loading           The driver API is a lower-level API than the
  and setting kernels before they are launched. The       runtime API. When compared with the runtime
  implicit code initialization, CUDA context              API, the driver API has these advantages:
  management, CUDA module management (cubin                        No dependency on the runtime library
  and function mapping), kernel configuration, and
  parameter passing are all performed by the C                     More control over devices (for example,
  runtime for CUDA.                                                only the driver API enables one CPU
                                                                   thread to control multiple GPUs)
  It comprises two principal parts:
                                                                   No C extensions in the host code, so
            The low-level functions                                compilers other than the default CPU
            (cuda_runtime_api.h) have a C-style                    compiler can be used
            interface that does not require compilation
            with nvcc.                                    Its primary disadvantages
            The high-level functions (cuda_runtime.h)              Verbose code
            have a C++-style interface built on top of             Greater difficulty in debugging
            the low-level functions.                               No device emulation
  Of these, the high-level functions are the most         A key point is that for every runtime API function,
  commonly used. They wrap some of the low-level          there is an equivalent driver API function. The
  functions, using overloading, references, and           driver API, however, includes other functions
  default arguments. These wrappers can be used           missing in the runtime API, such as those for
  from C++ code and can be compiled with any C++          migrating a context from one host thread to
  compiler.                                               another.




                                                                                                                   Slide 51
                                                                                          CUDA > J. Rühmkorf > July 22nd 2009
CUDA Driver API vs. C for CUDA (3)
Example: Vector Addition Using C for CUDA
const unsigned int cnBlockSize = 512;      // copy host vectors to device
const unsigned int cnBlocks = 3;           cudaMemcpy(pDeviceMemA, pA, cnDimension *
const unsigned int cnDimension =               sizeof(float),
    cnBlocks * cnBlockSize;                cudaMemcpyHostToDevice);
// create CUDA device & context            cudaMemcpy(pDeviceMemB, pB, cnDimension *
cudaSetDevice( 0 ); // pick first device       sizeof(float),
                                               cudaMemcpyHostToDevice);
// allocate host vectors
float * pA = new float[cnDimension];
                                           vectorAdd<<<cnBlocks, cnBlockSize>>>
float * pB = new float[cnDimension];           (pDeviceMemA, pDeviceMemB,
float * pC = new float[cnDimension];           pDeviceMemC);
// initialize host memory
randomInit(pA, cnDimension);               // copy result from device to host
randomInit(pB, cnDimension);               cudaMemcpy ((void *) pC, pDeviceMemC,
                                               cnDimension * sizeof(float),
                                               cudaMemcpyDeviceToHost);
// allocate device memory
                                           delete[] pA;
float *pDeviceMemA, *pDeviceMemB,
    *pDeviceMemC;                          delete[] pB;
cudaMalloc((void **)&pDeviceMemA,          delete[] pC;
    cnDimension * sizeof(float));          cudaFree(pDeviceMemA);
cudaMalloc((void **)&pDeviceMemB,          cudaFree(pDeviceMemB);
    cnDimension * sizeof(float));          cudaFree(pDeviceMemC);
cudaMalloc((void **)&pDeviceMemC,
    cnDimension * sizeof(float));


                                                                                               Slide 52
                                                                      CUDA > J. Rühmkorf > July 22nd 2009
CUDA Driver API vs. C for CUDA (4)
Example: Vector Addition Using CUDA Driver API
const unsigned int cnBlocks = 3;                              #define ALIGN_UP(offset, alignment) 
const unsigned int cnDimension = cnBlocks * cnBlockSize;      (offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
CUdevice hDevice;
CUcontext hContext;                                           int offset = 0;
CUmodule hModule;                                             void* ptr;
CUfunction hFunction;                                         ptr = (void*)(size_t)pDeviceMemA;
// create CUDA device & context                               ALIGN_UP(offset, __alignof(ptr));
cuInit(0);                                                    cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
cuDeviceGet(&hContext, 0); // pick first device               offset += sizeof(ptr);
cuCtxCreate(&hContext, 0, hDevice));                          ptr = (void*)(size_t)pDeviceMemB;
cuModuleLoad(&hModule, “vectorAdd.cubin”);                    ALIGN_UP(offset, __alignof(ptr));
cuModuleGetFunction(&hFunction, hModule, "vectorAdd");        cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
// allocate host vectors                                      offset += sizeof(ptr);
float * pA = new float[cnDimension];                          ptr = (void*)(size_t)pDeviceMemC;
float * pB = new float[cnDimension];                          ALIGN_UP(offset, __alignof(ptr));
float * pC = new float[cnDimension];                          cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
// initialize host memory                                     offset += sizeof(ptr);
randomInit(pA, cnDimension);                                  cuParamSetSize(cuFunction, offset);
randomInit(pB, cnDimension);                                  // execute kernel
// allocate memory on the device                              cuLaunchGrid(cuFunction, cnBlocks, 1);
CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;            // copy the result from device back to host
cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float));        cuMemcpyDtoH((void *) pC, pDeviceMemC,
cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float));        cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float));        delete[] pA;
// copy host vectors to device                                delete[] pB;
cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float));   delete[] pC;
cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float));   cuMemFree(pDeviceMemA);
// set up parameter values                                    cuMemFree(pDeviceMemB);
cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1);           cuMemFree(pDeviceMemC);




                                                                                                                                Slide 53
                                                                                                       CUDA > J. Rühmkorf > July 22nd 2009
Watch Out: Floating Point Computations
Differing Results of FP Computations

  Results of floating-point computations will slightly differ because of:
       Different compiler outputs, instruction sets
       Use of extended precision for intermediate results
              There are various options to force strict single precision on
              the host




                                                                                         Slide 54
                                                                CUDA > J. Rühmkorf > July 22nd 2009
Watch Out: Floating Point Computations
Single and Double Precision Operations

  Double precision
       No deviations from the IEEE 754 standard
  Single precision
       Denormals and signalling NaNs are not supported;
       Only two IEEE rounding modes are supported (chop and round-to-
       nearest even); and
       The precision of division/square root is slightly lower than single
       precision.




                                                                                         Slide 55
                                                                CUDA > J. Rühmkorf > July 22nd 2009
Limitations (1)
Only a Subset of C Available

  C for CUDA offers only a subset of the C language:
        Recursion-free
        Function-pointer-free
  Functions reside in the global device memory, therefore we cannot
  obtain their addresses




                                                                                     Slide 56
                                                            CUDA > J. Rühmkorf > July 22nd 2009
Limitations (2)
Branching in Programm Code

  For best performance
         Threads should be running in groups 32 threads
         32 threads = 1 warp
  All threads of a warp should take the same execution path
  Otherwise, branching will probably hurt




                                                                                       Slide 57
                                                              CUDA > J. Rühmkorf > July 22nd 2009
Coalesced Access to Device Global Memory


  High Priority: Ensure global memory accesses are coalesced whenever
  possible
  Global memory loads and stores by threads of a half warp (16 threads)
  are coalesced by the device in as few as one transaction (or two
  transactions in the case of 128-bit words)
  But: certain access requirements have to be met




                                                                                     Slide 58
                                                            CUDA > J. Rühmkorf > July 22nd 2009
Coalesced Access (Reading Floats)




                                                             Slide 59
                                    CUDA > J. Rühmkorf > July 22nd 2009
Uncoalesced Access (Reading Floats)




                                                               Slide 60
                                      CUDA > J. Rühmkorf > July 22nd 2009
Shared Memory – Bank Conflicts


  Shared Memory
       16 KB
       Organized in 16 Banks, 1 KB each
  Shared Memory
       As fast as a register …
       … if no bank conflicts occur!
  Bank conflict:
       More than one thread in the same half-warp access the same bank
             Access needs to be serialized 
             Cost = max (# of concurrent access)




                                                                                   Slide 61
                                                          CUDA > J. Rühmkorf > July 22nd 2009
Shared Memory – No Bank Conflicts




 Linear addressing   Random        Linear addressing   Broadcast
 Step size = 1       Permutation   Step size = 3
 Word                              Words


                                                                                            Slide 62
                                                                   CUDA > J. Rühmkorf > July 22nd 2009
Shared Memory – Bank Conflicts




       Linear adressing   Linear addressing   No conflict
       Step size = 2      Step size = 8       or 5-way
       Words              words               conflict



                                                                                     Slide 63
                                                            CUDA > J. Rühmkorf > July 22nd 2009

More Related Content

What's hot

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)The Linux Foundation
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...Opersys inc.
 
Qualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceQualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceJJ Wu
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUGlobalLogic Ukraine
 
The Android graphics path, in depth
The Android graphics path, in depthThe Android graphics path, in depth
The Android graphics path, in depthChris Simmonds
 
【SWS22-01】複数コントローラ設計に対するシーケンス分析によるコントロールループ分析支援手法の提案.pdf
【SWS22-01】複数コントローラ設計に対するシーケンス分析によるコントロールループ分析支援手法の提案.pdf【SWS22-01】複数コントローラ設計に対するシーケンス分析によるコントロールループ分析支援手法の提案.pdf
【SWS22-01】複数コントローラ設計に対するシーケンス分析によるコントロールループ分析支援手法の提案.pdfHirokiUmeda1
 
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime Ripard
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime RipardKernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime Ripard
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime RipardAnne Nicolas
 
TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support Hanoch Haim
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/CoreShay Cohen
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...
Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...
Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...Arm
 
Making a Headless Android Device
Making a Headless Android DeviceMaking a Headless Android Device
Making a Headless Android DevicePatricia Aas
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 

What's hot (20)

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
Native Android Userspace part of the Embedded Android Workshop at Linaro Conn...
 
Qualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile DeviceQualcomm SnapDragon 800 Mobile Device
Qualcomm SnapDragon 800 Mobile Device
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPU
 
Embedded Android : System Development - Part II (Linux device drivers)
Embedded Android : System Development - Part II (Linux device drivers)Embedded Android : System Development - Part II (Linux device drivers)
Embedded Android : System Development - Part II (Linux device drivers)
 
Embedded Android : System Development - Part II (HAL)
Embedded Android : System Development - Part II (HAL)Embedded Android : System Development - Part II (HAL)
Embedded Android : System Development - Part II (HAL)
 
Embedded Android : System Development - Part III
Embedded Android : System Development - Part IIIEmbedded Android : System Development - Part III
Embedded Android : System Development - Part III
 
The Android graphics path, in depth
The Android graphics path, in depthThe Android graphics path, in depth
The Android graphics path, in depth
 
【SWS22-01】複数コントローラ設計に対するシーケンス分析によるコントロールループ分析支援手法の提案.pdf
【SWS22-01】複数コントローラ設計に対するシーケンス分析によるコントロールループ分析支援手法の提案.pdf【SWS22-01】複数コントローラ設計に対するシーケンス分析によるコントロールループ分析支援手法の提案.pdf
【SWS22-01】複数コントローラ設計に対するシーケンス分析によるコントロールループ分析支援手法の提案.pdf
 
AI/ML O-RAN Cloud-Native Automation
AI/ML O-RAN Cloud-Native AutomationAI/ML O-RAN Cloud-Native Automation
AI/ML O-RAN Cloud-Native Automation
 
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime Ripard
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime RipardKernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime Ripard
Kernel Recipes 2017 - An introduction to the Linux DRM subsystem - Maxime Ripard
 
TRex Realistic Traffic Generator - Stateless support
TRex  Realistic Traffic Generator  - Stateless support TRex  Realistic Traffic Generator  - Stateless support
TRex Realistic Traffic Generator - Stateless support
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...
Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...
Optimizing ARM cortex a and cortex-m based heterogeneous multiprocessor syste...
 
Making a Headless Android Device
Making a Headless Android DeviceMaking a Headless Android Device
Making a Headless Android Device
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 

Similar to Tech Talk NVIDIA CUDA

A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverKernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverAnne Nicolas
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Introduction To The Beagleboard
Introduction To The BeagleboardIntroduction To The Beagleboard
Introduction To The BeagleboardNeHal VeRma
 

Similar to Tech Talk NVIDIA CUDA (20)

A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverKernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 
Porting Android
Porting AndroidPorting Android
Porting Android
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Hpc4
Hpc4Hpc4
Hpc4
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Introduction To The Beagleboard
Introduction To The BeagleboardIntroduction To The Beagleboard
Introduction To The Beagleboard
 

Recently uploaded

Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 

Recently uploaded (20)

Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 

Tech Talk NVIDIA CUDA

  • 1. NVIDIA CUDA The Compute Unified Device Architecture Jens Rühmkorf Tech Talk, DLR Köln-Porz, July 22nd 2009 Slide 1 CUDA > J. Rühmkorf > July 22nd 2009
  • 2. References University of Illinois at Urbana-Champaign, Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009: http://courses.ece.illinois.edu/ece498/al/ Website about General-Purpose Computation on Graphics Hardware: http://gpgpu.org/developer/cuda ACM Queue, Vol. 6 No. 2, March/April 2008 (Issue on GPGPU): http://mags.acm.org/queue/20080304/ Dr. Dobb‘s: CUDA, Supercomputing for the Masses, Part 1-13: http://www.ddj.com/architect/207200659 NVIDIA CUDA Best Practices Guide http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVI DIA_CUDA_BestPracticesGuide_2.3.pdf Hubert Nguyen (ed.), GPU Gems 3, Addison-Wesley, 2007, online: http://developer.nvidia.com/object/gpu-gems-3.html Slide 2 CUDA > J. Rühmkorf > July 22nd 2009
  • 3. Multi- and Manycore Architectures A Difficult Road Lies Ahead Don Knuth on Multicore Architectures “[…] my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop” In: InformIT, April 25th 2008 http://www.informit.com/articles/article.aspx?p=1193856 Slide 3 CUDA > J. Rühmkorf > July 22nd 2009
  • 4. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 4 CUDA > J. Rühmkorf > July 22nd 2009
  • 5. Multicore and Manycore (1) Structural Differences Multicore Manycore Multicore: yoke of oxen Each core optimized for executing a single thread Manycore: flock of chickens Cores optimized for aggregate throughput, deemphasizing individual performance Slide 5 CUDA > J. Rühmkorf > July 22nd 2009
  • 6. Multicore and Manycore (2) Technical Characteristics Specifica- Core i7 960 GTX285 tions 4 cores, 4 way 30 cores, 8 way Processing SIMD SIMD Elements @3.2 GHz @1.5 GHz 4 cores, 2 threads, 30 cores, 32 SIMD 4 width SIMD: Resident Threads (max) vectors, 32 width SIMD: Core i7 32 strands 30720 strands SP GFLOP/s 102 1080 Memory Bandwidth 25.6 GB/s 159 GB/s Register File - 1.875 MB Local Store - 480 kB GTX285 Slide 6 CUDA > J. Rühmkorf > July 22nd 2009
  • 7. Multicore and Manycore (3) Performance Comparison: CPU vs. GPU CPU vs. GPU y-axis: floating point operations per sec., single precision Slide 7 CUDA > J. Rühmkorf > July 22nd 2009
  • 8. An Example of the Physical Reality Behind CUDA CPU GPU w/ (host) local DRAM (device) Slide 8 CUDA > J. Rühmkorf > July 22nd 2009
  • 9. CUDA Processing Flow Slide 9 CUDA > J. Rühmkorf > July 22nd 2009
  • 10. CUDA in a Nutshell Key Characteristics CUDA is designed for wide SIMD/SPMD parallelism & scalability CUDA provides 3 key abstractions, i.e. a hierarchy: of thread groups, of shared memories, and of barrier synchronization CUDA programs are written in C + extensions OpenCL is inspired by CUDA, but HW & SW vendor neutral Programming model essentially identical Slide 10 CUDA > J. Rühmkorf > July 22nd 2009
  • 11. Hello World // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i] } int main() { // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); } hello-world.cu Slide 11 CUDA > J. Rühmkorf > July 22nd 2009
  • 12. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 12 CUDA > J. Rühmkorf > July 22nd 2009
  • 13. CUDA Programming Model Structure of a CUDA application Integrated host + device application C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); ... Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); ... Slide 13 CUDA > J. Rühmkorf > July 22nd 2009
  • 14. CUDA Programming Model CUDA Devices and Threads A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Express data-parallel portions as device kernels (which run on many threads) Slide 14 CUDA > J. Rühmkorf > July 22nd 2009
  • 15. CUDA Programming Model Arrays of Parallel Threads Execute a Kernel by specifying arrays of threads All threads run the same code (SPMD) Use thread-ID to compute memory addresses & make control decisions threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Slide 15 CUDA > J. Rühmkorf > July 22nd 2009
  • 16. CUDA Programming Model Use Thread Blocks for (Scalable) Cooperation Divide monolithic thread array into multiple blocks Threads within a block can cooperate via shared memory, atomic operations, and barrier synchronization Threads in different blocks cannot cooperate Thread Block 0 Thread Block 1 Thread Block N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadID … … … float x = float x = float x = input[threadID]; float y = func(x); output[threadID] = y; input[threadID]; float y = func(x); output[threadID] = y; … input[threadID]; float y = func(x); output[threadID] = y; … … … Slide 16 CUDA > J. Rühmkorf > July 22nd 2009
  • 17. CUDA Programming Model Organisation of Thread Blocks Thread Blocks can be one-, two- or three-dimensional arrays The host issues a sequence of kernel invocations (kernel 1, kernel 2) to the device Each kernel is executed as a batch of threads This batch is organized as a grid of thread blocks 2-dimensional thread blocks Slide 17 CUDA > J. Rühmkorf > July 22nd 2009
  • 18. CUDA Programming Model Block IDs and Thread IDs Each thread uses IDs to decide what data to work on Block ID: 1D, 2D, or 3D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Slide 18 CUDA > J. Rühmkorf > July 22nd 2009
  • 19. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 19 CUDA > J. Rühmkorf > July 22nd 2009
  • 20. CUDA Memory Model Overview Global memory Grid Main means of communicating R/W Data between host and Block (0, 0) Block (1, 0) device Shared Memory Shared Memory Contents visible to all threads Long latency access Registers Registers Registers Registers We will focus on global memory for now Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Constant and texture memory will not be covered here Host Global Memory Constant Memory Texture Memory Slide 20 CUDA > J. Rühmkorf > July 22nd 2009
  • 21. CUDA Memory Model CUDA Device Memory Allocation cudaMalloc(): Grid Allocates object in the device global memory Block (0, 0) Block (1, 0) Requires two parameters Shared Memory Shared Memory Address of a pointer to the allocated object Registers Registers Registers Registers Size of allocated object cudaFree() Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Frees object from the device global memory Host Global Pointer to freed object Memory Slide 21 CUDA > J. Rühmkorf > July 22nd 2009
  • 22. CUDA Memory Model CUDA Host-Device Data Transfer cudaMemcpy() Grid memory data transfer Block (0, 0) Block (1, 0) Requires four parameters Pointer to destination Shared Memory Shared Memory Pointer to source Registers Registers Registers Registers Number of bytes to copy Type of transfer Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Type of transfer is one of: Host to Host Host to Device Host Global Memory Device to Host Device to Device Asynchronous transfer Slide 22 CUDA > J. Rühmkorf > July 22nd 2009
  • 23. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 23 CUDA > J. Rühmkorf > July 22nd 2009
  • 24. CUDA API Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo.s foo.cpp OCG gcc / cl Mark Murphy, “NVIDIA’s Experience with G80 SASS foo.sass with Open64,” www.capsl.udel.edu/conferences/open64/2008 008/Papers/101.doc Slide 24 CUDA > J. Rühmkorf > July 22nd 2009
  • 25. CUDA API C for CUDA Function type specifiers image-convolution.cu __global__, __device__, __host__ __device__ float filter[N]; Variable type specifiers __device__, __shared__, __global__ void convolve(float __constant__ *image) { Keywords __shared__ float region[M]; threadIdx, blockIdx ... Intrinsics / builtin functions: region[threadIdx] = image[i]; __syncthreads() __syncthreads() Runtime API ... Memory, symbol, execution management image[j] = result; Function launch } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>>(myimage); Slide 25 CUDA > J. Rühmkorf > July 22nd 2009
  • 26. CUDA API CUDA Function Type Qualifiers (1) only callable executed on: from: __device__ float deviceFunc() device device __global__ void kernelFunc() device host __host__ float hostFunc() host host __global__: defines a kernel function must return void __device__ and __host__ can be used together Slide 26 CUDA > J. Rühmkorf > July 22nd 2009
  • 27. CUDA API CUDA Function Type Qualifiers (2) __device__ functions cannot have their address taken For functions executed on the device: no recursion no static variable declarations inside the function no variable number of arguments Slide 27 CUDA > J. Rühmkorf > July 22nd 2009
  • 28. CUDA API CUDA Variable Type Qualifiers __device__ Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library. __shared__ (optionally used together with __device__) Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block. Not covered here: __constant__ (optionally used together with __device__) Resides in constant memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library. Slide 28 CUDA > J. Rühmkorf > July 22nd 2009
  • 29. CUDA API Calling a Kernel Function – Execution Configuration A kernel function ( == __global__ function) must be called with an execution configuration: __global__ void kernelFunc(...) {...}; dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block size_t sharedMemBytes = 64; // 64 bytes of shared memory kernelFunc<<< dimGrid, dimBlock, sharedMemBytes >>>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on Explicit synchronization needed for blocking Slide 29 CUDA > J. Rühmkorf > July 22nd 2009
  • 30. Overview A high level view on CUDA CUDA programming model CUDA memory model CUDA application programming interface Simple CUDA example Slide 30 CUDA > J. Rühmkorf > July 22nd 2009
  • 31. A Simple CUDA Example Matrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity Slide 31 CUDA > J. Rühmkorf > July 22nd 2009
  • 32. Simple CUDA Example Square Matrix Multiplication N P = M * N of size WIDTH x WIDTH WIDTH Here: without tiling! One thread calculates one element of P M and N are loaded WIDTH times from global memory M P WIDTH WIDTH WIDTH Slide 32 CUDA > J. Rühmkorf > July 22nd 2009
  • 33. Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Slide 33 CUDA > J. Rühmkorf > July 22nd 2009
  • 34. Step 1: Matrix Multiplication N A Simple Host Version in C k // Matrix multiplication on the (CPU) // host in double precision j WIDTH void matrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { double sum = 0; for (int k = 0; k < width; ++k) { M P double a = M[i * width + k]; double b = N[k *i width + j]; sum += a * b; WIDTH } P[i * width + j] = sum; } k } WIDTH WIDTH Slide 34 CUDA > J. Rühmkorf > July 22nd 2009
  • 35. Step 2: Input Matrix Data Transfer (Host-sided Code) void matrixMulOnDevice(float* M, float* N, float* P, int width){ int size = width * width * sizeof(float); float* Md, *Nd, *Pd; … 1. // Allocate and load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size); Slide 35 CUDA > J. Rühmkorf > July 22nd 2009
  • 36. Step 3: Output Matrix Data Transfer (Host-sided Code) 2. // Kernel invocation code – to be shown later … 3. // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd); } Slide 36 CUDA > J. Rühmkorf > July 22nd 2009
  • 37. Step 4: Kernel Function (1) // Matrix multiplication kernel – per thread code __global__ void matrixMulKernel(float* Md, float* Nd, float* Pd, int width) { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; // see next page… Slide 37 CUDA > J. Rühmkorf > July 22nd 2009
  • 38. Step 4: Kernel Function (2) Nd for (int k = 0; k < width; ++k) { k WIDTH float Melement = Md[k + threadIdx.y*width]; tx float Nelement = Nd[threadIdx.x + k*width]; Pvalue += Melement * Nelement; } Md Pd { ty ty int i = threadIdx.x + WIDTH threadIdx.y*width; Pd[i] = Pvalue; k tx } } WIDTH WIDTH Slide 38 CUDA > J. Rühmkorf > July 22nd 2009
  • 39. Step 5: Kernel Invocation (Host-sided Code) // Insert into step 2. from before // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(width, width); // Launch the device computation threads! matrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, width); Slide 39 CUDA > J. Rühmkorf > July 22nd 2009
  • 40. Example Far from Ideal Only One Thread Block Used Grid 1 Nd Block 1 One Block of threads compute matrix 2 Pd 4 Each thread computes one element of Pd Thread (2, 2) 2 Each thread 6 Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory 3 2 5 4 48 access ratio close to 1:1 (not very high) Size of matrix limited by the number WIDTH of threads allowed in a thread block Md Pd Slide 40 CUDA > J. Rühmkorf > July 22nd 2009
  • 41. CUDA: A Bright Future? Eure Rede aber sei: Ja, ja; nein, nein. Was darüber ist, das ist vom Übel. Matthäus 5:37 Slide 41 CUDA > J. Rühmkorf > July 22nd 2009
  • 42. NVIDIA CUDA: Appendix Best Practices & Things to Watch Out For Slide 42 CUDA > J. Rühmkorf > July 22nd 2009
  • 43. Appendix Best Practices & Things to Watch Out For Obtain relevant hardware data Compiling a CUDA program Linking Debugging C for CUDA vs. CUDA Driver API Watch out: floating point computations Unsupported C language elements Branching of code Coalesced access to device global memory Access patterns to avoid bank conflicts Slide 43 CUDA > J. Rühmkorf > July 22nd 2009
  • 44. Obtain Relevant Hardware Data Make sure to obtain relevant additional hardware data Call cudaGetDeviceProperties() Slide 44 CUDA > J. Rühmkorf > July 22nd 2009
  • 45. Compiling a CUDA Program (1) Parallel Thread eXecution C/C++ CUDA float4 me = gx[gtid]; (PTX) Application me.x += me.y * me.z; Virtual Machine and ISA (Instruction Set Architecture) NVCC CPU Code Programming model Execution Virtual PTX Code resources and state Physical PTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; Compiler mad.f32 $f1, $f5, $f3, $f1; G80 … GPU Target code Slide 45 CUDA > J. Rühmkorf > July 22nd 2009
  • 46. Compiling a CUDA Program (2) NVCC as a Compiler Driver Any source file containing CUDA language extensions must be compiled with NVCC NVCC is a compiler driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... NVCC outputs: C code (host CPU Code) Must then be compiled with the rest of the application using another tool PTX Object code directly Or, PTX source, interpreted at runtime Slide 46 CUDA > J. Rühmkorf > July 22nd 2009
  • 47. Linking Any executable with CUDA code requires two dynamic libraries: The CUDA runtime library (cudart) The CUDA core library (cuda) Slide 47 CUDA > J. Rühmkorf > July 22nd 2009
  • 48. Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (enabled via nvcc -deviceemu) runs completely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread Running in device emulation mode, one can: Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice- versa Detect deadlock situations caused by improper usage of __syncthreads() Slide 48 CUDA > J. Rühmkorf > July 22nd 2009
  • 49. Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode Slide 49 CUDA > J. Rühmkorf > July 22nd 2009
  • 50. CUDA Driver API vs. C for CUDA (1) Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo.s foo.cpp OCG gcc / cl G80 SASS Mark Murphy, “NVIDIA’s Experience foo.sass Experience with Open64,” www.capsl.udel.edu/conferences/open6 n64/2008/Papers/101.doc Slide 50 CUDA > J. Rühmkorf > July 22nd 2009
  • 51. CUDA Driver API vs. C for CUDA (2) Mutually Exclusive: Choose One or the Other The C runtime for CUDA handles kernel loading The driver API is a lower-level API than the and setting kernels before they are launched. The runtime API. When compared with the runtime implicit code initialization, CUDA context API, the driver API has these advantages: management, CUDA module management (cubin No dependency on the runtime library and function mapping), kernel configuration, and parameter passing are all performed by the C More control over devices (for example, runtime for CUDA. only the driver API enables one CPU thread to control multiple GPUs) It comprises two principal parts: No C extensions in the host code, so The low-level functions compilers other than the default CPU (cuda_runtime_api.h) have a C-style compiler can be used interface that does not require compilation with nvcc. Its primary disadvantages The high-level functions (cuda_runtime.h) Verbose code have a C++-style interface built on top of Greater difficulty in debugging the low-level functions. No device emulation Of these, the high-level functions are the most A key point is that for every runtime API function, commonly used. They wrap some of the low-level there is an equivalent driver API function. The functions, using overloading, references, and driver API, however, includes other functions default arguments. These wrappers can be used missing in the runtime API, such as those for from C++ code and can be compiled with any C++ migrating a context from one host thread to compiler. another. Slide 51 CUDA > J. Rühmkorf > July 22nd 2009
  • 52. CUDA Driver API vs. C for CUDA (3) Example: Vector Addition Using C for CUDA const unsigned int cnBlockSize = 512; // copy host vectors to device const unsigned int cnBlocks = 3; cudaMemcpy(pDeviceMemA, pA, cnDimension * const unsigned int cnDimension = sizeof(float), cnBlocks * cnBlockSize; cudaMemcpyHostToDevice); // create CUDA device & context cudaMemcpy(pDeviceMemB, pB, cnDimension * cudaSetDevice( 0 ); // pick first device sizeof(float), cudaMemcpyHostToDevice); // allocate host vectors float * pA = new float[cnDimension]; vectorAdd<<<cnBlocks, cnBlockSize>>> float * pB = new float[cnDimension]; (pDeviceMemA, pDeviceMemB, float * pC = new float[cnDimension]; pDeviceMemC); // initialize host memory randomInit(pA, cnDimension); // copy result from device to host randomInit(pB, cnDimension); cudaMemcpy ((void *) pC, pDeviceMemC, cnDimension * sizeof(float), cudaMemcpyDeviceToHost); // allocate device memory delete[] pA; float *pDeviceMemA, *pDeviceMemB, *pDeviceMemC; delete[] pB; cudaMalloc((void **)&pDeviceMemA, delete[] pC; cnDimension * sizeof(float)); cudaFree(pDeviceMemA); cudaMalloc((void **)&pDeviceMemB, cudaFree(pDeviceMemB); cnDimension * sizeof(float)); cudaFree(pDeviceMemC); cudaMalloc((void **)&pDeviceMemC, cnDimension * sizeof(float)); Slide 52 CUDA > J. Rühmkorf > July 22nd 2009
  • 53. CUDA Driver API vs. C for CUDA (4) Example: Vector Addition Using CUDA Driver API const unsigned int cnBlocks = 3; #define ALIGN_UP(offset, alignment) const unsigned int cnDimension = cnBlocks * cnBlockSize; (offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1) CUdevice hDevice; CUcontext hContext; int offset = 0; CUmodule hModule; void* ptr; CUfunction hFunction; ptr = (void*)(size_t)pDeviceMemA; // create CUDA device & context ALIGN_UP(offset, __alignof(ptr)); cuInit(0); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); cuDeviceGet(&hContext, 0); // pick first device offset += sizeof(ptr); cuCtxCreate(&hContext, 0, hDevice)); ptr = (void*)(size_t)pDeviceMemB; cuModuleLoad(&hModule, “vectorAdd.cubin”); ALIGN_UP(offset, __alignof(ptr)); cuModuleGetFunction(&hFunction, hModule, "vectorAdd"); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); // allocate host vectors offset += sizeof(ptr); float * pA = new float[cnDimension]; ptr = (void*)(size_t)pDeviceMemC; float * pB = new float[cnDimension]; ALIGN_UP(offset, __alignof(ptr)); float * pC = new float[cnDimension]; cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr)); // initialize host memory offset += sizeof(ptr); randomInit(pA, cnDimension); cuParamSetSize(cuFunction, offset); randomInit(pB, cnDimension); // execute kernel // allocate memory on the device cuLaunchGrid(cuFunction, cnBlocks, 1); CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC; // copy the result from device back to host cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float)); cuMemcpyDtoH((void *) pC, pDeviceMemC, cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float)); cnDimension * sizeof(float)); cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float)); delete[] pA; // copy host vectors to device delete[] pB; cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float)); delete[] pC; cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float)); cuMemFree(pDeviceMemA); // set up parameter values cuMemFree(pDeviceMemB); cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1); cuMemFree(pDeviceMemC); Slide 53 CUDA > J. Rühmkorf > July 22nd 2009
  • 54. Watch Out: Floating Point Computations Differing Results of FP Computations Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results There are various options to force strict single precision on the host Slide 54 CUDA > J. Rühmkorf > July 22nd 2009
  • 55. Watch Out: Floating Point Computations Single and Double Precision Operations Double precision No deviations from the IEEE 754 standard Single precision Denormals and signalling NaNs are not supported; Only two IEEE rounding modes are supported (chop and round-to- nearest even); and The precision of division/square root is slightly lower than single precision. Slide 55 CUDA > J. Rühmkorf > July 22nd 2009
  • 56. Limitations (1) Only a Subset of C Available C for CUDA offers only a subset of the C language: Recursion-free Function-pointer-free Functions reside in the global device memory, therefore we cannot obtain their addresses Slide 56 CUDA > J. Rühmkorf > July 22nd 2009
  • 57. Limitations (2) Branching in Programm Code For best performance Threads should be running in groups 32 threads 32 threads = 1 warp All threads of a warp should take the same execution path Otherwise, branching will probably hurt Slide 57 CUDA > J. Rühmkorf > July 22nd 2009
  • 58. Coalesced Access to Device Global Memory High Priority: Ensure global memory accesses are coalesced whenever possible Global memory loads and stores by threads of a half warp (16 threads) are coalesced by the device in as few as one transaction (or two transactions in the case of 128-bit words) But: certain access requirements have to be met Slide 58 CUDA > J. Rühmkorf > July 22nd 2009
  • 59. Coalesced Access (Reading Floats) Slide 59 CUDA > J. Rühmkorf > July 22nd 2009
  • 60. Uncoalesced Access (Reading Floats) Slide 60 CUDA > J. Rühmkorf > July 22nd 2009
  • 61. Shared Memory – Bank Conflicts Shared Memory 16 KB Organized in 16 Banks, 1 KB each Shared Memory As fast as a register … … if no bank conflicts occur! Bank conflict: More than one thread in the same half-warp access the same bank Access needs to be serialized  Cost = max (# of concurrent access) Slide 61 CUDA > J. Rühmkorf > July 22nd 2009
  • 62. Shared Memory – No Bank Conflicts Linear addressing Random Linear addressing Broadcast Step size = 1 Permutation Step size = 3 Word Words Slide 62 CUDA > J. Rühmkorf > July 22nd 2009
  • 63. Shared Memory – Bank Conflicts Linear adressing Linear addressing No conflict Step size = 2 Step size = 8 or 5-way Words words conflict Slide 63 CUDA > J. Rühmkorf > July 22nd 2009