GPU’s and CUDA
CPU Architecture
                                              -  good for serial programs
                                              -  do many different things well
                                              -  many transistors for purposes
                                              other than ALUs (eg. flow control
                                              and caching)
                                              -  memory access is slow (1GB/s)
                                              -  switching threads is slow




Image from Alex Moore, “Introduction to Programming in CUDA”,
http://astro.pas.rochester.edu/~aquillen/gpuworkshop.html
GPU Architecture
                           -  many processors perform
Cache Control ALUs
                           similar operations on a large
                           data set in parallel (single-
                           instruction multiple-data
                           parallelism)
                           -  recent GPUs have around 30
                           multiprocessors, each containing
                           8 stream processors
                           -  GPUs devote most (80%) of
                           their transistors to ALUs
                           -  fast memory (80GB/s)
Thread Hierarchy
   -  a block of threads runs on a
   single stream processor

   -  a grid of blocks makes up
   the entire set

   -  each thread in a block can
   access the same shared
   memory

   -  many more threads than
   processors

   Image from Johan Seland, “CUDA Programming”,
   http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf
Memory Hierarchy




Image from Johan Seland, “CUDA Programming”, http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf
CUDA
-  a set of C extensions for running programs on a GPU
-  Windows, Linux, Mac…. Nvidia Cards only
-  single instruction, multiple data
-  gives you direct access to the memory architecture
-  don’t need to know graphics APIs
-  built-in libraries: cudaMalloc, cudaFree, cudaMemcpy, cublas, cufft
Elementwise Matrix Addition
CPU Program
void add_matrix(float* a, float* b, float* c, int N) { 
    int index; 
    for ( int i = 0; i < N; ++i ) {
        for ( int j = 0; j < N; ++j ) { 
            index = i + j*N; 
            c[index] = a[index] + b[index]; 
        }
    }
}

int main() { 
    add_matrix( a, b, c, N ); 
}
Elemtwise Matix Addition
CUDA program
__global__ add_matrix(float* a, float* b, float* c, int N ) { 
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    int j = blockIdx.y * blockDim.y + threadIdx.y; 
    int index = i + j*N; 

     if ( i < N && j < N ) 
         c[index] = a[index] + b[index]; 
}

int main() { 
    dim3 dimBlock( blocksize, blocksize ); 
    dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); 
    add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N );
}
Results
                  (for tasks relevant to the MWA Real Time System)

                                  Imaging


                                Gridding *


                     UnpeelTileResponse


                        PeelTileResponse
  Module




                       ReRotateVisibilities


                    MeasureTileResponse


                MeasureIonosphericOffset


           RotateAndAccumulateVisibilities

                                              0   10   20        30           40          50          60      70
                                                                        Speedup




Image from Kevin Dale, “A Graphics Hardware-Accelerated Real-Time Processing Pipeline for Radio Astronomy”,
Presented at AstroGPU, Nov 2007.

Gpu Cuda

  • 1.
  • 2.
    CPU Architecture -  good for serial programs -  do many different things well -  many transistors for purposes other than ALUs (eg. flow control and caching) -  memory access is slow (1GB/s) -  switching threads is slow Image from Alex Moore, “Introduction to Programming in CUDA”, http://astro.pas.rochester.edu/~aquillen/gpuworkshop.html
  • 3.
    GPU Architecture -  many processors perform Cache Control ALUs similar operations on a large data set in parallel (single- instruction multiple-data parallelism) -  recent GPUs have around 30 multiprocessors, each containing 8 stream processors -  GPUs devote most (80%) of their transistors to ALUs -  fast memory (80GB/s)
  • 4.
    Thread Hierarchy -  a block of threads runs on a single stream processor -  a grid of blocks makes up the entire set -  each thread in a block can access the same shared memory -  many more threads than processors Image from Johan Seland, “CUDA Programming”, http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf
  • 5.
    Memory Hierarchy Image fromJohan Seland, “CUDA Programming”, http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf
  • 6.
    CUDA -  a setof C extensions for running programs on a GPU -  Windows, Linux, Mac…. Nvidia Cards only -  single instruction, multiple data -  gives you direct access to the memory architecture -  don’t need to know graphics APIs -  built-in libraries: cudaMalloc, cudaFree, cudaMemcpy, cublas, cufft
  • 7.
    Elementwise Matrix Addition CPUProgram void add_matrix(float* a, float* b, float* c, int N) { int index; for ( int i = 0; i < N; ++i ) { for ( int j = 0; j < N; ++j ) { index = i + j*N; c[index] = a[index] + b[index]; } } } int main() { add_matrix( a, b, c, N ); }
  • 8.
    Elemtwise Matix Addition CUDAprogram __global__ add_matrix(float* a, float* b, float* c, int N ) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N; if ( i < N && j < N ) c[index] = a[index] + b[index]; } int main() { dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N ); }
  • 9.
    Results (for tasks relevant to the MWA Real Time System) Imaging Gridding * UnpeelTileResponse PeelTileResponse Module ReRotateVisibilities MeasureTileResponse MeasureIonosphericOffset RotateAndAccumulateVisibilities 0 10 20 30 40 50 60 70 Speedup Image from Kevin Dale, “A Graphics Hardware-Accelerated Real-Time Processing Pipeline for Radio Astronomy”, Presented at AstroGPU, Nov 2007.