Gpu Cuda

CPU Architecture
-  good for serial programs
-  do many different things well
-  many transistors for purposes
other than ALUs (eg. flow control
and caching)
-  memory access is slow (1GB/s)
-  switching threads is slow

Image from Alex Moore, “Introduction to Programming in CUDA”,
http://astro.pas.rochester.edu/~aquillen/gpuworkshop.html

GPU Architecture
-  many processors perform
Cache Control ALUs
similar operations on a large
data set in parallel (single-
instruction multiple-data
parallelism)
-  recent GPUs have around 30
multiprocessors, each containing
8 stream processors
-  GPUs devote most (80%) of
their transistors to ALUs
-  fast memory (80GB/s)

Thread Hierarchy
-  a block of threads runs on a
single stream processor

-  a grid of blocks makes up
the entire set

-  each thread in a block can
access the same shared
memory

-  many more threads than
processors

Image from Johan Seland, “CUDA Programming”,
http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf

Memory Hierarchy

Image from Johan Seland, “CUDA Programming”, http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf

CUDA
-  a set of C extensions for running programs on a GPU
-  Windows, Linux, Mac…. Nvidia Cards only
-  single instruction, multiple data
-  gives you direct access to the memory architecture
-  don’t need to know graphics APIs
-  built-in libraries: cudaMalloc, cudaFree, cudaMemcpy, cublas, cufft

Elementwise Matrix Addition
CPU Program
void add_matrix(float* a, float* b, float* c, int N) {
int index;
for ( int i = 0; i < N; ++i ) {
for ( int j = 0; j < N; ++j ) {
index = i + j*N;
c[index] = a[index] + b[index];
}
}
}

int main() {
add_matrix( a, b, c, N );
}

Elemtwise Matix Addition
CUDA program
__global__ add_matrix(float* a, float* b, float* c, int N ) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;

if ( i < N && j < N )
c[index] = a[index] + b[index];
}

int main() {
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N );
}

Results
(for tasks relevant to the MWA Real Time System)

Imaging

Gridding *

UnpeelTileResponse

PeelTileResponse
Module

ReRotateVisibilities

MeasureTileResponse

MeasureIonosphericOffset

RotateAndAccumulateVisibilities

0 10 20 30 40 50 60 70
Speedup

Image from Kevin Dale, “A Graphics Hardware-Accelerated Real-Time Processing Pipeline for Radio Astronomy”,
Presented at AstroGPU, Nov 2007.

Gpu Cuda

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Gpu Cuda

Similar to Gpu Cuda (20)

More from melbournepatterns

More from melbournepatterns (20)

Recently uploaded

Recently uploaded (20)

Gpu Cuda