Kato Mivule: An Overview of CUDA for High Performance Computing

HPC GPU Programming with CUDA

An Overview of CUDA for High Performance Computing

By Kato Mivule
Computer Science Department
Bowie State University
COSC887 Fall 2013

Bowie State University Department of Computer Science


Agenda
•
•
•
•
•
•
•
•

CUDA Introduction.
CUDA Process flow.
CUDA Hello world program.
CUDA – Compiling and running a program.
CUDA Basic structure.
CUDA – Example program on vector addition.
CUDA – The conclusion.
CUDA – References and sources



CUDA – Introduction

•CUDA – Compute Unified Device Architecture.
•Developed by NVIDIA.
•A parallel computing platform and programming model .
•Implemented by the NVIDIA graphics processing units (GPUs).



CUDA – Introduction
•Grants access directly to the virtual instruction set and memory of GPUs.
•Allows for General Purpose Processing (GPGPU) beyond graphics .
•Allows for increased computing performance using GPUs.

Plymouth Cuda – Image Source: betterparts.org



CUDA – Process flow in three steps
1.

Copy input data from CPU memory to GPU memory.

2.

Load GPU program and execute.

3.

Copy results from GPU memory to CPU memory.

Image Source: http://en.wikipedia.org/wiki/CUDA



CUDA – Hello world program
#include <stdio.h>
__global__ void mykernel(void) {

// Denotes that this is device (GPU)code
// Denotes that function runs on device (GPU)
// Gets called from host code

}
int main(void) {

//Host (CPU) code
//Runs on Host

printf("Hello, world!n");
mykernel<<<1,1>>>();

//<<< >>> Denotes a call from host to device code

return 0;
}


CUDA – Compiling and Running A Program on GWU’s Cray
1. Log into Cary: ssh cray
2. Change to ‘work’ directory: cd work
3. Create your program with file extension as .cu: vim hello1.cu
4. Load the CUDA Module module load cudatoolkit
5. Compile using NVCC: nvcc hello1.cu -o hello1
6. Execute program: ./hello1



CUDA – Basic structure
•The kernel – this is the GPU program.
•The kernel is executed on a grid.
•The grid – is a group of thread blocks.
•The thread block – is a group of threads.
Image Source: CUDA Overview Tutorial, Cliff Woolley, NVIDIA
http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf

•Executed on a single multi-processor.
•Can communicate and synchronize.
•Threads are grouped into Blocks and Blocks into a Grid


Declaring functions
• __global__ Denotes a kernel function called on host and executed on device.
• __device__ Denotes device function called and executed on device.
• __host__

Denotes a host function called and executed on host.

• __constant__ Denotes a constant device variable available to all threads.
• __shared__ Denotes a shared device variable available to all threads in a block.



Some of the supported data types
• char and uchar
• short and ushort
• int and uint
• long and ulong
• float and ufloat

• longlong and ulonglong



• Accessing components – kernel function specifies the number of threads
• dim3 gridDim – denotes the dimensions of grid in blocks.
•

Example: dim3 DimGrid(8,4) – 32 thread blocks

• dim3 blockDim – denotes the dimensions of block in threads.
•

Example: dim3 DimBlock (2, 2, 2) – 8 threads per block

• uint3 blockIdx – denotes a block index within grid.
• uint3 threadIdx – denotes a thread index within block.



Thread management
•

__threadfence_block() – wait until memory access is available to block.

•

__threadfence() – wait until memory access is available to block and device.

•

__threadfence_system() – wait until memory access is available to block, device and host.

•

__syncthreads() – wait until all threads synchronize.



Memory management
•

cudaMalloc( ) – allocates memory.

•

cudaFree( ) – frees allocated memory.

•

cudaMemcpyDeviceToHost, cudaMemcpy( )
• copies device (GPU) results back to host (CPU) memory from device to host.



Atomic functions – executed without obstruction from other threads
• atomicAdd ( )
• atomicSub ( )
• atomicExch( )
• atomicMin ( )
• atomicMax ( )



CUDA – Example code for vector addition
//=============================================================
//Vector addition
//Oakridge National Lab Example
//https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/
//=============================================================
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
// CUDA kernel. Each thread takes care of one element of c
// To run on device (GPU) and get called by Host(CPU)
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
}



int main( int argc, char* argv[] )
{
// Size of vectors
int n = 100000;
// Host input vectors
double *h_a;
double *h_b;
//Host output vector
double *h_c;
// Device input vectors
double *d_a;
double *d_b;
//Device output vector
double *d_c;
// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);



// Allocate memory for each vector on host
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
int i;
// Initialize vectors on host
for( i = 0; i < n; i++ ) {
h_a[i] = sin(i)*sin(i);
h_b[i] = cos(i)*cos(i);
}



// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
int blockSize, gridSize;
// Number of threads in each thread block
blockSize = 1024;
// Number of thread blocks in grid
gridSize = (int)ceil((float)n/blockSize);
// Execute the kernel
vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );



// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %fn", sum/n);
// Release device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
// Release host memory
free(h_a);
free(h_b);
free(h_c);
return 0;
}



Sometimes your correct CUDA code will output wrong results.
•
Check the machine for error – access to the device(GPU) might not be granted.
•
Computation might only produce correct results at the host (CPU).
//============================
//ERROR CHECKING
//============================
#define cudaCheckErrors(msg)
do {
cudaError_t __err = cudaGetLastError();
if (__err != cudaSuccess) {
fprintf(stderr, "Fatal error: %s (%s at %s:%d)n",
msg, cudaGetErrorString(__err),
__FILE__, __LINE__);
fprintf(stderr, "*** FAILED - ABORTINGn");
exit(1);
}
} while (0)
//place in memory allocation section
cudaCheckErrors("cudamalloc fail");
//place in memory copy section
cudaCheckErrors("cuda memcpy fail");
cudaCheckErrors("cudamemcpy or cuda kernel fail");


Conclusion
• CUDA’s access to GPU computational power is outstanding.
• CUDA is easy to learn.

• CUDA – can take care of business by coding in C.
• However, it is a challenge translating code from host to device and device to host.



References and Sources
[1] CUDA Programming Blog Tutorial
http://cuda-programming.blogspot.com/2013/03/cuda-complete-complete-reference-on-cuda.html
[2] Dr. Kenrick Mock CUDA Tutorial
http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-firstprograms.pdf
[3] Parallel Programming Lecture Notes, Spring 2008, Johns Hopkins University
http://hssl.cs.jhu.edu/wiki/lib/exe/fetch.php?media=randal:teach:cs420:cudatools.pdf
[4] CUDA Super Computing Blog Tutorials
http://supercomputingblog.com/cuda-tutorials/
[5] Introduction to CUDA C Tutorial, Jason Sanders
http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf
[6] CUDA Overview Tutorial, Cliff Woolley, NVIDIA
http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf
[7] Oakridge National Lab CUDA Vector Addition Example
//https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/
[8] CUDA – Wikipedia
http://en.wikipedia.org/wiki/CUDA


Kato Mivule: An Overview of CUDA for High Performance Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kato Mivule: An Overview of CUDA for High Performance Computing

Similar to Kato Mivule: An Overview of CUDA for High Performance Computing (20)

More from Kato Mivule

More from Kato Mivule (20)

Recently uploaded

Recently uploaded (20)

Kato Mivule: An Overview of CUDA for High Performance Computing