Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HPC GPU Programming with CUDA

An Overview of CUDA for High Performance Computing

By Kato Mivule
Computer Science Departm...
HPC GPU Programming with CUDA

Agenda
•
•
•
•
•
•
•
•

CUDA Introduction.
CUDA Process flow.
CUDA Hello world program.
CUD...
HPC GPU Programming with CUDA

CUDA – Introduction

•CUDA – Compute Unified Device Architecture.
•Developed by NVIDIA.
•A ...
HPC GPU Programming with CUDA

CUDA – Introduction
•Grants access directly to the virtual instruction set and memory of GP...
HPC GPU Programming with CUDA

CUDA – Process flow in three steps
1.

Copy input data from CPU memory to GPU memory.

2.

...
HPC GPU Programming with CUDA

CUDA – Hello world program
#include <stdio.h>
__global__ void mykernel(void) {

// Denotes ...
HPC GPU Programming with CUDA
CUDA – Compiling and Running A Program on GWU’s Cray
1. Log into Cary: ssh cray
2. Change to...
HPC GPU Programming with CUDA

CUDA – Basic structure
•The kernel – this is the GPU program.
•The kernel is executed on a ...
HPC GPU Programming with CUDA

CUDA – Basic structure
Declaring functions
• __global__ Denotes a kernel function called on...
HPC GPU Programming with CUDA

CUDA – Basic structure
Some of the supported data types
• char and uchar
• short and ushort...
HPC GPU Programming with CUDA

CUDA – Basic structure
• Accessing components – kernel function specifies the number of thr...
HPC GPU Programming with CUDA

CUDA – Basic structure
Thread management
•

__threadfence_block() – wait until memory acces...
HPC GPU Programming with CUDA

CUDA – Basic structure
Memory management
•

cudaMalloc( ) – allocates memory.

•

cudaFree(...
HPC GPU Programming with CUDA

CUDA – Basic structure
Atomic functions – executed without obstruction from other threads
•...
HPC GPU Programming with CUDA

CUDA – Basic structure
Atomic functions – executed without obstruction from other threads
•...
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
//=================================================...
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
int main( int argc, char* argv[] )
{
// Size of vec...
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
// Allocate memory for each vector on host
h_a = (d...
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
// Copy host vectors to device
cudaMemcpy( d_a, h_a...
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
// Sum up vector c and print result divided by n, t...
HPC GPU Programming with CUDA

CUDA – Example code for vector addition
Sometimes your correct CUDA code will output wrong ...
HPC GPU Programming with CUDA

Conclusion
• CUDA’s access to GPU computational power is outstanding.
• CUDA is easy to lea...
HPC GPU Programming with CUDA

References and Sources
[1] CUDA Programming Blog Tutorial
http://cuda-programming.blogspot....
Upcoming SlideShare
Loading in …5
×

Kato Mivule: An Overview of CUDA for High Performance Computing

638 views

Published on

An Overview of CUDA for High Performance Computing

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Kato Mivule: An Overview of CUDA for High Performance Computing

  1. 1. HPC GPU Programming with CUDA An Overview of CUDA for High Performance Computing By Kato Mivule Computer Science Department Bowie State University COSC887 Fall 2013 Bowie State University Department of Computer Science
  2. 2. HPC GPU Programming with CUDA Agenda • • • • • • • • CUDA Introduction. CUDA Process flow. CUDA Hello world program. CUDA – Compiling and running a program. CUDA Basic structure. CUDA – Example program on vector addition. CUDA – The conclusion. CUDA – References and sources Bowie State University Department of Computer Science
  3. 3. HPC GPU Programming with CUDA CUDA – Introduction •CUDA – Compute Unified Device Architecture. •Developed by NVIDIA. •A parallel computing platform and programming model . •Implemented by the NVIDIA graphics processing units (GPUs). Bowie State University Department of Computer Science
  4. 4. HPC GPU Programming with CUDA CUDA – Introduction •Grants access directly to the virtual instruction set and memory of GPUs. •Allows for General Purpose Processing (GPGPU) beyond graphics . •Allows for increased computing performance using GPUs. Plymouth Cuda – Image Source: betterparts.org Bowie State University Department of Computer Science
  5. 5. HPC GPU Programming with CUDA CUDA – Process flow in three steps 1. Copy input data from CPU memory to GPU memory. 2. Load GPU program and execute. 3. Copy results from GPU memory to CPU memory. Image Source: http://en.wikipedia.org/wiki/CUDA Bowie State University Department of Computer Science
  6. 6. HPC GPU Programming with CUDA CUDA – Hello world program #include <stdio.h> __global__ void mykernel(void) { // Denotes that this is device (GPU)code // Denotes that function runs on device (GPU) // Gets called from host code } int main(void) { //Host (CPU) code //Runs on Host printf("Hello, world!n"); mykernel<<<1,1>>>(); //<<< >>> Denotes a call from host to device code return 0; } Bowie State University Department of Computer Science
  7. 7. HPC GPU Programming with CUDA CUDA – Compiling and Running A Program on GWU’s Cray 1. Log into Cary: ssh cray 2. Change to ‘work’ directory: cd work 3. Create your program with file extension as .cu: vim hello1.cu 4. Load the CUDA Module module load cudatoolkit 5. Compile using NVCC: nvcc hello1.cu -o hello1 6. Execute program: ./hello1 Bowie State University Department of Computer Science
  8. 8. HPC GPU Programming with CUDA CUDA – Basic structure •The kernel – this is the GPU program. •The kernel is executed on a grid. •The grid – is a group of thread blocks. •The thread block – is a group of threads. Image Source: CUDA Overview Tutorial, Cliff Woolley, NVIDIA http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf •Executed on a single multi-processor. •Can communicate and synchronize. •Threads are grouped into Blocks and Blocks into a Grid Bowie State University Department of Computer Science
  9. 9. HPC GPU Programming with CUDA CUDA – Basic structure Declaring functions • __global__ Denotes a kernel function called on host and executed on device. • __device__ Denotes device function called and executed on device. • __host__ Denotes a host function called and executed on host. • __constant__ Denotes a constant device variable available to all threads. • __shared__ Denotes a shared device variable available to all threads in a block. Bowie State University Department of Computer Science
  10. 10. HPC GPU Programming with CUDA CUDA – Basic structure Some of the supported data types • char and uchar • short and ushort • int and uint • long and ulong • float and ufloat • longlong and ulonglong Bowie State University Department of Computer Science
  11. 11. HPC GPU Programming with CUDA CUDA – Basic structure • Accessing components – kernel function specifies the number of threads • dim3 gridDim – denotes the dimensions of grid in blocks. • Example: dim3 DimGrid(8,4) – 32 thread blocks • dim3 blockDim – denotes the dimensions of block in threads. • Example: dim3 DimBlock (2, 2, 2) – 8 threads per block • uint3 blockIdx – denotes a block index within grid. • uint3 threadIdx – denotes a thread index within block. Bowie State University Department of Computer Science
  12. 12. HPC GPU Programming with CUDA CUDA – Basic structure Thread management • __threadfence_block() – wait until memory access is available to block. • __threadfence() – wait until memory access is available to block and device. • __threadfence_system() – wait until memory access is available to block, device and host. • __syncthreads() – wait until all threads synchronize. Bowie State University Department of Computer Science
  13. 13. HPC GPU Programming with CUDA CUDA – Basic structure Memory management • cudaMalloc( ) – allocates memory. • cudaFree( ) – frees allocated memory. • cudaMemcpyDeviceToHost, cudaMemcpy( ) • copies device (GPU) results back to host (CPU) memory from device to host. Bowie State University Department of Computer Science
  14. 14. HPC GPU Programming with CUDA CUDA – Basic structure Atomic functions – executed without obstruction from other threads • atomicAdd ( ) • atomicSub ( ) • atomicExch( ) • atomicMin ( ) • atomicMax ( ) Bowie State University Department of Computer Science
  15. 15. HPC GPU Programming with CUDA CUDA – Basic structure Atomic functions – executed without obstruction from other threads • atomicAdd ( ) • atomicSub ( ) • atomicExch( ) • atomicMin ( ) • atomicMax ( ) Bowie State University Department of Computer Science
  16. 16. HPC GPU Programming with CUDA CUDA – Example code for vector addition //============================================================= //Vector addition //Oakridge National Lab Example //https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/ //============================================================= #include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c // To run on device (GPU) and get called by Host(CPU) __global__ void vecAdd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } Bowie State University Department of Computer Science
  17. 17. HPC GPU Programming with CUDA CUDA – Example code for vector addition int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Host input vectors double *h_a; double *h_b; //Host output vector double *h_c; // Device input vectors double *d_a; double *d_b; //Device output vector double *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); Bowie State University Department of Computer Science
  18. 18. HPC GPU Programming with CUDA CUDA – Example code for vector addition // Allocate memory for each vector on host h_a = (double*)malloc(bytes); h_b = (double*)malloc(bytes); h_c = (double*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sin(i)*sin(i); h_b[i] = cos(i)*cos(i); } Bowie State University Department of Computer Science
  19. 19. HPC GPU Programming with CUDA CUDA – Example code for vector addition // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); Bowie State University Department of Computer Science
  20. 20. HPC GPU Programming with CUDA CUDA – Example code for vector addition // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %fn", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); return 0; } Bowie State University Department of Computer Science
  21. 21. HPC GPU Programming with CUDA CUDA – Example code for vector addition Sometimes your correct CUDA code will output wrong results. • Check the machine for error – access to the device(GPU) might not be granted. • Computation might only produce correct results at the host (CPU). //============================ //ERROR CHECKING //============================ #define cudaCheckErrors(msg) do { cudaError_t __err = cudaGetLastError(); if (__err != cudaSuccess) { fprintf(stderr, "Fatal error: %s (%s at %s:%d)n", msg, cudaGetErrorString(__err), __FILE__, __LINE__); fprintf(stderr, "*** FAILED - ABORTINGn"); exit(1); } } while (0) //place in memory allocation section cudaCheckErrors("cudamalloc fail"); //place in memory copy section cudaCheckErrors("cuda memcpy fail"); cudaCheckErrors("cudamemcpy or cuda kernel fail"); Bowie State University Department of Computer Science
  22. 22. HPC GPU Programming with CUDA Conclusion • CUDA’s access to GPU computational power is outstanding. • CUDA is easy to learn. • CUDA – can take care of business by coding in C. • However, it is a challenge translating code from host to device and device to host. Bowie State University Department of Computer Science
  23. 23. HPC GPU Programming with CUDA References and Sources [1] CUDA Programming Blog Tutorial http://cuda-programming.blogspot.com/2013/03/cuda-complete-complete-reference-on-cuda.html [2] Dr. Kenrick Mock CUDA Tutorial http://www.math.uaa.alaska.edu/~afkjm/cs448/handouts/cuda-firstprograms.pdf [3] Parallel Programming Lecture Notes, Spring 2008, Johns Hopkins University http://hssl.cs.jhu.edu/wiki/lib/exe/fetch.php?media=randal:teach:cs420:cudatools.pdf [4] CUDA Super Computing Blog Tutorials http://supercomputingblog.com/cuda-tutorials/ [5] Introduction to CUDA C Tutorial, Jason Sanders http://www.nvidia.com/content/GTC-2010/pdfs/2131_GTC2010.pdf [6] CUDA Overview Tutorial, Cliff Woolley, NVIDIA http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf [7] Oakridge National Lab CUDA Vector Addition Example //https://www.olcf.ornl.gov/tutorials/cuda-vector-addition/ [8] CUDA – Wikipedia http://en.wikipedia.org/wiki/CUDA Bowie State University Department of Computer Science

×