Cuda intro


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cuda intro

  1. 1. Emergence of GPU systemsand clusters for generalpurpose High PerformanceComputingITCS 4145/5145 Nov 8, 2010 © Barry Wilkinson
  2. 2. 2Last few years GPUs have developed from graphicscards into a platform from HPCThere is now great interest in using GPUs for scientifichigh performance computing and GPUs are beingdesigned with that application in mindCUDA programming – C/C++ with a few additionalfeatures and routines to support GPU programming.Uses data parallel paradigmGraphics Processing Units(GPUs)
  3. 3. 3
  4. 4. Graphics Processing Units (GPUs)Brief History1970 2010200019901980Atari 8-bitcomputertext/graphics chipSource of information PC ProfessionalGraphics ControllercardS3 graphics cards-single chip 2DacceleratorOpenGL graphics APIHardware-accelerated3D graphicsDirectX graphics APIPlaystationGPUs withprogrammable shadingNvidia GeForceGE 3 (2001) withprogrammable shadingGeneral-purpose computingon graphics processing units(GPGPUs)GPU Computing
  5. 5. NVIDIA productsNVIDIA Corp. is the leader in GPUs for high performancecomputing:1993 201019991995 20092007 20082000 2001 2002 2003 2004 2005 2006Established by Jen-Hsun Huang, ChrisMalachowsky,Curtis PriemNV1 GeForce 1GeForce 2 series GeForce FX seriesGeForce 8 seriesGeForce 200 seriesGeForce 400 seriesGTX460/465/470/475/480/485GTX260/275/280/285/295GeForce8800GT 80TeslaQuadroNVIDIAs firstGPU withgeneral purposeprocessorsC870, S870, C1060, S1070, C2050, …Tesla 2050 GPUhas 448 threadprocessorsFermiKepler(2011)Maxwell(2013)
  6. 6. 6GPU performance gains over CPUs02004006008001000120014009/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009GFLOPsNVIDIAGPUIntelCPUT12WestmereNV30NV40G70G80GT2003GHz DualCore P43GHz Core2Duo3GHz XeonQuadSource © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
  7. 7. 7CPU-GPU architecture evolutionCo-processors -- very old idea that appeared in1970s and 1980s with floating point co-processors attached to microprocessors that didnot then have floating point capability.These coprocessors simply executed floatingpoint instructions that were fetched frommemory.Around same time, interest to provide hardwaresupport for displays, especially with increasinguse of graphics and PC games.Led to graphics processing units (GPUs)attached to CPU to create video display.CPUGraphicscardDisplayMemoryEarly design
  8. 8. 8Birth of general purposeprogrammable GPUDedicated pipeline (late1990s-early 2000s)By late1990’s, graphics chipsneeded to support 3-D graphics,especially for games and graphicsAPIs such as DirectX andOpenGL.Graphics chips generally had apipeline structure with individualstages performing specializedoperations, finally leading toloading frame buffer for display.Individual stages may have accessto graphics memory for storingintermediate computed data.Input stageVertex shaderstageGeometryshader stageRasterizer stageFramebufferPixel shadingstageGraphicsmemory
  9. 9. 9GeForce 6 SeriesArchitecture(2004-5)From GPU Gems 2, Copyright2005 by NVIDIA Corporation
  10. 10. 10General-Purpose GPU designsHigh performance pipelines call for high-speed (IEEE) floating pointoperations.People had been trying to use GPU cards to speed up scientificcomputationsKnown as GPGPU (General-purpose computing on graphicsprocessing units) -- Difficult to do with specialized graphics pipelines,but possible.)By mid 2000’s, recognized that individual stages of graphics pipelinecould be implemented by a more general purpose processor core(although with a data-parallel paradym)
  11. 11. 112006 -- First GPU for general high performance computing as wellas graphics processing, NVIDIA GT 80 chip/GeForce 8800 card.Unified processors that could perform vertex, geometry, pixel, andgeneral computing operationsCould now write programs in C rather than graphics APIs.Single-instruction multiple thread (SIMT) programming modelGPU design for general highperformance computing
  12. 12. 12
  13. 13. 13Evolving GPU designNVIDIA Fermi architecture(announced Sept 2009)•512 stream processing engines (SPEs)•Organized as 16 SPEs, each having 32 cores•3GB or 6 GB GDDR5 memory•Many innovations including L1/L2 caches, unified device memoryaddressing, ECC memory, …First implementation: Tesla 20 series(single chip C2050/2070, 4 chip S2050/2070)3 billion transistor chip?New Fermi chips planned (GT 300, GeForce 400 series)
  14. 14. 14Fermi StreamingMultiprocessor (SM)* WhitepaperNVIDIA’s NextGeneration CUDAComputeArchitecture: Fermi,NVIDIA, 2008
  15. 15. 15CUDA(Compute Unified Device Architecture)Architecture and programming model, introduced in NVIDIA in 2007Enables GPUs to execute programs written in C.Within C programs, call SIMT “kernel” routines that are executed onGPU.CUDA syntax extension to C identify routine as a Kernel.Very easy to learn although to get highest possible executionperformance requires understanding of hardware architecture
  16. 16. 16Programming Model•Program once compiled has codeexecuted on CPU and codeexecuted on GPU•Separate memories on CPU andGPUNeed to•Explicitly transfer data from CPU toGPU for GPU computation, and•Explicitly transfer results in GPUmemory copied back to CPUmemoryCopy fromCPU toGPUCopy fromGPU toCPUGPUCPUCPU main memoryGPU global memory
  17. 17. 17Basic CUDA program structureint main (int argc, char **argv ) {1. Allocate memory space in device (GPU) for data2. Allocate memory space in host (CPU) for data3. Copy data to GPU4. Call “kernel” routine to execute on GPU(with CUDA syntax that defines no of threads and their physical structure)5. Transfer results from GPU to CPU6. Free memory space in device (GPU)7. Free memory space in host (CPU)return;}
  18. 18. 181. Allocating memory space in“device” (GPU) for dataUse CUDA malloc routines:int size = N *sizeof( int); // space for N integersint *devA, *devB, *devC; // devA, devB, devC ptrscudaMalloc( (void**)&devA, size) );cudaMalloc( (void**)&devB, size );cudaMalloc( (void**)&devC, size );Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
  19. 19. 192. Allocating memory space in“host” (CPU) for dataUse regular C malloc routines:int *a, *b, *c;…a = (int*)malloc(size);b = (int*)malloc(size);c = (int*)malloc(size);or statically declare variables:#define N 256…int a[N], b[N], c[N];
  20. 20. 203. Transferring data from host(CPU) to device (GPU)Use CUDA routine cudaMemcpycudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice);cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice);where devA and devB are pointers to destination indevice and A and B are pointers to host data
  21. 21. 214. Declaring “kernel” routine toexecute on device (GPU)CUDA introduces a syntax addition to C:Triple angle brackets mark call from host code to device code.Contains organization and number of threads in two parameters:myKernel<<< n, m >>>(arg1, … );n and m will define organization of thread blocks and threads in ablock.For now, we will set n = 1, which say one block and m = N, whichsays N threads in this block.arg1, … , -- arguments to routine myKernel typically pointers todevice memory obtained previously from cudaMallac.
  22. 22. 22A kernel defined using CUDA specifier __global__Example – Adding to vectors A and B#define N 256__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definitionint i = threadIdx.x;C[i] = A[i] + B[i];}int main() {// allocate device memory &// copy data to device// device mem. ptrs devA,devB,devCvecAdd<<<1, N>>>(devA,devB,devC);…}Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIADeclaring a Kernel RoutineEach of the N threads performs one pair-wise addition:Thread 0: devC[0] = devA[0] + devB[0];Thread 1: devC[1] = devA[1] + devB[1];Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];Grid of one block, block has N threadsCUDA structure that provides thread ID in block
  23. 23. 235. Transferring data from device(GPU) to host (CPU)Use CUDA routine cudaMemcpycudaMemcpy( &C, devC, size,cudaMemcpyDeviceToHost);where devC is a pointer in device and C is a pointer inhost.
  24. 24. 246. Free memory space in “device”(GPU)Use CUDA cudaFree routine:cudaFree( dev_a);cudaFree( dev_b);cudaFree( dev_c);
  25. 25. 257. Free memory space in (CPU) host(if CPU memory allocated with malloc)Use regular C free routine to deallocate memory ifpreviously allocated with malloc:free( a );free( b );free( c );
  26. 26. 26CompleteCUDAprogramAdding twovectors, A and BN elements in A andB, and N threads(without code to loadarrays with data)#define N 256__global__ void vecAdd(int *A, int *B, int *C) {int i = threadIdx.x;C[i] = A[i] + B[i];}int main (int argc, char **argv ) {int size = N *sizeof( int);int a[N], b[N], c[N], *devA, *devB, *devC;cudaMalloc( (void**)&devA, size) );cudaMalloc( (void**)&devB, size );cudaMalloc( (void**)&devC, size );a = (int*)malloc(size); b = (int*)malloc(size);c =(int*)malloc(size);cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice);vecAdd<<<1, N>>>(devA, devB, devC);cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost);cudaFree( dev_a);cudaFree( dev_b);cudaFree( dev_c);free( a ); free( b ); free( c );return (0);}Derived from Jason Sanders,"Introduction to CUDA C" GPUtechnology conference, Sept. 20,
  27. 27. 27Can be 1 or 2dimensionsCan be 1, 2 or3 dimensionsCUDA C programming guide, v 3.2, 2010,NVIDIACUDA SIMTThread StructureAllowsflexibility andefficiency inprocessing1D, 2-D, and3-D data onGPU.Linked tointernalorganizationThreads inone blockexecutetogether.
  28. 28. 28Need to provide each kernel call with values for two key structures:•Number of blocks in each dimension•Threads per block in each dimensionmyKernel<<< numBlocks, threadsperBlock >>>(arg1, … );numBlocks – number of blocks in grid in each dimension (1D or2D). An integer would define a 1D grid of that size, otherwise useCUDA structure, see next.threadsperBlock – number of threads in a block in each dimension(1D, 2D, or 3D). An integer would define a 1D block of that size,otherwise use CUDA structure, see next.Notes: Number of blocks not limited by specific GPU.Number of threads/block is limited by specific GPU.Defining Grid/Block Structure
  29. 29. 29CUDA provided with built-in variables and structures to definenumber of blocks of threads in grid in each dimension and numberof threads in a block in each dimension.CUDA Vector Types/Structuresunit3 and dim3 – can be considered essentially as CUDA-definedstructures of unsigned integers: x, y, z, i.e.struct unit3 { x; y; z; };struct dim3 { x; y; z; };Used to define grid of blocks and threads, see next.Unassigned structure components automatically set to 1.There are other CUDA vector types.Built-in CUDA data types andstructures
  30. 30. 30Built-in Variables for Grid/BlockSizesdim3 gridDim -- Size of grid:gridDim.x * gridDim.y(z not used)dim3 blockDim -- Size of block:blockDim.x * blockDim.y * blockDim.zExampledim3 grid(16, 16); // Grid -- 16 x 16 blocksdim3 block(32, 32); // Block -- 32 x 32 threadsmyKernel<<<grid, block>>>(...);
  31. 31. 31Built-in Variables for Grid/BlockIndicesuinit3 blockIdx -- block index within grid:blockIdx.x, blockIdx.y(z not used)uint3 threadIdx -- thread index within block:blockIdx.x, blockIdx.y, blockId.zFull global thread ID in x and y dimensions can be computed by:x = blockIdx.x * blockDim.x + threadIdx.x;y = blockIdx.y * blockDim.y + threadIdx.y;
  32. 32. 32Example -- x directionA 1-D grid and 1-D block4 blocks, each having 8 threads0 1 2 3 4 765 0 1 2 3 4 7650 1 2 3 4 765 0 1 2 3 4 765threadIdx.x threadIdx.x threadIdx.xblockIdx.x = 3threadIdx.xblockIdx.x = 1blockIdx.x = 0Derived from Jason Sanders, "Introduction to CUDAC" GPU technology conference, Sept. 20, 2010.blockIdx.x = 2gridDim = 4 x 1blockDim = 8 x 1Global thread ID = blockIdx.x * blockDim.x + threadIdx.x= 3 * 8 + 2 = thread 26 with linear global addressingGlobal ID 26
  33. 33. 33#define N 2048 // size of vectors#define T 256 // number of threads per block__global__ void vecAdd(int *A, int *B, int *C) {int i = blockIdx.x*blockDim.x + threadIdx.x;C[i] = A[i] + B[i];}int main (int argc, char **argv ) {…vecAdd<<<N/T, T>>>(devA, devB, devC); // assumes N/T is an integer…return (0);}Code example with a 1-D gridand 1-D blocksNumber of blocks to map each vector across grid,one element of each vector per thread
  34. 34. 34#define N 2048 // size of vectors#define T 240 // number of threads per block__global__ void vecAdd(int *A, int *B, int *C) {int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < N) C[i] = A[i] + B[i]; // allows for more threads than vector elements// some unused}int main (int argc, char **argv ) {int blocks = (N + T - 1) / T; // efficient way of rounding to next integer…vecAdd<<<blocks, T>>>(devA, devB, devC);…return (0);}If T/N not necessarily an integer:
  35. 35. 35Example using 1-D grid and 2-D blocksAdding two arrays#define N 2048 // size of arrays__global__void addMatrix (int *a, int *b, int *c) {int i = blockIdx.x*blockDim.x+threadIdx.x;int j =blockIdx.y*blockDim.y+threadIdx.y;int index = i + j * N;if ( i < N && j < N) c[index]= a[index] + b[index];}Void main() {...dim3 dimBlock (16,16);dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);addMatrix<<<dimGrid, dimBlock>>>(devA, devB, devC);…}
  36. 36. 36Memory Structure within GPULocal private memory -- per threadShared memory -- per blockGlobal memory -- per applicationGPU executes one or more kernel grids.Streaming multiprocessor (SM) executesone or more thread blocksCUDA cores and other execution units inthe SM execute threads.SM executes threads in groups of 32threads called a warp.** Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008
  37. 37. 37Compiling codeLinuxCommand line. CUDA provides nvcc (a NVIDIA “compiler-driver”.Use instead of gccnvcc –O3 –o <exe> <input> -I/usr/local/cuda/include–L/usr/local/cuda/lib –lcudartSeparates compiled code for CPU and for GPU and compiles code.Need regular C compiler installed for CPU.Make files also provided.WindowsNVIDIA suggests using Microsoft Visual Studio
  38. 38. 38DebuggingNVIDIA has recentlydevelped a debuggingtool called ParallelNsightAvailable for use withVisual Studio
  39. 39. 39GPU ClustersGPU systems for HPCGPU clustersGPU GridsGPU CloudsWith advent of GPUs for scientific high performancecomputing, compute cluster now can incorporate, greatlyincreasing their compute capability.
  40. 40. 40
  41. 41. 41Maryland CPU-GPU Cluster Infrastructure
  42. 42. 42Hybrid Programming Model for Clusters havingMulticore Shared Memory ProcessorsCombine MPI between nodes and OpenMP with nodes (or otherthread libraries such as Pthreads):MPI/OpenMP compilation:mpicc -o mpi_out mpi_test.c -fopenmp
  43. 43. 43Hybrid Programming Model for Clusters havingMulticore Shared Memory Processors andGPUsCombine OpenMP and CUDA on one node for CPUand GPU respectively, with MPI between nodesNote – All three as C-based so can be compiledtogether.NVIDIA does provide a sample OpenMP/CUDA
  44. 44. 44
  45. 45. 45Intel’sresponse toNvidia andGPUs
  46. 46. Questions