Intro to GPGPU Programming with Cuda

Rob Gillen Intro to GPGPU Programing With CUDA

CodeStock is proudly partnered with: RecruitWise and Staff with Excellence - www.recruitwise.jobs Send instant feedback on this session via Twitter: Send a direct message with the room number to @CodeStock d codestock 411 This guy is Amazing! For more information on sending feedback using Twitter while at CodeStock, please see the “CodeStock README” in your CodeStock guide.

Intro to GPGPU Programming with CUDA Rob Gillen

Welcome! Goals: Overview of GPGPU with CUDA “Vision Casting” for how you can use GPUs to improve your application Outline Why GPGPUs? Applications Tooling Hands-On: Matrix Multiplication Rating: http://spkr8.com/t/7714

CPU vs. GPU GPU devotes more transistors to data processing

NVIDIA Fermi ~1.5TFLOPS (SP)/~800GFLOPS (DP) 230 GB/s DRAM Bandwidth

Motivation FLoating-Point Operations per Second (FLOPS) and memory bandwidth For the CPU and GPU

Example: Sparse Matrix-Vector CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Rayleigh-Bénard Results Double precision 384 x 384 x 192 grid (max that fits in 4GB) Vertical slice of temperature at y=0 Transition from stratified (left) to turbulent (right) Regime depends on Rayleigh number: Ra = gαΔT/κν 8.5x speedup versus Fortran code running on 8-core 2.5 GHz Xeon

G80 Characteristics 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics

Applications Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” Molecular dynamics simulation, Video and audio codingand manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… programming model must not hinder parallel implementation data delivery needs careful management

*Not* for all applications SPMD (Single Program, Multiple Data) are best (data parallel) Operations need to be of sufficient size to overcome overhead Think Millions of operations.

Tooling VS 2010 C++ (Express is OK… sortof.) NVIDIA CUDA-Capable GPU NVIDIA CUDA Toolkit (v4+) NVIDIA CUDA Tools (v4+) GPU Computing SDK NVIDIA Parallel Insight

Before we get too excited… Host vs Device Kernels __global__ __device__ __host__ Thread/Block Control <<<x, y>>> Multi-dimensioned coordinate objects Memory Management/Movement Thread Management – think 1000’s or 1,000,000’s

Block IDs and Threads Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memoryaddressing when processingmultidimensional data Image processing

CUDA Thread Block All threads in a block execute the same kernel program (SPMD) Programmer declares block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Threads have thread id numbers within block Thread program uses thread id to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate Each block can execute in any order relative to other blocs! CUDA Thread Block Thread Id #:0 1 2 3 … m Thread program

Transparent Scalability Hardware is free to assigns blocks to any processor at any time A kernel scales across any number of parallel processors Kernel grid Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Each block can execute in any order relative to other blocks.

A Simple Running ExampleMatrix Multiplication A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs Leave shared memory usage until later Local, register usage Thread ID usage Memory data transfer API between host and device Assume square matrix for simplicity

Programming Model:Square Matrix Multiplication Example P = M * N of size WIDTH x WIDTH Without tiling: One thread calculates one element of P M and N are loaded WIDTH timesfrom global memory N WIDTH M P WIDTH WIDTH WIDTH 27

Memory Layout of Matrix in C M0,2 M0,1 M0,0 M0,3 M1,1 M1,0 M1,2 M1,3 M2,1 M2,0 M2,2 M2,3 M3,1 M3,0 M3,2 M3,3 M M0,2 M0,1 M0,0 M0,3 M1,1 M1,0 M1,2 M1,3 M2,1 M2,0 M2,2 M2,3 M3,1 M3,0 M3,2 M3,3

Simple Matrix Multiplication (CPU) void MatrixMulOnHost(float* M, float* N, float* P, int Width)‏ { for (int i = 0; i < Width; ++i) {‏ for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } } } N k j WIDTH M P i WIDTH k 29 WIDTH WIDTH

Simple Matrix Multiplication (GPU) void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‏ { intsize = Width * Width * sizeof(float); float* Md, Nd, Pd; … // 1. Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(&Pd, size);

Simple Matrix Multiplication (GPU) // 2. Kernel invocation code – to be shown later … // 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd); }

Kernel Function // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

Kernel Function (contd.) for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue+= Melement * Nelement; } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; } Nd k WIDTH tx Md Pd ty ty WIDTH tx k 33 WIDTH WIDTH

Kernel Function (full) // Matrix multiplication kernel – per thread code __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‏ { // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k)‏ { float Melement = Md[threadIdx.y*Width+k]; float Nelement = Nd[k*Width+threadIdx.x]; Pvalue += Melement * Nelement; } Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; }

Kernel Invocation (Host Side) // Setup the execution configuration dim3 dimGrid(1, 1); dim3 dimBlock(Width, Width); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Only One Thread Block Used Nd Grid 1 One Block of threads compute matrix Pd Each thread computes one element of Pd Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high)‏ Size of matrix limited by the number of threads allowed in a thread block Block 1 Thread (2, 2)‏ 48 WIDTH Pd Md

Handling Arbitrary Sized Square Matrices Have each 2D thread block to compute a (TILE_WIDTH)2 sub-matrix (tile) of the result matrix Each has (TILE_WIDTH)2 threads Generate a 2D Grid of (WIDTH/TILE_WIDTH)2 blocks Nd WIDTH Md Pd by You still need to put a loop around the kernel call for cases where WIDTH/TILE_WIDTH is greater than max grid size (64K)! TILE_WIDTH ty WIDTH bx tx 37 WIDTH WIDTH

Small Example Nd1,0 Nd0,0 Block(0,0) Block(1,0) Nd1,1 Nd0,1 P1,0 P0,0 P2,0 P3,0 Nd1,2 Nd0,2 TILE_WIDTH = 2 P0,1 P1,1 P3,1 P2,1 Nd0,3 Nd1,3 P0,2 P2,2 P3,2 P1,2 P0,3 P2,3 P3,3 P1,3 Pd1,0 Md2,0 Md1,0 Md0,0 Md3,0 Pd0,0 Pd2,0 Pd3,0 Md1,1 Md0,1 Md2,1 Md3,1 Pd0,1 Pd1,1 Pd3,1 Pd2,1 Block(1,1) Block(0,1) Pd0,2 Pd2,2 Pd3,2 Pd1,2 Pd0,3 Pd2,3 Pd3,3 Pd1,3

Cleanup Topics Memory Management Pinned Memory (Zero-Transfer) Portable Pinned Memory Multi-GPU Wrappers (Python, Java, .NET) Kernels Atomics Thread Synchronization (staged reductions) NVCC

Questions? rob@gillenfamily.net@argodev http://rob.gillenfamily.netRate: http://spkr8.com/t/7714

Intro to GPGPU Programming with Cuda

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Intro to GPGPU Programming with Cuda

Similar to Intro to GPGPU Programming with Cuda (20)

More from Rob Gillen

More from Rob Gillen (20)

Recently uploaded

Recently uploaded (20)

Intro to GPGPU Programming with Cuda

Editor's Notes