C for Cuda - Small Introduction to GPU computing


Published on

In this talk, we are presenting a short introduction to CUDA and GPU computing to help anyone who reads it to get started with this technology.
At first, we are introducing the GPU from the hardware point of view: what is it? How is it built? Why use it for General Purposes (GPGPU)? How does it differ from the CPU?
The second part of the presentation is dealing with the software abstraction and the use of CUDA to implement parallel computing. The software architecture, the kernels and the different types of memories are tackled in this part.
Finally, and to illustrate what has been presented previously, examples of codes are given. These examples are also highlighting the issues that may occur while using parallel-computing.

Published in: Technology

C for Cuda - Small Introduction to GPU computing

  1. 1. www.ipal.cnrs.frPatrick Jamet, François Regnoult, Agathe ValetteMay 6th 2013C for CUDASmall introduction to GPU computing
  2. 2. Summary‣ Introduction‣ GPUs‣ Hardware‣ Software abstraction- Grids, Blocks, Threads‣ Kernels‣ Memory‣ Global, Constant, Texture, Shared, Local, Register‣ Program Example‣ Conclusion
  3. 3. PRESENTATION OF GPUSGeneral and hardware considerations
  4. 4. What are GPUs?‣ Processors designed to handle graphic computations andscene generation- Optimized for parallel computation‣ GPGPU: the use of GPU for general purpose computinginstead of graphic operations like shading, texture mapping,etc.
  5. 5. Why use GPUs for general purposes?‣ CPUs are suffering from:- Performance growth slow-down- Limits to exploiting instruction-level parallelism- Power and thermal limitations‣ GPUs are found in all PCs‣ GPUs are energy efficient- Performance per watt
  6. 6. Why use GPUs for general purposes?‣ Modern GPUs provide extensive ressources- Massive parallelism and processing cores- flexible and increased programmability- high floating point precision- high arithmetic intensity- high memory bandwidth- Inherent parallelism
  7. 7. CPU architecture
  8. 8. How do CPUs and GPUs differ‣ Latency: delay between request and first data return- delay btw request for texture reading and texture data returns‣ Throughput: amount of work/amount of time‣ CPU: low-latency, low-throughput‣ GPU: high-latency, high-throughput- Processing millions of pixels in a single frame- No cache : more transistors dedicated to horsepower
  9. 9. How do CPUs and GPUs differTask Parallelism for CPU‣ multiple tasks map to multiplethreads‣ tasks run different instructions‣ 10s of heavy weight threads on10s of cores‣ each thread managed andscheduled explicitlyData Parallelism for GPU‣ SIMD model‣ same instruction on differentdata‣ 10,000s light weight threads,working on 100s of cores‣ threads managed and scheduledby hardware
  10. 10. SOFTWARE ABSTRACTIONGrids, blocks and threads
  11. 11. Host and Device‣ CUDA assumes a distinction between Host and Device‣ Terminology- Host The CPU and its memory (host memory)- Device The GPU and its memory (device memory)
  12. 12. Threads, blocks and grid‣ Threads are independentsequences of program that runconcurrently‣ Threads are organized in blocks,which are organized in a grid‣ Blocks and Threads can beaccessed using 3D coordinates‣ Threads in the same block sharefast memory with each other
  13. 13. Blocks‣ The number of Threads in a Block is limited and depends on theGraphic Card‣ Threads in a Block are divided in groups of 32 Threads calledWarps- Threads in the same Warp are executed in parallel‣ Automatic scalability,because of blocks
  14. 14. Kernels‣ The Kernel consists in the codeeach Thread is supposed toexecute‣ Threads can be thought of asentities mapping the elements ofa certain data structure‣ Kernels are launched by theHost, and can also be launchedby other Kernels in recent CUDAversions
  15. 15. How to use kernels ?‣ A kernel can only be a void function‣ The CUDA __global__ instruction means the Kernel isaccessible either from the Host and Device. But it is run on theDevice‣ Each Kernel can access its Thread andblock position to get a unique identifier
  16. 16. How to use kernels ?‣ Kernel call‣ If you want to call a normal function in your Kernel, you mustdeclare it with the CUDA __device__ instruction.‣ A __device__ function can only be accessed by the Device andis automatically defined as inline
  18. 18. Memory ManagementEach thread can:‣ Read/write per-thread registers‣ Read/write per-thread local memory‣ Read/write per-thread shared‣ Read/write per-grid global memory‣ Read per-grid constant memory‣ Read per-grid texture memory
  19. 19. Global Memory‣ Host and Device global memory are separate entities- Device pointers point to GPU memoryMay not be dereferenced in Host code- Host pointers point to CPU memoryMay not be dereferenced in Device code‣ Slowest memory‣ Easy to use‣ ~1,5Go on GPUCint *h_T;C for CUDAint *d_T;malloc() cudaMalloc()free() cudaFree()memcpy() cudaMemcpy()
  20. 20. Global Memory example‣ C ;‣ C for CUDA:&
  21. 21. Constant Memory‣ Constant memory is a read-only memory located in the Globalmemory and can be accessed by every thread‣ Two reason to use Constant memory:- A single read can be broadcast up to 15 others threads(half-warp)- Constant memory is cached on GPU‣ Drawback:- The half-warp broadcast feature can degrade the performancewhen all 16 threads read different addresses.
  22. 22. How to use constant memory ?‣ The instruction to define constant memory is __constant__‣ Must be declared out of the main body and cudaMemcpyToSymbolis used to copy values from the Host to the Device‣ Constant Memory variables dont need to be declared to be accessedin the kernel invocation
  23. 23. Texture memory‣ Texture memory is located in the Global memory and can beaccessed by every thread‣ Accessed through a dedicated read-only cache‣ Cache includes hardware filtering which can perform linearfloating point interpolation as part of the read process.‣ Cache optimised for spatial locality, in the coordinate system of thetexture, not in memory.
  24. 24. Shared Memory‣ [16-64] KB of memory per block‣ Extremely fast on-chip memory,user managed‣ Declare using __shared__,allocated per block‣ Data is not visible to threads inother blocks‣ !!!Bank Conflict!!!‣ When to use? When threads willaccess many times the globalmemory
  25. 25. Shared Memory - Example1D stencilSUMHow many times is itread?7 Times
  26. 26. Shared Memory - Example__global__ void stencil_1d(int *in, int *out){__shared__ int temp[BLOCK_SIZE];int lindex = threadIdx.x ;// Read input elements into shared memorytemp[lindex] = in[lindex];if (lindex > RADIUS && lindex < BLOCK_SIZE-RADIUS);{for (...) //Loop for calculating the sumout[lindex] = res;}}??__syncthreads()
  27. 27. Shared Problem
  28. 28. PROGRAM EXAMPLE1D stencil
  29. 29. Global Memory
  30. 30. CONCLUSION
  31. 31. Conclusion‣ GPUs are designed for parallel computing‣ CUDA’s software abstraction is adapted to the GPUarchitecture with grids, blocks and threads‣ The management of which functions access what type ofmemory is very important- Be careful of bank conflicts!‣ Data transfer between host and device is slow (5GB device tohost/host to device and 16GB device-device/host/host)
  32. 32. Resources‣ We skipped some details, you can learn more with- CUDA programming guide- CUDA Zone – tools, training, webinars and more- http://developer.nvidia.com/cuda‣ Install from- https://developer.nvidia.com/category/zone/cuda-zone andlearn from provided examples