GPU: Understanding CUDA

2,547 views

Published on

Published in: Technology, Sports

GPU: Understanding CUDA

  1. 1. J.A.R.J.C.G.T.R.G.B.GPU: UNDERSTANDING CUDA
  2. 2. TALK STRUCTURE• What is CUDA?• History of GPU• Hardware Presentation• How does it work?• Code Example• Examples & Videos• Results & Conclusion
  3. 3. WHAT IS CUDA• Compute Unified Device Architecture• Is a parallel computing platform andprogramming model created by NVIDIA andimplemented by the graphics processingunits (GPUs) that they produce• CUDA gives developers access to thevirtual instruction set and memory of theparallel computational elements in CUDA GPUs
  4. 4. HISTORY• 1981 – Monochrome Display Adapter• 1988 – VGA Standard (VGA Controller) – VESA Founded• 1989 – SVGA• 1993 – PCI – NVidia Founded• 1996 – AGP – Voodoo Graphics – Pentium• 1999 – NVidia GeForce 256 – P3• 2004 – PCI Express – GeForce6600 – P4• 2006 – GeForce 8800• 2008 – GeForce GTX280 / Core2
  5. 5. HISTORICAL PCCPUNorth Bridge MemorySouth BridgeVGAControllerScreenMemoryBufferLAN UARTSystem BusPCI Bus
  6. 6. INTEL PC STRUCTURE
  7. 7. NEW INTEL PC STRUCTURE
  8. 8. VOODOO GRAPHICS SYSTEM ARCHITECTUREGeomGatherGeomProcTriangleProcPixelProcZ / BlendCPUCoreLogicFBIFBMemorySystemMemoryTMUTEXMemoryGPUCPU
  9. 9. GEFORCE GTX280 SYSTEM ARCHITECTUREGeomGatherGeomProcTriangleProcPixelProcZ /BlendCPUCoreLogicGPUGPUMemorySystemMemoryGPUCPUPhysicsand AISceneMgmt
  10. 10. CUDA ARCHITECTURE ROADMAP
  11. 11. SOUL OF NVIDIA’S GPU ROADMAP• Increase Performance / Watt• Make Parallel Programming Easier• Run more of the Application on the GPU
  12. 12. MYTHS ABOUT CUDA• You have to port your entire application to theGPU• It is really hard to accelerate your application• There is a PCI-e Bottleneck
  13. 13. CUDA MODELS• Device Model• Execution Model
  14. 14. DEVICE MODELScalarProcessorMany Scalar Processors + Register File + Shared Memory
  15. 15. DEVICE MODELMultiprocessor Device
  16. 16. DEVICE MODELLoad/storeGlobal MemoryThread Execution ManagerInput AssemblerHostTexture Texture Texture Texture Texture Texture Texture TextureTextureParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheLoad/store Load/store Load/store Load/store Load/store
  17. 17. HARDWARE PRESENTATIONGeforce GTS450
  18. 18. HARDWARE PRESENTATIONGeforce GTS450
  19. 19. HARDWARE PRESENTATIONGeforce GTS450 Especificaciones
  20. 20. HARDWARE PRESENTATIONGeforce GTX470
  21. 21. HARDWARE PRESENTATIONGeforce GTX470 Especificaciones
  22. 22. HARDWARE PRESENTATION
  23. 23. HARDWARE PRESENTATIONGeforce 8600 GT/GTS Especificaciones
  24. 24. EXECUTION MODELVocabulary:• Host: CPU.• Device: GPU.• Kernel: A piece of code executed on GPU. ( function, program.. )• SIMT: Single Instruction Multiple Threads• Warps: A set of 32 threads. Minimum size of the data processed inSIMT.
  25. 25. EXECUTION MODELAll threads execute same code.Each thread have anunique identifier (threadID (x,y,z))A CUDA kernel is executed byan array of threadsSIMT
  26. 26. EXECUTION MODEL - SOFTWAREGrid: A set of BlocksThread: Smallest logict unitBlock: A set of Threads.(Max 512)• Private Shared Memory• Barrier (Threads synchronization)• Barrier ( Grid synchronization)• Without synchronization between blocks
  27. 27. EXECUTION MODELSpecified by the programmer at Runtime- Number of blocks (gridDim)- Block size (BlockDim)CUDA kernel invocationf <<<G, B>>>(a, b, c)
  28. 28. EXECUTION MODEL - MEMORY ARCHITECTURE
  29. 29. EXECUTION MODELEach thread runs on ascalar processorThread blocks arerunning on the multiprocessorA Grid only run a CUDA Kernel
  30. 30. SCHEDULEtiempowarp 8 instrucción 11warp 1 instrucción 42warp 3 instrucción 95warp 8 instrucción 12...warp 3 instrucción 96Bloque 1 Bloque 2 Bloque nwarp 12mwarp 22mwarp 22m• Threads are grouped into blocks• IDs are assigned to blocks andthreads• Blocks threads are distributedamong the multiprocessors• Threads of a block are grouped intowarps• A warp is the smallest unit ofplanning and consists of 32 threads• Various warps on eachmultiprocessor, but only one isrunning
  31. 31. CODE EXAMPLEThe following program calculates and prints the square of first 100 integers.// 1) Include header files#include <stdio.h>#include <conio.h>#include <cuda.h>// 2) Kernel that executes on the CUDA device__global__ void square_array(float*a,int N) {int idx=blockIdx.x*blockDim.x+threadIdx.x;if (idx <N )a[idx]=a[idx]*a[idx];}// 3) main( ) routine, the CPU must findint main(void) {
  32. 32. CODE EXAMPLE// 3.1:- Define pointer to host and device arraysfloat*a_h,*a_d;// 3.2:- Define other variables used in the program e.g. arrays etc.const int N=100;size_t size=N*sizeof(float);// 3.3:- Allocate array on the hosta_h=(float*)malloc(size);// 3.4:- Allocate array on device (DRAM of the GPU)cudaMalloc((void**)&a_d,size);for (int i=0;i<N;i ++)a_h[i]=(float)i;
  33. 33. CODE EXAMPLE// 3.5:- Copy the data from host array to device array.cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);// 3.6:- Kernel Call, Execution Configurationint block_size=4;int n_blocks=N / block_size + ( N % block_size ==0);square_array<<<n_blocks,block_size>>>(a_d,N);// 3.7:- Retrieve result from device to host in the host memorycudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);
  34. 34. CODE EXAMPLE// 3.8:- Print resultfor(int i=0;i<N;i++)printf("%dt%fn",i,a_h[i]);// 3.9:- Free allocated memories on the device and hostfree(a_h);cudaFree(a_d);getch(); } )
  35. 35. CUDA LIBRARIES
  36. 36. TESTING
  37. 37. TESTING
  38. 38. TESTING
  39. 39. EXAMPLES• Video Example with a NVidia Tesla• Development Environment
  40. 40. RADIX SORT RESULTS.00,20,40,60,811,21,41,61.000.000 10.000.000 51.000.000 100.000.000GTS 450GTX 470GeForce 8600GTX 560M
  41. 41. CONCLUSION• Easy to use and powerful so it is worth!• GPU computing is the future. The Resultsconfirm our theory and the industry is givingmore and more importance.• In the next years we will see more applicationsthat are using parallel computing
  42. 42. DOCUMENTATION & LINKS• http://www.nvidia.es/object/cuda_home_new_es.html• http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf• http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf• http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf• http://www.geforce.com/hardware/technology/cuda/supported-gpus• http://en.wikipedia.org/wiki/GeForce_256• http://en.wikipedia.org/wiki/CUDA• https://developer.nvidia.com/technologies/Libraries• https://www.udacity.com/wiki/cs344/troubleshoot_gcc47• http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-ubuntu-12-10
  43. 43. QUESTIONS?

×