GPU: Understanding CUDA
Upcoming SlideShare
Loading in...5
×
 

GPU: Understanding CUDA

on

  • 708 views

 

Statistics

Views

Total Views
708
Views on SlideShare
702
Embed Views
6

Actions

Likes
1
Downloads
58
Comments
0

1 Embed 6

http://mrgreen.myjino.ru 6

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    GPU: Understanding CUDA GPU: Understanding CUDA Presentation Transcript

    • J.A.R.J.C.G.T.R.G.B.GPU: UNDERSTANDING CUDA
    • TALK STRUCTURE• What is CUDA?• History of GPU• Hardware Presentation• How does it work?• Code Example• Examples & Videos• Results & Conclusion
    • WHAT IS CUDA• Compute Unified Device Architecture• Is a parallel computing platform andprogramming model created by NVIDIA andimplemented by the graphics processingunits (GPUs) that they produce• CUDA gives developers access to thevirtual instruction set and memory of theparallel computational elements in CUDA GPUs
    • HISTORY• 1981 – Monochrome Display Adapter• 1988 – VGA Standard (VGA Controller) – VESA Founded• 1989 – SVGA• 1993 – PCI – NVidia Founded• 1996 – AGP – Voodoo Graphics – Pentium• 1999 – NVidia GeForce 256 – P3• 2004 – PCI Express – GeForce6600 – P4• 2006 – GeForce 8800• 2008 – GeForce GTX280 / Core2
    • HISTORICAL PCCPUNorth Bridge MemorySouth BridgeVGAControllerScreenMemoryBufferLAN UARTSystem BusPCI Bus
    • INTEL PC STRUCTURE
    • NEW INTEL PC STRUCTURE
    • VOODOO GRAPHICS SYSTEM ARCHITECTUREGeomGatherGeomProcTriangleProcPixelProcZ / BlendCPUCoreLogicFBIFBMemorySystemMemoryTMUTEXMemoryGPUCPU
    • GEFORCE GTX280 SYSTEM ARCHITECTUREGeomGatherGeomProcTriangleProcPixelProcZ /BlendCPUCoreLogicGPUGPUMemorySystemMemoryGPUCPUPhysicsand AISceneMgmt
    • CUDA ARCHITECTURE ROADMAP
    • SOUL OF NVIDIA’S GPU ROADMAP• Increase Performance / Watt• Make Parallel Programming Easier• Run more of the Application on the GPU
    • MYTHS ABOUT CUDA• You have to port your entire application to theGPU• It is really hard to accelerate your application• There is a PCI-e Bottleneck
    • CUDA MODELS• Device Model• Execution Model
    • DEVICE MODELScalarProcessorMany Scalar Processors + Register File + Shared Memory
    • DEVICE MODELMultiprocessor Device
    • DEVICE MODELLoad/storeGlobal MemoryThread Execution ManagerInput AssemblerHostTexture Texture Texture Texture Texture Texture Texture TextureTextureParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheParallel DataCacheLoad/store Load/store Load/store Load/store Load/store
    • HARDWARE PRESENTATIONGeforce GTS450
    • HARDWARE PRESENTATIONGeforce GTS450
    • HARDWARE PRESENTATIONGeforce GTS450 Especificaciones
    • HARDWARE PRESENTATIONGeforce GTX470
    • HARDWARE PRESENTATIONGeforce GTX470 Especificaciones
    • HARDWARE PRESENTATION
    • HARDWARE PRESENTATIONGeforce 8600 GT/GTS Especificaciones
    • EXECUTION MODELVocabulary:• Host: CPU.• Device: GPU.• Kernel: A piece of code executed on GPU. ( function, program.. )• SIMT: Single Instruction Multiple Threads• Warps: A set of 32 threads. Minimum size of the data processed inSIMT.
    • EXECUTION MODELAll threads execute same code.Each thread have anunique identifier (threadID (x,y,z))A CUDA kernel is executed byan array of threadsSIMT
    • EXECUTION MODEL - SOFTWAREGrid: A set of BlocksThread: Smallest logict unitBlock: A set of Threads.(Max 512)• Private Shared Memory• Barrier (Threads synchronization)• Barrier ( Grid synchronization)• Without synchronization between blocks
    • EXECUTION MODELSpecified by the programmer at Runtime- Number of blocks (gridDim)- Block size (BlockDim)CUDA kernel invocationf <<<G, B>>>(a, b, c)
    • EXECUTION MODEL - MEMORY ARCHITECTURE
    • EXECUTION MODELEach thread runs on ascalar processorThread blocks arerunning on the multiprocessorA Grid only run a CUDA Kernel
    • SCHEDULEtiempowarp 8 instrucción 11warp 1 instrucción 42warp 3 instrucción 95warp 8 instrucción 12...warp 3 instrucción 96Bloque 1 Bloque 2 Bloque nwarp 12mwarp 22mwarp 22m• Threads are grouped into blocks• IDs are assigned to blocks andthreads• Blocks threads are distributedamong the multiprocessors• Threads of a block are grouped intowarps• A warp is the smallest unit ofplanning and consists of 32 threads• Various warps on eachmultiprocessor, but only one isrunning
    • CODE EXAMPLEThe following program calculates and prints the square of first 100 integers.// 1) Include header files#include <stdio.h>#include <conio.h>#include <cuda.h>// 2) Kernel that executes on the CUDA device__global__ void square_array(float*a,int N) {int idx=blockIdx.x*blockDim.x+threadIdx.x;if (idx <N )a[idx]=a[idx]*a[idx];}// 3) main( ) routine, the CPU must findint main(void) {
    • CODE EXAMPLE// 3.1:- Define pointer to host and device arraysfloat*a_h,*a_d;// 3.2:- Define other variables used in the program e.g. arrays etc.const int N=100;size_t size=N*sizeof(float);// 3.3:- Allocate array on the hosta_h=(float*)malloc(size);// 3.4:- Allocate array on device (DRAM of the GPU)cudaMalloc((void**)&a_d,size);for (int i=0;i<N;i ++)a_h[i]=(float)i;
    • CODE EXAMPLE// 3.5:- Copy the data from host array to device array.cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);// 3.6:- Kernel Call, Execution Configurationint block_size=4;int n_blocks=N / block_size + ( N % block_size ==0);square_array<<<n_blocks,block_size>>>(a_d,N);// 3.7:- Retrieve result from device to host in the host memorycudaMemcpy(a_h,a_d,sizeof(float)*N,cudaMemcpyDeviceToHost);
    • CODE EXAMPLE// 3.8:- Print resultfor(int i=0;i<N;i++)printf("%dt%fn",i,a_h[i]);// 3.9:- Free allocated memories on the device and hostfree(a_h);cudaFree(a_d);getch(); } )
    • CUDA LIBRARIES
    • TESTING
    • TESTING
    • TESTING
    • EXAMPLES• Video Example with a NVidia Tesla• Development Environment
    • RADIX SORT RESULTS.00,20,40,60,811,21,41,61.000.000 10.000.000 51.000.000 100.000.000GTS 450GTX 470GeForce 8600GTX 560M
    • CONCLUSION• Easy to use and powerful so it is worth!• GPU computing is the future. The Resultsconfirm our theory and the industry is givingmore and more importance.• In the next years we will see more applicationsthat are using parallel computing
    • DOCUMENTATION & LINKS• http://www.nvidia.es/object/cuda_home_new_es.html• http://www.nvidia.com/docs/IO/113297/ISC-Briefing-Sumit-June11-Final.pdf• http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf• http://www.hpca.ual.es/~jmartine/CUDA/SESION3_CUDA_GPU_EMG_JAM.pdf• http://www.geforce.com/hardware/technology/cuda/supported-gpus• http://en.wikipedia.org/wiki/GeForce_256• http://en.wikipedia.org/wiki/CUDA• https://developer.nvidia.com/technologies/Libraries• https://www.udacity.com/wiki/cs344/troubleshoot_gcc47• http://stackoverflow.com/questions/12986701/installing-cuda-5-samples-in-ubuntu-12-10
    • QUESTIONS?