Your SlideShare is downloading. ×
0
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Introduction to CUDA Programming
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to CUDA Programming

737

Published on

The aim of this seminar is to provide students with basic knowledge on developing applications for processors with massively parallel computing resources. In general, we refer to a processor as …

The aim of this seminar is to provide students with basic knowledge on developing applications for processors with massively parallel computing resources. In general, we refer to a processor as massively parallel if it has the ability to complete more than 64 arithmetic operations per clock cycle. Graphics processing units (GPUs) fall into this category, but other massively parallel architectures are emergent. Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, memory hierarchy, and resource limitations of these processors. We will also overview some tools that reduce the initial difficulties of CUDA programming.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
737
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
50
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsIntroduction to CUDA ProgrammingGuest LecturePeter WittekUniversity of Bor˚as & Tsinghua UniversityMay 28, 2013Peter Wittek Introduction to CUDA Programming
  • 2. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  • 3. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsThe Sources Are. . .PATC Course: Introduction to CUDA Programming, 2012.PUMPS Summer School, 2012.Wen Mei Hwu, Isaac Gelado.Peter Wittek Introduction to CUDA Programming
  • 4. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  • 5. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhat It Is AboutMany-core hardware limitationsSingle instruction, multiple data (SIMD)Streaming architecture and programming paradigmDesirable computation patternsEasy to achieve good results if you ignore correctnessEqually easy to obtain correct code that is slowPeter Wittek Introduction to CUDA Programming
  • 6. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhat We Are Not Going to Talk AboutARM, AMD, OpenCLField-programmable gate array(FPGA)CUDA works on FPGA as aproof-of-conceptIntel Xeon PhiRuns a Linux in the firmwareSSH, MPI, and OpenMPAdiabatic Quantum ComputingToo expensive at this pointAn adiabatic quantumcomputerImage courtesy of The New York Times.Peter Wittek Introduction to CUDA Programming
  • 7. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  • 8. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhy SIMD?Theoretical Peak Performance of some CPUs and GPUsImage courtesy of John Owens.Some observations are not apparentNotice the word theoretical2012 Nvidia GPUs: 1.5 TFLOPSPeter Wittek Introduction to CUDA Programming
  • 9. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhy Is It Hard to Program GPUs?CPUs versus GPUsCacheALUControlALUALUALUDRAMCPUDRAMGPUStreaming hardwareExplicit memory managementPeter Wittek Introduction to CUDA Programming
  • 10. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPUs: Latency-Oriented DesignLarge cachesMultiple layersReduces latencyCoherentBranch predictionData forwardingOut-of-branch executionCacheALUControlALUALUALUDRAMPeter Wittek Introduction to CUDA Programming
  • 11. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPUs: Achieving High PerformanceDesign is changing: SIMD registers, vector instructionsMMX, SSE-SSE4, AVXCache hierarchy and cache coherenceEasy to reach 70 % of the peak performance of a CPU100 % is harder than on a GPUCompilers will not help, especially not vendor-provided,commercial ones (icc, we are looking at you)Hand-tuned code will be more complex than GPU codem128 r1q = mm load ps(& r1 [ idx1 ] ) ;m128 i2q = mm load ps(& i2 [ idx2 ] ) ;m128 next r1q = mm sub ps ( mm mul ps ( r1q , aq ) ,mm mul ps ( i2q , bq ) ) ;Peter Wittek Introduction to CUDA Programming
  • 12. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPUs: Throughput-Oriented DesignSmall caches768k on a Tesla M2050 with448 streaming coresNo coherence protocolNo branch predictionIn fact, you have strongincentives to avoidbranchingNo data forwardingPre-Fermi architecture:accuracy was secondaryDRAMPeter Wittek Introduction to CUDA Programming
  • 13. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsExecution modelSerial code in host C codeHighly parallel parts in SIMD C kernelHost and device memories are separateMemory allocations on both memory spacesTransfers between host and deviceDefine structure of SIMD execution gridCall SIMD kernelBasic execution modelTrans. A Trans. B Kernel TranfertimeOnly use one direction,GPU idlePCIe Idle Only use one direction,GPU idlePeter Wittek Introduction to CUDA Programming
  • 14. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsStreaming threads1D, 2D, or 3D grid of blocksA block is a 1D, 2D, or 3Darray of threadsWhy is it not just one big arrayof threads?i = blockIdx.x*blockDim.x+threadIdx.x;C_d[i] = A_d[i] + B_d[i];…0 1 2 254 255…Peter Wittek Introduction to CUDA Programming
  • 15. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsStreaming threads1D, 2D, or 3D grid of blocksA block is a 1D, 2D, or 3Darray of threadsWhy is it not just one big arrayof threads?Only threads in a blockcooperateShared memory, registers,synchronizationPhysical limits on maximumblock sizeActual execution is in warpsi = blockIdx.x*blockDim.x+threadIdx.x;C_d[i] = A_d[i] + B_d[i];…0 1 2 254 255…Peter Wittek Introduction to CUDA Programming
  • 16. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCUDA Memory ModelRegisters per thread: ≈1cycle to accessShared memory per block: ≈5cycles to accessGlobal memory per grid:≈500 cycles to accessConstant/texture memory pergrid: ≈5 cycles if cachedThreads cannot access thehost memoryGridGlobal MemoryBlock (0, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersBlock (1, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersHostConstant MemoryPeter Wittek Introduction to CUDA Programming
  • 17. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGlobal Memory BandwidthFermi-core Tesla: 1 TFLOPS single-precision144 GByte/s global memory bandwidth36 giga single-precision floatsAt least 1000/36 ≈ 28 arithmetic operations are necessaryto achieve peak throughputPeter Wittek Introduction to CUDA Programming
  • 18. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCompilationIntegrated C programs with CUDAextensionsNVCC Compiler WrapperHostC Compiler/LinkerHost Code Device Code (PTX)Device Just-in-TimeCompilerHeterogeneous Computing PlatformwithCPUs, GPUsThe situation gets messy when other compiler wrappersentermpiccIt is a good idea to separate C++ code from CUDA CPeter Wittek Introduction to CUDA Programming
  • 19. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  • 20. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPU Versionvoid vecAdd ( float ∗ A h , float ∗ B h , float ∗ C h , int n ){for ( int i = 0; i < n ; i ++)C h [ i ] = A h [ i ] + B h [ i ] ;}int main ( ){float ∗A h = ( float ∗) malloc ( n∗ sizeof ( float ) ) ;float ∗B h = ( float ∗) malloc ( n∗ sizeof ( float ) ) ;float ∗C h = ( float ∗) malloc ( n∗ sizeof ( float ) ) ;/ / F i l l the arrays with datavecAdd ( A h , B h , C h , n )/ / Use the r e s u l t vector C h/ / Deallocate memory}Peter Wittek Introduction to CUDA Programming
  • 21. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPU Version – Outlinevoid vecAdd ( float ∗ A h , float ∗ B h , float ∗ C h , int n ){float ∗ A d , B d , C d ;1. / / Allocate device memory f o r A, B, and C/ / copy A and B to device memory2. / / Kernel launch code − to make the device/ / perform the actual vector addition3. / / copy C from the device memory/ / Free device vectors}Trans. A Trans. B Kernel TranfertimeOnly use one direction,GPU idlePCIe Idle Only use one direction,GPU idlePeter Wittek Introduction to CUDA Programming
  • 22. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPU Memory Allocation and Transferfloat ∗ A d , B d , C d ;size=n∗ sizeof ( float ) ;/ / Device memory a l l o c a t i o n scudaMalloc ( ( void ∗∗)&A d , size ) ;cudaMalloc ( ( void ∗∗)&B d , size ) ;cudaMalloc ( ( void ∗∗)&C d , size ) ;/ / Copying from host to devicecudaMemcpy( A d , A h , size , cudaMemcpyHostToDevice ) ;cudaMemcpy( B d , B h , size , cudaMemcpyHostToDevice ) ;GridGlobal MemoryBlock (0, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersBlock (1, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersHostConstant MemoryPeter Wittek Introduction to CUDA Programming
  • 23. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsMemory Transfer Back and Freeing Device MemorycudaMemcpy( C h , C d , size , cudaMemcpyDeviceToHost ) ;cudaFree ( A d ) ;cudaFree ( B d ) ;cudaFree ( C d ) ; GridGlobal MemoryBlock (0, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersBlock (1, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersHostConstant MemoryPeter Wittek Introduction to CUDA Programming
  • 24. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGrid Configuration and Kernel Invocationint threadsPerBlock = 256;int blocksPerGrid =(n + threadsPerBlock − 1) / threadsPerBlock ;vectorAdd<<<blocksPerGrid , threadsPerBlock>>>(d A , d B , d C , n ) ;Number of blocks is easy: given the number of threads perblock, have enough to cover all the elements.Why 256 threads in a block?Kernel calls are always asynchronousPeter Wittek Introduction to CUDA Programming
  • 25. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsThe Kernelg l o b a l voidvectorAdd ( float ∗A, float ∗B, float ∗C, int n ){int i = blockDim . x ∗ blockIdx . x + threadIdx . x ;i f ( i < n ){C[ i ] = A[ i ] + B[ i ] ;}}Peter Wittek Introduction to CUDA Programming
  • 26. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSilent About2D/3D tilingShared memory, registers, texture memory, constantmemoryCoalesced access, caching, bank conflictsStreams, parallel workflowsOverlapping host-device memory transfersPeter Wittek Introduction to CUDA Programming
  • 27. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  • 28. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsComputational ThinkingWen-mei Hwu:The ability to translate/formulate domain problemsinto computational models that can be solvedefficiently by available computing resources–Understanding the relationship between thedomain problem and the computational models–Understanding the strength and limitations of thecomputing devices–Defining problems and models to enable efficientcomputational solutionsPeter Wittek Introduction to CUDA Programming
  • 29. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculating a Euclidean Distance Matrix: The problemYou are given two sets of d-dimensional vectors:x1, x2, . . . , xmw1, w2, . . . , wnThe goal is to calculate every pairwise Euclidean distance:d(xi , wj ) =dk=1(xik − wjk )2.Other distance functions are also important.If the second set of vectors is identical to the first, thepairwise distances define the Gram matrix.Gram matrices are extensively used in machine learning.Peter Wittek Introduction to CUDA Programming
  • 30. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculating the Distances with the CPUfloat get distance ( float ∗ x , float ∗ w, int d ){float distance = 0.0 f ;for ( int k = 0; k < d ; k++)distance += ( x [ k]−w[ k ] ) ∗ ( x [ k]−w[ k ] ) ;return sqrt ( distance ) ;}int main ( ){/ / X i s a 2D array containing m vectors/ /W i s the other 2D array containing n vectors/ / D i s an m∗n 2D array storing the pairwise distancesfor ( int i = 0; i < m; i ++)for ( int j = 0; j < n ; j ++)D = get distance (X[ i ] , W[ j ] , d )}Peter Wittek Introduction to CUDA Programming
  • 31. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculating the Distances with the CPU –ObservationsMemory access patterns are regularA total of three embedded for loopsScreams of inherent yet unexploited parallelismPresent variant easily adapts to other distance functionsPeter Wittek Introduction to CUDA Programming
  • 32. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsA Not Entirely Na¨ıve GPU VariantLet us use libraries!The distance function is nothing but a vector addition and adot product:(xi − wj , xi − wj ).This translates to two level-1 CUBLAS callsKeep the outer loopsfor ( int i = 0; i < m; i ++)for ( int j = 0; j < n ; j ++)D = get distance with cublas level1 (X[ i ] , W[ j ] , d )This is a fantastic way to achieve a 10x slowdownPeter Wittek Introduction to CUDA Programming
  • 33. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsReformulating the ProblemThis is where computational thinking beginsHow do you incorporate the two outer loops?Use linear algebra the rewrite the problem in a differentform1: v1 = (X ◦ X)[1, 1 . . . 1]2: v2 = (W ◦ W)[1, 1 . . . 1]3: P1 = [v1v1 . . . v1]4: P2 = [v2v2 . . . v2]5: P3 = XW // This is BLAS Level 36: D = (P1 + P2 − 2P3)The raw performance is about 20-50x faster than asingle-core CPU.http://peterwittek.github.io/somoclu/Peter Wittek Introduction to CUDA Programming
  • 34. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsAlmost the Right Tool: ThrustThrust is a C++ template library for 1D arrays in CUDA.Saves you from writing kernels, allocating device memory,doing the memory transfers.The problem: Thrust is limited to 1D.Example: reductionCUDA: Efficient reduction kernels are hard to writeThrust: Reduction operators are readily available, but the2D data structure and extending the operator are hardThe cool thing: Thrust works both on the CPU and theGPU.Peter Wittek Introduction to CUDA Programming
  • 35. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Elements in a Matrix According to a StencilThis is a reduction problemDot product of a binary stencil and a submatrix at a givenoffsett h r u s t : : inner product ( t i l e . begin ( ) , t i l e . end ( ) ,s t e n c i l . begin ( ) , 0 ) ;The hard part is extracting a 2D sub-array (the tile above) fromthe matrixPeter Wittek Introduction to CUDA Programming
  • 36. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Elements in a Matrix According to a Stencil –Part 2We need to define a new class with appropriate iteratorstypedef typename t h r u s t : : c o u n t i n g i t e r a t o r <data type>CountingIterator ;typedef typename t h r u s t : : t r a n s f o r m i t e r a t o r <t i l e f u n c t o r ,CountingIterator > TransformIterator ;typedef typename t h r u s t : : permutation iterator <I t e r a t o r ,TransformIterator > PermutationIterator ;The “permutation” will be a new transform iterator that ensures only thetile elements are returned.Peter Wittek Introduction to CUDA Programming
  • 37. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Elements in a Matrix According to a Stencil –Part 3Extending the unary functor of the transform iteratorstruct t i l e f u n c t o r : publict h r u s t : : unary function<data type , data type>{data type t i l e s i z e x ;data type leading dimension ;t i l e f u n c t o r ( data type t i l e s i z e x , data type leading dimension ): t i l e s i z e x ( t i l e s i z e x ) ,leading dimension ( leading dimension ) {}h o s t d e v i c edata type operator ( ) ( const data type& i ) const{int x= i % t i l e s i z e x ;int y= i / t i l e s i z e x ;return leading dimension∗y + x ;}};Peter Wittek Introduction to CUDA Programming
  • 38. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  • 39. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsTrends in Supercomputing: Top500(a) November 2010 (b) November 2011(c) November 2012Peter Wittek Introduction to CUDA Programming
  • 40. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsTowards ExascaleWith current technology, exascale computing would requireseveral power plants.Mont Blanc project: build a supercomputer with cell phonecomponentsARM processor with on-chip GPUMany heterogeneous, small cores, with limited memoryaddressPrototype has been operational for over a yearPeter Wittek Introduction to CUDA Programming
  • 41. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOmpSS: OpenMP SuperScalarOpenCL/CUDA coding is complexand error prone:Memory allocation and transfersManual work schedulingHeterogeneous workload is hardInstead, use directivesDetect hardware at runtimeSchedule work according toavailable resourceshttps://pm.bsc.es/ompssPeter Wittek Introduction to CUDA Programming
  • 42. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  • 43. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSummaryGeneral-purpose programming on GPUs is still rapidlyevolvingCompiler technology is slowly getting betterSupporting tools are getting better much fasterThe HPC ecosystem clusters around CUDA and NVidiahardware, OpenCL is far less commonCUDA is not easy, but easier than it was two years agoRemember computational thinkingPeter Wittek Introduction to CUDA Programming

×