Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsIntroduct...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 ...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsThe Sourc...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 ...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhat It I...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhat We A...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 ...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhy SIMD?...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhy Is It...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPUs: Lat...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPUs: Ach...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPUs: Thr...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsExecution...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsStreaming...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsStreaming...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCUDA Memo...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGlobal Me...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCompilati...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 ...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPU Versi...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPU Versi...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPU Memor...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsMemory Tr...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGrid Confi...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsThe Kerne...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSilent Ab...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 ...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsComputati...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculati...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculati...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculati...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsA Not Ent...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsReformula...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsAlmost th...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Eleme...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Eleme...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Eleme...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 ...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsTrends in...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsTowards E...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOmpSS: Op...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 ...
Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSummaryGe...
Upcoming SlideShare
Loading in...5
×

Introduction to CUDA Programming

1,015
-1

Published on

The aim of this seminar is to provide students with basic knowledge on developing applications for processors with massively parallel computing resources. In general, we refer to a processor as massively parallel if it has the ability to complete more than 64 arithmetic operations per clock cycle. Graphics processing units (GPUs) fall into this category, but other massively parallel architectures are emergent. Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, memory hierarchy, and resource limitations of these processors. We will also overview some tools that reduce the initial difficulties of CUDA programming.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,015
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
74
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Introduction to CUDA Programming

  1. 1. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsIntroduction to CUDA ProgrammingGuest LecturePeter WittekUniversity of Bor˚as & Tsinghua UniversityMay 28, 2013Peter Wittek Introduction to CUDA Programming
  2. 2. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  3. 3. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsThe Sources Are. . .PATC Course: Introduction to CUDA Programming, 2012.PUMPS Summer School, 2012.Wen Mei Hwu, Isaac Gelado.Peter Wittek Introduction to CUDA Programming
  4. 4. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  5. 5. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhat It Is AboutMany-core hardware limitationsSingle instruction, multiple data (SIMD)Streaming architecture and programming paradigmDesirable computation patternsEasy to achieve good results if you ignore correctnessEqually easy to obtain correct code that is slowPeter Wittek Introduction to CUDA Programming
  6. 6. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhat We Are Not Going to Talk AboutARM, AMD, OpenCLField-programmable gate array(FPGA)CUDA works on FPGA as aproof-of-conceptIntel Xeon PhiRuns a Linux in the firmwareSSH, MPI, and OpenMPAdiabatic Quantum ComputingToo expensive at this pointAn adiabatic quantumcomputerImage courtesy of The New York Times.Peter Wittek Introduction to CUDA Programming
  7. 7. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  8. 8. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhy SIMD?Theoretical Peak Performance of some CPUs and GPUsImage courtesy of John Owens.Some observations are not apparentNotice the word theoretical2012 Nvidia GPUs: 1.5 TFLOPSPeter Wittek Introduction to CUDA Programming
  9. 9. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsWhy Is It Hard to Program GPUs?CPUs versus GPUsCacheALUControlALUALUALUDRAMCPUDRAMGPUStreaming hardwareExplicit memory managementPeter Wittek Introduction to CUDA Programming
  10. 10. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPUs: Latency-Oriented DesignLarge cachesMultiple layersReduces latencyCoherentBranch predictionData forwardingOut-of-branch executionCacheALUControlALUALUALUDRAMPeter Wittek Introduction to CUDA Programming
  11. 11. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPUs: Achieving High PerformanceDesign is changing: SIMD registers, vector instructionsMMX, SSE-SSE4, AVXCache hierarchy and cache coherenceEasy to reach 70 % of the peak performance of a CPU100 % is harder than on a GPUCompilers will not help, especially not vendor-provided,commercial ones (icc, we are looking at you)Hand-tuned code will be more complex than GPU codem128 r1q = mm load ps(& r1 [ idx1 ] ) ;m128 i2q = mm load ps(& i2 [ idx2 ] ) ;m128 next r1q = mm sub ps ( mm mul ps ( r1q , aq ) ,mm mul ps ( i2q , bq ) ) ;Peter Wittek Introduction to CUDA Programming
  12. 12. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPUs: Throughput-Oriented DesignSmall caches768k on a Tesla M2050 with448 streaming coresNo coherence protocolNo branch predictionIn fact, you have strongincentives to avoidbranchingNo data forwardingPre-Fermi architecture:accuracy was secondaryDRAMPeter Wittek Introduction to CUDA Programming
  13. 13. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsExecution modelSerial code in host C codeHighly parallel parts in SIMD C kernelHost and device memories are separateMemory allocations on both memory spacesTransfers between host and deviceDefine structure of SIMD execution gridCall SIMD kernelBasic execution modelTrans. A Trans. B Kernel TranfertimeOnly use one direction,GPU idlePCIe Idle Only use one direction,GPU idlePeter Wittek Introduction to CUDA Programming
  14. 14. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsStreaming threads1D, 2D, or 3D grid of blocksA block is a 1D, 2D, or 3Darray of threadsWhy is it not just one big arrayof threads?i = blockIdx.x*blockDim.x+threadIdx.x;C_d[i] = A_d[i] + B_d[i];…0 1 2 254 255…Peter Wittek Introduction to CUDA Programming
  15. 15. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsStreaming threads1D, 2D, or 3D grid of blocksA block is a 1D, 2D, or 3Darray of threadsWhy is it not just one big arrayof threads?Only threads in a blockcooperateShared memory, registers,synchronizationPhysical limits on maximumblock sizeActual execution is in warpsi = blockIdx.x*blockDim.x+threadIdx.x;C_d[i] = A_d[i] + B_d[i];…0 1 2 254 255…Peter Wittek Introduction to CUDA Programming
  16. 16. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCUDA Memory ModelRegisters per thread: ≈1cycle to accessShared memory per block: ≈5cycles to accessGlobal memory per grid:≈500 cycles to accessConstant/texture memory pergrid: ≈5 cycles if cachedThreads cannot access thehost memoryGridGlobal MemoryBlock (0, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersBlock (1, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersHostConstant MemoryPeter Wittek Introduction to CUDA Programming
  17. 17. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGlobal Memory BandwidthFermi-core Tesla: 1 TFLOPS single-precision144 GByte/s global memory bandwidth36 giga single-precision floatsAt least 1000/36 ≈ 28 arithmetic operations are necessaryto achieve peak throughputPeter Wittek Introduction to CUDA Programming
  18. 18. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCompilationIntegrated C programs with CUDAextensionsNVCC Compiler WrapperHostC Compiler/LinkerHost Code Device Code (PTX)Device Just-in-TimeCompilerHeterogeneous Computing PlatformwithCPUs, GPUsThe situation gets messy when other compiler wrappersentermpiccIt is a good idea to separate C++ code from CUDA CPeter Wittek Introduction to CUDA Programming
  19. 19. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  20. 20. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCPU Versionvoid vecAdd ( float ∗ A h , float ∗ B h , float ∗ C h , int n ){for ( int i = 0; i < n ; i ++)C h [ i ] = A h [ i ] + B h [ i ] ;}int main ( ){float ∗A h = ( float ∗) malloc ( n∗ sizeof ( float ) ) ;float ∗B h = ( float ∗) malloc ( n∗ sizeof ( float ) ) ;float ∗C h = ( float ∗) malloc ( n∗ sizeof ( float ) ) ;/ / F i l l the arrays with datavecAdd ( A h , B h , C h , n )/ / Use the r e s u l t vector C h/ / Deallocate memory}Peter Wittek Introduction to CUDA Programming
  21. 21. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPU Version – Outlinevoid vecAdd ( float ∗ A h , float ∗ B h , float ∗ C h , int n ){float ∗ A d , B d , C d ;1. / / Allocate device memory f o r A, B, and C/ / copy A and B to device memory2. / / Kernel launch code − to make the device/ / perform the actual vector addition3. / / copy C from the device memory/ / Free device vectors}Trans. A Trans. B Kernel TranfertimeOnly use one direction,GPU idlePCIe Idle Only use one direction,GPU idlePeter Wittek Introduction to CUDA Programming
  22. 22. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGPU Memory Allocation and Transferfloat ∗ A d , B d , C d ;size=n∗ sizeof ( float ) ;/ / Device memory a l l o c a t i o n scudaMalloc ( ( void ∗∗)&A d , size ) ;cudaMalloc ( ( void ∗∗)&B d , size ) ;cudaMalloc ( ( void ∗∗)&C d , size ) ;/ / Copying from host to devicecudaMemcpy( A d , A h , size , cudaMemcpyHostToDevice ) ;cudaMemcpy( B d , B h , size , cudaMemcpyHostToDevice ) ;GridGlobal MemoryBlock (0, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersBlock (1, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersHostConstant MemoryPeter Wittek Introduction to CUDA Programming
  23. 23. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsMemory Transfer Back and Freeing Device MemorycudaMemcpy( C h , C d , size , cudaMemcpyDeviceToHost ) ;cudaFree ( A d ) ;cudaFree ( B d ) ;cudaFree ( C d ) ; GridGlobal MemoryBlock (0, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersBlock (1, 0)Shared MemoryThread (0, 0)RegistersThread (1, 0)RegistersHostConstant MemoryPeter Wittek Introduction to CUDA Programming
  24. 24. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsGrid Configuration and Kernel Invocationint threadsPerBlock = 256;int blocksPerGrid =(n + threadsPerBlock − 1) / threadsPerBlock ;vectorAdd<<<blocksPerGrid , threadsPerBlock>>>(d A , d B , d C , n ) ;Number of blocks is easy: given the number of threads perblock, have enough to cover all the elements.Why 256 threads in a block?Kernel calls are always asynchronousPeter Wittek Introduction to CUDA Programming
  25. 25. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsThe Kernelg l o b a l voidvectorAdd ( float ∗A, float ∗B, float ∗C, int n ){int i = blockDim . x ∗ blockIdx . x + threadIdx . x ;i f ( i < n ){C[ i ] = A[ i ] + B[ i ] ;}}Peter Wittek Introduction to CUDA Programming
  26. 26. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSilent About2D/3D tilingShared memory, registers, texture memory, constantmemoryCoalesced access, caching, bank conflictsStreams, parallel workflowsOverlapping host-device memory transfersPeter Wittek Introduction to CUDA Programming
  27. 27. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  28. 28. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsComputational ThinkingWen-mei Hwu:The ability to translate/formulate domain problemsinto computational models that can be solvedefficiently by available computing resources–Understanding the relationship between thedomain problem and the computational models–Understanding the strength and limitations of thecomputing devices–Defining problems and models to enable efficientcomputational solutionsPeter Wittek Introduction to CUDA Programming
  29. 29. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculating a Euclidean Distance Matrix: The problemYou are given two sets of d-dimensional vectors:x1, x2, . . . , xmw1, w2, . . . , wnThe goal is to calculate every pairwise Euclidean distance:d(xi , wj ) =dk=1(xik − wjk )2.Other distance functions are also important.If the second set of vectors is identical to the first, thepairwise distances define the Gram matrix.Gram matrices are extensively used in machine learning.Peter Wittek Introduction to CUDA Programming
  30. 30. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculating the Distances with the CPUfloat get distance ( float ∗ x , float ∗ w, int d ){float distance = 0.0 f ;for ( int k = 0; k < d ; k++)distance += ( x [ k]−w[ k ] ) ∗ ( x [ k]−w[ k ] ) ;return sqrt ( distance ) ;}int main ( ){/ / X i s a 2D array containing m vectors/ /W i s the other 2D array containing n vectors/ / D i s an m∗n 2D array storing the pairwise distancesfor ( int i = 0; i < m; i ++)for ( int j = 0; j < n ; j ++)D = get distance (X[ i ] , W[ j ] , d )}Peter Wittek Introduction to CUDA Programming
  31. 31. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsCalculating the Distances with the CPU –ObservationsMemory access patterns are regularA total of three embedded for loopsScreams of inherent yet unexploited parallelismPresent variant easily adapts to other distance functionsPeter Wittek Introduction to CUDA Programming
  32. 32. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsA Not Entirely Na¨ıve GPU VariantLet us use libraries!The distance function is nothing but a vector addition and adot product:(xi − wj , xi − wj ).This translates to two level-1 CUBLAS callsKeep the outer loopsfor ( int i = 0; i < m; i ++)for ( int j = 0; j < n ; j ++)D = get distance with cublas level1 (X[ i ] , W[ j ] , d )This is a fantastic way to achieve a 10x slowdownPeter Wittek Introduction to CUDA Programming
  33. 33. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsReformulating the ProblemThis is where computational thinking beginsHow do you incorporate the two outer loops?Use linear algebra the rewrite the problem in a differentform1: v1 = (X ◦ X)[1, 1 . . . 1]2: v2 = (W ◦ W)[1, 1 . . . 1]3: P1 = [v1v1 . . . v1]4: P2 = [v2v2 . . . v2]5: P3 = XW // This is BLAS Level 36: D = (P1 + P2 − 2P3)The raw performance is about 20-50x faster than asingle-core CPU.http://peterwittek.github.io/somoclu/Peter Wittek Introduction to CUDA Programming
  34. 34. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsAlmost the Right Tool: ThrustThrust is a C++ template library for 1D arrays in CUDA.Saves you from writing kernels, allocating device memory,doing the memory transfers.The problem: Thrust is limited to 1D.Example: reductionCUDA: Efficient reduction kernels are hard to writeThrust: Reduction operators are readily available, but the2D data structure and extending the operator are hardThe cool thing: Thrust works both on the CPU and theGPU.Peter Wittek Introduction to CUDA Programming
  35. 35. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Elements in a Matrix According to a StencilThis is a reduction problemDot product of a binary stencil and a submatrix at a givenoffsett h r u s t : : inner product ( t i l e . begin ( ) , t i l e . end ( ) ,s t e n c i l . begin ( ) , 0 ) ;The hard part is extracting a 2D sub-array (the tile above) fromthe matrixPeter Wittek Introduction to CUDA Programming
  36. 36. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Elements in a Matrix According to a Stencil –Part 2We need to define a new class with appropriate iteratorstypedef typename t h r u s t : : c o u n t i n g i t e r a t o r <data type>CountingIterator ;typedef typename t h r u s t : : t r a n s f o r m i t e r a t o r <t i l e f u n c t o r ,CountingIterator > TransformIterator ;typedef typename t h r u s t : : permutation iterator <I t e r a t o r ,TransformIterator > PermutationIterator ;The “permutation” will be a new transform iterator that ensures only thetile elements are returned.Peter Wittek Introduction to CUDA Programming
  37. 37. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSum Elements in a Matrix According to a Stencil –Part 3Extending the unary functor of the transform iteratorstruct t i l e f u n c t o r : publict h r u s t : : unary function<data type , data type>{data type t i l e s i z e x ;data type leading dimension ;t i l e f u n c t o r ( data type t i l e s i z e x , data type leading dimension ): t i l e s i z e x ( t i l e s i z e x ) ,leading dimension ( leading dimension ) {}h o s t d e v i c edata type operator ( ) ( const data type& i ) const{int x= i % t i l e s i z e x ;int y= i / t i l e s i z e x ;return leading dimension∗y + x ;}};Peter Wittek Introduction to CUDA Programming
  38. 38. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  39. 39. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsTrends in Supercomputing: Top500(a) November 2010 (b) November 2011(c) November 2012Peter Wittek Introduction to CUDA Programming
  40. 40. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsTowards ExascaleWith current technology, exascale computing would requireseveral power plants.Mont Blanc project: build a supercomputer with cell phonecomponentsARM processor with on-chip GPUMany heterogeneous, small cores, with limited memoryaddressPrototype has been operational for over a yearPeter Wittek Introduction to CUDA Programming
  41. 41. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOmpSS: OpenMP SuperScalarOpenCL/CUDA coding is complexand error prone:Memory allocation and transfersManual work schedulingHeterogeneous workload is hardInstead, use directivesDetect hardware at runtimeSchedule work according toavailable resourceshttps://pm.bsc.es/ompssPeter Wittek Introduction to CUDA Programming
  42. 42. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsOutline1 Disclaimer2 Accelerator Architectures3 Introduction to Streaming Architecture4 Writing Kernels: Vector Multiplication5 Using Libraries6 Prospects7 ConclusionsPeter Wittek Introduction to CUDA Programming
  43. 43. Disclaimer Accelerator Architectures Streaming Architecture Writing Kernels Using Libraries Prospects ConclusionsSummaryGeneral-purpose programming on GPUs is still rapidlyevolvingCompiler technology is slowly getting betterSupporting tools are getting better much fasterThe HPC ecosystem clusters around CUDA and NVidiahardware, OpenCL is far less commonCUDA is not easy, but easier than it was two years agoRemember computational thinkingPeter Wittek Introduction to CUDA Programming
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×