• Like
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Upcoming SlideShare
Loading in...5
×

IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

  • 2,442 views
Uploaded on

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009 …

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • really nice ppt but I wounder about implementations specially the curve of CPU and GPU bench marks and difference of performance on computational times.
    how I can see your codes which represent that curve
    Are you sure you want to
    Your message goes here
    Be the first to like this
No Downloads

Views

Total Views
2,442
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
165
Comments
1
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 6.963 IT / A@M CUD 9 IAP0 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 04 Nicolas Pinto (MIT) #1 CUDA - Advanced
  • 2. During this course, 3 6 for 6.9 ed adapt we’ll try to “ ” and use existing material ;-)
  • 3. warp != wrap
  • 4. Today yey!!
  • 5. 6.963 IT / A@M CUD 9 IAP0 Textures & OpenGL Async API Libraries Interfacing CUDA Performance
  • 6. 6.963 IT / A@M CUD 9 IAP0 CUDA Textures and OpenGL
  • 7. res xtu Te Textures in CUDA Different hardware path to memory Benefits of CUDA textures: Texture fetches are cached Optimized for 2D locality Textures are addressable in 2D Using integer or normalized coordinates Means fewer addressing calculations in code Provide filtering for free Free wrap modes (boundary conditions) Clamp to edge / repeat Limitations of CUDA textures: Read-only Currently either 1D or 2D (3D will be added) 9-bit accuracy of filter weights © NVIDIA Corporation 2008 160
  • 8. res xtu Te Two CUDA Texture Types Bound to linear memory Global memory is bound to a texture Only 1D Integer addressing No filtering, no addressing modes Bound to CUDA arrays CUDA array is bound to a texture 1D or 2D Float addressing (size-based or normalized) Filtering Addressing modes (clamping, repeat) Both: Return either element type or normalized float © NVIDIA Corporation 2008 161
  • 9. res xtu Te CUDA Texturing Steps Host (CPU) code: Allocate/obtain memory (global linear, or CUDA array) Create a texture reference object Currently must be at file-scope Bind the texture reference to memory/array When done: Unbind the texture reference, free resources Device (kernel) code: Fetch using texture reference Linear memory textures: tex1Dfetch() Array textures: tex1D() or tex2D() © NVIDIA Corporation 2008 162
  • 10. res xtu Te Texture Reference Immutable parameters (compile-time) Type: type returned when fetching Basic int, float types CUDA 1-, 2-, 4-element vectors Dimensionality: Currently 1 or 2 (3 will be supported in the future) Read Mode: cudaReadModeElementType cudaReadModeNormalizedFloat (valid for 8- or 16-bit ints) – returns [-1,1] for signed, [0,1] for unsigned Mutable parameters (run-time, only for array-textures) Normalized: non-zero = addressing range [0, 1] Filter Mode: cudaFilterModePoint cudaFilterModeLinear Address Mode: cudaAddressModeClamp cudaAddressModeWrap © NVIDIA Corporation 2008 163
  • 11. Example: Host code for linear mem // declare texture reference (must be at file-scope) texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef; ... // set up linear memory unsigned short *dA = 0; cudaMalloc((void**)&dA, numBytes); cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice); // bind texture reference to array res cudaBindTexture(NULL, texRef, dA); xtu Te © NVIDIA Corporation 2008 164
  • 12. res xtu Te cudaArray Type Channel format, width, height cudaChannelFormatDesc structure int x, y, z, w: bits for each component enum cudaChannelFormatKind – one of: cudaChannelFormatKindSigned cudaChannelFormatKindUnsigned cudaChannelFormatKindFloat some predefined constructors: cudaCreateChannelDesc<float>(void); cudaCreateChannelDesc<float4>(void); Management functions: cudaMallocArray, cudaFreeArray, cudaMemcpyToArray, cudaMemcpyFromArray, ... © NVIDIA Corporation 2008 165
  • 13. Example: Host code for 2D array tex // declare texture reference (must be at file-scope) texture<float, 2, cudaReadModeElementType> texRef; ... // set up the CUDA array cudaChannelFormatDesc cf = cudaCreateChannelDesc<float>(); cudaArray *texArray = 0; cudaMallocArray(&texArray, &cf, dimX, dimY); cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice); // specify mutable texture reference parameters texRef.normalized = 0; res texRef.filterMode = cudaFilterModeLinear; xtu texRef.addressMode = cudaAddressModeClamp; Te // bind texture reference to array cudaBindTextureToArray(texRef, texArray); © NVIDIA Corporation 2008 166
  • 14. nGL pe O OpenGL Interoperability OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory Vertex buffer objects Pixel buffer objects Direct3D9 Vertex objects can be mapped Data can be accessed like any other global data in the device code Image data can be displayed from pixel buffer objects using glDrawPixels / glTexImage2D Requires copy in video memory, but still fast © NVIDIA Corporation 2008 177
  • 15. nGL pe O OpenGL Interop Steps Register a buffer object with CUDA cudaGLRegisterBufferObject(GLuint buffObj); OpenGL can use a registered buffer only as a source Unregister the buffer prior to rendering to it by OpenGL Map the buffer object to CUDA memory cudaGLMapBufferObject(void **devPtr, GLuint buffObj); Returns an address in global memory Buffer must registered prior to mapping Launch a CUDA kernel to process the buffer Unmap the buffer object prior to use by OpenGL cudaGLUnmapBufferObject(GLuint buffObj); Unregister the buffer object cudaGLUnregisterBufferObject(GLuint buffObj); Optional: needed if the buffer is a render target Use the buffer object in OpenGL code © NVIDIA Corporation 2008 178
  • 16. nGL pe O Interop Scenario: Dynamic CUDA-generated texture Register the texture PBO with CUDA For each frame: Map the buffer Generate the texture in a CUDA kernel Unmap the buffer Update the texture Render the textured object unsigned char *p_d=0; cudaGLMapBufferObject((void**)&p_d, pbo); prepTexture<<<height,width>>>(p_d, time); cudaGLUnmapBufferObject(pbo); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glBindTexture(GL_TEXTURE_2D, texID); glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256, GL_BGRA, GL_UNSIGNED_BYTE, 0); © NVIDIA Corporation 2008 179
  • 17. nGL pe O Interop Scenario: Frame Post-processing by CUDA For each frame: Render to PBO with OpenGL Register the PBO with CUDA Map the buffer Process the buffer with a CUDA kernel Unmap the buffer Unregister the PBO from CUDA unsigned char *p_d=0; cudaGLRegisterBufferObject(pbo); cudaGLMapBufferObject((void**)&p_d, pbo); postProcess<<<blocks,threads>>>(p_d); cudaGLUnmapBufferObject(pbo); cudaGLUnregisterBufferObject(pbo); ... © NVIDIA Corporation 2008 180
  • 18. 6.963 IT / A@M CUD 9 IAP0 CUDA Async API
  • 19. ync As !quot;#$%&'($()quot;*+,+('#*%(-# !quot;#$%&'($()quot;*&(quot;.*!quot; /,01%,*+,+('#*%(-#*2('* -34,56(%7,/*+,+('#*2',,quot;*)-*89:*($*366*8:;!* %3-3<6,*/,01%,quot; =0,'63-*1+-6,+,$.,/*<#*)quot;1$4*3*8:;!*quot;.',3+ 8:;!*>.',3+*?*>,@),$%,*(2*8:;!*(-,'3.1($quot;*.&3.* ,A,%).,*1$*('/,' >.',3+*!9BC D3%&*quot;.',3+*&3quot;*3$*B;C*E*?*/,23)6.*quot;.',3+ cudaMemcpyAsync(dst, src, size, 0); 97
  • 20. ync As !quot;#$%&'()#$*#%(&*+(,#,-$.(/-'. 0-*/1$$#*2(#3#/124-*(-5(&()#$*#%(&*+(&(6-72(!quot; +#quot;4/#(,#,-$.(/-'.(5-$('&8#9%-/)#+(,#,-$. 0-,'12#(/&'&:4%42.(;<(=>=(?@AB(&*+(1'C Dquot;&4%&:%#(&7(&('$#quot;4#E(5#&21$#(4*(0FGD(=>= !quot;#$%&'7()#$*#%(#3#/124-*(4*(-*#(72$#&,(E426(&(,#,-$.( /-'.(5$-,(&*-26#$(72$#&, H2$#&,(DIJK cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(dst, src, size, stream1); -quot;#$%&''#+ kernel<<<grid, block, 0, stream2>>>(…); cudaStreamQuery(stream2); 98
  • 21. ync As !quot;#$%&'()*%$+, &'()*-%./(%0)-(/*(1%2/(34/1(15%0)*4%!quot;#$%3.66%-*/(.7- quot;-.8(%-3()./04-9 7(.-:/(%(6.;-(1%*07(%<4/%!quot;#$%3.66-%23643=%3>36(%;/(30-04)5 ?:(/>%*@(%-*.*:-%4<%.)%.->)3@/4)4:-%!quot;#$%3.66 A643=%!+quot;%:)*06%!quot;#$%3.66-%;/04/%*4%*@(%('()*%./(%347;6(*(1 .->)3$+, -.7;6(%0)%!quot;#$%B#C 3:1.&'()*D* -*./*E%-*4;F 3:1.&'()*!/(.*(2G-*./*5F 3:1.&'()*!/(.*(2G-*4;5F 3:1.&'()*H(34/12-*./*E%I5F =(/)(6JJJ8/01E%A643=KKK2LLL5F 3:1.&'()*H(34/12-*4;E%I5F 3:1.&'()*B>)3@/4)0M(2-*4;5F <64.*%(*F 3:1.&'()*&6.;-(1N07(2G(*E%-*./*E%-*4;5F 3:1.&'()*#(-*/4>2-*./*5F 3:1.&'()*#(-*/4>2-*4;5F 95 95
  • 22. 6.963 IT / A@M CUD 9 IAP0 CUDA Libraries
  • 23. ary ibr L CUDA libraries CUDA includes 2 widely used libraries CUBLAS: BLAS implementation CUFFT: FFT implementation CUDPP (Data Parallel Primitives), available from http://www.gpgpu.org/developer/cudpp/ : Reduction Scan Sort 9 M02: High Performance Computing with CUDA
  • 24. ary ibr L Closely Coupled CPU-GPU Function Function Function Lib Lib Init GPU Alloc CPU Operation 1 Operation 2 Operation 3 Integrated programming model High speed data transfer – up to 5.5GB/sec Asynchronous data transfer Large GPU memory systems 10 M02: High Performance Computing with CUDA
  • 25. ary ibr L CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction with CUDA driver Basic model for use Create matrix and vector objects in GPU memory space Fill objects with data Call sequence of CUBLAS functions Retrieve data from GPU CUBLAS library contains helper functions Creating and destroying objects in GPU space Writing data to and retrieving data from objects 11 M02: High Performance Computing with CUDA
  • 26. ary ibr L Using CUBLAS Interface to CUBLAS library is in cublas.h Function naming convention cublas + BLAS name Eg., cublasSGEMM Error handling CUBLAS core functions do not return error CUBLAS provides function to retrieve last error recorded CUBLAS helper functions do return error Helper functions: Memory allocation, data transfer Implemented using C-based CUDA tool chain Interfacing to C/C++ applications is trivial 13 M02: High Performance Computing with CUDA
  • 27. ary ibr L Supported Features Single Precision Double Precision* Real Complex Real Complex ! ! ! Level 1 dgemv, ! dger, Level 2 dsyr, dtrsv cgemm zgemm ! ! Level 3 *Double-precision functions only supported on GPUs with double-precision hardware © 2008 NVIDIA Corporation.
  • 28. ary ibr L CUBLAS Helper Functions cublasInit() Initializes CUBLAS library cublasShutdown() Releases resources used by CUBLAS library cublasGetError() Returns last error from CUBLAS core function (+ resets) cublasAlloc() Wrapper around cudaMalloc() to allocate space for array cublasFree() destroys object in GPU memory cublas[Set|Get][Vector|Matrix]() Copies array elements between CPU and GPU memory Accommodates non-unit strides © 2008 NVIDIA Corporation.
  • 29. ary ibr L sgemmExample.c #include <stdio.h> cublasInit(); #include <stdlib.h> #include quot;cublas.hquot; cublasAlloc(n2, sizeof(float), (void **)&a_d); cublasAlloc(n2, sizeof(float), (void **)&b_d); int main(void) cublasAlloc(n2, sizeof(float), (void **)&c_d); { float *a_h, *b_h, *c_h; cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1); float *a_d, *b_d, *c_d; cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1); float alpha = 1.0f, beta = 0.0f; int N = 2048, n2 = N*N; cublasSgemm('n', 'n', N, N, N, alpha, a_d, N, int nBytes, i; b_d, N, beta, c_d, N); nBytes = n2*sizeof(float); cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1); a_h = (float *)malloc(nBytes); free(a_h); free(b_h); free(c_h); b_h = (float *)malloc(nBytes); cublasFree(a_d); cublasFree(b_d); c_h = (float *)malloc(nBytes); cublasFree(c_d); for (i=0; i < n2; i++) { cublasShutdown(); return 0; a_h[i] = rand() / (float) RAND_MAX; } b_h[i] = rand() / (float) RAND_MAX; } © 2008 NVIDIA Corporation.
  • 30. ary ibr L Calling CUBLAS from FORTRAN Two interfaces: Thunking (define CUBLAS_USE_THUNKING when compiling fortran.c) Allows interfacing to existing applications without any changes During each call, the wrappers allocate GPU memory, copy source data from CPU memory space to GPU memory space, call CUBLAS, and finally copy back the results to CPU memory space and deallocate the GPGPU memory Intended for light testing due to call overhead Non-Thunking (default) Intended for production code Substitute device pointers for vector and matrix arguments in all BLAS functions Existing applications need to be modified slightly to allocate and deallocate data structures in GPGPU memory space (using CUBLAS_ALLOC and CUBLAS_FREE) and to copy data between GPU and CPU memory spaces (using CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR, CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX) 14 M02: High Performance Computing with CUDA
  • 31. ary ibr L SGEMM example (THUNKING) ! Define 3 single precision matrices A, B, C real , dimension(m1,m1):: A, B, C …… ! Initialize …… #ifdef CUBLAS ! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of ! memory allocation on device and data movement) call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) #else ! Call SGEMM in host BLAS library call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) #endif To use the host BLAS routine: g95 –O3 code.f90 –L/usr/local/lib -lblas To use the CUBLAS routine (fortran.c is provided by NVIDIA): gcc -O3 -DCUBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c g95 -O3 -DCUBLAS code.f90 fortran.o -L/usr/local/cuda/lib -lcublas 15 M02: High Performance Computing with CUDA
  • 32. ary ibr L SGEMM example (NON-THUNKING) ! Define 3 single precision matrices A, B, C real , dimension(m1,m1):: A, B, C integer:: devPtrA, devPtrB, devPtrC, size_of_real=4 …… ! Initialize A, B, C ……… ! Allocate matrices on GPU cublasAlloc(m1*m1, size_of_real, devPtrA) cublasAlloc(m1*m1, size_of_real, devPtrB) cublasAlloc(m1*m1, size_of_real, devPtrC) !Copy data from CPU to GPU cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1) cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1) cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1) ! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data in GPU memory) call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) !Copy data from GPU to CPU cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1) ! Free memory on device cublasFree(devPtrA) …… g95 -O3 code.f90 -L/usr/local/cuda/lib -lcublas 16 M02: High Performance Computing with CUDA
  • 33. !quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(', , V&9F=J!V*=>*7! X&'(9!YB!S(''(=! W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9! $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J! 7901('$1, quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 +(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! _quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U ary ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! 8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?! ibr +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(! L 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! &,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! ,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! ,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8! S#?!M(&>!quot;6=*MG9! C0! T! T! T! +(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &! S#?!6=*M9RO*+N! <B<! T! T! T! ,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U =(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! :,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%! O*+N9B!! Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(! 9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U 9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(! ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6! 6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9! 8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N! *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH! +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=! =FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B! 6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6! '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9! ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9! 8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9! 9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*! 8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(! M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U O*+>B! 6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9! 31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J! =,-./,7($%*1quot;$14(quot;, W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ! [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2! ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(! Volkov and Demmel (SC08) 9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+! 8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!
  • 34. rary Lib !quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(', , V&9F=J!V*=>*7! X&'(9!YB!S(''(=! W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9! $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J! 7901('$1, quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 +(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! _quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! 8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?! +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(! 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! &,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! ,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! ,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8! S#?!M(&>!quot;6=*MG9! C0! T! T! T! +(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &! S#?!6=*M9RO*+N! <B<! T! T! T! ,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U =(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! :,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%! O*+N9B!! Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(! 9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U 9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(! ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6! 6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9! 8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N! *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH! +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=! =FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B! 6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6! '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9! ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9! 8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9! 9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*! 8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(! M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U O*+>B! 6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9! 31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J! =,-./,7($%*1quot;$14(quot;, W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ! Volkov and Demmel (SC08) [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2! ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(! 9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+! 8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!
  • 35. ary ibr L DGEMM Performance 17 M02: High Performance Computing with CUDA
  • 36. ary ibr L Additional Resources CUDA SDK example simpleCUBLAS CUBLAS Library documentation in doc folder of CUDA Toolkit or download from CUDA Zone © 2008 NVIDIA Corporation.
  • 37. ary ibr L CUFFT The Fast Fourier Transform (FFT) is a divide-and- conquer algorithm for efficiently computing discrete Fourier transform of complex or real-valued data sets. CUFFT is the CUDA FFT library Provides a simple interface for computing parallel FFT on an NVIDIA GPU Allows users to leverage the floating-point power and parallelism of the GPU without having to develop a custom, GPU-based FFT implementation 18 M02: High Performance Computing with CUDA
  • 38. ary ibr L Supported Features 1D, 2D and 3D transforms of complex and real-valued data Batched execution for doing multiple 1D transforms in parallel 1D transform size up to 8M elements 2D and 3D transform sizes in the range [2,16384] In-place and out-of-place transforms for real and complex data. 19 M02: High Performance Computing with CUDA
  • 39. ary ibr L Transform Types Library supports real and complex transforms CUFFT_C2C, CUFFT_C2R, CUFFT_R2C Directions CUFFT_FORWARD (-1) and CUFFT_INVERSE (1) According to sign of the complex exponential term Real and imaginary parts of complex input and output arrays are interleaved cufftComplex type is defined for this Real to complex FFTs, output array holds only nonredundant coefficients N -> N/2+1 N0 x N1 x … x Nn -> N0 x N1 x … x (Nn/2+1) For in-place transforms the input/output arrays need to be padded 20 M02: High Performance Computing with CUDA
  • 40. ary ibr L More on Transforms For 2D and 3D transforms, CUFFT performs transforms in row- major (C-order) If calling from FORTRAN or MATLAB, remember to change the order of size parameters during plan creation CUFFT performs un-normalized transforms: IFFT(FFT(A))= length(A)*A CUFFT API is modeled after FFTW. Based on plans, that completely specify the optimal configuration to execute a particular size of FFT Once a plan is created, the library stores whatever state is needed to execute the plan multiple times without recomputing the configuration Works very well for CUFFT, because different kinds of FFTs require different thread configurations and GPU resources 21 M02: High Performance Computing with CUDA
  • 41. ary ibr L CUFFT Types and Definitions cufftHandle Type used to store and access CUFFT plans cufftResults Enumeration of API function return values cufftReal single-precision, real datatype cufftComplex single-precision, complex datatype Real and complex transforms CUFFT_C2C, CUFFT_C2R, CUFFT_R2C Directions CUFFT_FORWARD, CUFFT_INVERSE © 2008 NVIDIA Corporation.
  • 42. ary ibr L CUFFT Example #include <stdio.h> cufftPlan1d(&plan, N, CUFFT_C2C, batchSize); #include <math.h> #include quot;cufft.hquot; cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD); cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE); int main(int argc, char *argv[]) { cudaMemcpy(a_h, a_d, nBytes, cufftComplex *a_h, *a_d; cudaMemcpyDeviceToHost); cufftHandle plan; int N = 1024, batchSize = 10; // check error - normalize int i, nBytes; for (maxError = 0.0, i=0; i < N*batchSize; i++) { double maxError; maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError); maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError); nBytes = sizeof(cufftComplex)*N*batchSize; } a_h = (cufftComplex *)malloc(nBytes); printf(quot;Max fft error = %gnquot;, maxError); for (i=0; i < N*batchSize; i++) { a_h[i].x = sinf(i); cufftDestroy(plan); a_h[i].y = cosf(i); free(a_h); cudaFree(a_d); } return 0; cudaMalloc((void **)&a_d, nBytes); } cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); © 2008 NVIDIA Corporation.
  • 43. ary ibr L Additional CUFFT Resources CUDA SDK examples simpleCUFFT convolutionFFT2D oceanFFT CUFFT Library documentation In doc folder of CUDA Toolkit or download from CUDA Zone © 2008 NVIDIA Corporation.
  • 44. ? e lu G
  • 45. 6.963 IT / A@M CUD 9 IAP0 Interfacing CUDA
  • 46. lue G Interfacing CUDA with other languages CUDA kernels from FORTRAN, allocate pinned memory from FORTRAN Calling CUDA from MATLAB with MEX files Several packages (open source and commercial) to interface CUDA with Python, IDL, .NET, FORTRAN (Flagon). Browse CUDA Zone to find all the packages. 23 M02: High Performance Computing with CUDA
  • 47. lue G Pinned memory from FORTRAN Pinned memory provides a fast PCI-e transfer speed and enables use of streams: •Allocation needs to be done with cudaMallocHost •Use new Fortran 2003 features for interoperability with C. use iso_c_binding ! The allocation is performed by C function calls. Define the C pointer as type (C_PTR) type(C_PTR) :: cptr_A, cptr_B, cptr_C ! Define Fortran arrays as pointer. real, dimension(:,:), pointer :: A, B, C ! Allocating memory with cudaMallocHost. ! The Fortan arrays, now defined as pointers, are then associated with the C pointers using the ! new interoperability defined in iso_c_binding. This is equivalent to allocate(A(m1,m1)) res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) ) call c_f_pointer ( cptr_A, A, (/ m1, m1 /) ) ! Use A as usual. ! See example code for cudaMallocHost interface code http://www.nvidia.com/object/cuda_programming_tools.html 24 M02: High Performance Computing with CUDA
  • 48. lue G Calling CUDA kernels from FORTRAN From Fortran call C function that will call CUDA kernel ! Fortran -> C -> CUDA ->C ->Fortran call cudafunction(c,c2,N) /* NB: Fortran subroutine arguments are passed by reference. */ extern quot;Cquot; void cudafunction_(cuComplex *a, cuComplex *b, int *Np) { ... int N=*np; cudaMalloc ((void **) &a_d , sizeof(cuComplex)*N); cudaMemcpy( a_d, a, sizeof(cuComplex)*N ,cudaMemcpyHostToDevice); dim3 dimBlock(block_size); dim3 dimGrid (N/dimBlock.x); if( N % block_size != 0 ) dimGrid.x+=1; square_complex<<<dimGrid,dimBlock>>>(a_d,a_d,N); cudaMemcpy( b, a_d, sizeof(cuComplex)*N,cudaMemcpyDeviceToHost); cudaFree(a_d); } complex_mul: main.f90 Cuda_function.o $(FC) -o complex_mul main.f90 Cuda_function.o -L/usr/local/cuda/lib -lcudart cuda_function.o: cuda_function.cu nvcc -c -O3 cuda_function.cu 25 M02: High Performance Computing with CUDA
  • 49. lue G CUDA & MATLAB Even though MATLAB is built on many well- optimized libraries, some functions can perform better when written in a compiled language (e.g. C and Fortran). MATLAB provides a convenient API for interfacing code written in C and FORTRAN to MATLAB functions with MEX files. MEX files could be used to exploit multi-core processors with OpenMP or threaded codes or like in this case to offload functions to the GPU. 26 M02: High Performance Computing with CUDA
  • 50. lue G NVMEX Native MATLAB script cannot parse CUDA code New MATLAB script nvmex.m compiles CUDA code (.cu) to create MATLAB function files Syntax similar to original mex script: >> nvmex –f nvmexopts.bat filename.cu –IC:cudainclude –LC:cudalib -lcudart Available for Windows and Linux from: http://developer.nvidia.com/object/matlab_cuda.html 27 M02: High Performance Computing with CUDA
  • 51. lue G Mex files for CUDA A typical mex file will perform the following steps: 1. Convert from double to single precision 2. Rearrange the data layout for complex data 3. Allocate memory on the GPU 4. Transfer the data from the host to the GPU 5. Perform computation on GPU (library, custom code) 6. Transfer results from the GPU to the host 7. Rearrange the data layout for complex data 8. Convert from single to double 9. Clean up memory and return results to MATLAB Some of these steps will go away with new versions of the library (2,7) and new hardware (1,8) 28 M02: High Performance Computing with CUDA
  • 52. lue G CUDA MEX example Additional code in MEX file to handle CUDA /*Parse input, convert to single precision and to interleaved complex format */ ….. /* Allocate array on the GPU */ cufftComplex *rhs_complex_d; cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M); /* Copy input array in interleaved format to the GPU */ cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M, cudaMemcpyHostToDevice); /* Create plan for CUDA FFT NB: transposing dimensions*/ cufftPlan2d(&plan, N, M, CUFFT_C2C) ; /* Execute FFT on GPU */ cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ; /* Copy result back to host */ cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M, cudaMemcpyDeviceToHost); /* Clean up memory and plan on the GPU */ cufftDestroy(plan); cudaFree(rhs_complex_d); /*Convert back to double precision and to split complex format */ …. 29 M02: High Performance Computing with CUDA
  • 53. lue G Timing details 1024x1024 mesh, 400 RK4 steps on Windows, 2D isotropic turbulence Runtime Speed Runtime Speed Opteron 250 Opteron 2210 up up PCI-e Bandwidth: 1135 MB/s 1483 MB/s Host to/from device 1003 MB/s 1223 MB/s Standard MATLAB 8098 s 9525s Overload FFT2 and IFFT2 4425 s 1.8x 4937s 1.9x Overload Szeta 735 s 11.x 789s 12.X Overload Szeta , FFT2 and 577 s 14.x 605s 15.7x IFFT2 30 M02: High Performance Computing with CUDA
  • 54. lue G
  • 55. lue G
  • 56. lue G
  • 57. lue G
  • 58. lue G
  • 59. Wanna Play with The Big Guys?
  • 60. 6.963 IT / A@M CUD 9 IAP0 CUDA Performance Strategies
  • 61. ing ead hr T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006
  • 62. mory Me Data Movement in a CUDA Program Host Memory Device Memory [Shared Memory] COMPUTATION [Shared Memory] Device Memory Host Memory © NVIDIA Corporation 2008 10
  • 63. erf P !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123 456$%$&'($78'quot;'78'7#(quot;5-5**'*$/% 456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.? @,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.' 123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-: E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:( 85#5(#-57/0'-/ GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='( 05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/# 39
  • 64. erf P !quot;#$%$&'()'%*+,(-*.'+'/0' -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4' =2*>12?@*012(4'5$0'(%'%*+,( !quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'( %'%*+, B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3 40
  • 65. erf P !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03 45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03 !/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03 <6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;% 6/quot;0$'%93%quot;88%*/0$quot;'6 <6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66 .*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)? :,quot;8$6:$quot;98$%quot;''0$667)+ 1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0 41
  • 66. erf P !quot;#$%&'&((#()quot;*$+,,)-)#./(0 %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$ *2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0 9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot; <##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$ *2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1' @#=)quot;/#'quot;;$quot;5&'#:$*#*1'0 42
  • 67. erf P !quot;#$%&'$()*#*+,)*$-. /()*#*+*-0'#quot;#$%&')%,-.1quot;%. 2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;. 6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3& 44
  • 68. erf P !quot;#quot;$%&quot;'()*&( !*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$ 6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1 789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE G89:($)/&$>?@A*$MFH N,',.,O*$#&quot;'()*&( @'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*&quot;#*3$ /'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$ .*./&0 8&/5;$#&quot;'()*&( R'*$6quot;&Q*$#&quot;'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*( 45
  • 69. erf P !quot;#$%&'()$*+,$-'./+0.quot;123$.2 (4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+ ='27+-$-'./ >1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($ @AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9 LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+ 8'Q$.(5'()$*+!GH%$9 R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$ U2$+:;7=+(quot;47;'1 W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+ 'Q$.quot;55+2/27$-+<$.3'.-quot;1($ 0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72 46
  • 70. erf P !quot;#$%quot;&'()#*+&,(%-./0*12(. 3145(.2&quot;%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(. ?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7> B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67( D#%quot;(.71649&8@&2#&E;F&.@((-8@ ?%2(67+&51-1649&8@&2#&GHIF&.@((-8@ 47
  • 71. erf P !quot;#$%&'()* +,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&: +,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1= 9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@ 8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@ AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@ +..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%= J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;), &(K% L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#, 0$quot;'M,0%()*,-%#. NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()* P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6 48
  • 72. erf P !quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1& 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 *$$)1>?%#(&)C#?1-'-C#1% 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 ;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1% 49
  • 73. erf P !quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1( 12 13 14 17 135 136 349 374 378 352 355 395 399 3B4 :';<=1')*+##'((*>?*@A;'%)( 12 13 14 17 137 135 136 349 374 378 352 355 395 399 3B4 C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G 50
  • 74. erf P !quot;#$%&'()*+,-(.()*,/%&0$1& 234%5(.%)1,quot;),678+, 9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'? @A,;$quot;#1&,BCDAEF -(.%&,#G%5#*%:,quot;G%5,C89,50)& CD9,>$quot;'?&,3,DHI,1J5%#:&+ @HIK&,L 'quot;#$%&'%: @HMK&,L 'quot;#$%&'%:<,&quot;.%,1J5%#:&,:quot;)N1,4#51('(4#1% @<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&& 51
  • 75. erf P !quot;#$%&'()*+ ,-./'-/.%&0quot;10&(2%0! 34054067089-%& :&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0 <;quot;,= ?10,quot;;0(&0)quot;-0@(#A$%+ Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067 :&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()* x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA 58
  • 76. erf P !quot;#$%&'()*+,-.//#01 !quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2 !0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$& :%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A 712%&,B($$,70%#9,'quot;#$%&'()*+ C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;- E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG D88(2(quot;)#$,0%&quot;.0'%&+ D$(*)%8,I13%&,-JK,-#/3$% 59
  • 77. erf P !quot;#quot;$$%$&'%()#*&+#,-./%,/0#% 12&quot;&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56&quot;,,%66&(%()#* 7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6 <66%2/.quot;$&/)&quot;,-.%9%&-.=-&:quot;25>.5/- <quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%&quot;55#%66&3%#&,*,$% +&(%()#*&,quot;2&6%#9.,%&quot;6&(quot;2*&6.(0$/quot;2%)06& Bank 0 quot;,,%66%6&quot;6&./&-quot;6&:quot;2;6 Bank 1 Bank 2 Bank 3 '0$/.3$%&6.(0$/quot;2%)06&quot;,,%66%6&/)&quot;&:quot;2; Bank 4 #%60$/&.2&quot;&:quot;2;&,)28$.,/& Bank 5 Bank 6 ?)28$.,/.2=&quot;,,%66%6&quot;#%&6%#.quot;$.@%5 Bank 7 Bank 15 64
  • 78. erf P !quot;#$%&''()**+#,%-.quot;/01)* 23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67* 8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3# *7(+')%99%: Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 65
  • 79. erf P !quot;#$%&''()**+#,%-.quot;/01)* 234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:* ;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,% *:(+')%<<%2 *:(+')%<<%= x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15 66
  • 80. erf P !quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012 3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:() <=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($ -%./) 012$5%)$AB$-%./) <quot;$-%./$C$%&&'())$D$AB <%*($%)$,5($)4E($quot;6$%$5%:6?#%'+ Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+ 67
  • 81. erf P !quot;#$%&'(%()$*'+#,-'.),/01.23 !quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%' ,)'+#,-'.),/01.23 5quot;%'/#32'.#3%6 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13' ,)'+#,-'.),/01.2 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;' 2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32= 5quot;%'30)9'.#3%6 >#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:' #..%33'2quot;%'3#(%'+#,- A@32'3%$1#01B%'2quot;%'#..%33%3 ?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,- 68
  • 82. erf P Conflicts, Coalescing, Warps... I hate growing up.
  • 83. erf P !quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.
  • 84. erf P !quot;#$%&'($quot;)*+,*- ./0'.quot;1+2-'34#$quot;)*+,*-56 7228*#$quot;#-*9 :,quot;2-*;%)< =>,%?%)<'.!@!'Aquot;)B';,)C2%;#* .+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-* 1 5 9 13 1 2 3 4 2 6 10 14 5 6 7 8 3 7 11 15 9 10 11 12 4 8 12 16 13 14 15 16 70
  • 85. erf P !quot;#$%&'(#')*+,%quot;(-$(' __global__ void transpose_naive(float *odata, float *idata, int width, int height) { 1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; 2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; 3. if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; 4. unsigned int index_out = yIndex + height * xIndex; 5. $)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4; 6. } } 71
  • 86. erf P !quot;#$%&'(#')*+,%quot;(-$(' .'%)(*/quot;-01*2,$3*4565 <,/1'*$01-01*1$*4565 ;8; ;87 ;8: ;879 ;8; 78; :8; 798; 78; 787 78: 7879 ;87 787 :87 7987 798; 7987 798: 79879 ;879 7879 :879 79879 4565 4565 Stride = 1, coalesced Stride = 16, uncoalesced 72
  • 87. erf P !quot;#$%&'%()*+#,&-quot;&% .&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%& *6+%#(7$quot;'8)974:)7;<3 =%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@? A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$% *+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@? *6+%#()914:1;<3 =%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$% A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$% !quot;#$%&'2,B)2&)#'62%D%()2C3 E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH 73
  • 88. erf P !quot;#$%&'%()*+#,&-quot;&% 4%#(&)5+quot;6)1232 .+/0%&)0quot;)7232 <9< <98 <9; <98: <9< <98 <9; <98: 89< 898 89; 898: 89< 898 89; 898: 8:9< 8:98 8:9; 8:98: 8:9< 8:98 8:9; 8:98: 4%#(&)5+quot;6)7232 .+/0%&)0quot;)1232 <9< 89< ;9< 8:9< <9< <98 <9; <98: <98 898 ;98 8:98 89< 898 89; 898: <98: 898: ;98: 8:98: 8:9< 8:98 8:9; 8:98: 74
  • 89. erf P !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75
  • 90. erf P !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75
  • 91. erf P !quot;#$%&'%()*+#,&-quot;&% __global__ void transpose(float *odata, float *idata, int width, int height) { 1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; 2. unsigned int yBlock = blockDim.y * blockIdx.y; 3. unsigned int xIndex = xBlock + threadIdx.x; 4. unsigned int yIndex = yBlock + threadIdx.y; 5. unsigned int index_out, index_transpose; 6. 7. if (xIndex < width && yIndex < height) { unsigned int index_in = width * yIndex + xIndex; 8. unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x; 9. block[index_block] = idata[index_in]; 10. index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y; 11. index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; 12. } 13. __syncthreads(); 14. if (xIndex < width && yIndex < height) odata[index_out] = block[index_transpose]; 15. } 76
  • 92. erf P !quot;#$%&'%()!*+*$,% -&((./&%)0*12)3'#4(%3*$,)#$.)-565)'&1*+*7#1*'$8 9:;<9:;8))=>=99+% ?%>)=>=::+%))@:>=A %&((./&B C9:<C9:8))=>=D+%)))?%>)=>EE+%))))@F>CA %&((./&B 9=:F<9=:F8))=>E=+%)))?%>)9>G:+%))))@H>FA %&((./&B 9=:F<:=F;8))=>DG+%)))?%>)H>H+%))))))@;>FA %&((./&B I'#4(%3*$,)0*12'/1)-565)'&1*+*7#1*'$8 9:;<9:;8))=>=9F+% C9:<C9:8))=>9=9+% 9=:F<9=:F8))=>F9:+% 9=:F<:=F;8))=>;HG+% 77
  • 93. erf P !quot;#$%&'()*+(),'-%./&'()*01&'2'3/&'()4
  • 94. erf P !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ A+6./0+*/ B)%*+,-<+<1*' 79
  • 95. erf P !quot;#$%&'()*+,#-.+/.0quot;#12#)1 3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1 ,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02. 3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+> ?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot; &'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+ :9quot;$B9quot;.+501@ ,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@ 3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1 &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(< JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1 80
  • 96. erf P !quot;#$%&quot;'()quot;*quot;+,quot;+-. !quot;/,0/1&quot;'02'$&quot;('quot;#$%&quot;'(,quot;*quot;+,quot;+-. 3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/&quot;' ;-quot;+/'$5%<=>)?< @AB< S T(.(U(JV /,,N1O:(((P1OQ(P1EQ(P1: W(T(S U(OV /,,N1O:(((P1JQ(P1OQ(P1R %[,/&/XYZ(UT(OV 7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ( /,,N1O:(((((((((((P1OQ(P1OQ(P1R A5(-5C*7quot;&quot;7.(D$,quot;(&Dquot;(7/&quot;+-.<( !4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5' ?&(7quot;/%&(:JK 5--4*/+-. AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M 81
  • 97. erf P !quot;#$%&quot;'()'quot;%%*'quot; +$,quot;(-.&quot;/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78 9$3$&$/#(:.0&4'%; <*32quot;'(4=('quot;#$%&quot;'%(6quot;'(>quot;'/quot;- ?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,% D34*/&(4=(%5.'quot;,(3quot;34'1 @EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>% H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%&quot;'%(K(>quot;'/quot;- L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH < O(,quot;%$'quot;,(3.N$3*3('quot;#$%&quot;'%(K(>quot;'/quot;- D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*' !quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot; 82
  • 98. erf P !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot; /*quot;)012#3+2#&+'*4567 +2#&+')#+)'6-- 8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+) =quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A 82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$) #;quot;)0-+=quot;7 *quot;-#&+'A architecture {sm_10} abiversion {0} modname {cubin} code { per thread local memory name = BlackScholesGPU lmem = 0 smem = 68 per thread block shared memory reg = 20 bar = 0 per thread registers bincode { 0xa0004205 0x04200780 0x40024c09 0x00200780 … 83
  • 99. erf P !quot;#$%&''()*+',%!*-'(-*./0 84
  • 100. erf P !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85
  • 101. erf P !quot;quot;#$%&quot;'()*(+,-./-0%&quot;, 1&quot;-,%23&4(/quot;quot;#$%&quot;'(5/,2(&/6(&,quot;,22%-37'( 3&quot;-,%2,($,-./-0%&quot;, BUT… 8/9:/quot;quot;#$%&quot;'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'( <35,(7%6,&quot;'(/&(0,0/-':=/#&5(>,-&,72 ?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,( $%-%77,7320A 86
  • 102. erf P !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1 !quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;.&quot;&'+1)&+)4'55%#%1&)6!73 6!73)8quot;#9)'1)$quot;19):quot;93 ;)+5)$,/&'.#+0%33+#3 <%$+#9)=quot;14:'4&2 >2quot;#%4)$%$+#9)3'(% ?%@'3&%#)5'/%)3'(% A2#%quot;43).%#)=/+0B *+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14) -AG->H IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/) 0+15'@,#quot;&'+1 87
  • 103. erf P !quot;#$%&'(quot;# )#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($' 6+4quot;,7/$quot;.%+'$(#8 0(9+,8+#-/:,.#$5(#8 ;.#</$quot;#3%($-' =.-+#$7/5(*(#8 )'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/ )#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7 @#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$< +B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.* D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2' )'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+ 88
  • 104. erf P !quot;#$%&'($)*+,-.$/012*.#0
  • 105. erf P !quot;#$%&'($)*+,-.$/012*.#0 3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$ 401:.#5 ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$ 5#594?+ !*5#$+8-54+ (99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$ 61
  • 106. erf P !quot;#$%&'quot;()'*# 101
  • 107. erf P !quot;#$%&' ()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56 +quot;7*'+%75 #&08quot;$.32*-*$+ Global memory loads/stores are coalesced #&08.32*-*$+ (coherent) or non-coalesced (incoherent) #'+8quot;$.32*-*$+ #'+8.32*-*$+ &3.%&8&3%0 Local loads/stores &3.%&8'+3-* Total branches and divergent branches 9-%$.2 0quot;)*-#*$+89-%$.2 taken by threads quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+ 1%-58'*-quot;%&quot;;* : +2-*%0,1%-5',+2%+,'*-quot;%&quot;;*,3$,%00-*'',.3$<&quot;.+',+3, '2%-*0,3-,.3$'+%$+,7*73-= .+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./' 62
  • 108. erf P !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/ 01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%& 6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*% 01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/) ,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,; <1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#) 8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,) 3*%:; 01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$) 5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$ !quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+) (,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$ 63
  • 109. ME CO
  • 110. Back Pocket Slides slide by David Cox
  • 111. 6.963 IT / A@M CUD 9 IAP0 Dense Linear Algebra
  • 112. !quot;#$quot;%&'#quot;()%*+,quot;-)( 4/5,-quot;.-,6789:; B,A-C8quot;,Dquot;7/-?8E:C/78quot;C/:8:; ! <128/-quot;:=:089: ! *8-,2/A01C:;quot;F>4 $% ! & ! >1?82@/7A8: ! +,9.A0/01,2/7quot;CG891:0-= $% ! quot;% ! B12?A7/-quot;@/7A8: ! )/0/quot;91212? $ ! #!quot; ! !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  • 113. !quot;#$quot;%&'#quot;()%*+,quot;-)( 4/5,-quot;.-,6789:; *7?,-10C9:; ! <128/-quot;:=:089: ! D28E:1F8Fquot;G/H0,-1I/01,2:; $% ! & <JKquot;+C,78:L=Kquot;MN ! >1?82@/7A8: ! OP,E:1F8Fquot;G/H0,-1I/01,2:; $% ! quot;% MNquot;/7?3Kquot;Q/H,61 ! B12?A7/-quot;@/7A8: ! OP,E:1F8Fquot;G/H0,-1I/01,2:; $ ! #!quot; ! B') !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  • 114. !quot;#$%&'!quot;#$%&()$*+,-#.(!quot;#quot;!quot;$quot;%quot;&quot;' 45*6quot;78,-0-/29::;< +=45*6quot;7+;< 6789:;$<=--(!)*+,!!)*+, !quot;##!$%&''(!)*+,!)*+, -,!.,!/, -,!.,!/, 012,!>quot;,!#3quot;, 012,!quot;,!#3quot;, >4,!#34, 012,!>!,!#3!!5? 4,!#34, 012,!!,!#3!!5 +,>.?0/01,2quot;12quot;@A=quot;-BC?1-BD< ! (2101/E1F/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20 ! *EE,I/01,2quot;,Gquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-Kquot;7L/2JEB-Dquot;!quot;#$!%#$!&; ! M-/2DGB-quot;,Gquot;J/0/quot;7>/0-1IBDquot;quot;#$%#$&; ! +,>.?0/01,2quot;7I?NE/D6OB>>; ! PB0-1BHBquot;-BD?E0quot;7>/0-1Qquot;&; ! 8-BBquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-K ! MB->12/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20 !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  • 115. quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! S#?!M(&>!quot;6=*MG9! C0! T! T! T! S#?!6=*M9RO*+N! <B<! T! T! T! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
  • 116. !quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(', , V&9F=J!V*=>*7! X&'(9!YB!S(''(=! W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9! $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J! 7901('$1, quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 +(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! _quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! 8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?! +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(! 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! &,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! ,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! ,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8! S#?!M(&>!quot;6=*MG9! C0! T! T! T! +(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &! S#?!6=*M9RO*+N! <B<! T! T! T! ,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U =(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! :,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%! O*+N9B!! Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(! 9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U 9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(! ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6! 6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9! 8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N! *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH! +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=! =FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B! 6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6! '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9! ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9! 8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9! 9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*! 8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(! M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U O*+>B! 6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9! 31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J! =,-./,7($%*1quot;$14(quot;, W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ! [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2! ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(! 9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+! 8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98! 8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S!M+*E+&''F%E! &=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9! ELFN(! iZV[S[! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[! /113l! 6*L%N!6*+!7(,8*+!M+*,(99*+9B! ZV[S[! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! quot;#$! &%N! +(7(&=! 'F%E! quot;#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%! 9*'(!*6!8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J! i@OL!&%N!HF+>!/11CjB!NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!quot;#$% 8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=! !&;?!OPF,P!F9!&!8PF+NUM&+8J!NF9&99('K=(+!*6!quot;#$! KF%&+F(9!K&9(N! =&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B! *%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U -P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(! quot;#$! '('*+J! 9J9U 8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%! 8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6!8P(!d;!&%N!d/!,&,P(9!&%N! 8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B! -dIB! )*+! 8P(! 6F+98! 8F'(! O(! F'M=('(%8! &%N! '(&9L+(! 8P(! M(+6*+U '&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! quot;#$B! Y(! =>:,?21'1*2#, K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E! quot;#$9! OF8P! =*O(+!W#$!F%8(+7(%8F*%B! -P(!quot;#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S!M+*E+&''F%E! -*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(! (%7F+*%'(%8!iZV[S[!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9! L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$! L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%quot;d! Volkov and Demmel (SC08) &%N! quot;#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=! 98&%N&+N9B!quot;#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9! K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&! 8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B! %*7(=! &=E*+F8P'! OF8P! '*NF6F(N! %L'(+F,9B! Y(! &%&=JA(! 8P(! M(+U 4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+! 6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'U
  • 117. Volkov and Demmel (SC08)
  • 118. ! %quot;! ;< quot;#D =>-,40?@ %!! AB !-8-,'9!M+-'1!42! $quot;! %B,#7B'! >A^'! UDU! &.'!EEHH=PQ!['! $!! *+,-./0 ')D ! ,/'! &'-%B,-! [#,/! #quot;! !+21!A?`!JDHD! &'J!K%+1!KNEOH! #!! C(D T!UHDH!B#M&+&8!#-! ! ,/+,! #,! &%2-! -%MZ quot;! B'!7&'.#-#42D! ! !9'94&86![/#./! &' #$( $quot;& quot;#$ #!$' $!'( '!)& (#)$ #&%(' +77B#.+,#42-!<,/+,! 123425-+567829: !<-B4['&!,&+2-3'&-! quot;#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4&#5+,#42-6!7'&.'2,-!#21#.+,'! CD!P/'!.4-,!43!,/'! ,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?! -D! 42B8C!+./#'0'1D! 3! NX! [4&1-D! P/#-! 'Equot; ! +,! -49'! 9+,&#a!
  • 119. 4&'J!K%+1!KNEOH! #!! C(D bT!UHDH!B#M&+&8!#-! '1! ,/+,! #,! &%2-! -%MZ quot;! 2$B'!7&'.#-#42D! ! >?!9'94&86![/#./! &' #$( $quot;& quot;#$ #!$' $!'( '!)& (#)$ #&%(' 2!+77B#.+,#42-!<,/+,! 123425-+567829: '!<-B4['&!,&+2-3'&-! quot;#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4&#5+,#42-6!7'&.'2,-!#21#.+,'! &8CD!P/'!.4-,!43!,/'! ,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?! $-D! 42B8C!+./#'0'1D! ! 43! NX! [4&1-D! P/#-! 'Equot; 7-! +,! -49'! 9+,&#a! ;< 'E! 'E': =>-,40?@ '1!#2!,/'!34BB4[#2$! *IJ$(! F.443G.5H05=-24$5;G73 %Equot; AB 9!'2,&#'-!%2#34&9Z %E! ',&#.! 74-#,#0'! 1'3#Z $EC: #2! ,'-,#2$! ,/'! A/4Z $Equot; ((!!*IJ +,&#a! +-! 1'-.&#M'1! $E! .,4&-!+&'!9%B,#7B#'1! #Equot; %,! 9+,&#a! #-! 34%21D! &#5+,#42D!_'!34%21! #E! 2$! 4%&! =>?ZM+-'1! !Equot; !#2!,/'!^2,'B!YbT!! !E! #2!94-,!.+-'-CD!P/'! &' #$( $quot;& quot;#$ #!$' $!'( '!)& (#)$ #&%(' -!M8!,/'!#20'&-'-!43! 123425-+567829: ! /+-! -/4[2! +M4%,! quot;#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'! 4B0'-! 42! ,/'! =>?D! &#$/,!+&'!,/'!M'-,!-7''1%7-D! 0'! #2! T?6! K*! +21! ! ! )6!JHHquot;#quot;ii!ii'()!+21! !! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH! '! '7-#B42! #2! ^RRR!
  • 120. ! ,'-,#2$! ,/'! A/4Z F.443G.5H05=-2 $Equot; ((!!*IJ ,&#a! +-! 1'-.&#M'1! $E! &-!+&'!9%B,#7B#'1! #Equot; ! 9+,&#a! #-! 34%21D! 5+,#42D!_'!34%21! #E! ! 4%&! =>?ZM+-'1! !Equot; 2!,/'!^2,'B!YbT!! !E! !94-,!.+-'-CD!P/'! &' #$( $quot;& quot;#$ #!$' $!'( '!)& (#)$ #&%(' 8!,/'!#20'&-'-!43! 123425-+567829: +-! -/4[2! +M4%,! quot;#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'! 0'-! 42! ,/'! =>?D! &#$/,!+&'!,/'!M'-,!-7''1%7-D! ! #2! T?6! K*! +21! ! ! JHHquot;#quot;ii!ii'()!+21! !! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH! '7-#B42! #2! ^RRR! D! !! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7 T?! (G! U(V! JDOW! GHV! XDUW! A/4B'-;8! (H! UEG! JD(W! GUO! XDXW! !=>?ZM+-'1!9+Z 1!+B42'6!+21!quot;#$D! K*! (O! UVJ! JDNW! GXH! XDXW! 2$!,4!,/'!quot;#$%&'6! F=RYY! EE! JHE! JDXW! G(O! XDGW! >?Z+B42'! #97B'Z /4B'-;8! &%2! 42! 7'+;! VN! GEE! XDHW! NN(! NDVW! '&34&9+2.'-! +&'! P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?! 1%7!#-!2'+&B8!,/'! 0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY! #7B8! <F=RYYCD! &+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D! '+;! &+,'-! #-! -%MZ &'!.497%,+,#42+B! quot;%( quot;quot;! quot;!!
  • 121. ?ZM+-'1!9+Z 42'6!+21!quot;#$D! K*! (O! UVJ! JDNW! GXH! XDXW! 4!,/'!quot;#$%&'6! F=RYY! EE! JHE! JDXW! G(O! XDGW! +B42'! #97B'Z '-;8! &%2! 42! 7'+;! VN! GEE! XDHW! NN(! NDVW! &9+2.'-! +&'! P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?! #-!2'+&B8!,/'! 0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY! ! <F=RYYCD! &+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D! &+,'-! #-! -%MZ 497%,+,#42+B! quot;%( quot;quot;! quot;!! 74-#,#42!,/+,! 'quot;! [4! =>?-! #2! '!! +,'-!,/+2!,[4! %quot;! %!) $)( *+,-./0 %!! $quot;! $!! #C) 3+.,4&#5+,#42! #quot;! !43!,/'!&%2Z #!! M4%,!43!UHh! quot;! '!'a7'.,!,/'! ! &!=>?-!7&4Z ! $quot;!! quot;!!! Cquot;!! #!!!! #$quot;!! #quot;!!! #Cquot;!! $!!!! $$quot;!! 123425-+567829: -7'2,! #2! ,/'! quot;#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43! '1#%9! -#5'1! ,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D
  • 122. ! G'!#35'&913H #!!quot; .&0;?! +#6'! +!quot; ?! G30#D#0%.A! *!quot; KJ!<LM;!41&! $quot;# )!quot; '! 610'&.5'! !quot;#2$quot;# Rquot;;!51!91:I! (!quot; -3+&.', !quot;# ,-./ !-.302#05,! '!quot; ,'.0?!FE8H .--/!'0+'1 &!quot; P! 2,#9,! .:H %&'(),-)+ %!quot; $!quot; .5#13;!%;'0! !quot;#!$quot;#quot;%&'()*+& :5#6#B.5#13! #!quot; 01%-A'0!5,'! !quot; .;%&'6'35;! &&* )!& #!** #((& $&+( %(&* '%#$ ))&& ##$(& 012/134536781-9 ! 5,'! '35#&'! quot;#$%&'!()*!+,'!-&'./0123!14!5#6'!#3!5,'!78!0'916:1;#5#13!&%3!13! &! A.I1%5?! G3! <'quot;1&9'!==))!<+>?! #05,?! W,'3! '! 515.A! 5#6'! $:! .5&#9';?! -3+&.',quot;!quot;#2$quot;# #:+ 1-5.#3'0! -I! %&'(),-)+quot;4'%&56 #:* #3D'&;'! 6.H 789:quot;35'quot;$;:: #:) O! .30! =([C! <'%=0quot;,53-%5(> 2,'3! %;#3$! #:( <4=24=> ,.3! 5,'! CS=! #:' .5&#O! -I! .! #:&
  • 123. 6.963 IT / A@M CUD 9 IAP0 CUFFT Example
  • 124. CUDA Example: Fourier-spectral Poisson Solver Solve a Poisson equation on a rectangular domain with periodic boundary conditions using a Fourier-spectral method. This example will show how to use the FFT library, transfer the data to/from GPU and perform simple computations on the GPU. 31 M02: High Performance Computing with CUDA
  • 125. Mathematical background ˆ=r FFT 2 2 2 $ % = r quot;quot; # !(k + k )% ˆ quot; x y 1. Apply 2D forward FFT to r to obtain r(k), where k is the wave number 2. Apply the inverse of the Laplace operator to r(k) to obtain u(k): simple element-wise division in Fourier space ˆ r ˆ quot; =! (k x2 + k y ) 2 3. Apply 2D inverse FFT to u(k) to obtain u 32 M02: High Performance Computing with CUDA
  • 126. Reference MATLAB implementation % No. of Fourier modes % Construct RHS f(x,y) at the Fourier gridpoints N = 64; rsq = (X-0.5*L).^2 + (Y-0.5*L).^2; % Domain size (assumed square) sigsq = sig^2; L = 1; f = exp(-rsq/(2*sigsq)).*… % Characteristic width of f (make << 1) (rsq - 2*sigsq)/(sigsq^2); sig = 0.1; % Spectral inversion of Laplacian % Vector of wavenumbers fhat = fft2(f); k = (2*pi/L)*[0:(N/2-1) (-N/2):(-1)]; u = real(ifft2(fhat./delsq)); %Matrix of (x,y) wavenumbers corresponding % Specify arbitrary constant by forcing corner % to Fourier mode (m,n) % u = 0. [KX KY] = meshgrid(k,k); u = u - u(1,1); % Laplacian matrix acting on the wavenumbers % Compute L2 and Linf norm of error delsq = -(KX.^2 + KY.^2); uex = exp(-rsq/(2*sigsq)); % Kludge to avoid division by zero for errmax = norm(u(:)-uex(:),inf); % wavenumber (0,0). errmax2 = norm(u(:)-uex(:),2)/(N*N); % (this waveno. of fhat should be zero anyway!) % Print L2 and Linf norm of error delsq(1,1) = 1; fprintf('N=%dn',N); % Grid spacing fprintf('Solution at (%d,%d): ',N/2,N/2); h = L/N; fprintf('computed=%10.6f … x = (0:(N-1))*h ; reference = %10.6fn',u(N/2,N/2), uex(N/2,N/2)); y = (0:(N-1))*h; fprintf('Linf err=%10.6e L2 norm [X Y] = meshgrid(x,y); err = %10.6en',errmax, errmax2); http://www.atmos.washington.edu/2005Q2/581/matlab/pois_FFT.m 33 M02: High Performance Computing with CUDA
  • 127. Implementation steps The following steps need to be performed: 1. Allocate memory on host: r (NxN), u (NxN) , kx (N) and ky (N) 2. Allocate memory on device: r_d, u_d, kx_d, ky_d 3. Transfer r, kx and ky from host memory to the correspondent arrays on device memory 4. Initialize plan for FFT 5. Compute execution configuration 6. Transform real input to complex input 7. 2D forward FFT 8. Solve Poisson equation in Fourier space 9. 2D inverse FFT 10.Transform complex output to real input 11.Transfer results from the GPU back to the host We are not taking advantage of the symmetries (C2C transform for real data) to keep the code simple. 34 M02: High Performance Computing with CUDA
  • 128. Solution walk-through (steps 1-2) /*Allocate arrays on the host */ float *kx, *ky, *r; kx = (float *) malloc(sizeof(float*N); ky = (float *) malloc(sizeof(float*N); r = (float *) malloc(sizeof(float*N*N); /* Allocate array on the GPU with cudaMalloc */ float *kx_d, *ky_d, *r_d; cudaMalloc( (void **) &kx_d, sizeof(cufftComplex)*N); cudaMalloc( (void **) &ky_d, sizeof(cufftComplex)*N); cudaMalloc( (void **) &r_d , sizeof(cufftComplex)*N*N); cufftComplex *r_complex_d; cudaMalloc( (void **) &r_complex_d, sizeof(cufftComplex)*N*N); 35 M02: High Performance Computing with CUDA
  • 129. Code walk-through (steps 3-4) /* Initialize r, kx and ky on the host */ …………… /*Transfer data from host to device with cudaMemcpy(target, source, size, direction)*/ cudaMemcpy (kx_d, kx, sizeof(float)*N , cudaMemcpyHostToDevice); cudaMemcpy (ky_d, ky, sizeof(float)*N , cudaMemcpyHostToDevice); cudaMemcpy (r_d , r , sizeof(float)*N*N, cudaMemcpyHostToDevice); /* Create plan for CUDA FFT (interface similar to FFTW) */ cufftHandle plan; cufftPlan2d( &plan, N, N, CUFFT_C2C); 36 M02: High Performance Computing with CUDA
  • 130. Code walk-through (step 5) /* Compute the execution configuration NB: block_size_x*block_size_y = number of threads On G80 number of threads < 512 */ dim3 dimBlock(block_size_x, block_size_y); dim3 dimGrid (N/dimBlock.x, N/dimBlock.y); /* Handle N not multiple of block_size_x or block_size_y */ if (N % block_size_x !=0 ) dimGrid.x+=1; if (N % block_size_y !=0 ) dimGrid.y+=1 Block_size_y N Block_size_x N 37 M02: High Performance Computing with CUDA
  • 131. Code walk-through (step 6-10) /* Transform real input to complex input */ real2complex<<<dimGrid, dimBlock>>> (r_d, r_complex_d, N); /* Compute in place forward FFT */ cufftExecC2C (plan, r_complex_d, r_complex_d, CUFFT_FORWARD); /* Solve Poisson equation in Fourier space */ solve_poisson<<<dimGrid, dimBlock>>> (r_complex_d, kx_d, ky_d,N); /* Compute in place inverse FFT */ cufftExecC2C (plan, r_complex_d, r_complex_d, CUFFT_INVERSE); /* Copy the solution back to a real array and apply scaling ( an FFT followed by iFFT will give you back the same array times the length of the transform) */ scale = 1.f / ( (float) N * (float) N ); complex2real_scaled<<<dimGrid, dimBlock>>> (r_d, r_complex_d, N, scale); 38 M02: High Performance Computing with CUDA
  • 132. Code walk-through (step 11) /*Transfer data from device to host with cudaMemcpy(target, source, size, direction)*/ cudaMemcpy (r , r_d , sizeof(float)*N*N, cudaMemcpyDeviceToHost); /* Destroy plan and clean up memory on device*/ cufftDestroy( plan); cudaFree(r_complex_d); ……. cudaFree(kx_d); 39 M02: High Performance Computing with CUDA
  • 133. real2complex /*Copy real data to complex data */ __global__ void real2complex (float *a, cufftComplex *c, int N) { /* compute idx and idy, the location of the element in the original NxN array */ int idx = blockIdx.x*blockDim.x+threadIdx.x; int idy = blockIdx.y*blockDim.y+threadIdx.y; if ( idx < N && idy <N) { int index = idx + idy*N; c[index].x = a[index]; idy c[index].y = 0.f; } } idx 40 M02: High Performance Computing with CUDA
  • 134. solve_poisson (with shared memory) __global__ void solve_poisson (cufftComplex *c, float *kx, float *ky, int N) { unsigned int idx = __umul24(blockIdx.x,blockDim.x)+threadIdx.x; unsigned int idy = __umul24(blockIdx.y,blockDim.y)+threadIdx.y; // use shared memory to minimize multiple access to same k values __shared__ float kx_s[BLOCK_WIDTH], ky_s[BLOCK_HEIGHT] if (threadIx.x < 1) kx_s[threadIdx.x] = kx[idx]; if (threadIx.y < 1) ky_s[threadIdx.y] = ky[idy]; __syncthreads(); if ( idx < N && idy <N) { unsigned int index = idx +__umul24(idy ,N); float scale = - ( kx_s[threadIdx.x]*kx_s[threadIdx.x] + ky_s[threadIdy.y]*ky_s[threadIdy.y] ); if ( idx ==0 && idy == 0 ) scale =1.f; scale = 1.f / scale; ˆ r ˆ c[index].x *= scale; quot; =! 2 2 c[index].y*= scale; (k + k y ) x } } 41 M02: High Performance Computing with CUDA
  • 135. Compile and run poisson Compile the example poisson.cu: nvcc –O3 –o poisson poisson.cu -I/usr/local/cuda/include –L/usr/local/cuda/lib -lcufft -L/usr/local/NVDIA_CUDA_SDK/common/inc -L/usr/local/NVDIA_CUDA_SDK/lib -lcutil Run the example ./poisson -N64 Poisson solver on a domain 64 x 64 dimBlock 32 16 (512 threads) dimGrid 2 4 L2 error 9.436995e-08: Time 0.000569: Time I/O 0.000200 (0.000136 + 0.000064): Solution at (32,32) computed=0.975879 reference=0.975882 Reference values from MATLAB: N=64 Solution at (32,32): computed= 0.975879 reference= 0.975882 Linf err=2.404194e-05 L2 norm err = 9.412790e-08 42 M02: High Performance Computing with CUDA
  • 136. 6.963 IT / A@M CUD 9 IAP0 Misc
  • 137. Tesla C1060 Computing Processor Processor 1x Tesla T10P Core GHz 1.33 GHz Full ATX: Form factor 4.736” (H) x 10.5” (L) Dual slot wide On-board 4 GB memory System I/O PCIe x16 gen2 512-bit, 800MHz DDR Memory I/O 102 GB/s peak bandwidth Display outputs None Typical power 160 W 19 M02: High Performance Computing with CUDA
  • 138. Tesla S1070 1U System Processors 4 x Tesla T10P Core GHz 1.5 GHz 1U for an EIA 19” Form factor 4-post rack Total 1U system 16 GB (4.0GB per GPU) memory System I/O 2 PCIe x16 512-bit, 800MHz GDDR Memory I/O per 102 GB/s peak processor bandwidth Display outputs None Typical power 700 W Chassis 1.73” H ! 17.5” W ! 28.5” D dimensions 20 M02: High Performance Computing with CUDA
  • 139. Double Precision Floating Point NVIDIA GPU SSE2 Cell SPE IEEE 754 IEEE 754 IEEE 754 Precision Rounding modes for FADD All 4 IEEE, round to All 4 IEEE, round to Round to and FMUL nearest, zero, inf, -inf nearest, zero, inf, -inf zero/truncate only Supported, costs 1000’s Denormal handling Full speed Flush to zero of cycles NaN support Yes Yes No Overflow and Infinity No infinity, Yes Yes support clamps to max norm Flags No Yes Some FMA Yes No Yes Software with low-latency Square root Hardware Software only FMA-based convergence Software with low-latency Division Hardware Software only FMA-based convergence Reciprocal estimate 24 bit 12 bit 12 bit accuracy Reciprocal sqrt estimate 23 bit 12 bit 12 bit accuracy log2(x) and 2^x estimates 23 bit No No accuracy 18 M02: High Performance Computing with CUDA