Your SlideShare is downloading. ×
0
6.963
               IT /
         A@M
      CUD
    9
IAP0

       Supercomputing on your desktop:
 Programming the next ...
During this course,
                                3
                               6
                        for 6.9
   ...
warp != wrap
Today
yey!!
6.963
               IT /
         A@M
      CUD
    9
IAP0


       Textures & OpenGL
                 Async API
        ...
6.963
               IT /
         A@M
      CUD
    9
IAP0



                            CUDA
       Textures and OpenGL
res
                                                                 xtu
                                                 ...
res
                                                          xtu
                                                       T...
res
                                                               xtu
                                                   ...
res
                                                                         xtu
                                         ...
Example: Host code for linear mem

// declare texture reference (must be at file-scope)
texture<unsigned short, 1, cudaRea...
res
                                                                 xtu
                                                 ...
Example: Host code for 2D array tex

// declare texture reference (must be at file-scope)
texture<float, 2, cudaReadModeEl...
nGL
                                                        pe
                                                      O
Ope...
nGL
                                                              pe
                                                     ...
nGL
                                                             pe
                                                      ...
nGL
                                                           pe
                                                        ...
6.963
               IT /
         A@M
      CUD
    9
IAP0


                      CUDA
                   Async API
ync
                                                     As
     !quot;#$%&'($()quot;*+,+('#*%(-#

      !quot;#$%&'($()qu...
ync
                                                         As
     !quot;#$%&'()#$*#%(&*+(,#,-$.(/-'.

       0-*/1$$#*2...
ync
                                                                As
      !quot;#$%&'()*%$+,
         &'()*-%./(%0)-(/*...
6.963
               IT /
         A@M
      CUD
    9
IAP0


                         CUDA
                       Librari...
ary
                                                   ibr
                                                 L
 CUDA librar...
ary
                                                                           ibr
                                       ...
ary
                                                                   ibr
                                               ...
ary
                                                                  ibr
                                                ...
ary
                                                                         ibr
                                         ...
ary
                                                        ibr
                                                      L
CU...
ary
                                                                                    ibr
                              ...
ary
                                                                              ibr
                                    ...
ary
                                                                              ibr
                                    ...
ary
                                                                             ibr
                                     ...
!quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(',
                                                  ...
rary
                                                                                                                     ...
ary
                                              ibr
                                            L
 DGEMM Performance



...
ary
                                                      ibr
                                                    L
Additi...
ary
                                                            ibr
                                                      ...
ary
                                                 ibr
                                               L
 Supported Featu...
ary
                                                                 ibr
                                                 ...
ary
                                                                 ibr
                                                 ...
ary
                                                        ibr
                                                      L
CU...
ary
                                                                                            ibr
                      ...
ary
                                                      ibr
                                                    L
Additi...
?
     e
  lu
G
6.963
               IT /
         A@M
      CUD
    9
IAP0




Interfacing CUDA
lue
                                                G
Interfacing CUDA with other languages

      CUDA kernels from FORTR...
lue
                                                                                     G
 Pinned memory from FORTRAN
Pin...
lue
                                                                                                G
 Calling CUDA kernel...
lue
                                                   G
 CUDA & MATLAB

      Even though MATLAB is built on many well-
 ...
lue
                                                          G
 NVMEX
       Native MATLAB script cannot parse CUDA code
...
lue
                                                                 G
 Mex files for CUDA
 A typical mex file will perfor...
lue
                                                                          G
 CUDA MEX example
      Additional code in...
lue
                                                                              G
Timing details
 1024x1024 mesh, 400 RK...
lue
G
lue
G
lue
G
lue
G
lue
G
Wanna Play with
The Big Guys?
6.963
               IT /
         A@M
      CUD
    9
IAP0



                            CUDA
       Performance Strateg...
ing
                                                                             ead
                                     ...
mory
                                  Me
Data Movement in a CUDA Program

Host Memory
 Device Memory
  [Shared Memory]
  ...
erf
                                                       P
     !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123

       456$%$&'($78...
erf
                                                    P
     !quot;#$%$&'()'%*+,(-*.'+'/0'

       -*12'30'4(536(7*/80*1...
erf
                                                        P
     !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03

      ...
erf
                                                  P
     !quot;#$%&'&((#()quot;*$+,,)-)#./(0

       %&'/)/)1.$012'$-1...
erf
                                           P
     !quot;#$%&'$()*#*+,)*$-.

       /()*#*+*-0'#quot;#$%&')%,-.1quot;%....
erf
                                                          P
     !quot;#quot;$%&quot;'()*&(

       !*+,-*$.*./&0$#/$1...
erf
                                                      P
     !quot;#$%&'()$*+,$-'./+0.quot;123$.2

       (4*quot;,quo...
erf
                                                  P
     !quot;#$%quot;&'()#*+&,(%-./0*12(.

       3145(.2&quot;%2(67...
erf
                                                            P
     !quot;#$%&'()*


      +,'quot;quot;-.()#/%.,-%#.,0...
erf
                                                                                      P
     !quot;#$%&'%()*''%&&+),%#...
erf
                                                                                 P
     !quot;#$%&'(#')*+##'((,*-'%).q...
erf
                                                         P
     !quot;#$%&'()*+,-(.()*,/%&0$1&

       234%5(.%)1,quot...
erf
                                                                      P
     !quot;#$%&'()*+
     ,-./'-/.%&0quot;10&(...
erf
                                                          P
     !quot;#$%&'()*+,-.//#01

       !quot;#$%&'()*,*0%#2$...
erf
                                                       P
     !quot;#quot;$$%$&'%()#*&+#,-./%,/0#%

       12&quot;&3q...
erf
                                                           P
     !quot;#$%&''()**+#,%-.quot;/01)*
       23%!quot;#$%...
erf
                                                           P
     !quot;#$%&''()**+#,%-.quot;/01)*
       234quot;5%!q...
erf
                                                                             P
     !quot;#$%&&'())()$*%+$,quot;$-%./)...
erf
                                                             P
     !quot;#$%&'(%()$*'+#,-'.),/01.23

       !quot;#$%...
erf
                       P
     Conflicts,
Coalescing, Warps...
I hate growing up.
erf
                                P




!quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.
erf
                                                       P
     !quot;#$%&'($quot;)*+,*-

       ./0'.quot;1+2-'34#$quot...
erf
                                                                          P
     !quot;#$%&'(#')*+,%quot;(-$('



    ...
erf
                                                                        P
     !quot;#$%&'(#')*+,%quot;(-$('


       ...
erf
                                                         P
     !quot;#$%&'%()*+#,&-quot;&%

       .&&/0-12quot;,3)0#...
erf
                                                            P
     !quot;#$%&'%()*+#,&-quot;&%
         4%#(&)5+quot;6...
erf
                                                             P
     !quot;#quot;$%&'()(*+'(,-
      =1+23$;0,)$!quot;#...
erf
                                                             P
     !quot;#quot;$%&'()(*+'(,-
      =1+23$;0,)$!quot;#...
erf
                                                                                 P
     !quot;#$%&'%()*+#,&-quot;&%
  ...
erf
                                                           P
     !quot;#$%&'%()!*+*$,%

       -&((./&%)0*12)3'#4(%3*...
erf
                                P




!quot;#$%&'()*+(),'-%./&'()*01&'2'3/&'()4
erf
                                                 P
     !quot;quot;#$%&quot;'

      ()*+%,-.&/0*#quot;0.1&/-%*+-+2+qu...
erf
                                                          P
     !quot;#$%&'()*+,#-.+/.0quot;#12#)1

       3+(4+5'()*...
erf
                                                                 P
     !quot;#$%&quot;'()quot;*quot;+,quot;+-.

     ...
erf
                                                             P
     !quot;#$%&quot;'()'quot;%%*'quot;

       +$,quot;...
erf
                                                           P
     !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot;
     ...
erf
                                 P
     !quot;#$%&''()*+',%!*-'(-*./0




84
erf
                                                             P
     !quot;#$%$&$'()#*+,-./)quot;,+)01234
       5*22/,...
erf
                                                            P
     !quot;quot;#$%&quot;'()*(+,-./-0%&quot;,

       1&...
erf
                                                      P
     !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1

       !quot;#...
erf
                                                        P
     !quot;#$%&'(quot;#
       )#*+,'-.#*/!)01/2+,3quot;,4.#...
erf
                           P




!quot;#$%&'($)*+,-.$/012*.#0
erf
                                                         P
     !quot;#$%&'($)*+,-.$/012*.#0

       3#.4+$5#-+,0#$-67...
erf
                      P
      !quot;#$%&'quot;()'*#




101
erf
                                                                        P
     !quot;#$%&'
       ()*$+',%-*,+-%./*0,1...
erf
                                                           P
     !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/

       0...
ME
CO
Back Pocket Slides




                     slide by David Cox
6.963
               IT /
         A@M
      CUD
    9
IAP0


               Dense
       Linear Algebra
!quot;#$quot;%&'#quot;()%*+,quot;-)(


4/5,-quot;.-,6789:;       B,A-C8quot;,Dquot;7/-?8E:C/78quot;C/:8:;

! <128/-quot;:=...
!quot;#$quot;%&'#quot;()%*+,quot;-)(


4/5,-quot;.-,6789:;       *7?,-10C9:;

! <128/-quot;:=:089:      ! D28E:1F8Fquot;G/...
!quot;#$%&'!quot;#$%&()$*+,-#.(!quot;#quot;!quot;$quot;%quot;&quot;'


45*6quot;78,-0-/29::;<                 +=45*6quot;7...
quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(!
    quot;#$!%&'(!
                      quot;-./01! 2011quot;-.! 0011q...
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Upcoming SlideShare
Loading in...5
×

IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

2,574

Published on

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.

1 Comment
0 Likes
Statistics
Notes
  • really nice ppt but I wounder about implementations specially the curve of CPU and GPU bench marks and difference of performance on computational times.
    how I can see your codes which represent that curve
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
2,574
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
167
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)"

  1. 1. 6.963 IT / A@M CUD 9 IAP0 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 04 Nicolas Pinto (MIT) #1 CUDA - Advanced
  2. 2. During this course, 3 6 for 6.9 ed adapt we’ll try to “ ” and use existing material ;-)
  3. 3. warp != wrap
  4. 4. Today yey!!
  5. 5. 6.963 IT / A@M CUD 9 IAP0 Textures & OpenGL Async API Libraries Interfacing CUDA Performance
  6. 6. 6.963 IT / A@M CUD 9 IAP0 CUDA Textures and OpenGL
  7. 7. res xtu Te Textures in CUDA Different hardware path to memory Benefits of CUDA textures: Texture fetches are cached Optimized for 2D locality Textures are addressable in 2D Using integer or normalized coordinates Means fewer addressing calculations in code Provide filtering for free Free wrap modes (boundary conditions) Clamp to edge / repeat Limitations of CUDA textures: Read-only Currently either 1D or 2D (3D will be added) 9-bit accuracy of filter weights © NVIDIA Corporation 2008 160
  8. 8. res xtu Te Two CUDA Texture Types Bound to linear memory Global memory is bound to a texture Only 1D Integer addressing No filtering, no addressing modes Bound to CUDA arrays CUDA array is bound to a texture 1D or 2D Float addressing (size-based or normalized) Filtering Addressing modes (clamping, repeat) Both: Return either element type or normalized float © NVIDIA Corporation 2008 161
  9. 9. res xtu Te CUDA Texturing Steps Host (CPU) code: Allocate/obtain memory (global linear, or CUDA array) Create a texture reference object Currently must be at file-scope Bind the texture reference to memory/array When done: Unbind the texture reference, free resources Device (kernel) code: Fetch using texture reference Linear memory textures: tex1Dfetch() Array textures: tex1D() or tex2D() © NVIDIA Corporation 2008 162
  10. 10. res xtu Te Texture Reference Immutable parameters (compile-time) Type: type returned when fetching Basic int, float types CUDA 1-, 2-, 4-element vectors Dimensionality: Currently 1 or 2 (3 will be supported in the future) Read Mode: cudaReadModeElementType cudaReadModeNormalizedFloat (valid for 8- or 16-bit ints) – returns [-1,1] for signed, [0,1] for unsigned Mutable parameters (run-time, only for array-textures) Normalized: non-zero = addressing range [0, 1] Filter Mode: cudaFilterModePoint cudaFilterModeLinear Address Mode: cudaAddressModeClamp cudaAddressModeWrap © NVIDIA Corporation 2008 163
  11. 11. Example: Host code for linear mem // declare texture reference (must be at file-scope) texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef; ... // set up linear memory unsigned short *dA = 0; cudaMalloc((void**)&dA, numBytes); cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice); // bind texture reference to array res cudaBindTexture(NULL, texRef, dA); xtu Te © NVIDIA Corporation 2008 164
  12. 12. res xtu Te cudaArray Type Channel format, width, height cudaChannelFormatDesc structure int x, y, z, w: bits for each component enum cudaChannelFormatKind – one of: cudaChannelFormatKindSigned cudaChannelFormatKindUnsigned cudaChannelFormatKindFloat some predefined constructors: cudaCreateChannelDesc<float>(void); cudaCreateChannelDesc<float4>(void); Management functions: cudaMallocArray, cudaFreeArray, cudaMemcpyToArray, cudaMemcpyFromArray, ... © NVIDIA Corporation 2008 165
  13. 13. Example: Host code for 2D array tex // declare texture reference (must be at file-scope) texture<float, 2, cudaReadModeElementType> texRef; ... // set up the CUDA array cudaChannelFormatDesc cf = cudaCreateChannelDesc<float>(); cudaArray *texArray = 0; cudaMallocArray(&texArray, &cf, dimX, dimY); cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice); // specify mutable texture reference parameters texRef.normalized = 0; res texRef.filterMode = cudaFilterModeLinear; xtu texRef.addressMode = cudaAddressModeClamp; Te // bind texture reference to array cudaBindTextureToArray(texRef, texArray); © NVIDIA Corporation 2008 166
  14. 14. nGL pe O OpenGL Interoperability OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory Vertex buffer objects Pixel buffer objects Direct3D9 Vertex objects can be mapped Data can be accessed like any other global data in the device code Image data can be displayed from pixel buffer objects using glDrawPixels / glTexImage2D Requires copy in video memory, but still fast © NVIDIA Corporation 2008 177
  15. 15. nGL pe O OpenGL Interop Steps Register a buffer object with CUDA cudaGLRegisterBufferObject(GLuint buffObj); OpenGL can use a registered buffer only as a source Unregister the buffer prior to rendering to it by OpenGL Map the buffer object to CUDA memory cudaGLMapBufferObject(void **devPtr, GLuint buffObj); Returns an address in global memory Buffer must registered prior to mapping Launch a CUDA kernel to process the buffer Unmap the buffer object prior to use by OpenGL cudaGLUnmapBufferObject(GLuint buffObj); Unregister the buffer object cudaGLUnregisterBufferObject(GLuint buffObj); Optional: needed if the buffer is a render target Use the buffer object in OpenGL code © NVIDIA Corporation 2008 178
  16. 16. nGL pe O Interop Scenario: Dynamic CUDA-generated texture Register the texture PBO with CUDA For each frame: Map the buffer Generate the texture in a CUDA kernel Unmap the buffer Update the texture Render the textured object unsigned char *p_d=0; cudaGLMapBufferObject((void**)&p_d, pbo); prepTexture<<<height,width>>>(p_d, time); cudaGLUnmapBufferObject(pbo); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glBindTexture(GL_TEXTURE_2D, texID); glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256, GL_BGRA, GL_UNSIGNED_BYTE, 0); © NVIDIA Corporation 2008 179
  17. 17. nGL pe O Interop Scenario: Frame Post-processing by CUDA For each frame: Render to PBO with OpenGL Register the PBO with CUDA Map the buffer Process the buffer with a CUDA kernel Unmap the buffer Unregister the PBO from CUDA unsigned char *p_d=0; cudaGLRegisterBufferObject(pbo); cudaGLMapBufferObject((void**)&p_d, pbo); postProcess<<<blocks,threads>>>(p_d); cudaGLUnmapBufferObject(pbo); cudaGLUnregisterBufferObject(pbo); ... © NVIDIA Corporation 2008 180
  18. 18. 6.963 IT / A@M CUD 9 IAP0 CUDA Async API
  19. 19. ync As !quot;#$%&'($()quot;*+,+('#*%(-# !quot;#$%&'($()quot;*&(quot;.*!quot; /,01%,*+,+('#*%(-#*2('* -34,56(%7,/*+,+('#*2',,quot;*)-*89:*($*366*8:;!* %3-3<6,*/,01%,quot; =0,'63-*1+-6,+,$.,/*<#*)quot;1$4*3*8:;!*quot;.',3+ 8:;!*>.',3+*?*>,@),$%,*(2*8:;!*(-,'3.1($quot;*.&3.* ,A,%).,*1$*('/,' >.',3+*!9BC D3%&*quot;.',3+*&3quot;*3$*B;C*E*?*/,23)6.*quot;.',3+ cudaMemcpyAsync(dst, src, size, 0); 97
  20. 20. ync As !quot;#$%&'()#$*#%(&*+(,#,-$.(/-'. 0-*/1$$#*2(#3#/124-*(-5(&()#$*#%(&*+(&(6-72(!quot; +#quot;4/#(,#,-$.(/-'.(5-$('&8#9%-/)#+(,#,-$. 0-,'12#(/&'&:4%42.(;<(=>=(?@AB(&*+(1'C Dquot;&4%&:%#(&7(&('$#quot;4#E(5#&21$#(4*(0FGD(=>= !quot;#$%&'7()#$*#%(#3#/124-*(4*(-*#(72$#&,(E426(&(,#,-$.( /-'.(5$-,(&*-26#$(72$#&, H2$#&,(DIJK cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(dst, src, size, stream1); -quot;#$%&''#+ kernel<<<grid, block, 0, stream2>>>(…); cudaStreamQuery(stream2); 98
  21. 21. ync As !quot;#$%&'()*%$+, &'()*-%./(%0)-(/*(1%2/(34/1(15%0)*4%!quot;#$%3.66%-*/(.7- quot;-.8(%-3()./04-9 7(.-:/(%(6.;-(1%*07(%<4/%!quot;#$%3.66-%23643=%3>36(%;/(30-04)5 ?:(/>%*@(%-*.*:-%4<%.)%.->)3@/4)4:-%!quot;#$%3.66 A643=%!+quot;%:)*06%!quot;#$%3.66-%;/04/%*4%*@(%('()*%./(%347;6(*(1 .->)3$+, -.7;6(%0)%!quot;#$%B#C 3:1.&'()*D* -*./*E%-*4;F 3:1.&'()*!/(.*(2G-*./*5F 3:1.&'()*!/(.*(2G-*4;5F 3:1.&'()*H(34/12-*./*E%I5F =(/)(6JJJ8/01E%A643=KKK2LLL5F 3:1.&'()*H(34/12-*4;E%I5F 3:1.&'()*B>)3@/4)0M(2-*4;5F <64.*%(*F 3:1.&'()*&6.;-(1N07(2G(*E%-*./*E%-*4;5F 3:1.&'()*#(-*/4>2-*./*5F 3:1.&'()*#(-*/4>2-*4;5F 95 95
  22. 22. 6.963 IT / A@M CUD 9 IAP0 CUDA Libraries
  23. 23. ary ibr L CUDA libraries CUDA includes 2 widely used libraries CUBLAS: BLAS implementation CUFFT: FFT implementation CUDPP (Data Parallel Primitives), available from http://www.gpgpu.org/developer/cudpp/ : Reduction Scan Sort 9 M02: High Performance Computing with CUDA
  24. 24. ary ibr L Closely Coupled CPU-GPU Function Function Function Lib Lib Init GPU Alloc CPU Operation 1 Operation 2 Operation 3 Integrated programming model High speed data transfer – up to 5.5GB/sec Asynchronous data transfer Large GPU memory systems 10 M02: High Performance Computing with CUDA
  25. 25. ary ibr L CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction with CUDA driver Basic model for use Create matrix and vector objects in GPU memory space Fill objects with data Call sequence of CUBLAS functions Retrieve data from GPU CUBLAS library contains helper functions Creating and destroying objects in GPU space Writing data to and retrieving data from objects 11 M02: High Performance Computing with CUDA
  26. 26. ary ibr L Using CUBLAS Interface to CUBLAS library is in cublas.h Function naming convention cublas + BLAS name Eg., cublasSGEMM Error handling CUBLAS core functions do not return error CUBLAS provides function to retrieve last error recorded CUBLAS helper functions do return error Helper functions: Memory allocation, data transfer Implemented using C-based CUDA tool chain Interfacing to C/C++ applications is trivial 13 M02: High Performance Computing with CUDA
  27. 27. ary ibr L Supported Features Single Precision Double Precision* Real Complex Real Complex ! ! ! Level 1 dgemv, ! dger, Level 2 dsyr, dtrsv cgemm zgemm ! ! Level 3 *Double-precision functions only supported on GPUs with double-precision hardware © 2008 NVIDIA Corporation.
  28. 28. ary ibr L CUBLAS Helper Functions cublasInit() Initializes CUBLAS library cublasShutdown() Releases resources used by CUBLAS library cublasGetError() Returns last error from CUBLAS core function (+ resets) cublasAlloc() Wrapper around cudaMalloc() to allocate space for array cublasFree() destroys object in GPU memory cublas[Set|Get][Vector|Matrix]() Copies array elements between CPU and GPU memory Accommodates non-unit strides © 2008 NVIDIA Corporation.
  29. 29. ary ibr L sgemmExample.c #include <stdio.h> cublasInit(); #include <stdlib.h> #include quot;cublas.hquot; cublasAlloc(n2, sizeof(float), (void **)&a_d); cublasAlloc(n2, sizeof(float), (void **)&b_d); int main(void) cublasAlloc(n2, sizeof(float), (void **)&c_d); { float *a_h, *b_h, *c_h; cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1); float *a_d, *b_d, *c_d; cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1); float alpha = 1.0f, beta = 0.0f; int N = 2048, n2 = N*N; cublasSgemm('n', 'n', N, N, N, alpha, a_d, N, int nBytes, i; b_d, N, beta, c_d, N); nBytes = n2*sizeof(float); cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1); a_h = (float *)malloc(nBytes); free(a_h); free(b_h); free(c_h); b_h = (float *)malloc(nBytes); cublasFree(a_d); cublasFree(b_d); c_h = (float *)malloc(nBytes); cublasFree(c_d); for (i=0; i < n2; i++) { cublasShutdown(); return 0; a_h[i] = rand() / (float) RAND_MAX; } b_h[i] = rand() / (float) RAND_MAX; } © 2008 NVIDIA Corporation.
  30. 30. ary ibr L Calling CUBLAS from FORTRAN Two interfaces: Thunking (define CUBLAS_USE_THUNKING when compiling fortran.c) Allows interfacing to existing applications without any changes During each call, the wrappers allocate GPU memory, copy source data from CPU memory space to GPU memory space, call CUBLAS, and finally copy back the results to CPU memory space and deallocate the GPGPU memory Intended for light testing due to call overhead Non-Thunking (default) Intended for production code Substitute device pointers for vector and matrix arguments in all BLAS functions Existing applications need to be modified slightly to allocate and deallocate data structures in GPGPU memory space (using CUBLAS_ALLOC and CUBLAS_FREE) and to copy data between GPU and CPU memory spaces (using CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR, CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX) 14 M02: High Performance Computing with CUDA
  31. 31. ary ibr L SGEMM example (THUNKING) ! Define 3 single precision matrices A, B, C real , dimension(m1,m1):: A, B, C …… ! Initialize …… #ifdef CUBLAS ! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of ! memory allocation on device and data movement) call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) #else ! Call SGEMM in host BLAS library call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) #endif To use the host BLAS routine: g95 –O3 code.f90 –L/usr/local/lib -lblas To use the CUBLAS routine (fortran.c is provided by NVIDIA): gcc -O3 -DCUBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c g95 -O3 -DCUBLAS code.f90 fortran.o -L/usr/local/cuda/lib -lcublas 15 M02: High Performance Computing with CUDA
  32. 32. ary ibr L SGEMM example (NON-THUNKING) ! Define 3 single precision matrices A, B, C real , dimension(m1,m1):: A, B, C integer:: devPtrA, devPtrB, devPtrC, size_of_real=4 …… ! Initialize A, B, C ……… ! Allocate matrices on GPU cublasAlloc(m1*m1, size_of_real, devPtrA) cublasAlloc(m1*m1, size_of_real, devPtrB) cublasAlloc(m1*m1, size_of_real, devPtrC) !Copy data from CPU to GPU cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1) cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1) cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1) ! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data in GPU memory) call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) !Copy data from GPU to CPU cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1) ! Free memory on device cublasFree(devPtrA) …… g95 -O3 code.f90 -L/usr/local/cuda/lib -lcublas 16 M02: High Performance Computing with CUDA
  33. 33. !quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(', , V&9F=J!V*=>*7! X&'(9!YB!S(''(=! W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9! $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J! 7901('$1, quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 +(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! _quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U ary ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! 8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?! ibr +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(! L 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! &,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! ,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! ,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8! S#?!M(&>!quot;6=*MG9! C0! T! T! T! +(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &! S#?!6=*M9RO*+N! <B<! T! T! T! ,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U =(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! :,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%! O*+N9B!! Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(! 9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U 9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(! ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6! 6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9! 8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N! *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH! +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=! =FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B! 6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6! '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9! ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9! 8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9! 9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*! 8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(! M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U O*+>B! 6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9! 31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J! =,-./,7($%*1quot;$14(quot;, W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ! [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2! ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(! Volkov and Demmel (SC08) 9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+! 8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!
  34. 34. rary Lib !quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(', , V&9F=J!V*=>*7! X&'(9!YB!S(''(=! W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9! $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J! 7901('$1, quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 +(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! _quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! 8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?! +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(! 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! &,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! ,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! ,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8! S#?!M(&>!quot;6=*MG9! C0! T! T! T! +(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &! S#?!6=*M9RO*+N! <B<! T! T! T! ,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U =(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! :,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%! O*+N9B!! Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(! 9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U 9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(! ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6! 6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9! 8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N! *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH! +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=! =FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B! 6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6! '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9! ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9! 8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9! 9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*! 8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(! M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U O*+>B! 6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9! 31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J! =,-./,7($%*1quot;$14(quot;, W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ! Volkov and Demmel (SC08) [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2! ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(! 9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+! 8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!
  35. 35. ary ibr L DGEMM Performance 17 M02: High Performance Computing with CUDA
  36. 36. ary ibr L Additional Resources CUDA SDK example simpleCUBLAS CUBLAS Library documentation in doc folder of CUDA Toolkit or download from CUDA Zone © 2008 NVIDIA Corporation.
  37. 37. ary ibr L CUFFT The Fast Fourier Transform (FFT) is a divide-and- conquer algorithm for efficiently computing discrete Fourier transform of complex or real-valued data sets. CUFFT is the CUDA FFT library Provides a simple interface for computing parallel FFT on an NVIDIA GPU Allows users to leverage the floating-point power and parallelism of the GPU without having to develop a custom, GPU-based FFT implementation 18 M02: High Performance Computing with CUDA
  38. 38. ary ibr L Supported Features 1D, 2D and 3D transforms of complex and real-valued data Batched execution for doing multiple 1D transforms in parallel 1D transform size up to 8M elements 2D and 3D transform sizes in the range [2,16384] In-place and out-of-place transforms for real and complex data. 19 M02: High Performance Computing with CUDA
  39. 39. ary ibr L Transform Types Library supports real and complex transforms CUFFT_C2C, CUFFT_C2R, CUFFT_R2C Directions CUFFT_FORWARD (-1) and CUFFT_INVERSE (1) According to sign of the complex exponential term Real and imaginary parts of complex input and output arrays are interleaved cufftComplex type is defined for this Real to complex FFTs, output array holds only nonredundant coefficients N -> N/2+1 N0 x N1 x … x Nn -> N0 x N1 x … x (Nn/2+1) For in-place transforms the input/output arrays need to be padded 20 M02: High Performance Computing with CUDA
  40. 40. ary ibr L More on Transforms For 2D and 3D transforms, CUFFT performs transforms in row- major (C-order) If calling from FORTRAN or MATLAB, remember to change the order of size parameters during plan creation CUFFT performs un-normalized transforms: IFFT(FFT(A))= length(A)*A CUFFT API is modeled after FFTW. Based on plans, that completely specify the optimal configuration to execute a particular size of FFT Once a plan is created, the library stores whatever state is needed to execute the plan multiple times without recomputing the configuration Works very well for CUFFT, because different kinds of FFTs require different thread configurations and GPU resources 21 M02: High Performance Computing with CUDA
  41. 41. ary ibr L CUFFT Types and Definitions cufftHandle Type used to store and access CUFFT plans cufftResults Enumeration of API function return values cufftReal single-precision, real datatype cufftComplex single-precision, complex datatype Real and complex transforms CUFFT_C2C, CUFFT_C2R, CUFFT_R2C Directions CUFFT_FORWARD, CUFFT_INVERSE © 2008 NVIDIA Corporation.
  42. 42. ary ibr L CUFFT Example #include <stdio.h> cufftPlan1d(&plan, N, CUFFT_C2C, batchSize); #include <math.h> #include quot;cufft.hquot; cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD); cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE); int main(int argc, char *argv[]) { cudaMemcpy(a_h, a_d, nBytes, cufftComplex *a_h, *a_d; cudaMemcpyDeviceToHost); cufftHandle plan; int N = 1024, batchSize = 10; // check error - normalize int i, nBytes; for (maxError = 0.0, i=0; i < N*batchSize; i++) { double maxError; maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError); maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError); nBytes = sizeof(cufftComplex)*N*batchSize; } a_h = (cufftComplex *)malloc(nBytes); printf(quot;Max fft error = %gnquot;, maxError); for (i=0; i < N*batchSize; i++) { a_h[i].x = sinf(i); cufftDestroy(plan); a_h[i].y = cosf(i); free(a_h); cudaFree(a_d); } return 0; cudaMalloc((void **)&a_d, nBytes); } cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); © 2008 NVIDIA Corporation.
  43. 43. ary ibr L Additional CUFFT Resources CUDA SDK examples simpleCUFFT convolutionFFT2D oceanFFT CUFFT Library documentation In doc folder of CUDA Toolkit or download from CUDA Zone © 2008 NVIDIA Corporation.
  44. 44. ? e lu G
  45. 45. 6.963 IT / A@M CUD 9 IAP0 Interfacing CUDA
  46. 46. lue G Interfacing CUDA with other languages CUDA kernels from FORTRAN, allocate pinned memory from FORTRAN Calling CUDA from MATLAB with MEX files Several packages (open source and commercial) to interface CUDA with Python, IDL, .NET, FORTRAN (Flagon). Browse CUDA Zone to find all the packages. 23 M02: High Performance Computing with CUDA
  47. 47. lue G Pinned memory from FORTRAN Pinned memory provides a fast PCI-e transfer speed and enables use of streams: •Allocation needs to be done with cudaMallocHost •Use new Fortran 2003 features for interoperability with C. use iso_c_binding ! The allocation is performed by C function calls. Define the C pointer as type (C_PTR) type(C_PTR) :: cptr_A, cptr_B, cptr_C ! Define Fortran arrays as pointer. real, dimension(:,:), pointer :: A, B, C ! Allocating memory with cudaMallocHost. ! The Fortan arrays, now defined as pointers, are then associated with the C pointers using the ! new interoperability defined in iso_c_binding. This is equivalent to allocate(A(m1,m1)) res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) ) call c_f_pointer ( cptr_A, A, (/ m1, m1 /) ) ! Use A as usual. ! See example code for cudaMallocHost interface code http://www.nvidia.com/object/cuda_programming_tools.html 24 M02: High Performance Computing with CUDA
  48. 48. lue G Calling CUDA kernels from FORTRAN From Fortran call C function that will call CUDA kernel ! Fortran -> C -> CUDA ->C ->Fortran call cudafunction(c,c2,N) /* NB: Fortran subroutine arguments are passed by reference. */ extern quot;Cquot; void cudafunction_(cuComplex *a, cuComplex *b, int *Np) { ... int N=*np; cudaMalloc ((void **) &a_d , sizeof(cuComplex)*N); cudaMemcpy( a_d, a, sizeof(cuComplex)*N ,cudaMemcpyHostToDevice); dim3 dimBlock(block_size); dim3 dimGrid (N/dimBlock.x); if( N % block_size != 0 ) dimGrid.x+=1; square_complex<<<dimGrid,dimBlock>>>(a_d,a_d,N); cudaMemcpy( b, a_d, sizeof(cuComplex)*N,cudaMemcpyDeviceToHost); cudaFree(a_d); } complex_mul: main.f90 Cuda_function.o $(FC) -o complex_mul main.f90 Cuda_function.o -L/usr/local/cuda/lib -lcudart cuda_function.o: cuda_function.cu nvcc -c -O3 cuda_function.cu 25 M02: High Performance Computing with CUDA
  49. 49. lue G CUDA & MATLAB Even though MATLAB is built on many well- optimized libraries, some functions can perform better when written in a compiled language (e.g. C and Fortran). MATLAB provides a convenient API for interfacing code written in C and FORTRAN to MATLAB functions with MEX files. MEX files could be used to exploit multi-core processors with OpenMP or threaded codes or like in this case to offload functions to the GPU. 26 M02: High Performance Computing with CUDA
  50. 50. lue G NVMEX Native MATLAB script cannot parse CUDA code New MATLAB script nvmex.m compiles CUDA code (.cu) to create MATLAB function files Syntax similar to original mex script: >> nvmex –f nvmexopts.bat filename.cu –IC:cudainclude –LC:cudalib -lcudart Available for Windows and Linux from: http://developer.nvidia.com/object/matlab_cuda.html 27 M02: High Performance Computing with CUDA
  51. 51. lue G Mex files for CUDA A typical mex file will perform the following steps: 1. Convert from double to single precision 2. Rearrange the data layout for complex data 3. Allocate memory on the GPU 4. Transfer the data from the host to the GPU 5. Perform computation on GPU (library, custom code) 6. Transfer results from the GPU to the host 7. Rearrange the data layout for complex data 8. Convert from single to double 9. Clean up memory and return results to MATLAB Some of these steps will go away with new versions of the library (2,7) and new hardware (1,8) 28 M02: High Performance Computing with CUDA
  52. 52. lue G CUDA MEX example Additional code in MEX file to handle CUDA /*Parse input, convert to single precision and to interleaved complex format */ ….. /* Allocate array on the GPU */ cufftComplex *rhs_complex_d; cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M); /* Copy input array in interleaved format to the GPU */ cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M, cudaMemcpyHostToDevice); /* Create plan for CUDA FFT NB: transposing dimensions*/ cufftPlan2d(&plan, N, M, CUFFT_C2C) ; /* Execute FFT on GPU */ cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ; /* Copy result back to host */ cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M, cudaMemcpyDeviceToHost); /* Clean up memory and plan on the GPU */ cufftDestroy(plan); cudaFree(rhs_complex_d); /*Convert back to double precision and to split complex format */ …. 29 M02: High Performance Computing with CUDA
  53. 53. lue G Timing details 1024x1024 mesh, 400 RK4 steps on Windows, 2D isotropic turbulence Runtime Speed Runtime Speed Opteron 250 Opteron 2210 up up PCI-e Bandwidth: 1135 MB/s 1483 MB/s Host to/from device 1003 MB/s 1223 MB/s Standard MATLAB 8098 s 9525s Overload FFT2 and IFFT2 4425 s 1.8x 4937s 1.9x Overload Szeta 735 s 11.x 789s 12.X Overload Szeta , FFT2 and 577 s 14.x 605s 15.7x IFFT2 30 M02: High Performance Computing with CUDA
  54. 54. lue G
  55. 55. lue G
  56. 56. lue G
  57. 57. lue G
  58. 58. lue G
  59. 59. Wanna Play with The Big Guys?
  60. 60. 6.963 IT / A@M CUD 9 IAP0 CUDA Performance Strategies
  61. 61. ing ead hr T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006
  62. 62. mory Me Data Movement in a CUDA Program Host Memory Device Memory [Shared Memory] COMPUTATION [Shared Memory] Device Memory Host Memory © NVIDIA Corporation 2008 10
  63. 63. erf P !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123 456$%$&'($78'quot;'78'7#(quot;5-5**'*$/% 456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.? @,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.' 123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-: E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:( 85#5(#-57/0'-/ GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='( 05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/# 39
  64. 64. erf P !quot;#$%$&'()'%*+,(-*.'+'/0' -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4' =2*>12?@*012(4'5$0'(%'%*+,( !quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'( %'%*+, B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3 40
  65. 65. erf P !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03 45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03 !/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03 <6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;% 6/quot;0$'%93%quot;88%*/0$quot;'6 <6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66 .*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)? :,quot;8$6:$quot;98$%quot;''0$667)+ 1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0 41
  66. 66. erf P !quot;#$%&'&((#()quot;*$+,,)-)#./(0 %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$ *2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0 9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot; <##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$ *2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1' @#=)quot;/#'quot;;$quot;5&'#:$*#*1'0 42
  67. 67. erf P !quot;#$%&'$()*#*+,)*$-. /()*#*+*-0'#quot;#$%&')%,-.1quot;%. 2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;. 6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3& 44
  68. 68. erf P !quot;#quot;$%&quot;'()*&( !*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$ 6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1 789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE G89:($)/&$>?@A*$MFH N,',.,O*$#&quot;'()*&( @'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*&quot;#*3$ /'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$ .*./&0 8&/5;$#&quot;'()*&( R'*$6quot;&Q*$#&quot;'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*( 45
  69. 69. erf P !quot;#$%&'()$*+,$-'./+0.quot;123$.2 (4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+ ='27+-$-'./ >1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($ @AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9 LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+ 8'Q$.(5'()$*+!GH%$9 R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$ U2$+:;7=+(quot;47;'1 W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+ 'Q$.quot;55+2/27$-+<$.3'.-quot;1($ 0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72 46
  70. 70. erf P !quot;#$%quot;&'()#*+&,(%-./0*12(. 3145(.2&quot;%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(. ?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7> B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67( D#%quot;(.71649&8@&2#&E;F&.@((-8@ ?%2(67+&51-1649&8@&2#&GHIF&.@((-8@ 47
  71. 71. erf P !quot;#$%&'()* +,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&: +,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1= 9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@ 8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@ AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@ +..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%= J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;), &(K% L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#, 0$quot;'M,0%()*,-%#. NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()* P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6 48
  72. 72. erf P !quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1& 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 *$$)1>?%#(&)C#?1-'-C#1% 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 ;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1% 49
  73. 73. erf P !quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1( 12 13 14 17 135 136 349 374 378 352 355 395 399 3B4 :';<=1')*+##'((*>?*@A;'%)( 12 13 14 17 137 135 136 349 374 378 352 355 395 399 3B4 C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G 50
  74. 74. erf P !quot;#$%&'()*+,-(.()*,/%&0$1& 234%5(.%)1,quot;),678+, 9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'? @A,;$quot;#1&,BCDAEF -(.%&,#G%5#*%:,quot;G%5,C89,50)& CD9,>$quot;'?&,3,DHI,1J5%#:&+ @HIK&,L 'quot;#$%&'%: @HMK&,L 'quot;#$%&'%:<,&quot;.%,1J5%#:&,:quot;)N1,4#51('(4#1% @<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&& 51
  75. 75. erf P !quot;#$%&'()*+ ,-./'-/.%&0quot;10&(2%0! 34054067089-%& :&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0 <;quot;,= ?10,quot;;0(&0)quot;-0@(#A$%+ Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067 :&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()* x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA 58
  76. 76. erf P !quot;#$%&'()*+,-.//#01 !quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2 !0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$& :%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A 712%&,B($$,70%#9,'quot;#$%&'()*+ C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;- E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG D88(2(quot;)#$,0%&quot;.0'%&+ D$(*)%8,I13%&,-JK,-#/3$% 59
  77. 77. erf P !quot;#quot;$$%$&'%()#*&+#,-./%,/0#% 12&quot;&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56&quot;,,%66&(%()#* 7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6 <66%2/.quot;$&/)&quot;,-.%9%&-.=-&:quot;25>.5/- <quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%&quot;55#%66&3%#&,*,$% +&(%()#*&,quot;2&6%#9.,%&quot;6&(quot;2*&6.(0$/quot;2%)06& Bank 0 quot;,,%66%6&quot;6&./&-quot;6&:quot;2;6 Bank 1 Bank 2 Bank 3 '0$/.3$%&6.(0$/quot;2%)06&quot;,,%66%6&/)&quot;&:quot;2; Bank 4 #%60$/&.2&quot;&:quot;2;&,)28$.,/& Bank 5 Bank 6 ?)28$.,/.2=&quot;,,%66%6&quot;#%&6%#.quot;$.@%5 Bank 7 Bank 15 64
  78. 78. erf P !quot;#$%&''()**+#,%-.quot;/01)* 23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67* 8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3# *7(+')%99%: Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 65
  79. 79. erf P !quot;#$%&''()**+#,%-.quot;/01)* 234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:* ;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,% *:(+')%<<%2 *:(+')%<<%= x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15 66
  80. 80. erf P !quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012 3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:() <=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($ -%./) 012$5%)$AB$-%./) <quot;$-%./$C$%&&'())$D$AB <%*($%)$,5($)4E($quot;6$%$5%:6?#%'+ Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+ 67
  81. 81. erf P !quot;#$%&'(%()$*'+#,-'.),/01.23 !quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%' ,)'+#,-'.),/01.23 5quot;%'/#32'.#3%6 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13' ,)'+#,-'.),/01.2 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;' 2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32= 5quot;%'30)9'.#3%6 >#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:' #..%33'2quot;%'3#(%'+#,- A@32'3%$1#01B%'2quot;%'#..%33%3 ?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,- 68
  82. 82. erf P Conflicts, Coalescing, Warps... I hate growing up.
  83. 83. erf P !quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.
  84. 84. erf P !quot;#$%&'($quot;)*+,*- ./0'.quot;1+2-'34#$quot;)*+,*-56 7228*#$quot;#-*9 :,quot;2-*;%)< =>,%?%)<'.!@!'Aquot;)B';,)C2%;#* .+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-* 1 5 9 13 1 2 3 4 2 6 10 14 5 6 7 8 3 7 11 15 9 10 11 12 4 8 12 16 13 14 15 16 70
  85. 85. erf P !quot;#$%&'(#')*+,%quot;(-$(' __global__ void transpose_naive(float *odata, float *idata, int width, int height) { 1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; 2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; 3. if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; 4. unsigned int index_out = yIndex + height * xIndex; 5. $)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4; 6. } } 71
  86. 86. erf P !quot;#$%&'(#')*+,%quot;(-$(' .'%)(*/quot;-01*2,$3*4565 <,/1'*$01-01*1$*4565 ;8; ;87 ;8: ;879 ;8; 78; :8; 798; 78; 787 78: 7879 ;87 787 :87 7987 798; 7987 798: 79879 ;879 7879 :879 79879 4565 4565 Stride = 1, coalesced Stride = 16, uncoalesced 72
  87. 87. erf P !quot;#$%&'%()*+#,&-quot;&% .&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%& *6+%#(7$quot;'8)974:)7;<3 =%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@? A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$% *+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@? *6+%#()914:1;<3 =%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$% A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$% !quot;#$%&'2,B)2&)#'62%D%()2C3 E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH 73
  88. 88. erf P !quot;#$%&'%()*+#,&-quot;&% 4%#(&)5+quot;6)1232 .+/0%&)0quot;)7232 <9< <98 <9; <98: <9< <98 <9; <98: 89< 898 89; 898: 89< 898 89; 898: 8:9< 8:98 8:9; 8:98: 8:9< 8:98 8:9; 8:98: 4%#(&)5+quot;6)7232 .+/0%&)0quot;)1232 <9< 89< ;9< 8:9< <9< <98 <9; <98: <98 898 ;98 8:98 89< 898 89; 898: <98: 898: ;98: 8:98: 8:9< 8:98 8:9; 8:98: 74
  89. 89. erf P !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75
  90. 90. erf P !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75
  91. 91. erf P !quot;#$%&'%()*+#,&-quot;&% __global__ void transpose(float *odata, float *idata, int width, int height) { 1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; 2. unsigned int yBlock = blockDim.y * blockIdx.y; 3. unsigned int xIndex = xBlock + threadIdx.x; 4. unsigned int yIndex = yBlock + threadIdx.y; 5. unsigned int index_out, index_transpose; 6. 7. if (xIndex < width && yIndex < height) { unsigned int index_in = width * yIndex + xIndex; 8. unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x; 9. block[index_block] = idata[index_in]; 10. index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y; 11. index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; 12. } 13. __syncthreads(); 14. if (xIndex < width && yIndex < height) odata[index_out] = block[index_transpose]; 15. } 76
  92. 92. erf P !quot;#$%&'%()!*+*$,% -&((./&%)0*12)3'#4(%3*$,)#$.)-565)'&1*+*7#1*'$8 9:;<9:;8))=>=99+% ?%>)=>=::+%))@:>=A %&((./&B C9:<C9:8))=>=D+%)))?%>)=>EE+%))))@F>CA %&((./&B 9=:F<9=:F8))=>E=+%)))?%>)9>G:+%))))@H>FA %&((./&B 9=:F<:=F;8))=>DG+%)))?%>)H>H+%))))))@;>FA %&((./&B I'#4(%3*$,)0*12'/1)-565)'&1*+*7#1*'$8 9:;<9:;8))=>=9F+% C9:<C9:8))=>9=9+% 9=:F<9=:F8))=>F9:+% 9=:F<:=F;8))=>;HG+% 77
  93. 93. erf P !quot;#$%&'()*+(),'-%./&'()*01&'2'3/&'()4
  94. 94. erf P !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ A+6./0+*/ B)%*+,-<+<1*' 79
  95. 95. erf P !quot;#$%&'()*+,#-.+/.0quot;#12#)1 3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1 ,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02. 3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+> ?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot; &'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+ :9quot;$B9quot;.+501@ ,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@ 3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1 &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(< JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1 80
  96. 96. erf P !quot;#$%&quot;'()quot;*quot;+,quot;+-. !quot;/,0/1&quot;'02'$&quot;('quot;#$%&quot;'(,quot;*quot;+,quot;+-. 3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/&quot;' ;-quot;+/'$5%<=>)?< @AB< S T(.(U(JV /,,N1O:(((P1OQ(P1EQ(P1: W(T(S U(OV /,,N1O:(((P1JQ(P1OQ(P1R %[,/&/XYZ(UT(OV 7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ( /,,N1O:(((((((((((P1OQ(P1OQ(P1R A5(-5C*7quot;&quot;7.(D$,quot;(&Dquot;(7/&quot;+-.<( !4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5' ?&(7quot;/%&(:JK 5--4*/+-. AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M 81
  97. 97. erf P !quot;#$%&quot;'()'quot;%%*'quot; +$,quot;(-.&quot;/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78 9$3$&$/#(:.0&4'%; <*32quot;'(4=('quot;#$%&quot;'%(6quot;'(>quot;'/quot;- ?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,% D34*/&(4=(%5.'quot;,(3quot;34'1 @EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>% H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%&quot;'%(K(>quot;'/quot;- L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH < O(,quot;%$'quot;,(3.N$3*3('quot;#$%&quot;'%(K(>quot;'/quot;- D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*' !quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot; 82
  98. 98. erf P !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot; /*quot;)012#3+2#&+'*4567 +2#&+')#+)'6-- 8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+) =quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A 82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$) #;quot;)0-+=quot;7 *quot;-#&+'A architecture {sm_10} abiversion {0} modname {cubin} code { per thread local memory name = BlackScholesGPU lmem = 0 smem = 68 per thread block shared memory reg = 20 bar = 0 per thread registers bincode { 0xa0004205 0x04200780 0x40024c09 0x00200780 … 83
  99. 99. erf P !quot;#$%&''()*+',%!*-'(-*./0 84
  100. 100. erf P !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85
  101. 101. erf P !quot;quot;#$%&quot;'()*(+,-./-0%&quot;, 1&quot;-,%23&4(/quot;quot;#$%&quot;'(5/,2(&/6(&,quot;,22%-37'( 3&quot;-,%2,($,-./-0%&quot;, BUT… 8/9:/quot;quot;#$%&quot;'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'( <35,(7%6,&quot;'(/&(0,0/-':=/#&5(>,-&,72 ?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,( $%-%77,7320A 86
  102. 102. erf P !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1 !quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;.&quot;&'+1)&+)4'55%#%1&)6!73 6!73)8quot;#9)'1)$quot;19):quot;93 ;)+5)$,/&'.#+0%33+#3 <%$+#9)=quot;14:'4&2 >2quot;#%4)$%$+#9)3'(% ?%@'3&%#)5'/%)3'(% A2#%quot;43).%#)=/+0B *+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14) -AG->H IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/) 0+15'@,#quot;&'+1 87
  103. 103. erf P !quot;#$%&'(quot;# )#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($' 6+4quot;,7/$quot;.%+'$(#8 0(9+,8+#-/:,.#$5(#8 ;.#</$quot;#3%($-' =.-+#$7/5(*(#8 )'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/ )#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7 @#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$< +B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.* D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2' )'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+ 88
  104. 104. erf P !quot;#$%&'($)*+,-.$/012*.#0
  105. 105. erf P !quot;#$%&'($)*+,-.$/012*.#0 3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$ 401:.#5 ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$ 5#594?+ !*5#$+8-54+ (99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$ 61
  106. 106. erf P !quot;#$%&'quot;()'*# 101
  107. 107. erf P !quot;#$%&' ()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56 +quot;7*'+%75 #&08quot;$.32*-*$+ Global memory loads/stores are coalesced #&08.32*-*$+ (coherent) or non-coalesced (incoherent) #'+8quot;$.32*-*$+ #'+8.32*-*$+ &3.%&8&3%0 Local loads/stores &3.%&8'+3-* Total branches and divergent branches 9-%$.2 0quot;)*-#*$+89-%$.2 taken by threads quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+ 1%-58'*-quot;%&quot;;* : +2-*%0,1%-5',+2%+,'*-quot;%&quot;;*,3$,%00-*'',.3$<&quot;.+',+3, '2%-*0,3-,.3$'+%$+,7*73-= .+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./' 62
  108. 108. erf P !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/ 01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%& 6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*% 01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/) ,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,; <1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#) 8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,) 3*%:; 01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$) 5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$ !quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+) (,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$ 63
  109. 109. ME CO
  110. 110. Back Pocket Slides slide by David Cox
  111. 111. 6.963 IT / A@M CUD 9 IAP0 Dense Linear Algebra
  112. 112. !quot;#$quot;%&'#quot;()%*+,quot;-)( 4/5,-quot;.-,6789:; B,A-C8quot;,Dquot;7/-?8E:C/78quot;C/:8:; ! <128/-quot;:=:089: ! *8-,2/A01C:;quot;F>4 $% ! & ! >1?82@/7A8: ! +,9.A0/01,2/7quot;CG891:0-= $% ! quot;% ! B12?A7/-quot;@/7A8: ! )/0/quot;91212? $ ! #!quot; ! !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  113. 113. !quot;#$quot;%&'#quot;()%*+,quot;-)( 4/5,-quot;.-,6789:; *7?,-10C9:; ! <128/-quot;:=:089: ! D28E:1F8Fquot;G/H0,-1I/01,2:; $% ! & <JKquot;+C,78:L=Kquot;MN ! >1?82@/7A8: ! OP,E:1F8Fquot;G/H0,-1I/01,2:; $% ! quot;% MNquot;/7?3Kquot;Q/H,61 ! B12?A7/-quot;@/7A8: ! OP,E:1F8Fquot;G/H0,-1I/01,2:; $ ! #!quot; ! B') !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  114. 114. !quot;#$%&'!quot;#$%&()$*+,-#.(!quot;#quot;!quot;$quot;%quot;&quot;' 45*6quot;78,-0-/29::;< +=45*6quot;7+;< 6789:;$<=--(!)*+,!!)*+, !quot;##!$%&''(!)*+,!)*+, -,!.,!/, -,!.,!/, 012,!>quot;,!#3quot;, 012,!quot;,!#3quot;, >4,!#34, 012,!>!,!#3!!5? 4,!#34, 012,!!,!#3!!5 +,>.?0/01,2quot;12quot;@A=quot;-BC?1-BD< ! (2101/E1F/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20 ! *EE,I/01,2quot;,Gquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-Kquot;7L/2JEB-Dquot;!quot;#$!%#$!&; ! M-/2DGB-quot;,Gquot;J/0/quot;7>/0-1IBDquot;quot;#$%#$&; ! +,>.?0/01,2quot;7I?NE/D6OB>>; ! PB0-1BHBquot;-BD?E0quot;7>/0-1Qquot;&; ! 8-BBquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-K ! MB->12/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20 !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  115. 115. quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! S#?!M(&>!quot;6=*MG9! C0! T! T! T! S#?!6=*M9RO*+N! <B<! T! T! T! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×