IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

6.963
IT /
A@M
CUD
9
IAP0

Supercomputing on your desktop:
Programming the next generation of cheap
and massively parallel hardware using CUDA

Lecture 04
Nicolas Pinto (MIT)

#1
CUDA - Advanced

During this course,
3
6
for 6.9
ed
adapt

we’ll try to

“ ”

and use existing material ;-)

6.963
IT /
A@M
CUD
9
IAP0

Textures & OpenGL
Async API
Libraries
Interfacing CUDA
Performance

6.963
IT /
A@M
CUD
9
IAP0

CUDA
Textures and OpenGL

res
xtu
Te
Textures in CUDA
Different hardware path to memory
Benefits of CUDA textures:
Texture fetches are cached
Optimized for 2D locality
Textures are addressable in 2D
Using integer or normalized coordinates
Means fewer addressing calculations in code
Provide filtering for free
Free wrap modes (boundary conditions)
Clamp to edge / repeat

Limitations of CUDA textures:
Read-only
Currently either 1D or 2D (3D will be added)
9-bit accuracy of filter weights
© NVIDIA Corporation 2008 160

res
xtu
Te
Two CUDA Texture Types

Bound to linear memory
Global memory is bound to a texture
Only 1D
Integer addressing
No filtering, no addressing modes

Bound to CUDA arrays
CUDA array is bound to a texture
1D or 2D
Float addressing (size-based or normalized)
Filtering
Addressing modes (clamping, repeat)

Both:
Return either element type or normalized float


res
xtu
Te
CUDA Texturing Steps
Host (CPU) code:
Allocate/obtain memory (global linear, or CUDA array)
Create a texture reference object
Currently must be at file-scope
Bind the texture reference to memory/array
When done:
Unbind the texture reference, free resources

Device (kernel) code:
Fetch using texture reference
Linear memory textures:
tex1Dfetch()
Array textures:
tex1D() or tex2D()


res
xtu
Te
Texture Reference
Immutable parameters (compile-time)
Type: type returned when fetching
Basic int, float types
CUDA 1-, 2-, 4-element vectors
Dimensionality:
Currently 1 or 2 (3 will be supported in the future)
Read Mode:
cudaReadModeElementType
cudaReadModeNormalizedFloat (valid for 8- or 16-bit ints)
– returns [-1,1] for signed, [0,1] for unsigned
Mutable parameters (run-time, only for array-textures)
Normalized:
non-zero = addressing range [0, 1]
Filter Mode:
cudaFilterModePoint
cudaFilterModeLinear
Address Mode:
cudaAddressModeClamp
cudaAddressModeWrap


Example: Host code for linear mem

// declare texture reference (must be at file-scope)
texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef;
...

// set up linear memory
unsigned short *dA = 0;
cudaMalloc((void**)&dA, numBytes);
cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice);

// bind texture reference to array

res
cudaBindTexture(NULL, texRef, dA);

xtu
Te


res
xtu
Te
cudaArray Type

Channel format, width, height
cudaChannelFormatDesc structure
int x, y, z, w: bits for each component
enum cudaChannelFormatKind – one of:
cudaChannelFormatKindSigned
cudaChannelFormatKindUnsigned
cudaChannelFormatKindFloat
some predefined constructors:
cudaCreateChannelDesc<float>(void);
cudaCreateChannelDesc<float4>(void);

Management functions:
cudaMallocArray, cudaFreeArray,
cudaMemcpyToArray, cudaMemcpyFromArray, ...


Example: Host code for 2D array tex

// declare texture reference (must be at file-scope)
texture<float, 2, cudaReadModeElementType> texRef;
...

// set up the CUDA array
cudaChannelFormatDesc cf = cudaCreateChannelDesc<float>();
cudaArray *texArray = 0;
cudaMallocArray(&texArray, &cf, dimX, dimY);
cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice);

// specify mutable texture reference parameters
texRef.normalized = 0;

res
texRef.filterMode = cudaFilterModeLinear;

xtu
texRef.addressMode = cudaAddressModeClamp;

Te
// bind texture reference to array
cudaBindTextureToArray(texRef, texArray);


nGL
pe
O
OpenGL Interoperability

OpenGL buffer objects can be mapped into the
CUDA address space and then used as global
memory
Vertex buffer objects
Pixel buffer objects

Direct3D9 Vertex objects can be mapped
Data can be accessed like any other global data in
the device code
Image data can be displayed from pixel buffer
objects using glDrawPixels / glTexImage2D
Requires copy in video memory, but still fast

nGL
pe
O
OpenGL Interop Steps

Register a buffer object with CUDA
cudaGLRegisterBufferObject(GLuint buffObj);
OpenGL can use a registered buffer only as a source
Unregister the buffer prior to rendering to it by OpenGL
Map the buffer object to CUDA memory
cudaGLMapBufferObject(void **devPtr, GLuint buffObj);
Returns an address in global memory
Buffer must registered prior to mapping
Launch a CUDA kernel to process the buffer
Unmap the buffer object prior to use by OpenGL
cudaGLUnmapBufferObject(GLuint buffObj);

Unregister the buffer object
cudaGLUnregisterBufferObject(GLuint buffObj);
Optional: needed if the buffer is a render target
Use the buffer object in OpenGL code


nGL
pe
O
Interop Scenario:
Dynamic CUDA-generated texture
Register the texture PBO with CUDA
For each frame:
Map the buffer
Generate the texture in a CUDA kernel
Unmap the buffer
Update the texture
Render the textured object

unsigned char *p_d=0;
cudaGLMapBufferObject((void**)&p_d, pbo);
prepTexture<<<height,width>>>(p_d, time);
cudaGLUnmapBufferObject(pbo);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
glBindTexture(GL_TEXTURE_2D, texID);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256,
GL_BGRA, GL_UNSIGNED_BYTE, 0);

nGL
pe
O
Interop Scenario:
Frame Post-processing by CUDA

For each frame:
Render to PBO with OpenGL
Register the PBO with CUDA
Map the buffer
Process the buffer with a CUDA kernel
Unmap the buffer
Unregister the PBO from CUDA
unsigned char *p_d=0;
cudaGLRegisterBufferObject(pbo);
cudaGLMapBufferObject((void**)&p_d, pbo);
postProcess<<<blocks,threads>>>(p_d);
cudaGLUnmapBufferObject(pbo);
cudaGLUnregisterBufferObject(pbo);
...

6.963
IT /
A@M
CUD
9
IAP0

CUDA
Async API

ync
As
!quot;#$%&'($()quot;*+,+('#*%(-#

!quot;#$%&'($()quot;*&(quot;.*!quot; /,01%,*+,+('#*%(-#*2('*
-34,56(%7,/*+,+('#*2',,quot;*)-*89:*($*366*8:;!*
%3-3<6,*/,01%,quot;

=0,'63-*1+-6,+,$.,/*<#*)quot;1$4*3*8:;!*quot;.',3+

8:;!*>.',3+*?*>,@),$%,*(2*8:;!*(-,'3.1($quot;*.&3.*
,A,%).,*1$*('/,'

>.',3+*!9BC
D3%&*quot;.',3+*&3quot;*3$*B;C*E*?*/,23)6.*quot;.',3+
cudaMemcpyAsync(dst, src, size, 0);

97

ync
As
!quot;#$%&'()#$*#%(&*+(,#,-$.(/-'.

0-*/1$$#*2(#3#/124-*(-5(&()#$*#%(&*+(&(6-72(!quot;
+#quot;4/#(,#,-$.(/-'.(5-$('&8#9%-/)#+(,#,-$.
0-,'12#(/&'&:4%42.(;<(=>=(?@AB(&*+(1'C
Dquot;&4%&:%#(&7(&('$#quot;4#E(5#&21$#(4*(0FGD(=>=
!quot;#$%&'7()#$*#%(#3#/124-*(4*(-*#(72$#&,(E426(&(,#,-$.(
/-'.(5$-,(&*-26#$(72$#&,

H2$#&,(DIJK
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(dst, src, size, stream1);
-quot;#$%&''#+
kernel<<<grid, block, 0, stream2>>>(…);
cudaStreamQuery(stream2);

98

ync
As
!quot;#$%&'()*%$+,
&'()*-%./(%0)-(/*(1%2/(34/1(15%0)*4%!quot;#$%3.66%-*/(.7-
quot;-.8(%-3()./04-9
7(.-:/(%(6.;-(1%*07(%<4/%!quot;#$%3.66-%23643=%3>36(%;/(30-04)5
?:(/>%*@(%-*.*:-%4<%.)%.->)3@/4)4:-%!quot;#$%3.66
A643=%!+quot;%:)*06%!quot;#$%3.66-%;/04/%*4%*@(%('()*%./(%347;6(*(1
.->)3$+, -.7;6(%0)%!quot;#$%B#C

3:1.&'()*D* -*./*E%-*4;F
3:1.&'()*!/(.*(2G-*./*5F 3:1.&'()*!/(.*(2G-*4;5F
3:1.&'()*H(34/12-*./*E%I5F
=(/)(6JJJ8/01E%A643=KKK2LLL5F
3:1.&'()*H(34/12-*4;E%I5F
3:1.&'()*B>)3@/4)0M(2-*4;5F
<64.*%(*F
3:1.&'()*&6.;-(1N07(2G(*E%-*./*E%-*4;5F
3:1.&'()*#(-*/4>2-*./*5F 3:1.&'()*#(-*/4>2-*4;5F
95
95

6.963
IT /
A@M
CUD
9
IAP0

CUDA
Libraries

ary
ibr
L
CUDA libraries

CUDA includes 2 widely used libraries
CUBLAS: BLAS implementation
CUFFT: FFT implementation

CUDPP (Data Parallel Primitives), available from
http://www.gpgpu.org/developer/cudpp/ :
Reduction
Scan
Sort

9
M02: High Performance Computing with CUDA

ary
ibr
L
Closely Coupled CPU-GPU

Function Function Function
Lib Lib

Init
GPU
Alloc

CPU
Operation 1 Operation 2 Operation 3

Integrated programming model
High speed data transfer – up to 5.5GB/sec
Asynchronous data transfer
Large GPU memory systems
10

ary
ibr
L
CUBLAS
Implementation of BLAS (Basic Linear Algebra Subprograms)
on top of CUDA driver
Self-contained at the API level, no direct interaction with CUDA
driver

Basic model for use
Create matrix and vector objects in GPU memory space
Fill objects with data
Call sequence of CUBLAS functions
Retrieve data from GPU

CUBLAS library contains helper functions
Creating and destroying objects in GPU space
Writing data to and retrieving data from objects

11

ary
ibr
L
Using CUBLAS
Interface to CUBLAS library is in cublas.h
Function naming convention
cublas + BLAS name
Eg., cublasSGEMM
Error handling
CUBLAS core functions do not return error
CUBLAS provides function to retrieve last error recorded
CUBLAS helper functions do return error
Helper functions:
Memory allocation, data transfer
Implemented using C-based CUDA tool chain
Interfacing to C/C++ applications is trivial

13

ary
ibr
L
Supported Features

Single Precision Double Precision*

Real Complex Real Complex

! ! !
Level 1

dgemv,
! dger,
Level 2
dsyr, dtrsv
cgemm zgemm
! !
Level 3

*Double-precision functions only supported on GPUs with double-precision hardware

© 2008 NVIDIA Corporation.

ary
ibr
L
CUBLAS Helper Functions

cublasInit()
Initializes CUBLAS library
cublasShutdown()
Releases resources used by CUBLAS library
cublasGetError()
Returns last error from CUBLAS core function (+ resets)
cublasAlloc()
Wrapper around cudaMalloc() to allocate space for array
cublasFree()
destroys object in GPU memory
cublas[Set|Get][Vector|Matrix]()
Copies array elements between CPU and GPU memory
Accommodates non-unit strides


ary
ibr
L
sgemmExample.c
#include <stdio.h> cublasInit();
#include <stdlib.h>
#include quot;cublas.hquot; cublasAlloc(n2, sizeof(float), (void **)&a_d);
cublasAlloc(n2, sizeof(float), (void **)&b_d);
int main(void) cublasAlloc(n2, sizeof(float), (void **)&c_d);
{
float *a_h, *b_h, *c_h; cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1);
float *a_d, *b_d, *c_d; cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1);
float alpha = 1.0f, beta = 0.0f;
int N = 2048, n2 = N*N; cublasSgemm('n', 'n', N, N, N, alpha, a_d, N,
int nBytes, i; b_d, N, beta, c_d, N);

nBytes = n2*sizeof(float); cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1);

a_h = (float *)malloc(nBytes); free(a_h); free(b_h); free(c_h);
b_h = (float *)malloc(nBytes); cublasFree(a_d); cublasFree(b_d);
c_h = (float *)malloc(nBytes); cublasFree(c_d);

for (i=0; i < n2; i++) { cublasShutdown();
return 0;
a_h[i] = rand() / (float) RAND_MAX;
}
b_h[i] = rand() / (float) RAND_MAX;
}

ary
ibr
L
Calling CUBLAS from FORTRAN
Two interfaces:

Thunking (define CUBLAS_USE_THUNKING when compiling fortran.c)
Allows interfacing to existing applications without any changes
During each call, the wrappers allocate GPU memory, copy source data
from CPU memory space to GPU memory space, call CUBLAS, and finally
copy back the results to CPU memory space and deallocate the GPGPU
memory
Intended for light testing due to call overhead

Non-Thunking (default)
Intended for production code
Substitute device pointers for vector and matrix arguments in all BLAS
functions
Existing applications need to be modified slightly to allocate and deallocate
data structures in GPGPU memory space (using CUBLAS_ALLOC and
CUBLAS_FREE) and to copy data between GPU and CPU memory
spaces (using CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR,
CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)

14

ary
ibr
L
SGEMM example (THUNKING)
! Define 3 single precision matrices A, B, C
real , dimension(m1,m1):: A, B, C
……
! Initialize
……
#ifdef CUBLAS
! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of
! memory allocation on device and data movement)
call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
#else
! Call SGEMM in host BLAS library
call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
#endif

To use the host BLAS routine:
g95 –O3 code.f90 –L/usr/local/lib -lblas

To use the CUBLAS routine (fortran.c is provided by NVIDIA):
gcc -O3 -DCUBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c
g95 -O3 -DCUBLAS code.f90 fortran.o -L/usr/local/cuda/lib -lcublas

15

ary
ibr
L
SGEMM example (NON-THUNKING)
! Define 3 single precision matrices A, B, C
real , dimension(m1,m1):: A, B, C
integer:: devPtrA, devPtrB, devPtrC, size_of_real=4
……
! Initialize A, B, C
………
! Allocate matrices on GPU
cublasAlloc(m1*m1, size_of_real, devPtrA)
cublasAlloc(m1*m1, size_of_real, devPtrB)
cublasAlloc(m1*m1, size_of_real, devPtrC)
!Copy data from CPU to GPU
cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1)
cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1)
cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1)
! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data in
GPU memory)
call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
!Copy data from GPU to CPU
cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1)
! Free memory on device
cublasFree(devPtrA)
……

g95 -O3 code.f90 -L/usr/local/cuda/lib -lcublas
16

!quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(',
,
V&9F=J!V*=>*7! X&'(9!YB!S(''(=!
W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!
$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!

7901('$1,
quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(!
quot;#$!%&'(!
Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4
+(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!
_quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U

ary
,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D!
8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!

ibr
+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!
ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!

L
9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!
M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9!
'('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1!
&,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!
'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!
,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U
K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/!
E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(!
'('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI!
7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!
4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:!
K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!
4#?!M(&>!M(+!,*+(! /;! /C! //! /:!
9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U
4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!
,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8!
S#?!M(&>!quot;6=*MG9! C0! T! T! T!
+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!
S#?!6=*M9RO*+N! <B<! T! T! T!
,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U
=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!
:,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
O*+N9B!!
Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!
9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U
9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(!
ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6!
6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!
8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N!
*M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U
M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH!
+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!
=FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B!
6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U
]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!
'&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!
ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!
8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!
9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!
8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!
M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U
O*+>B!
6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!
31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!
=,-./,7($%*1quot;$14(quot;,
W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!
[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!
ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(! Volkov and Demmel (SC08)
9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!
8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!

rary
Lib
!quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(',
,
V&9F=J!V*=>*7! X&'(9!YB!S(''(=!
W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!
$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!

7901('$1,
quot;#$!%&'(!
Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4
+(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!
_quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D!
8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?! +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!
ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!
9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!
M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9!
'('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1!
&,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!
'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!
,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U
K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/!
E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(!
7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!
4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:!
K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!
4#?!M(&>!M(+!,*+(! /;! /C! //! /:!
9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U
4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!
,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8!
S#?!M(&>!quot;6=*MG9! C0! T! T! T!
+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!
S#?!6=*M9RO*+N! <B<! T! T! T!
,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U
=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
:,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
O*+N9B!!
Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!
9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U
9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(!
ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6!
6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!
8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N!
*M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U
M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH!
+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!
=FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B!
6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U
]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!
'&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!
ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!
8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!
9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!
8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!
M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U
O*+>B!
6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!
31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!
=,-./,7($%*1quot;$14(quot;,
W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ! Volkov and Demmel (SC08)
[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!
ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(!
9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!
8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!

ary
ibr
L
DGEMM Performance

17

ary
ibr
L
Additional Resources

CUDA SDK example
simpleCUBLAS
CUBLAS Library documentation
in doc folder of CUDA Toolkit or download from CUDA
Zone


ary
ibr
L
CUFFT
The Fast Fourier Transform (FFT) is a divide-and-
conquer algorithm for efficiently computing discrete
Fourier transform of complex or real-valued data
sets.
CUFFT is the CUDA FFT library
Provides a simple interface for computing parallel FFT on
an NVIDIA GPU
Allows users to leverage the floating-point power and
parallelism of the GPU without having to develop a custom,
GPU-based FFT implementation

18

ary
ibr
L
Supported Features
1D, 2D and 3D transforms of complex and real-valued
data
Batched execution for doing multiple 1D transforms
in parallel
1D transform size up to 8M elements
2D and 3D transform sizes in the range [2,16384]
In-place and out-of-place transforms for real and
complex data.

19

ary
ibr
L
Transform Types
Library supports real and complex transforms
CUFFT_C2C, CUFFT_C2R, CUFFT_R2C
Directions
CUFFT_FORWARD (-1) and CUFFT_INVERSE (1)
According to sign of the complex exponential term
Real and imaginary parts of complex input and
output arrays are interleaved
cufftComplex type is defined for this
Real to complex FFTs, output array holds only
nonredundant coefficients
N -> N/2+1
N0 x N1 x … x Nn -> N0 x N1 x … x (Nn/2+1)
For in-place transforms the input/output arrays need to be
padded

20

ary
ibr
L
More on Transforms
For 2D and 3D transforms, CUFFT performs transforms in row-
major (C-order)
If calling from FORTRAN or MATLAB, remember to change the
order of size parameters during plan creation
CUFFT performs un-normalized transforms:
IFFT(FFT(A))= length(A)*A
CUFFT API is modeled after FFTW. Based on plans, that
completely specify the optimal configuration to execute a
particular size of FFT
Once a plan is created, the library stores whatever state is
needed to execute the plan multiple times without recomputing
the configuration
Works very well for CUFFT, because different kinds of FFTs
require different thread configurations and GPU resources

21

ary
ibr
L
CUFFT Types and Definitions

cufftHandle
Type used to store and access CUFFT plans
cufftResults
Enumeration of API function return values
cufftReal
single-precision, real datatype
cufftComplex
single-precision, complex datatype

Real and complex transforms
CUFFT_C2C, CUFFT_C2R, CUFFT_R2C
Directions
CUFFT_FORWARD, CUFFT_INVERSE

ary
ibr
L
CUFFT Example
#include <stdio.h> cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);
#include <math.h>
#include quot;cufft.hquot; cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);
cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);
int main(int argc, char *argv[])
{ cudaMemcpy(a_h, a_d, nBytes,
cufftComplex *a_h, *a_d; cudaMemcpyDeviceToHost);
cufftHandle plan;
int N = 1024, batchSize = 10; // check error - normalize
int i, nBytes; for (maxError = 0.0, i=0; i < N*batchSize; i++) {
double maxError; maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError);
maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError);
nBytes = sizeof(cufftComplex)*N*batchSize; }
a_h = (cufftComplex *)malloc(nBytes);
printf(quot;Max fft error = %gnquot;, maxError);
for (i=0; i < N*batchSize; i++) {
a_h[i].x = sinf(i); cufftDestroy(plan);
a_h[i].y = cosf(i); free(a_h); cudaFree(a_d);
}
return 0;
cudaMalloc((void **)&a_d, nBytes); }
cudaMemcpy(a_d, a_h, nBytes,
cudaMemcpyHostToDevice); © 2008 NVIDIA Corporation.

ary
ibr
L
Additional CUFFT Resources

CUDA SDK examples
simpleCUFFT
convolutionFFT2D
oceanFFT
CUFFT Library documentation
In doc folder of CUDA Toolkit or download from CUDA
Zone


6.963
IT /
A@M
CUD
9
IAP0

Interfacing CUDA

lue
G
Interfacing CUDA with other languages

CUDA kernels from FORTRAN, allocate pinned
memory from FORTRAN

Calling CUDA from MATLAB with MEX files

Several packages (open source and commercial) to
interface CUDA with Python, IDL, .NET, FORTRAN
(Flagon). Browse CUDA Zone to find all the
packages.

23

lue
G
Pinned memory from FORTRAN
Pinned memory provides a fast PCI-e transfer speed and enables use of streams:
•Allocation needs to be done with cudaMallocHost
•Use new Fortran 2003 features for interoperability with C.

use iso_c_binding
! The allocation is performed by C function calls. Define the C pointer as type (C_PTR)
type(C_PTR) :: cptr_A, cptr_B, cptr_C
! Define Fortran arrays as pointer.
real, dimension(:,:), pointer :: A, B, C

! Allocating memory with cudaMallocHost.
! The Fortan arrays, now defined as pointers, are then associated with the C pointers using the
! new interoperability defined in iso_c_binding. This is equivalent to allocate(A(m1,m1))
res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) )
call c_f_pointer ( cptr_A, A, (/ m1, m1 /) )

! Use A as usual.
! See example code for cudaMallocHost interface code

http://www.nvidia.com/object/cuda_programming_tools.html
24

lue
G
Calling CUDA kernels from FORTRAN
From Fortran call C function that will call CUDA kernel
! Fortran -> C -> CUDA ->C ->Fortran
call cudafunction(c,c2,N)

/* NB: Fortran subroutine arguments are passed by reference. */
extern quot;Cquot; void cudafunction_(cuComplex *a, cuComplex *b, int *Np)
{
...
int N=*np;
cudaMalloc ((void **) &a_d , sizeof(cuComplex)*N);
cudaMemcpy( a_d, a, sizeof(cuComplex)*N ,cudaMemcpyHostToDevice);
dim3 dimBlock(block_size); dim3 dimGrid (N/dimBlock.x); if( N % block_size != 0 ) dimGrid.x+=1;
square_complex<<<dimGrid,dimBlock>>>(a_d,a_d,N);
cudaMemcpy( b, a_d, sizeof(cuComplex)*N,cudaMemcpyDeviceToHost);
cudaFree(a_d);
}

complex_mul: main.f90 Cuda_function.o
$(FC) -o complex_mul main.f90 Cuda_function.o -L/usr/local/cuda/lib -lcudart

cuda_function.o: cuda_function.cu
nvcc -c -O3 cuda_function.cu

25

lue
G
CUDA & MATLAB

Even though MATLAB is built on many well-
optimized libraries, some functions can perform
better when written in a compiled language (e.g. C
and Fortran).

MATLAB provides a convenient API for interfacing
code written in C and FORTRAN to MATLAB
functions with MEX files.

MEX files could be used to exploit multi-core
processors with OpenMP or threaded codes or like
in this case to offload functions to the GPU.

26

lue
G
NVMEX
Native MATLAB script cannot parse CUDA code

New MATLAB script nvmex.m compiles CUDA code
(.cu) to create MATLAB function files

Syntax similar to original mex script:

>> nvmex –f nvmexopts.bat filename.cu –IC:cudainclude
–LC:cudalib -lcudart

Available for Windows and Linux from:
http://developer.nvidia.com/object/matlab_cuda.html

27

lue
G
Mex files for CUDA
A typical mex file will perform the following steps:

1. Convert from double to single precision
2. Rearrange the data layout for complex data
3. Allocate memory on the GPU
4. Transfer the data from the host to the GPU
5. Perform computation on GPU (library, custom code)
6. Transfer results from the GPU to the host
7. Rearrange the data layout for complex data
8. Convert from single to double
9. Clean up memory and return results to MATLAB

Some of these steps will go away with new versions of the library
(2,7) and new hardware (1,8)

28

lue
G
CUDA MEX example
Additional code in MEX file to handle CUDA

/*Parse input, convert to single precision and to interleaved complex format */
…..
/* Allocate array on the GPU */
cufftComplex *rhs_complex_d;
cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M);
/* Copy input array in interleaved format to the GPU */
cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M,
cudaMemcpyHostToDevice);
/* Create plan for CUDA FFT NB: transposing dimensions*/
cufftPlan2d(&plan, N, M, CUFFT_C2C) ;
/* Execute FFT on GPU */
cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ;
/* Copy result back to host */
cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M,
cudaMemcpyDeviceToHost);
/* Clean up memory and plan on the GPU */
cufftDestroy(plan); cudaFree(rhs_complex_d);
/*Convert back to double precision and to split complex format */
….

29

lue
G
Timing details
1024x1024 mesh, 400 RK4 steps on Windows,
2D isotropic turbulence
Runtime Speed Runtime Speed
Opteron 250 Opteron 2210
up up

PCI-e Bandwidth: 1135 MB/s 1483 MB/s
Host to/from device 1003 MB/s 1223 MB/s
Standard MATLAB 8098 s 9525s

Overload FFT2 and IFFT2 4425 s 1.8x 4937s 1.9x

Overload Szeta 735 s 11.x 789s 12.X

Overload Szeta , FFT2 and 577 s 14.x 605s 15.7x
IFFT2

30

6.963
IT /
A@M
CUD
9
IAP0

CUDA
Performance Strategies

ing
ead
hr
T
Programming Model
Host Device
A kernel is executed as a Grid 1
grid of thread blocks Block Block Block
Kernel
A thread block is a batch (0, 0) (1, 0) (2, 0)
1

of threads that can Block Block Block
cooperate with each (0, 1) (1, 1) (2, 1)

other by:
Grid 2
Sharing data through
shared memory Kernel
2
Synchronizing their
execution
Block (1, 1)

Threads from different
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

blocks cannot cooperate Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

3
© NVIDIA Corporation 2006

mory
Me
Data Movement in a CUDA Program

Host Memory
Device Memory
[Shared Memory]
COMPUTATION
[Shared Memory]
Device Memory
Host Memory


erf
P
!quot;#$%$&'()*+,-$#.%/(0,-(#.'(123

456$%$&'($78'quot;'78'7#(quot;5-5**'*$/%

456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

@,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.'
123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:

E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(
85#5(#-57/0'-/
GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='(
05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#

39

erf
P
!quot;#$%$&'()'%*+,(-*.'+'/0'

-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'
=2*>12?@*012(4'5$0'(%'%*+,(

!quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'(
%'%*+,

B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3

40

erf
P
!quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03

45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03
!/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03

<6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;%
6/quot;0$'%93%quot;88%*/0$quot;'6

<6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66
.*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)?
:,quot;8$6:$quot;98$%quot;''0$667)+
1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0

41

erf
P
!quot;#$%&'&((#()quot;*$+,,)-)#./(0

%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$
*2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0
9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot;

<##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$
*2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1'
@#=)quot;/#'quot;;$quot;5&'#:$*#*1'0

42

erf
P
!quot;#$%&'$()*#*+,)*$-.

/()*#*+*-0'#quot;#$%&')%,-.1quot;%.
2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;.
6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3&

44

erf
P
!quot;#quot;$%"'()*&(

!*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$
6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1
789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE
G89:($)/&$>?@A*$MFH

N,',.,O*$#"'()*&(
@'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*"#*3$
/'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$
.*./&0

8&/5;$#"'()*&(
R'*$6quot;&Q*$#"'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*(

45

erf
P
!quot;#$%&'()$*+,$-'./+0.quot;123$.2

(4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+
='27+-$-'./
>1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($
@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9
LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+
8'Q$.(5'()$*+!GH%$9

R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$

U2$+:;7=+(quot;47;'1
W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+
'Q$.quot;55+2/27$-+<$.3'.-quot;1($
0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72

46

erf
P
!quot;#$%quot;&'()#*+&,(%-./0*12(.

3145(.2"%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(.

?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7>

B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67(
D#%quot;(.71649&8@&2#&E;F&.@((-8@
?%2(67+&51-1649&8@&2#&GHIF&.@((-8@

47

erf
P
!quot;#$%&'()*

+,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:
+,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1=
9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@
8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@
AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@
+..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%=
J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;),
&(K%
L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,
0$quot;'M,0%()*,-%#.
NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*
P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

48

erf
P
!quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1&

12 13 14 17 135 136

349 374 378 352 355 395 399 3:4

*$$)1>?%#(&)C#?1-'-C#1%

12 13 14 17 135 136

349 374 378 352 355 395 399 3:4

;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1%

49

erf
P
!quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1(

12 13 14 17 135 136

349 374 378 352 355 395 399 3B4

:';<=1')*+##'((*>?*@A;'%)(

12 13 14 17 137 135 136

349 374 378 352 355 395 399 3B4

C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G

50

erf
P
!quot;#$%&'()*+,-(.()*,/%&0$1&

234%5(.%)1,quot;),678+,
9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'?
@A,;$quot;#1&,BCDAEF
-(.%&,#G%5#*%:,quot;G%5,C89,50)&
CD9,>$quot;'?&,3,DHI,1J5%#:&+
@HIK&,L 'quot;#$%&'%:
@HMK&,L 'quot;#$%&'%:<,".%,1J5%#:&,:quot;)N1,4#51('(4#1%
@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&

51

erf
P
!quot;#$%&'()*+
,-./'-/.%&0quot;10&(2%0! 34054067089-%&
:&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0
<;quot;,=

?10,quot;;0(&0)quot;-0@(#A$%+
Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067
:&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()*

x y z Point structure

x y z x y z x y z AoS

x x x y y y z z z SoA

58

erf
P
!quot;#$%&'()*+,-.//#01

!quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2

!0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$&

:%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A
712%&,B($$,70%#9,'quot;#$%&'()*+
C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;-
E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG

D88(2(quot;)#$,0%".0'%&+
D$(*)%8,I13%&,-JK,-#/3$%

59

erf
P
!quot;#quot;$$%$&'%()#*&+#,-./%,/0#%

12"&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56",,%66&(%()#*
7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6
<66%2/.quot;$&/)",-.%9%&-.=-&:quot;25>.5/-

<quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%"55#%66&3%#&,*,$%
+&(%()#*&,quot;2&6%#9.,%"6&(quot;2*&6.(0$/quot;2%)06&
Bank 0
quot;,,%66%6"6&./&-quot;6&:quot;2;6
Bank 1
Bank 2
Bank 3
'0$/.3$%&6.(0$/quot;2%)06",,%66%6&/)"&:quot;2; Bank 4
#%60$/&.2"&:quot;2;&,)28$.,/& Bank 5
Bank 6
?)28$.,/.2=",,%66%6"#%&6%#.quot;$.@%5
Bank 7

Bank 15
64

erf
P
!quot;#$%&''()**+#,%-.quot;/01)*
23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67*
8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3#
*7(+')%99%:

Thread 0 Bank 0 Thread 0 Bank 0


65

erf
P
!quot;#$%&''()**+#,%-.quot;/01)*
234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:*
;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,%
*:(+')%<<%2 *:(+')%<<%=

x8
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Thread 8 x8
Thread 9
Thread 10

66

erf
P
!quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012

3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:()
<=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($
-%./)
012$5%)$AB$-%./)
<quot;$-%./$C$%&&'())$D$AB
<%*($%)$,5($)4E($quot;6$%$5%:6?#%'+
Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+

67

erf
P
!quot;#$%&'(%()$*'+#,-'.),/01.23

!quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%'
,)'+#,-'.),/01.23

5quot;%'/#32'.#3%6
7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13'
,)'+#,-'.),/01.2
7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;'
2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=
5quot;%'30)9'.#3%6
>#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:'
#..%33'2quot;%'3#(%'+#,-
A@32'3%$1#01B%'2quot;%'#..%33%3
?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-

68

erf
P
Conﬂicts,
Coalescing, Warps...
I hate growing up.

erf
P

!quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.

erf
P
!quot;#$%&'($quot;)*+,*-

./0'.quot;1+2-'34#$quot;)*+,*-56
7228*#$quot;#-*9
:,quot;2-*;%)<
=>,%?%)<'.!@!'Aquot;)B';,)C2%;#*
.+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-*

1 5 9 13
1 2 3 4

2 6 10 14
5 6 7 8

3 7 11 15
9 10 11 12

4 8 12 16
13 14 15 16

70

erf
P
!quot;#$%&'(#')*+,%quot;(-$('

__global__ void transpose_naive(float *odata, float *idata, int width, int height)
{
1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

3. if (xIndex < width && yIndex < height)
{
unsigned int index_in = xIndex + width * yIndex;
4.
unsigned int index_out = yIndex + height * xIndex;
5.
$)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4;
6.
}
}

71

erf
P
!quot;#$%&'(#')*+,%quot;(-$('

.'%)(*/quot;-01*2,$3*4565 <,/1'*$01-01*1$*4565

;8; ;87 ;8: ;879 ;8; 78; :8; 798;

78; 787 78: 7879 ;87 787 :87 7987

798; 7987 798: 79879 ;879 7879 :879 79879

4565 4565

Stride = 1, coalesced Stride = 16, uncoalesced

72

erf
P
!quot;#$%&'%()*+#,&-quot;&%

.&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%&
*6+%#(7$quot;'8)974:)7;<3
=%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@?
A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$%
*+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@?

*6+%#()914:1;<3
=%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$%
A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$%
!quot;#$%&'2,B)2&)#'62%D%()2C3
E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH

73

erf
P
!quot;#$%&'%()*+#,&-quot;&%
4%#(&)5+quot;6)1232 .+/0%&)0quot;)7232

<9< <98 <9; <98: <9< <98 <9; <98:

89< 898 89; 898: 89< 898 89; 898:

8:9< 8:98 8:9; 8:98: 8:9< 8:98 8:9; 8:98:

4%#(&)5+quot;6)7232 .+/0%&)0quot;)1232

<9< 89< ;9< 8:9< <9< <98 <9; <98:

<98 898 ;98 8:98 89< 898 89; 898:

<98: 898: ;98: 8:98: 8:9< 8:98 8:9; 8:98:

74

erf
P
!quot;#quot;$%&'()(*+'(,-
=1+23$;0,)$!quot;#quot;

./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67
A?A 6?A @?A 6>?A

8+-9$:,-;<(:'3
A?6 6?6 @?6 6>?6

A?6> 6?6> @?6> 6>?6>

!,<B'(,-
A?A 6?A @?A 6>?A

C<<,:+'1$+-$D1E'0+F :,<B)-
A?6 6?6 @?6 6>?6
=1+2$3'0(21$5$6G
./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
A?6> 6?6> @?6> 6>?6>

75

erf
P
!quot;#$%&'%()*+#,&-quot;&%
__global__ void transpose(float *odata, float *idata, int width, int height)
{
1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;
2.
unsigned int yBlock = blockDim.y * blockIdx.y;
3.
unsigned int xIndex = xBlock + threadIdx.x;
4.
unsigned int yIndex = yBlock + threadIdx.y;
5.
unsigned int index_out, index_transpose;
6.

{
unsigned int index_in = width * yIndex + xIndex;
8.
unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;
9.
block[index_block] = idata[index_in];
10.
index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;
11.
index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
12.
}
13. __syncthreads();

odata[index_out] = block[index_transpose];
15.
}
76

erf
P
!quot;#$%&'%()!*+*$,%

-&((./&%)0*12)3'#4(%3*$,)#$.)-565)'&1*+*7#1*'$8
9:;<9:;8))=>=99+% ?%>)=>=::+%))@:>=A %&((./&B
C9:<C9:8))=>=D+%)))?%>)=>EE+%))))@F>CA %&((./&B
9=:F<9=:F8))=>E=+%)))?%>)9>G:+%))))@H>FA %&((./&B
9=:F<:=F;8))=>DG+%)))?%>)H>H+%))))))@;>FA %&((./&B
I'#4(%3*$,)0*12'/1)-565)'&1*+*7#1*'$8
9:;<9:;8))=>=9F+%
C9:<C9:8))=>9=9+%
9=:F<9=:F8))=>F9:+%
9=:F<:=F;8))=>;HG+%

77

erf
P

!quot;#$%&'()*+(),'-%./&'()*01&'2'3/&'()4

erf
P
!quot;quot;#$%"'

()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1-
+2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
4%0+".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!quot;quot;#$%"' :-;#<9+*-1=-7%*$/-*#&&.&6-
quot;1"#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'-
<%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&-
quot;1"#**+&04'

?.<.0+,-9'-*+/1#*quot;+-#/%6+@
A+6./0+*/
B)%*+,-<+<1*'

79

erf
P
!quot;#$%&'()*+,#-.+/.0quot;#12#)1

3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1
,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.

3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+>
?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot;
&'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+
:9quot;$B9quot;.+501@
,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@

3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1
&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<
JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1

80

erf
P
!quot;#$%"'()quot;*quot;+,quot;+-.

!quot;/,0/1"'02'$"('quot;#$%"'(,quot;*quot;+,quot;+-.
3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/"'
;-quot;+/'$5%<=>)?< @AB<

S T(.(U(JV /,,N1O:(((P1OQ(P1EQ(P1:
W(T(S U(OV /,,N1O:(((P1JQ(P1OQ(P1R

%[,/&/XYZ(UT(OV 7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ(
/,,N1O:(((((((((((P1OQ(P1OQ(P1R

A5(-5C*7quot;"7.(D$,quot;(&Dquot;(7/"+-.<(
!4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5'
?&(7quot;/%&(:JK 5--4*/+-.
AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M

81

erf
P
!quot;#$%"'()'quot;%%*'quot;

+$,quot;(-."/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78
9$3$&$/#(:.0&4'%;
<*32quot;'(4=('quot;#$%"'%(6quot;'(>quot;'/quot;-
?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,%
D34*/&(4=(%5.'quot;,(3quot;34'1
@EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>%
H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%"'%(K(>quot;'/quot;-
L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH
< O(,quot;%$'quot;,(3.N$3*3('quot;#$%"'%(K(>quot;'/quot;-
D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'
!quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T
H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot;

82

erf
P
!quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot;
/*quot;)012#3+2#&+'*4567 +2#&+')#+)'6--
8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+)
=quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A
82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$)
#;quot;)0-+=quot;7 *quot;-#&+'A
architecture {sm_10}
abiversion {0}
modname {cubin}
code {
per thread local memory
name = BlackScholesGPU
lmem = 0
smem = 68 per thread block shared memory
reg = 20
bar = 0
per thread registers
bincode {
0xa0004205 0x04200780 0x40024c09 0x00200780
…

83

erf
P
!quot;#$%&''()*+',%!*-'(-*./0

84

erf
P
!quot;#$%$&$'()#*+,-./)quot;,+)01234
5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&,
9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/
<2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>)
*$.$'(
?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+)
#*+,-.
A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
B,6+$/#$3/
<$'$%6%C)DE)#*+,-./)quot;,+)01234
!'1>)$7)%61#$quot;1,)32'36++,'#)01234/)
FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611>
K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M

85

erf
P
!quot;quot;#$%"'()*(+,-./-0%",

1"-,%23&4(/quot;quot;#$%"'(5/,2(&/6(&,quot;,22%-37'(
3"-,%2,($,-./-0%",

BUT…

8/9:/quot;quot;#$%"'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'(
<35,(7%6,"'(/&(0,0/-':=/#&5(>,-&,72
?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,(
$%-%77,7320A

86

erf
P
!quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1

!quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;."&'+1)&+)4'55%#%1&)6!73

6!73)8quot;#9)'1)$quot;19):quot;93
;)+5)$,/&'.#+0%33+#3
<%$+#9)=quot;14:'4&2
>2quot;#%4)$%$+#9)3'(%
?%@'3&%#)5'/%)3'(%
A2#%quot;43).%#)=/+0B

*+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14)
-AG->H
IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/)
0+15'@,#quot;&'+1

87

erf
P
!quot;#$%&'(quot;#
)#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($'
6+4quot;,7/$quot;.%+'$(#8
0(9+,8+#-/:,.#$5(#8
;.#</$quot;#3%($-'
=.-+#$7/5(*(#8
)'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/
)#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7
@#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$<
+B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.*
D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2'
)'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+

88

erf
P

!quot;#$%&'($)*+,-.$/012*.#0

erf
P
!quot;#$%&'($)*+,-.$/012*.#0

3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$
401:.#5
;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$
5#594?+
!*5#$+8-54+

(99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$

61

erf
P
!quot;#$%&'quot;()'*#

101

erf
P
!quot;#$%&'
()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56

+quot;7*'+%75

#&08quot;$.32*-*$+
Global memory loads/stores are coalesced
#&08.32*-*$+
(coherent) or non-coalesced (incoherent)
#'+8quot;$.32*-*$+
#'+8.32*-*$+

&3.%&8&3%0
Local loads/stores
&3.%&8'+3-*

Total branches and divergent branches
9-%$.2
0quot;)*-#*$+89-%$.2 taken by threads

quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+

1%-58'*-quot;%";* : +2-*%0,1%-5',+2%+,'*-quot;%";*,3$,%00-*'',.3$<".+',+3,
'2%-*0,3-,.3$'+%$+,7*73-=

.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'

62

erf
P
!quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/

01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%&

6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*%
01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/)
,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,;
<1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#)
8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,)
3*%:;

01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$)
5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$
!quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+)
(,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$

63

Back Pocket Slides

slide by David Cox

6.963
IT /
A@M
CUD
9
IAP0

Dense
Linear Algebra

!quot;#$quot;%&'#quot;()%*+,quot;-)(

4/5,-quot;.-,6789:; B,A-C8quot;,Dquot;7/-?8E:C/78quot;C/:8:;

! <128/-quot;:=:089: ! *8-,2/A01C:;quot;F>4
$% ! &

! >1?82@/7A8: ! +,9.A0/01,2/7quot;CG891:0-=
$% ! quot;%
! B12?A7/-quot;@/7A8: ! )/0/quot;91212?
$ ! #!quot; !

!quot;#$$%quot;&'()(*quot;+,-.,-/01,23

!quot;#$quot;%&'#quot;()%*+,quot;-)(

4/5,-quot;.-,6789:; *7?,-10C9:;

! <128/-quot;:=:089: ! D28E:1F8Fquot;G/H0,-1I/01,2:;
$% ! & <JKquot;+C,78:L=Kquot;MN

! >1?82@/7A8: ! OP,E:1F8Fquot;G/H0,-1I/01,2:;
$% ! quot;% MNquot;/7?3Kquot;Q/H,61

! B12?A7/-quot;@/7A8: ! OP,E:1F8Fquot;G/H0,-1I/01,2:;
$ ! #!quot; !
B')

!quot;#$$%quot;&'()(*quot;+,-.,-/01,23

!quot;#$%&'!quot;#$%&()$*+,-#.(!quot;#quot;!quot;$quot;%quot;"'

45*6quot;78,-0-/29::;< +=45*6quot;7+;<
6789:;$<=--(!)*+,!!)*+,
!quot;##!$%&''(!)*+,!)*+,
-,!.,!/,
-,!.,!/, 012,!>quot;,!#3quot;,
012,!quot;,!#3quot;, >4,!#34,
012,!>!,!#3!!5?
4,!#34,
012,!!,!#3!!5

+,>.?0/01,2quot;12quot;@A=quot;-BC?1-BD<
! (2101/E1F/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20
! *EE,I/01,2quot;,Gquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-Kquot;7L/2JEB-Dquot;!quot;#$!%#$!&;
! M-/2DGB-quot;,Gquot;J/0/quot;7>/0-1IBDquot;quot;#$%#$&;
! +,>.?0/01,2quot;7I?NE/D6OB>>;
! PB0-1BHBquot;-BD?E0quot;7>/0-1Qquot;&;
! 8-BBquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-K
! MB->12/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20
!quot;#$$%quot;&'()(*quot;+,-.,-/01,23

quot;#$!%&'(!
quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4
5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!
,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D!
+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!
9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!
'('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1!
'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!
K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/!
4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:!
4#?!M(&>!M(+!,*+(! /;! /C! //! /:!
4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!
S#?!M(&>!quot;6=*MG9! C0! T! T! T!
S#?!6=*M9RO*+N! <B<! T! T! T!
-&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!

IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Similar to IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT) (20)

More from npinto

More from npinto (20)

Recently uploaded

Recently uploaded (20)

IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)