[Harvard CS264] 04 - Intermediate-level CUDA Programming

Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #4: Intermediate-level CUDA | February 15th, 2011

Nicolas Pinto (MIT, Harvard)
pinto@mit.edu

Administrivia

• HW1: due Fri 2/18/11 (this week)
• Projects: think about it, consult the staff
• New guest lecturers!
• Max Lin (Google), Kurt Messersmith et al. (Amazon),
David Rich et al. (Microsoft)

During this course,
r CS264
adapted fo

we’ll try to

“ ”

and use existing material ;-)

Outline

• CUDA Language & APIs (overview)
• Threading/Execution (cont’d)
• Memory/Communication (cont’d)
• Tools
• Libraries

gu age
Lan
! 49!:(1&/<.2"'(%('")(/+(5,.$)=.-(<"#)/&()61"'>
! *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+
0*",-2/+*",-3/+0*",-3/+
! $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+
$"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/
! )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+
0)4%2/+)4%3/+0)4%3/+
! 5#46./+05#46./+5#461/+05#461/+5#462/+
05#462/+5#463/+05#463/+
! 75#,%./+75#,%1/+75#,%2/+75#,%3+

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

gu age
Lan

! 4%-(#/-')&,#)(%(<"#)/&()61"(8.)*('1"#.%$(
+,-#)./->
8,9'!!"#$%&'(%):(;/+(.!"#$

! 4%-(%##"''("$"0"-)'(/+(%(<"#)/&()61"(8.)*(
!"#$%&!"'$%&!"($%&!")$*
('*(,-<=

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

gu age
Lan

! &)82 .'(%('1"#.%$(<"#)/&()61"

! ?%0"(%'(0)4%2@("3#"1)(#%-(5"(#/-')&,#)"2(
+&/0(%('#%$%&()/(+/&0(%(<"#)/&>
:$*,5,-/+./+.>

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

gu age
Lan

! 49!:(1&/<.2"'(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"'
! %"-',&?&=@(@5#*9?&=@(@5#*9A)8@(
6-)&A)8

! +',-.&/0&/&1&)822&34&10)4%22&

! :##"''.5$"(/-$6(+&/0(2"<.#"(#/2"
! 4%--/)()%A"(%22&"''
! 4%--/)(%''.;-(<%$,"
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

gu age
!"#$%&'()*$+(,%-.*./
Lan
0+'.'(*.'($12($/3'4(25(."#$%&'(&*$+(23'.*$%2#4(%#(4%#67'(
3.'8%4%2#
!!"#$%"&'9(:%.'8$(&*33%#6($2(+*.:1*.'(;<=
>*4$(-"$(721'.(*88".*8/(?4''(3.26@(6"%:'(52.(:'$*%74A
BC*&37'49(!!()$"&*'+,!!-*."&*'+,!!./0"&*+1'
"#$%"&',9(82&3%7'($2(&"7$%37'(%#4$."8$%2#4
<721'.(-"$(+%6+'.(*88".*8/(?D("73(2.(7'44A
BC*&37'49(()$"&*'+,-*."&*'+,./0"&*+1'

0+'(2#(-!"3(4!5346,82&3%7'.(23$%2#(52.8'4('E'./("#$%"&',$2(
82&3%7'($2(!!"#$%"&'

© NVIDIA Corporation 2010 5
4
Unit of Least Precision (ULP) is the gap between the floating-point
numbers nearest a given real number

A PI
CUDA APIs

API allows the host to manage the devices
Allocate memory & transfer data
Launch kernels

High level of abstraction - start here!

(aka “Device” API)
More control, more verbose

(OpenCL: Similar to CUDA C Driver API)

© NVIDIA Corporation 2010

A PI

! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$
!"#$%&
! !"#$G*H$G#1#G$'#127#$(-.$I/0#42@8$75J
! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J

! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$
*,"#0+$:0#$+/#72:G2M#3
! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#
! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$
,"#$":03H:0#L$H#$6##3$:$!"#$%"&%'()"*)

! '#127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$
,"0#:3+$IO5+,$G2P#$A/#6BCQJ
! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$
7*6,#@,
! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$
*6#$"*+,$,"0#:3
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$
7*3#$*4$,=/#8$+,-"./0)
! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$
7*3#$*4$,=/#$%/!12--'-3)

! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0

! %/!14")51.)2--'-L$%/!14")2--'-6)-$(7

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#
! '#127#$(-.$7:GG+$95+,$7:GG$%/8($)

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$%
1(5%#5$*.-"/%*%#(".$6.=

! !"+.'$(#$%&!$)/"0(
! !"+.1$(#$%&!$

! :"+%.'$%8)$180=

! !"+.)2//3$#$%&!$
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI
! :00(#*.$HI5$$%9$9(52=
! !"+.6.99/!@%!"+.<-$$

! <"-.-*0-E$%9$9(52=
! !"+.6$73$(

! 4(32%9$9(52=
! !"+.6$7!=4

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! +,!$--. !"#$%&#'()*+,#-*#%."#&+*/#01*#2223444#
'&"%0(5"#("65%.0(5"#758*9.059:

! 5,(%12-6#7("%8(0("+*()%1+77)%*2%+77%$(3#1(%9:;%
*2%)(*/6%*,(%(<(1/*#20%(03#"20-(0*

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

“Device” Driver API
(low-level)

A PI

! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$
:1:2G:FG#$3#127#+

! %/9"#$%"4")+'/()
! %/9"#$%"4")
! %/9"#$%"4"):1;"
! %/9"#$%"4")<')10=";'->
! %/9"#$%"4")?))-$@/)"
! !
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! !"#$%&$%#'(()$%*%+$,-#$%&-.'%!"#$%&!$'$(
&$%/$.%*%+$,-#$%'*"+0$%(1%.23$%)*+$%&!$

! 4*"%"(&%#5$*.$%*%#(".$6.%&-.'%!")(,)-$.($

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! !"#$%&$%'*,$%*%#(".$6.%>)*!/0($,(?%#*"%
*00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%
! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.'%#5$*.-"/%
.'5$*+

! D(%)2"#'5("-E$%*00%.'5$*+)%>4;B%'().%&-.'%
A;B%.'5$*+)?%#*00%!")(,140!2-/0&5$
! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)'%
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! :00(#*.$HI5$$%9$9(52=
! !"6$7899/!:;!"6$7<-$$

! <"-.-*0-E$%9$9(52=
! !"6$73$(

! 4(32%9$9(52=
! !"6$7!=4>(/#:;!"6$7!=4#(/>:;
!"6$7!=4#(/#
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! 3")(4<'#"6."&"@'(@"(9"ABC"%(4#D4&$&"&'(-:"
56$7".()#"$+8#"6-9(*)&$6(-
! >%<@6- 96'#.

! 3")(4<'#"6."%*#&$#4"@+"'(&46-:"&"%<@6- 56$7"
!"#(2"'$,)$*-$ (*"!"#(2"'$3(*2.*-*

! ;(4<'#"%&-"@#"<-'(&4#4"56$7"
!"#(2"'$45'(*2
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! E(&46-:"&")(4<'#"&'.("%(86#."6$"$("$7#"4#F6%#

! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4"
:'(@&'"F&*6&@'#.G
!"#(2"'$6$-7"5!-8(5
!"#(2"'$6$-6'(9*'
!"#(2"'$6$-:$;<$=

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! H-%#"&")(4<'#"6."'(&4#4!"&-4"5#"7&F#"&"
9<-%$6(-"8(6-$#*!"5#"%&-"%&''"&"9<-%$6(-

! I#")<.$".#$<8"$7#"!"!#$%&'()!(*&+'(,!(%)
96*.$

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! JK#%<$6(-"#-F6*(-)#-$"6-%'<4#.G
" L7*#&4"M'(%?"N6=#
" N7&*#4";#)(*+"N6=#
" O<-%$6(-"B&*&)#$#*.
" A*64"N6=#

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! L7*#&4"M'(%?"N6=#G"
!"7"5!>$-?'(!@>A*0$

! N7&*#4";#)(*+"N6=#G
!"7"5!>$->A*)$2>8B$

! O<-%$6(-"B&*&)#$#*.G
!"C*)*%>$->8B$DE!"C*)*%>$-8DE
!"C*)*%>$-=DE!"C*)*%>$-F
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

A PI

! !"#$%&#'(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(%
./01*#20%#0321+*#204
!"#$"%!&'()*

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

Threading Hierarchy
Execution Model
Software Hardware

Threads are executed by thread
Thread
processors
Processor
Thread

Thread blocks are executed on
multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can
Thread reside on one multiprocessor - limited
Block Multiprocessor by multiprocessor resources (shared
memory and register file)

A kernel is launched as a grid of
thread blocks
...
Only one kernel can execute on a
Grid device at one time
Device
© 2008 NVIDIA Corporation.

Thread Batching

Kernel launches a grid of thread blocks
Threads within a block cooperate via shared memory
Threads within a block can synchronize
Threads in different blocks cannot cooperate
Allows programs to transparently scale to
different GPUs
Grid
Thread Block 0 Thread Block 1 Thread Block N-1

…
Shared Memory Shared Memory Shared Memory


Transparent Scalability

Hardware is free to schedule thread blocks on any
processor
A kernel scales across parallel multiprocessors

Kernel grid
Device Device
Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 6 Block 7

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Block 4 Block 5

Block 6 Block 7


Indexing Arrays: Example
In this example, the red entry would have an index of 21:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

M = 8 threads/block

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;
= 5 + 2 * 8;
= 21;

Indexing Arrays: Example
In this example, the red entry would have an index of 21:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

M = 8 threads/block

Addition with Threads and Blocks blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;
The blockDim.x is a built-in variable for threads per block:
= 5 + 2 * 8;
int index= threadIdx.x + blockIdx.x * blockDim.x;
= 21;

A combined version of our vector addition kernel to use blocks and threads:
__global__ void add( int *a, int *b, int *c ) {
int index = threadIdx.x + blockIdx.x * blockDim.x;

Control Flow Divergence

What happens if you have the following code?
!"#"$$#%&'()*+*,-,..
/
*$01#.2
3
(45(
/
*$06#.2
3


Branch

Path A

Path B


Nested branches are handled as well
!"#"$$#%&'()*+*,-,..
/
!"#0)'#%&'()*+*,-,..
*$12#.3
(45(
*$16#.3
7
(45(
*$18#.3


Branch

Branch

Path A

Path B

Path C


for correctness (*)
You might have to think about it for performance
Depends on your branch conditions


Performance drops off with the degree of divergence

!"#$%&'$&()*+,+-.- /012
3
%*!) 45
...
%*!) 65
...
7

Divergence

35

30
Performance

25

20

15

10

5

0
0 2 4 6 8 10 12 14 16 18

Divergence

!""#$%&"'
()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-+2+"#0.&6-
10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-4%0+&".+/-%&,-8++$-0)+-
)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-"1&"#**+&04'-1&-%-
<#40.$*1"+//1*-,.>.,+,-9'-<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-
*#&-"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@
A+6./0+*/
B)%*+,-<+<1*'


!"#$%&'()*'+*,-'.)/*,&0,$&

1'#2'3"#$%&'4'1'#2'5/"0,(*#$)&&#*&
6#'7""'5/"0,(*#$)&&#*&'879)'70'")7&0'#:)'3"#$%'0#');)$/0)

1'#2'3"#$%&'<'1'#2'5/"0,(*#$)&&#*&'4'=
>/"0,(")'3"#$%&'$7:'*/:'$#:$/**):0"?',:'7'5/"0,(*#$)&&#*
!!"#$%&'()*+",-.%))('08)'87*-@7*)'3/&?
6/3A)$0'0#'*)&#/*$)'797,"73,",0?' *)B,&0)*&C'&87*)-'5)5#*?

1'#2'3"#$%&'4'DEE'0#'&$7")'0#'2/0/*)'-)9,$)&
!"#$%&');)$/0)-',:'(,()",:)'27&8,#:
DEEE'3"#$%&'()*'B*,-'@,""'&$7")'7$*#&&'5/"0,(")'B):)*70,#:&


Kernel Memory Access

Per-thread
Registers On-chip
Thread
Local Memory Off-chip, uncached

Per-block
Shared • On-chip, small
Block • Fast
Memory

Per-device

Kernel 0 ... • Off-chip, large
• Uncached
Global • Persistent across
Time

Memory kernel launches
Kernel 1 ... • Kernel I/O

!"#$%&"'%
! !"#$%&'()*%#&+,&#%$-./%#.&0%#&./#%")&
0#+1%..+#&23456&-'&.+)%&7"#89"#%:
! ;%#+<1=+1>&1?1=%&"11%..
! @/+#%&%-/7%#&A5*-/&-'/%$%#&+#&A5*-/&,=+"/

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

()*+,-."/)'0
! B&.)"==&0+#/-+'&+,&$=+*"=&)%)+#?&/7"/&-.&
0#-C"/%&/+&"&./#%")&0#+1%..+#
! D,/%'&(.%8&".&+C%#,=+9&,#+)&#%$-./%#.
! @=+9&/+&"11%..&2.")%&".&$=+*"=&)%)+#?:

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

12+'"3-."/)'0
! B&*=+1>&+,&)%)+#?&/7"/&-.&.7"#%8&*?&"==&
./#%")&0#+1%..+#.&-'&"&)(=/-<0#+1%..+#

! 3EFG&0%#&*=+1>H&./+#%8&-'&3EI3FG&*"'>.
! J%#?&,"./&/+&"11%..&2-K%K&".&,"./&".&#%$-./%#.L:&&
9-/7+(/&!"#$%&'#()*&+,

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

Global Memory

• Different types of “global memory”
Per-thread
Registers On-chip

• Linear Memory Thread
Local Memory Off-chip, uncached

• Texture
Per-block Memory

• Constant Memory
Block
•
•
Shared
Memory
On-chip, small
Fast

Per-device

Kernel 0 ... • Off-chip, large
• Uncached
Global • Persistent across
Time

Memory kernel launches
Kernel 1 ... • Kernel I/O

! !"#$%#&'#%(")'*+%,*"-'*+%."/0#'/#%'/1%2$3#45$%
6$6"57%'**%)"6$%85"6%#&$%0'6$%9&70:)'*%
6$6"57%9""*
! ;40#%1:88$5%:/%'))$00%9'##$5/0+%)')&:/<+%$#)=

!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

4,)5+,-."/)'0
! M7%&="#$%&*=+1>&+,&)%)+#?&.7"#%8&*?&"==&
)(=/-<0#+1%..+#.&+'&/7%&1+)0(/%&8%C-1%
! @-N%&8%0%'8.&+'&8%C-1%&! 5OEPG&/+&3KOQG
! R-$7&*"'89-8/7&S&344QGT.

! @=+9&/+&"11%..&! .%C%#"=&7('8#%8&1=+1>&1?1=%&
="/%'1?K&
! 678-1"17%8
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0

Constant Memory
Constants set by CPU, read by GPU
Each SM has 8kiB cache for constants
Optimized for broadcast
Accessing different elements forces serialisation
Can speed some calculations
Can relieve register pressure

Constant Memory
Declared at ﬁle scope
__constant__ float dc_myConst;

Set via cudaMemcpyToSymbol API call
cudaMemcpyToSymbol( “dc_myConst”, 3.14f, sizeof(float) )

Accessed by name in kernel
__global__ MyKernel( ... ) {
....
float myVal = dc_myConst+1;
....
}

Textures

Textures are essentially look up tables
Can only be written by the host
Cached on each multiprocessor (8kiB)
Optimised for 2D spatial locality
Hardware interpolation possible
Limited precision
Can clamp or wrap at boundaries

Textures

Declaration and setup rather involved
See programming guide
Accessed in kernels via texture fetches:
tex1D, tex2D, tex3D, etc.

Co-ordinates at texel centres
Have to take care when accessing elements

Textures

Can improve load coalescing from global memory
If whole texture ﬁts in 8kiB cache, has grid lifetime
Clamping/wrapping can aid edge case handling
Have to test to determine beneﬁts

General Principles

Memory access patterns are crucial
Even CPUs are typically memory bound
GPUs have 100x FP
Only 10x memory bandwidth
Have to keep the GPU busy

PC Architecture
8 GB/s
>?@

?>L9G=2%&66"K16
J%+8#"F7(&"K16

H%'2$7,6">'%("I"
A+%#$)%7(B& F+1#$)%7(B&
>@C!

E&.+%/"K16 ?>L"K16
3+ Gb/s
CD!E F!:! G#$&%8&# !
160+ GB/s
to
VRAM 25+ GB/s
modiﬁed from Matthew Bolitho

PCIe Transfers
(ﬁrst thing to optimize?)

!"#$%&'()##*+&%,
-+#"(4&5&(6*+3(
-./(123+*"(5+ -./
0./(123+*"

!"#$
%&'()*+,
789(-/4) -,$#<25 0./
:2*92';<=

& -. &. /.'
()*+ ()*+

-+#"(4&5&(6*+3
0./(323+*"(5+ -./ 0./
123+*" 123+*"
-./(123+*"

*Averaged observed bandwidth

Processing Flow

PCI Bus

1. Copy input data from CPU memory to GPU
memory


Processing Flow

PCI Bus

memory
2. Load GPU program and execute,
caching data on chip for performance


Processing Flow

PCI Bus

memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory


PCIe Transfers

PCIe 2.0 x16 bus has
Latency of 10 µs (observed)
Bandwidth of 8GB/s (theory), 5 GB/s (observed)
A lot of calculations can happen in these times

PCIe Transfers
PCIe transfers occur via DMA
GPU reads pages direct from CPU memory
Very bad if page gets moved mid-transfer
CUDA maintains internal pinned memory buffers
Used for cudaMemcpy calls
Data staged through these

!"#$%&'#'()*+(#$,-'#)
!"#$%#&%'()*+'(',-$."/0$1'#&2'!"#$%&'#'()
32&$24'4#-$.521'#&26
7-$"/82'+9:6'+1;$.5&0$0-1
*&/<2&'+9:6'!"#$"%!&'()*!" 0&'!"#$"%!&'()*+,-%!!"
!;$.5&0$0-1',-$."/0$1'=40.>'0$'#$;'?&/0&'#1;$.5&0$0-1'
>2&$24'4#-$.521

!"#$%&'
!"#$%&'!()
')*&+,&- !"#$%&'()**"+),#"-.

!"#$%&'!() /,)#'(01%(')*&+,&- #1(21*3-"#"4(

$&#)-(213.()'(21*3-"#"5

!"#$%&'(!&)&*+,$-./010$&2#)1&(
'"#'(6"7,8)1%5(9%,+"100(6"#:""&(;<=(
2.2-"'(,&+(%"'31&'"('3""+
!"#$%&'(!&)!2&#",&)3(4
!"#$%&'(!&)!2&#",&5(&,#
!"#$%&'(!&6,7!8(4-)94!
()*+'),-./,0#1,'23*+#&'45,6745'"5,
6)'#5*74,8&#91

!"#$%&'($()"*!+,"
!""#$%&'()#'*%(+,-'./#0+.#+"/'#
1%#+/).02('.'3/4#+.5#(%,3(.#
-&&%5-+,%")
$%&'()#&3/,#1%#+""'0+,%5#+/#

!"#$%&'()*++'!,-
!"./&'()*++'!,- +,--./ &%&'()#+""'9/#5-(%0,#
;$!#,(+./8%(/#1)#,2%#<=>#,'#
6"5%(#7%(/-'.#'8#,2%/%#83.0,-'./#
!"#$%*++'!&'(),-0
"'0:%5#,'#+#*2)/-0+"#+55(%//
!"./#/++'!&'(),-+"/'#9'(:4#

!"#$%&'($()"*!+,"*-.($/01
!"#$%&'(')%*+%,&'-*%'./%/%0'/#'$+'12%'345
6+7',-/+82'"9%*2%-0'$&'"9%*,-##%0

:7+82*"+"/&'8-,,&'&2"/,0';%'0"+%'"/1&$0%'8*$1$8-,'
&%81$"+&' &"<%'"='12%&%'-*%'%>#%+&$9%?
@+$1$-,$A-1$"+
B%<"*7'-,,"8-1$"+&
:1*%-<'C'D9%+1'8*%-1$"+
@+1%*"# *%&"/*8%''234"/'5/4($

!"#$%&'
!"#$%&'!()*+),!!"#$%&"'"&()*"
#$%&"'"(+,*
(%-./)",$01)*
.102",0&34.2,567%1&"8%1&* .&+'$&/$()+'01(
,0&39)+.32/)"()+.32:""""""" (&&/2$-&+$/&3$0((,1'
()+'01$4$5

!"#$%&'!()*+),! 6'),+/($711'%70)'89
')-&.,&/ 6'),+/($711'%70)'89
!"#$%&'!()*+),! 6'),+/($711'%70)'89

!"#$%&'($&)*'+$(),--$*'+'

!"#$01.&$#2),!1.3,45&!:; :07)($-&+$';'+9)*7/<$&/$)*'$="#$)&$-7/7(*2$
)*'/$+'),+/(

!"#$%&'()$'*#'+&#,'%-'.-$/%-0'(-123#%/-$
!"#$%&'()*)+,+-.'&'()+
!"#$%#&'() !"#$%&'()*+'$)'/0+,+!"%&'()*+'$)'/0

!"#$%&'()*)+1)23#2('4
!"#$%&'()-5'$)'/61)23#2('7804
!"#$,'-!./01/(!/#'9)792"5!'7:;)'97!"#$<'=!>;129)?23'&@!'7804
!"#$%&'()2'!3+#/1)23#2('04
-/4'+('5
!"#$,'-!./01/(!/#'9)792"5!'7:;)'97!"#$<'=!>;3'&@!'?2129)7804
!;<'5$3'&0',%.//'2#"#
*$+%,'-.,%'/0"'#1#")%2+34'(#/0"#
!"#$%&'()6/(!7+3(89'/1)23#2('04 !"%&'()A'!25#/1)23#2('0
%0'50678#%#9'%2#3'"#%."3,
=2#'/+",%'6#60")'507)'+,'&03#9',0'%2#'6#60")'$%'
,0."5#'50.8&'(#'.,#&'$4$+3'()'%2#'!;<
!"#$:7+'$#6/(!7+3(89'/04 *$+%,'/0"'#1#")%2+34'03'%2#':;<'
%0'/+3+,29'%2#3'"#%."3,

!"#$"%&'()*+%#,$-)./01232"245)62"7)8#$539%#
!""#$%&'()(*+,,'-(./0(1233'456(*+,78*#,7(1'&9'',(&:';
!"#$%&'(&)$*$+,
-.!/$0$1234%&'567,
<<<((!""#$%&'(1233'45
89!:;$<=/.";>
?
!"#$%&'!()*+),!<1@34%&'A1234%&'5<%&'(&)BC>D67AE!F;A
G&/HI;)G1JK.E#L.M;-!G;A+>,
')-&.,&/0
.#,$244',&

')-&.,&/1

2.$3%)4.$'&<1234%&'5%&'(&)7>,

!"#$56.&$#7),!6.8,9:&<>,
%&'(&)BB,$%&'(&)$D*6,
N

!"#$%&'()*+, GPU

!"#$#%&'()*&+,-.&/0123-4&
/5256,7,-8&9:&;<;&.5=4&5 #-$% !.+56')
20340) 20340)
>4>,?5-4>&$@%&4AB,A4
$@%&-C5A*D4C*&0=4C&(/#4&?5A&64
?0A?3CC4A-&+,-.&/)$%&E4CA47&4F4?3-,0AG
&'()*+, 5770+*&,A>424A>4A-&?0A?3CC4A-&,AH -)+.(/ !.0'(.11)(

0C>4C&I3434*&0D&4F4?3-,0A
!"#$%&'($)*&+,-./&'($)
!"#$%&'($)-'($&(01+,!"%&'($)-'($&(01
@37-,274&*-C451*&4F,*-&+,-.,A&5&*,AB74&?0A-4F-J&-.48&
*.5C4&1410C8&5A>&0-.4C&C4*03C?4*
'()&@410C8

78#%!.54),%.01/9%%!"#$%&'!()*+,-).! :*00.'%.;)(1*5<

!"#$%&'(%#%&$"$#
!""#!$%&' ()&'*+,&#-./+0*+0$#1.-0#.
)"#$%& 2./.30*0/
4)&*+30#50/&0"#6.)&'1
!!!"#$%&'%()*"+,-./'%()*'010 '%()*"'2$)34555

7/+-0/#!89
.67368.9#$%&:;<8.=>68.2%-8*"?%&29*"9)%@92*"
;2$)34A

:,2+0$#;#50/&0"#".)&'10$#<+*1#*10#)%&$ $*/0.3#
2./.30*0/#0=0')*0#*+,-#.$#

Scheduling on GPU
,12'&3456789 ,;<= ,;8<A46
">?@>6 ">?@>6
Independent Tasks
,-./&'(
:'3!&' :'3!&+
,-./&+( !"#$"%&'(
,-./&'( ,-./&+(
,-./&+) !"#$"%&')
!"#$"%&'( ,-./&+)
!"#$"%&'*
!"#$"%&') !"#$"%&+(
,-./&') !"#$"%&+(
!"#$"%&'* ,-./&+*
,-./&+*
,-./&') ,-./&+0
,-./&+0

=#BC& =#EBF-(

!"#$%&'()$*+$,*-$#./ 7.D$.(
=2>5&!9
7.D$.(

'@17!A&!
01234&0!5/ 671378&!9
=2>5&!9

671378&!:
671378&!9

671378&!: 671378&!;

671378&!; =2>5&!:

=2>5&!:
=2>5&<9

'@17!A&< =2>5&<:

=2>5&<9
671378&<9
=2>5&<:

671378&<9 =2>5&<;

=2>5&<; !"#$"%&'(%(%)&*+%&,$--%.&$"&/0%&1+.%+&21.%& =2>5&<?

=2>5&<?
$)&%3%2(/%.

()<=' ()&<A"$

!"#$%&'()*$'+#*$# ->?@>$
(+34'12
->?@>$

!9.-1B'1 (+..-(9':14;
(+34'72 ,-./-0'12
(+34'12

,-./-0'12 (+34'75 ,-./-0'15

,-./-0'15
,-./-0'16
,-./-0'16 !9.-1B'7
(+34'72
(+34'15 ,-./-0'72
(+34'75
(+34'76
,-./-0'72

(+34'15 (+34'78

(+34'76

(+34'78

!"#$%&'(")*%&+,,-./0%+121+% 3')"45",*
>/%'%;6,'?;,'0*';./#%@%'.*9,A.*)910%'*@%+:;9=
61 7.+89'%!"#$%&+,,-./B
C1"0#)%D'!"#$%&'(&)*!&+,$-.23'?#0/'!"#$(&)*!&/$012.' $:;<
E+#@%+D'!"3'435&$'&23'?#0/'3673897/:;71<%8
:1 ;99"<+$'%,-..'=%5>?%('(")*
C1"0#)%D'!"#$12.':,,2!=>F'16%'!"#$12.':,,2!/$00&# $:;<
E+#@%+D'!"/&?12.':,,2!=> 16%'36(:7/@/1<%8:AA<37(@BC3@/:;
@1 A'$%+%5?B;%='C-<'%,"-.$')%$"%$D-#%('(")*
C1"0#)%D'!"#$12.'D&'(&)*!&;2*E'&5=>
E+#@%+D'!"/&?12.'D&'(&)*!&;2*E'&5=> !"#$#%&'(%)*+,'-+./#0%.01+%'
E1 FG#$%G#'%$D+$%,"-.$')%-.%*"G)%2').'9#H 2!(-3'45!6'7%+*8.*9,'
%:#)#";0%6'&;0;'0+;"6$%+'
BG/%.H'0/%'!"#$"%&'()$*+',- A'./01234.20566748/620.590$5:0&;<60$2$;7= ;:0*<%0/%+=
&%@#.%'9+*9%+0,'$:;<'0*'6%%'#$'I%+*8G*9,'#6';@;#:;J:%K

!"#$%&$'()*+,-".,/"0
!"#"$%&$#'"(&)*''*+$,-*'$#.*$/01*$23&$"3#,4"#%5"6678$

9&*$:.*($+"#"$%&$,(67$'*"+;:'%##*($,(5*
9&*$),'$-*'7$&4"66$"4,3(#&$,)$+"#"$<(*:$-"'%"26*&8$
0/9;=/9$5,443(%5"#%,(>
9&*$:.*($5,4?3#*;4*4,'7$'"#%,$%&$-*'7$.%@.$"(+$
,553?"(57$%&$.%@.8$&,$6"#*(57$,-*'$/01*$%&$.%++*(
0,"6*&5%(@$%&$!"#$#!%&&'(%4?,'#"(#A

PCIe Transfers Optimization

PCIe bus is slow
Try to minimize transfers
Use pinned memory on host whenever possible
Try to perform copies asynchronously

CUDA-GDB

Extended version of GDB with support for C for CUDA

Supported on Linux 32bit/64bit systems

Seamlessly debug both the host|CPU and device|GPU code
Set breakpoints on any source line or symbol name
Single step executes only one warp except on sync threads
Access and print all CUDA memory allocations, local, global,
constant and shared vars.

Walkthrough example with sourcecode : CUDA-GDB manual

Linux GDB
Integration with
EMACS


Linux GDB
Integration with
DDD


CUDA-MemCheck

Detects/tracks memory errors
Out of bounds accesses
Misaligned accesses (types must be aligned on their size)
Integrated into CUDA-GDB
Linux and WinXP
Win7 and Vista support coming

11
©NVIDIA 2010

CUDA Driver Low-level Profiling support
1. Set up environment variables
!"#$%&'()*+,-./012345
!"#$%&'()*+,-./0123,(6745
!"#$%&'()*+,-./0123,(/80194:$;<=>?&"&'
!"#$%&'()*+,-./0123,2/94#%$<=@!?:AB

2. Set up configuration file
FILE "config.txt": FILE "profile.csv":
>#CA&D%&&=E!A&DE# G'()*+,-./0123,2/9,73.61/8'5?H
G'()*+,*371(3'I'9!0$%:!'JJII'9K
=;A&%C:&=$;A G'()*+,-./0123,(67'5
G'K1F36K+F-0+(K/.'<DLMLNN5!DL:5L:
>#CA&D%&&=E!A&DE#OE!&P$QO>#C&=E!O:#C&=E!O$::C#D;:RO=;A&%C:&=$;A
3. Run application 55H<S!DD5I!TNLLIOE!E:#RU&$*OV?TLJO5L?III
ED&%="FC@ 55H<S!DD5I!HQD:IOE!E:#RU&$*OH?WWSOS?III
55H<S!DD5I!MH:!IOE!E:#RU&$*OV?TLJOW?III
55H<S!DD5I<L!DWIO,X5IQED&%="EC@-<==6,==6,O5M?LMWOSI?IIIOI?TTTOST
4. View profiler output HL
55H<S!DD5I<SSTDIOE!E:#R*&$UOV?VVWOTW?III


CUDA Visual Profiler - Overview
Performance analysis tool to fine tune CUDA applications

Supported on Linux/Windows/Mac platforms

Functionality:

Execute a CUDA application and collect profiling data

Multiple application runs to collect data for all hardware performance counters

Profiling data for all kernels and memory transfers

Analyze profiling data


CUDA Visual Profiler data for kernels


CUDA Visual Profiler computed data for kernels
Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

Global memory read throughput (Gigabytes/second)

Global memory write throughput (Gigabytes/second)

Overall global memory access throughput (Gigabytes/second)

Global memory load efficiency

Global memory store efficiency


CUDA Visual Profiler data for memory transfers

Memory transfer type and direction
(D=Device, H=Host, A=cuArray)
e.g. H to D: Host to Device

Synchronous / Asynchronous

Memory transfer size, in bytes

Stream ID


CUDA Visual Profiler data analysis views
Views:
Summary table
Kernel table
Memcopy table
Summary plot
GPU Time Height plot
GPU Time Width plot
Profiler counter plot
Profiler table column plot
Multi-device plot
Multi-stream plot
Analyze profiler counters
Analyze kernel occupancy


CUDA Visual Profiler Misc.
Multiple sessions

Compare views for different sessions

Comparison Summary plot

Profiler projects save & load

Import/Export profiler data
(.CSV format)


CUBLAS
CUDA accelerated BLAS (Basic Linear Algebra
Subprograms)
Create matrix and vector objects in GPU memory space
Fill objects with data
Call sequence of CUBLAS functions
Retrieve data from GPU (optionally)

!"#$%&'#((')'*+,-#.%/'00'1%$.+2%!'3'4.56-.5$'7
8
9:;$+4<=%*>
?$5+.'+$6"+'@''1%$.+2%!'A'9:;$+4<15.&BC1-1CDC1-ECD7F'''''''''
9:;$+4<+,6E&BC'+$6"+C1-1CDC1-,CD7F'

AA'%>%/E'GH'#.%/+.#524C'/%4.+/.'/%4#1:+$
#?'&#'I'GH'@@'H7'8
9:;$+4<=%*>
9:;$+4<956E&BC'1-;C'DC'1-/C'D7F
9:;$+4<+,6E&BC'JDKHC'1-EC'DC'1-/C'D7F
L
%$4%'
9:;$+4<+,6E&BCJ+$6"+C1-ECDC1-/CD7F'
KKK'


CUBLAS Features
Single precision data:
Level 1 (vector-vector O(N) )
Level 2 (matrix-vector O(N2) )
3
Level 3 (matrix-matrix O(N ) )
Complex single precision data:
Level 1
CGEMM
Double precision data:
Level 1: DASUM, DAXPY, DCOPY, DDOT, DNRM2,
DROT, DROTM, DSCAL, DSWAP, ISAMAX, IDAMIN
Level 2: DGEMV, DGER, DSYR, DTRSV
Level 3: ZGEMM, DGEMM, DTRSM, DTRMM, DSYMM,
DSYRK, DSYR2K


CUBLAS Performance: CPU vs GPU

CUBLAS: CUDA 2.3, Tesla C1060
MKL 10.0.3: Intel Core2 Extreme, 3.00GHz

!"#$%&'()*+,*-./0)
'#"

Up to 2x average speedup over CUBLAS 3.1
'!"

&"
&1))231'456'78$

Less variation in performance
for different dimensions vs. 3.1
%"

$"

#"
,-.
/(0'
!"
/(0#
'!#$ #!$& (!)# $!*% +'#! %'$$ )'%&
7.9*:;'2:-)/5:,/5'<=;=>

Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT}
!"##$%&'(%)%&'*%+,%-./0/1%$2345%!(676%89"
:;<%*6'('&'6(=%+,%>?5@A!+B2%/,C24%!+B2%DE%F-2G542HI

CULA (LAPACK for heterogeneous systems)
GPU Accelerated
Linear Algebra

Partnership

! Dense linear algebra Developed in
! C/C++ & FORTRAN partnership with
! 150+ Routines NVIDIA

MATLAB Interface Supercomputer Speeds

! 15+ functions Performance 7x of
! Up to 10x speedup

CULA - Performance
Supercomputing Speeds
This graph shows the relative speed of many CULA functions when compared to

(Fermi) and an Intel Core i7 860. More at www.culatools.com

Sparse Matrix Performance: CPU vs. GPU
Multiplication of a sparse matrix by multiple vectors
35x

30x

25x

20x

15x

10x
"Non-transposed"
5x "Transposed"
0x MKL 10.2

Average speedup across S,D,C,Z

!"#$%&#'()*+(,-(./010%(23456(!+787(9$"
:;<(=7*+*)*7+>(,-(?@6AB!,C3(0-D35(!,C3(EF(G.3H653IJ

CUFFT
CUFFT is the CUDA FFT library
Computes parallel FFT on an NVIDIA GPU

Plan contains information about optimal configuration for a
given transform.
Plans can be persisted to prevent recalculation.
Good fit for CUFFT because different kinds of FFTs require
different thread/block/grid configurations.


CUFFT Features

1D, 2D and 3D transforms of complex and real-
valued data
Batched execution for doing multiple 1D transforms
in parallel
1D transform size up to 8M elements
2D and 3D transform sizes in the range [2,16384]
In-place and out-of-place transforms for real and
complex data.


CUFFT Example
Complex 2D transform
!"#$%&#'()'*+,
!"#$%&#'(-'.*/

01$$234&"5# 654&7
01$$289:65#; <%"424='<9"4247
01"4>45590??@9%"<<AB%"424='C%D#9$?01$$289:65#;A<()<(-A7
01"4>45590??@9%"<<AB9"424='C%D#9$?01$$289:65#;A<()<(-A7

E<'8F#42#'4'*G'HHI'654&J'<E
01$$2K54&*"?B654&='()=(-='8LHHIM8*8A7

E<'LC#'2N#'8LHHI'654&'29'2F4&C$9F:'2N#'C%O&45'912'9$'6540#J'<E
01$$2P;#08*8?654&='%"424='9"424='8LHHIMHQRSTRGA7

E<'U&@#FC#'2F4&C$9F:'2N#'C%O&45'%&'6540#J'<E
01$$2P;#08*8?654&='9"424='9"424='8LHHIMU(VPRWPA7

E<'G#C2F9X'2N#'8LHHI'654&J'<E
01$$2G#C2F9X?654&A7

01"4HF##?%"424A7
01"4HF##?9"424A7


CUFFT Performance: CPU vs GPU

!"##$%&'()%*+,-,.%$/012%34565%789
:;<%45'4=4)%>"2?@3A=/%,BC/1%3A=/%DE%F*/G21/HI%('&7JK

CUFFT 3.2: Improved Radix-3, -5, -7
123-45*6+&%768996(::06 123-45*6+;%768996(::60
$"! (!

'!
$!!
"!

#"!
!"#$%&

!"#$%&
&!
+$!(!,-%.$ +$!(!,-%.$
+$!(!,-%.# %! +$!(!,-%.#
#!!
/01 /01
$!
"!
#!

! !
# $ % & " ' ( ) * #! ## #$ #% #& #" # $ % & " ' ( ) * #! ## #$ #% #& #"
'()*+,-./0 '()*+,-./0

Radix-5, -7 and mixed radix improvements not shown

9<""=6*>?6@6*>A6(B6CDE;EF6=/,'269?GHG6!%<
IJ#6AG>?>*>G?K6(B6LM2359(N/6EBO/'69(N/6-H6+C/P2'/Q0

!"#$$
!"#$%&'&(")*"+$,+-./&*)&0'12/".'&'##/#".&$0$3$4/5"*)&"!678
9:";'&&$5"<=>?7?8@A"B:"CD/15"<6!7@A"E:"E/1,F.3'"<6!7@A"8:"7'4$G5)1"
<6!7@A"E:"HI/1,"<6!7@A"J:"K+'1,"<6!7@

8#,)&$3+05
2FG..E2'1A"2FG..E/,0/13/GE2'1A"2FG..L/GF2/
2FG..E)&3A"2FG..L'1GA"2FG..E.'&5/9'3&$M>/23)&9F#3$.#(

8GG$3$)1'#"'#,)&$3+05"$1".&),&/55
N&'.+5A"0)&/"5)&3$1,A"3&//5A"+'5+$1,A"'F3)3F1$1,

!"#$$%&'()*+,
!"#$$!%&'()*+,-(%& .%&'() /010!"#$$23!456000
!"#$$24##60!"#$$27894:60!"#$$29$:;95279<=4<#0>?

!"#$$@,&ABC DB,&?0
!"#$$<CE*B- +CE*B-0/0.*ADD$B,&FGDB,&60
.%&'()60
&*HIBCHC&-E60
J60KL?0
.*ADD3.,&FDB,&60A2%A,-,60A2(A,-,60&*HIBCHC&-EL?

Objectives

!   Programmer productivity
!   Rapidly develop complex applications
!   Leverage parallel primitives
!   Encourage generic programming
!   Don’t reinvent the wheel
!   E.g. one reduction to rule them all
!   High performance
!   With minimal programmer effort
!   Interoperability
!   Integrates with CUDA C/C++ code

3
© 2008 NVIDIA Corporation

!"#$%&
!""#$%&'()$%#(*+,),-#./,#!012
3*&*45#6$)78),8#9%&'()$%#:*+,),-#;69:<#
!/7$)*7%,5
!"#$%!&&"'%!()*+!'#,-.
!"#$%!&&/*)0+*()*+!'#,-.
2(=/,*$>&5
!"#$%!&&%'#!12
!"#$%!&&#*/$+*12
!"#$%!&&03+4$%0)*(%+5312
?$4@ 63

!"#$%&'()*+,-.
!!"#$%$&'($")*+"&'%,-."%/.0$&1"-%"(2$"2-1(
(2&/1(332-1(45$6(-&78%(9"245$6:)"77";<=>"
(2&/1(33#$%$&'($:245$6?0$#8%:=@"245$6?$%,:=@"&'%,=>

!!"(&'%1A$&",'('"(-"(2$",$586$
(2&/1(33,$586$45$6(-&78%(9",45$6 B"245$6>

!!"1-&(",'('"-%"(2$",$586$
(2&/1(331-&(:,45$6?0$#8%:=@",45$6?$%,:==>

!!"(&'%1A$&",'('"0'6C"(-"2-1(
(2&/1(336-DE:,45$6?0$#8%:=@",45$6?$%,:=@"245$6?0$#8%:==>

!"#$%&'()'*((+,-'.(/-
!"#$%&'()*(&+"#,-

./)012-3

45$"0-6()(#56

7)#2#68&9#3(&:(;*"3(<"3-*3=

!"#$%$&'($)*+&',-
!"#$%&'($)*+#$&#$,
-../+0%'#+#$&#&

12$%+34056$
7$58'&&'($+9'6$%&$+:;2<6=$+(>?

;6#'($+64880%'#*
>../+8$8@$5&+4%+60&2A0&$5&

PyCUDA

3rd party open source, written by Andreas Klöckner
Exposes all of CUDA via Python bindings
Compiles CUDA on the fly
presents CUDA as an interpreted language
Integration with numpy
Handles memory management, resource allocation
CUDA programs are Python strings
Metaprogramming modify source code on-the-fly
Like a really complex pre-processor
http://mathema.tician.de/software/pycuda


PyCUDA Example
! "#$%&' $()*+,-+&"./&0,1 )*+,
20 "#$%&' $()*+,-,*'%"3"'
4 "#$%&' 3*#$(
5
6 ,0703*#$(-&,3+%#-&,3+38595:-0,1'($/83*#$(-%,'42:
; ,<=$*070)*+,-#/#<,>>%)8,-1"?/90,-+'($/-"'/#1"?/:
@ )*+,-#/#)$(<A'%+8,<=$*90,:
B
C #%+070)*+,-D%*&)/E%+*>/8FFF
!G <<=>%H,><<0.%"+0+%*H>"I(8I>%,'0J,:
!! K
!2 "3'0"+L070'A&/,+M+L-L0N0'A&/,+M+L-(J5O
!4 ,P0"+L0Q0J702-GIO
!5 R
!6 FFF:
!; I*3)070#%+-=/'<I*3)'"%38F+%*H>"I(F:
!@ I*3)8,<=$*90H>%)S785959!::
!B
!C ,<+%*H>/+0703*#$(-/#$'(<>"S/8,:
2G )*+,-#/#)$(<+'%A8,<+%*H>/+90,<=$*:
2! $&"3' ,<+%*H>/+
22 $&"3' ,

RNG Performance: CPU vs. GPU
Generating 100K Sobol' Samples
25x

20x

15x

10x CURAND 3.2
MKL 10.2

5x

0x
SP DP SP DP

Uniform Normal
!"#$%&'()*'+,'%-.&.$'/0123'!*454'67"
89:';4)*)()4*<'+,'=>3?@!+A0'.,B02'!+A0'CD'E%0F320GH

OpenVIDIA
Open source, supported by NVIDIA
Computer Vision Workbench (CVWB)
GPU imaging & computer vision

Demonstrates most commonly used image
processing primitives on CUDA

Demos, code & tutorials/information

http://openvidia.sourceforge.net

References
• CUDA C Programming Guide
• CUDA C Best Practices Guide
• CUDA Reference Manual
• API Reference, PTX ISA 2.2
• CUDA-GDB User Manual
• Visual Proﬁler Manual
• User Guides: CUBLAS, CUFFT, CUSPARSE, CURAND

http://developer.nvidia.com/object/gpucomputing.html

one more thing
or two...

Life/Code Hacking #2.x
Speed {listen,read,writ}ing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Life/Code Hacking #2.1
Speed listening


Speed listening
• Step 1: Collect
• online videos, tutorials, podcasts, etc.
• audiobooks
• youtube-dl, get_ﬂash_videos, jDownloader,
ffmpeg, mplayer, etc.
• etc.

Speed listening

• Step 2: Accelerate (time-stretch)
• VLC (Playback > Faster)
• sox $f{,.1.8X.mp3} tempo 1.8 50
• iPod ? mp3splt -t 5.00 -o small-@n large.mp3


Speed listening

• Step 3: chill or do more ;-)


[Harvard CS264] 04 - Intermediate-level CUDA Programming

More Related Content

What's hot

Similar to [Harvard CS264] 04 - Intermediate-level CUDA Programming

More from npinto

Recently uploaded

[Harvard CS264] 04 - Intermediate-level CUDA Programming