[Harvard CS264] 04 - Intermediate-level CUDA Programming

  • 1,768 views
Uploaded on

http://cs264.org

http://cs264.org

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,768
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
75
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Massively Parallel Computing CS 264 / CSCI E-292Lecture #4: Intermediate-level CUDA | February 15th, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  • 2. Administrivia• HW1: due Fri 2/18/11 (this week)• Projects: think about it, consult the staff• New guest lecturers! • Max Lin (Google), Kurt Messersmith et al. (Amazon), David Rich et al. (Microsoft)
  • 3. During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  • 4. Todayyey!!
  • 5. Outline• CUDA Language & APIs (overview)• Threading/Execution (cont’d)• Memory/Communication (cont’d)• Tools• Libraries
  • 6. Outline• CUDA Language & APIs (overview)• Threading/Execution (cont’d)• Memory/Communication (cont’d)• Tools• Libraries
  • 7. gu age Lan! 49!:(1&/<.2"(%(")(/+(5,.$)=.-(<"#)/&()61">! *",-./+0*",-./+*",-1/+0*",-1/+*",-2/+ 0*",-2/+*",-3/+0*",-3/+! $"#-%./+0$"#-%./+$"#-%1/+0$"#-%1/+ $"#-%2/+0$"#-%2/+$"#-%3/+0$"#-%3/! )4%./+0)4%./+)4%1/+0)4%1/+)4%2/+ 0)4%2/+)4%3/+0)4%3/+! 5#46./+05#46./+5#461/+05#461/+5#462/+ 05#462/+5#463/+05#463/+! 75#,%./+75#,%1/+75#,%2/+75#,%3+ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 8. gu age Lan! 4%-(#/-)&,#)(%(<"#)/&()61"(8.)*(1"#.%$( +,-#)./-> 8,9!!"#$%&(%):(;/+(.!"#$! 4%-(%##"("$"0"-)(/+(%(<"#)/&()61"(8.)*( !"#$%&!"$%&!"($%&!")$* (*(,-<= !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 9. gu age Lan! &)82 .(%(1"#.%$(<"#)/&()61"! ?%0"(%(0)4%2@("3#"1)(#%-(5"(#/-)&,#)"2( +&/0(%(#%$%&()/(+/&0(%(<"#)/&> :$*,5,-/+./+.> !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 10. gu age Lan! 49!:(1&/<.2"(+/,&(;$/5%$@(5,.$)=.-(<%&.%5$"! %"-,&?&=@(@5#*9?&=@(@5#*9A)8@( 6-)&A)8! +,-.&/0&/&1&)822&34&10)4%22&! :##".5$"(/-$6(+&/0(2"<.#"(#/2"! 4%--/)()%A"(%22&"! 4%--/)(%.;-(<%$," !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 11. gu age !"#$%&()*$+(,%-.*./ Lan 0+.(*.($12($/34(25(."#$%&(&*$+(23.*$%2#4(%#(4%#67( 3.8%4%2# !!"#$%"&9(:%.8$(&*33%#6($2(+*.:1*.(;<= >*4$(-"$(721.(*88".*8/(?4(3.26@(6"%:(52.(:$*%74A BC*&3749(!!()$"&*+,!!-*."&*+,!!./0"&*+1 "#$%"&,9(82&3%7($2(&"7$%37(%#4$."8$%2#4 <721.(-"$(+%6+.(*88".*8/(?D("73(2.(744A BC*&3749(()$"&*+,-*."&*+,./0"&*+1 0+(2#(-!"3(4!5346,82&3%7.(23$%2#(52.84(E./("#$%"&,$2( 82&3%7($2(!!"#$%"&© NVIDIA Corporation 2010 5 4 Unit of Least Precision (ULP) is the gap between the floating-point numbers nearest a given real number
  • 12. A PI CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels High level of abstraction - start here! (aka “Device” API) More control, more verbose (OpenCL: Similar to CUDA C Driver API)© NVIDIA Corporation 2010
  • 13. A PI! !"#$)*+,$(-.$2+$#@/*+#3$:+$,H*$3244#0#6,$ !"#$%& ! !"#$G*H$G#1#G$#127#$(-.$I/0#42@8$75J ! !"#$"2;"$G#1#G$K56,29#$(-.$I/0#42@8$753:J! >*9#$,"26;+$7:6$F#$3*6#$,"0*5;"$F*,"$(-.+L$ *,"#0+$:0#$+/#72:G2M#3! %:6$F#$92@#3$,*;#,"#0$IH2,"$7:0#J !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 14. A PI! (GG$B-&$7*9/5,26;$2+$/#04*09#3$*6$:$3#127#! !*$:GG*7:,#$9#9*0=L$056$:$/0*;0:9L$#,7$*6$ ,"#$":03H:0#L$H#$6##3$:$!"#$%"&%()"*)! #127#$7*6,#@,+$:0#$F*563$N8N$H2,"$"*+,$ ,"0#:3+$IO5+,$G2P#$A/#6BCQJ ! >*L$#:7"$"*+,$,"0#:3$9:=$":1#$:,$9*+,$*6#$3#127#$ 7*6,#@, ! (63L$#:7"$3#127#$7*6,#@,$2+$:77#++2FG#$40*9$*6G=$ *6#$"*+,$,"0#:3 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 15. A PI! (GG$3#127#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$ 7*3#$*4$,=/#8$+,-"./0)! (GG$056,29#$(-.$7:GG+$0#,506$:6$#00*0D+577#++$ 7*3#$*4$,=/#$%/!12---3)! (6$26,#;#0$1:G5#$H2,"$M#0*$R$6*$#00*0! %/!14")51.)2---L$%/!14")2---6)-$(7 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 16. A PI! K56,29#$(-.$7:GG+$:5,*9:,27:GG=$262,2:G2M#! #127#$(-.$7:GG+$95+,$7:GG$%/8($) !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 17. Runtime API (high-level)
  • 18. A PI! 78".-9$%:;<%35(,-+$)%*%)-930-1-$+%-".$51*#$% 1(5%#5$*.-"/%*%#(".$6.=! !"+.$(#$%&!$)/"0(! !"+.1$(#$%&!$! :"+%.$%8)$180=! !"+.)2//3$#$%&!$ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 19. A PI! :00(#*.$HI5$$%9$9(52=! !"+.6.99/!@%!"+.<-$$! <"-.-*0-E$%9$9(52=! !"+.6$73$(! 4(32%9$9(52=! !"+.6$7!=4 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 20. A PI! +,!$--. !"#$%&#()*+,#-*#%."#&+*/#01*#2223444# &"%0(5"#("65%.0(5"#758*9.059:! 5,(%12-6#7("%8(0("+*()%1+77)%*2%+77%$(3#1(%9:;% *2%)(*/6%*,(%(<(1/*#20%(03#"20-(0* !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 21. “Device” Driver API (low-level)
  • 22. A PI! !"#$420+,$I*/,2*6:GSJ$+,#/$2+$,*$#659#0:,#$,"#$ :1:2G:FG#$3#127#+! %/9"#$%"4")+/()! %/9"#$%"4")! %/9"#$%"4"):1;"! %/9"#$%"4")<)10=";->! %/9"#$%"4")?))-$@/)"! ! !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 23. A PI! !"#$%&$%#(()$%*%+$,-#$%&-.%!"#$%&!$$( &$%/$.%*%+$,-#$%*"+0$%(1%.23$%)*+$%&!$! 4*"%"(&%#5$*.$%*%#(".$6.%&-.%!")(,)-$.($ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 24. A PI! !"#$%&$%*,$%*%#(".$6.%>)*!/0($,(?%#*"% *00(#*.$%9$9(52@%#*00%*%A;B%18"#.-("%$.#C%%! 4(".$6.%-)%-930-#-.02%*))(#-*.$+%&-.%#5$*.-"/% .5$*+! D(%)2"#5("-E$%*00%.5$*+)%>4;B%().%&-.% A;B%.5$*+)?%#*00%!")(,140!2-/0&5$! F*-.)%1(5%*00%A;B%.*)G)%.(%1-"-)% !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 25. A PI! :00(#*.$HI5$$%9$9(52=! !"6$7899/!:;!"6$7<-$$! <"-.-*0-E$%9$9(52=! !"6$73$(! 4(32%9$9(52=! !"6$7!=4>(/#:;!"6$7!=4#(/>:; !"6$7!=4#(/# !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 26. A PI! 3")(4<#"6."&"@(@"(9"ABC"%(4#D4&$&"&(-:" 56$7".()#"$+8#"6-9(*)&$6(- ! >%<@6- 96#.! 3")(4<#"6."%*#&$#4"@+"(&46-:"&"%<@6- 56$7" !"#(2"$,)$*-$ (*"!"#(2"$3(*2.*-*! ;(4<#"%&-"@#"<-(&4#4"56$7" !"#(2"$45(*2 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 27. A PI! E(&46-:"&")(4<#"&.("%(86#."6$"$("$7#"4#F6%#! ,&-"$7#-":#$"$7#"&44*#.."(9"9<-%$6(-."&-4" :(@&"F&*6&@#.G !"#(2"$6$-7"5!-8(5 !"#(2"$6$-6(9* !"#(2"$6$-:$;<$= !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 28. A PI! H-%#"&")(4<#"6."(&4#4!"&-4"5#"7&F#"&" 9<-%$6(-"8(6-$#*!"5#"%&-"%&"&"9<-%$6(-! I#")<.$".#$<8"$7#"!"!#$%&()!(*&+(,!(%) 96*.$ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 29. A PI! JK#%<$6(-"#-F6*(-)#-$"6-%<4#.G " L7*#&4"M(%?"N6=# " N7&*#4";#)(*+"N6=# " O<-%$6(-"B&*&)#$#*. " A*64"N6=# !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 30. A PI! L7*#&4"M(%?"N6=#G" !"7"5!>$-?(!@>A*0$! N7&*#4";#)(*+"N6=#G !"7"5!>$->A*)$2>8B$! O<-%$6(-"B&*&)#$#*.G !"C*)*%>$->8B$DE!"C*)*%>$-8DE !"C*)*%>$-=DE!"C*)*%>$-F !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 31. A PI! !"#$%&#(%#)%)(*%+*%*,(%)+-(%*#-(%+)%*,(% ./01*#20%#0321+*#204 !"#$"%!&()* !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 32. Outline• CUDA Language & APIs (overview)• Threading/Execution (cont’d)• Memory/Communication (cont’d)• Tools• Libraries
  • 33. Threading HierarchyExecution ModelSoftware Hardware Threads are executed by thread Thread processors Processor Thread Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can Thread reside on one multiprocessor - limited Block Multiprocessor by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks ... Only one kernel can execute on a Grid device at one time Device © 2008 NVIDIA Corporation.
  • 34. Thread Batching Kernel launches a grid of thread blocks Threads within a block cooperate via shared memory Threads within a block can synchronize Threads in different blocks cannot cooperate Allows programs to transparently scale to different GPUs Grid Thread Block 0 Thread Block 1 Thread Block N-1 … Shared Memory Shared Memory Shared Memory © 2008 NVIDIA Corporation.
  • 35. Transparent Scalability Hardware is free to schedule thread blocks on any processor A kernel scales across parallel multiprocessors Kernel gridDevice Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 © 2008 NVIDIA Corporation.
  • 36. Thread Arithmetic
  • 37. Indexing Arrays: Example In this example, the red entry would have an index of 21: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 M = 8 threads/block blockIdx.x = 2 int index = threadIdx.x + blockIdx.x * M; = 5 + 2 * 8; = 21;
  • 38. Indexing Arrays: Example In this example, the red entry would have an index of 21: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 M = 8 threads/blockAddition with Threads and Blocks blockIdx.x = 2 int index = threadIdx.x + blockIdx.x * M; The blockDim.x is a built-in variable for threads per block: = 5 + 2 * 8; int index= threadIdx.x + blockIdx.x * blockDim.x; = 21; A combined version of our vector addition kernel to use blocks and threads: __global__ void add( int *a, int *b, int *c ) { int index = threadIdx.x + blockIdx.x * blockDim.x;
  • 39. Control Flow
  • 40. Control Flow Divergence What happens if you have the following code? !"#"$$#%&()*+*,-,.. / *$01#.2 3 (45( / *$06#.2 3
  • 41. Control Flow Divergence Branch Path APath B
  • 42. Control Flow Divergence Nested branches are handled as well!"#"$$#%&()*+*,-,../ !"#0)#%&()*+*,-,.. *$12#.3 (45( *$16#.37(45( *$18#.3
  • 43. Control Flow Divergence Branch Branch Path A Path BPath C
  • 44. Control Flow Divergence for correctness (*) You might have to think about it for performance Depends on your branch conditions
  • 45. Control Flow Divergence Performance drops off with the degree of divergence !"#$%&$&()*+,+-.- /012 3 %*!) 45 ... %*!) 65 ... 7
  • 46. Divergence 35 30Performance 25 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 Divergence
  • 47. Occupancy
  • 48. !""#$%&" ()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%445-/1-+2+"#0.&6- 10)+*-7%*$/-./-0)+-1&4-7%-01-).,+-4%0+&".+/-%&,-8++$-0)+- )%*,7%*+-9#/ !""#$%&" :-;#<9+*-1=-7%*$/-*#&&.&6-"1&"#**+&04-1&-%- <#40.$*1"+//1*-,.>.,+,-9-<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&- *#&-"1&"#**+&04 ?.<.0+,-9-*+/1#*"+-#/%6+@ A+6./0+*/ B)%*+,-<+<1*© NVIDIA Corporation 2010
  • 49. !"#$%&()*+*,-.)/*,&0,$& 1#23"#$%&41#25/"0,(*#$)&&#*& 6#7""5/"0,(*#$)&&#*&879)70")7&0#:)3"#$%0#);)$/0) 1#23"#$%&<1#25/"0,(*#$)&&#*&4= >/"0,(")3"#$%&$7:*/:$#:$/**):0"?,:75/"0,(*#$)&&#* !!"#$%&()*+",-.%))(08)87*-@7*)3/&? 6/3A)$00#*)&#/*$)797,"73,",0? *)B,&0)*&C&87*)-5)5#*? 1#23"#$%&4DEE0#&$7")0#2/0/*)-)9,$)& !"#$%&);)$/0)-,:(,()",:)27&8,#: DEEE3"#$%&()*B*,-@,""&$7")7$*#&&5/"0,(")B):)*70,#:&© NVIDIA Corporation 2010
  • 50. Outline• CUDA Language & APIs (overview)• Threading/Execution (cont’d)• Memory/Communication (cont’d)• Tools• Libraries
  • 51. Kernel Memory Access Kernel Memory Access Per-thread Registers On-chip Thread Local Memory Off-chip, uncached Per-block Shared • On-chip, small Block • Fast Memory Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
  • 52. !"#$%&"%! !"#$%&()*%#&+,&#%$-./%#.&0%#&./#%")& 0#+1%..+#&23456&-&.+)%&7"#89"#%:! ;%#+<1=+1>&1?1=%&"11%..! @/+#%&%-/7%#&A5*-/&-/%$%#&+#&A5*-/&,=+"/ !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 53. ()*+,-."/)0! B&.)"==&0+#/-+&+,&$=+*"=&)%)+#?&/7"/&-.& 0#-C"/%&/+&"&./#%")&0#+1%..+#! D,/%&(.%8&".&+C%#,=+9&,#+)&#%$-./%#.! @=+9&/+&"11%..&2.")%&".&$=+*"=&)%)+#?: !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 54. 12+"3-."/)0! B&*=+1>&+,&)%)+#?&/7"/&-.&.7"#%8&*?&"==& ./#%")&0#+1%..+#.&-&"&)(=/-<0#+1%..+#! 3EFG&0%#&*=+1>H&./+#%8&-&3EI3FG&*">.! J%#?&,"./&/+&"11%..&2-K%K&".&,"./&".&#%$-./%#.L:&& 9-/7+(/&!"#$%&#()*&+, !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 55. Global Memory Kernel Memory Access • Different types of “global memory” Per-thread Registers On-chip • Linear Memory Thread Local Memory Off-chip, uncached • Texture Per-block Memory • Constant Memory Block • • Shared Memory On-chip, small Fast Per-device Kernel 0 ... • Off-chip, large • Uncached Global • Persistent acrossTime Memory kernel launches Kernel 1 ... • Kernel I/O
  • 56. ! !"#$%#&#%(")*+%,*"-*+%."/0#/#%/1%2$3#45$% 6$6"57%**%)"6$%85"6%#&$%06$%9&70:)*% 6$6"57%9""*! ;40#%1:88$5%:/%))$00%9##$5/0+%))&:/<+%$#)= !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 57. 4,)5+,-."/)0! M7%&="#$%&*=+1>&+,&)%)+#?&.7"#%8&*?&"==& )(=/-<0#+1%..+#.&+&/7%&1+)0(/%&8%C-1%! @-N%&8%0%8.&+&8%C-1%&! 5OEPG&/+&3KOQG! R-$7&*"89-8/7&S&344QGT.! @=+9&/+&"11%..&! .%C%#"=&7(8#%8&1=+1>&1?1=%& ="/%1?K&! 678-1"17%8 !"#$$%&$()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<( =3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
  • 58. Constant MemoryConstants set by CPU, read by GPUEach SM has 8kiB cache for constantsOptimized for broadcast Accessing different elements forces serialisationCan speed some calculationsCan relieve register pressure
  • 59. Constant MemoryDeclared at file scope__constant__ float dc_myConst;Set via cudaMemcpyToSymbol API callcudaMemcpyToSymbol( “dc_myConst”, 3.14f, sizeof(float) )Accessed by name in kernel__global__ MyKernel( ... ) { .... float myVal = dc_myConst+1; ....}
  • 60. TexturesTextures are essentially look up tables Can only be written by the hostCached on each multiprocessor (8kiB) Optimised for 2D spatial localityHardware interpolation possible Limited precisionCan clamp or wrap at boundaries
  • 61. TexturesDeclaration and setup rather involved See programming guideAccessed in kernels via texture fetches:tex1D, tex2D, tex3D, etc.Co-ordinates at texel centres Have to take care when accessing elements
  • 62. TexturesCan improve load coalescing from global memoryIf whole texture fits in 8kiB cache, has grid lifetimeClamping/wrapping can aid edge case handlingHave to test to determine benefits
  • 63. General Principles Memory access patterns are crucial Even CPUs are typically memory bound GPUs have 100x FP Only 10x memory bandwidth Have to keep the GPU busy
  • 64. PC Architecture8 GB/s >?@ ?>L9G=2%&66"K16 J%+8#"F7(&"K16 H%2$7,6">%("I" A+%#$)%7(B& F+1#$)%7(B& >@C! E&.+%/"K16 ?>L"K16 3+ Gb/s CD!E F!:! G#$&%8&# !160+ GB/s to VRAM 25+ GB/s modified from Matthew Bolitho
  • 65. PCIe Transfers (first thing to optimize?)
  • 66. !"#$%&()##*+&%, -+#"(4&5&(6*+3( -./(123+*"(5+ -./ 0./(123+*" !"#$ %&()*+, 789(-/4) -,$#<25 0./ :2*92;<= & -. &. /. ()*+ ()*+ -+#"(4&5&(6*+3 0./(323+*"(5+ -./ 0./ 123+*" 123+*" -./(123+*" *Averaged observed bandwidth
  • 67. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory© NVIDIA Corporation 2010
  • 68. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance© NVIDIA Corporation 2010
  • 69. Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory© NVIDIA Corporation 2010
  • 70. PCIe TransfersPCIe 2.0 x16 bus has Latency of 10 µs (observed) Bandwidth of 8GB/s (theory), 5 GB/s (observed)A lot of calculations can happen in these times
  • 71. PCIe TransfersPCIe transfers occur via DMA GPU reads pages direct from CPU memory Very bad if page gets moved mid-transferCUDA maintains internal pinned memory buffers Used for cudaMemcpy calls Data staged through these
  • 72. !"#$%&#()*+(#$,-#)!"#$%#&%()*+(,-$."/0$1#&2!"#$%&#()32&$244#-$.521#&26 7-$"/82+9:6+1;$.5&0$0-1 *&/<2&+9:6!"#$"%!&()*!" 0&!"#$"%!&()*+,-%!!"!;$.5&0$0-1,-$."/0$1=40.>0$#$;?&/0&#1;$.5&0$0-1>2&$244#-$.521
  • 73. !"#$%&!"#$%&!())*&+,&- !"#$%&()**"+),#"-.!"#$%&!() /,)#(01%()*&+,&- #1(21*3-"#"4( $&#)-(213.()(21*3-"#"5 !"#$%&(!&)&*+,$-./010$&2#)1&( "#(6"7,8)1%5(9%,+"100(6"#:""&(;<=( 2.2-"(,&+(%"31&"(3""+ !"#$%&(!&)!2&#",&)3(4 !"#$%&(!&)!2&#",&5(&,# !"#$%&(!&6,7!8(4-)94! ()*+),-./,0#1,23*+#&45,6745"5, 6)#5*74,8&#91
  • 74. !"#$%&($()"*!+,"!""#$%&()#*%(+,-./#0+.#+"/#1%#+/).02(.3/4#+.5#(%,3(.#-&&%5-+,%")$%&()#&3/,#1%#+""0+,%5#+/# !"#$%&()*++!,- !"./&()*++!,- +,--./ &%&()#+""9/#5-(%0,# ;$!#,(+./8%(/#1)#,2%#<=>#,# 6"5%(#7%(/-.#8#,2%/%#83.0,-./# !"#$%*++!&(),-0 "0:%5#,#+#*2)/-0+"#+55(%// !"./#/++!&(),-+"/#9(:4#
  • 75. !"#$%&($()"*!+,"*-.($/01!"#$%&()%*+%,&-*%./%/%0/#$+12%3456+7,-/+82"9%*2%-0$&"9%*,-##%0:7+82*"+"/&8-,,&&2"/,0;%0"+%"/1&$0%8*$1$8-,&%81$"+& &"<%"=12%&%-*%%>#%+&$9%? @+$1$-,$A-1$"+ B%<"*7-,,"8-1$"+& :1*%-<CD9%+18*%-1$"+ @+1%*"# *%&"/*8%234"/5/4($
  • 76. !"#$%&!"#$%&!()*+),!!"#$%&""&()*" #$%&""(+,* (%-./)",$01)* .102",0&34.2,567%1&"8%1&* .&+$&/$()+01( ,0&39)+.32/)"()+.32:""""""" (&&/2$-&+$/&3$0((,1 ()+01$4$5!"#$%&!()*+),! 6),+/($711%70)89)-&.,&/ 6),+/($711%70)89!"#$%&!()*+),! 6),+/($711%70)89 !"#$%&($&)*+$(),--$*+!"#$01.&$#2),!1.3,45&!:; :07)($-&+$;+9)*7/<$&/$)*$="#$)&$-7/7(*2$ )*/$+),+/(
  • 77. !"#$%&()$*#+&#,%-.-$/%-0(-123#%/-$ !"#$%&()*)+,+-.&()+ !"#$%#&() !"#$%&()*+$)/0+,+!"%&()*+$)/0!"#$%&()*)+1)23#2(4!"#$%&()-5$)/61)23#2(7804!"#$,-!./01/(!/#9)792"5!7:;)97!"#$<=!>;129)?23&@!7804!"#$%&()2!3+#/1)23#2(04-/4+(5!"#$,-!./01/(!/#9)792"5!7:;)97!"#$<=!>;3&@!?2129)7804!;<5$3&0,%.//2#"# *$+%,-.,%/0"#1#")%2+34(#/0"#!"#$%&()6/(!7+3(89/1)23#2(04 !"%&()A!25#/1)23#2(0 %050678#%#9%2#3"#%."3,=2#/+",%6#60")507)+,&03#9,0%2#6#60")$%,0."5#50.8&(#.,#&$4$+3()%2#!;<!"#$:7+$#6/(!7+3(89/04 *$+%,/0"#1#")%2+3403%2#:;< %0/+3+,29%2#3"#%."3,
  • 78. !"#$"%&()*+%#,$-)./01232"245)62"7)8#$539%# !""#$%&()(*+,,-(./0(1233456(*+,78*#,7(1&9,(&:; !"#$%&(&)$*$+, -.!/$0$1234%&567, <<<((!""#$%&(123345 89!:;$<=/.";> ? !"#$%&!()*+),!<1@34%&A1234%&5<%&(&)BC>D67AE!F;A G&/HI;)G1JK.E#L.M;-!G;A+>, )-&.,&/0.#,$244,& )-&.,&/1 2.$3%)4.$&<1234%&5%&(&)7>, !"#$56.&$#7),!6.8,9:&<>, %&(&)BB,$%&(&)$D*6, N
  • 79. !"#$%&()*+, GPU !"#$#%&()*&+,-.&/0123-4& /5256,7,-8&9:&;<;&.5=4&5 #-$% !.+56) 20340) 20340) >4>,?5-4>&$@%&4AB,A4 $@%&-C5A*D4C*&0=4C&(/#4&?5A&64 ?0A?3CC4A-&+,-.&/)$%&E4CA47&4F4?3-,0AG &()*+, 5770+*&,A>424A>4A-&?0A?3CC4A-&,AH -)+.(/ !.0(.11)( 0C>4C&I3434*&0D&4F4?3-,0A !"#$%&($)*&+,-./&($) !"#$%&($)-($&(01+,!"%&($)-($&(01 @37-,274&*-C451*&4F,*-&+,-.,A&5&*,AB74&?0A-4F-J&-.48& *.5C4&1410C8&5A>&0-.4C&C4*03C?4* ()&@410C878#%!.54),%.01/9%%!"#$%&!()*+,-).! :*00.%.;)(1*5<
  • 80. !"#$%&(%#%&$"$#!""#!$%& ()&*+,&#-./+0*+0$#1.-0#.)"#$%& 2./.30*0/4)&*+30#50/&0"#6.)&1 !!!"#$%&%()*"+,-./%()*010 %()*"2$)345557/+-0/#!89 .67368.9#$%&:;<8.=>68.2%-8*"?%&29*"9)%@92*" ;2$)34A:,2+0$#;#50/&0"#".)&10$#<+*1#*10#)%&$ $*/0.3#2./.30*0/#0=0)*0#*+,-#.$#
  • 81. Scheduling on GPU,12&3456789 ,;<= ,;8<A46 ">?@>6 ">?@>6 Independent Tasks ,-./&( :3!& :3!&+ ,-./&+( !"#$"%&( ,-./&( ,-./&+( ,-./&+) !"#$"%&) !"#$"%&( ,-./&+) !"#$"%&* !"#$"%&) !"#$"%&+( ,-./&) !"#$"%&+( !"#$"%&* ,-./&+* ,-./&+* ,-./&) ,-./&+0 ,-./&+0
  • 82. =#BC& =#EBF-(!"#$%&()$*+$,*-$#./ 7.D$.( =2>5&!9 7.D$.(@17!A&! 01234&0!5/ 671378&!9 =2>5&!9 671378&!:671378&!9671378&!: 671378&!;671378&!; =2>5&!: =2>5&!: =2>5&<9@17!A&< =2>5&<: =2>5&<9 671378&<9 =2>5&<:671378&<9 =2>5&<; =2>5&<; !"#$"%&(%(%)&*+%&,$--%.&$"&/0%&1+.%+&21.%& =2>5&<? =2>5&<? $)&%3%2(/%.
  • 83. ()<= ()&<A"$!"#$%&()*$+#*$# ->?@>$ (+3412 ->?@>$!9.-1B1 (+..-(9:14; (+3472 ,-./-012 (+3412,-./-012 (+3475 ,-./-015,-./-015 ,-./-016,-./-016 !9.-1B7 (+3472 (+3415 ,-./-072 (+3475 (+3476 ,-./-072 (+3415 (+3478 (+3476 (+3478
  • 84. !"#$%&(")*%&+,,-./0%+121+% 3)"45",*>/%%;6,?;,0*;./#%@%.*9,A.*)910%*@%+:;9=61 7.+89%!"#$%&+,,-./B C1"0#)%D!"#$%&(&)*!&+,$-.23?#0/!"#$(&)*!&/$012. $:;< E+#@%+D!"3435&$&23?#0/3673897/:;71<%8:1 ;99"<+$%,-..=%5>?%((")* C1"0#)%D!"#$12.:,,2!=>F16%!"#$12.:,,2!/$00&# $:;< E+#@%+D!"/&?12.:,,2!=> 16%36(:7/@/1<%8:AA<37(@BC3@/:;@1 A$%+%5?B;%=C-<%,"-.$)%$"%$D-#%((")* C1"0#)%D!"#$12.D&(&)*!&;2*E&5=> E+#@%+D!"/&?12.D&(&)*!&;2*E&5=> !"#$#%&(%)*+,-+./#0%.01+%E1 FG#$%G#%$D+$%,"-.$)%-.%*"G)%2).9#H 2!(-345!67%+*8.*9, %:#)#";0%6&;0;0+;"6$%+BG/%.H0/%!"#$"%&()$*+,- A./01234.20566748/620.590$5:0&;<60$2$;7= ;:0*<%0/%+= &%@#.%9+*9%+0,$:;<0*6%%#$I%+*8G*9,#6;@;#:;J:%K
  • 85. !"#$%&$()*+,-".,/"0!"#"$%&$#"(&)**+$,-*$#.*$/01*$23&$"3#,4"#%5"6678$9&*$:.*($+"#"$%&$,(67$*"+;:%##*($,(5*9&*$),$-*7$&4"66$"4,3(#&$,)$+"#"$<(*:$-"%"26*&8$0/9;=/9$5,443(%5"#%,(>9&*$:.*($5,4?3#*;4*4,7$"#%,$%&$-*7$.%@.$"(+$,553?"(57$%&$.%@.8$&,$6"#*(57$,-*$/01*$%&$.%++*(0,"6*&5%(@$%&$!"#$#!%&&(%4?,#"(#A
  • 86. PCIe Transfers Optimization PCIe bus is slow Try to minimize transfers Use pinned memory on host whenever possible Try to perform copies asynchronously
  • 87. Outline• CUDA Language & APIs (overview)• Threading/Execution (cont’d)• Memory/Communication (cont’d)• Tools• Libraries
  • 88. CUDA-GDB Extended version of GDB with support for C for CUDA Supported on Linux 32bit/64bit systems Seamlessly debug both the host|CPU and device|GPU code Set breakpoints on any source line or symbol name Single step executes only one warp except on sync threads Access and print all CUDA memory allocations, local, global, constant and shared vars. Walkthrough example with sourcecode : CUDA-GDB manual© NVIDIA Corporation 2010
  • 89. Linux GDB Integration with EMACS© NVIDIA Corporation 2010
  • 90. Linux GDBIntegration withDDD© NVIDIA Corporation 2010
  • 91. CUDA-MemCheck Detects/tracks memory errors Out of bounds accesses Misaligned accesses (types must be aligned on their size) Integrated into CUDA-GDB Linux and WinXP Win7 and Vista support coming© NVIDIA Corporation 2010 11©NVIDIA 2010
  • 92. CUDA Driver Low-level Profiling support 1. Set up environment variables !"#$%&()*+,-./012345 !"#$%&()*+,-./0123,(6745 !"#$%&()*+,-./0123,(/80194:$;<=>?&"& !"#$%&()*+,-./0123,2/94#%$<=@!?:AB 2. Set up configuration file FILE "config.txt": FILE "profile.csv": >#CA&D%&&=E!A&DE# G()*+,-./0123,2/9,73.61/85?H G()*+,*371(3I9!0$%:!JJII9K =;A&%C:&=$;A G()*+,-./0123,(675 GK1F36K+F-0+(K/.<DLMLNN5!DL:5L: >#CA&D%&&=E!A&DE#OE!&P$QO>#C&=E!O:#C&=E!O$::C#D;:RO=;A&%C:&=$;A 3. Run application 55H<S!DD5I!TNLLIOE!E:#RU&$*OV?TLJO5L?III ED&%="FC@ 55H<S!DD5I!HQD:IOE!E:#RU&$*OH?WWSOS?III 55H<S!DD5I!MH:!IOE!E:#RU&$*OV?TLJOW?III 55H<S!DD5I<L!DWIO,X5IQED&%="EC@-<==6,==6,O5M?LMWOSI?IIIOI?TTTOST 4. View profiler output HL 55H<S!DD5I<SSTDIOE!E:#R*&$UOV?VVWOTW?III© NVIDIA Corporation 2010
  • 93. CUDA Visual Profiler - Overview Performance analysis tool to fine tune CUDA applications Supported on Linux/Windows/Mac platforms Functionality: Execute a CUDA application and collect profiling data Multiple application runs to collect data for all hardware performance counters Profiling data for all kernels and memory transfers Analyze profiling data© NVIDIA Corporation 2010
  • 94. CUDA Visual Profiler data for kernels© NVIDIA Corporation 2010
  • 95. CUDA Visual Profiler computed data for kernels Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate Global memory read throughput (Gigabytes/second) Global memory write throughput (Gigabytes/second) Overall global memory access throughput (Gigabytes/second) Global memory load efficiency Global memory store efficiency© NVIDIA Corporation 2010
  • 96. CUDA Visual Profiler data for memory transfers Memory transfer type and direction (D=Device, H=Host, A=cuArray) e.g. H to D: Host to Device Synchronous / Asynchronous Memory transfer size, in bytes Stream ID© NVIDIA Corporation 2010
  • 97. CUDA Visual Profiler data analysis views Views: Summary table Kernel table Memcopy table Summary plot GPU Time Height plot GPU Time Width plot Profiler counter plot Profiler table column plot Multi-device plot Multi-stream plot Analyze profiler counters Analyze kernel occupancy© NVIDIA Corporation 2010
  • 98. CUDA Visual Profiler Misc. Multiple sessions Compare views for different sessions Comparison Summary plot Profiler projects save & load Import/Export profiler data (.CSV format)© NVIDIA Corporation 2010
  • 99. Outline• CUDA Language & APIs (overview)• Threading/Execution (cont’d)• Memory/Communication (cont’d)• Tools• Libraries
  • 100. CUBLAS
  • 101. CUBLAS CUDA accelerated BLAS (Basic Linear Algebra Subprograms) Create matrix and vector objects in GPU memory space Fill objects with data Call sequence of CUBLAS functions Retrieve data from GPU (optionally) !"#$%&#(()*+,-#.%/001%$.+2%!34.56-.5$7 8 9:;$+4<=%*> ?$5+.+$6"+@1%$.+2%!A9:;$+4<15.&BC1-1CDC1-ECD7F 9:;$+4<+,6E&BC+$6"+C1-1CDC1-,CD7F AA%>%/EGH#.%/+.#524C/%4.+/./%4#1:+$ #?&#IGH@@H78 9:;$+4<=%*> 9:;$+4<956E&BC1-;CDC1-/CD7F 9:;$+4<+,6E&BCJDKHC1-ECDC1-/CD7F L %$4% 9:;$+4<+,6E&BCJ+$6"+C1-ECDC1-/CD7F KKK© NVIDIA Corporation 2009
  • 102. CUBLAS Features Single precision data: Level 1 (vector-vector O(N) ) Level 2 (matrix-vector O(N2) ) 3 Level 3 (matrix-matrix O(N ) ) Complex single precision data: Level 1 CGEMM Double precision data: Level 1: DASUM, DAXPY, DCOPY, DDOT, DNRM2, DROT, DROTM, DSCAL, DSWAP, ISAMAX, IDAMIN Level 2: DGEMV, DGER, DSYR, DTRSV Level 3: ZGEMM, DGEMM, DTRSM, DTRMM, DSYMM, DSYRK, DSYR2K© NVIDIA Corporation 2009
  • 103. CUBLAS Performance: CPU vs GPU CUBLAS: CUDA 2.3, Tesla C1060 MKL 10.0.3: Intel Core2 Extreme, 3.00GHz© NVIDIA Corporation 2009
  • 104. !"#$%&()*+,*-./0) #" Up to 2x average speedup over CUBLAS 3.1 !" &"&1))23145678$ Less variation in performance for different dimensions vs. 3.1 %" $" #" ,-. /(0 !" /(0# !#$ #!$& (!)# $!*% +#! %$$ )%& 7.9*:;2:-)/5:,/5<=;=> Average speedup of {S/D/C/Z}GEMM x {NN,NT,TN,TT} !"##$%&(%)%&*%+,%-./0/1%$2345%!(676%89" :;<%*6(&6(=%+,%>?5@A!+B2%/,C24%!+B2%DE%F-2G542HI
  • 105. CULA
  • 106. CULA (LAPACK for heterogeneous systems) GPU Accelerated Linear Algebra Partnership ! Dense linear algebra Developed in ! C/C++ & FORTRAN partnership with ! 150+ Routines NVIDIA MATLAB Interface Supercomputer Speeds ! 15+ functions Performance 7x of ! Up to 10x speedup
  • 107. CULA - Performance Supercomputing Speeds This graph shows the relative speed of many CULA functions when compared to (Fermi) and an Intel Core i7 860. More at www.culatools.com
  • 108. CUSPARSE
  • 109. Sparse Matrix Performance: CPU vs. GPU Multiplication of a sparse matrix by multiple vectors35x30x25x20x15x10x "Non-transposed" 5x "Transposed" 0x MKL 10.2 Average speedup across S,D,C,Z !"#$%&#()*+(,-(./010%(23456(!+787(9$" :;<(=7*+*)*7+>(,-(?@6AB!,C3(0-D35(!,C3(EF(G.3H653IJ
  • 110. CUFFT
  • 111. CUFFT CUFFT is the CUDA FFT library Computes parallel FFT on an NVIDIA GPU Plan contains information about optimal configuration for a given transform. Plans can be persisted to prevent recalculation. Good fit for CUFFT because different kinds of FFTs require different thread/block/grid configurations.© NVIDIA Corporation 2009
  • 112. CUFFT Features 1D, 2D and 3D transforms of complex and real- valued data Batched execution for doing multiple 1D transforms in parallel 1D transform size up to 8M elements 2D and 3D transform sizes in the range [2,16384] In-place and out-of-place transforms for real and complex data.© NVIDIA Corporation 2009
  • 113. CUFFT Example Complex 2D transform !"#$%&#()*+, !"#$%&#(-.*/ 01$$234&"5# 654&7 01$$289:65#; <%"424=<9"4247 01"4>45590??@9%"<<AB%"424=C%D#9$?01$$289:65#;A<()<(-A7 01"4>45590??@9%"<<AB9"424=C%D#9$?01$$289:65#;A<()<(-A7 E<8F#42#4*GHHI654&J<E 01$$2K54&*"?B654&=()=(-=8LHHIM8*8A7 E<LC#2N#8LHHI654&292F4&C$9F:2N#C%O&459129$6540#J<E 01$$2P;#08*8?654&=%"424=9"424=8LHHIMHQRSTRGA7 E<U&@#FC#2F4&C$9F:2N#C%O&45%&6540#J<E 01$$2P;#08*8?654&=9"424=9"424=8LHHIMU(VPRWPA7 E<G#C2F9X2N#8LHHI654&J<E 01$$2G#C2F9X?654&A7 01"4HF##?%"424A7 01"4HF##?9"424A7© NVIDIA Corporation 2009
  • 114. CUFFT Performance: CPU vs GPU !"##$%&()%*+,-,.%$/012%34565%789 :;<%454=4)%>"2?@3A=/%,BC/1%3A=/%DE%F*/G21/HI%(&7JK© NVIDIA Corporation 2009
  • 115. CUFFT 3.2: Improved Radix-3, -5, -7 123-45*6+&%768996(::06 123-45*6+;%768996(::60 $"! (! ! $!! "! #"!!"#$%& !"#$%& &! +$!(!,-%.$ +$!(!,-%.$ +$!(!,-%.# %! +$!(!,-%.# #!! /01 /01 $! "! #! ! ! # $ % & " ( ) * #! ## #$ #% #& #" # $ % & " ( ) * #! ## #$ #% #& #" ()*+,-./0 ()*+,-./0 Radix-5, -7 and mixed radix improvements not shown 9<""=6*>?6@6*>A6(B6CDE;EF6=/,269?GHG6!%< IJ#6AG>?>*>G?K6(B6LM2359(N/6EBO/69(N/6-H6+C/P2/Q0
  • 116. CUDPP
  • 117. !"#$$!"#$%&&(")*"+$,+-./&*)&012/".&##/#".&$0$3$4/5"*)&"!678 9:";&&$5"<=>?7?8@A"B:"CD/15"<6!7@A"E:"E/1,F.3"<6!7@A"8:"74$G5)1" <6!7@A"E:"HI/1,"<6!7@A"J:"K+1,"<6!7@8#,)&$3+05 2FG..E21A"2FG..E/,0/13/GE21A"2FG..L/GF2/ 2FG..E)&3A"2FG..L1GA"2FG..E.&5/93&$M>/23)&9F#3$.#(8GG$3$)1#"#,)&$3+05"$1".&),&/55 N&.+5A"0)&/"5)&3$1,A"3&//5A"+5+$1,A"F3)3F1$1,
  • 118. !"#$$%&()*+,!"#$$!%&()*+,-(%& .%&() /010!"#$$23!456000 !"#$$24##60!"#$$27894:60!"#$$29$:;95279<=4<#0>?!"#$$@,&ABC DB,&?0!"#$$<CE*B- +CE*B-0/0.*ADD$B,&FGDB,&60 .%&()60 &*HIBCHC&-E60 J60KL?0.*ADD3.,&FDB,&60A2%A,-,60A2(A,-,60&*HIBCHC&-EL?
  • 119. More?
  • 120. Thrust
  • 121. Objectives!   Programmer productivity !   Rapidly develop complex applications !   Leverage parallel primitives!   Encourage generic programming !   Don’t reinvent the wheel !   E.g. one reduction to rule them all!   High performance !   With minimal programmer effort!   Interoperability !   Integrates with CUDA C/C++ code 3© 2008 NVIDIA Corporation
  • 122. !"#$%&!""#$%&()$%#(*+,),-#./,#!012 3*&*45#6$)78),8#9%&()$%#:*+,),-#;69:<#!/7$)*7%,5 !"#$%!&&"%!()*+!#,-. !"#$%!&&/*)0+*()*+!#,-.2(=/,*$>&5 !"#$%!&&%#!12 !"#$%!&&#*/$+*12 !"#$%!&&03+4$%0)*(%+5312 ?$4@ 63
  • 123. !"#$%&()*+,-.!!"#$%$&($")*+"&%,-."%/.0$&1"-%"(2$"2-1((2&/1(332-1(45$6(-&78%(9"245$6:)"77";<=>"(2&/1(33#$%$&($:245$6?0$#8%:=@"245$6?$%,:=@"&%,=>!!"(&%1A$&",("(-"(2$",$586$(2&/1(33,$586$45$6(-&78%(9",45$6 B"245$6>!!"1-&(",("-%"(2$",$586$(2&/1(331-&(:,45$6?0$#8%:=@",45$6?$%,:==>!!"(&%1A$&",("06C"(-"2-1((2&/1(336-DE:,45$6?0$#8%:=@",45$6?$%,:=@"245$6?0$#8%:==>
  • 124. !"#$%&()*((+,-.(/- !"#$%&()*(&+"#,- ./)012-3 45$"0-6()(#56 7)#2#68&9#3(&:(;*"3(<"3-*3=
  • 125. !"#$%$&($)*+&,- !"#$%&($)*+#$&#$, -../+0%#+#$&#& 12$%+34056$ 7$58&&($+96$%&$+:;2<6=$+(>? ;6#($+64880%#* >../+8$8@$5&+4%+60&2A0&$5&
  • 126. More?
  • 127. PyCUDA
  • 128. PyCUDA 3rd party open source, written by Andreas Klöckner Exposes all of CUDA via Python bindings Compiles CUDA on the fly presents CUDA as an interpreted language Integration with numpy Handles memory management, resource allocation CUDA programs are Python strings Metaprogramming modify source code on-the-fly Like a really complex pre-processor http://mathema.tician.de/software/pycuda© NVIDIA Corporation 2009
  • 129. PyCUDA Example ! "#$%& $()*+,-+&"./&0,1 )*+, 20 "#$%& $()*+,-,*%"3" 4 "#$%& 3*#$( 5 6 ,0703*#$(-&,3+%#-&,3+38595:-0,1($/83*#$(-%,42: ; ,<=$*070)*+,-#/#<,>>%)8,-1"?/90,-+($/-"/#1"?/: @ )*+,-#/#)$(<A%+8,<=$*90,: B C #%+070)*+,-D%*&)/E%+*>/8FFF !G <<=>%H,><<0.%"+0+%*H>"I(8I>%,0J,: !! K !2 "30"+L070A&/,+M+L-L0N0A&/,+M+L-(J5O !4 ,P0"+L0Q0J702-GIO !5 R !6 FFF: !; I*3)070#%+-=/<I*3)"%38F+%*H>"I(F: !@ I*3)8,<=$*90H>%)S785959!:: !B !C ,<+%*H>/+0703*#$(-/#$(<>"S/8,: 2G )*+,-#/#)$(<+%A8,<+%*H>/+90,<=$*: 2! $&"3 ,<+%*H>/+ 22 $&"3 ,© NVIDIA Corporation 2009
  • 130. More?
  • 131. CURAND
  • 132. RNG Performance: CPU vs. GPU Generating 100K Sobol Samples 25x 20x 15x 10x CURAND 3.2 MKL 10.2 5x 0x SP DP SP DP Uniform Normal !"#$%&()*+,%-.&.$/0123!*45467" 89:;4)*)()4*<+,=>3?@!+A0.,B02!+A0CDE%0F320GH
  • 133. OpenVIDIA
  • 134. OpenVIDIA Open source, supported by NVIDIA Computer Vision Workbench (CVWB) GPU imaging & computer vision Demonstrates most commonly used image processing primitives on CUDA Demos, code & tutorials/information http://openvidia.sourceforge.net
  • 135. and many more...
  • 136. References• CUDA C Programming Guide • CUDA C Best Practices Guide • CUDA Reference Manual • API Reference, PTX ISA 2.2 • CUDA-GDB User Manual • Visual Profiler Manual  • User Guides: CUBLAS, CUFFT, CUSPARSE, CURANDhttp://developer.nvidia.com/object/gpucomputing.html
  • 137. one more thing or two...
  • 138. Life/Code Hacking #2.x Speed {listen,read,writ}ingaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 139. Life/Code Hacking #2.1 Speed listeningaccelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 140. Life/Code Hacking #2.1 Speed listening• Step 1: Collect • online videos, tutorials, podcasts, etc. • audiobooks • youtube-dl, get_flash_videos, jDownloader, ffmpeg, mplayer, etc. • etc. accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 141. Life/Code Hacking #2.1 Speed listening• Step 2: Accelerate (time-stretch) • VLC (Playback > Faster) • sox $f{,.1.8X.mp3} tempo 1.8 50 • iPod ? mp3splt -t 5.00 -o small-@n large.mp3 accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 142. Life/Code Hacking #2.1 Speed listening• Step 3: chill or do more ;-) accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
  • 143. Demo
  • 144. CO ME