IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
Upcoming SlideShare
Loading in...5
×
 

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

on

  • 2,292 views

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Note that some slides were borrowed from NVIDIA.

Statistics

Views

Total Views
2,292
Views on SlideShare
2,286
Embed Views
6

Actions

Likes
0
Downloads
87
Comments
0

1 Embed 6

http://www.slideshare.net 6

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT) IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT) Presentation Transcript

  • 6.963 IT / A@M CUD 9 IAP0 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 07 Nicolas Pinto (MIT) #2 CUDA - Advanced Friday, January 23, 2009
  • During this course, 3 6 for 6.9 ed adapt we’ll try to “ ” and use existing material ;-) Friday, January 23, 2009
  • Today yey!! Friday, January 23, 2009
  • Wanna Play with The Big Guys? Friday, January 23, 2009
  • Here are the keys to High-Performance in CUDA Friday, January 23, 2009
  • ng! rni Wa To optimize or not to optimize Hoare said (and Knuth restated) “Premature optimization is the root of all evil.” slide by Johan Seland Applied Mathematics 23/53 Friday, January 23, 2009
  • ng! rni Wa To optimize or not to optimize Hoare said (and Knuth restated) “We should forget about small efficiencies, say about 97% of the time: Premature optimization is the root of all evil.” ⇓ 3% of the time we really should worry about small efficiencies (Every 33rd codeline) slide by Johan Seland Applied Mathematics 23/53 Friday, January 23, 2009
  • 6.963 IT / A@M CUD 9 IAP0 Strategy Memory Optimizations Execution Optimizations Friday, January 23, 2009
  • 6.963 IT / A@M CUD 9 IAP0 CUDA Performance Strategies Friday, January 23, 2009
  • egy rat St Optimization goals We should strive to reach GPU performance We must know the GPU performance Vendor specifications Syntetic benchmarks Choose a performance metric Memory bandwidth or GFLOPS? Use clock() to measure Experiment and profile! slide by Johan Seland Applied Mathematics 25/53 Friday, January 23, 2009
  • ing ead hr T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006 Friday, January 23, 2009
  • mory Me Data Movement in a CUDA Program Host Memory Device Memory [Shared Memory] COMPUTATION [Shared Memory] Device Memory Host Memory © NVIDIA Corporation 2008 10 Friday, January 23, 2009
  • erf P !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123 456$%$&'($78'quot;'78'7#(quot;5-5**'*$/% 456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.? @,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.' 123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-: E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:( 85#5(#-57/0'-/ GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='( 05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/# 39 Friday, January 23, 2009
  • erf P !quot;#$%$&'()'%*+,(-*.'+'/0' -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4' =2*>12?@*012(4'5$0'(%'%*+,( !quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'( %'%*+, B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3 40 Friday, January 23, 2009
  • erf P !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03 45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03 !/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03 <6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;% 6/quot;0$'%93%quot;88%*/0$quot;'6 <6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66 .*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)? :,quot;8$6:$quot;98$%quot;''0$667)+ 1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0 41 Friday, January 23, 2009
  • erf P !quot;#$%&'&((#()quot;*$+,,)-)#./(0 %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$ *2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0 9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot; <##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$ *2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1' @#=)quot;/#'quot;;$quot;5&'#:$*#*1'0 42 Friday, January 23, 2009
  • Friday, January 23, 2009
  • 6.963 IT / A@M CUD 9 IAP0 Memory Optimizations Friday, January 23, 2009
  • ory em M !quot;#$%&'$()*#*+,)*$-. /()*#*+*-0'#quot;#$%&')%,-.1quot;%. 2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;. 6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3& 44 Friday, January 23, 2009
  • ory em M !quot;#quot;$%&quot;'()*&( !*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$ 6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1 789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE G89:($)/&$>?@A*$MFH N,',.,O*$#&quot;'()*&( @'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*&quot;#*3$ /'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$ .*./&0 8&/5;$#&quot;'()*&( R'*$6quot;&Q*$#&quot;'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*( 45 Friday, January 23, 2009
  • ory em M !quot;#$%&'()$*+,$-'./+0.quot;123$.2 (4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+ ='27+-$-'./ >1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($ @AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9 LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+ 8'Q$.(5'()$*+!GH%$9 R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$ U2$+:;7=+(quot;47;'1 W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+ 'Q$.quot;55+2/27$-+<$.3'.-quot;1($ 0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72 46 Friday, January 23, 2009
  • em gm !quot;#$%quot;&'()#*+&,(%-./0*12(. 3145(.2&quot;%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(. ?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7> B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67( D#%quot;(.71649&8@&2#&E;F&.@((-8@ ?%2(67+&51-1649&8@&2#&GHIF&.@((-8@ 47 Friday, January 23, 2009
  • em gm Accessing global memory 4 cycles to issue on memory fetch but 400-600 cycles of latency The equivalent of 100 MADs Likely to be a performance bottleneck Order of magnitude speedups possible Coalesce memory access Use shared memory to re-order non-coalesced addressing slide by Johan Seland Applied Mathematics 32/53 Friday, January 23, 2009
  • em gm !quot;#$%&'()* +,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&: +,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1= 9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@ 8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@ AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@ +..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%= J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;), &(K% L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#, 0$quot;'M,0%()*,-%#. NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()* P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6 48 Friday, January 23, 2009
  • em gm !quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1& 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 *$$)1>?%#(&)C#?1-'-C#1% 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 ;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1% 49 Friday, January 23, 2009
  • !quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1( em gm 12 13 14 17 135 136 349 374 378 352 355 395 399 3B4 :';<=1')*+##'((*>?*@A;'%)( 12 13 14 17 137 135 136 349 374 378 352 355 395 399 3B4 C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G 50 Friday, January 23, 2009
  • em gm !quot;#$%&'()*+,-(.()*,/%&0$1& 234%5(.%)1,quot;),678+, 9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'? @A,;$quot;#1&,BCDAEF -(.%&,#G%5#*%:,quot;G%5,C89,50)& CD9,>$quot;'?&,3,DHI,1J5%#:&+ @HIK&,L 'quot;#$%&'%: @HMK&,L 'quot;#$%&'%:<,&quot;.%,1J5%#:&,:quot;)N1,4#51('(4#1% @<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&& 51 Friday, January 23, 2009
  • em gm !quot;#$%&'()*+ ,-./'-/.%&0quot;10&(2%0! 34054067089-%& :&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0 <;quot;,= ?10,quot;;0(&0)quot;-0@(#A$%+ Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067 :&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()* x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA 58 Friday, January 23, 2009
  • em gm !quot;#$%&'()*+,-.//#01 !quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2 !0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$& :%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A 712%&,B($$,70%#9,'quot;#$%&'()*+ C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;- E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG D88(2(quot;)#$,0%&quot;.0'%&+ D$(*)%8,I13%&,-JK,-#/3$% 59 Friday, January 23, 2009
  • em sm !quot;#quot;$$%$&'%()#*&+#,-./%,/0#% 12&quot;&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56&quot;,,%66&(%()#* 7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6 <66%2/.quot;$&/)&quot;,-.%9%&-.=-&:quot;25>.5/- <quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%&quot;55#%66&3%#&,*,$% +&(%()#*&,quot;2&6%#9.,%&quot;6&(quot;2*&6.(0$/quot;2%)06& Bank 0 quot;,,%66%6&quot;6&./&-quot;6&:quot;2;6 Bank 1 Bank 2 Bank 3 '0$/.3$%&6.(0$/quot;2%)06&quot;,,%66%6&/)&quot;&:quot;2; Bank 4 #%60$/&.2&quot;&:quot;2;&,)28$.,/& Bank 5 Bank 6 ?)28$.,/.2=&quot;,,%66%6&quot;#%&6%#.quot;$.@%5 Bank 7 Bank 15 64 Friday, January 23, 2009
  • em sm !quot;#$%&''()**+#,%-.quot;/01)* 23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67* 8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3# *7(+')%99%: Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 65 Friday, January 23, 2009
  • em sm !quot;#$%&''()**+#,%-.quot;/01)* 234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:* ;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,% *:(+')%<<%2 *:(+')%<<%= x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15 66 Friday, January 23, 2009
  • mem s !quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012 3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:() <=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($ -%./) 012$5%)$AB$-%./) <quot;$-%./$C$%&&'())$D$AB <%*($%)$,5($)4E($quot;6$%$5%:6?#%'+ Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+ 67 Friday, January 23, 2009
  • em sm !quot;#$%&'(%()$*'+#,-'.),/01.23 !quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%' ,)'+#,-'.),/01.23 5quot;%'/#32'.#3%6 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13' ,)'+#,-'.),/01.2 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;' 2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32= 5quot;%'30)9'.#3%6 >#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:' #..%33'2quot;%'3#(%'+#,- A@32'3%$1#01B%'2quot;%'#..%33%3 ?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,- 68 Friday, January 23, 2009
  • egy rat St Use the right kind of memory Constant memory: Quite small, ≈ 20K As fast as register access if all threads in a warp access the same location Texture memory: Spatially cached Optimized for 2D locality Neighboring threads should read neighboring addresses No need to think about coalescing Constraint: These memories can only be updated from the CPU slide by Johan Seland Applied Mathematics 31/53 Friday, January 23, 2009
  • egy rat St Memory optimizations roundup CUDA memory handling is complex And I have not covered all topics... Using memory correctly can lead to huge speedups At least CUDA expose the memory hierarchy, unlike CPUs Get your algorithm up an running first, then optimize Use shared memory to let threads cooperate Be wary of “data ownership” A thread does not have to read/write the data it calculate slide by Johan Seland Applied Mathematics 41/53 Friday, January 23, 2009
  • Conflicts, Coalescing, Warps... I hate growing up. Friday, January 23, 2009
  • ple xa m E !quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3. Friday, January 23, 2009
  • ple xa m E !quot;#$%&'($quot;)*+,*- ./0'.quot;1+2-'34#$quot;)*+,*-56 7228*#$quot;#-*9 :,quot;2-*;%)< =>,%?%)<'.!@!'Aquot;)B';,)C2%;#* .+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-* 1 5 9 13 1 2 3 4 2 6 10 14 5 6 7 8 3 7 11 15 9 10 11 12 4 8 12 16 13 14 15 16 70 Friday, January 23, 2009
  • ple xa m E !quot;#$%&'(#')*+,%quot;(-$(' __global__ void transpose_naive(float *odata, float *idata, int width, int height) { 1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; 2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; 3. if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; 4. unsigned int index_out = yIndex + height * xIndex; 5. $)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4; 6. } } 71 Friday, January 23, 2009
  • ple xa m E !quot;#$%&'(#')*+,%quot;(-$(' .'%)(*/quot;-01*2,$3*4565 <,/1'*$01-01*1$*4565 ;8; ;87 ;8: ;879 ;8; 78; :8; 798; 78; 787 78: 7879 ;87 787 :87 7987 798; 7987 798: 79879 ;879 7879 :879 79879 4565 4565 Stride = 1, coalesced Stride = 16, uncoalesced 72 Friday, January 23, 2009
  • ple xa m E !quot;#$%&'%()*+#,&-quot;&% .&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%& *6+%#(7$quot;'8)974:)7;<3 =%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@? A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$% *+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@? *6+%#()914:1;<3 =%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$% A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$% !quot;#$%&'2,B)2&)#'62%D%()2C3 E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH 73 Friday, January 23, 2009
  • ple xa m E !quot;#$%&'%()*+#,&-quot;&% 4%#(&)5+quot;6)1232 .+/0%&)0quot;)7232 <9< <98 <9; <98: <9< <98 <9; <98: 89< 898 89; 898: 89< 898 89; 898: 8:9< 8:98 8:9; 8:98: 8:9< 8:98 8:9; 8:98: 4%#(&)5+quot;6)7232 .+/0%&)0quot;)1232 <9< 89< ;9< 8:9< <9< <98 <9; <98: <98 898 ;98 8:98 89< 898 89; 898: <98: 898: ;98: 8:98: 8:9< 8:98 8:9; 8:98: 74 Friday, January 23, 2009
  • ple xa m E !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75 Friday, January 23, 2009
  • ple xa m E !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75 Friday, January 23, 2009
  • ple xa m E !quot;#$%&'%()*+#,&-quot;&% __global__ void transpose(float *odata, float *idata, int width, int height) { 1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; 2. unsigned int yBlock = blockDim.y * blockIdx.y; 3. unsigned int xIndex = xBlock + threadIdx.x; 4. unsigned int yIndex = yBlock + threadIdx.y; 5. unsigned int index_out, index_transpose; 6. 7. if (xIndex < width && yIndex < height) { unsigned int index_in = width * yIndex + xIndex; 8. unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x; 9. block[index_block] = idata[index_in]; 10. index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y; 11. index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; 12. } 13. __syncthreads(); 14. if (xIndex < width && yIndex < height) odata[index_out] = block[index_transpose]; 15. } 76 Friday, January 23, 2009
  • ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { __shared__ float block[BLOCK_DIM*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } Synchronize. __synchthreads(); NB:outside if-clause if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } Synchronize. __synchthreads(); NB:outside if-clause Write to global mem. if ( xIndex < width && yIndex < height ) { Different index out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  • ple xa m E Transpose timings Was it worth the trouble? Grid Size Coalesced Non-coalesced Speedup 128 × 128 0.011 ms 0.022 ms 2.0× 512 × 512 0.07 ms 0.33 ms 4.5× 1024 × 1024 0.30 ms 1.92 ms 6.4× 1024 × 2048 0.79 ms 6.6 ms 8.4× For me, this is a clear yes. slide by Johan Seland Applied Mathematics 40/53 Friday, January 23, 2009
  • Friday, January 23, 2009
  • 6.963 IT / A@M CUD 9 IAP0 Execution Optimizations Friday, January 23, 2009
  • xec E Know the arithmetic cost of operations 4 clock cycles: Floating point: add, multiply, fused multiply-add Integer add, bitwise operations, compare, min, max 16 clock cycles: log(x), 32-bit integer reciprocal, reciprocal square root, multiplication 32 clock cycles: sin(x), cos(x) and exp(x) 36 clock cycles: Floating point division (24-bit version in 20 cycles) Particularly costly: Integer division, modulo Remedy: Replace with shifting whenever possible Double precision (when available) will perform at half the speed slide by Johan Seland Applied Mathematics 28/53 Friday, January 23, 2009
  • xec E !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ A+6./0+*/ B)%*+,-<+<1*' 79 Friday, January 23, 2009
  • xec E !quot;#$%&'()*+,#-.+/.0quot;#12#)1 3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1 ,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02. 3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+> ?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot; &'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+ :9quot;$B9quot;.+501@ ,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@ 3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1 &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(< JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1 80 Friday, January 23, 2009
  • xec E !quot;#$%&quot;'()quot;*quot;+,quot;+-. !quot;/,0/1&quot;'02'$&quot;('quot;#$%&quot;'(,quot;*quot;+,quot;+-. 3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/&quot;' ;-quot;+/'$5%<=>)?< @AB< S T(.(U(JV /,,N1O:(((P1OQ(P1EQ(P1: W(T(S U(OV /,,N1O:(((P1JQ(P1OQ(P1R %[,/&/XYZ(UT(OV 7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ( /,,N1O:(((((((((((P1OQ(P1OQ(P1R A5(-5C*7quot;&quot;7.(D$,quot;(&Dquot;(7/&quot;+-.<( !4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5' ?&(7quot;/%&(:JK 5--4*/+-. AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M 81 Friday, January 23, 2009
  • xec E !quot;#$%&quot;'()'quot;%%*'quot; +$,quot;(-.&quot;/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78 9$3$&$/#(:.0&4'%; <*32quot;'(4=('quot;#$%&quot;'%(6quot;'(>quot;'/quot;- ?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,% D34*/&(4=(%5.'quot;,(3quot;34'1 @EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>% H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%&quot;'%(K(>quot;'/quot;- L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH < O(,quot;%$'quot;,(3.N$3*3('quot;#$%&quot;'%(K(>quot;'/quot;- D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*' !quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot; 82 Friday, January 23, 2009
  • xec E !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot; /*quot;)012#3+2#&+'*4567 +2#&+')#+)'6-- 8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+) =quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A 82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$) #;quot;)0-+=quot;7 *quot;-#&+'A architecture {sm_10} abiversion {0} modname {cubin} code { per thread local memory name = BlackScholesGPU lmem = 0 smem = 68 per thread block shared memory reg = 20 bar = 0 per thread registers bincode { 0xa0004205 0x04200780 0x40024c09 0x00200780 … 83 Friday, January 23, 2009
  • xec E !quot;#$%&''()*+',%!*-'(-*./0 84 Friday, January 23, 2009
  • xec E !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85 Friday, January 23, 2009
  • xec E !quot;quot;#$%&quot;'()*(+,-./-0%&quot;, 1&quot;-,%23&4(/quot;quot;#$%&quot;'(5/,2(&/6(&,quot;,22%-37'( 3&quot;-,%2,($,-./-0%&quot;, BUT… 8/9:/quot;quot;#$%&quot;'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'( <35,(7%6,&quot;'(/&(0,0/-':=/#&5(>,-&,72 ?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,( $%-%77,7320A 86 Friday, January 23, 2009
  • xec E !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1 !quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;.&quot;&'+1)&+)4'55%#%1&)6!73 6!73)8quot;#9)'1)$quot;19):quot;93 ;)+5)$,/&'.#+0%33+#3 <%$+#9)=quot;14:'4&2 >2quot;#%4)$%$+#9)3'(% ?%@'3&%#)5'/%)3'(% A2#%quot;43).%#)=/+0B *+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14) -AG->H IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/) 0+15'@,#quot;&'+1 87 Friday, January 23, 2009
  • xec E Loop unrolling Sometimes we know some kernel parameters at compile time: # of loop iterations Degrees of polynomials Number of data elements If we could “tell” this to the compiler, it can unroll loops and optimize register usage We need to be generic Avoid code duplication, sizes unknown at compile time Templates to rescue The same trick can be used for regular C++ sources slide by Johan Seland Applied Mathematics 43/53 Friday, January 23, 2009
  • xec E Example: de Casteljau algorithm A standard algorithm for evaluating polynomials in Bernstein form d f (x) = b00 Recursively defined: x 1−x d f (x) = b00 d−1 d−1 b10 b01 k−1 k−1 k bi,j = xbi+1,j + (1 − x)bi,j+1 1 − x2 x x 1−x 0 bi,j are coefficients d−2 d−2 d−2 b20 b11 b02 slide by Johan Seland Applied Mathematics 44/53 Friday, January 23, 2009
  • xec E Implementation The de Casteljau algorithm is usually implemented as nested for-loops Coefficients are overwritten for each iteration d f (x) = c00 float deCasteljau ( float ∗ c , float x , int d ) { x 1−x f o r ( u i n t i = 1 ; i <= d ; ++i ) { f o r ( u i n t j = 0 ; j <= d− i ; ++j ) d−1 d−1 c10 c01 c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ; } 1 − x2 x x 1−x return c [ 0 ] ; } d−2 d−2 d−2 c20 c11 c02 slide by Johan Seland Applied Mathematics 45/53 Friday, January 23, 2009
  • xec E Template loop unrolling We make d a template parameter template<int d> f l o a t d e C a s t e l j a u ( f l o a t ∗ c , f l o a t x, int d ) { f o r ( u i n t i = 1 ; i <= d ; ++i ) { f o r ( u i n t j = 0 ; j <= d− i ; ++j ) c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ; } return c [ 0 ] ; } Kernel is called as switch ( d ) { case 1: d e C a s t e l j a u <1><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; case 2: d e C a s t e l j a u <2><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; . . c a s e MAXD: d e C a s t e l j a u <MAXD><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; } slide by Johan Seland Applied Mathematics 46/53 Friday, January 23, 2009
  • xec E Results For the de Castelaju algorithm we see a relatively small speedup ≈ 1.2× (20%...) Very easy to implement Can lead to long compile times Conclusion: Probably worth it near end of development cycle slide by Johan Seland Applied Mathematics 47/53 Friday, January 23, 2009
  • xec E !quot;#$%&'(quot;# )#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($' 6+4quot;,7/$quot;.%+'$(#8 0(9+,8+#-/:,.#$5(#8 ;.#</$quot;#3%($-' =.-+#$7/5(*(#8 )'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/ )#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7 @#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$< +B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.* D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2' )'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+ 88 Friday, January 23, 2009
  • ing ofil Pr !quot;#$%&'($)*+,-.$/012*.#0 3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$ 401:.#5 ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$ 5#594?+ !*5#$+8-54+ (99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$ 61 Friday, January 23, 2009
  • ing ofil Pr !quot;#$%&' ()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56 +quot;7*'+%75 #&08quot;$.32*-*$+ Global memory loads/stores are coalesced #&08.32*-*$+ (coherent) or non-coalesced (incoherent) #'+8quot;$.32*-*$+ #'+8.32*-*$+ &3.%&8&3%0 Local loads/stores &3.%&8'+3-* Total branches and divergent branches 9-%$.2 0quot;)*-#*$+89-%$.2 taken by threads quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+ 1%-58'*-quot;%&quot;;* : +2-*%0,1%-5',+2%+,'*-quot;%&quot;;*,3$,%00-*'',.3$<&quot;.+',+3, '2%-*0,3-,.3$'+%$+,7*73-= .+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./' 62 Friday, January 23, 2009
  • ing ofil Pr !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/ 01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%& 6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*% 01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/) ,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,; <1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#) 8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,) 3*%:; 01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$) 5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$ !quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+) (,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$ 63 Friday, January 23, 2009
  • ple xam E !quot;#$%#&'()quot;*$%#*+,*quot;-quot;&quot;(.*#quot;/0).1%( M.quot;C Q0&0-'.1Jquot; N1&quot;*O444*1(.>P <'(/F1/.D MCquot;quot;/0C MCquot;quot;/0C Aquot;#(quot;-*2B* 638@+*&> 43869*;<=> 1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G F1.D*/1Jquot;#Gquot;(.*H#'()D1(G Aquot;#(quot;-*4B 93+@:*&> +36@+*;<=> 43995 43995 1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G F1.D*H'(K*)%($-1).> Aquot;#(quot;-*9B 23744*&> ?37+2*;<=> 43825 +3:65 >quot;I0quot;(.1'-*'//#quot;>>1(G Aquot;#(quot;-*+B 83?:@*&> 273977*;<=> 23765 639+5 $1#>.*'//*/0#1(G*G-%H'-*-%'/ Aquot;#(quot;-*@B 83@9:*&> 92346?*;<=> 2365 2@3825 0(#%--*-'>.*F'#C Aquot;#(quot;-*:B 83962*&> +93??:*;<=> 23+25 4232:5 )%&C-quot;.quot;-E*0(#%--quot;/ Aquot;#(quot;-*7B 834:6*&> :43:72*;<=> 23+45 9838+5 &0-.1C-quot;*quot;-quot;&quot;(.>*Cquot;#*.D#quot;'/ Aquot;#(quot;-*7*%(*94,*quot;-quot;&quot;(.>B*74*;<=>L 84 Friday, January 23, 2009
  • n! ow our y ild Bu Friday, January 23, 2009
  • Friday, January 23, 2009
  • ou! ky an Th slide by David NVIDIA Corpora © 2008 Kirk Friday, January 23, 2009
  • Back Pocket Slides slide by David Cox Friday, January 23, 2009
  • Friday, January 23, 2009
  • 6.963 IT / A@M CUD 9 IAP0 Misc Friday, January 23, 2009
  • Tesla C1060 Computing Processor Processor 1x Tesla T10P Core GHz 1.33 GHz Full ATX: Form factor 4.736” (H) x 10.5” (L) Dual slot wide On-board 4 GB memory System I/O PCIe x16 gen2 512-bit, 800MHz DDR Memory I/O 102 GB/s peak bandwidth Display outputs None Typical power 160 W 19 M02: High Performance Computing with CUDA Friday, January 23, 2009
  • Tesla S1070 1U System Processors 4 x Tesla T10P Core GHz 1.5 GHz 1U for an EIA 19” Form factor 4-post rack Total 1U system 16 GB (4.0GB per GPU) memory System I/O 2 PCIe x16 512-bit, 800MHz GDDR Memory I/O per 102 GB/s peak processor bandwidth Display outputs None Typical power 700 W Chassis 1.73” H ! 17.5” W ! 28.5” D dimensions 20 M02: High Performance Computing with CUDA Friday, January 23, 2009
  • Double Precision Floating Point NVIDIA GPU SSE2 Cell SPE IEEE 754 IEEE 754 IEEE 754 Precision Rounding modes for FADD All 4 IEEE, round to All 4 IEEE, round to Round to and FMUL nearest, zero, inf, -inf nearest, zero, inf, -inf zero/truncate only Supported, costs 1000’s Denormal handling Full speed Flush to zero of cycles NaN support Yes Yes No Overflow and Infinity No infinity, Yes Yes support clamps to max norm Flags No Yes Some FMA Yes No Yes Software with low-latency Square root Hardware Software only FMA-based convergence Software with low-latency Division Hardware Software only FMA-based convergence Reciprocal estimate 24 bit 12 bit 12 bit accuracy Reciprocal sqrt estimate 23 bit 12 bit 12 bit accuracy log2(x) and 2^x estimates 23 bit No No accuracy 18 M02: High Performance Computing with CUDA Friday, January 23, 2009