6.963
                                      IT /
                                A@M
                             CUD
    ...
During this course,
                                               3
                                              6
     ...
Today
                           yey!!




Friday, January 23, 2009
Wanna Play with
                           The Big Guys?




Friday, January 23, 2009
Here are the keys
                to High-Performance in CUDA




Friday, January 23, 2009
ng!
                                                                               rni
                                   ...
ng!
                                                                                rni
                                  ...
6.963
                                      IT /
                                A@M
                             CUD
    ...
6.963
                                      IT /
                                A@M
                             CUD
    ...
egy
                                                                               rat
                                   ...
ing
                                                                                                ead
                  ...
mory
                                             Me
           Data Movement in a CUDA Program

           Host Memory
  ...
erf
                                                                           P
                   !quot;#$%$&'()*+,-$#.%...
erf
                                                                        P
                   !quot;#$%$&'()'%*+,(-*.'+...
erf
                                                                            P
                   !quot;#$%&'(quot;)*qu...
erf
                                                                      P
                   !quot;#$%&'&((#()quot;*$+,,...
Friday, January 23, 2009
6.963
                                      IT /
                                A@M
                             CUD
    ...
ory
                                                       em
                                                     M
     ...
ory
                                                                     em
                                              ...
ory
                                                                          em
                                         ...
em
                                                                     gm
                   !quot;#$%quot;&'()#*+&,(%-./...
em
                                                                              gm
               Accessing global memory...
em
                                                                               gm
                   !quot;#$%&'()*


 ...
em
                                                                                                          gm
          ...
!quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1(

                                                                              ...
em
                                                                            gm
                   !quot;#$%&'()*+,-(.()...
em
                                                                                      gm
                   !quot;#$%&'...
em
                                                                              gm
                   !quot;#$%&'()*+,-./...
em
                                                                           sm
                   !quot;#quot;$$%$&'%()#...
em
                                                                                sm
                   !quot;#$%&''()**+...
em
                                                                               sm
                   !quot;#$%&''()**+#...
mem
                                                                                                         s
           ...
em
                                                                                   sm
                   !quot;#$%&'(%(...
egy
                                                                                  rat
                                ...
egy
                                                                                 rat
                                 ...
Conflicts,
                           Coalescing, Warps...
                           I hate growing up.




Friday, Januar...
ple
                                                   xa m
                                                  E




      ...
ple
                                                                            xa m
                                     ...
ple
                                                                                       xa m
                          ...
ple
                                                                               xa m
                                  ...
ple
                                                                              xa m
                                   ...
ple
                                                                              xa m
                                   ...
ple
                                                                         xa m
                                        ...
ple
                                                                         xa m
                                        ...
ple
                                                                                                  xa m
               ...
ple
                                                                                                    xa m
             ...
ple
                                                                                                         xa m
        ...
ple
                                                                                                         xa m
        ...
ple
                                                                                                         xa m
        ...
ple
                                                                                                         xa m
        ...
ple
                                                                                                         xa m
        ...
ple
                                                                                                         xa m
        ...
ple
                                                                                                         xa m
        ...
ple
                                                                            xa m
                                     ...
Friday, January 23, 2009
6.963
                                      IT /
                                A@M
                             CUD
    ...
xec
                                                                                        E
               Know the arit...
xec
                                                                      E
                   !quot;quot;#$%"'

    ...
xec
                                                                              E
                   !quot;#$%&'()*+,#-....
xec
                                                                                    E
                   !quot;#$%&quo...
xec
                                                                                E
                   !quot;#$%"'(...
xec
                                                                               E
                   !quot;#quot;$%&'&'...
xec
                                               E
                   !quot;#$%&''()*+',%!*-'(-*./0




             84
...
xec
                                                                                 E
                   !quot;#$%$&$'()#...
xec
                                                                               E
                   !quot;quot;#$%&quo...
xec
                                                                          E
                   !quot;#quot;$%&%#'(%)*+...
xec
                                                                                 E
               Loop unrolling



  ...
xec
                                                                                                     E
               ...
xec
                                                                                                   E
               Im...
xec
                                                                                                      E
              ...
xec
                                                                                E
               Results




         ...
xec
                                                                            E
                   !quot;#$%&'(quot;#
  ...
ing
                                                                             ofil
                                     ...
ing
                                                                                            ofil
                      ...
ing
                                                                              ofil
                                    ...
ple
                                                                                                  xam
                ...
n!
                                ow
                            our
                           y
           ild
        ...
Friday, January 23, 2009
ou!
                                ky
                             an
                           Th
                     ...
Back Pocket Slides




                             slide by David Cox


Friday, January 23, 2009
Friday, January 23, 2009
6.963
                                      IT /
                                A@M
                             CUD
    ...
Tesla C1060 Computing Processor
                                                                     Processor            ...
Tesla S1070 1U System
                                                                     Processors         4 x Tesla T1...
Double Precision Floating Point
                                                          NVIDIA GPU                    SS...
Upcoming SlideShare
Loading in...5
×

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

1,134

Published on

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Note that some slides were borrowed from NVIDIA.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,134
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
91
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)

  1. 1. 6.963 IT / A@M CUD 9 IAP0 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 07 Nicolas Pinto (MIT) #2 CUDA - Advanced Friday, January 23, 2009
  2. 2. During this course, 3 6 for 6.9 ed adapt we’ll try to “ ” and use existing material ;-) Friday, January 23, 2009
  3. 3. Today yey!! Friday, January 23, 2009
  4. 4. Wanna Play with The Big Guys? Friday, January 23, 2009
  5. 5. Here are the keys to High-Performance in CUDA Friday, January 23, 2009
  6. 6. ng! rni Wa To optimize or not to optimize Hoare said (and Knuth restated) “Premature optimization is the root of all evil.” slide by Johan Seland Applied Mathematics 23/53 Friday, January 23, 2009
  7. 7. ng! rni Wa To optimize or not to optimize Hoare said (and Knuth restated) “We should forget about small efficiencies, say about 97% of the time: Premature optimization is the root of all evil.” ⇓ 3% of the time we really should worry about small efficiencies (Every 33rd codeline) slide by Johan Seland Applied Mathematics 23/53 Friday, January 23, 2009
  8. 8. 6.963 IT / A@M CUD 9 IAP0 Strategy Memory Optimizations Execution Optimizations Friday, January 23, 2009
  9. 9. 6.963 IT / A@M CUD 9 IAP0 CUDA Performance Strategies Friday, January 23, 2009
  10. 10. egy rat St Optimization goals We should strive to reach GPU performance We must know the GPU performance Vendor specifications Syntetic benchmarks Choose a performance metric Memory bandwidth or GFLOPS? Use clock() to measure Experiment and profile! slide by Johan Seland Applied Mathematics 25/53 Friday, January 23, 2009
  11. 11. ing ead hr T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006 Friday, January 23, 2009
  12. 12. mory Me Data Movement in a CUDA Program Host Memory Device Memory [Shared Memory] COMPUTATION [Shared Memory] Device Memory Host Memory © NVIDIA Corporation 2008 10 Friday, January 23, 2009
  13. 13. erf P !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123 456$%$&'($78'quot;'78'7#(quot;5-5**'*$/% 456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.? @,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.' 123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-: E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:( 85#5(#-57/0'-/ GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='( 05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/# 39 Friday, January 23, 2009
  14. 14. erf P !quot;#$%$&'()'%*+,(-*.'+'/0' -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4' =2*>12?@*012(4'5$0'(%'%*+,( !quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'( %'%*+, B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3 40 Friday, January 23, 2009
  15. 15. erf P !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03 45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03 !/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03 <6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;% 6/quot;0$'%93%quot;88%*/0$quot;'6 <6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66 .*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)? :,quot;8$6:$quot;98$%quot;''0$667)+ 1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0 41 Friday, January 23, 2009
  16. 16. erf P !quot;#$%&'&((#()quot;*$+,,)-)#./(0 %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$ *2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0 9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot; <##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$ *2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1' @#=)quot;/#'quot;;$quot;5&'#:$*#*1'0 42 Friday, January 23, 2009
  17. 17. Friday, January 23, 2009
  18. 18. 6.963 IT / A@M CUD 9 IAP0 Memory Optimizations Friday, January 23, 2009
  19. 19. ory em M !quot;#$%&'$()*#*+,)*$-. /()*#*+*-0'#quot;#$%&')%,-.1quot;%. 2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;. 6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3& 44 Friday, January 23, 2009
  20. 20. ory em M !quot;#quot;$%&quot;'()*&( !*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$ 6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1 789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE G89:($)/&$>?@A*$MFH N,',.,O*$#&quot;'()*&( @'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*&quot;#*3$ /'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$ .*./&0 8&/5;$#&quot;'()*&( R'*$6quot;&Q*$#&quot;'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*( 45 Friday, January 23, 2009
  21. 21. ory em M !quot;#$%&'()$*+,$-'./+0.quot;123$.2 (4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+ ='27+-$-'./ >1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($ @AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9 LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+ 8'Q$.(5'()$*+!GH%$9 R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$ U2$+:;7=+(quot;47;'1 W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+ 'Q$.quot;55+2/27$-+<$.3'.-quot;1($ 0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72 46 Friday, January 23, 2009
  22. 22. em gm !quot;#$%quot;&'()#*+&,(%-./0*12(. 3145(.2&quot;%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(. ?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7> B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67( D#%quot;(.71649&8@&2#&E;F&.@((-8@ ?%2(67+&51-1649&8@&2#&GHIF&.@((-8@ 47 Friday, January 23, 2009
  23. 23. em gm Accessing global memory 4 cycles to issue on memory fetch but 400-600 cycles of latency The equivalent of 100 MADs Likely to be a performance bottleneck Order of magnitude speedups possible Coalesce memory access Use shared memory to re-order non-coalesced addressing slide by Johan Seland Applied Mathematics 32/53 Friday, January 23, 2009
  24. 24. em gm !quot;#$%&'()* +,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&: +,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1= 9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@ 8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@ AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@ +..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%= J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;), &(K% L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#, 0$quot;'M,0%()*,-%#. NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()* P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6 48 Friday, January 23, 2009
  25. 25. em gm !quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1& 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 *$$)1>?%#(&)C#?1-'-C#1% 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 ;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1% 49 Friday, January 23, 2009
  26. 26. !quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1( em gm 12 13 14 17 135 136 349 374 378 352 355 395 399 3B4 :';<=1')*+##'((*>?*@A;'%)( 12 13 14 17 137 135 136 349 374 378 352 355 395 399 3B4 C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G 50 Friday, January 23, 2009
  27. 27. em gm !quot;#$%&'()*+,-(.()*,/%&0$1& 234%5(.%)1,quot;),678+, 9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'? @A,;$quot;#1&,BCDAEF -(.%&,#G%5#*%:,quot;G%5,C89,50)& CD9,>$quot;'?&,3,DHI,1J5%#:&+ @HIK&,L 'quot;#$%&'%: @HMK&,L 'quot;#$%&'%:<,&quot;.%,1J5%#:&,:quot;)N1,4#51('(4#1% @<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&& 51 Friday, January 23, 2009
  28. 28. em gm !quot;#$%&'()*+ ,-./'-/.%&0quot;10&(2%0! 34054067089-%& :&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0 <;quot;,= ?10,quot;;0(&0)quot;-0@(#A$%+ Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067 :&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()* x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA 58 Friday, January 23, 2009
  29. 29. em gm !quot;#$%&'()*+,-.//#01 !quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2 !0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$& :%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A 712%&,B($$,70%#9,'quot;#$%&'()*+ C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;- E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG D88(2(quot;)#$,0%&quot;.0'%&+ D$(*)%8,I13%&,-JK,-#/3$% 59 Friday, January 23, 2009
  30. 30. em sm !quot;#quot;$$%$&'%()#*&+#,-./%,/0#% 12&quot;&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56&quot;,,%66&(%()#* 7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6 <66%2/.quot;$&/)&quot;,-.%9%&-.=-&:quot;25>.5/- <quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%&quot;55#%66&3%#&,*,$% +&(%()#*&,quot;2&6%#9.,%&quot;6&(quot;2*&6.(0$/quot;2%)06& Bank 0 quot;,,%66%6&quot;6&./&-quot;6&:quot;2;6 Bank 1 Bank 2 Bank 3 '0$/.3$%&6.(0$/quot;2%)06&quot;,,%66%6&/)&quot;&:quot;2; Bank 4 #%60$/&.2&quot;&:quot;2;&,)28$.,/& Bank 5 Bank 6 ?)28$.,/.2=&quot;,,%66%6&quot;#%&6%#.quot;$.@%5 Bank 7 Bank 15 64 Friday, January 23, 2009
  31. 31. em sm !quot;#$%&''()**+#,%-.quot;/01)* 23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67* 8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3# *7(+')%99%: Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 65 Friday, January 23, 2009
  32. 32. em sm !quot;#$%&''()**+#,%-.quot;/01)* 234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:* ;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,% *:(+')%<<%2 *:(+')%<<%= x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15 66 Friday, January 23, 2009
  33. 33. mem s !quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012 3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:() <=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($ -%./) 012$5%)$AB$-%./) <quot;$-%./$C$%&&'())$D$AB <%*($%)$,5($)4E($quot;6$%$5%:6?#%'+ Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+ 67 Friday, January 23, 2009
  34. 34. em sm !quot;#$%&'(%()$*'+#,-'.),/01.23 !quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%' ,)'+#,-'.),/01.23 5quot;%'/#32'.#3%6 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13' ,)'+#,-'.),/01.2 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;' 2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32= 5quot;%'30)9'.#3%6 >#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:' #..%33'2quot;%'3#(%'+#,- A@32'3%$1#01B%'2quot;%'#..%33%3 ?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,- 68 Friday, January 23, 2009
  35. 35. egy rat St Use the right kind of memory Constant memory: Quite small, ≈ 20K As fast as register access if all threads in a warp access the same location Texture memory: Spatially cached Optimized for 2D locality Neighboring threads should read neighboring addresses No need to think about coalescing Constraint: These memories can only be updated from the CPU slide by Johan Seland Applied Mathematics 31/53 Friday, January 23, 2009
  36. 36. egy rat St Memory optimizations roundup CUDA memory handling is complex And I have not covered all topics... Using memory correctly can lead to huge speedups At least CUDA expose the memory hierarchy, unlike CPUs Get your algorithm up an running first, then optimize Use shared memory to let threads cooperate Be wary of “data ownership” A thread does not have to read/write the data it calculate slide by Johan Seland Applied Mathematics 41/53 Friday, January 23, 2009
  37. 37. Conflicts, Coalescing, Warps... I hate growing up. Friday, January 23, 2009
  38. 38. ple xa m E !quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3. Friday, January 23, 2009
  39. 39. ple xa m E !quot;#$%&'($quot;)*+,*- ./0'.quot;1+2-'34#$quot;)*+,*-56 7228*#$quot;#-*9 :,quot;2-*;%)< =>,%?%)<'.!@!'Aquot;)B';,)C2%;#* .+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-* 1 5 9 13 1 2 3 4 2 6 10 14 5 6 7 8 3 7 11 15 9 10 11 12 4 8 12 16 13 14 15 16 70 Friday, January 23, 2009
  40. 40. ple xa m E !quot;#$%&'(#')*+,%quot;(-$(' __global__ void transpose_naive(float *odata, float *idata, int width, int height) { 1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; 2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; 3. if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; 4. unsigned int index_out = yIndex + height * xIndex; 5. $)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4; 6. } } 71 Friday, January 23, 2009
  41. 41. ple xa m E !quot;#$%&'(#')*+,%quot;(-$(' .'%)(*/quot;-01*2,$3*4565 <,/1'*$01-01*1$*4565 ;8; ;87 ;8: ;879 ;8; 78; :8; 798; 78; 787 78: 7879 ;87 787 :87 7987 798; 7987 798: 79879 ;879 7879 :879 79879 4565 4565 Stride = 1, coalesced Stride = 16, uncoalesced 72 Friday, January 23, 2009
  42. 42. ple xa m E !quot;#$%&'%()*+#,&-quot;&% .&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%& *6+%#(7$quot;'8)974:)7;<3 =%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@? A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$% *+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@? *6+%#()914:1;<3 =%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$% A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$% !quot;#$%&'2,B)2&)#'62%D%()2C3 E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH 73 Friday, January 23, 2009
  43. 43. ple xa m E !quot;#$%&'%()*+#,&-quot;&% 4%#(&)5+quot;6)1232 .+/0%&)0quot;)7232 <9< <98 <9; <98: <9< <98 <9; <98: 89< 898 89; 898: 89< 898 89; 898: 8:9< 8:98 8:9; 8:98: 8:9< 8:98 8:9; 8:98: 4%#(&)5+quot;6)7232 .+/0%&)0quot;)1232 <9< 89< ;9< 8:9< <9< <98 <9; <98: <98 898 ;98 8:98 89< 898 89; 898: <98: 898: ;98: 8:98: 8:9< 8:98 8:9; 8:98: 74 Friday, January 23, 2009
  44. 44. ple xa m E !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75 Friday, January 23, 2009
  45. 45. ple xa m E !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75 Friday, January 23, 2009
  46. 46. ple xa m E !quot;#$%&'%()*+#,&-quot;&% __global__ void transpose(float *odata, float *idata, int width, int height) { 1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; 2. unsigned int yBlock = blockDim.y * blockIdx.y; 3. unsigned int xIndex = xBlock + threadIdx.x; 4. unsigned int yIndex = yBlock + threadIdx.y; 5. unsigned int index_out, index_transpose; 6. 7. if (xIndex < width && yIndex < height) { unsigned int index_in = width * yIndex + xIndex; 8. unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x; 9. block[index_block] = idata[index_in]; 10. index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y; 11. index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; 12. } 13. __syncthreads(); 14. if (xIndex < width && yIndex < height) odata[index_out] = block[index_transpose]; 15. } 76 Friday, January 23, 2009
  47. 47. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { __shared__ float block[BLOCK_DIM*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  48. 48. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  49. 49. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; if ( xIndex < width && yIndex < height ) { unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  50. 50. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  51. 51. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  52. 52. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } __synchthreads(); if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  53. 53. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } Synchronize. __synchthreads(); NB:outside if-clause if ( xIndex < width && yIndex < height ) { out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  54. 54. ple xa m E Coalesced transpose: Source code __global__ void transpose( float *out, float *in, int w, int h ) { Allocate shared memory. __shared__ float block[BLOCK_DIM*BLOCK_DIM]; Set up indexing unsigned int xBlock = blockDim.x * blockIdx.x; unsigned int yBlock = blockDim.y * blockIdx.y; unsigned int xIndex = xBlock + threadIdx.x; unsigned int yIndex = yBlock + threadIdx.y; unsigned int index_out, index_transpose; Check that we are within domain, calculate more if ( xIndex < width && yIndex < height ) { indices unsigned int index_in = width * yIndex + xIndex; unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x; Write to shared memory. block[index_block] = in[index_in]; Calculate output indices. index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y; index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; } Synchronize. __synchthreads(); NB:outside if-clause Write to global mem. if ( xIndex < width && yIndex < height ) { Different index out[index_out] = block[index_transpose]; } } slide by Johan Seland Applied Mathematics 39/53 Friday, January 23, 2009
  55. 55. ple xa m E Transpose timings Was it worth the trouble? Grid Size Coalesced Non-coalesced Speedup 128 × 128 0.011 ms 0.022 ms 2.0× 512 × 512 0.07 ms 0.33 ms 4.5× 1024 × 1024 0.30 ms 1.92 ms 6.4× 1024 × 2048 0.79 ms 6.6 ms 8.4× For me, this is a clear yes. slide by Johan Seland Applied Mathematics 40/53 Friday, January 23, 2009
  56. 56. Friday, January 23, 2009
  57. 57. 6.963 IT / A@M CUD 9 IAP0 Execution Optimizations Friday, January 23, 2009
  58. 58. xec E Know the arithmetic cost of operations 4 clock cycles: Floating point: add, multiply, fused multiply-add Integer add, bitwise operations, compare, min, max 16 clock cycles: log(x), 32-bit integer reciprocal, reciprocal square root, multiplication 32 clock cycles: sin(x), cos(x) and exp(x) 36 clock cycles: Floating point division (24-bit version in 20 cycles) Particularly costly: Integer division, modulo Remedy: Replace with shifting whenever possible Double precision (when available) will perform at half the speed slide by Johan Seland Applied Mathematics 28/53 Friday, January 23, 2009
  59. 59. xec E !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ A+6./0+*/ B)%*+,-<+<1*' 79 Friday, January 23, 2009
  60. 60. xec E !quot;#$%&'()*+,#-.+/.0quot;#12#)1 3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1 ,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02. 3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+> ?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot; &'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+ :9quot;$B9quot;.+501@ ,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@ 3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1 &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(< JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1 80 Friday, January 23, 2009
  61. 61. xec E !quot;#$%&quot;'()quot;*quot;+,quot;+-. !quot;/,0/1&quot;'02'$&quot;('quot;#$%&quot;'(,quot;*quot;+,quot;+-. 3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/&quot;' ;-quot;+/'$5%<=>)?< @AB< S T(.(U(JV /,,N1O:(((P1OQ(P1EQ(P1: W(T(S U(OV /,,N1O:(((P1JQ(P1OQ(P1R %[,/&/XYZ(UT(OV 7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ( /,,N1O:(((((((((((P1OQ(P1OQ(P1R A5(-5C*7quot;&quot;7.(D$,quot;(&Dquot;(7/&quot;+-.<( !4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5' ?&(7quot;/%&(:JK 5--4*/+-. AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M 81 Friday, January 23, 2009
  62. 62. xec E !quot;#$%&quot;'()'quot;%%*'quot; +$,quot;(-.&quot;/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78 9$3$&$/#(:.0&4'%; <*32quot;'(4=('quot;#$%&quot;'%(6quot;'(>quot;'/quot;- ?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,% D34*/&(4=(%5.'quot;,(3quot;34'1 @EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>% H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%&quot;'%(K(>quot;'/quot;- L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH < O(,quot;%$'quot;,(3.N$3*3('quot;#$%&quot;'%(K(>quot;'/quot;- D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*' !quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot; 82 Friday, January 23, 2009
  63. 63. xec E !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot; /*quot;)012#3+2#&+'*4567 +2#&+')#+)'6-- 8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+) =quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A 82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$) #;quot;)0-+=quot;7 *quot;-#&+'A architecture {sm_10} abiversion {0} modname {cubin} code { per thread local memory name = BlackScholesGPU lmem = 0 smem = 68 per thread block shared memory reg = 20 bar = 0 per thread registers bincode { 0xa0004205 0x04200780 0x40024c09 0x00200780 … 83 Friday, January 23, 2009
  64. 64. xec E !quot;#$%&''()*+',%!*-'(-*./0 84 Friday, January 23, 2009
  65. 65. xec E !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85 Friday, January 23, 2009
  66. 66. xec E !quot;quot;#$%&quot;'()*(+,-./-0%&quot;, 1&quot;-,%23&4(/quot;quot;#$%&quot;'(5/,2(&/6(&,quot;,22%-37'( 3&quot;-,%2,($,-./-0%&quot;, BUT… 8/9:/quot;quot;#$%&quot;'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'( <35,(7%6,&quot;'(/&(0,0/-':=/#&5(>,-&,72 ?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,( $%-%77,7320A 86 Friday, January 23, 2009
  67. 67. xec E !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1 !quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;.&quot;&'+1)&+)4'55%#%1&)6!73 6!73)8quot;#9)'1)$quot;19):quot;93 ;)+5)$,/&'.#+0%33+#3 <%$+#9)=quot;14:'4&2 >2quot;#%4)$%$+#9)3'(% ?%@'3&%#)5'/%)3'(% A2#%quot;43).%#)=/+0B *+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14) -AG->H IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/) 0+15'@,#quot;&'+1 87 Friday, January 23, 2009
  68. 68. xec E Loop unrolling Sometimes we know some kernel parameters at compile time: # of loop iterations Degrees of polynomials Number of data elements If we could “tell” this to the compiler, it can unroll loops and optimize register usage We need to be generic Avoid code duplication, sizes unknown at compile time Templates to rescue The same trick can be used for regular C++ sources slide by Johan Seland Applied Mathematics 43/53 Friday, January 23, 2009
  69. 69. xec E Example: de Casteljau algorithm A standard algorithm for evaluating polynomials in Bernstein form d f (x) = b00 Recursively defined: x 1−x d f (x) = b00 d−1 d−1 b10 b01 k−1 k−1 k bi,j = xbi+1,j + (1 − x)bi,j+1 1 − x2 x x 1−x 0 bi,j are coefficients d−2 d−2 d−2 b20 b11 b02 slide by Johan Seland Applied Mathematics 44/53 Friday, January 23, 2009
  70. 70. xec E Implementation The de Casteljau algorithm is usually implemented as nested for-loops Coefficients are overwritten for each iteration d f (x) = c00 float deCasteljau ( float ∗ c , float x , int d ) { x 1−x f o r ( u i n t i = 1 ; i <= d ; ++i ) { f o r ( u i n t j = 0 ; j <= d− i ; ++j ) d−1 d−1 c10 c01 c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ; } 1 − x2 x x 1−x return c [ 0 ] ; } d−2 d−2 d−2 c20 c11 c02 slide by Johan Seland Applied Mathematics 45/53 Friday, January 23, 2009
  71. 71. xec E Template loop unrolling We make d a template parameter template<int d> f l o a t d e C a s t e l j a u ( f l o a t ∗ c , f l o a t x, int d ) { f o r ( u i n t i = 1 ; i <= d ; ++i ) { f o r ( u i n t j = 0 ; j <= d− i ; ++j ) c [ j ] = ( 1 . 0 f −x ) ∗ c [ j ] + x ∗ c [ j + 1 ] ; } return c [ 0 ] ; } Kernel is called as switch ( d ) { case 1: d e C a s t e l j a u <1><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; case 2: d e C a s t e l j a u <2><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; . . c a s e MAXD: d e C a s t e l j a u <MAXD><<<d i m G r i d , d i m B l o c k >>>( c , x ) ; b r e a k ; } slide by Johan Seland Applied Mathematics 46/53 Friday, January 23, 2009
  72. 72. xec E Results For the de Castelaju algorithm we see a relatively small speedup ≈ 1.2× (20%...) Very easy to implement Can lead to long compile times Conclusion: Probably worth it near end of development cycle slide by Johan Seland Applied Mathematics 47/53 Friday, January 23, 2009
  73. 73. xec E !quot;#$%&'(quot;# )#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($' 6+4quot;,7/$quot;.%+'$(#8 0(9+,8+#-/:,.#$5(#8 ;.#</$quot;#3%($-' =.-+#$7/5(*(#8 )'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/ )#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7 @#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$< +B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.* D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2' )'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+ 88 Friday, January 23, 2009
  74. 74. ing ofil Pr !quot;#$%&'($)*+,-.$/012*.#0 3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$ 401:.#5 ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$ 5#594?+ !*5#$+8-54+ (99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$ 61 Friday, January 23, 2009
  75. 75. ing ofil Pr !quot;#$%&' ()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56 +quot;7*'+%75 #&08quot;$.32*-*$+ Global memory loads/stores are coalesced #&08.32*-*$+ (coherent) or non-coalesced (incoherent) #'+8quot;$.32*-*$+ #'+8.32*-*$+ &3.%&8&3%0 Local loads/stores &3.%&8'+3-* Total branches and divergent branches 9-%$.2 0quot;)*-#*$+89-%$.2 taken by threads quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+ 1%-58'*-quot;%&quot;;* : +2-*%0,1%-5',+2%+,'*-quot;%&quot;;*,3$,%00-*'',.3$<&quot;.+',+3, '2%-*0,3-,.3$'+%$+,7*73-= .+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./' 62 Friday, January 23, 2009
  76. 76. ing ofil Pr !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/ 01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%& 6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*% 01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/) ,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,; <1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#) 8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,) 3*%:; 01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$) 5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$ !quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+) (,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$ 63 Friday, January 23, 2009
  77. 77. ple xam E !quot;#$%#&'()quot;*$%#*+,*quot;-quot;&quot;(.*#quot;/0).1%( M.quot;C Q0&0-'.1Jquot; N1&quot;*O444*1(.>P <'(/F1/.D MCquot;quot;/0C MCquot;quot;/0C Aquot;#(quot;-*2B* 638@+*&> 43869*;<=> 1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G F1.D*/1Jquot;#Gquot;(.*H#'()D1(G Aquot;#(quot;-*4B 93+@:*&> +36@+*;<=> 43995 43995 1(.quot;#-quot;'Jquot;/*'//#quot;>>1(G F1.D*H'(K*)%($-1).> Aquot;#(quot;-*9B 23744*&> ?37+2*;<=> 43825 +3:65 >quot;I0quot;(.1'-*'//#quot;>>1(G Aquot;#(quot;-*+B 83?:@*&> 273977*;<=> 23765 639+5 $1#>.*'//*/0#1(G*G-%H'-*-%'/ Aquot;#(quot;-*@B 83@9:*&> 92346?*;<=> 2365 2@3825 0(#%--*-'>.*F'#C Aquot;#(quot;-*:B 83962*&> +93??:*;<=> 23+25 4232:5 )%&C-quot;.quot;-E*0(#%--quot;/ Aquot;#(quot;-*7B 834:6*&> :43:72*;<=> 23+45 9838+5 &0-.1C-quot;*quot;-quot;&quot;(.>*Cquot;#*.D#quot;'/ Aquot;#(quot;-*7*%(*94,*quot;-quot;&quot;(.>B*74*;<=>L 84 Friday, January 23, 2009
  78. 78. n! ow our y ild Bu Friday, January 23, 2009
  79. 79. Friday, January 23, 2009
  80. 80. ou! ky an Th slide by David NVIDIA Corpora © 2008 Kirk Friday, January 23, 2009
  81. 81. Back Pocket Slides slide by David Cox Friday, January 23, 2009
  82. 82. Friday, January 23, 2009
  83. 83. 6.963 IT / A@M CUD 9 IAP0 Misc Friday, January 23, 2009
  84. 84. Tesla C1060 Computing Processor Processor 1x Tesla T10P Core GHz 1.33 GHz Full ATX: Form factor 4.736” (H) x 10.5” (L) Dual slot wide On-board 4 GB memory System I/O PCIe x16 gen2 512-bit, 800MHz DDR Memory I/O 102 GB/s peak bandwidth Display outputs None Typical power 160 W 19 M02: High Performance Computing with CUDA Friday, January 23, 2009
  85. 85. Tesla S1070 1U System Processors 4 x Tesla T10P Core GHz 1.5 GHz 1U for an EIA 19” Form factor 4-post rack Total 1U system 16 GB (4.0GB per GPU) memory System I/O 2 PCIe x16 512-bit, 800MHz GDDR Memory I/O per 102 GB/s peak processor bandwidth Display outputs None Typical power 700 W Chassis 1.73” H ! 17.5” W ! 28.5” D dimensions 20 M02: High Performance Computing with CUDA Friday, January 23, 2009
  86. 86. Double Precision Floating Point NVIDIA GPU SSE2 Cell SPE IEEE 754 IEEE 754 IEEE 754 Precision Rounding modes for FADD All 4 IEEE, round to All 4 IEEE, round to Round to and FMUL nearest, zero, inf, -inf nearest, zero, inf, -inf zero/truncate only Supported, costs 1000’s Denormal handling Full speed Flush to zero of cycles NaN support Yes Yes No Overflow and Infinity No infinity, Yes Yes support clamps to max norm Flags No Yes Some FMA Yes No Yes Software with low-latency Square root Hardware Software only FMA-based convergence Software with low-latency Division Hardware Software only FMA-based convergence Reciprocal estimate 24 bit 12 bit 12 bit accuracy Reciprocal sqrt estimate 23 bit 12 bit 12 bit accuracy log2(x) and 2^x estimates 23 bit No No accuracy 18 M02: High Performance Computing with CUDA Friday, January 23, 2009
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×