CUDA Tricks and
Computational Physics
                   Kipton Barros
                 Boston University

In collaboratio...
High energy physics
huge computational needs
   Large Hadron Collider, CERN




                      27 km
A request:

       Please question/comment
         freely during the talk

A disclaimer:
       I’m not a high energy phy...
View of the CMS detector at the end of 2007 (Maximilien Brice, © CERN)
                                           .
15 Petabytes to be processed annually
View of the Computer Center during the installation of ser vers. (Maximilien Brice; ...
The “Standard Model” of Particle Physics
I’ll discuss Quantum ChromoDynamics

Although it’s “standard”, these equations are hard to solve
Big questions:
          ...
Quantum
ChromoDynamics
The theory of nuclear
     interactions
                                                      (boun...
Lattice QCD:
    Solving Quantum Chromodynamics by Computer




               Discretize space and time
       (place the...
Spacetime = 3+1 dimensions
                                       32 ∼ 10
                                             4  ...
Lattice QCD:
 Inner loop requires repeatedly solving linear equation

                                                  qu...
DW
Operation of

           1 output quark site
                (24 floats)
DW
Operation of

           1 output quark site
                (24 floats)

          2x4 input quark sites
              ...
DW
Operation of

           1 output quark site
                (24 floats)

          2x4 input quark sites
              ...
DW
             Operation of

                           1 output quark site
                                (24 floats)

 ...
Cuda Parallelization:
   Must process many quark updates simultaneously
   Odd/even sites processed separately
ding
                                                                                              hrea
                  ...
DW parallelization:
  Each thread processes 1 site
 No communication required bet ween threads!
 All threads in warp execu...
Step 1: Read neighbor site
Step 1: Read neighbor site

Step 2: Read neighbor link
Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into
Step 4: Read neighbor site
Step 1: Read neighbor site

Step 2: Read neighbor link

Step 3: Accumulate into
Step 4: Read neighbor site
Step 1: Read neighbor site

                             Step 5: Read neighbor link
Step 2: Rea...
Step 4: Read neighbor site
Step 1: Read neighbor site

                             Step 5: Read neighbor link
Step 2: Rea...
xec
                                                                      E
                   !quot;quot;#$%"'

    ...
xec
                                                                                 E
                   !quot;#$%$&$'()#...
Reminder -- each multiprocessor has:

        16 kb shared memory
        16 k registers
         1024 active threads (max...
DW    : does it fit onto the GPU?

     Each thread requires 1.4 kb 0.2 kb
        of fast local memory

                  ...
DW    : does it fit onto the GPU?

     Each thread requires 1.4 kb 0.2 kb
        of fast local memory

     MP has 16 kb ...
DW    : does it fit onto the GPU?

     Each thread requires 1.4 kb 0.2 kb
        of fast local memory

     MP has 16 kb ...
DW    : does it fit onto the GPU?

     Each thread requires 1.4 kb 0.2 kb
        of fast local memory

     MP has 16 kb ...
6% occupancy
                        sounds pretty
                            bad!



Andreas Kuehn / Getty
How can we get better occupancy?


Reminder -- each multiprocessor has:

        16 kb shared memory
         16 k registe...
How can we get better occupancy?


Reminder -- each multiprocessor has:

        16 kb shared memory            Occupancy ...
Registers as data
      (possible because no inter-thread communication)

Instead of shared memory




Registers are alloc...
Registers as data
        Can’t be indexed. All loops must be
               EXPLICITLY expanded
Code sample




(approx. 1000 LOC automatically generated)
Performance Results:

              44 Gigabytes/sec      (Tesla C870)

              82 Gigabytes/sec      (GTX 280)
    ...
GB/s vs Occupancy
         Tesla C870                           GTX 280
GB/s                               GB/s
45.00     ...
Device memory is the bottleneck
       Coalesced memory accesses crucial
               Data reordering

       Quark 1   ...
Memory coalescing: store even/odd
      lattices separately
When memory access isn’t perfectly coalesced
               Sometimes float4 arrays can hide latency


                    ...
When memory access isn’t perfectly coalesced

               Binding to textures can help



                             ...
Regarding textures, there are t wo kinds of memory:

   Linear array


          Can be modified in kernel
          Can on...
When a CUDA array is bound to a 2D texture, it is
       probably reordered to something like a Z-cur ve




             ...
Warnings:
      The effectiveness of float4, textures, depends
           on the CUDA hardware and driver (!)

       Certa...
Memory bandwidth test
   Simple kernel




         Memory access completely coalesced
         Should be optimal
Memory bandwidth test
   Simple kernel




         Memory access completely coalesced

           Bandwidth:    54 Gigaby...
So why are NVIDIA samples so fast?


                           NVIDIA actually uses




54 Gigabytes / sec               ...
Naive access pattern

Step 1                       ...



                             ...
         Block 1   Block 2


St...
Modified access pattern              (much more efficient)


 Step 1                              ...



                   ...
CUDA Compiler
                             (LOTS of
                           optimization
                              ...
CUDA Disassembler           (decuda)


                                        foo.cu
Compile and save cubin file


Disasse...
Look how CUDA
implements integer
     division!
CUDA provides fast (but imperfect)
   trigonometry in hardware!
The compiler is very aggressive in optimization. It will
   group memory loads together to minimize latency

(snippet from...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)
Upcoming SlideShare
Loading in...5
×

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

6,597

Published on

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Note that some slides were borrowed from NVIDIA.

Published in: Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,597
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
201
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

  1. 1. CUDA Tricks and Computational Physics Kipton Barros Boston University In collaboration with R. Babich, R. Brower, M. Clark, C. Rebbi, J. Ellowitz
  2. 2. High energy physics huge computational needs Large Hadron Collider, CERN 27 km
  3. 3. A request: Please question/comment freely during the talk A disclaimer: I’m not a high energy physicist
  4. 4. View of the CMS detector at the end of 2007 (Maximilien Brice, © CERN) .
  5. 5. 15 Petabytes to be processed annually View of the Computer Center during the installation of ser vers. (Maximilien Brice; Claudia Marcelloni, © CERN)
  6. 6. The “Standard Model” of Particle Physics
  7. 7. I’ll discuss Quantum ChromoDynamics Although it’s “standard”, these equations are hard to solve Big questions: why do quarks appear in groups? physics during big bang?
  8. 8. Quantum ChromoDynamics The theory of nuclear interactions (bound by “gluons”) Extremely difficult: Must work at the level of fields, not particles Calculation is quantum mechanical
  9. 9. Lattice QCD: Solving Quantum Chromodynamics by Computer Discretize space and time (place the quarks and gluons on a 4D lattice)
  10. 10. Spacetime = 3+1 dimensions 32 ∼ 10 4 6 lattice sites Quarks live on sites (24 floats each) Gluons live on links (18 floats each) lattice sites 4 × 324 × (24 + 4 × 18) ∼ 384MB Total system size gluons float bytes quarks
  11. 11. Lattice QCD: Inner loop requires repeatedly solving linear equation quarks gluons DW is a sparse matrix with only nearest neighbor couplings DW needs to be fast!
  12. 12. DW Operation of 1 output quark site (24 floats)
  13. 13. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats)
  14. 14. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats)
  15. 15. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats) 1.4 kB of local storage required per quark update?
  16. 16. Cuda Parallelization: Must process many quark updates simultaneously Odd/even sites processed separately
  17. 17. ding hrea T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006 Friday, January 23, 2009
  18. 18. DW parallelization: Each thread processes 1 site No communication required bet ween threads! All threads in warp execute same code
  19. 19. Step 1: Read neighbor site
  20. 20. Step 1: Read neighbor site Step 2: Read neighbor link
  21. 21. Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into
  22. 22. Step 4: Read neighbor site Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into
  23. 23. Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 3: Accumulate into
  24. 24. Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 6: Accumulate into Step 3: Accumulate into
  25. 25. xec E !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ !!! A+6./0+*/ B)%*+,-<+<1*' 79 Friday, January 23, 2009
  26. 26. xec E !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85 Friday, January 23, 2009
  27. 27. Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) High occupancy needed for (roughly 25% or so) maximum performance
  28. 28. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory 24 12 floats 18 floats 24 floats
  29. 29. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80
  30. 30. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only)
  31. 31. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only) MP occupancy = 64/1024 = 6%
  32. 32. 6% occupancy sounds pretty bad! Andreas Kuehn / Getty
  33. 33. How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) Each thread requires 0.2 kb of fast local memory
  34. 34. How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory Occupancy > 25% 16 k registers = 64 kb memory 1024 active threads Each thread requires 0.2 kb of fast local memory
  35. 35. Registers as data (possible because no inter-thread communication) Instead of shared memory Registers are allocated as
  36. 36. Registers as data Can’t be indexed. All loops must be EXPLICITLY expanded
  37. 37. Code sample (approx. 1000 LOC automatically generated)
  38. 38. Performance Results: 44 Gigabytes/sec (Tesla C870) 82 Gigabytes/sec (GTX 280) (90 Gflops/s) (completely bandwidth limited) For comparison: t wice as fast as Cell impl. (arXiv:0804.3654) 20 times faster than CPU implementations
  39. 39. GB/s vs Occupancy Tesla C870 GTX 280 GB/s GB/s 45.00 85.00 33.75 63.75 22.50 42.50 11.25 21.25 0 0 ≥ 25% 17% 8% 0% ≥ 19% 13% 6% 0% Occupancy Occupancy Surprise! Very robust to low occupancy
  40. 40. Device memory is the bottleneck Coalesced memory accesses crucial Data reordering Quark 1 Quark 2 Quark 3 q21 , q22 , ...q224 q31 , q32 , ...q324 ... q11 , q12 , ...q124 q11 q21 q31 ... q12 q22 q32 ... thread 1 ... thread 0 thread 2
  41. 41. Memory coalescing: store even/odd lattices separately
  42. 42. When memory access isn’t perfectly coalesced Sometimes float4 arrays can hide latency This global memory read corresponds to a single CUDA instruction In case of coalesce miss, at least 4x data is transfered thread 0 thread 1 thread 2
  43. 43. When memory access isn’t perfectly coalesced Binding to textures can help corresponds to a single CUDA instruction This makes use of the texture cache and can reduce penalty for nearly coalesced accesses
  44. 44. Regarding textures, there are t wo kinds of memory: Linear array Can be modified in kernel Can only be bound to 1D texture “Cuda array” Can’t be modifed in kernel Gets reordered for 2D, 3D locality Allows various hardware features
  45. 45. When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-cur ve This gives 2D locality Wikipedia image
  46. 46. Warnings: The effectiveness of float4, textures, depends on the CUDA hardware and driver (!) Certain “magic” access patterns are many times faster than others Testing appears to be necessary
  47. 47. Memory bandwidth test Simple kernel Memory access completely coalesced Should be optimal
  48. 48. Memory bandwidth test Simple kernel Memory access completely coalesced Bandwidth: 54 Gigabytes / sec (GTX 280, 140 GB/s theoretical!)
  49. 49. So why are NVIDIA samples so fast? NVIDIA actually uses 54 Gigabytes / sec 102 Gigabytes / sec (GTX 280, 140 GB/s theoretical)
  50. 50. Naive access pattern Step 1 ... ... Block 1 Block 2 Step 2 ... ... Block 1 Block 2
  51. 51. Modified access pattern (much more efficient) Step 1 ... ... Block 1 Block 2 ... Step 2 ... Block 1 Block 2
  52. 52. CUDA Compiler (LOTS of optimization here) CUDA PTX CUDA machine C code code code Use unofficial CUDA disassembler to view CUDA machine code CUDA disassembly
  53. 53. CUDA Disassembler (decuda) foo.cu Compile and save cubin file Disassemble
  54. 54. Look how CUDA implements integer division!
  55. 55. CUDA provides fast (but imperfect) trigonometry in hardware!
  56. 56. The compiler is very aggressive in optimization. It will group memory loads together to minimize latency (snippet from LQCD) Notice: each thread reads 20 floats!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×