IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU) - Presentation Transcript

    1. CUDA Tricks and Computational Physics Kipton Barros Boston University In collaboration with R. Babich, R. Brower, M. Clark, C. Rebbi, J. Ellowitz
    2. High energy physics huge computational needs Large Hadron Collider, CERN 27 km
    3. A request: Please question/comment freely during the talk A disclaimer: I’m not a high energy physicist
    4. View of the CMS detector at the end of 2007 (Maximilien Brice, © CERN) .
    5. 15 Petabytes to be processed annually View of the Computer Center during the installation of ser vers. (Maximilien Brice; Claudia Marcelloni, © CERN)
    6. The “Standard Model” of Particle Physics
    7. I’ll discuss Quantum ChromoDynamics Although it’s “standard”, these equations are hard to solve Big questions: why do quarks appear in groups? physics during big bang?
    8. Quantum ChromoDynamics The theory of nuclear interactions (bound by “gluons”) Extremely difficult: Must work at the level of fields, not particles Calculation is quantum mechanical
    9. Lattice QCD: Solving Quantum Chromodynamics by Computer Discretize space and time (place the quarks and gluons on a 4D lattice)
    10. Spacetime = 3+1 dimensions 32 ∼ 10 4 6 lattice sites Quarks live on sites (24 floats each) Gluons live on links (18 floats each) lattice sites 4 × 324 × (24 + 4 × 18) ∼ 384MB Total system size gluons float bytes quarks
    11. Lattice QCD: Inner loop requires repeatedly solving linear equation quarks gluons DW is a sparse matrix with only nearest neighbor couplings DW needs to be fast!
    12. DW Operation of 1 output quark site (24 floats)
    13. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats)
    14. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats)
    15. DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats) 1.4 kB of local storage required per quark update?
    16. Cuda Parallelization: Must process many quark updates simultaneously Odd/even sites processed separately
    17. ding hrea T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006 Friday, January 23, 2009
    18. DW parallelization: Each thread processes 1 site No communication required bet ween threads! All threads in warp execute same code
    19. Step 1: Read neighbor site
    20. Step 1: Read neighbor site Step 2: Read neighbor link
    21. Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into
    22. Step 4: Read neighbor site Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into
    23. Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 3: Accumulate into
    24. Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 6: Accumulate into Step 3: Accumulate into
    25. xec E !\"\"#$%&\"' ()*+%,-.&/0*#\"0.1&/-%*+-+2+\"#0+,-/+3#+&0.%44'5-/1- +2+\"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&\".+/-%&,-8++$-0)+-)%*,7%*+-9#/' !\"\"#$%&\"' :-;#<9+*-1=-7%*$/-*#&&.&6- \"1&\"#**+&04'-1&-%-<#40.$*1\"+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-\"%&-*#&- \"1&\"#**+&04' ?.<.0+,-9'-*+/1#*\"+-#/%6+@ !!! A+6./0+*/ B)%*+,-<+<1*' 79 Friday, January 23, 2009
    26. xec E !\"#$%$&$'()#*+,-./)\",+)01234 5*22/,)#*+,-./)\",+)01234)-/)-)%61#$\"1,)27)8-+\")/$&, 9:2$.)8-/#$'()32%\"6#-#$2')2')6'.,+;\"2\"61-#,.)8-+\"/ <2+,)#*+,-./)\",+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)\",+)01234)==)7,8,+)+,($/#,+/)\",+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)\",+)01234 !'1>)$7)%61#$\"1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%\"$1,)-'.)$':24,)/633,//7611> K*$/)-11).,\",'./)2')>26+)32%\"6#-#$2'@)/2),L\"+$%,'#M 85 Friday, January 23, 2009
    27. Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) High occupancy needed for (roughly 25% or so) maximum performance
    28. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory 24 12 floats 18 floats 24 floats
    29. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80
    30. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only)
    31. DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only) MP occupancy = 64/1024 = 6%
    32. 6% occupancy sounds pretty bad! Andreas Kuehn / Getty
    33. How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) Each thread requires 0.2 kb of fast local memory
    34. How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory Occupancy > 25% 16 k registers = 64 kb memory 1024 active threads Each thread requires 0.2 kb of fast local memory
    35. Registers as data (possible because no inter-thread communication) Instead of shared memory Registers are allocated as
    36. Registers as data Can’t be indexed. All loops must be EXPLICITLY expanded
    37. Code sample (approx. 1000 LOC automatically generated)
    38. Performance Results: 44 Gigabytes/sec (Tesla C870) 82 Gigabytes/sec (GTX 280) (90 Gflops/s) (completely bandwidth limited) For comparison: t wice as fast as Cell impl. (arXiv:0804.3654) 20 times faster than CPU implementations
    39. GB/s vs Occupancy Tesla C870 GTX 280 GB/s GB/s 45.00 85.00 33.75 63.75 22.50 42.50 11.25 21.25 0 0 ≥ 25% 17% 8% 0% ≥ 19% 13% 6% 0% Occupancy Occupancy Surprise! Very robust to low occupancy
    40. Device memory is the bottleneck Coalesced memory accesses crucial Data reordering Quark 1 Quark 2 Quark 3 q21 , q22 , ...q224 q31 , q32 , ...q324 ... q11 , q12 , ...q124 q11 q21 q31 ... q12 q22 q32 ... thread 1 ... thread 0 thread 2
    41. Memory coalescing: store even/odd lattices separately
    42. When memory access isn’t perfectly coalesced Sometimes float4 arrays can hide latency This global memory read corresponds to a single CUDA instruction In case of coalesce miss, at least 4x data is transfered thread 0 thread 1 thread 2
    43. When memory access isn’t perfectly coalesced Binding to textures can help corresponds to a single CUDA instruction This makes use of the texture cache and can reduce penalty for nearly coalesced accesses
    44. Regarding textures, there are t wo kinds of memory: Linear array Can be modified in kernel Can only be bound to 1D texture “Cuda array” Can’t be modifed in kernel Gets reordered for 2D, 3D locality Allows various hardware features
    45. When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-cur ve This gives 2D locality Wikipedia image
    46. Warnings: The effectiveness of float4, textures, depends on the CUDA hardware and driver (!) Certain “magic” access patterns are many times faster than others Testing appears to be necessary
    47. Memory bandwidth test Simple kernel Memory access completely coalesced Should be optimal
    48. Memory bandwidth test Simple kernel Memory access completely coalesced Bandwidth: 54 Gigabytes / sec (GTX 280, 140 GB/s theoretical!)
    49. So why are NVIDIA samples so fast? NVIDIA actually uses 54 Gigabytes / sec 102 Gigabytes / sec (GTX 280, 140 GB/s theoretical)
    50. Naive access pattern Step 1 ... ... Block 1 Block 2 Step 2 ... ... Block 1 Block 2
    51. Modified access pattern (much more efficient) Step 1 ... ... Block 1 Block 2 ... Step 2 ... Block 1 Block 2
    52. CUDA Compiler (LOTS of optimization here) CUDA PTX CUDA machine C code code code Use unofficial CUDA disassembler to view CUDA machine code CUDA disassembly
    53. CUDA Disassembler (decuda) foo.cu Compile and save cubin file Disassemble
    54. Look how CUDA implements integer division!
    55. CUDA provides fast (but imperfect) trigonometry in hardware!
    56. The compiler is very aggressive in optimization. It will group memory loads together to minimize latency (snippet from LQCD) Notice: each thread reads 20 floats!

    + npintonpinto, 2 years ago

    custom

    4538 views, 1 favs, 0 embeds more stats

    More at http://sites.google.com/site/cudaiap2009 an more

    More info about this presentation

    © All Rights Reserved

    • Total Views 4538
      • 4538 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 79
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories