• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)
 

IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU)

on

  • 13,339 views

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009

Note that some slides were borrowed from NVIDIA.

Statistics

Views

Total Views
13,339
Views on SlideShare
13,282
Embed Views
57

Actions

Likes
3
Downloads
193
Comments
0

1 Embed 57

http://www.slideshare.net 57

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU) IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Computational Physics (Kipton Barros, BU) Presentation Transcript

    • CUDA Tricks and Computational Physics Kipton Barros Boston University In collaboration with R. Babich, R. Brower, M. Clark, C. Rebbi, J. Ellowitz
    • High energy physics huge computational needs Large Hadron Collider, CERN 27 km
    • A request: Please question/comment freely during the talk A disclaimer: I’m not a high energy physicist
    • View of the CMS detector at the end of 2007 (Maximilien Brice, © CERN) .
    • 15 Petabytes to be processed annually View of the Computer Center during the installation of ser vers. (Maximilien Brice; Claudia Marcelloni, © CERN)
    • The “Standard Model” of Particle Physics
    • I’ll discuss Quantum ChromoDynamics Although it’s “standard”, these equations are hard to solve Big questions: why do quarks appear in groups? physics during big bang?
    • Quantum ChromoDynamics The theory of nuclear interactions (bound by “gluons”) Extremely difficult: Must work at the level of fields, not particles Calculation is quantum mechanical
    • Lattice QCD: Solving Quantum Chromodynamics by Computer Discretize space and time (place the quarks and gluons on a 4D lattice)
    • Spacetime = 3+1 dimensions 32 ∼ 10 4 6 lattice sites Quarks live on sites (24 floats each) Gluons live on links (18 floats each) lattice sites 4 × 324 × (24 + 4 × 18) ∼ 384MB Total system size gluons float bytes quarks
    • Lattice QCD: Inner loop requires repeatedly solving linear equation quarks gluons DW is a sparse matrix with only nearest neighbor couplings DW needs to be fast!
    • DW Operation of 1 output quark site (24 floats)
    • DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats)
    • DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats)
    • DW Operation of 1 output quark site (24 floats) 2x4 input quark sites (24x8 floats) 2x4 input gluon links (18x8 floats) 1.4 kB of local storage required per quark update?
    • Cuda Parallelization: Must process many quark updates simultaneously Odd/even sites processed separately
    • ding hrea T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006 Friday, January 23, 2009
    • DW parallelization: Each thread processes 1 site No communication required bet ween threads! All threads in warp execute same code
    • Step 1: Read neighbor site
    • Step 1: Read neighbor site Step 2: Read neighbor link
    • Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into
    • Step 4: Read neighbor site Step 1: Read neighbor site Step 2: Read neighbor link Step 3: Accumulate into
    • Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 3: Accumulate into
    • Step 4: Read neighbor site Step 1: Read neighbor site Step 5: Read neighbor link Step 2: Read neighbor link Step 6: Accumulate into Step 3: Accumulate into
    • xec E !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ !!! A+6./0+*/ B)%*+,-<+<1*' 79 Friday, January 23, 2009
    • xec E !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85 Friday, January 23, 2009
    • Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) High occupancy needed for (roughly 25% or so) maximum performance
    • DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory 24 12 floats 18 floats 24 floats
    • DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80
    • DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only)
    • DW : does it fit onto the GPU? Each thread requires 1.4 kb 0.2 kb of fast local memory MP has 16 kb shared mem Threads/MP = 16 / 0.2 = 80 64 (multiple of 64 only) MP occupancy = 64/1024 = 6%
    • 6% occupancy sounds pretty bad! Andreas Kuehn / Getty
    • How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory 16 k registers 1024 active threads (max) Each thread requires 0.2 kb of fast local memory
    • How can we get better occupancy? Reminder -- each multiprocessor has: 16 kb shared memory Occupancy > 25% 16 k registers = 64 kb memory 1024 active threads Each thread requires 0.2 kb of fast local memory
    • Registers as data (possible because no inter-thread communication) Instead of shared memory Registers are allocated as
    • Registers as data Can’t be indexed. All loops must be EXPLICITLY expanded
    • Code sample (approx. 1000 LOC automatically generated)
    • Performance Results: 44 Gigabytes/sec (Tesla C870) 82 Gigabytes/sec (GTX 280) (90 Gflops/s) (completely bandwidth limited) For comparison: t wice as fast as Cell impl. (arXiv:0804.3654) 20 times faster than CPU implementations
    • GB/s vs Occupancy Tesla C870 GTX 280 GB/s GB/s 45.00 85.00 33.75 63.75 22.50 42.50 11.25 21.25 0 0 ≥ 25% 17% 8% 0% ≥ 19% 13% 6% 0% Occupancy Occupancy Surprise! Very robust to low occupancy
    • Device memory is the bottleneck Coalesced memory accesses crucial Data reordering Quark 1 Quark 2 Quark 3 q21 , q22 , ...q224 q31 , q32 , ...q324 ... q11 , q12 , ...q124 q11 q21 q31 ... q12 q22 q32 ... thread 1 ... thread 0 thread 2
    • Memory coalescing: store even/odd lattices separately
    • When memory access isn’t perfectly coalesced Sometimes float4 arrays can hide latency This global memory read corresponds to a single CUDA instruction In case of coalesce miss, at least 4x data is transfered thread 0 thread 1 thread 2
    • When memory access isn’t perfectly coalesced Binding to textures can help corresponds to a single CUDA instruction This makes use of the texture cache and can reduce penalty for nearly coalesced accesses
    • Regarding textures, there are t wo kinds of memory: Linear array Can be modified in kernel Can only be bound to 1D texture “Cuda array” Can’t be modifed in kernel Gets reordered for 2D, 3D locality Allows various hardware features
    • When a CUDA array is bound to a 2D texture, it is probably reordered to something like a Z-cur ve This gives 2D locality Wikipedia image
    • Warnings: The effectiveness of float4, textures, depends on the CUDA hardware and driver (!) Certain “magic” access patterns are many times faster than others Testing appears to be necessary
    • Memory bandwidth test Simple kernel Memory access completely coalesced Should be optimal
    • Memory bandwidth test Simple kernel Memory access completely coalesced Bandwidth: 54 Gigabytes / sec (GTX 280, 140 GB/s theoretical!)
    • So why are NVIDIA samples so fast? NVIDIA actually uses 54 Gigabytes / sec 102 Gigabytes / sec (GTX 280, 140 GB/s theoretical)
    • Naive access pattern Step 1 ... ... Block 1 Block 2 Step 2 ... ... Block 1 Block 2
    • Modified access pattern (much more efficient) Step 1 ... ... Block 1 Block 2 ... Step 2 ... Block 1 Block 2
    • CUDA Compiler (LOTS of optimization here) CUDA PTX CUDA machine C code code code Use unofficial CUDA disassembler to view CUDA machine code CUDA disassembly
    • CUDA Disassembler (decuda) foo.cu Compile and save cubin file Disassemble
    • Look how CUDA implements integer division!
    • CUDA provides fast (but imperfect) trigonometry in hardware!
    • The compiler is very aggressive in optimization. It will group memory loads together to minimize latency (snippet from LQCD) Notice: each thread reads 20 floats!