Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

736 views

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
736
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
8
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Heterogeneous Particle based Simulation (SIGGRAPH ASIA 2011)

  1. 1. HETEROGENEOUS PARTICLE BASED SIMULATION Takahiro Harada, AMD
  2. 2. 2 Harada, Heterogeneous Particle-based Simulation  Large number of particles  Particles with identical size – Work granularity is almost the same – Good for the wide SIMD architecture PARTICLE BASED SIMULATION ON THE GPU Harada et al. 2007
  3. 3. 3 Harada, Heterogeneous Particle-based Simulation PARTICLE BASED SIMULATION  Collision  Integration  Acceleration structure is used for efficient collide – Uniform grid → Suited for the GPU – Less divergence 𝑓𝑐𝑜𝑙𝑙𝑖𝑑𝑒 = 𝑓𝑖𝑗 𝑣 += 𝑓 𝑚 ∆𝑡 𝑥 += 𝑣∆𝑡 𝑑𝑣 𝑑𝑡 = 𝑓 𝑚 𝑑𝑥 𝑑𝑡 = 𝑣
  4. 4. 4 Harada, Heterogeneous Particle-based Simulation DIVERGENCE ON SIMD 0 1 2 3 4 5 6 7 Void Kernel() { if(A) FuncA(); else if(B) FuncB(); else FuncC(); }
  5. 5. 5 Harada, Heterogeneous Particle-based Simulation PARTICLE BASED SIMULATION ON THE GPU  Particle collision using a uniform grid 0 1 2 3 4 5 6 7 Void Kernel() { prepare(); collide(Cell0); collide(Cell1); collide(Cell2); collide(Cell3); collide(Cell4); collide(Cell5); collide(Cell6); collide(Cell7); collide(Cell8); } Cell0 Cell1 Cell2 Cell3 Cell4 Cell5 Cell6 Cell7 Cell8
  6. 6. 6 Harada, Heterogeneous Particle-based Simulation MIXED PARTICLE SIMULATION  Not only small particles  Difficulty for GPUs – Large particles interact with small particles – Large-large collision
  7. 7. 7 Harada, Heterogeneous Particle-based Simulation CHALLENGE  Non uniform work granularity – Small-small(SS) collision  Uniform, GPU – Large-large(LL) collision  Non Uniform, CPU – Large-small(LS) collision  Non Uniform, CPU
  8. 8. 8 Harada, Heterogeneous Particle-based Simulation FUSION ARCHITECTURE  CPU and GPU are: – On the same die – Much closer – Efficient data sharing  CPU and GPU are good at different works – CPU: serial computation, conditional branch – GPU: parallel computation  Able to dispatch works to: – Serial work with varying granularity → CPU – Parallel work with the uniform granularity → GPU
  9. 9. 9 Harada, Heterogeneous Particle-based Simulation MIXED PARTICLE SIMULATION  Benefit from Fusion Architecture – Different works in a simulation – CPU & GPU are working together – Shares data
  10. 10. 10 Harada, Heterogeneous Particle-based Simulation METHOD
  11. 11. 11 Harada, Heterogeneous Particle-based Simulation TWO SIMULATIONS  Small particles  Large particles Build Acc. Structure SS Collision S Integration Build Acc. Structure LL Collision L Integration LS Collision Position Velocity Force Grid Position Velocity Force
  12. 12. 12 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles Uniform Work Non Uniform Work CLASSIFY BY WORK GRANULARITY Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Build Acc. Structure
  13. 13. 13 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles GPU CPU CLASSIFY BY WORK GRANULARITY, ASSIGN PROCESSOR Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Build Acc. Structure
  14. 14. 14 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  Grid, small particle data has to be shared with the CPU for LS collision – Allocated as zero copy buffer GPU CPU DATA SHARING Build Acc. Structure SS Collision S Integration L Integration Position Velocity Force Grid Position Velocity Force LL Collision Build Acc. Structure Position Velocity Grid Force LS Collision
  15. 15. 15 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  Grid, small particle data has to be shared with the CPU for LS collision – Allocated as zero copy buffer GPU CPU SYNCHRONIZATION Position Velocity Force Grid Position Velocity Force SS Collision S Integration L Integration LL Collision Position Velocity Grid Force Synchronization LS Collision Build Acc. Structure Build Acc. Structure Synchronization
  16. 16. 16 Harada, Heterogeneous Particle-based Simulation GPU CPU VISUALIZING WORKLOADS Build Acc. Structure SS Collision S Integration Position Velocity Force Grid Position Velocity Force LL Collision LS Collision Synchronization L Integration  Small particles  Large particles  Grid construction can be moved at the end of the pipeline – Unbalanced workload
  17. 17. 17 Harada, Heterogeneous Particle-based Simulation  Small particles  Large particles  To get better load balancing – The sync is for passing the force buffer filled by the CPU to the GPU – Move the LL collision after the sync GPU CPU LOAD BALANCING Build Acc. Structure SS Collision S Integration Position Velocity Force Grid Position Velocity Force LL Collision Synchronization L Integration LS Collision
  18. 18. 18 Harada, Heterogeneous Particle-based Simulation GPUWork CPUWork
  19. 19. 19 Harada, Heterogeneous Particle-based Simulation MULTI THREADING (4 THREADS)
  20. 20. 20 Harada, Heterogeneous Particle-based Simulation FURTHER OPTIMIZATION GPU CPU0 CPU1 CPU2 Build Acc. Structure SS Collision S Integ. LL Collision L Integ. LS Collision Synchronization 1. Not optimized for “Llano” which is a 4 core CPU – Only 2 CPU core were used – Can use 2 more cores for LS collision 2. LL collision was not optimized – CPU waits when the GPU was constructing a grid – Use CPU to improve SS collision
  21. 21. 21 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION  Cannot split the work by large particle indices – More than 1 large particle can collide with a small particle – Have to lock the memory on write → Inefficient  Prepare a local buffer for a thread – A buffer storing force on small particles – Lock free  Local buffers are merged to one L0 S0 S1 L1 Thread0 Thread1 Thread2
  22. 22. 22 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision Synchronization
  23. 23. 23 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION1: MULTITHREADING LARGE-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge Synchronization
  24. 24. 24 Harada, Heterogeneous Particle-based Simulation  Spatially coherent memory layout improves cache utilization  As particles move, spatial locality decreases OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
  25. 25. 25 Harada, Heterogeneous Particle-based Simulation  Spatially coherent memory layout improves cache utilization  As particles move, spatial locality decreases OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION
  26. 26. 26 Harada, Heterogeneous Particle-based Simulation  Sort particles by spatial location to improve cache utilization – Z curve SPATIAL SORT
  27. 27. 27 Harada, Heterogeneous Particle-based Simulation  Sort particles by spatial location to improve cache utilization – Z curve SPATIAL SORT
  28. 28. 28 Harada, Heterogeneous Particle-based Simulation  Requirements – Full sort was over the budget – Full sort is not “a must” – Sort is an optional computation for performance improvement – Incremental sort – Use multiple threads  Solution – Used generalized “Odd-even transition sort” CHOOSE SORT
  29. 29. 29 Harada, Heterogeneous Particle-based Simulation BLOCK TRANSITION SORT  Generalized “Odd-even transition sort”  Instead of sorting 2 adjacent elements, sort adjacent 2 blocks  Iterate until convergence  Use a thread to sort 2 adjacent blocks – 6 blocks for 3 threads – Radix sort Odd-even transition sort Block transition sort
  30. 30. 30 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 LL Collision L Integ. CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge Synchronization
  31. 31. 31 Harada, Heterogeneous Particle-based Simulation OPTIMIZATION2: IMPROVING SMALL-SMALL COLLISION GPU Build Acc. Structure SS Collision S Integ. CPU0 CPU1 CPU2 LS Collision LS Collision LS Collision Synchronization MergeMergeMerge LL Coll. L Integ. Synchronization S Sorting S Sorting S Sorting Synchronization
  32. 32. 32 Harada, Heterogeneous Particle-based Simulation DEMO GPUWork CPUWork
  33. 33. 33 Harada, Heterogeneous Particle-based Simulation DEMO GPUWork CPUWork
  34. 34. 34 Harada, Heterogeneous Particle-based Simulation CONCLUSIONS  Realized a simulation that handles variable sized particles by leveraging the best features of both the CPU and GPU on AMD’s Fusion Architecture – The CPU is used for works with non identical compute granularity – The GPU is used for highly parallel works  Memory sharing between the CPU and GPU is the key for the efficiency – Avoid wasteful memory copies
  35. 35. 35 Harada, Heterogeneous Particle-based Simulation REFERENCE  Takahiro Harada, Seiichi Koshizuka, Yoichiro Kawaguchi, Smoothed Particle Hydrodynamics on GPUs, Proc. of Computer Graphics International, 63-70(2007)  Justin Hensley, Takahiro Harada, Chapter X OpenCL Case Study:Mixed Particle Simulation, Heterogeneous Computing with OpenCL, Morgan Kaufmann(2011)

×