Brief Overview of a Parallel Nbody Code
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Brief Overview of a Parallel Nbody Code

on

  • 473 views

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

Statistics

Views

Total Views
473
Views on SlideShare
318
Embed Views
155

Actions

Likes
0
Downloads
0
Comments
0

5 Embeds 155

http://www.filipomor.com 114
http://Filipomor.com 28
http://filipo.uni5.net 7
http://filipo.web680.uni5.net 5
https://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Brief Overview of a Parallel Nbody Code Presentation Transcript

  • 1. Brief overview of a parallel nbody code Implementation and analysis Filipo Novo Mór Graduate Program in Computer Science UFRGS Prof. Nicollas Maillard 2013, December
  • 2. Overview • About the nbody problem • The Serial Implementation • The OpenMP Implementation • The CUDA Implementation • Experimental Results • Conclusion
  • 3. About the nbody problem Features:  Force calculation between all particles.  Complexity O(N2)  Energy should be constant.  The brute force algorithm demands huge computational power.
  • 4. The Serial Implementation NAIVE! • Clearly N2 • Each pair is evaluated twice • Acceleration has to be adjusted at the end.
  • 5. The Serial Implementation • It stills under N2 domain, but: • Each pair is evaluated once only. • Acceleration it’s OK at the end!
  • 6. The OpenMP Implementation • MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”! • OBS: the static schedule seems to be slightly faster than dynamic schedule.
  • 7. Analysis ***** ***** ***** ***** ***** for (i=0; i<N; i++) { for(j=i+1; j<N; j++) { printf(“*”); } printf(“n”); }
  • 8. The CUDA Implementation Basic CUDA GPU architecture
  • 9. Global Memory N = 15 K=3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 10. Global Memory Active Tasks Active Transfers Shared Memory Bank BARRIER 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  • 11. Global Memory Active Tasks Active Transfers 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 Shared Memory Bank
  • 12. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  • 13. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4 5 Shared Memory Bank
  • 14. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  • 15. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 8 Shared Memory Bank
  • 16. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  • 17. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 18. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  • 19. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 20. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 21. Global Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 22. Analysis C : cost of the CalculateForce function. M : transfer cost between global and shared memories. T : transfer cost between CPU and device memories.  Access to shared memory is around 100X faster than to the global memory.
  • 23. Experimental Results How much would it cost??? Testing Environment:  Dell PowerEdge R610  2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading  8 physical cores, 16 threads.  RAM 16GB  NVIDIA Tesla S2050  Ubuntu Server 10.0.4 LTS  GCC 4.4.3  CUDA 5.0 Version Cost Naive $ 0.49 Smart $ 0.33 OMP $ 0.08 CUDA $ 0.05  Amazon EC2:  General Purpose - m1.large plan  GPU Instances - g2.2xlarge plan
  • 24. Conclusions • PRAM is OK for sequential and OpenMP. • But for CUDA, we need a better model! – Considering block threads, warps and latency. Thanks!
  • 25. Additional Slides
  • 26. About the nbody problem • Calculations Force (acceleration) 𝑚 𝑗 𝑟𝑖𝑗 𝑓𝑖 ≈ 𝐺𝑚 𝑖 1<𝑗<𝑁 𝑗 ≠𝑖 𝑟𝑖𝑗 2 + 3 2 𝜀2 Energy (kinetic and potential) 𝐸 = 𝐸𝑘 + 𝐸𝑝 𝑁 𝐺𝑚 𝑖 𝑚𝑗 𝐸𝑝 = − 1<𝑗 <𝑁 𝑖≠𝑗 𝑁 𝐸𝑘 = 1<𝑖<𝑁 𝑟𝑖𝑗 𝑚 𝑖 𝑣2 𝑖 2 Softening Factor collisionless system virtual particles
  • 27. About the nbody problem