Your SlideShare is downloading. ×
0
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Brief Overview of a Parallel Nbody Code
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Brief Overview of a Parallel Nbody Code

590

Published on

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
590
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Brief overview of a parallel nbody code Implementation and analysis Filipo Novo Mór Graduate Program in Computer Science UFRGS Prof. Nicollas Maillard 2013, December
  • 2. Overview • About the nbody problem • The Serial Implementation • The OpenMP Implementation • The CUDA Implementation • Experimental Results • Conclusion
  • 3. About the nbody problem Features:  Force calculation between all particles.  Complexity O(N2)  Energy should be constant.  The brute force algorithm demands huge computational power.
  • 4. The Serial Implementation NAIVE! • Clearly N2 • Each pair is evaluated twice • Acceleration has to be adjusted at the end.
  • 5. The Serial Implementation • It stills under N2 domain, but: • Each pair is evaluated once only. • Acceleration it’s OK at the end!
  • 6. The OpenMP Implementation • MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”! • OBS: the static schedule seems to be slightly faster than dynamic schedule.
  • 7. Analysis ***** ***** ***** ***** ***** for (i=0; i<N; i++) { for(j=i+1; j<N; j++) { printf(“*”); } printf(“n”); }
  • 8. The CUDA Implementation Basic CUDA GPU architecture
  • 9. Global Memory N = 15 K=3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 10. Global Memory Active Tasks Active Transfers Shared Memory Bank BARRIER 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  • 11. Global Memory Active Tasks Active Transfers 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 Shared Memory Bank
  • 12. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  • 13. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4 5 Shared Memory Bank
  • 14. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  • 15. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 8 Shared Memory Bank
  • 16. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  • 17. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 18. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  • 19. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 20. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 21. Global Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  • 22. Analysis C : cost of the CalculateForce function. M : transfer cost between global and shared memories. T : transfer cost between CPU and device memories.  Access to shared memory is around 100X faster than to the global memory.
  • 23. Experimental Results How much would it cost??? Testing Environment:  Dell PowerEdge R610  2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading  8 physical cores, 16 threads.  RAM 16GB  NVIDIA Tesla S2050  Ubuntu Server 10.0.4 LTS  GCC 4.4.3  CUDA 5.0 Version Cost Naive $ 0.49 Smart $ 0.33 OMP $ 0.08 CUDA $ 0.05  Amazon EC2:  General Purpose - m1.large plan  GPU Instances - g2.2xlarge plan
  • 24. Conclusions • PRAM is OK for sequential and OpenMP. • But for CUDA, we need a better model! – Considering block threads, warps and latency. Thanks!
  • 25. Additional Slides
  • 26. About the nbody problem • Calculations Force (acceleration) 𝑚 𝑗 𝑟𝑖𝑗 𝑓𝑖 ≈ 𝐺𝑚 𝑖 1<𝑗<𝑁 𝑗 ≠𝑖 𝑟𝑖𝑗 2 + 3 2 𝜀2 Energy (kinetic and potential) 𝐸 = 𝐸𝑘 + 𝐸𝑝 𝑁 𝐺𝑚 𝑖 𝑚𝑗 𝐸𝑝 = − 1<𝑗 <𝑁 𝑖≠𝑗 𝑁 𝐸𝑘 = 1<𝑖<𝑁 𝑟𝑖𝑗 𝑚 𝑖 𝑣2 𝑖 2 Softening Factor collisionless system virtual particles
  • 27. About the nbody problem

×