Your SlideShare is downloading. ×
0
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

# Brief Overview of a Parallel Nbody Code

590

Published on

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

No Downloads
Views
Total Views
590
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Brief overview of a parallel nbody code Implementation and analysis Filipo Novo M&#xF3;r Graduate Program in Computer Science UFRGS Prof. Nicollas Maillard 2013, December
• 2. Overview &#x2022; About the nbody problem &#x2022; The Serial Implementation &#x2022; The OpenMP Implementation &#x2022; The CUDA Implementation &#x2022; Experimental Results &#x2022; Conclusion
• 3. About the nbody problem Features: &#xF0A7; Force calculation between all particles. &#xF0A7; Complexity O(N2) &#xF0A7; Energy should be constant. &#xF0A7; The brute force algorithm demands huge computational power.
• 4. The Serial Implementation NAIVE! &#x2022; Clearly N2 &#x2022; Each pair is evaluated twice &#x2022; Acceleration has to be adjusted at the end.
• 5. The Serial Implementation &#x2022; It stills under N2 domain, but: &#x2022; Each pair is evaluated once only. &#x2022; Acceleration it&#x2019;s OK at the end!
• 6. The OpenMP Implementation &#x2022; MUST be based on the &#x201C;naive&#x201D; version. &#x2022; We lost the &#x201C;/2&#x201D;, but we gain the &#x201C;/p&#x201D;! &#x2022; OBS: the static schedule seems to be slightly faster than dynamic schedule.
• 7. Analysis ***** ***** ***** ***** ***** for (i=0; i&lt;N; i++) { for(j=i+1; j&lt;N; j++) { printf(&#x201C;*&#x201D;); } printf(&#x201C;n&#x201D;); }
• 8. The CUDA Implementation Basic CUDA GPU architecture
• 9. Global Memory N = 15 K=3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
• 10. Global Memory Active Tasks Active Transfers Shared Memory Bank BARRIER 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
• 11. Global Memory Active Tasks Active Transfers 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 Shared Memory Bank
• 12. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
• 13. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4 5 Shared Memory Bank
• 14. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
• 15. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 8 Shared Memory Bank
• 16. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
• 17. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
• 18. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
• 19. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
• 20. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
• 21. Global Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
• 22. Analysis C : cost of the CalculateForce function. M : transfer cost between global and shared memories. T : transfer cost between CPU and device memories. &#xF0D8; Access to shared memory is around 100X faster than to the global memory.
• 23. Experimental Results How much would it cost??? Testing Environment: &#xF0A7; Dell PowerEdge R610 &#xF0A7; 2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading &#xF0A7; 8 physical cores, 16 threads. &#xF0A7; RAM 16GB &#xF0A7; NVIDIA Tesla S2050 &#xF0A7; Ubuntu Server 10.0.4 LTS &#xF0A7; GCC 4.4.3 &#xF0A7; CUDA 5.0 Version Cost Naive \$ 0.49 Smart \$ 0.33 OMP \$ 0.08 CUDA \$ 0.05 &#xF0A7; Amazon EC2: &#xF0A7; General Purpose - m1.large plan &#xF0A7; GPU Instances - g2.2xlarge plan
• 24. Conclusions &#x2022; PRAM is OK for sequential and OpenMP. &#x2022; But for CUDA, we need a better model! &#x2013; Considering block threads, warps and latency. Thanks!
• 25. Additional Slides
• 26. About the nbody problem &#x2022; Calculations Force (acceleration) &#x1D45A; &#x1D457; &#x1D45F;&#x1D456;&#x1D457; &#x1D453;&#x1D456; &#x2248; &#x1D43A;&#x1D45A; &#x1D456; 1&lt;&#x1D457;&lt;&#x1D441; &#x1D457; &#x2260;&#x1D456; &#x1D45F;&#x1D456;&#x1D457; 2 + 3 2 &#x1D700;2 Energy (kinetic and potential) &#x1D438; = &#x1D438;&#x1D458; + &#x1D438;&#x1D45D; &#x1D441; &#x1D43A;&#x1D45A; &#x1D456; &#x1D45A;&#x1D457; &#x1D438;&#x1D45D; = &#x2212; 1&lt;&#x1D457; &lt;&#x1D441; &#x1D456;&#x2260;&#x1D457; &#x1D441; &#x1D438;&#x1D458; = 1&lt;&#x1D456;&lt;&#x1D441; &#x1D45F;&#x1D456;&#x1D457; &#x1D45A; &#x1D456; &#x1D463;2 &#x1D456; 2 Softening Factor collisionless system virtual particles
• 27. About the nbody problem