Brief overview of a
parallel nbody code
Implementation and analysis

Filipo Novo Mór
Graduate Program in Computer Science ...
Overview
• About the nbody problem
• The Serial Implementation

• The OpenMP Implementation
• The CUDA Implementation
• Ex...
About the nbody problem
Features:
 Force calculation between all particles.
 Complexity O(N2)
 Energy should be constan...
The Serial Implementation
NAIVE!

• Clearly N2
• Each pair is evaluated twice
• Acceleration has to be adjusted at the end...
The Serial Implementation

• It stills under N2 domain, but:
• Each pair is evaluated once only.
• Acceleration it’s OK at...
The OpenMP Implementation

• MUST be based on the “naive” version.
• We lost the “/2”, but we gain the “/p”!
• OBS: the st...
Analysis

*****
*****
*****
*****
*****

for (i=0; i<N; i++)
{
for(j=i+1; j<N; j++)
{
printf(“*”);
}
printf(“n”);
}
The CUDA Implementation

Basic CUDA GPU architecture
Global Memory

N = 15
K=3

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank
Global Memory

Active Tasks
Active Transfers

Shared
Memory
Bank

BARRIER

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5...
Global Memory

Active Tasks
Active Transfers

0

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1...
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

B...
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

3

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

4...
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

B...
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

6

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

7...
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

B...
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

9 10 11

0
1
2
3
4
5
6
7
8
9
10
11
12
13...
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

B...
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

12 13 14

0
1
2
3
4
5
6
7
8
9
10
11
12
1...
Global Memory

Active Tasks
Active Transfers

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shar...
Global Memory
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Shared
Memory
Bank
Analysis

C : cost of the CalculateForce
function.
M : transfer cost between global and
shared memories.
T : transfer cost...
Experimental Results

How much would it cost???
Testing Environment:
 Dell PowerEdge R610
 2 Intel Xeon Quad-Core E5520 ...
Conclusions
• PRAM is OK for sequential and OpenMP.
• But for CUDA, we need a better model!
– Considering block threads, w...
Additional Slides
About the nbody problem
• Calculations
Force (acceleration)

𝑚 𝑗 𝑟𝑖𝑗

𝑓𝑖 ≈ 𝐺𝑚 𝑖
1<𝑗<𝑁
𝑗 ≠𝑖

𝑟𝑖𝑗

2

+

3
2
𝜀2

Energy (kin...
About the nbody problem
Upcoming SlideShare
Loading in …5
×

Brief Overview of a Parallel Nbody Code

1,531 views

Published on

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,531
On SlideShare
0
From Embeds
0
Number of Embeds
669
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Brief Overview of a Parallel Nbody Code

  1. 1. Brief overview of a parallel nbody code Implementation and analysis Filipo Novo Mór Graduate Program in Computer Science UFRGS Prof. Nicollas Maillard 2013, December
  2. 2. Overview • About the nbody problem • The Serial Implementation • The OpenMP Implementation • The CUDA Implementation • Experimental Results • Conclusion
  3. 3. About the nbody problem Features:  Force calculation between all particles.  Complexity O(N2)  Energy should be constant.  The brute force algorithm demands huge computational power.
  4. 4. The Serial Implementation NAIVE! • Clearly N2 • Each pair is evaluated twice • Acceleration has to be adjusted at the end.
  5. 5. The Serial Implementation • It stills under N2 domain, but: • Each pair is evaluated once only. • Acceleration it’s OK at the end!
  6. 6. The OpenMP Implementation • MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”! • OBS: the static schedule seems to be slightly faster than dynamic schedule.
  7. 7. Analysis ***** ***** ***** ***** ***** for (i=0; i<N; i++) { for(j=i+1; j<N; j++) { printf(“*”); } printf(“n”); }
  8. 8. The CUDA Implementation Basic CUDA GPU architecture
  9. 9. Global Memory N = 15 K=3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  10. 10. Global Memory Active Tasks Active Transfers Shared Memory Bank BARRIER 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  11. 11. Global Memory Active Tasks Active Transfers 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 Shared Memory Bank
  12. 12. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  13. 13. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4 5 Shared Memory Bank
  14. 14. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  15. 15. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 8 Shared Memory Bank
  16. 16. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  17. 17. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  18. 18. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  19. 19. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  20. 20. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  21. 21. Global Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  22. 22. Analysis C : cost of the CalculateForce function. M : transfer cost between global and shared memories. T : transfer cost between CPU and device memories.  Access to shared memory is around 100X faster than to the global memory.
  23. 23. Experimental Results How much would it cost??? Testing Environment:  Dell PowerEdge R610  2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading  8 physical cores, 16 threads.  RAM 16GB  NVIDIA Tesla S2050  Ubuntu Server 10.0.4 LTS  GCC 4.4.3  CUDA 5.0 Version Cost Naive $ 0.49 Smart $ 0.33 OMP $ 0.08 CUDA $ 0.05  Amazon EC2:  General Purpose - m1.large plan  GPU Instances - g2.2xlarge plan
  24. 24. Conclusions • PRAM is OK for sequential and OpenMP. • But for CUDA, we need a better model! – Considering block threads, warps and latency. Thanks!
  25. 25. Additional Slides
  26. 26. About the nbody problem • Calculations Force (acceleration) 𝑚 𝑗 𝑟𝑖𝑗 𝑓𝑖 ≈ 𝐺𝑚 𝑖 1<𝑗<𝑁 𝑗 ≠𝑖 𝑟𝑖𝑗 2 + 3 2 𝜀2 Energy (kinetic and potential) 𝐸 = 𝐸𝑘 + 𝐸𝑝 𝑁 𝐺𝑚 𝑖 𝑚𝑗 𝐸𝑝 = − 1<𝑗 <𝑁 𝑖≠𝑗 𝑁 𝐸𝑘 = 1<𝑖<𝑁 𝑟𝑖𝑗 𝑚 𝑖 𝑣2 𝑖 2 Softening Factor collisionless system virtual particles
  27. 27. About the nbody problem

×