Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Brief overview of a
parallel nbody code
Implementation and analysis

Filipo Novo Mór
Graduate Program in Computer Science ...
Overview
• About the nbody problem
• The Serial Implementation

• The OpenMP Implementation
• The CUDA Implementation
• Ex...
About the nbody problem
Features:
 Force calculation between all particles.
 Complexity O(N2)
 Energy should be constan...
The Serial Implementation
NAIVE!

• Clearly N2
• Each pair is evaluated twice
• Acceleration has to be adjusted at the end...
The Serial Implementation

• It stills under N2 domain, but:
• Each pair is evaluated once only.
• Acceleration it’s OK at...
The OpenMP Implementation

• MUST be based on the “naive” version.
• We lost the “/2”, but we gain the “/p”!
• OBS: the st...
Analysis

*****
*****
*****
*****
*****

for (i=0; i<N; i++)
{
for(j=i+1; j<N; j++)
{
printf(“*”);
}
printf(“n”);
}
The CUDA Implementation

Basic CUDA GPU architecture
Global Memory

N = 15
K=3

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank
Global Memory

Active Tasks
Active Transfers

Shared
Memory
Bank

BARRIER

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5...
Global Memory

Active Tasks
Active Transfers

0

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1...
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

B...
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

3

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

4...
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

B...
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

6

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

7...
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

B...
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

9 10 11

0
1
2
3
4
5
6
7
8
9
10
11
12
13...
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

B...
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

12 13 14

0
1
2
3
4
5
6
7
8
9
10
11
12
1...
Global Memory

Active Tasks
Active Transfers

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shar...
Global Memory
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Shared
Memory
Bank
Analysis

C : cost of the CalculateForce
function.
M : transfer cost between global and
shared memories.
T : transfer cost...
Experimental Results

How much would it cost???
Testing Environment:
 Dell PowerEdge R610
 2 Intel Xeon Quad-Core E5520 ...
Conclusions
• PRAM is OK for sequential and OpenMP.
• But for CUDA, we need a better model!
– Considering block threads, w...
Additional Slides
About the nbody problem
• Calculations
Force (acceleration)

𝑚 𝑗 𝑟𝑖𝑗

𝑓𝑖 ≈ 𝐺𝑚 𝑖
1<𝑗<𝑁
𝑗 ≠𝑖

𝑟𝑖𝑗

2

+

3
2
𝜀2

Energy (kin...
About the nbody problem
Upcoming SlideShare
Loading in …5
×

Brief Overview of a Parallel Nbody Code

1,936 views

Published on

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Brief Overview of a Parallel Nbody Code

  1. 1. Brief overview of a parallel nbody code Implementation and analysis Filipo Novo Mór Graduate Program in Computer Science UFRGS Prof. Nicollas Maillard 2013, December
  2. 2. Overview • About the nbody problem • The Serial Implementation • The OpenMP Implementation • The CUDA Implementation • Experimental Results • Conclusion
  3. 3. About the nbody problem Features:  Force calculation between all particles.  Complexity O(N2)  Energy should be constant.  The brute force algorithm demands huge computational power.
  4. 4. The Serial Implementation NAIVE! • Clearly N2 • Each pair is evaluated twice • Acceleration has to be adjusted at the end.
  5. 5. The Serial Implementation • It stills under N2 domain, but: • Each pair is evaluated once only. • Acceleration it’s OK at the end!
  6. 6. The OpenMP Implementation • MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”! • OBS: the static schedule seems to be slightly faster than dynamic schedule.
  7. 7. Analysis ***** ***** ***** ***** ***** for (i=0; i<N; i++) { for(j=i+1; j<N; j++) { printf(“*”); } printf(“n”); }
  8. 8. The CUDA Implementation Basic CUDA GPU architecture
  9. 9. Global Memory N = 15 K=3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  10. 10. Global Memory Active Tasks Active Transfers Shared Memory Bank BARRIER 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  11. 11. Global Memory Active Tasks Active Transfers 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 Shared Memory Bank
  12. 12. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  13. 13. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 4 5 Shared Memory Bank
  14. 14. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  15. 15. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 8 Shared Memory Bank
  16. 16. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  17. 17. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  18. 18. Global Memory Active Tasks 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank BARRIER Active Transfers
  19. 19. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  20. 20. Global Memory Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  21. 21. Global Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Shared Memory Bank
  22. 22. Analysis C : cost of the CalculateForce function. M : transfer cost between global and shared memories. T : transfer cost between CPU and device memories.  Access to shared memory is around 100X faster than to the global memory.
  23. 23. Experimental Results How much would it cost??? Testing Environment:  Dell PowerEdge R610  2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading  8 physical cores, 16 threads.  RAM 16GB  NVIDIA Tesla S2050  Ubuntu Server 10.0.4 LTS  GCC 4.4.3  CUDA 5.0 Version Cost Naive $ 0.49 Smart $ 0.33 OMP $ 0.08 CUDA $ 0.05  Amazon EC2:  General Purpose - m1.large plan  GPU Instances - g2.2xlarge plan
  24. 24. Conclusions • PRAM is OK for sequential and OpenMP. • But for CUDA, we need a better model! – Considering block threads, warps and latency. Thanks!
  25. 25. Additional Slides
  26. 26. About the nbody problem • Calculations Force (acceleration) 𝑚 𝑗 𝑟𝑖𝑗 𝑓𝑖 ≈ 𝐺𝑚 𝑖 1<𝑗<𝑁 𝑗 ≠𝑖 𝑟𝑖𝑗 2 + 3 2 𝜀2 Energy (kinetic and potential) 𝐸 = 𝐸𝑘 + 𝐸𝑝 𝑁 𝐺𝑚 𝑖 𝑚𝑗 𝐸𝑝 = − 1<𝑗 <𝑁 𝑖≠𝑗 𝑁 𝐸𝑘 = 1<𝑖<𝑁 𝑟𝑖𝑗 𝑚 𝑖 𝑣2 𝑖 2 Softening Factor collisionless system virtual particles
  27. 27. About the nbody problem

×