Brief overview of a
parallel nbody code
Implementation and analysis

Filipo Novo Mór
Graduate Program in Computer Science UFRGS
Prof. Nicollas Maillard
2013, December
Overview
• About the nbody problem
• The Serial Implementation

• The OpenMP Implementation
• The CUDA Implementation
• Experimental Results

• Conclusion
About the nbody problem
Features:
 Force calculation between all particles.
 Complexity O(N2)
 Energy should be constant.
 The brute force algorithm demands huge
computational power.
The Serial Implementation
NAIVE!

• Clearly N2
• Each pair is evaluated twice
• Acceleration has to be adjusted at the end.
The Serial Implementation

• It stills under N2 domain, but:
• Each pair is evaluated once only.
• Acceleration it’s OK at the end!
The OpenMP Implementation

• MUST be based on the “naive” version.
• We lost the “/2”, but we gain the “/p”!
• OBS: the static schedule seems to be slightly faster than dynamic schedule.
Analysis

*****
*****
*****
*****
*****

for (i=0; i<N; i++)
{
for(j=i+1; j<N; j++)
{
printf(“*”);
}
printf(“n”);
}
The CUDA Implementation

Basic CUDA GPU architecture
Global Memory

N = 15
K=3

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank
Global Memory

Active Tasks
Active Transfers

Shared
Memory
Bank

BARRIER

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Global Memory

Active Tasks
Active Transfers

0

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1

2

Shared
Memory
Bank
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

BARRIER

Active Transfers
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

3

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

4

5

Shared
Memory
Bank
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

BARRIER

Active Transfers
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

6

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

7

8

Shared
Memory
Bank
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

BARRIER

Active Transfers
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

9 10 11

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Shared
Memory
Bank
Global Memory

Active Tasks

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank

BARRIER

Active Transfers
Global Memory

Active Tasks
Active Transfers

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

12 13 14

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Shared
Memory
Bank
Global Memory

Active Tasks
Active Transfers

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Shared
Memory
Bank
Global Memory
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Shared
Memory
Bank
Analysis

C : cost of the CalculateForce
function.
M : transfer cost between global and
shared memories.
T : transfer cost between CPU and
device memories.

 Access to shared memory is
around 100X faster than to the
global memory.
Experimental Results

How much would it cost???
Testing Environment:
 Dell PowerEdge R610
 2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading
 8 physical cores, 16 threads.
 RAM 16GB
 NVIDIA Tesla S2050
 Ubuntu Server 10.0.4 LTS
 GCC 4.4.3
 CUDA 5.0

Version

Cost

Naive

$

0.49

Smart

$

0.33

OMP

$

0.08

CUDA

$

0.05



Amazon EC2:

General Purpose - m1.large plan

GPU Instances - g2.2xlarge plan
Conclusions
• PRAM is OK for sequential and OpenMP.
• But for CUDA, we need a better model!
– Considering block threads, warps and latency.

Thanks!
Additional Slides
About the nbody problem
• Calculations
Force (acceleration)

𝑚 𝑗 𝑟𝑖𝑗

𝑓𝑖 ≈ 𝐺𝑚 𝑖
1<𝑗<𝑁
𝑗 ≠𝑖

𝑟𝑖𝑗

2

+

3
2
𝜀2

Energy (kinetic and potential)

𝐸 = 𝐸𝑘 + 𝐸𝑝
𝑁

𝐺𝑚 𝑖 𝑚𝑗

𝐸𝑝 = −
1<𝑗 <𝑁
𝑖≠𝑗

𝑁

𝐸𝑘 =
1<𝑖<𝑁

𝑟𝑖𝑗

𝑚 𝑖 𝑣2
𝑖
2

Softening Factor
collisionless system
virtual particles
About the nbody problem

Brief Overview of a Parallel Nbody Code