Parallel and Distributed Computing on Low Latency Clusters
1. Parallel and Distributed
Computing on Low
Latecy Clusters
Vittorio Giovara
M. S. Electrical Engineering and Computer Science
University of Illinois at Chicago
May 2009
4. Motivation
• Scaling trend has to stop for CMOS
technology:
✓ Direct-tunneling limit in SiO2 ~3 nm
✓ Distance between Si atoms ~0.3 nm
✓ Variabilty
• Foundamental reason: rising fab cost
5. Motivation
• Easy to build multiple core processor
• Requires human action to modify and adapt
concurrent software
• New classification for computer
architectures
6. Classification
SISD SIMD
data pool data pool
instruction pool
instruction pool
CPU CPU CPU
MISD MIMD
data pool data pool
instruction pool
instruction pool
CPU CPU CPU
CPU CPU CPU
8. Levels
recursion
memory
management
profiling
data dependency
branching overhead
control flow
algorithm
loop level
process management
SMP Multiprogramming
Multithreading and Scheduling
9. Backfire
• Difficutly to fully exploit the parallelism
offered
• Automatic tools required to adapt software
to parallelism
• Compiler support for manual or semi-
automatic enhancement
10. Applications
• OpenMP and MPI are two popular tools
used to simplify the parallelizing process of
both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics
11. Specific Problem and
Background
• Sally3D is a micromagnetism program suit
for field analysis and modeling developed at
Politecnico di Torino (Department of
Electrical Engineering)
• Computationally intensive (even days of
CPU); speedup required
• Previous works still not fully encompassing
the problem (no Infiniband or OpenMP
+MPI solutions)
13. Strategy
• Install a Linux Kernel with ad-hoc
configuration for scientific computation
• Compile a OpenMP enable GCC
(supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with
proper drivers in kernel and user space
• Select a MPI implementation library
14. Strategy
• Verify Infiniband network through some
MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI
directives in the code
• Run test cases
15. OpenMP
• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
25. MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
- OpenMPI
- MVAPICH
26. Infiniband
• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
27. MPI over Infiniband
10000000,0 µs
1000000,0 µs
100000,0 µs
10000,0 µs
1000,0 µs
100,0 µs
10,0 µs
1,0 µs
kB
kB
kB
kB
kB
kB
12 B
25 B
51 B
kB
B
B
B
B
32 B
64 B
12 B
25 B
51 B
B
B
B
B
B
B
k
k
k
M
M
M
M
M
M
M
M
M
M
G
G
G
G
G
1
2
4
8
16
32
64
8
6
2
1
2
4
8
16
1
2
4
8
16
8
6
2
OpenMPI Mvapich2
28. MPI over Infiniband
10000000,00 µs
1000000,00 µs
100000,00 µs
10000,00 µs
1000,00 µs
100,00 µs
10,00 µs
1,00 µs
kB
kB
kB
kB
kB
kB
kB
kB
kB
kB
B
B
B
B
M
M
M
M
1
2
4
8
16
32
64
8
6
2
1
2
4
8
12
25
51
OpenMPI Mvapich2
29. Optimizations
• Active at compile time
• Available only after porting the software to
standard FORTRAN
• Consistent documentation available
• Unexpected positive results
32. Target Software
• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of
mathematical models
37. Actual Results
OMP MPI seconds
* * 59
* - 129
- * 174
- - 249
Function Name Normal OpenMP MPI OpenMP+MPI
calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s
calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s
calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s
campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s
38. Actual Results
• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x
Total Raw Speed Increment: 76%
40. Conclusions and
Future Works
• Computational time has been significantly
decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
• Continue inserting OpenMP and MPI directives
• Perform algorithm optimizations
• Increase cluster size