Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Mapping Parallel Programs into Hierarchical
Distributed Computer Systems
Prof. Victor G. Khoroshevsky and Mikhail G. Kurnosov
Computer Systems Laboratory,
The A.V. Rzhanov Institute of Semiconductor Physics of Siberian Branch of
Russian Academy of Sciences,
13 Lavrentyev ave., 630090 Novosibirsk, Russia
E-mail: mkurnosov@isp.nsc.ru
4th International Conference on Software and Data Technologies (ICSOFT 2009)
Sofia, Bulgaria, 26 - 29 July, 2009

Mapping High-Performance Linpack into
hierarchical computer cluster:
Mapping by standard MPI-tools (mpiexec) –
execution time 118 sec. (44 GFLOPS)
Optimized mapping –
execution time 100 sec. (53 GFLOPS)
Mapping Parallel Programs into
Hierarchical Distributed Computer Systems
High-Performance Linpack task graph
(NP=8, PMAP=1, BCAST=5)
Computer cluster with
hierarchical organization
Two SMP-nodes: 2 x Intel Xeon 5150
2ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov
Level 1
Level 2

Related Work
1. Mapping parallel programs into computer systems (CS) with a
fixed network topology (hybercube, 3D-torus, mesh, etc). A parallel program
represented by a task graph:
(Yu, 2006), (Chen et al. 2006), (Bhanot et al. 2005), (Jose, 1999),
(Ahmad, 1997), (Kalinowski, 1994), (Yau, 1993), (Ercal et al. 1990), (Lee, 1989),
(Bokhari, 1981).
2. Mapping parallel programs into CSs with arbitrary topology. A parallel
program represented by unweighted task graph:
(Ucar et al., 2006), (Prakash et al., 2004), (Miquel et al., 2003), (Träff, 2002),
(Moh, 2001), (Perego, 1998), (Lee, 1989).
Algorithms considering a hierarchical organization of modern distributed
computer systems are needed.
The objective of our research – is development of models and algorithms for
mapping parallel programs into modern hierarchical computer systems
(such as, multicore computer clusters).

Model of Hierarchical Organization of Distributed Computer System
Example of hierarchical organization of computer cluster:
N = 12; L = 3; n23 = 2; C23 = {9, 10, 11, 12}; g(3, 3, 4) = 2; z(1, 7) = 1
Denotations:
C = {1, 2, …, N} – is a set of processor cores;
L – is a number of levels in communication network;
nl – is a number of elements placed at level l ∈ {1, 2, …, L};
nlk – is a number of children of element k ∈ {1, 2, …, nl} at level l;
Сlk – is a set of processor cores belonging to the descendants of element k at level l; clk = |Clk|;
bl – is a bandwidth of communication channels at level l (bit/sec.).

Given a task graph G = (V, E) and a description of hierarchical organization of
computer system (CS):
• V = {1, 2, …, M} – is a set of parallel processes;
• E ⊆ V × V – is a set of inter-process communications;
• dij – is a volume of data transmitted between process i and j for a program execution time;
• bz(p, q) – is a bandwidth of communication channel between cores p and q;
Mapping – is a function f : V → C, which is defined by values of
Objective – is to minimize a program execution time T(X).
The Problem of Mapping Parallel Programs into Hierarchical
)(1 1 1
),( minmax)(
ijx
M
j
N
p
N
q
qpzijjqip
Vi
bdxxXT →






⋅⋅= ∑ ∑∑
= = =∈
,1
1
∑=
=
N
j
ijx ,,...,2,1 Mi =
,1
1
∑=
≤
M
i
ijx ,,...,2,1 Nj =
},1,0{∈ijx ,Vi∈ .Cj ∈
Subject to the constraints:



≠
=
=
.)(else,0
;)(if,1
jif
jif
xij

Task graph partitioning:
The Heuristic Algorithm TMMGP
b3
b3
b2
b3
b1
1V′
2V ′
3V ′
 1LcMk =
Step 1 –
Partitioning
Step 2 –
Mapping

Task Graph Partitioning in the TMMGP algorithm
… …
1. Coarse graph:
Heavy Edge Matching
(Karypis, Kumar, 1998)
2. Partition graph Gm into k subsets by
recursive bisection (Schloegel et al. 2003)
3. Refine partition
by FM heuristic
(Fiduccia,
Mattheyses, 1982)
A computational complexity of TMMGP algorithm is O(|E|log2k + M)

Software Tools for Mapping MPI Programs

Experiments Organization
MPI programs:
• NAS Parallel Benchmarks (NPB);
• High-Performance Linpack (HPL).
Computer clusters:
• Cluster Xeon16: 4 nodes (2 x Intel Xeon 5150), interconnect: Gigabit/Fast Ethernet;
• Cluster Opteron10: 5 nodes (2 x AMD Opteron 248), interconnect: Gigabit/Fast Ethernet.
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
HPL task graph:
16 processes, PMAP=0, BCAST=5
NPB Conjugate Gradient task graph:
16 processes, CLASS B
NPB Multigrid task graph:
16 processes, CLASS B

Experiment Results
The execution time of TMMGP algorithm on Intel Core 2 Duo 2.13 GHz processor is less then 1 sec.
Cluster
interconnect
T(XRR), sec. T(XTMMGP), sec.
Speedup
T(XRR) / T(XTMMGP)
High-Performance Linpack
Fast Ethernet 1108.69 911.81 1.22
Gigabit
Ethernet
263.15 231.72 1.14
NPB Conjugate Gradient
Fast Ethernet 726.02 400.36 1.81
Gigabit
Ethernet
97.56 42.05 2.32
NPB Multigrid
Fast Ethernet 23.94 23.90 1.00
Gigabit
Ethernet
4.06 4.03 1.00
• T(XRR) – is the execution time of MPI benchmark with mapping by round robin algorithm
of mpiexec tool (MPICH2 1.0.6).
• T(XTMMGT) – is the execution time of MPI benchmark with mapping by TMMGP algorithm.
The execution time of MPI benchmarks on Xeon16 cluster

Conclusions and Future Works
Conclusions
• It is required to take into account a hierarchical organization of modern
computer systems and structures of parallel programs in mapping algorithms.
• The proposed algorithm TMMGP allows to reduce execution time of
MPI-programs on 40% in average.
• New algorithms for mapping parallel programs with full task graphs are
required.
Future Works
• Development of new algorithms for mapping parallel programs into arbitrary
subsystems of hierarchical distributed computer systems.
• Integrating the mapping algorithm TMMGP with mpiexec tool and resource
management systems (such as TORQUE).
• Application of the descried approach for optimizing MPI collective operations.

Mapping Parallel Programs into Hierarchical
Victor G. Khoroshevsky and Mikhail G. Kurnosov
Computer Systems Laboratory,
The A.V. Rzhanov Institute of Semiconductor Physics of
Siberian Branch of Russian Academy of Sciences,
13 Lavrentyev ave., 630090 Novosibirsk, Russia
E-mail: mkurnosov@isp.nsc.ru
4th International Conference on Software and Data Technologies (ICSOFT 2009)
Sofia, Bulgaria, 26 - 29 July, 2009
Thank You For Your Attention

Backup Slides
ICSOFT 2009, July 26 – 29, 2009, Sofia, Bulgaria Mikhail Kurnosov

The k-way Graph Partitioning Problem
The example of 3-way graph partitioning:
V’ = {1, 2, …, 12}; k = 3; s = 3;
W(1, 2) = 3; W(1, 3) = 2; W(2, 3) = 4.
It is required to partition graph G’ = (V’, E’) into k disjoint subsets such that
maximal sum of edge-weights incident to any subset is minimized and |V’i| ≤ s.
• w(u, v) – is a weight of edge (u, v) ∈ E’;
• W(i, j) – is an additional weight for edges incident to subsets i and j;
• c(u, v, i, j) = w(u, v)W(i, j) – is a total weight of edge (u, v) incident to subsets i and j.
kVVV ′′′ ,...,, 21
The approximate partition:
edge-weights(V’1) = w(1, 5)W(1, 2) +
+ w(6, 8)W(1, 3) + w(2, 3)W(1, 3) = 22;
edge-weights(V’2) = 40
edge-weights(V’3) = 38
1V′
2V′
3V′),,(/),,,( jiLguv bdjivuc =

Heavy Edge Matching algorithm
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Coarser graphMatching (source graph)
HEM

Graph Bisection
5
3
5
6
4
2
5
3
2
1
4
Initial vertexBisection

Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Mapping Parallel Programs into Hierarchical Distributed Computer Systems

Similar to Mapping Parallel Programs into Hierarchical Distributed Computer Systems (20)

More from Mikhail Kurnosov

More from Mikhail Kurnosov (20)

Recently uploaded

Recently uploaded (20)

Mapping Parallel Programs into Hierarchical Distributed Computer Systems