Chapter on Book on Cloud Computing 96

Parallel Computing: State-of-the Art Perspective,
E.H. D’Hollander, G.R. Joubert, F.J. Peters, D.
Trystram (Eds.), Elsevier, 1996
A programming environment for heterogeneous network
computing with transparent workload redistribution
M. Angelaccio, M. Cermele and M. Colajanni
Dipartimento di Informatica, Sistemi e Produzione
Università di Roma “Tor Vergata”, Via della Ricerca Scientifica, Roma, Italy
The project presented in this paper aims to extend the SPMD programming
paradigm to a computational platform composed of a network of heterogeneous
workstations with time-varying conditions. Presently, packages such as PVM and
MPI allow us to use a network of distributed nodes as a single parallel machine but
do not overcome the potential inefficiencies due both to heterogeneity and the
unpredictable variability of usually shared resources. The aim of this paper is to
illustrate an environment that both supports SPMD programming on a network of
workstations, and also provides transparent dynamic data re-distribution. Our
experiments demonstrate that a workload re-distribution support is necessary to
achieve a satisfactory efficiency when the computational platform is subject to heavy
modifications.
1. INTRODUCTION
SPMD programming is a widely adopted paradigm for a large class of problems.
Nevertheless, it becomes hard to preserve efficiency when the computing platform is
highly irregular and subject to dynamically varying conditions. The SPMD
programming paradigm, in fact, requires the choice of a specific data decomposition,
and the insertion of primitives in a decomposition dependent way. This approach
yields parallel programs that correspond to a single data distribution and guarantee
adequate efficiency only for regular problems running on homogeneous static
platforms. On the other hand, there are several cases where both data
decomposition and hardware platform are subject to dynamic variations. For
example, in all the problems such as molecular dynamics in which the workload
intrinsically changes at run-time; in case of heterogeneous network computing to
adjust load balancing when the available resources dynamically vary; in the recovery
of parallel programs in the presence of faulty nodes provided that a run-time
process/data reconfiguration support is available. In all these cases the use of static
environments would lead to serious inefficiencies that can be avoided by adapting
the workload distribution (in this case corresponding to data decomposition) to the
modified framework. This can be obtained by decomposition and machine
independent (DMI) parallel programs that do not require specification of data
decomposition and target machine at compile-time.

Presently, two main frameworks (i.e. PVM and MPI) allow us to use a network of
distributed nodes as a single parallel machine, thus yielding the design of machine
independent (MI) programs. These packages finely hide differences among the
nodes of a distributed platform from the programmer, but they do not overcome the
potential inefficiencies due to heterogeneity and unpredictable variability of usually
shared resources. At the moment, the solution to this problem is completely left to
the programmer who has to face any random modification of the computing platform.
The intent of our project called DAME (DAta Migration Environment) is twofold:
firstly to write DMI programs in an explicit message passing environment, secondly to
support dynamic data re-distribution. The first goal has been accomplished by the
parallel run-time library PLUS the theoretical foundations of which are in [1]. PLUS
provides a set of DMI collective primitives that allow the design and implementation of
programs in which the distribution attributes can be settled at run-time. The second
goal has been achieved by a transparent mechanism that, at regular intervals,
checks the status of the platform and, if necessary, autonomously provides suitable
data migrations from overloaded to under-loaded nodes.
The paper is organised as follows. Section 2 presents the DAME project focusing
on its aims and comparing them to related frameworks. Section 3 outlines the virtual
architecture and its effects on data decomposition. Section 4 describes the
programming model provided by PLUS and the interactions among the DAME
components. Section 5 presents experimental results on a computational platform
composed of a network of workstations.
2. THE DAME PROJECT
DAME is an environment that supports SPMD programming by means of
primitives that identify node properties (such as memory, current computational
power, etc.), facilitate node grouping operations, and support inter/intra group
communications. DAME provides double independence: from machines and from
data distribution. For SPMD programs the amount of computation performed by each
processing unit is usually proportional to the size of data owned. Therefore at the
beginning, DAME automatically distributes data by taking into account the
differences among current computational power of each workstation. At run-time,
DAME provides a dynamic data balancing support to preserve efficiency on a
platform subject to modification without forcing the programmer to manage
potentially complex operations such as workload monitoring, process
synchronisation, data migrations, and so on.
Literature presents various examples of strategies for load balancing. Task
migration strategies for highly parallel computers are shown in [7], whereas optimal
scheduling algorithms for network computing are presented in [4]. Piranha
dynamically adapts Linda computations to the number of available workstations [2].
Nedeljkovic and Quinn propose a modification of the run-time system of Dataparallel
C (DPC) by adapting it to heterogeneous networks and providing transparent
workload migration [6]. Automatic Data Movement (ADM) furnishes a set of functions
that help the programmer to achieve load balancing by means of data migration [3].
By comparing DAME to the existing strategies for SPMD applications, it should be
noted that ADM is not yet transparent to the programmer, whereas DPC presents
some similarities even if it is carried out in a completely different way. In particular,

the programming language provided by this latter is Dataparallel C, a SIMD language
oriented to virtual processors without explicit communication primitives, whereas
DAME supports PLUS, a decomposition independent message-passing language for
SPMD computations. In addition, DAME achieves dynamic load balancing by data
migration only instead of virtual parallel processor migration, as needed in DPC.
Moreover, since DAME is partially built over PVM [5], it inherits all the portability
advantages of this latter framework.
3. VIRTUAL COMPUTATIONAL ARCHITECTURE
DAME supports a virtual mesh topology because SPMD programming is
considerably simplified if an underlying regular platform is assumed. Nevertheless,
workstations are heterogeneous and irregularly connected. Their topology is usually
composed of a main backbone that connects several subnets by means of some
bridges (Figure 1.a). Even if widely used protocols such as TCP-IP provide complete
interconnection among nodes, efficiency issues suggest that we should cluster
together nodes that are more quickly connected among each other.
To this purpose, DAME groups together nodes of the same physical subnet to
form the rows of the virtual mesh topology (the so called row subnets). In addition,
DAME emulates a regular platform (i.e. each group with the same number of nodes)
by splitting some nodes into several virtual nodes whose number depends on the
offered computational power of each workstation.
A B C D
E F
H I
A B D
G
1 A2 C1
H1 H2 I1 I2 I3
E1
C2
E2 E3 E4 F1 F2
G
Figure 1.a. Actual network. Figure 1.b. Virtual network.
For example, once the computational parameters have been evaluated, DAME
maps the irregular physical network of Figure 1.a into the virtual mesh of Figure 1.b.
The virtual mesh seems the best compromise because it introduces fewer virtual
links (grey lines in Figure 1.b) than unbounded degree topologies and it does not
represent a severe limitation since several practical applications can be immediately
mapped over such domain or can be easily reduced to it.
As a consequence of this virtual topology definition, DAME always maps the data
domain onto a mn virtual mesh (e.g. 36 in Figure 1.b). For example in the case of
2D matrix domain, the partition algorithm decomposes the matrix into m groups of
rows and n groups of columns (Figure 2.a). In such way, a programmer deals with
virtual nodes/decomposition and can adopt the usual SPMD paradigm for 2D regular
topologies. Figure 2.b shows the actual mapping of data on the physical network:
each node has an amount of data proportional to the offered computational power

thus implying a very irregular topology. The dynamic load balancing support that
causes data migration and run-time modifications of the physical data distribution
does not require any adjustment on the high level code oriented to virtual nodes
thanks to the decomposition-independent paradigm provided by the PLUS run-time
library underlying DAME. The PLUS language, in fact, overcomes the difficulties of
programming on irregular and variable domains by providing a suitable set of
functions whose syntax appears quite similar to that of traditional data-parallel
primitives. The DMI PLUS primitives are characterised by a semantic flexibility, in the
sense that they self-adapt their effect to any data distribution.
A B D
G
1 A2
C1
H1 H2 I1 I2 I3
E1
C2
E2
E3 E4 F1 F2
B D
G
A
H I
C
E F
Figure 2.a. Virtual data decomposition. Figure 2.b. Actual data decomposition.
4. DAME COMPONENTS
DAME is organised into two logical components: master and computing nodes.
The whole evolution of programs is governed by the master that is a process
resident in one node. Since the master is idle during most of program execution, one
node (possibly, the most powerful) carries on the double activity of master and
computing node. The master starts the PVM demon on each workstation, and groups
nodes according to the network configuration. The static data distribution is carried
out by a “data balancing algorithm” on the basis of the network monitor function that
quantifies the current computational power of each workstation (in Figure 3 these
activities are evidenced by the grey arrows). Afterwards, each node can start the
execution of the parallel code.
During program execution, a plus_check() call guarantees load balancing by
performing, if necessary, a data migration. In such a case, the program execution is
interrupted, information about current computational power is collected by the
network monitor and, if heavy modifications have occurred, dynamic data distribution
algorithm is executed (in Figure 3 these activities are evidenced by the black arrows).
The re-distribution is not performed by the master that only indicates to each node
which data are to be sent and to be received. In such a way, each row subnet can
concurrently re-distribute data among its nodes. For the sake of efficiency we
distinguish between local and global reconfiguration, in the sense that data
exchange can happen only among nodes belonging to the same row subnet (local)
or among row subnets (global). The scalability requirement is satisfied since, if we
increase the number of nodes, the complexity of load balancing grows in proportion
to the square root of the number of nodes.

Each node behaves as in an usual SPMD programming environment. The
programmer should insert communication PLUS primitives as he would with a
regular virtual mesh. The decomposition can be settled and/or modified at run-time
by means of the plus_check() primitive that can be called either by the programmer or
automatically by the system if heavy and unexpected events require the suddenly re-
evaluation of data partition.
The node program is written in C enriched by the PLUS primitives. The Figure 3
illustrates a typical aspect of a PLUS code and how the different DAME components
interact. The self-adapting characteristic of the PLUS primitives cannot be illustrated
because it influences an underlying level.
Figure 3. Template of a PLUS node program and interactions among function calls
and DAME components.
The PLUS primitives can be divided into four groups Some of them are currently
built on top of PVM [5] thus representing an auxiliary layer.
Identification primitives. Usually called once before the main loop of the program,
they return global (such as number of nodes involved in computation, number of row
subnets) and local information (such as position of each node in the mesh, its
number of row subnet).
Loop dependent primitives. Used inside the main loop, they can be distinguished
between owner compute functions that determine the owners of a given set of data,
and indexing rules that allows the programmer to access to local data by means of
their global indexes in the original data structure. These primitives are the
fundamental basis that supports the decomposition independence paradigm of
PLUS since the programmer is never required to exactly express where data are
located.
Communication primitives. They conform to the PVM standard by supporting
several types of data exchange among nodes and among row subnets (such as fan-

in, fan-out, gathering). Some primitives are implemented by means of PVM
functions, others are designed and implemented ex novo.
DAME interface primitives. They represent the only non-transparent interface
between a traditional SPMD code and the irregular computational platform. At
present, three primitives belong to this class: plus_init(), plus_end(), and plus_check().

5. EXPERIMENTAL RESULTS
DAME is currently implemented on a Ethernet-based local area network
composed of four HP-9000, four Sun Sparc-Stations and one IBM RISC-6000 that
are connected as in Figure 1.a. Experiments were carried out on dedicated network
and workstations. In some experiments, though, some synthetic overheads were
added to the computational platform with the aim of emulating network and/or
machine contention. We have run several SPMD numerical algorithms such as
matrix multiplication, Gaussian and Cholesky factorisation, block Jacobi. For the
sake of room, we restrict ourselves here to the LU factorisation algorithm the results
of which are representative of the performance achieved by DAME. We evaluate
efficacy of the supports for irregular data decomposition, virtual network and dynamic
data re-distribution.
The first set of experiments has been carried out on a dedicated computational
platform. The aim is to demonstrate that the DAME supports do not add heavy
overheads to the execution times under static condition. Before starting computation,
the irregular data decomposition support partitions the workload in a way
proportional to the current computational power of each workstation. It has been
verified that for any number of machines and data dimension, DAME execution times
are lower than those achieved by using a workload equally partitioned among nodes.
In particular, Figure 4 shows the execution time (in seconds) of a parallel algorithm
for the factorisation of a dense matrix running on different numbers of workstations
under the hypothesis that no modification occurs in the computational platform. This
figure shows that considerable speed-up is achieved until four workstations are
involved, thus demonstrating that the irregular data decomposition and virtual
network supports do not degrade performance. The loss of efficiency for a higher
number of nodes is due both to an increased number of communications, and mainly
to the fact that the additional workstations belong to different physical subnets
connected through bridges.
1100900700500300100
0
40
80
120
160
1 Node
2 Nodes
4 Nodes
6 Nodes
8 Nodes
Matrix Size
Execution Time
1100900700500300100
0
20
40
60
80
100
2 Nodes
4 Nodes
8 Nodes
2-plus _check
4-plus _check
8-plus _check
Matrix Size
Execution Time
Figure 4. Execution times for LU factorisation Figure 5. Overhead of one plus_check()
call

of a dense matrix with varying dimensions without data migration.
(dedicated computational platform). (dedicated computational platform).

The efficacy of the dynamic data re-distribution support has to be evaluated under
static and dynamic condition. A trade-off exists between the performance
degradation due to load unbalance and the overhead due to the execution of the
plus_check() primitive. The latter consists of four phases: process synchronisation,
network monitoring, decision algorithm, and data re-distribution. Since DAME
efficiently implements the second and third phases, the main costly factors of the
plus_check() execution are process synchronisation and data re-distribution.
Figure 5 shows the execution times of a DAME program with and without
plus_check() call, respectively. Since no modification occurs in the computational
platform, no data re-distribution is carried out. Therefore, the gap between the two
curves evidences the cost of the first three phases. In particular, the light differences
demonstrate the scalability of the plus_check() primitive: the introduced overhead, in
fact, does not increase for higher number of nodes. It should be noted, though, that
this low overhead is also due to the characteristics of the considered SPMD
algorithm which implicitly synchronises the different processes at the end of each
iteration, if the workload is well balanced.
Figure 6 shows the execution time tex of the same parallel algorithm when some
modification of the computational power of workstations occurs. To evaluate the
impact of data re-distribution only, we preserve the global power of the
computational platform. In particular, at time tex/4, one workstation is burdened with
three synthetic workloads that cause a loss of power equal to 10%, 30% and 50%,
respectively. At the same time, some other workstations gain an analogous amount
of power. In this experiment the DAME program executes only one plus_check() call at
time tex/2. The (plain) curves point out the importance of a dynamic data migration
support especially when the occurred modifications are heavy (for the considered
algorithm, at least 30%) and/or the computational cost of the problem is high (i.e. in
case of long execution times).
1100900700500300100
0
20
40
60
80
100
120
10 %
10 %-p lu s_check
30 %
30 %-p lu s_check
50 %
50 %-p lu s_check
Matrix Size
Execution Time
1100900700500300100
0
20
40
60
80
100
120
10 %
10 %-p lu s_check
30 %
30 %-p lu s_check
50 %
50 %-p lu s_check
Matrix Size
Execution Time
Figure 6. Execution times with and without Figure 7. Execution times with and without
data migration for different variations of the data migration for different variations of the
computational platform (1 plus_check() call). computational platform (3 plus_check() calls).

Figure 7 illustrates the same experiments under a different frequency of the
plus_check() call, that is at time tex/4, tex/2 and 3tex/4. In this case, the modification of
the computational power occurs at time tex/8. We can observe that the additional
overhead caused by the multiple occurrence of the plus_check() call is widely
compensated if heavy modifications occur in the platform: the execution time is
reduced if a power variation of at least 30% occurs, whereas a longer execution time
is observed when the modifications are light (less than 30%).
In addition, by considering Figure 6 and 7 together, we can observe that three
plus_check() calls improve performance of the 50%-modification case to the extent that
the resulting execution time is lower than the unbalanced 30%-modification case
(compare 30% and 50%-plus_check curves in the two figures). It should be noted,
though, that here the checkpoint frequency is empirically solved once known the
program execution time. The optimal checkpoint insertion for any kind of SPMD
algorithm is one of the open problem that is still under study.
6. CONCLUSIONS
The DAME project presented in this paper aims to face some of the intrinsic
difficulties of SPMD programming on heterogeneous and time-varying network
platforms. DAME supplies the programmer with four kinds of transparent supports: a
run-time library (PLUS) of decomposition and machine independent primitives; a
virtual mesh abstraction that hides irregularities of the network; a static mechanism
that automatically distributes workload in a way which is proportional to the current
computational power of each workstation; a dynamic and transparent data migration
support that masks any modification of the underlying platform. The satisfying
experimental results shown by all these supports demonstrate that DAME is a
theoretical-based and efficacious framework for SPMD network computing and it
preserves efficiency when the platform is subject to dynamic variations.
References
[1] M. Angelaccio, M. Colajanni, “Unifying and optimizing parallel linear algebra
algorithms”, IEEE Trans. on Parallel and Distributed Systems, v. 4, no. 12, pp.
1382-1397, Dec. 1993.
[2] N. Carriero, D. Kaminsky, “Adaptive parallelism and Piranha”, IEEE Computer, v.
28, no. 1, Jan. 1995.
[3] J. Casas, R. Konuru, S.W. Otto, R. Prouty, J. Walpole, “Adaptive load migration
systems for PVM”, Proc. of Supercomputing ’94, pp. 390-399, Nov. 1994.
[4] K. Efe, V. Krishnamoorty, “Optimal scheduling of compute-intensive tasks on a
network of workstations”, IEEE Trans. on Parallel and Distributed Systems, v. 6,
no. 6, pp. 668-673, June 1995.
[5] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM
3.0 User’s Guides and Reference Manual, Feb. 1993 (available via ftp).
[6] N. Nedeijkovic, M.J. Quinn, “Data-parallel programming on a network of
heterogeneous workstations”, Concurrency: Practice and Experience, v. 5, no. 4,
pp. 257-268, June 1993.

[7] M.H. Willebeek-Le Mair, A.P. Reeves, “Strategies for dynamic load balancing on
highly parallel computers“, IEEE Trans. on Parallel and Distributed Systems, v. 4,
no. 9, pp. 979-993, Sept. 1993.

Chapter on Book on Cloud Computing 96

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (12)

Similar to Chapter on Book on Cloud Computing 96

Similar to Chapter on Book on Cloud Computing 96 (20)

More from Michele Cermele

More from Michele Cermele (6)

Chapter on Book on Cloud Computing 96