Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries

HPCS 96 The 10th Annual International
Conference on High Performance
Computers
Achieving Portability and Efficiency in a HPC code using standard
message-passing libraries.
Derryck L. Lamptey, G. A. Manson, R. K. England
National Transputer Support Centre, 5 Palmerston Road, Sheffield, S10 2TE, U.K.
Contact author: bq176@torfree.net
Telephone: +44 742 76 87 40
Fax: +44 742 72 75 63

SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
2 ĆĆ
1 Introduction
As part of a European Union EUREKA project (EU 638, PARSIM) the NTSC has ported
sequential FORTRAN code for seismic inversion onto a parallel platform. The sequential code has
been industrially tested by one of the PARSIM partners, Ødegaard & DanneskioldĆSamsøe, and
remains commercially confidential. To respect confidentiality the parallel code presented here is a
sanitised version of the parallel code developed at the NTSC. As part of the GPĆMIMD2 project the
code has been tested on the CSĆ2 parallel processing supercomputer at CERN and results show
near linear scalability on up to 30 processors. The development environment for this successful
project involved using MPI on a number of different different platforms to ensure a portable yet effiĆ
cient parallel code.
The code has been functionally tested on inĆhouse Silicon Graphics' servers, and is due to
be benchmarked on a Silicon Graphics' PowerChallenge Array (16 nodes) at the Silicon Graphics'
SuperComputer Technology Centre in Neuchatel, Switzerland. Results from these runs should be
available by the end of February, 1996.
2 The Code
A sequential implementation of the algorithm was available, from which the parallelisation
could be developed. The relevant code, referred to as sequential code", is the code necessary to
solve the problem once initial estimates and a number of preset variables have been set up. The solĆ
ution of the seismic problem principally involves repeating a number of 3Ćdimensional operations deĆ
signed to minimise the error between the seismic (input) data and the synthetic (output) data sets.
The calculation computes a set of variables which reduce the entropy of a given data set over a
number of iterations, using a conjugate gradient method. The total seismic inversion involves a comĆ
plex algorithm, most of which will not benefit greatly from parallelisation. However, a small section
of the code, shown as Parallel Prototype" in Figure 1 accounts for 95% of the computation time. This
is the portion of the algorithm which has been parallelised.
// Global Optimisation =
Loop for N
Non linear optimisation of some binary variables
// Local Optimisation (small) =
1 dimensional optimisation
3 dimensional optimisation
// 3 dimensional optimisation (big) =
Loop for M
scan 1: calculation of global scalar.
scan 2..5: update of a number of data structures in each
processing cell. Inter-cell communications required
between each scan.
End loop M
End loop N
Figure 1 Overview of the optimisation algorithm.
Main
Parallel Prototype
The main optimisation code (Main) does some external optimisation, then calls the internal
optimisation procedure (Parallel Prototype) to solve the system of equations.

3 ĆĆ
3 Options for Code Parallelisation
For this problem, the data structures are large (circa 2GB), regular and 3Ćdimensional, and
there are spatial dependencies between the data elements in all 3 dimensions. The data could not
be distributed conveniently in the vertical direction. The computational code in the parallel prototype
would be wellĆstructured, if these spatial dependencies could be accomodated. Considerable time
was therefore spent analysing these data structures and the spatial dependencies. Three data disĆ
tribution approaches were proposed for this parallelisation:
S Option 1 (The processor grid approach)
S Option 2 (The sparse matrix solver approach)
S Option 3 (The distributed vector approach)
3.1 Option 1 (The processor grid approach)
Figure 2 Mapping the seismic data on to a grid of processors
τ
y
x
τ
y
x
P1
P2
P3
P4
P5
P6
This solution provides a straightforward mapping, and is conceptually simple, but has a
number of disadvantages to it:
S the option would require complicated communications and process sychĆ
ronisation, and scalability was not thought to be easily obtainable in a parĆ
allel version.
S load balancing was expected to be a problem because of the distribution
patterns that would necessarily arise.

4 ĆĆ
3.2 Option 2 (The sparse matrix solver approach)
Figure 3 Mapping the seismic data on to a sparse matrix
τ
y
x
x ∗ τ ∗ y
x ∗ τ ∗ y
The problem could be formulated as a sparse banded matrix type problem. The solution
could then be handled by a number of parallel sparse matrix solvers currently available. Conceptually
the sparse matrix approach is simple and elegant, but a number of implementation issues remained:
S Parallel formulation of the sparse matrix is not straightforward, because the data
structures necessary for formulation as a matrix problem are not easily mapped
into the data structures of the calling routine", (Main" in Figure 1). With this apĆ
proach, the sparse matrix would have to be constructed on each entry to the proĆ
totype section of code, and deconstructed" on each exit.
S Several sparse matrices would need to be set up to represent all the disĆ
tributed data structures.
S Sparse matrix data storage was likely to be in the triad" format, giving rise
to a memory requirement up to three times that of any other approach.

5 ĆĆ
3.3 Option 3 (The distributed vector approach)
Figure 4 Mapping the seismic data on
to one dimensional vectors
τ
y
x
x ∗ τ ∗ y
P1
P2
P3
P1 P2 P3 P4
This approach formulates the 3 dimensional structures as 1 dimensional structures, permitĆ
ting the parallel data structures to be viewed in a manner similar to that described in Option 1, but
reducing the communication complexity by assigning entire seismic lines (y direction) to processors.
This permits the data on each processor to be viewed as a contiguous data space. This option is
possible, and beneficial in terms of processing, because:
S loop control is highly regular.
S the spatial dependencies between the data elements are well ordered.
S the code structure of the original sequential code could be preserved to a
great degree in the parallel prototype.
S there was a fairly straightforward mapping of existing data structures
In this formulation of the problem each processor has a number of complete seismic lines
(see Figure 4), permitting an SPMD (single program multiple data) approach to be employed.
Figure 4 also illustrates the mapping of the seismic data onto a number of processors, the seismic
data being represented by equivalent 1Ćdimensional vectors.
Option 3 was chosen.
4 Library Interfaces
The software architecture of the parallel prototype is shown in Figure 5.

6 ĆĆ
.
I/O Computation
MPI
DDL
Application
Parallel prototype
Figure 5 Organisation of the Library hierarchy
Application Library
4.1 Distributed Data Libraries
DDL is the Distributed Data Library developed at the Institute for Advanced Scientific ComĆ
putation at the University of Liverpool[1]. DDL is a library system which creates, manages and operĆ
ates upon distributed objects such as matrices and vectors, thus permitting the exploitation of distribĆ
uted memory parallel computers from a number of singleĆthreaded FORTRAN program. Many of the
conventions employed by the DDL are derived from the MPI standard. The distributed data structures
supported by the library are treated for the most part as opaque objects i.e. that they can only be
manipulated through calls to DDL procedures. DDL objects may be passed to procedures via the
use of handles. But DDL also permits the application to directly access the data inside the distributed
vector, in which case the application can treat the data like a local data segment. The DDL manages
system memory, allocating space required for new objects and deĆallocating redundant objects.
DDL is used in the parallel prototype primarily for input and output, memory allocation, and distribĆ
uted data management. Because of the parallelisation option chosen, only a small subset of the DDL
interface is required, namely:
S Input and Output ( DDL_Open(), DDL_Fileformat(), DDL_Read(),
DDL_Open_host(), DDL_Write(), DDL_Close() ).
S Memory Allocation and Data distribution ( DDL_Create_vector(),
DDL_Free() ).
S Data Access ( DDL_Get_vector() ).
S Data distribution query ( DDL_Size(), DDL_GSize(),
DDL_Offset_Vector(), DDL_Size_Vector() ).
S Miscellaneous ( DDL_Init_sparse(), DDL_Finalize_sparse(),
DDL_Ishost() ).
4.2 Message Passing Interface
MPI is a Message Passing Interface which is intended to become a standard for applications
running on distributed memory MIMD concurrent computers, and is well described elsewhere[2]. It

7 ĆĆ
is not intended as a complete parallel programming environment, and currently lacks universal supĆ
port for features such as parallel I/O, parallel program composition, dynamic process control and
debugging. However, the core of MPI consists of a set of routines which support pointĆtoĆpoint comĆ
munication between pairs of processes or between groups of processes. Because of the regular naĆ
ture of the seismic data structures, the messageĆpassing requirements of the chosen approach are
not complicated and rely on a very basic and standard subset of the MPI library:
S Message Passing routines (MPI_Send(), MPI_Recv()).
S Communication contexts (MPI_Comm_Size(), MPI_Comm_Rank()).
S Data reduction (MPI_AllReduce()).
S Miscellaneous (MPI_Init(), MPI_Finalize()).
More and more vendors are providing proprietary implementations of MPI, e.g. Silicon
Graphics.
4.3 Use of standard libraries in porting
During the design stage, the portable libraries (DDL and MPI) were chosen. These libraries
were used during the development, but a conscious effort had to be made during the development
process to use smallest subset of functions from these libraries, in order to minimize the library deĆ
pendencies. All known implementations of MPI incorporate the MPI routines required by the parallel
prototype. DDL is portable, and available for an increasing number of platforms.
During the development phase, coding and functional testing was carried out on the NTSC's
set of Sun workstations, and performance testing and module integration was carried out on the
CSĆ2 supercomputer at CERN, mostly because the system bandwidth of the inĆhouse SUN network
is around 0.5MB/sec, as compared to the 50MB/sec nodeĆtoĆnode bandwidth that the CSĆ2 can susĆ
tain. Portability was also ensured for a minimal cost in efficiency (Benchmarks show that on the CSĆ2,
around 80% of the native messageĆpassing bandwidth is available through the use of the MPI liĆ
braries).
5 Results
5.1 Parallel Performance
Using data from North Sea oilfields, the parallel prototypes developed have demonstrated
impressive parallel scalability in benchmarking tests on the CSĆ2 machine at CERN (see Figure 6).

8 ĆĆ
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35
Computationspeedup
Number of processors
Number of lines: 60
traces per line: 151
samples per trace: 56
samples per wavelet: 9
Figure 6 Parallel scalability of the parallel prototype on real seismic data
linear speedup
achieved speedup
5.2 Numerical Performance
A key objective for the parallel prototype is to reduce the entropy of the provided data set,
whilst maintaining identical numerical accuracy for all parallel configurations (invariant of the number
of processors). This objective has been met. See Figure 7.
318000
319000
320000
321000
0 5 10 15 20 25 30 35
ResidualEnergy
Number of processes
Figure 7 Final energies for a number of processors on real seismic data.
6 Conclusions
It is clear that this port of the numerical kernel of a large sequential program has been a sucĆ
cess, and that the code runs on a number of parallel processing machines which are currently comĆ
mercially available. Portability was obtained without significant efficiency concessions. The authors
conclude that the role played by MPI and DDL is crucial in formulating parallelisation strategies, and
in providing prototype implementations. From this good base the project partners can consider movĆ
ing the parallelisation to other parts of the problem solution, secure in the framework of data manageĆ
ment which has been provided by MPI and DDL.

9 ĆĆ
7 References
[1] Tim Oliver et. al., “Sparse DDL – User Guide” , (1995) Institute for Advanced Scientific
Computation, Liverpool, England,
tim@supr.scm.liv.ac.uk (http://supr.scm.liv.ac.uk/~tim/parsim/parsim.html)
[2] Bill Gropp, Rusty Lusk, Tony Skjellum, Nathan Doss, “A Portable MPI Implementa-
tion”, (November, 1994), Argonne National Laboratories/Mississippi State University
gropp@mcs.anl.gov

Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (17)

Similar to Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries

Similar to Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries (20)

Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries