SlideShare a Scribd company logo
HPCS 96 The 10th Annual International
Conference on High Performance
Computers
Achieving Portability and Efficiency in a HPC code using standard
message-passing libraries.
Derryck L. Lamptey, G. A. Manson, R. K. England
National Transputer Support Centre, 5 Palmerston Road, Sheffield, S10 2TE, U.K.
Contact author: bq176@torfree.net
Telephone: +44 742 76 87 40
Fax: +44 742 72 75 63
SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
2 ĆĆ
1 Introduction
As part of a European Union EUREKA project (EU 638, PARSIM) the NTSC has ported
sequential FORTRAN code for seismic inversion onto a parallel platform. The sequential code has
been industrially tested by one of the PARSIM partners, Ødegaard & DanneskioldĆSamsøe, and
remains commercially confidential. To respect confidentiality the parallel code presented here is a
sanitised version of the parallel code developed at the NTSC. As part of the GPĆMIMD2 project the
code has been tested on the CSĆ2 parallel processing supercomputer at CERN and results show
near linear scalability on up to 30 processors. The development environment for this successful
project involved using MPI on a number of different different platforms to ensure a portable yet effiĆ
cient parallel code.
The code has been functionally tested on inĆhouse Silicon Graphics' servers, and is due to
be benchmarked on a Silicon Graphics' PowerChallenge Array (16 nodes) at the Silicon Graphics'
SuperComputer Technology Centre in Neuchatel, Switzerland. Results from these runs should be
available by the end of February, 1996.
2 The Code
A sequential implementation of the algorithm was available, from which the parallelisation
could be developed. The relevant code, referred to as sequential code", is the code necessary to
solve the problem once initial estimates and a number of preset variables have been set up. The solĆ
ution of the seismic problem principally involves repeating a number of 3Ćdimensional operations deĆ
signed to minimise the error between the seismic (input) data and the synthetic (output) data sets.
The calculation computes a set of variables which reduce the entropy of a given data set over a
number of iterations, using a conjugate gradient method. The total seismic inversion involves a comĆ
plex algorithm, most of which will not benefit greatly from parallelisation. However, a small section
of the code, shown as Parallel Prototype" in Figure 1 accounts for 95% of the computation time. This
is the portion of the algorithm which has been parallelised.
// Global Optimisation =
Loop for N
Non linear optimisation of some binary variables
// Local Optimisation (small) =
1 dimensional optimisation
3 dimensional optimisation
// 3 dimensional optimisation (big) =
Loop for M
scan 1: calculation of global scalar.
scan 2..5: update of a number of data structures in each
processing cell. Inter-cell communications required
between each scan.
End loop M
End loop N
Figure 1 Overview of the optimisation algorithm.
Main
Parallel Prototype
The main optimisation code (Main) does some external optimisation, then calls the internal
optimisation procedure (Parallel Prototype) to solve the system of equations.
SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
3 ĆĆ
3 Options for Code Parallelisation
For this problem, the data structures are large (circa 2GB), regular and 3Ćdimensional, and
there are spatial dependencies between the data elements in all 3 dimensions. The data could not
be distributed conveniently in the vertical direction. The computational code in the parallel prototype
would be wellĆstructured, if these spatial dependencies could be accomodated. Considerable time
was therefore spent analysing these data structures and the spatial dependencies. Three data disĆ
tribution approaches were proposed for this parallelisation:
S Option 1 (The processor grid approach)
S Option 2 (The sparse matrix solver approach)
S Option 3 (The distributed vector approach)
3.1 Option 1 (The processor grid approach)
Figure 2 Mapping the seismic data on to a grid of processors
τ
y
x
τ
y
x
P1
P2
P3
P4
P5
P6
This solution provides a straightforward mapping, and is conceptually simple, but has a
number of disadvantages to it:
S the option would require complicated communications and process sychĆ
ronisation, and scalability was not thought to be easily obtainable in a parĆ
allel version.
S load balancing was expected to be a problem because of the distribution
patterns that would necessarily arise.
SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
4 ĆĆ
3.2 Option 2 (The sparse matrix solver approach)
Figure 3 Mapping the seismic data on to a sparse matrix
τ
y
x
x ∗ τ ∗ y
x ∗ τ ∗ y
The problem could be formulated as a sparse banded matrix type problem. The solution
could then be handled by a number of parallel sparse matrix solvers currently available. Conceptually
the sparse matrix approach is simple and elegant, but a number of implementation issues remained:
S Parallel formulation of the sparse matrix is not straightforward, because the data
structures necessary for formulation as a matrix problem are not easily mapped
into the data structures of the calling routine", (Main" in Figure 1). With this apĆ
proach, the sparse matrix would have to be constructed on each entry to the proĆ
totype section of code, and deconstructed" on each exit.
S Several sparse matrices would need to be set up to represent all the disĆ
tributed data structures.
S Sparse matrix data storage was likely to be in the triad" format, giving rise
to a memory requirement up to three times that of any other approach.
SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
5 ĆĆ
3.3 Option 3 (The distributed vector approach)
Figure 4 Mapping the seismic data on
to one dimensional vectors
τ
y
x
x ∗ τ ∗ y
P1
P2
P3
P1 P2 P3 P4
This approach formulates the 3 dimensional structures as 1 dimensional structures, permitĆ
ting the parallel data structures to be viewed in a manner similar to that described in Option 1, but
reducing the communication complexity by assigning entire seismic lines (y direction) to processors.
This permits the data on each processor to be viewed as a contiguous data space. This option is
possible, and beneficial in terms of processing, because:
S loop control is highly regular.
S the spatial dependencies between the data elements are well ordered.
S the code structure of the original sequential code could be preserved to a
great degree in the parallel prototype.
S there was a fairly straightforward mapping of existing data structures
In this formulation of the problem each processor has a number of complete seismic lines
(see Figure 4), permitting an SPMD (single program multiple data) approach to be employed.
Figure 4 also illustrates the mapping of the seismic data onto a number of processors, the seismic
data being represented by equivalent 1Ćdimensional vectors.
Option 3 was chosen.
4 Library Interfaces
The software architecture of the parallel prototype is shown in Figure 5.
SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
6 ĆĆ
.
I/O Computation
MPI
DDL
Application
Parallel prototype
Figure 5 Organisation of the Library hierarchy
Application Library
4.1 Distributed Data Libraries
DDL is the Distributed Data Library developed at the Institute for Advanced Scientific ComĆ
putation at the University of Liverpool[1]. DDL is a library system which creates, manages and operĆ
ates upon distributed objects such as matrices and vectors, thus permitting the exploitation of distribĆ
uted memory parallel computers from a number of singleĆthreaded FORTRAN program. Many of the
conventions employed by the DDL are derived from the MPI standard. The distributed data structures
supported by the library are treated for the most part as opaque objects i.e. that they can only be
manipulated through calls to DDL procedures. DDL objects may be passed to procedures via the
use of handles. But DDL also permits the application to directly access the data inside the distributed
vector, in which case the application can treat the data like a local data segment. The DDL manages
system memory, allocating space required for new objects and deĆallocating redundant objects.
DDL is used in the parallel prototype primarily for input and output, memory allocation, and distribĆ
uted data management. Because of the parallelisation option chosen, only a small subset of the DDL
interface is required, namely:
S Input and Output ( DDL_Open(), DDL_Fileformat(), DDL_Read(),
DDL_Open_host(), DDL_Write(), DDL_Close() ).
S Memory Allocation and Data distribution ( DDL_Create_vector(),
DDL_Free() ).
S Data Access ( DDL_Get_vector() ).
S Data distribution query ( DDL_Size(), DDL_GSize(),
DDL_Offset_Vector(), DDL_Size_Vector() ).
S Miscellaneous ( DDL_Init_sparse(), DDL_Finalize_sparse(),
DDL_Ishost() ).
4.2 Message Passing Interface
MPI is a Message Passing Interface which is intended to become a standard for applications
running on distributed memory MIMD concurrent computers, and is well described elsewhere[2]. It
SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
7 ĆĆ
is not intended as a complete parallel programming environment, and currently lacks universal supĆ
port for features such as parallel I/O, parallel program composition, dynamic process control and
debugging. However, the core of MPI consists of a set of routines which support pointĆtoĆpoint comĆ
munication between pairs of processes or between groups of processes. Because of the regular naĆ
ture of the seismic data structures, the messageĆpassing requirements of the chosen approach are
not complicated and rely on a very basic and standard subset of the MPI library:
S Message Passing routines (MPI_Send(), MPI_Recv()).
S Communication contexts (MPI_Comm_Size(), MPI_Comm_Rank()).
S Data reduction (MPI_AllReduce()).
S Miscellaneous (MPI_Init(), MPI_Finalize()).
More and more vendors are providing proprietary implementations of MPI, e.g. Silicon
Graphics.
4.3 Use of standard libraries in porting
During the design stage, the portable libraries (DDL and MPI) were chosen. These libraries
were used during the development, but a conscious effort had to be made during the development
process to use smallest subset of functions from these libraries, in order to minimize the library deĆ
pendencies. All known implementations of MPI incorporate the MPI routines required by the parallel
prototype. DDL is portable, and available for an increasing number of platforms.
During the development phase, coding and functional testing was carried out on the NTSC's
set of Sun workstations, and performance testing and module integration was carried out on the
CSĆ2 supercomputer at CERN, mostly because the system bandwidth of the inĆhouse SUN network
is around 0.5MB/sec, as compared to the 50MB/sec nodeĆtoĆnode bandwidth that the CSĆ2 can susĆ
tain. Portability was also ensured for a minimal cost in efficiency (Benchmarks show that on the CSĆ2,
around 80% of the native messageĆpassing bandwidth is available through the use of the MPI liĆ
braries).
5 Results
5.1 Parallel Performance
Using data from North Sea oilfields, the parallel prototypes developed have demonstrated
impressive parallel scalability in benchmarking tests on the CSĆ2 machine at CERN (see Figure 6).
SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
8 ĆĆ
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35
Computationspeedup
Number of processors
Number of lines: 60
traces per line: 151
samples per trace: 56
samples per wavelet: 9
Figure 6 Parallel scalability of the parallel prototype on real seismic data
linear speedup
achieved speedup
5.2 Numerical Performance
A key objective for the parallel prototype is to reduce the entropy of the provided data set,
whilst maintaining identical numerical accuracy for all parallel configurations (invariant of the number
of processors). This objective has been met. See Figure 7.
318000
319000
320000
321000
0 5 10 15 20 25 30 35
ResidualEnergy
Number of processes
Figure 7 Final energies for a number of processors on real seismic data.
6 Conclusions
It is clear that this port of the numerical kernel of a large sequential program has been a sucĆ
cess, and that the code runs on a number of parallel processing machines which are currently comĆ
mercially available. Portability was obtained without significant efficiency concessions. The authors
conclude that the role played by MPI and DDL is crucial in formulating parallelisation strategies, and
in providing prototype implementations. From this good base the project partners can consider movĆ
ing the parallelisation to other parts of the problem solution, secure in the framework of data manageĆ
ment which has been provided by MPI and DDL.
SuperCan2 (Interleaf)
Portability and Efficiency using
standard message-passing libraries.
9 ĆĆ
7 References
[1] Tim Oliver et. al., “Sparse DDL – User Guide” , (1995) Institute for Advanced Scientific
Computation, Liverpool, England,
tim@supr.scm.liv.ac.uk (http://supr.scm.liv.ac.uk/~tim/parsim/parsim.html)
[2] Bill Gropp, Rusty Lusk, Tony Skjellum, Nathan Doss, “A Portable MPI Implementa-
tion”, (November, 1994), Argonne National Laboratories/Mississippi State University
gropp@mcs.anl.gov

More Related Content

What's hot

A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
ijcses
 
A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...
A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...
A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...
ijcsit
 
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO SystemsA New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
Radita Apriana
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
Hanif Durad
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
ijtsrd
 
11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial 11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial
Vivan17
 
Scaling PageRank to 100 Billion Pages
Scaling PageRank to 100 Billion PagesScaling PageRank to 100 Billion Pages
Scaling PageRank to 100 Billion Pages
Subhajit Sahu
 
A Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisA Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph Analysis
Editor IJCATR
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
lauratoni4
 
Embedding bus and ring into hex cell
Embedding bus and ring into hex cellEmbedding bus and ring into hex cell
Embedding bus and ring into hex cell
IJCNCJournal
 
cis97003
cis97003cis97003
cis97003perfj
 
Chapter 1 pc
Chapter 1 pcChapter 1 pc
Chapter 1 pc
Hanif Durad
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
TELKOMNIKA JOURNAL
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
lauratoni4
 
Chap3 slides
Chap3 slidesChap3 slides
Chap3 slides
Divya Grover
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
ijdpsjournal
 

What's hot (17)

A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
 
A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...
A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...
A NEW PARALLEL MATRIX MULTIPLICATION ALGORITHM ON HEX-CELL NETWORK (PMMHC) US...
 
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO SystemsA New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
A New Approach to Linear Estimation Problem in Multiuser Massive MIMO Systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial 11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial
 
Scaling PageRank to 100 Billion Pages
Scaling PageRank to 100 Billion PagesScaling PageRank to 100 Billion Pages
Scaling PageRank to 100 Billion Pages
 
A Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisA Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph Analysis
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
Embedding bus and ring into hex cell
Embedding bus and ring into hex cellEmbedding bus and ring into hex cell
Embedding bus and ring into hex cell
 
cis97003
cis97003cis97003
cis97003
 
Chapter 1 pc
Chapter 1 pcChapter 1 pc
Chapter 1 pc
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
 
Chap3 slides
Chap3 slidesChap3 slides
Chap3 slides
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
 

Viewers also liked

Varför bryr vi oss om hur mycket befolkningen dricker och vad kan vi säga om ...
Varför bryr vi oss om hur mycket befolkningen dricker och vad kan vi säga om ...Varför bryr vi oss om hur mycket befolkningen dricker och vad kan vi säga om ...
Varför bryr vi oss om hur mycket befolkningen dricker och vad kan vi säga om ...
Centralförbundet för alkohol- och narkotikaupplysning, CAN
 
NISSAN PETROL ENGINE CERTIFICATE
NISSAN PETROL ENGINE CERTIFICATENISSAN PETROL ENGINE CERTIFICATE
NISSAN PETROL ENGINE CERTIFICATEMiguel Moll Pujols
 
Radio show 2015 presenters template jenna fox
Radio show 2015   presenters template jenna foxRadio show 2015   presenters template jenna fox
Radio show 2015 presenters template jenna fox
Jenna Fox
 
- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016
- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016
- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016
Maria Julia Medeiros
 
Prezentacja Innowacyjne Technologie i Tworzywa
Prezentacja Innowacyjne Technologie i TworzywaPrezentacja Innowacyjne Technologie i Tworzywa
Prezentacja Innowacyjne Technologie i Tworzywa
Michal Kisilewicz
 
Ojos astrales
Ojos astralesOjos astrales
Ojos astralesednarm10
 
Einladung kathreintanz 2014
Einladung kathreintanz 2014Einladung kathreintanz 2014
Einladung kathreintanz 2014
Folkloretanzkreis
 
Kevadlilled metsas
Kevadlilled metsasKevadlilled metsas
Kevadlilled metsas
kairitw
 
Struktur, policy och samordning i den lilla kommunen
Struktur, policy och samordning i den lilla kommunenStruktur, policy och samordning i den lilla kommunen
Struktur, policy och samordning i den lilla kommunen
Centralförbundet för alkohol- och narkotikaupplysning, CAN
 
Anand karaj ceremony of sikh wedding
Anand karaj ceremony of sikh weddingAnand karaj ceremony of sikh wedding
Anand karaj ceremony of sikh wedding
Tajinder Singh
 
Ye,jiahui oral defense-final
Ye,jiahui oral defense-finalYe,jiahui oral defense-final
Ye,jiahui oral defense-final
沉冰 斯
 
Profil daerah kabupaten jayapura
Profil daerah kabupaten jayapuraProfil daerah kabupaten jayapura
Profil daerah kabupaten jayapura
wiratmokowikan
 
Rúbrica para el taller de climogramas
Rúbrica para el taller de climogramasRúbrica para el taller de climogramas
Rúbrica para el taller de climogramas
Jesús Martín Cardoso
 
Huddersfield half marathon 2016
Huddersfield half marathon 2016Huddersfield half marathon 2016
Huddersfield half marathon 2016
Andy Smith
 

Viewers also liked (17)

Varför bryr vi oss om hur mycket befolkningen dricker och vad kan vi säga om ...
Varför bryr vi oss om hur mycket befolkningen dricker och vad kan vi säga om ...Varför bryr vi oss om hur mycket befolkningen dricker och vad kan vi säga om ...
Varför bryr vi oss om hur mycket befolkningen dricker och vad kan vi säga om ...
 
Msp
MspMsp
Msp
 
Efruzhu cancer carcinogenesis theory laws»25
Efruzhu cancer carcinogenesis theory  laws»25Efruzhu cancer carcinogenesis theory  laws»25
Efruzhu cancer carcinogenesis theory laws»25
 
NISSAN PETROL ENGINE CERTIFICATE
NISSAN PETROL ENGINE CERTIFICATENISSAN PETROL ENGINE CERTIFICATE
NISSAN PETROL ENGINE CERTIFICATE
 
Radio show 2015 presenters template jenna fox
Radio show 2015   presenters template jenna foxRadio show 2015   presenters template jenna fox
Radio show 2015 presenters template jenna fox
 
Comenzar
ComenzarComenzar
Comenzar
 
- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016
- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016
- Edital e Anexos do Processo 08/2016 – Chamada Pública 01/2016
 
Prezentacja Innowacyjne Technologie i Tworzywa
Prezentacja Innowacyjne Technologie i TworzywaPrezentacja Innowacyjne Technologie i Tworzywa
Prezentacja Innowacyjne Technologie i Tworzywa
 
Ojos astrales
Ojos astralesOjos astrales
Ojos astrales
 
Einladung kathreintanz 2014
Einladung kathreintanz 2014Einladung kathreintanz 2014
Einladung kathreintanz 2014
 
Kevadlilled metsas
Kevadlilled metsasKevadlilled metsas
Kevadlilled metsas
 
Struktur, policy och samordning i den lilla kommunen
Struktur, policy och samordning i den lilla kommunenStruktur, policy och samordning i den lilla kommunen
Struktur, policy och samordning i den lilla kommunen
 
Anand karaj ceremony of sikh wedding
Anand karaj ceremony of sikh weddingAnand karaj ceremony of sikh wedding
Anand karaj ceremony of sikh wedding
 
Ye,jiahui oral defense-final
Ye,jiahui oral defense-finalYe,jiahui oral defense-final
Ye,jiahui oral defense-final
 
Profil daerah kabupaten jayapura
Profil daerah kabupaten jayapuraProfil daerah kabupaten jayapura
Profil daerah kabupaten jayapura
 
Rúbrica para el taller de climogramas
Rúbrica para el taller de climogramasRúbrica para el taller de climogramas
Rúbrica para el taller de climogramas
 
Huddersfield half marathon 2016
Huddersfield half marathon 2016Huddersfield half marathon 2016
Huddersfield half marathon 2016
 

Similar to Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries

SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
ijscmcj
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
IJCNCJournal
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
IJCNCJournal
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
RSIS International
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsJigisha Aryya
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
cscpconf
 
Complier design
Complier design Complier design
Complier design
shreeuva
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Kiruthikak14
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
IJERA Editor
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Jenny Liu
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Kiruthikak14
 
Distributed vertex cover
Distributed vertex coverDistributed vertex cover
Distributed vertex cover
IJCNCJournal
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010BOSC 2010
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
পল্লব রায়
 
User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesEffective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
IJCSEA Journal
 
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
International Journal of Technical Research & Application
 

Similar to Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries (20)

SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systems
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 
Complier design
Complier design Complier design
Complier design
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
 
Distributed vertex cover
Distributed vertex coverDistributed vertex cover
Distributed vertex cover
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
User biglm
User biglmUser biglm
User biglm
 
Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesEffective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
 
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
HOCSA: AN EFFICIENT DOWNLINK BURST ALLOCATION ALGORITHM TO ACHIEVE HIGH FRAME...
 

Achieving Portability and Efficiency in a HPC Code Using Standard Message-passing Libraries

  • 1. HPCS 96 The 10th Annual International Conference on High Performance Computers Achieving Portability and Efficiency in a HPC code using standard message-passing libraries. Derryck L. Lamptey, G. A. Manson, R. K. England National Transputer Support Centre, 5 Palmerston Road, Sheffield, S10 2TE, U.K. Contact author: bq176@torfree.net Telephone: +44 742 76 87 40 Fax: +44 742 72 75 63
  • 2. SuperCan2 (Interleaf) Portability and Efficiency using standard message-passing libraries. 2 ĆĆ 1 Introduction As part of a European Union EUREKA project (EU 638, PARSIM) the NTSC has ported sequential FORTRAN code for seismic inversion onto a parallel platform. The sequential code has been industrially tested by one of the PARSIM partners, Ødegaard & DanneskioldĆSamsøe, and remains commercially confidential. To respect confidentiality the parallel code presented here is a sanitised version of the parallel code developed at the NTSC. As part of the GPĆMIMD2 project the code has been tested on the CSĆ2 parallel processing supercomputer at CERN and results show near linear scalability on up to 30 processors. The development environment for this successful project involved using MPI on a number of different different platforms to ensure a portable yet effiĆ cient parallel code. The code has been functionally tested on inĆhouse Silicon Graphics' servers, and is due to be benchmarked on a Silicon Graphics' PowerChallenge Array (16 nodes) at the Silicon Graphics' SuperComputer Technology Centre in Neuchatel, Switzerland. Results from these runs should be available by the end of February, 1996. 2 The Code A sequential implementation of the algorithm was available, from which the parallelisation could be developed. The relevant code, referred to as sequential code", is the code necessary to solve the problem once initial estimates and a number of preset variables have been set up. The solĆ ution of the seismic problem principally involves repeating a number of 3Ćdimensional operations deĆ signed to minimise the error between the seismic (input) data and the synthetic (output) data sets. The calculation computes a set of variables which reduce the entropy of a given data set over a number of iterations, using a conjugate gradient method. The total seismic inversion involves a comĆ plex algorithm, most of which will not benefit greatly from parallelisation. However, a small section of the code, shown as Parallel Prototype" in Figure 1 accounts for 95% of the computation time. This is the portion of the algorithm which has been parallelised. // Global Optimisation = Loop for N Non linear optimisation of some binary variables // Local Optimisation (small) = 1 dimensional optimisation 3 dimensional optimisation // 3 dimensional optimisation (big) = Loop for M scan 1: calculation of global scalar. scan 2..5: update of a number of data structures in each processing cell. Inter-cell communications required between each scan. End loop M End loop N Figure 1 Overview of the optimisation algorithm. Main Parallel Prototype The main optimisation code (Main) does some external optimisation, then calls the internal optimisation procedure (Parallel Prototype) to solve the system of equations.
  • 3. SuperCan2 (Interleaf) Portability and Efficiency using standard message-passing libraries. 3 ĆĆ 3 Options for Code Parallelisation For this problem, the data structures are large (circa 2GB), regular and 3Ćdimensional, and there are spatial dependencies between the data elements in all 3 dimensions. The data could not be distributed conveniently in the vertical direction. The computational code in the parallel prototype would be wellĆstructured, if these spatial dependencies could be accomodated. Considerable time was therefore spent analysing these data structures and the spatial dependencies. Three data disĆ tribution approaches were proposed for this parallelisation: S Option 1 (The processor grid approach) S Option 2 (The sparse matrix solver approach) S Option 3 (The distributed vector approach) 3.1 Option 1 (The processor grid approach) Figure 2 Mapping the seismic data on to a grid of processors τ y x τ y x P1 P2 P3 P4 P5 P6 This solution provides a straightforward mapping, and is conceptually simple, but has a number of disadvantages to it: S the option would require complicated communications and process sychĆ ronisation, and scalability was not thought to be easily obtainable in a parĆ allel version. S load balancing was expected to be a problem because of the distribution patterns that would necessarily arise.
  • 4. SuperCan2 (Interleaf) Portability and Efficiency using standard message-passing libraries. 4 ĆĆ 3.2 Option 2 (The sparse matrix solver approach) Figure 3 Mapping the seismic data on to a sparse matrix τ y x x ∗ τ ∗ y x ∗ τ ∗ y The problem could be formulated as a sparse banded matrix type problem. The solution could then be handled by a number of parallel sparse matrix solvers currently available. Conceptually the sparse matrix approach is simple and elegant, but a number of implementation issues remained: S Parallel formulation of the sparse matrix is not straightforward, because the data structures necessary for formulation as a matrix problem are not easily mapped into the data structures of the calling routine", (Main" in Figure 1). With this apĆ proach, the sparse matrix would have to be constructed on each entry to the proĆ totype section of code, and deconstructed" on each exit. S Several sparse matrices would need to be set up to represent all the disĆ tributed data structures. S Sparse matrix data storage was likely to be in the triad" format, giving rise to a memory requirement up to three times that of any other approach.
  • 5. SuperCan2 (Interleaf) Portability and Efficiency using standard message-passing libraries. 5 ĆĆ 3.3 Option 3 (The distributed vector approach) Figure 4 Mapping the seismic data on to one dimensional vectors τ y x x ∗ τ ∗ y P1 P2 P3 P1 P2 P3 P4 This approach formulates the 3 dimensional structures as 1 dimensional structures, permitĆ ting the parallel data structures to be viewed in a manner similar to that described in Option 1, but reducing the communication complexity by assigning entire seismic lines (y direction) to processors. This permits the data on each processor to be viewed as a contiguous data space. This option is possible, and beneficial in terms of processing, because: S loop control is highly regular. S the spatial dependencies between the data elements are well ordered. S the code structure of the original sequential code could be preserved to a great degree in the parallel prototype. S there was a fairly straightforward mapping of existing data structures In this formulation of the problem each processor has a number of complete seismic lines (see Figure 4), permitting an SPMD (single program multiple data) approach to be employed. Figure 4 also illustrates the mapping of the seismic data onto a number of processors, the seismic data being represented by equivalent 1Ćdimensional vectors. Option 3 was chosen. 4 Library Interfaces The software architecture of the parallel prototype is shown in Figure 5.
  • 6. SuperCan2 (Interleaf) Portability and Efficiency using standard message-passing libraries. 6 ĆĆ . I/O Computation MPI DDL Application Parallel prototype Figure 5 Organisation of the Library hierarchy Application Library 4.1 Distributed Data Libraries DDL is the Distributed Data Library developed at the Institute for Advanced Scientific ComĆ putation at the University of Liverpool[1]. DDL is a library system which creates, manages and operĆ ates upon distributed objects such as matrices and vectors, thus permitting the exploitation of distribĆ uted memory parallel computers from a number of singleĆthreaded FORTRAN program. Many of the conventions employed by the DDL are derived from the MPI standard. The distributed data structures supported by the library are treated for the most part as opaque objects i.e. that they can only be manipulated through calls to DDL procedures. DDL objects may be passed to procedures via the use of handles. But DDL also permits the application to directly access the data inside the distributed vector, in which case the application can treat the data like a local data segment. The DDL manages system memory, allocating space required for new objects and deĆallocating redundant objects. DDL is used in the parallel prototype primarily for input and output, memory allocation, and distribĆ uted data management. Because of the parallelisation option chosen, only a small subset of the DDL interface is required, namely: S Input and Output ( DDL_Open(), DDL_Fileformat(), DDL_Read(), DDL_Open_host(), DDL_Write(), DDL_Close() ). S Memory Allocation and Data distribution ( DDL_Create_vector(), DDL_Free() ). S Data Access ( DDL_Get_vector() ). S Data distribution query ( DDL_Size(), DDL_GSize(), DDL_Offset_Vector(), DDL_Size_Vector() ). S Miscellaneous ( DDL_Init_sparse(), DDL_Finalize_sparse(), DDL_Ishost() ). 4.2 Message Passing Interface MPI is a Message Passing Interface which is intended to become a standard for applications running on distributed memory MIMD concurrent computers, and is well described elsewhere[2]. It
  • 7. SuperCan2 (Interleaf) Portability and Efficiency using standard message-passing libraries. 7 ĆĆ is not intended as a complete parallel programming environment, and currently lacks universal supĆ port for features such as parallel I/O, parallel program composition, dynamic process control and debugging. However, the core of MPI consists of a set of routines which support pointĆtoĆpoint comĆ munication between pairs of processes or between groups of processes. Because of the regular naĆ ture of the seismic data structures, the messageĆpassing requirements of the chosen approach are not complicated and rely on a very basic and standard subset of the MPI library: S Message Passing routines (MPI_Send(), MPI_Recv()). S Communication contexts (MPI_Comm_Size(), MPI_Comm_Rank()). S Data reduction (MPI_AllReduce()). S Miscellaneous (MPI_Init(), MPI_Finalize()). More and more vendors are providing proprietary implementations of MPI, e.g. Silicon Graphics. 4.3 Use of standard libraries in porting During the design stage, the portable libraries (DDL and MPI) were chosen. These libraries were used during the development, but a conscious effort had to be made during the development process to use smallest subset of functions from these libraries, in order to minimize the library deĆ pendencies. All known implementations of MPI incorporate the MPI routines required by the parallel prototype. DDL is portable, and available for an increasing number of platforms. During the development phase, coding and functional testing was carried out on the NTSC's set of Sun workstations, and performance testing and module integration was carried out on the CSĆ2 supercomputer at CERN, mostly because the system bandwidth of the inĆhouse SUN network is around 0.5MB/sec, as compared to the 50MB/sec nodeĆtoĆnode bandwidth that the CSĆ2 can susĆ tain. Portability was also ensured for a minimal cost in efficiency (Benchmarks show that on the CSĆ2, around 80% of the native messageĆpassing bandwidth is available through the use of the MPI liĆ braries). 5 Results 5.1 Parallel Performance Using data from North Sea oilfields, the parallel prototypes developed have demonstrated impressive parallel scalability in benchmarking tests on the CSĆ2 machine at CERN (see Figure 6).
  • 8. SuperCan2 (Interleaf) Portability and Efficiency using standard message-passing libraries. 8 ĆĆ 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Computationspeedup Number of processors Number of lines: 60 traces per line: 151 samples per trace: 56 samples per wavelet: 9 Figure 6 Parallel scalability of the parallel prototype on real seismic data linear speedup achieved speedup 5.2 Numerical Performance A key objective for the parallel prototype is to reduce the entropy of the provided data set, whilst maintaining identical numerical accuracy for all parallel configurations (invariant of the number of processors). This objective has been met. See Figure 7. 318000 319000 320000 321000 0 5 10 15 20 25 30 35 ResidualEnergy Number of processes Figure 7 Final energies for a number of processors on real seismic data. 6 Conclusions It is clear that this port of the numerical kernel of a large sequential program has been a sucĆ cess, and that the code runs on a number of parallel processing machines which are currently comĆ mercially available. Portability was obtained without significant efficiency concessions. The authors conclude that the role played by MPI and DDL is crucial in formulating parallelisation strategies, and in providing prototype implementations. From this good base the project partners can consider movĆ ing the parallelisation to other parts of the problem solution, secure in the framework of data manageĆ ment which has been provided by MPI and DDL.
  • 9. SuperCan2 (Interleaf) Portability and Efficiency using standard message-passing libraries. 9 ĆĆ 7 References [1] Tim Oliver et. al., “Sparse DDL – User Guide” , (1995) Institute for Advanced Scientific Computation, Liverpool, England, tim@supr.scm.liv.ac.uk (http://supr.scm.liv.ac.uk/~tim/parsim/parsim.html) [2] Bill Gropp, Rusty Lusk, Tony Skjellum, Nathan Doss, “A Portable MPI Implementa- tion”, (November, 1994), Argonne National Laboratories/Mississippi State University gropp@mcs.anl.gov