A Lightweight C++ Interface to MPI
Simone Pellegrini, Radu Prodan, Thomas Fahringer
Institute of Computer Science, University of Innsbruck
Technikerstr. 21A, 6020 Innsbruck, Austria
Abstract
The Message Passing Interface (MPI) provides bindings
for the three programming languages commonly used in
High Performance Computing (HPC): C, C++ and Fortran.
Unfortunately, MPI supports only the lowest common de-
nominator of the three languages, providing a level of ab-
straction far lower than typical C++ libraries. Lately, af-
ter the decision of the MPI committee to deprecate and re-
move the C++ bindings from the MPI standard, program-
mers are forced to use either the C API or rely on third-party
libraries.
In this paper we present a lightweight, header-only C++
interface to MPI which uses object oriented and generic
programming concepts to improve its integration into the
C++ programming language. We compare our wrapper
with a related approach called Boost.MPI showing how
MPP facilitates the interaction with C++ objects. Perfor-
mance wise, MPP outperforms Boost.MPI by reducing the
interface overhead by a factor of eight. Additionally, MPP’s
handling of user-defined data types allows transferring of
STL containers (e.g. std::list) up to 20 times faster
than Boost.MPI for small linked lists by relying on software
serialization.
1 Introduction
MPI is the defacto standard for writing parallel programs
for distributed memory systems. As its focus is on High
Performance Computing (HPC), MPI offers an Applica-
tion Programming Interface (API) for C, C++ and Fortran,
the most widely used languages for HPC. Unfortunately,
since the definition of the first standard in 1994 [3], MPI
did not keep the pace with the evolution of the underlying
languages, such as object-oriented programming in Fortran
2000 and templates in C++. Nowadays, this problem is
mostly perceived in C++ which, unlike Fortran and C, pro-
vides much higher-level abstractions which are not reflected
in the design of the MPI interface [6]. MPI is so poorly in-
tegrated into the C++ environment that many programmers
prefer to use, even in C++ programs, the C interface. Fur-
thermore, to map common C++ constructs onto MPI, pro-
grammers are forced to weaken the language type safety. As
a consequence, errors that could be easily detected by the
compiler are no longer captured leading to runtime failures.
These issues led the MPI committee to the decision of dep-
recating C++ bindings in the version 2.2 of the MPI stan-
dard. However, because of the growing interest and use of
C++ in HPC, several third-party wrappers to MPI have been
proposed [11], the most important being Boost.MPI [8] and
OOMPI [9].
Figure 1 shows a simple MPI program sending two float-
ing point values from process rank 0 to rank 1. A problem
of this code snippet is that the programmer is forced to un-
necessarily declare a temporary variable val to store the
values being sent by MPI Send (line 4). Although the C99
standard [1] introduced compound literals to avoid such un-
necessary memory allocations (line 2), they are not widely
used because of the decreased code readability. Because the
compiler is not aware of the semantics of MPI Send which
guarantees that the val’s value is not modified, no memory
optimizations can be performed. A second problem is that
the signature of all MPI routines requires the programmer
to provide the size and the type (i.e. one MPI FLOAT) of
the data being sent, which is error-prone and can be avoided
in C++ by inferring them at compile-time.
Boost.MPI [8] tries to simplify the MPI interface by de-
ducing several of those parameters at compile-time through
C++ template techniques. For example, the size of the data
sent and its associated MPI Datatype is strictly related to
the type of the object being sent and, therefore, deducible at
compile-time from the C++ typing system. The send and
recv routines in Boost.MPI require only three parameters,
as shown in Figure 2 (lines 2, 3, 6, and 8): the source/desti-
nation rank, the message tag, and the message content. This
not only simplifies the usage of the routines, but also im-
proves their safety. Although Boost.MPI is a consistent im-
provement over the standard MPI C++ bindings, it is not
widely accepted within the MPI community because of two
main reasons: (i) the dependency on the Boost C++ library
and accompanying licensing issues; (ii) the use of a serial-
2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
978-0-7695-4633-9/12 $26.00 © 2012 IEEE
DOI 10.1109/PDP.2012.42
3
1 if ( rank==0 ) {
2 MPI Send((const int[1]) { 2 }, 1, MPI INT, 1, 1,
3 MPI COMM WORLD );
4 std::array<float,2> val = {3.14f, 2.95f};
5 MPI Send(&val.front(), val.size(), MPI FLOAT, 1, 0,
6 MPI COMM WORLD);
7 } else if (rank == 1) {
8 int n;
9 MPI Recv(&n, 1, MPI INT, 0, 1, MPI COMM WORLD,
10 MPI STATUS IGNORE);
11 std::vector<float> values(n);
12 MPI Recv(&values.front(), n, MPI FLOAT, 0, 0,
13 MPI COMM WORLD, MPI STATUS IGNORE);
14 }
Figure 1. Simple MPI program using C bind-
ings
ization library [10] to handle transmission of user-defined
data types (i.e. merging of objects with a sparse memory
representation into a continuous data chunk) that negatively
impacts the performance.
1 if ( world.rank() ==0 ) {
2 world.send( 1, 1, 2 );
3 world.send( 1, 0, std::array<float,2>({3.14f, 2.95f}) );
4 } else if (world.rank() == 1) {
5 int n;
6 world.recv(0, 1, n);
7 std::vector<float> values(n);
8 world.recv(0, 0, values);
9 }
Figure 2. Boost.MPI version of the program
from Figure 1
An object-oriented approach to improve the C++ MPI
interface is OOMPI [9] which specifies send and receive
operations in a more user friendly way by overloading the
insertion << and extraction >> C++ operators. In OOMPI, a
Port towards a process rank is obtained by using the array
subscript operator [] on a communicator object (see line 2
in Figure 3). A further advantage is the convenience to com-
bine these operators in one C++ instruction when inserting
or extracting data to/from the same stream. A drawback
of OOMPI is the poor integration of arrays and user data
types in general. For example, sending an array instance
requires the programmer to explicitly instantiate an object
of class OOMPI Array message, which requires the size
and type of the data to be manually specified as in the cur-
rent MPI specification (line 4). The support for generic user
data types requires the objects being sent to inherit from the
OOMPI User type interface. This is a rather severe lim-
itation as it does not allow any legacy class (e.g. the STL’s
containers) to be directly supported.
1 if ( OOMPI COMM WORLD.rank() == 0 ) {
2 OOMPI COMM WORLD[1] << 2;
3 std::array<float,2> val = {3.14f, 2.95f};
4 OOMPI COMM WORLD[1] <<
5 OOMPI Array message(&val.front(), val.size());
6 } else if (OOMPI COMM WORLD.rank() == 1) {
7 int n;
8 OOMPI COMM WORLD[0] >> n;
9 std::vector<float> values(n);
10 OOMPI COMM WORLD[0] >>
11 OOMPI Array message(values, 2);
12 }
Figure 3. OOMPI version of the program from
Figure 1
In this paper, we combine some of the concepts pre-
sented in Boost.MPI and OOMPI and propose an advanced
lightweight MPI C++ interface called MPP that aims at
transparently integrating the message passing paradigm into
the C++ programming language without sacrificing perfor-
mance. Our approach focuses on point-to-point commu-
nications and integration of user data types which, unlike
Boost.MPI, relies entirely on native MPI Datatypes for
better performance. Our interface also utilizes advanced
concepts from other parallel programming languages such
future objects [5] which simplifies the use of MPI asyn-
chronous routines.
Overall, MPP is designed with a specific focus on per-
formance. As we target HPC systems, we understand how
critical performance is and spent significant effort to re-
duce the interface overhead. We compare the performance
of MPP with Boost.MPI and show that, for a simple ping-
pong application, MPP achieves a four times larger through-
put (in terms of messages per seconds). Compared to the
pure C bindings, MPP has an increased latency of only
9%. As far as the handling of user data types is concerned,
MPP is able to reduce the transfer time of a linked list (i.e.
std::list<T> from C++ STL) up to 20 times compared
to Boost.MPI. To determine the real benefit of using MPP
for real applications, we rewrote the computational kernel
of QUAD MPI [7] to use Boost.MPI and MPP. The ob-
tained results show a performance improvement of around
12% compared to Boost.MPI.
The rest of the paper is organized as follows. In Section 2
we introduce MPP as a lightweight C++ wrapper to MPI
using small code snippets. In Section 3 we compare our
library against Boost.MPI and an plain MPI implementation
using two micro-benchmark codes and an application code
4
called QUAD MPI. Section 4 concludes the paper.
2 MPP: C++ Interface to MPI
We use object-oriented programming concepts and C++
templates to design a lightweight wrapper for MPI routines
that simplifies the way in which MPI programs are written.
Similar to Boost.MPI, we achieve this goal by reducing the
amount of information required by MPI routines and by in-
ferring as much as possible at compile-time. By reducing
the amount of code written by the users, we expect less pro-
gramming errors. Furthermore, by making type checking
safer, most common programming mistakes can be captured
at compile-time. In this work, we focus on point-to-point
operators, as the specialised semantics of collective oper-
ations has no counterpart in C++ STL. We also present a
generic mechanism of handling C++ user data types which
allows for easy transfer of C++ objects to any existing MPI
routine (including collective operations).
2.1 Point-to-Point Communication
While Boost.MPI maintains in its API design the style of
the traditional send/receive MPI routines, our approach is
more similar to OOMPI aiming at a better C++ integration
by defining these basic operations using streams. A stream
is an abstraction that represents a device on which input
and output operations are performed. Therefore, sending
or receiving a message through a MPI channel can be seen
as a stream operation. We introduce a mpi::endpoint
class which has the semantics of a bidirectional stream from
which data can be read (received) or written (sent) using the
<< and >> operators. The concept of endpoints is simi-
lar to the Port abstraction of OOMPI, however, because
our mechanism is based on generic programming, user-
defined data types can be transparently handled. In con-
trast, OOMPI is based on inheritance which forces the pro-
grammer to instantiate an OOMPI Message class contain-
ing the data type and size required by the MPI routines un-
derneath [11] (see line 4 in Figure 3).
Because an MPI send/receive operation offers more ca-
pabilities than C++ streams (e.g. tags for messages, non-
blocking semantics), endpoints cannot be directly modelled
using an “is-a” relationship. Fortunately, STL’s utilities
(e.g. algorithms) are mostly based on templates and end-
points can be passed to any generic function which relies on
the << or >> stream operations. Figure 4 shows an example
that uses an endpoint as argument to a generic read from
function. An endpoint is generated from a communicator
using the () operator to which the process rank is passed
(line 3). The mpi::comm class is a simple wrapper for an
MPI Communicator with the capability of creating end-
points and retrieving the current process rank and the com-
municator size. The mpi::world refers to an instance of
the comm class which wraps the MPI COMM WORLD com-
municator.
1 namespace mpi {
2 struct comm {
3 mpi::endpoint operator()(int) const;
4 };
5 } // end mpi namespace
6
7 template <class InStream, class T>
8 void read from(InStream& in, T& val) {
9 in >> val;
10 }
11
12 int val[2];
13 // reads the first element of the val array from std::cin
14 read from(std::cin, val[0]);
15
16 // receives 2nd element of val array from rank 1
17 read from(mpi::comm::world(1), val[1]);
Figure 4. Example of usage of endpoints in a
generic function.
Figure 5 shows how the program in Figure 1 can be
rewritten with MPP. First of all, the objects are either sent
or received using stream operations which allows for a more
compact code compared to C MPI bindings (half in size) or
to Boost.MPI. Secondly, objects are automatically wrapped
by a generic mpi::msg<T> object, which does not need
to be specified by the user (as opposed to OOMPI). Adding
this level of indirection allows MPP to handle both prim-
itive and user data types in a way transparent to the user.
R-values (i.e. values with no address such as constants) are
handled similar to any regular L-value (e.g. variables) us-
ing C++ constant references via the msg class, which avoids
unnecessary memory allocation. The interface also allows
to specify message tags by manually allocating the message
wrapper (example in line 3).
MPP also supports non-blocking semantics for the
send and receive operations through the overloaded <
and > operators. Unlike blocking send/receives, asyn-
chronous operations return a future object [5] of class
mpi::request<T> which can be polled to test whether
the pending operation has completed or not. An exam-
ple of non-blocking operations in MPP is shown in Fig-
ure 6. For non-blocking receives, the method T& get()
waits for the underlying operation to complete (line 2)
and, upon completion, it returns a reference to the received
value. The mpi::request<T> class also provides a
void wait() and a bool test() method implement-
ing the semantics of MPI Wait and MPI Test, respec-
5
1 using namespace mpi;
2 if ( comm::world.rank() == 0 ) {
3 comm::world(1) << std::array<float,2>({3.14f, 2.95f});
4 comm::world(1) << msg(2, 1);
5 } else if (mpi::world.rank() == 1) {
6 int n;
7 comm::world(0) << msg(n, 1);
8 std::vector<float> values(n);
9 comm::world(0) >> values;
10 }
Figure 5. MPP version of the program from
Figure 1.
tively. The example also shows MPP’s support for receive
operations which listen for messages coming from an un-
known process using the mpi::any constant rank when
creating an endpoint (line 3).
1 float real;
2 mpi::request<float>&& req =
3 mpi::comm::world(mpi::any) > real;
4 // ... do something else ...
5 use( req.get() );
Figure 6. Non-blocking MPP endpoints.
Errors returned in MPI by every routine as an error code
are handled in MPP via C++ exceptions. Any call to MPP
routines can potentially throw an exception as a subclass of
mpi::exception. The method get error code()
of this class allows the retrieval of the native error code.
2.2 User Data Types
OOMPI is one of the first APIs trying to introduce
support for user data types through inheritance from an
OOMPI User type class. Unfortunately, this mechanism
is relatively weak because, by relying on inheritance, it does
not allow the handling of class instances provided by third-
party libraries (e.g. STL containers). Another attempt is
the use of serialization in Boost.MPI which, although el-
egant, introduces a high runtime overhead. The objective
of MPP is to reach the same level of integration with user
data types as Boost.MPI without performance loss, which
we achieve by relying on the existing MPI support for user
data types, i.e. MPI Datatype. The definition of an
MPI Datatype is rather cumbersome and therefore not
commonly used. Indeed, defining an MPI Datatype re-
quires the programmer to specify several information re-
lated to its memory layout which often leads to program-
ming errors that are very difficult to debug. However, be-
1 template <class T>
2 struct mpi type traits<std::vector<T>> {
3 static inline const T∗
4 get addr( const std::vector<T>& vec ) {
5 return mpi type trait<T>::get addr(vec.front());
6 }
7 static inline const size t
8 get size( const std::vector<T>& vec ) {
9 return vec.size();
10 }
11 static inline MPI Datatype
12 get type( const std::vector<T>& ) {
13 return mpi type trait<T>::get type( T() );
14 }
15 };
16 ...
17 typedef mpi type traits<vector<int>> vect traits;
18 vector<int> v = { 2, 3, 5, 7, 11, 13, 17, 19 };
19 MPI Ssend( vect traits::get addr(v),
20 vect traits::get size(v),
21 vect traits::get type(v), ... );
Figure 7. Example of using mpi type traits
to handle STL vectors.
cause operations on data types are mapped to DMA trans-
fers by the MPI library, the use of an MPI Datatype out-
performs any other techniques based on software serializa-
tion.
The integration of user data types is achieved by using
a design pattern called type traits [4]. An example is illus-
trated in Figure 7 for C++ STL’s std::vector<T> class.
We let the user specialize a class which statically provides
the compiler three pieces of information required to map a
user data type to MPI Datatypes:
1. the memory address from which the data type instance
begins;
2. the type of each element;
3. the number of elements.
Because a C++ vector is contiguously allocated in mem-
ory, the starting address of the first element has to be recur-
sively computed for handling generic regular nested types
(e.g. vector<array<float,10>> in lines 3−6). The
length is the number of elements present in the vector (line
9) and the type is the data type of a vector element (line
11 − 14). Because our mechanism is not based on inher-
itance (like in OOMPI), it is open for integration and use
with third party class libraries. Lines 17 − 21 show how the
introduced type traits can be used with the MPI C binding.
This method can also be used for collective operations or
for one of the several flavors of MPI Send for which an
6
!



(a) Number of ping/pong operations per second.































































  
       #$   %
'
(


(
%

(b) Comparison of Boost.MPI and MPP for STL’s linked list
(std::listT).
Figure 8. MPP performance evaluation results.
appropriate operator cannot be defined. MPP also provides
several type traits for some of the STL containers such as
vector, array and list.
3 Performance Evaluation
In this section we compare the performance of MPP
against Boost.MPI and the standard C binding of MPI. We
used the Open MPI version 1.4.2 to execute the experi-
ments. We did not consider OOMPI for performance eval-
uation since its development has been stopped since several
years. We first compared the MPI bindings by using micro-
benchmarks and then by using a real MPI application called
QUAD MPI which is a C++ program that approximates an
integral based on a quadrature rule [7].
3.1 Micro Benchmarks
The purpose of the first experiment is to measure the la-
tency overhead introduced by MPP over the standard C in-
terface to MPI compared to Boost.MPI. We implemented
a simple ping-pong application which we executed on a
shared memory machine with a single AMD Phenom II X2
555, 3.5 GHz dual-core processors, 1MB of L2 cache, and
6MB of L3 cache. This way, any data transmission over-
head is minimized and the focus is solely on the interface
overhead. Figure 8(a) displays the number of ping-pong
operations per second for varying message sizes. MPP has
approximately 9% larger latency for small messages com-
pared to the native MPI routines. This overhead is due to the
creation of a temporary status object corresponding to the
MPI Status returned by the MPI receive routine contain-
ing the message source, size, tag, and error (if any). Com-
pared to Boost.MPI, MPP shows nevertheless a consistent
performance improvement of around 75% for small mes-
sage sizes. Because both implementations use plain vectors
to store the exchanged message, no serialization is involved
to explain the overhead difference. We believe that the main
reason for this overhead comes from the fact that Boost.MPI
is implemented as a library and every call to MPI routines
pays the overhead of an additional function call. We solved
the problem in MPP by designing a pure header-based im-
plementation, which allows all MPP routines to be inlined
by the compiler, thus eliminating any overhead. The graph
also illustrates that, as expected, the overhead decreases for
larger messages as the communication time becomes pre-
dominant.
In the second experiment, we compared MPP with
Boost.MPI for the support of user-defined data types. We
used a listdouble type of varying size exchanged be-
tween two processes in a loop repeated one thousand times.
We executed the experiment on an IBM blade cluster with
a quad-core Intel Xeon X5570 processors interconnected
through Infiniband network. We allocated the two MPI pro-
cesses on different blades in order to simulate a real use
case scenario. Figure 8(b) shows the time necessary to per-
form this micro-benchmark for different list sizes and the
7
1 double my a, my b;
2 my total = 0.0;
3 if ( rank == 0 ) {
4 for ( unsigned q = 1; q  p; ++q ) {
5 my a = ( ( p − q ) ∗ a + ( q − 1 ) ∗ b ) / ( p − 1 );
6 MPI Send ( my a, 1, MPI DOUBLE, q, 0 );
7
8 my b = ( ( p − q − 1 ) ∗ a + ( q ) ∗ b ) / ( p − 1 );
9 MPI Send ( my b, 1, MPI DOUBLE, q, 0 );
10 }
11 } else {
12 MPI Recv ( my a, 1, MPI DOUBLE, 0, 0, status );
13 MPI Recv ( my b, 1, MPI DOUBLE, 0, 0, status );
14
15 for ( unsigned i = 1; i = my n; ++i ) {
16 x = ((my n − i) ∗ my a + (i − 1) ∗ my b) / (my n − 1);
17 my total = my total + f ( x );
18 }
19 my total = (my b − my a) ∗ my total / (double) my n;
20 }
Figure 9. Computational kernel of QUAD MPI.
speedup achieved by MPP over Boost.MPI. For small lists
of 100 elements, the speedup is approximately 20, how-
ever, the performance gap closes by increasing the list size.
The reason is the std::list implementation in MPP us-
ing MPI Type struct, which requires enumerating all
memory addresses that compose the object being sent. To
create an MPI Datatype for a linked list, three arrays
have to be provided:
• the displacement of each list element relative to the
starting address;
• the size of each element;
• the data type of each element (i.e. O(3·N) of memory
overhead).
We observe in Figure 8(a) that building such a data type be-
comes more expensive as the list size increases, so that for
large linked lists over 50,000 elements the software serial-
ization outperforms the MPI data typing mechanism. Future
optimization could improve the support of large data struc-
tures integrating in MPP a mechanism that switches from
the use of MPI Datatype to serialization starting from a
critical size.
3.2 QUAD MPI Application Code
The micro-benchmarks highlighted the low latency of
the MPP bindings, however this does not indicate much
about the benefits of using MPP for real application codes.
1 my total = 0.0;
2 if ( rank == 0 ) {
3 for ( unsigned q = 1; q  p ; ++q ) {
4 world.send(q, 0, (( p − q ) ∗ a + ( q − 1 ) ∗ b) / ( p − 1 ));
5 world.send(q, 0, (( p − q − 1 ) ∗ a + ( q ) ∗ b) / ( p − 1 ));
6 }
7 } else {
8 double my a, my b;
9 world.recv(0, 1, my a);
10 world.recv(0, 2, my b);
11
12 for ( unsigned i = 1; i = my n; ++i ) {
13 x = ((my n − i) ∗ my a + (i − 1) ∗ my b) / (my n − 1);
14 my total = my total + f ( x );
15 }
16 my total = (my b − my a) ∗ my total / (double) my n;ł
17 }
Figure 10. Computational kernel of
QUAD MPI rewritten using Boost.MPI.
For this purpose we took a simple MPI application ker-
nel called QUAD MPI and rewritten using Boost.MPI and
MPP. QUAD MPI is a C program which approximates an
integral using a quadrature rule [7] and can be efficiently
parallelized using MPI. From the original code [7], we ex-
tracted the computational kernel depicted in Figure 9. The
process rank 0 assigns to every other process a sub-interval
of [A, B] and these bounds are then communicated using
message passing routines. The number of communication
statement in the code is limited, i.e. 2 · (P − 1), where P
is the number of processes. Therefore, this code represents
a good balance between communication and computation
making it a good choice to determine the benefits of MPP
bindings.
This QUAD MPI kernel can be easily rewritten using
Boost.MPI and MPP, as shown respectively in Figures 10
and 12. In both cases, we removed the necessity of assign-
ing the value being sent to the my a and my b variables be-
cause both Boost.MPI and MPP support sending R-values
that are computed and directly sent to the destination (lines
4 and 5). The code at the receiver side is similar, the only
difference being that we can now restrict the scope of the
my a and my b variables to the else body only (lines 9 and
10), which allows a faster machine code generation as the
compiler can utilize the CPU registers more efficiently. Ad-
ditionally, MPP allows for a further reduction of the code
as shown in Figure 12, since the two sends (line 4) and the
two receives (line 9) can be combined together into a sin-
gle statement. MPP also relieves the programmer from the
burden of specifying a message tag by utilizing the tag 0
by default. With MPP we are able to shrink the input code
8
1 my total = 0.0;
2 if ( rank == 0 ) {
3 for ( unsigned q = 1; q  p; ++q ) {
4 comm::world(q)  ((p − q) ∗ a + (q − 1) ∗ b) / (p − 1)
5  ((p − q − 1) ∗ a + q ∗ b ) / (p − 1);
6 }
7 } else {
8 double my a, my b;
9 comm::world(0)  my a  my b;
10
11 for ( unsigned i = 1; i = my n; ++i ) {
12 x = ((my n − i) ∗ my a + (i − 1) ∗ my b) / (my n − 1);
13 my total = my total + f ( x );
14 }
15 my total = (my b − my a) ∗ my total / (double) my n;
16 }
Figure 11. Computational kernel of
QUAD MPI rewritten using MPP.
by 30% (in terms of number of characters), which reduces
the chances of programming errors and increases the overall
productivity.
We ran the three versions of the QUAD MPI kernel on a
machine with 16 cores (a dual socket Intel Xeon CPU) and
used shared memory to minimize communications costs
and highlight the library overhead. We compiled the in-
put programs with optimization enabled (i.e. -O3 flag), re-
peated each experiment for 10 times, and reported the av-
erage and standard deviation in execution time (see Fig-
ure 12).
Because of the removal of the superfluous assignment
operations to the my a and my b variables, the MPP ver-
sion performs slightly faster than the original code. It is
worth noticing that, although the same optimization has
been applied to the Boost.MPI version, the large overhead
of Boost.MPI cancels any benefit making the resulting code
the slowest of all three. Compared to Boost.MPI, the MPP
version has a performance improvement of around 12%.
4 Conclusions
In this paper we presented MPP as an advanced C++ in-
terface to MPI. We combined some of the ideas of OOMPI
and Boost.MPI into a lightweight, header-only interface
smoothly integrated with the C++ environment. We intro-
duced a transparent mechanism for dealing with user data
types which, for small objects, is up to 20 times faster than
Boost.MPI due to the use of MPI Datatypes instead of
software serialization. We showed that programs written
using MPP are more compact compared to the MPI C bind-
ings and that the object oriented design overhead introduced
   
 
 
 
 
 
 
 
 
 
 

!


#
$
!
%
Figure 12. QUAD MPI performance compari-
son.
is negligible. Furthermore, MPP can avoid common pro-
gramming errors in two ways:
1. through its interface design that uses future objects to
avoid reading the buffer of an asynchronous receive
before data has been written;
2. by automatically inferring most of the input arguments
required by MPI routines.
The MPP interface is freely available at [2].
In the future we intent to extend the interface to sup-
port easier use of other complex MPI features such as dy-
namic process management, operations on communicators
and groups, and creation of process topologies.
5 Acknoledgments
This research has been partially funded by the Austrian
Research Promotion Agency (FFG) under grant P7030-025-
011 and by the Tiroler Zukunftsstiftung under the Trans-
lational Research Grant ”Parallel Computing with Java for
Manycore Computers.
References
[1] C99 standard. www.open-std.org/JTC1/SC22/
wg14/www/docs/n1124.pdf
[2] MPI C++ Interface. https://github.com/
motonacciu/mpp
[3] The MPI-1 Specification. http://www.mpi-forum.
org/docs/docs.html
[4] A. Alexandrescu. Traits: The else-if-then of types. In
C++ Report, pages 22–25, 2000. http://erdani.
com/publications/traits.html
9
[5] H. C. Baker, Jr. and C. Hewitt. The incremental garbage
collection of processes. In Proceedings of the 1977 sympo-
sium on Artificial intelligence and programming languages,
pages 55–59, New York, NY, USA, 1977. ACM.
[6] J. S. Bill, B. S. Y, and A. L. Z. The design and evolution
of the MPI-2 C++ interface. In In Proceedings, 1997 In-
ternantional Conference on Scientific Computing in Object-
Oriented Parallel Computing, Lecture Notes in Computer
Science. Springer-Verlag, 1997.
[7] J. Burkardt. http://people.sc.fsu.edu/
˜jburkardt/c_src/quad_mpi/quad_mpi.html
[8] P. Kambadur, D. Gregor, A. Lumsdaine, and A. Dharurkar.
Modernizing the C++ interface to MPI. In Recent Advances
in Parallel Virtual Machine and Message Passing Inter-
face, Lecture Notes in Computer Science, pages 266–274.
Springer Berlin / Heidelberg, 2006.
[9] B. C. McCandless, J. M. Squyres, and A. Lumsdaine.
Object-Oriented MPI (OOMPI): A class library for the mes-
sage passing interface. In Proceedings of the Second MPI
Developers Conference, pages 87–, Washington, DC, USA,
1996. IEEE Computer Society.
[10] R. Ramsey. Boost serialization library.www.boost.org/
doc/libs/release/libs/serialization/
[11] A. Skjellum, D. G. Wooley, A. Lumsdaine, Z. Lu, M. Wolf,
J. M. Squyres, B. Mccandless, and P. V. Bangalore. Object-
oriented analysis and design of the message passing inter-
face, 1998.
10

A Lightweight C Interface To MPI

  • 1.
    A Lightweight C++Interface to MPI Simone Pellegrini, Radu Prodan, Thomas Fahringer Institute of Computer Science, University of Innsbruck Technikerstr. 21A, 6020 Innsbruck, Austria Abstract The Message Passing Interface (MPI) provides bindings for the three programming languages commonly used in High Performance Computing (HPC): C, C++ and Fortran. Unfortunately, MPI supports only the lowest common de- nominator of the three languages, providing a level of ab- straction far lower than typical C++ libraries. Lately, af- ter the decision of the MPI committee to deprecate and re- move the C++ bindings from the MPI standard, program- mers are forced to use either the C API or rely on third-party libraries. In this paper we present a lightweight, header-only C++ interface to MPI which uses object oriented and generic programming concepts to improve its integration into the C++ programming language. We compare our wrapper with a related approach called Boost.MPI showing how MPP facilitates the interaction with C++ objects. Perfor- mance wise, MPP outperforms Boost.MPI by reducing the interface overhead by a factor of eight. Additionally, MPP’s handling of user-defined data types allows transferring of STL containers (e.g. std::list) up to 20 times faster than Boost.MPI for small linked lists by relying on software serialization. 1 Introduction MPI is the defacto standard for writing parallel programs for distributed memory systems. As its focus is on High Performance Computing (HPC), MPI offers an Applica- tion Programming Interface (API) for C, C++ and Fortran, the most widely used languages for HPC. Unfortunately, since the definition of the first standard in 1994 [3], MPI did not keep the pace with the evolution of the underlying languages, such as object-oriented programming in Fortran 2000 and templates in C++. Nowadays, this problem is mostly perceived in C++ which, unlike Fortran and C, pro- vides much higher-level abstractions which are not reflected in the design of the MPI interface [6]. MPI is so poorly in- tegrated into the C++ environment that many programmers prefer to use, even in C++ programs, the C interface. Fur- thermore, to map common C++ constructs onto MPI, pro- grammers are forced to weaken the language type safety. As a consequence, errors that could be easily detected by the compiler are no longer captured leading to runtime failures. These issues led the MPI committee to the decision of dep- recating C++ bindings in the version 2.2 of the MPI stan- dard. However, because of the growing interest and use of C++ in HPC, several third-party wrappers to MPI have been proposed [11], the most important being Boost.MPI [8] and OOMPI [9]. Figure 1 shows a simple MPI program sending two float- ing point values from process rank 0 to rank 1. A problem of this code snippet is that the programmer is forced to un- necessarily declare a temporary variable val to store the values being sent by MPI Send (line 4). Although the C99 standard [1] introduced compound literals to avoid such un- necessary memory allocations (line 2), they are not widely used because of the decreased code readability. Because the compiler is not aware of the semantics of MPI Send which guarantees that the val’s value is not modified, no memory optimizations can be performed. A second problem is that the signature of all MPI routines requires the programmer to provide the size and the type (i.e. one MPI FLOAT) of the data being sent, which is error-prone and can be avoided in C++ by inferring them at compile-time. Boost.MPI [8] tries to simplify the MPI interface by de- ducing several of those parameters at compile-time through C++ template techniques. For example, the size of the data sent and its associated MPI Datatype is strictly related to the type of the object being sent and, therefore, deducible at compile-time from the C++ typing system. The send and recv routines in Boost.MPI require only three parameters, as shown in Figure 2 (lines 2, 3, 6, and 8): the source/desti- nation rank, the message tag, and the message content. This not only simplifies the usage of the routines, but also im- proves their safety. Although Boost.MPI is a consistent im- provement over the standard MPI C++ bindings, it is not widely accepted within the MPI community because of two main reasons: (i) the dependency on the Boost C++ library and accompanying licensing issues; (ii) the use of a serial- 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing 978-0-7695-4633-9/12 $26.00 © 2012 IEEE DOI 10.1109/PDP.2012.42 3
  • 2.
    1 if (rank==0 ) { 2 MPI Send((const int[1]) { 2 }, 1, MPI INT, 1, 1, 3 MPI COMM WORLD ); 4 std::array<float,2> val = {3.14f, 2.95f}; 5 MPI Send(&val.front(), val.size(), MPI FLOAT, 1, 0, 6 MPI COMM WORLD); 7 } else if (rank == 1) { 8 int n; 9 MPI Recv(&n, 1, MPI INT, 0, 1, MPI COMM WORLD, 10 MPI STATUS IGNORE); 11 std::vector<float> values(n); 12 MPI Recv(&values.front(), n, MPI FLOAT, 0, 0, 13 MPI COMM WORLD, MPI STATUS IGNORE); 14 } Figure 1. Simple MPI program using C bind- ings ization library [10] to handle transmission of user-defined data types (i.e. merging of objects with a sparse memory representation into a continuous data chunk) that negatively impacts the performance. 1 if ( world.rank() ==0 ) { 2 world.send( 1, 1, 2 ); 3 world.send( 1, 0, std::array<float,2>({3.14f, 2.95f}) ); 4 } else if (world.rank() == 1) { 5 int n; 6 world.recv(0, 1, n); 7 std::vector<float> values(n); 8 world.recv(0, 0, values); 9 } Figure 2. Boost.MPI version of the program from Figure 1 An object-oriented approach to improve the C++ MPI interface is OOMPI [9] which specifies send and receive operations in a more user friendly way by overloading the insertion << and extraction >> C++ operators. In OOMPI, a Port towards a process rank is obtained by using the array subscript operator [] on a communicator object (see line 2 in Figure 3). A further advantage is the convenience to com- bine these operators in one C++ instruction when inserting or extracting data to/from the same stream. A drawback of OOMPI is the poor integration of arrays and user data types in general. For example, sending an array instance requires the programmer to explicitly instantiate an object of class OOMPI Array message, which requires the size and type of the data to be manually specified as in the cur- rent MPI specification (line 4). The support for generic user data types requires the objects being sent to inherit from the OOMPI User type interface. This is a rather severe lim- itation as it does not allow any legacy class (e.g. the STL’s containers) to be directly supported. 1 if ( OOMPI COMM WORLD.rank() == 0 ) { 2 OOMPI COMM WORLD[1] << 2; 3 std::array<float,2> val = {3.14f, 2.95f}; 4 OOMPI COMM WORLD[1] << 5 OOMPI Array message(&val.front(), val.size()); 6 } else if (OOMPI COMM WORLD.rank() == 1) { 7 int n; 8 OOMPI COMM WORLD[0] >> n; 9 std::vector<float> values(n); 10 OOMPI COMM WORLD[0] >> 11 OOMPI Array message(values, 2); 12 } Figure 3. OOMPI version of the program from Figure 1 In this paper, we combine some of the concepts pre- sented in Boost.MPI and OOMPI and propose an advanced lightweight MPI C++ interface called MPP that aims at transparently integrating the message passing paradigm into the C++ programming language without sacrificing perfor- mance. Our approach focuses on point-to-point commu- nications and integration of user data types which, unlike Boost.MPI, relies entirely on native MPI Datatypes for better performance. Our interface also utilizes advanced concepts from other parallel programming languages such future objects [5] which simplifies the use of MPI asyn- chronous routines. Overall, MPP is designed with a specific focus on per- formance. As we target HPC systems, we understand how critical performance is and spent significant effort to re- duce the interface overhead. We compare the performance of MPP with Boost.MPI and show that, for a simple ping- pong application, MPP achieves a four times larger through- put (in terms of messages per seconds). Compared to the pure C bindings, MPP has an increased latency of only 9%. As far as the handling of user data types is concerned, MPP is able to reduce the transfer time of a linked list (i.e. std::list<T> from C++ STL) up to 20 times compared to Boost.MPI. To determine the real benefit of using MPP for real applications, we rewrote the computational kernel of QUAD MPI [7] to use Boost.MPI and MPP. The ob- tained results show a performance improvement of around 12% compared to Boost.MPI. The rest of the paper is organized as follows. In Section 2 we introduce MPP as a lightweight C++ wrapper to MPI using small code snippets. In Section 3 we compare our library against Boost.MPI and an plain MPI implementation using two micro-benchmark codes and an application code 4
  • 3.
    called QUAD MPI.Section 4 concludes the paper. 2 MPP: C++ Interface to MPI We use object-oriented programming concepts and C++ templates to design a lightweight wrapper for MPI routines that simplifies the way in which MPI programs are written. Similar to Boost.MPI, we achieve this goal by reducing the amount of information required by MPI routines and by in- ferring as much as possible at compile-time. By reducing the amount of code written by the users, we expect less pro- gramming errors. Furthermore, by making type checking safer, most common programming mistakes can be captured at compile-time. In this work, we focus on point-to-point operators, as the specialised semantics of collective oper- ations has no counterpart in C++ STL. We also present a generic mechanism of handling C++ user data types which allows for easy transfer of C++ objects to any existing MPI routine (including collective operations). 2.1 Point-to-Point Communication While Boost.MPI maintains in its API design the style of the traditional send/receive MPI routines, our approach is more similar to OOMPI aiming at a better C++ integration by defining these basic operations using streams. A stream is an abstraction that represents a device on which input and output operations are performed. Therefore, sending or receiving a message through a MPI channel can be seen as a stream operation. We introduce a mpi::endpoint class which has the semantics of a bidirectional stream from which data can be read (received) or written (sent) using the << and >> operators. The concept of endpoints is simi- lar to the Port abstraction of OOMPI, however, because our mechanism is based on generic programming, user- defined data types can be transparently handled. In con- trast, OOMPI is based on inheritance which forces the pro- grammer to instantiate an OOMPI Message class contain- ing the data type and size required by the MPI routines un- derneath [11] (see line 4 in Figure 3). Because an MPI send/receive operation offers more ca- pabilities than C++ streams (e.g. tags for messages, non- blocking semantics), endpoints cannot be directly modelled using an “is-a” relationship. Fortunately, STL’s utilities (e.g. algorithms) are mostly based on templates and end- points can be passed to any generic function which relies on the << or >> stream operations. Figure 4 shows an example that uses an endpoint as argument to a generic read from function. An endpoint is generated from a communicator using the () operator to which the process rank is passed (line 3). The mpi::comm class is a simple wrapper for an MPI Communicator with the capability of creating end- points and retrieving the current process rank and the com- municator size. The mpi::world refers to an instance of the comm class which wraps the MPI COMM WORLD com- municator. 1 namespace mpi { 2 struct comm { 3 mpi::endpoint operator()(int) const; 4 }; 5 } // end mpi namespace 6 7 template <class InStream, class T> 8 void read from(InStream& in, T& val) { 9 in >> val; 10 } 11 12 int val[2]; 13 // reads the first element of the val array from std::cin 14 read from(std::cin, val[0]); 15 16 // receives 2nd element of val array from rank 1 17 read from(mpi::comm::world(1), val[1]); Figure 4. Example of usage of endpoints in a generic function. Figure 5 shows how the program in Figure 1 can be rewritten with MPP. First of all, the objects are either sent or received using stream operations which allows for a more compact code compared to C MPI bindings (half in size) or to Boost.MPI. Secondly, objects are automatically wrapped by a generic mpi::msg<T> object, which does not need to be specified by the user (as opposed to OOMPI). Adding this level of indirection allows MPP to handle both prim- itive and user data types in a way transparent to the user. R-values (i.e. values with no address such as constants) are handled similar to any regular L-value (e.g. variables) us- ing C++ constant references via the msg class, which avoids unnecessary memory allocation. The interface also allows to specify message tags by manually allocating the message wrapper (example in line 3). MPP also supports non-blocking semantics for the send and receive operations through the overloaded < and > operators. Unlike blocking send/receives, asyn- chronous operations return a future object [5] of class mpi::request<T> which can be polled to test whether the pending operation has completed or not. An exam- ple of non-blocking operations in MPP is shown in Fig- ure 6. For non-blocking receives, the method T& get() waits for the underlying operation to complete (line 2) and, upon completion, it returns a reference to the received value. The mpi::request<T> class also provides a void wait() and a bool test() method implement- ing the semantics of MPI Wait and MPI Test, respec- 5
  • 4.
    1 using namespacempi; 2 if ( comm::world.rank() == 0 ) { 3 comm::world(1) << std::array<float,2>({3.14f, 2.95f}); 4 comm::world(1) << msg(2, 1); 5 } else if (mpi::world.rank() == 1) { 6 int n; 7 comm::world(0) << msg(n, 1); 8 std::vector<float> values(n); 9 comm::world(0) >> values; 10 } Figure 5. MPP version of the program from Figure 1. tively. The example also shows MPP’s support for receive operations which listen for messages coming from an un- known process using the mpi::any constant rank when creating an endpoint (line 3). 1 float real; 2 mpi::request<float>&& req = 3 mpi::comm::world(mpi::any) > real; 4 // ... do something else ... 5 use( req.get() ); Figure 6. Non-blocking MPP endpoints. Errors returned in MPI by every routine as an error code are handled in MPP via C++ exceptions. Any call to MPP routines can potentially throw an exception as a subclass of mpi::exception. The method get error code() of this class allows the retrieval of the native error code. 2.2 User Data Types OOMPI is one of the first APIs trying to introduce support for user data types through inheritance from an OOMPI User type class. Unfortunately, this mechanism is relatively weak because, by relying on inheritance, it does not allow the handling of class instances provided by third- party libraries (e.g. STL containers). Another attempt is the use of serialization in Boost.MPI which, although el- egant, introduces a high runtime overhead. The objective of MPP is to reach the same level of integration with user data types as Boost.MPI without performance loss, which we achieve by relying on the existing MPI support for user data types, i.e. MPI Datatype. The definition of an MPI Datatype is rather cumbersome and therefore not commonly used. Indeed, defining an MPI Datatype re- quires the programmer to specify several information re- lated to its memory layout which often leads to program- ming errors that are very difficult to debug. However, be- 1 template <class T> 2 struct mpi type traits<std::vector<T>> { 3 static inline const T∗ 4 get addr( const std::vector<T>& vec ) { 5 return mpi type trait<T>::get addr(vec.front()); 6 } 7 static inline const size t 8 get size( const std::vector<T>& vec ) { 9 return vec.size(); 10 } 11 static inline MPI Datatype 12 get type( const std::vector<T>& ) { 13 return mpi type trait<T>::get type( T() ); 14 } 15 }; 16 ... 17 typedef mpi type traits<vector<int>> vect traits; 18 vector<int> v = { 2, 3, 5, 7, 11, 13, 17, 19 }; 19 MPI Ssend( vect traits::get addr(v), 20 vect traits::get size(v), 21 vect traits::get type(v), ... ); Figure 7. Example of using mpi type traits to handle STL vectors. cause operations on data types are mapped to DMA trans- fers by the MPI library, the use of an MPI Datatype out- performs any other techniques based on software serializa- tion. The integration of user data types is achieved by using a design pattern called type traits [4]. An example is illus- trated in Figure 7 for C++ STL’s std::vector<T> class. We let the user specialize a class which statically provides the compiler three pieces of information required to map a user data type to MPI Datatypes: 1. the memory address from which the data type instance begins; 2. the type of each element; 3. the number of elements. Because a C++ vector is contiguously allocated in mem- ory, the starting address of the first element has to be recur- sively computed for handling generic regular nested types (e.g. vector<array<float,10>> in lines 3−6). The length is the number of elements present in the vector (line 9) and the type is the data type of a vector element (line 11 − 14). Because our mechanism is not based on inher- itance (like in OOMPI), it is open for integration and use with third party class libraries. Lines 17 − 21 show how the introduced type traits can be used with the MPI C binding. This method can also be used for collective operations or for one of the several flavors of MPI Send for which an 6
  • 5.
    ! (a) Number ofping/pong operations per second. #$ % ' ( ( % (b) Comparison of Boost.MPI and MPP for STL’s linked list (std::listT). Figure 8. MPP performance evaluation results. appropriate operator cannot be defined. MPP also provides several type traits for some of the STL containers such as vector, array and list. 3 Performance Evaluation In this section we compare the performance of MPP against Boost.MPI and the standard C binding of MPI. We used the Open MPI version 1.4.2 to execute the experi- ments. We did not consider OOMPI for performance eval- uation since its development has been stopped since several years. We first compared the MPI bindings by using micro- benchmarks and then by using a real MPI application called QUAD MPI which is a C++ program that approximates an integral based on a quadrature rule [7]. 3.1 Micro Benchmarks The purpose of the first experiment is to measure the la- tency overhead introduced by MPP over the standard C in- terface to MPI compared to Boost.MPI. We implemented a simple ping-pong application which we executed on a shared memory machine with a single AMD Phenom II X2 555, 3.5 GHz dual-core processors, 1MB of L2 cache, and 6MB of L3 cache. This way, any data transmission over- head is minimized and the focus is solely on the interface overhead. Figure 8(a) displays the number of ping-pong operations per second for varying message sizes. MPP has approximately 9% larger latency for small messages com- pared to the native MPI routines. This overhead is due to the creation of a temporary status object corresponding to the MPI Status returned by the MPI receive routine contain- ing the message source, size, tag, and error (if any). Com- pared to Boost.MPI, MPP shows nevertheless a consistent performance improvement of around 75% for small mes- sage sizes. Because both implementations use plain vectors to store the exchanged message, no serialization is involved to explain the overhead difference. We believe that the main reason for this overhead comes from the fact that Boost.MPI is implemented as a library and every call to MPI routines pays the overhead of an additional function call. We solved the problem in MPP by designing a pure header-based im- plementation, which allows all MPP routines to be inlined by the compiler, thus eliminating any overhead. The graph also illustrates that, as expected, the overhead decreases for larger messages as the communication time becomes pre- dominant. In the second experiment, we compared MPP with Boost.MPI for the support of user-defined data types. We used a listdouble type of varying size exchanged be- tween two processes in a loop repeated one thousand times. We executed the experiment on an IBM blade cluster with a quad-core Intel Xeon X5570 processors interconnected through Infiniband network. We allocated the two MPI pro- cesses on different blades in order to simulate a real use case scenario. Figure 8(b) shows the time necessary to per- form this micro-benchmark for different list sizes and the 7
  • 6.
    1 double mya, my b; 2 my total = 0.0; 3 if ( rank == 0 ) { 4 for ( unsigned q = 1; q p; ++q ) { 5 my a = ( ( p − q ) ∗ a + ( q − 1 ) ∗ b ) / ( p − 1 ); 6 MPI Send ( my a, 1, MPI DOUBLE, q, 0 ); 7 8 my b = ( ( p − q − 1 ) ∗ a + ( q ) ∗ b ) / ( p − 1 ); 9 MPI Send ( my b, 1, MPI DOUBLE, q, 0 ); 10 } 11 } else { 12 MPI Recv ( my a, 1, MPI DOUBLE, 0, 0, status ); 13 MPI Recv ( my b, 1, MPI DOUBLE, 0, 0, status ); 14 15 for ( unsigned i = 1; i = my n; ++i ) { 16 x = ((my n − i) ∗ my a + (i − 1) ∗ my b) / (my n − 1); 17 my total = my total + f ( x ); 18 } 19 my total = (my b − my a) ∗ my total / (double) my n; 20 } Figure 9. Computational kernel of QUAD MPI. speedup achieved by MPP over Boost.MPI. For small lists of 100 elements, the speedup is approximately 20, how- ever, the performance gap closes by increasing the list size. The reason is the std::list implementation in MPP us- ing MPI Type struct, which requires enumerating all memory addresses that compose the object being sent. To create an MPI Datatype for a linked list, three arrays have to be provided: • the displacement of each list element relative to the starting address; • the size of each element; • the data type of each element (i.e. O(3·N) of memory overhead). We observe in Figure 8(a) that building such a data type be- comes more expensive as the list size increases, so that for large linked lists over 50,000 elements the software serial- ization outperforms the MPI data typing mechanism. Future optimization could improve the support of large data struc- tures integrating in MPP a mechanism that switches from the use of MPI Datatype to serialization starting from a critical size. 3.2 QUAD MPI Application Code The micro-benchmarks highlighted the low latency of the MPP bindings, however this does not indicate much about the benefits of using MPP for real application codes. 1 my total = 0.0; 2 if ( rank == 0 ) { 3 for ( unsigned q = 1; q p ; ++q ) { 4 world.send(q, 0, (( p − q ) ∗ a + ( q − 1 ) ∗ b) / ( p − 1 )); 5 world.send(q, 0, (( p − q − 1 ) ∗ a + ( q ) ∗ b) / ( p − 1 )); 6 } 7 } else { 8 double my a, my b; 9 world.recv(0, 1, my a); 10 world.recv(0, 2, my b); 11 12 for ( unsigned i = 1; i = my n; ++i ) { 13 x = ((my n − i) ∗ my a + (i − 1) ∗ my b) / (my n − 1); 14 my total = my total + f ( x ); 15 } 16 my total = (my b − my a) ∗ my total / (double) my n;ł 17 } Figure 10. Computational kernel of QUAD MPI rewritten using Boost.MPI. For this purpose we took a simple MPI application ker- nel called QUAD MPI and rewritten using Boost.MPI and MPP. QUAD MPI is a C program which approximates an integral using a quadrature rule [7] and can be efficiently parallelized using MPI. From the original code [7], we ex- tracted the computational kernel depicted in Figure 9. The process rank 0 assigns to every other process a sub-interval of [A, B] and these bounds are then communicated using message passing routines. The number of communication statement in the code is limited, i.e. 2 · (P − 1), where P is the number of processes. Therefore, this code represents a good balance between communication and computation making it a good choice to determine the benefits of MPP bindings. This QUAD MPI kernel can be easily rewritten using Boost.MPI and MPP, as shown respectively in Figures 10 and 12. In both cases, we removed the necessity of assign- ing the value being sent to the my a and my b variables be- cause both Boost.MPI and MPP support sending R-values that are computed and directly sent to the destination (lines 4 and 5). The code at the receiver side is similar, the only difference being that we can now restrict the scope of the my a and my b variables to the else body only (lines 9 and 10), which allows a faster machine code generation as the compiler can utilize the CPU registers more efficiently. Ad- ditionally, MPP allows for a further reduction of the code as shown in Figure 12, since the two sends (line 4) and the two receives (line 9) can be combined together into a sin- gle statement. MPP also relieves the programmer from the burden of specifying a message tag by utilizing the tag 0 by default. With MPP we are able to shrink the input code 8
  • 7.
    1 my total= 0.0; 2 if ( rank == 0 ) { 3 for ( unsigned q = 1; q p; ++q ) { 4 comm::world(q) ((p − q) ∗ a + (q − 1) ∗ b) / (p − 1) 5 ((p − q − 1) ∗ a + q ∗ b ) / (p − 1); 6 } 7 } else { 8 double my a, my b; 9 comm::world(0) my a my b; 10 11 for ( unsigned i = 1; i = my n; ++i ) { 12 x = ((my n − i) ∗ my a + (i − 1) ∗ my b) / (my n − 1); 13 my total = my total + f ( x ); 14 } 15 my total = (my b − my a) ∗ my total / (double) my n; 16 } Figure 11. Computational kernel of QUAD MPI rewritten using MPP. by 30% (in terms of number of characters), which reduces the chances of programming errors and increases the overall productivity. We ran the three versions of the QUAD MPI kernel on a machine with 16 cores (a dual socket Intel Xeon CPU) and used shared memory to minimize communications costs and highlight the library overhead. We compiled the in- put programs with optimization enabled (i.e. -O3 flag), re- peated each experiment for 10 times, and reported the av- erage and standard deviation in execution time (see Fig- ure 12). Because of the removal of the superfluous assignment operations to the my a and my b variables, the MPP ver- sion performs slightly faster than the original code. It is worth noticing that, although the same optimization has been applied to the Boost.MPI version, the large overhead of Boost.MPI cancels any benefit making the resulting code the slowest of all three. Compared to Boost.MPI, the MPP version has a performance improvement of around 12%. 4 Conclusions In this paper we presented MPP as an advanced C++ in- terface to MPI. We combined some of the ideas of OOMPI and Boost.MPI into a lightweight, header-only interface smoothly integrated with the C++ environment. We intro- duced a transparent mechanism for dealing with user data types which, for small objects, is up to 20 times faster than Boost.MPI due to the use of MPI Datatypes instead of software serialization. We showed that programs written using MPP are more compact compared to the MPI C bind- ings and that the object oriented design overhead introduced ! # $ ! % Figure 12. QUAD MPI performance compari- son. is negligible. Furthermore, MPP can avoid common pro- gramming errors in two ways: 1. through its interface design that uses future objects to avoid reading the buffer of an asynchronous receive before data has been written; 2. by automatically inferring most of the input arguments required by MPI routines. The MPP interface is freely available at [2]. In the future we intent to extend the interface to sup- port easier use of other complex MPI features such as dy- namic process management, operations on communicators and groups, and creation of process topologies. 5 Acknoledgments This research has been partially funded by the Austrian Research Promotion Agency (FFG) under grant P7030-025- 011 and by the Tiroler Zukunftsstiftung under the Trans- lational Research Grant ”Parallel Computing with Java for Manycore Computers. References [1] C99 standard. www.open-std.org/JTC1/SC22/ wg14/www/docs/n1124.pdf [2] MPI C++ Interface. https://github.com/ motonacciu/mpp [3] The MPI-1 Specification. http://www.mpi-forum. org/docs/docs.html [4] A. Alexandrescu. Traits: The else-if-then of types. In C++ Report, pages 22–25, 2000. http://erdani. com/publications/traits.html 9
  • 8.
    [5] H. C.Baker, Jr. and C. Hewitt. The incremental garbage collection of processes. In Proceedings of the 1977 sympo- sium on Artificial intelligence and programming languages, pages 55–59, New York, NY, USA, 1977. ACM. [6] J. S. Bill, B. S. Y, and A. L. Z. The design and evolution of the MPI-2 C++ interface. In In Proceedings, 1997 In- ternantional Conference on Scientific Computing in Object- Oriented Parallel Computing, Lecture Notes in Computer Science. Springer-Verlag, 1997. [7] J. Burkardt. http://people.sc.fsu.edu/ ˜jburkardt/c_src/quad_mpi/quad_mpi.html [8] P. Kambadur, D. Gregor, A. Lumsdaine, and A. Dharurkar. Modernizing the C++ interface to MPI. In Recent Advances in Parallel Virtual Machine and Message Passing Inter- face, Lecture Notes in Computer Science, pages 266–274. Springer Berlin / Heidelberg, 2006. [9] B. C. McCandless, J. M. Squyres, and A. Lumsdaine. Object-Oriented MPI (OOMPI): A class library for the mes- sage passing interface. In Proceedings of the Second MPI Developers Conference, pages 87–, Washington, DC, USA, 1996. IEEE Computer Society. [10] R. Ramsey. Boost serialization library.www.boost.org/ doc/libs/release/libs/serialization/ [11] A. Skjellum, D. G. Wooley, A. Lumsdaine, Z. Lu, M. Wolf, J. M. Squyres, B. Mccandless, and P. V. Bangalore. Object- oriented analysis and design of the message passing inter- face, 1998. 10