Message Passing Interface (MPI) is a language-independent communications protocol used to program parallel computers. Both point-to-point and collective communication are supported.
MPI "is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation." So, MPI is a specification, not an implementation.
MPI's goals are high performance, scalability, and portability.
OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most platforms, processor architectures and operating systems, including Solaris, AIX, HP-UX, Linux, MacOS, and Windows.
OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer.
2. Topics Covered
MPI
MPI Principles
Building blocks
The Message Passing Interface(MPI)
Overlapping Communication and
Computation
Collective Communication
Operations
Composite Synchronization
Constructs
Pros and Cons of MPI
OpenMP
Threading
Parallel Programming
Model
Combining MPI and
OpenMP
Shared Memory
Programming
Pros and Cons of OpenMP
3. What is MPI???
Message Passing Interface (MPI) is a language-independent
communications protocol used to program parallel computers. Both
point-to-point and collective communication are supported.
MPI "is a message-passing application programmer interface, together
with protocol and semantic specifications for how its features must
behave in any implementation." So, MPI is a specification, not an
implementation.
MPI's goals are high performance, scalability, and portability.
4. MPI Principles
MPI-1 model has no shared memory concept.
MPI-2 has only a limited distributed shared memory
concept.
MPI-3 includes new Fortran 2008 bindings, while it
removes deprecated C++ bindings as well as many
deprecated routines and MPI objects.
5. MPI Building Blocks
Since interactions are accomplished by sending and receiving messages,
the basic operations in the message-passing programming paradigm are
SEND and RECEIVE.
In their simplest form, the prototypes of these operations are defined as
follows:
send(void *sendbuf, int nelems, int dest)
receive(void *recvbuf, int nelems, int source)
The sendbuf points to a buffer that stores the data to be sent, recvbuf
points to a buffer that stores the data to be received, nelems is the
number of data units to be sent and received, dest is the identifier of the
process that receives the data, and source is the identifier of the process
that sends the data.
6. MPI: the Message Passing Interface
MPI defines a standard library for message-passing that can be used to
develop portable message-passing programs using either C or Fortran.
The MPI standard defines both the syntax as well as the semantics of a
core set of library routines that are very useful in writing message-
passing programs.
The MPI library contains over 125 routines.
These routines are used to initialize and terminate the MPI library, to
get information about the parallel computing environment, and to send
and receive messages.
7. MPI: the Message Passing Interface
MPI_Init - Initializes MPI.
This function must be called in every MPI program, must be called
before any other MPI functions and must be called only once in an MPI
program.
MPI_Init(&argc,&argv);
MPI_Comm_size - Determines the number of processes.
Returns the total number of MPI processes in the specified
communicator (MPI_COMM_WORLD).
It represents the number of MPI tasks available to your application.
8. MPI: the Message Passing Interface
MPI_Comm_rank - Determines the label of the calling process.
Returns the rank of the calling MPI process within the specified
communicator. Initially, each process will be assigned a unique
integer rank between 0 and number of tasks - 1 within the
communicator MPI_COMM_WORLD. This rank is often referred
to as a task ID.
MPI_Comm_rank (comm,&rank);
MPI_Send - Sends a message.
It performs a blocking send i.e. this routine may block until the
message is received by the destination process.
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int
dest, int tag, MPI_Comm comm)
Buf -> initial address
of send buffer
Count -> number of
elements in send
buffer
Datatype -> datatype
of each send buffer
element
Dest -> rank of
destination
Tag -> message tag
Comm ->
communicator
9. MPI: the Message Passing Interface
MPI_Recv - Receives a message.
The count argument indicates the maximum length of
a message;
int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag,MPI_Comm comm,
MPI_Status *status)
MPI_Finalize - Terminates MPI.
This function should be the last MPI routine called in
every MPI program - no other MPI routines may be
called after it.
MPI_Finalize();
buf->initial address of
receive buffer
status->status object
count->maximum
number of elements in
receive buffer
datatype->datatype of
each receive buffer
element
source->rank of source
tag->message tag
comm->communicator
11. MPI Example – Hello World
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int rank, size; MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "Hello World from process %d of %dn", rank, size );
MPI_Finalize();
return 0;
}
12. MPI Example – Hello World
Output –
Hello World from process 0 of 4
Hello World from process 2 of 4
Hello World from process 3 of 4
Hello World from process 1 of 4
13. Overlapping Communication and
Computation
A blocking send operation remains blocked until the message has been
copied out of the send buffer (either into a system buffer at the source
process or sent to the destination process).
Similarly, a blocking receive operation returns only after the message
has been received and copied into the receive buffer.
In order to overlap communication with computation, MPI provides a
pair of functions for performing non-blocking send and receive
operations.
These functions:
MPI_Isend and
MPI_Irecv.
14. Overlapping Communication and
Computation
MPI_Isend
MPI_Isend starts a send operation but does not complete, that is, it returns
before the data is copied out of the buffer.
The calling sequences of MPI_Isend is
int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag,
MPI_Comm comm, MPI_Request *request)
MPI_Irecv
MPI_Irecv starts a receive operation but returns before the data has been
received and copied into the buffer.
The calling sequences of MPI_Irecv is
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
MPI_Comm comm, MPI_Request *request)
15. Collective Communication Operations
MPI provides the following routines for collective communication:
MPI_Bcast() -> Broadcast (one to all)
MPI_Reduce() -> Reduction (all to one)
MPI_Allreduce() -> Reduction (all to all)
MPI_Scatter() -> Distribute data (one to all)
MPI_Gather() -> Collect data (all to one)
MPI_Alltoall() -> Distribute data (all to all)
MPI_Allgather() -> Collect data (all to all)
16. Composite Synchronization
Constructs
By design, Pthreads provide support for a basic set of operations.
Higher level constructs can be built using basic synchronization
constructs.
We discuss two such constructs - read-write locks and barriers.
A read lock is granted when there are other threads that may already
have read locks.
If there is a write lock on the data (or if there are queued write locks),
the thread performs a condition wait.
If there are multiple threads requesting a write lock, they must perform
a condition wait.
With this description, we can design functions for
read locks mylib_rwlock_rlock,
write locks mylib_rwlock_wlock and
unlocking mylib_rwlock_unlock.
17. Read-Write Locks
The lock data type mylib_rwlock_t holds the following:
a count of the number of readers,
the writer (a 0/1 integer specifying whether a writer is present),
a condition variable readers_proceed that is signaled when readers
can proceed,
a condition variable writer_proceed that is signaled when one of the
writers can proceed,
a count pending_writers of pending writers, and
a mutex read_write_lock associated with the shared data structure
18. Barriers
As in MPI, a barrier holds a thread until all threads participating in the
barrier have reached it.
Barriers can be implemented using a counter, a mutex and a condition
variable.
A single integer is used to keep track of the number of threads that have
reached the barrier.
If the count is less than the total number of threads, the threads execute
a condition wait.
The last thread entering (and setting the count to the number of
threads) wakes up all the threads using a condition broadcast.
20. Pros and Cons of MPI
Pros
Does not require shared memory architectures which are more expensive
than distributed memory architectures
Can be used on a wider range of problems since it exploits both task
parallelism and data parallelism
Can run on both shared memory and distributed memory architectures
Highly portable with specific optimization for the implementation on most
hardware
Cons
Requires more programming changes to go from serial to parallel version
Can be harder to debug
21. What is OpenMP???
OpenMP (Open Multi-Processing) is an application programming
interface (API) that supports multi-platform shared memory
multiprocessing programming in C, C++, and Fortran, on most
platforms, processor architectures and operating systems, including
Solaris, AIX, HP-UX, Linux, MacOS, and Windows.
OpenMP uses a portable, scalable model that gives programmers a
simple and flexible interface for developing parallel applications for
platforms ranging from the standard desktop computer to the
supercomputer.
22. What is OpenMP???
OpenMP is basically an add on in compiler. It is available in GCC (gnu
compiler) , Intel compiler and with other compilers.
OpenMP target shared memory systems i.e. where processor shared the
main memory.
OpenMP is based on thread approach . It launches a single process which in
turn can create n number of thread as desired. It is based on what is called
"fork and join method" i.e. depending on particular task it can launch
desired number of thread as directed by user.
23. Threading
A thread is a single stream of control in the flow of a program.
Static Threads
All work is allocated and assigned at runtime
Dynamic Threads
Consists of one Master and a pool of threads
The pool is assigned some of the work at runtime, but not all of it
When a thread from the pool becomes idle, the Master gives it a new
assignment
“Round-robin assignments”
24. Parallel Programing Model
OpenMP uses the fork-join model of parallel execution.
All OpenMP programs begin with a single master thread.
The master thread executes sequentially until a parallel region is
encountered, when it creates a team of parallel threads (FORK).
When the team threads complete the parallel region, they synchronize and
terminate, leaving only the master thread that executes sequentially
(JOIN).
25. Variables
2 types of Variables
Private
Shared
Private Variables
Variables in a thread’s private space can only be accessed by the thread
Private variable has a different address in the execution context of every
thread.
Clause : private «variable list»
Shared Variables
Variables in the global data space are accessed by all parallel threads.
Shared-variable has the same address in the execution context of every
thread. All threads have access to shared variables.
26. Variables
A thread can access its own private variables, but cannot access the
private variable of another thread.
In parallel for pragma, variables are by default shared, except
loop index variable which is private.
#pragma omp parallel for private(privDbl )
for ( i = 0; i < arraySize; i++ ) {
for ( privIndx = 0; privIndx < 16; privIndx++ ) { privDbl = ( (double)
privIndx ) / 16;
y[i] = sin( exp( cos( - exp( sin(x[i]) ) ) ) ) + cos( privDbl );
}
}
27. OpenMP Functions
omp_get_num_procs ()
Returns the number of CPUs in the multiprocessor on which this thread is
executing,
The integer returned by this function may be less than the total number of
physical processors in the multiprocessor, depending on how the run-time
system gives processes access to processors.
e.g. int t= omp_get_num_procs();
omp_get_num_threads()
Returns the number of threads active in the current parallel region
t=omp_get_num_threads();
28. OpenMP Functions Contd.
omp_set_num_threads()
Allows to set the number of threads executing the parallel sections of code
Setting the number of threads equal to the number of available CPUs
e.g. omp_set_num_threads(t);
omp_get _thread_num()
Returns the thread identification number, from 0 to n-1 where n are
number of active threads.
tid = omp_get_thread_num();
29. OpenMP compiler directives (Pragma)
A compiler directive in C or c++ is called a pragma.
Format:
#pragma omp directive-name [clause,..]
1. #pragma omp parallel
Block of code should be executed by all of the threads (code block is
replicated among the threads)
use curly braces {} to create a block of code from a statement group.
30. OpenMP compiler directives (Pragma)
2. #pragma omp parallel for
indicate to the compiler that the iterations of a for loop may
be executed in parallel.
e.g.
#pragma omp parallel for
for (i = first; i < size; i += prime)
marked[i] = 1;
32. Combining MPI and OpenMP
In many cases hybrid programs using both MPI and OpenMP execute
faster than programs using only MPL.
Sometimes hybrid programs execute faster because they have lower
communication overhead.
Suppose we are executing our program on a cluster of m multiprocessors,
where each multiprocessor has k CPUs. In order to utilize every CPU, a
program relying on MPI must create mk processes. During
communication steps, mk processes are active.
On the other hand, a hybrid program need only create m processes. In
parallel sections of code, the workload is divided among k threads on
each multiprocessor. Hence every CPU is utilized.
33. However, during communication
steps, only m processes are active.
This may well give the hybrid
program lower communication
overhead than a "pure" MPI
program, resulting in higher
speedup.
Combining MPI and OpenMP
34. Shared Memory Programing
The underlying hardware is assumed to be a collection
of processors, each with access to the same shared
memory.
Because they have access 10 the same memory
locations, processors can interact and synchronize with
each other through shared variables.
35. The standard view of parallelism in a shared memory program is
fork/join parallelism.
When the program begins execution, only a single thread, called the
master thread, is active.
The master thread executes the sequential portions of the algorithm. At
those points where parallel operations arc required, the master thread
forks (creates or awakens) additional threads.
The master thread and the created threads work concurrently through
the parallel section, At the end of the parallel code the created threads
die or are suspended, and the flow of control returns to the single
master thread. This is called a join.
Shared Memory Programing
36. The shared-memory model is
characterized by forkjoin parallelism,
in which parallelism comes and goes.
At the beginning of execution only a
single thread, called the master thread,
is active.
The master thread executes the serial
portions 0f the program. It forks
additional threads to help it execute
parallel portions of the program.
These threads are deactivated when
serial execution resumes.
Shared Memory Programing
37. A key difference, then, between the shared-memory model and the
message passing model is that in the message-passing model all
processes typically remain active throughout the execution of the
program, whereas in the shared-memory model the number of
active threads is one at the program's start and finish and may
change dynamically throughout the execution of the program.
Parallel shared-memory programs range from those with only a
single fork/join around a single loop to those in which most of the
code segments are executed in parallel. Hence the shared-memory
model supports incremental paI1lllelization, the process of
transforming a sequential program into a parallel program one
block of code at a line.
Shared Memory Programing
38. Pros and Cons of OpenMP
Pros
Considered by some to be easier to program and debug (compared to
MPI)
Data layout and decomposition is handled automatically by directives.
Allows incremental parallelism: directives can be added incrementally,
so the program can be parallelized one portion after another and thus
no dramatic change to code is needed.
Unified code for both serial and parallel applications: OpenMP
constructs are treated as comments when sequential compilers are
used.
Original (serial) code statements need not, in general, be modified when
parallelized with OpenMP. This reduces the chance of inadvertently
introducing bugs and helps maintenance as well.
Both coarse-grained and fine-grained parallelism are possible
39. Pros and Cons of OpenMP
Cons
Currently only runs efficiently in shared-memory multiprocessor
platforms
Requires a compiler that supports OpenMP.
Scalability is limited by memory architecture.
Reliable error handling is missing.
Lacks fine-grained mechanisms to control thread-processor
mapping.
Synchronization between subsets of threads is not allowed.
Mostly used for loop parallelization
Can be difficult to debug, due to implicit communication between
threads via shared variables.