ISBI MPI Tutorial

Eric Borisch, M.S.
Mayo Clinic

 Motivation for distributed computing
 What MPI is
 Intro to MPI programming
 Thinking in parallel
 Wrap up

 Shared Memory: all memory within a system is directly
addressable (ignoring access restrictions) by each process [or
thread]
 Single- and multi-CPU desktops & laptops
 Multi-threaded apps
 GPGPU *
 MPI *
 Distributed Memory: memory available a given node within
a system is unique and distinct from its peers
 MPI
 Google MapReduce / Hadoop

Centos 5.2; Dual Quad-Core 3GHz P4 [E5472]; DDR2
800MHz
2.5
Relative performance

2

1.5
Copy
1 Scale
Add
0.5
Triad
0
1 2 3 4 5 6 7 8
# of processes

http://www.cs.virginia.edu/stream/

STREAM benchmark OpenMP performance
400%
350%
Relative performance

300%
250%
Add:
200%
Copy:
150%
Scale:
100%
Triad:
50%
0%
0 4 8 12 16
Threads (8 Physical cores + HT)

2x X5570 (2.93GHz; Quad-core; 6.4GT/s QPI); 12x4G 1033 DDR3 http://www.cs.virginia.edu/stream/

 Bandwidth (FSB, HT, Nehalem, CUDA, …)
 Frequently run into with high-level languages (MATLAB)
 Capacity – cost & availability
 High-density chips are $$$ (if even available)
 Memory limits on individual systems
 Distributed computing addresses both bandwidth and
capacity with multiple systems
 MPI is the glue used to connect multiple distributed
processes together

 Custom iterative SENSE reconstruction
 3 x 8 coils x 400 x 320 x 176 x 8 [complex float]
 Profile data (img space)
 Estimate (img<->k space)
 Acquired data (k space)
 > 4GB data touched during each iteration
 16, 32 channel data here or on the way…

Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms”
M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro

FTx
DATA
Place view into correct x-Ky-Kz space (AP & LP)

CAL FTyz (AP & LP)

“Traditional” 2D SENSE Unfold (AP & LP)

Homodyne Correction
Pre-loaded data
GW Correction (Y, Z)
Real-time data
GW Correction (X)
MPI Communication
MIP
Root node

Worker nodes
Store /
RESULT
DICOM

Root Node 1Gb Eth Site Intranet
3.6GHz P4 16GB RAM 1Gb Eth
Worker Node (x7)
1Gb Eth
3.6GHz P4 16GB RAM
3.6GHz P4
1Gb Eth
80GB HDD 3.6GHz P4 2x8Gb IB
80GB HDD
500GB HDD 2x8Gb IB

16-Port Gigabit Ethernet Switch
x7 File system
connections

24-Port Infiniband Switch
x7x2 MPI interconnects
16Gb/s bandwidth per node

8Gb/s Connection
Key
Cluster Hardware
MRI System
External Hardware

2x8Gig Infiniband connection

1Gig Ethernet connection

 Loosely coupled
 SETI / BOINC
 “Grid computing”
 BIOS-level abstraction
 ScaleMP
 Tightly coupled
 MPI
 “Cluster computing”
 Hybrid
 Folding@Home
 gpugrid.net
http://en.wikipedia.org/wiki/File:After_Dark_Flying_Toasters.png
http://en.wikipedia.org/wiki/File:Setiathomeversion3point08.png

Worker
Head
Node
Worker

Worker
Master Node

Worker
Worker

Worker

Host Host I
OS OS I
Process A Thread 1 Process A

Thread 2
Host II
ThreadN
OS II
Process B Process B

Host N
OS N
Memory Transfers Process C
Network Transfers

Host Host I
OS OS I
Process A Thread 1 Process A Process D

Thread 2
Host II
ThreadN
OS II
Process B Process B Process E

Host N
OS N
Memory Transfers Process C Process F
Network Transfers

 Message Passing Interface is…
 “a library specification for message-passing” 1
 Available in many implementations on multiple
platforms *
 A set of functions for moving messages between
different processes without a shared memory
environment
 Low-level*; no concept of overall computing tasks
to be performed

[1] http://www.mcs.anl.gov/research/projects/mpi/

 MPI-1
 Version 1.0 draft standard 1994
 Version 1.1 in 1995
 MPI-2
 Added:
▪ 1-sided communication
▪ Dynamic “world” sizes; spawn / join
 MPI-3
 In process
 Enhanced fault handling
 Forward compatibility preserved

 MPI is the de-facto standard for distributed computing
 Freely available
 Open source implementations exist
 Portable
 Mature
 From a discussion of why MPI is dominant [1]:
 […] 100s of languages have come and gone.
 Good stuff must have been created [… yet] it is broadly accepted in the field
that they’re not used.
 MPI has a lock.
 OpenMP is accepted, but a distant second.
 There are substantial barriers to the introduction of new languages and
language constructs.
 Economic, ecosystem related, psychological, a catch-22 of widespread
use, etc.
 Any parallel language proposal must come equipped with reasons why it will
overcome those barriers.
[1] http://www.ieeetcsc.org/newsletters/2006-01/why_all_mpi_discussion.html

 MPI itself is just a specification. We want an implementation
 MPICH, MPICH2
 Widely portable
 MVAPICH, MVAPICH2
 Infiniband-centric; MPICH/MPICH2 based
 OpenMPI
 Plug-in architecture; many run-time options
 And more:
 IntelMPI
 HP-MPI
 MPI for IBM Blue Gene
 MPI for Cray
 Microsoft MPI
 MPI for SiCortex
 MPI for Myrinet Express (MX)
 MPICH2 over SCTP

 Without MPI:
 Start all of the processes across bank of machines
(shell scripting + ssh)
 socket(), bind(), listen(), accept() or connect() each
link
 send(), read() on individual links
 Raw byte interfaces; no discrete messages

 With MPI
 mpiexec –np<n> app
 MPI_Init()
 MPI_Send()
 MPI_Recv()
 MPI_Finalize()
 MPI:
 Manages the connections
 Packages messages
 Provides launching mechanism

Provides definitions for:
 Communication functions
 MPI_Send()
 MPI_Recv()
 MPI_Bcast()
 etc.
 Datatypemangement functions
 MPI_Type_create_hvector()
 C, C++, and Fortran bindings
 Also recommends process startup
 mpiexec –np<nproc><program><args>

[1] http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html

MPI_Abort MPI_Comm_remote_size MPI_File_read_ordered_end MPI_Group_rank MPI_Scatterv MPI_Unpublish_name
MPI_Accumulate MPI_Comm_set_attr MPI_File_read_shared MPI_Group_size MPI_Send MPI_Wait
MPI_Add_error_class MPI_Comm_set_errhandler MPI_File_seek MPI_Group_translate_ranks MPI_Send_init MPI_Waitall
MPI_Add_error_code MPI_Comm_set_name MPI_File_seek_shared MPI_Group_union MPI_Sendrecv MPI_Waitany
MPI_Add_error_string MPI_Comm_size MPI_File_set_atomicity MPI_Ibsend MPI_Sendrecv_replace MPI_Waitsome
MPI_Address MPI_Comm_spawn MPI_File_set_errhandler MPI_Info_create MPI_Ssend MPI_Win_call_errhandler
MPI_Allgather MPI_Comm_spawn_multiple MPI_File_set_info MPI_Info_delete MPI_Ssend_init MPI_Win_complete
MPI_Allgatherv MPI_Comm_split MPI_File_set_size MPI_Info_dup MPI_Start MPI_Win_create
MPI_Alloc_mem MPI_Comm_test_inter MPI_File_set_view MPI_Info_free MPI_Startall MPI_Win_create_errhandler
MPI_Allreduce MPI_Dims_create MPI_File_sync MPI_Info_get MPI_Status_set_cancelled MPI_Win_create_keyval
MPI_Alltoall MPI_Errhandler_create MPI_File_write MPI_Info_get_nkeys MPI_Status_set_elements MPI_Win_delete_attr
MPI_Alltoallv MPI_Errhandler_free MPI_File_write_all MPI_Info_get_nthkey MPI_Test MPI_Win_fence
MPI_Alltoallw MPI_Errhandler_get MPI_File_write_all_begin MPI_Info_get_valuelen MPI_Test_cancelled MPI_Win_free
MPI_Attr_delete MPI_Errhandler_set MPI_File_write_all_end MPI_Info_set MPI_Testall MPI_Win_free_keyval
MPI_Attr_get MPI_Error_class MPI_File_write_at MPI_Init MPI_Testany MPI_Win_get_attr
MPI_Attr_put MPI_Error_string MPI_File_write_at_all MPI_Init_thread MPI_Testsome MPI_Win_get_errhandler
MPI_Barrier MPI_Exscan MPI_File_write_at_all_begin MPI_Initialized MPI_Topo_test MPI_Win_get_group
MPI_Bcast MPI_File_c2f MPI_File_write_at_all_end MPI_Intercomm_create MPI_Type_commit MPI_Win_get_name
MPI_Bsend MPI_File_call_errhandler MPI_File_write_ordered MPI_Intercomm_merge MPI_Type_contiguous MPI_Win_lock
MPI_Bsend_init MPI_File_close MPI_File_write_ordered_begin MPI_Iprobe MPI_Type_create_darray MPI_Win_post
MPI_Buffer_attach MPI_File_create_errhandler MPI_File_write_ordered_end MPI_Irecv MPI_Type_create_hindexed MPI_Win_set_attr
MPI_Buffer_detach MPI_File_delete MPI_File_write_shared MPI_Irsend MPI_Type_create_hvector MPI_Win_set_errhandler
MPI_Cancel MPI_File_f2c MPI_Finalize MPI_Is_thread_main MPI_Type_create_indexed_block MPI_Win_set_name
MPI_Cart_coords MPI_File_get_amode MPI_Finalized MPI_Isend MPI_Type_create_keyval MPI_Win_start
MPI_Cart_create MPI_File_get_atomicity MPI_Free_mem MPI_Issend MPI_Type_create_resized MPI_Win_test
MPI_Cart_get MPI_File_get_byte_offset MPI_Gather MPI_Keyval_create MPI_Type_create_struct MPI_Win_unlock
MPI_Cart_map MPI_File_get_errhandler MPI_Gatherv MPI_Keyval_free MPI_Type_create_subarray MPI_Win_wait
MPI_Cart_rank MPI_File_get_group MPI_Get MPI_Lookup_name MPI_Type_delete_attr MPI_Wtick
MPI_Cart_shift MPI_File_get_info MPI_Get_address MPI_Op_create MPI_Type_dup MPI_Wtime
MPI_Cart_sub MPI_File_get_position MPI_Get_count MPI_Op_free MPI_Type_extent
MPI_Cartdim_get MPI_File_get_position_shared MPI_Get_elements MPI_Open_port MPI_Type_free
MPI_Close_port MPI_File_get_size MPI_Get_processor_name MPI_Pack MPI_Type_free_keyval
MPI_Comm_accept MPI_File_get_type_extent MPI_Get_version MPI_Pack_external MPI_Type_get_attr
MPI_Comm_call_errhandler MPI_File_get_view MPI_Graph_create MPI_Pack_external_size MPI_Type_get_contents
MPI_Comm_compare MPI_File_iread MPI_Graph_get MPI_Pack_size MPI_Type_get_envelope
MPI_Comm_connect MPI_File_iread_at MPI_Graph_map MPI_Pcontrol MPI_Type_get_extent
MPI_Comm_create MPI_File_iread_shared MPI_Graph_neighbors MPI_Probe MPI_Type_get_name
MPI_Comm_create_errhandler MPI_File_iwrite MPI_Graph_neighbors_count MPI_Publish_name MPI_Type_get_true_extent
MPI_Comm_create_keyval MPI_File_iwrite_at MPI_Graphdims_get MPI_Put MPI_Type_hindexed
MPI_Comm_delete_attr MPI_File_iwrite_shared MPI_Grequest_complete MPI_Query_thread MPI_Type_hvector
MPI_Comm_disconnect MPI_File_open MPI_Grequest_start MPI_Recv MPI_Type_indexed
MPI_Comm_dup MPI_File_preallocate MPI_Group_compare MPI_Recv_init MPI_Type_lb
MPI_Comm_free MPI_File_read MPI_Group_difference MPI_Reduce MPI_Type_match_size
MPI_Comm_free_keyval MPI_File_read_all MPI_Group_excl MPI_Reduce_scatter MPI_Type_set_attr
MPI_Comm_get_attr MPI_File_read_all_begin MPI_Group_free MPI_Register_datarep MPI_Type_set_name
MPI_Comm_get_errhandler MPI_File_read_all_end MPI_Group_incl MPI_Request_free MPI_Type_size
MPI_Comm_get_name MPI_File_read_at MPI_Group_intersection MPI_Request_get_status MPI_Type_struct
MPI_Comm_get_parent MPI_File_read_at_all MPI_Group_range_excl MPI_Rsend MPI_Type_ub
MPI_Comm_group MPI_File_read_at_all_begin MPI_Group_range_incl MPI_Rsend_init MPI_Type_vector
MPI_Comm_join MPI_File_read_at_all_end MPI_Scan MPI_Unpack
MPI_Comm_rank MPI_File_read_ordered MPI_Scatter MPI_Unpack_external
MPI_Comm_remote_group MPI_File_read_ordered_begin

 Each process owns their data – there is no “our”
 Makes many things simpler; no mutexes, condition
variables, semaphores, etc; memory access order race
conditions go away
 Every message is an explicit copy
 I have the memory I sent from, you have the memory you
used to received into
 Even when running in a “shared memory” environment
 Synchronization comes along for free
 I won’t get your message (or data) until you choose to
send it
 Programming to MPI first can make it easier to scale-
out later

 Download / decompress MPICH source:
http://www.mcs.anl.gov/research/projects/mpic
h2/
 Suports: c / c++ / Fortran
 Requires Python >= 2.2
 ./configure
 make install
 installs into /usr/local by default, or use
--prefix=<chosen path>
 Make sure <prefix>/bin is in PATH
 Make sure <prefix>/share/man is in MANPATH

c compiler wrapper c++ compiler wrapper

MPI job launcher
MPD launcher

 Set up passwordlessssh to workers
 Start the daemons with mpdboot -n<N>
 Requires ~/.mpd.conf to exist on each host
▪ Contains: (same on each host)
▪ MPD_SECRETWORD=<some gibberish string>
▪ permissions set to 600 (r/w access for owner only)
 Requires ./mpd.hosts to list other host names
▪ Unless run as mpdboot -n 1 (run on current host only)
▪ Will not accept current host in list (implicit)
 Check for running daemons with mpdtrace
For details: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf

 Use mpicc/ mpicxx for c/c++ compiler
 Wrapper script around c/c++ compilers detected
during install
▪ $ mpicc --show
gcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include -
L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 -
lpthread -luuid -lpthread –lrt
 $ mpicc -o hello hello.c
 Use mpiexec -np<nproc><app><args> to launch
 $ mpiexec -np 4 ./hello

/* hello.c */
#include <stdio.h> $ mpicc -o hello hello.c
#include <mpi.h> $ mpiexec -np 4 ./hello
Hello, from 0 of 4!
int main (int argc, char * argv[])
{ Hello, from 2 of 4!
inti, rank, nodes; Hello, from 1 of 4!
Hello, from 3 of 4!
MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nodes);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

for (i=0; i< nodes; i++)
{
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) printf("Hello from %i of %i!n", rank, nodes);
}
MPI_Finalize();
return 0;
}

./threaded_app

main()
Thread within
threaded_app process
pthread_create( func() )
func()

Do work Do work
Memory

pthread_join() pthread_exit()

exit()

mpiexec –np 4 ./mpi_app

mpd launches jobs

mpi_app [rank 0] mpi_app [rank 1] mpi_app [rank 3]

main() main() main()

MPI_Init() MPI_Init() MPI_Init() MPI comm.

MPI_Bcast() MPI_Bcast() MIP_Bcast() MPI comm.

Do Work on local mem Do Work on local mem Do Work on local mem

MPI_Allreduce() MPI_Allreduce() MPI_Allreduce() MPI comm.

MPI_Finalize() MPI_Finalize() MPI_Finalize() MPI comm.

exit() exit() exit()

/* hello.c */
#include <stdio.h>
#include <mpi.h>
int
main (int argc, char * argv[])
{
int i;
int rank;
int nodes;
for (i=0; i< nodes; i++)
{
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) printf("Hello from %i of %i!n", rank, nodes);
}
MPI_Finalize();
return 0;
}

 MPICH2 comes with mpe by default (unless disabled
during configure)
 Multiple tracing / logging options to track MPI traffic
 Enabled through –mpe=<option> at compile time
MacPro:code$ mpicc -mpe=mpilog -o hello hello.c
MacPro:code$ mpiexec -np 4 ./hello
Hello from 0 of 4!
Hello from 2 of 4!
Hello from 1 of 4!
Hello from 3 of 4!
Writing logfile....
Enabling the Default clock synchronization...
Finished writing logfile ./hello.clog2.

MacPro:code$ mpicc -mpe=mpitrace -o hello hello.c
MacPro:code$ mpiexec -np 2 ./hello > trace

MacPro:code$ grep 0 trace MacPro:code$ grep 1 trace
[0] Ending MPI_Init [1] Ending MPI_Init
[0] Starting MPI_Comm_size... [1] Starting MPI_Comm_size...
[0] Ending MPI_Comm_size [1] Ending MPI_Comm_size
[0] Starting MPI_Comm_rank... [1] Starting MPI_Comm_rank...
[0] Ending MPI_Comm_rank [1] Ending MPI_Comm_rank
[0] Starting MPI_Barrier... [1] Starting MPI_Barrier...
[0] Ending MPI_Barrier [1] Ending MPI_Barrier
Hello from 0 of 2! [1] Starting MPI_Barrier...
[0] Starting MPI_Barrier... [1] Ending MPI_Barrier
[0] Ending MPI_Barrier Hello from 1 of 2!
[0] Starting MPI_Finalize... [1] Starting MPI_Finalize...
[0] Ending MPI_Finalize [1] Ending MPI_Finalize

intMPI_Send(
void *buf,
memory location to send from
int count,
number of elements (of type datatype) at buf
MPI_Datatypedatatype,
MPI_INT, MPI_FLOAT, etc…
Or custom datatypes; strided vectors; structures, etc
intdest,
rank (within the communicator comm) of destination for this message
int tag,
used to distinguish this message from other messages
MPI_Commcomm )
communicator for this transfer
often MPI_COMM_WORLD

intMPI_Recv(
void *buf,
memory location to receive data into
int count,
number of elements (of type datatype) available to receive into at buf
MPI_Datatypedatatype,
MPI_INT, MPI_FLOAT, etc…
Or custom datatypes; strided vectors; structures, etc.
Typically matches sending datatype, but doesn’t have to…
int source,
rank (within the communicator comm) of source for this message
can also be MPI_ANY_SOURCE
int tag,
used to distinguish this message from other messages
can also be MPI_ANY_TAG
MPI_Commcomm,
communicator for this transfer
often MPI_COMM_WORLD
MPI_Status *status )
Structure describing the received message, including:
actual count (can be smaller than passed count)
source (useful if used with source = MPI_ANY_SOURCE)
tag (useful if used with tag = MPI_ANY_TAG)

/* sr.c */
#include <stdio.h>
#include <mpi.h>
#ifndef SENDSIZE
#define SENDSIZE 1
#endif
int
main (int argc, char * argv[] )
{
int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];
MPI_Status sendStatus;
myData[0] = rank;
MPI_Send(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD);
MPI_Recv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1 ) % nodes,
0, MPI_COMM_WORLD, &sendStatus);
printf("%i sent %i; received %in", rank, myData[0], theirData[0]);
MPI_Finalize();
return 0;
}

$ mpicc -osrsr.c
$ mpiexec -np 2 ./sr
0 sent 0; received 1

$ mpicc -osrsr.c
$ mpicc -osrsr.c -DSENDSIZE="0x1<<13”
$ mpicc -osrsr.c -DSENDSIZE="0x1<<14”
^C

$ mpicc -osrsr.c -DSENDSIZE="0x1<<14 - 1”

3.4 Communication Modes
The send call described in Section Blocking send is blocking: it does not return until the message data
and envelope have been safely stored away so that the sender is free to access and overwrite the send
buffer. The message might be copied directly into the matching receive buffer, or it might be copied
into a temporary system buffer.
Message buffering decouples the send and receive operations. A blocking send can complete as soon
as the message was buffered, even if no matching receive has been executed by the receiver. On the
other hand, message buffering can be expensive, as it entails additional memory-to-memory
copying, and it requires the allocation of memory for buffering. MPI offers the choice of several
communication modes that allow one to control the choice of the communication protocol.
The send call described in Section Blocking send used the standard communication mode. In this
mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer
outgoing messages. In such a case, the send call may complete before a matching receive is
invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer
outgoing messages, for performance reasons. In this case, the send call will not complete until a
matching receive has been posted, and the data has been moved to the receiver.
Thus, a send in standard mode can be started whether or not a matching receive has been posted. It
may complete before a matching receive is posted. The standard mode send is non-local: successful
completion of the send operation may depend on the occurrence of a matching receive.

http://www.mpi-forum.org/docs/mpi-11-html/node40.html#Node40

Process 1 Process 2
Send “small”
message &
return Eager send Eager recv

Send “large” Request & receive
message Receive small message
Rndv. req.
Rndv. req.
Blocks until Match Rndv. Request large
completion. Rndv. send message
req.

Receive Receive large
Rndv. data message

User activity
MPI activity

 MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)
 Sends are “local” – they return independent of any remote activity
 Message buffer can be touched immediately after call returns
 Requires a user-provided buffer, provided via MPI_Buffer_attach()
 Forces an “eager”-like message transfer from sender’s perspective
 User can wait for completion by calling MPI_Buffer_detach()
 MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init)
 Won’t return until matching receive is posted
 Forces a “rendezvous”-like message transfer
 Can be used to guarantee synchronization without additional MPI_Barrier() calls
 MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init)
 Erroneous if matching receive has not been posted
 Performance tweak (on some systems) when user can guarantee matching receive is posted
 MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)
 Non-blocking, immediate return once send/receive request is posted
 Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion
 Send/receive buffers should not be touched until completed
 MPI_Request * argument used for eventual completion

 The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to
receive any send mode.

/* sr2.c */
#include <stdio.h>
#include <mpi.h>
#ifndef SENDSIZE
#define SENDSIZE 1
#endif

int
main (int argc, char * argv[] )
{
int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];
MPI_Status xferStatus[2];
MPI_Request xferRequest[2];



myData[0] = rank;
MPI_Isend(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[0]);
MPI_Irecv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[1]);

MPI_Waitall(2,xferRequest,xferStatus);

printf("%i sent %i; received %in", rank, myData[0], theirData[0]);

$ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14”
$ mpiexec -np 4 ./sr2

 Task parallelism
 Each process handles a unique kind of task
▪ Example: multi-image uploader (with resize/recompress)
▪ Thread 1: GUI / user interaction
▪ Thread 2: file reader & decompression
▪ Thread 3: resize & recompression
▪ Thread 3: network communication
 Can be used in a grid with a pipeline of separable tasks
to be performed on each data set
▪ Resample / warp volume
▪ Segment volume
▪ Calculate metrics on segmented volume

 Data parallelism
 Each process handles a portion of the entire data
 Often used with large data sets
▪ [task 0… | … task 1 … | … | … task n]
 Frequently used in MPI programming
 Each process is “doing the same thing,” just on a
different subset of the whole

 Layout is crucial in high-
performance computing
 BW efficiency; cache efficiency
 Even more important in
distributed Node 0
 Poor layout  extra Node 1
communication Node 2
Node 3
 Shown is an example of Node 4
“block” data distribution Node 5
 x is contiguous dimension Node 6
Node 7
 z is slowest dimension x
 Each node has contiguous y
portion of z z

FTx
DATA
Place view into correct x-Ky-Kz space (AP & LP)

CAL FTyz (AP & LP)

“Traditional” 2D SENSE Unfold (AP & LP)

Homodyne Correction
Pre-loaded data
GW Correction (Y, Z)
Real-time data
GW Correction (X)
MPI Communication
MIP
Root node

Worker nodes
Display /
RESULT
DICOM

 Completely separable problems:
 Add 1 to everyone
 Multiply each a[i] * b[i]
 Inseparable problems: [?]
 Max of a vector
 Sort a vector
 MIP of a volume
 1D FFT of a volume
 2d FFT of a volume
 3d FFT of a volume
[Parallel sort] Pacheo, Peter S., Parallel Programming with MPI

 Dynamic datatypes
 MPI_Type_vector()
 Enables communication of sub-sets without packing
 Combined with DMA, permits zero-copy transposes, etc.
 Other collectives
 MPI_Reduce
 MPI_Scatter
 MPI_Gather
 MPI-2 (MPICH2, MVAPICH2)
 One-sided (DMA) communication
▪ MPI_Put()
▪ MPI_Get()
 Dynamic world size
▪ Ability to spawn new processes during run

 Take time on the algorithm & data layout
 Minimize traffic between nodes / separate
problem
▪ FTx into xKyKz in SENSE example
 Cache-friendly (linear, efficient) access patterns
 Overlap processing and communication
 MPI_Isend() / MPI_Irecv() with multiple work
buffers
 While actively transferring one, process the other
 Larger messages will hit a higher BW (in general)

 Profile
 Vtune (Intel; Linux / Windows)
 Shark (Mac)
 MPI profiling with -mpe=mpilog
 Avoid “premature optimization” (Knuth)
 Implementation time & effort vs. runtime
performance
 Use derived datatypes rather than packing
 Using a debugger with MPI is hard
 Build in your own debugging messages from go

 If you might need MPI, build to MPI.
 Works well in shared memory environments
▪ It’s getting better all the time
 Encourages memory locality in NUMA architectures
▪ Nehalem, AMD
 Portable, reusable, open-source
 Can be used in conjunction with threads / OpenMP /
TBB / CUDA / OpenCL “Hybrid model of parallel
programming”
 Messaging paradigm can create “less obfuscated”
code than threads / OpenMP

 Homogeneous nodes
 Private network
 Shared filesystem; ssh communication
 Password-less SSH
 High-bandwidth private interconnect
 MPI communication exclusively
 GbE, 10GbE
 Infiniband
 Consider using Rocks
 CentOS / RHEL based
 Built for building clusters
 Rapid network boot based install/reinstall of nodes
 http://www.rocksclusters.org/

 MPI documents
 http://www.mpi-forum.org/docs/
 MPICH2
 http://www.mcs.anl.gov/research/projects/mpich2
 http://lists.mcs.anl.gov/pipermail/mpich-discuss/
 OpenMPI
 http://www.open-mpi.org/
 http://www.open-mpi.org/community/lists/ompi.php
 MVAPICH[1|2] (Infiniband-tuned distribution)
 http://mvapich.cse.ohio-state.edu/
 http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/
 Rocks
 http://www.rocksclusters.org/
 https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/
 Books:
 Pacheo, Peter S., Parallel Programming with MPI
 Karniadakis, George E., Parallel Scientific Computing in C++ and MPI
 Gropp, W., Using MPI-2

 This is the painting operation #define RB 0x00FF00FFu
#define RB_8OFF 0xFF00FF00u
for one RGBA pixel (in) onto #define RGB 0x00FFFFFFu
#define G 0x0000FF00u
another (out) #define G_8OFF 0x00FF0000u
#define A 0xFF000000u
 We can do red and blue
together, as we know they inlinevoid
blendPreToStatic(constuint32_t& in,
won’t collide, and we can mask uint32_t& out)
{
out the unwanted results. uint32_t alpha = in >>24;
if(alpha &0x00000080u) ++alpha;
 Post-multiply masks are out = A | RGB&
(in +
applied in the shifted position (
to minimize the number of (
(alpha * (out &RB) &RB_8OFF) |
shift operations (alpha * (out &G) &G_8OFF)
) >>8
)
);
 Note: we’re using pre- }

multiplied colors & painting
onto an opaque background

OUT = A | RGB&
(IN +
(
(
(ALPHA * (OUT &RB) &RB_8OFF) |
(ALPHA * (OUT &G) &G_8OFF)
) >>8
)
);

 For cases where there is no overlap between
the four output pixels for four input pixels, we
can use vectorized (SSE2) code
 128-bit wide registers; load four 32-bit RGBA
values, use the same approach as previously
(R|B and G) in two registers to perform four
paints at once

inline
void
blend4PreToStatic(uint32_t ** in,
uint32_t * out) // Paints in (quad-word) onto out
{
__m128irb, g, a, a_, o, mask_reg; // Registers
rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary)
a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call
*in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory

mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4)
g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4)
mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4)
rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue

rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing

a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word
mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word
a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word

// These steps add one to transparancy values >= 80
o = _mm_srli_epi16(a,7); // Now the high bit is the low bit

// We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want
// to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and
// storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're
// doing it in this fashion!)
rb = _mm_mulhi_epu16(rb,a);
g = _mm_mulhi_epu16(g,a);
g =_mm_slli_epi32(g,8); // Move green into the correct location.
// R and B, both the lower 8 bits of their 16 bits, don't need to be shifted
o = _mm_set1_epi32(0xFF000000); // Opaque alpha value
o = _mm_or_si128(o,g);
o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color

mask_reg = _mm_set1_epi32(0x00FFFFFF);
g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color

o = _mm_add_epi32(o,g); // Add foreground and background contributions together

_mm_storeu_si128((__m128i *) out,o); // Unaligned store
}

 Vectorizing this code achieves 3-4x speedup
on cluster
 8x 2x(3.4|3.2GHz) Xeon, 800MHz FSB
 Render 512x512x409 (400MB) volume in
▪ ~22ms (45fps) (SIMD code)
▪ ~92ms (11fps) (Non-vectorized)
 ~18GB/s memory throughput
 ~11 cycles / voxel vs. ~45 cycles non-vectorized

MPI_Init(3) MPI MPI_Init(3)

NAME
MPI_Init - Initialize the MPI execution environment

SYNOPSIS
int MPI_Init( int *argc, char ***argv )

INPUT PARAMETERS
argc - Pointer to the number of arguments
argv - Pointer to the argument vector

THREAD AND SIGNAL SAFETY
This routine must be called by one thread only. That thread is called
the mainthread and must be the thread that calls MPI_Finalize .

NOTES
The MPI standard does not say what a program can do before an MPI_INIT
or after an MPI_FINALIZE . In the MPICH implementation, you should do
as little as possible. In particular, avoid anything that changes the
external state of the program, such as opening files, reading standard
input or writing to standard output.

MPI_Barrier(3) MPI MPI_Barrier(3)

NAME
MPI_Barrier - Blocks until all processes in the communicator have
reached this routine.

SYNOPSIS
int MPI_Barrier( MPI_Commcomm )

INPUT PARAMETER
comm - communicator (handle)

NOTES
Blocks the caller until all processes in the communicator have called
it; that is, the call returns at any process only after all members of
the communicator have entered the call.

MPI_Finalize(3) MPI MPI_Finalize(3)

NAME
MPI_Finalize - Terminates MPI execution environment

SYNOPSIS
int MPI_Finalize( void )

NOTES
All processes must call this routine before exiting. The number of
processes running after this routine is called is undefined; it is best
not to perform much more than a returnrc after calling MPI_Finalize .

MPI_Comm_size(3) MPI MPI_Comm_size(3)

NAME
MPI_Comm_size - Determines the size of the group associated with a
communicator

SYNOPSIS
int MPI_Comm_size( MPI_Commcomm, int *size )

INPUT PARAMETER

OUTPUT PARAMETER
size - number of processes in the group of comm (integer)

MPI_Comm_rank(3) MPI MPI_Comm_rank(3)

NAME
MPI_Comm_rank - Determines the rank of the calling process in the com-
municator

SYNOPSIS
int MPI_Comm_rank( MPI_Commcomm, int *rank )

INPUT ARGUMENT

OUTPUT ARGUMENT
rank - rank of the calling process in the group of comm (integer)

MPI_Send(3) MPI MPI_Send(3)

NAME
MPI_Send - Performs a blocking send

SYNOPSIS
int MPI_Send(void *buf, int count, MPI_Datatypedatatype, int dest, int tag,
MPI_Commcomm)

INPUT PARAMETERS
buf - initial address of send buffer (choice)
count - number of elements in send buffer (nonnegative integer)
datatype
- datatype of each send buffer element (handle)
dest - rank of destination (integer)
tag - message tag (integer)

NOTES
This routine may block until the message is received by the destination
process.

MPI_Recv(3) MPI MPI_Recv(3)

NAME
MPI_Recv - Blocking receive for a message

SYNOPSIS
int MPI_Recv(void *buf, int count, MPI_Datatypedatatype, int source, int tag,
MPI_Commcomm, MPI_Status *status)

OUTPUT PARAMETERS
buf - initial address of receive buffer (choice)
status - status object (Status)

INPUT PARAMETERS
count - maximum number of elements in receive buffer (integer)
datatype
- datatype of each receive buffer element (handle)
source - rank of source (integer)

NOTES
The count argument indicates the maximum length of a message; the
actual length of the message can be determined with MPI_Get_count .

MPI_Isend(3) MPI MPI_Isend(3)

NAME
MPI_Isend - Begins a nonblocking send

SYNOPSIS
intMPI_Isend(void *buf, int count, MPI_Datatypedatatype, intdest, int tag,
MPI_Commcomm, MPI_Request *request)

INPUT PARAMETERS
buf - initial address of send buffer (choice)
count - number of elements in send buffer (integer)
datatype
- datatype of each send buffer element (handle)
dest - rank of destination (integer)

OUTPUT PARAMETER
request
- communication request (handle)

MPI_Irecv(3) MPI MPI_Irecv(3)

NAME
MPI_Irecv - Begins a nonblocking receive

SYNOPSIS
intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source,
int tag, MPI_Commcomm, MPI_Request *request)

INPUT PARAMETERS
buf - initial address of receive buffer (choice)
count - number of elements in receive buffer (integer)
datatype
- datatype of each receive buffer element (handle)
source - rank of source (integer)

OUTPUT PARAMETER
request
- communication request (handle)

MPI_Bcast(3) MPI MPI_Bcast(3)

NAME
MPI_Bcast - Broadcasts a message from the process with rank "root" to
all other processes of the communicator

SYNOPSIS
int MPI_Bcast( void *buffer, int count, MPI_Datatypedatatype, int root,
MPI_Commcomm )

INPUT/OUTPUT PARAMETER
buffer - starting address of buffer (choice)

INPUT PARAMETERS
count - number of entries in buffer (integer)
datatype
- data type of buffer (handle)
root - rank of broadcast root (integer)

MPI_Allreduce(3) MPI MPI_Allreduce(3)

NAME
MPI_Allreduce - Combines values from all processes and distributes the
result back to all processes

SYNOPSIS
int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count,
MPI_Datatypedatatype, MPI_Op op, MPI_Commcomm )

INPUT PARAMETERS
sendbuf
- starting address of send buffer (choice)
count - number of elements in send buffer (integer)
datatype
- data type of elements of send buffer (handle)
op - operation (handle)

OUTPUT PARAMETER
recvbuf
- starting address of receive buffer (choice)

MPI_Type_create_hvector(3) MPI MPI_Type_create_hvector(3)

NAME
MPI_Type_create_hvector - Create a datatype with a constant stride
given in bytes

SYNOPSIS
int MPI_Type_create_hvector(int count,
int blocklength,
MPI_Aint stride,
MPI_Datatypeoldtype,
MPI_Datatype *newtype)

INPUT PARAMETERS
count - number of blocks (nonnegative integer)
blocklength
- number of elements in each block (nonnegative integer)
stride - number of bytes between start of each block (address integer)
oldtype
- old datatype (handle)

OUTPUT PARAMETER
newtype
- new datatype (handle)

mpicc(1) MPI mpicc(1)

NAME
mpicc - Compiles and links MPI programs written in C

DESCRIPTION
This command can be used to compile and link MPI programs written in C.
It provides the options and any special libraries that are needed to
compile and link MPI programs.

It is important to use this command, particularly when linking pro-
grams, as it provides the necessary libraries.

COMMAND LINE ARGUMENTS
-show - Show the commands that would be used without runnning them
-help - Give short help
-cc=name
- Use compiler name instead of the default choice. Use this
only if the compiler is compatible with the MPICH library (see
below)
-config=name
- Load a configuration file for a particular compiler. This
allows a single mpicc command to be used with multiple compil-
ers.

[…]

mpiexec(1) MPI mpiexec(1)

NAME
mpiexec - Run an MPI program

SYNOPSIS
mpiexecargs executable pgmargs [ : args executable pgmargs ... ]

where args are command line arguments for mpiexec (see below), exe-
cutable is the name of an executable MPI program, and pgmargs are com-
mand line arguments for the executable. Multiple executables can be
specified by using the colon notation (for MPMD - Multiple Program Mul-
tiple Data applications). For example, the following command will run
the MPI program a.out on 4 processes:
mpiexec -n 4 a.out

The MPI standard specifies the following arguments and their meanings:

-n<np>
- Specify the number of processes to use
-host<hostname>
- Name of host on which to run processes
-arch<architecturename>
- Pick hosts with this architecture type

[…]

ISBI MPI Tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to ISBI MPI Tutorial

Similar to ISBI MPI Tutorial (20)

Recently uploaded

Recently uploaded (20)

ISBI MPI Tutorial

Editor's Notes