SlideShare a Scribd company logo
Eric Borisch, M.S.
Mayo Clinic
   Motivation for distributed computing
   What MPI is
   Intro to MPI programming
   Thinking in parallel
   Wrap up
   Shared Memory: all memory within a system is directly
    addressable (ignoring access restrictions) by each process [or
     Single- and multi-CPU desktops & laptops
     Multi-threaded apps
     GPGPU *
     MPI *
   Distributed Memory: memory available a given node within
    a system is unique and distinct from its peers
     MPI
     Google MapReduce / Hadoop
Centos 5.2; Dual Quad-Core 3GHz P4 [E5472]; DDR2
Relative performance


                        1                                                              Scale
                              1     2     3     4       5      6   7         8
                                              # of processes

STREAM benchmark OpenMP performance
Relative performance

                              0              4                  8                  12             16
                                          Threads (8 Physical cores + HT)

                                  2x X5570 (2.93GHz; Quad-core; 6.4GT/s QPI); 12x4G 1033 DDR3
   Bandwidth (FSB, HT, Nehalem, CUDA, …)
     Frequently run into with high-level languages (MATLAB)
   Capacity – cost & availability
     High-density chips are $$$ (if even available)
     Memory limits on individual systems
   Distributed computing addresses both bandwidth and
    capacity with multiple systems
   MPI is the glue used to connect multiple distributed
    processes together
   Custom iterative SENSE reconstruction
   3 x 8 coils x 400 x 320 x 176 x 8 [complex float]
     Profile data (img space)
     Estimate (img<->k space)
     Acquired data (k space)
     > 4GB data touched during each iteration
   16, 32 channel data here or on the way…

     Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms”
                                                                           M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro
                    Place view into correct x-Ky-Kz space (AP & LP)

    CAL                             FTyz (AP & LP)

                       “Traditional” 2D SENSE Unfold (AP & LP)

                                Homodyne Correction
 Pre-loaded data
                                 GW Correction (Y, Z)
  Real-time data
                                  GW Correction (X)
MPI Communication
    Root node

  Worker nodes
                                       Store /
Root Node       1Gb Eth     Site Intranet
  3.6GHz P4    16GB RAM       1Gb Eth
                                                                  Worker Node (x7)
                                                                                      1Gb Eth
                                                            3.6GHz P4   16GB RAM
  3.6GHz P4
                              1Gb Eth
               80GB HDD                                     3.6GHz P4                 2x8Gb IB
                                                                        80GB HDD
               500GB HDD     2x8Gb IB

   16-Port Gigabit Ethernet Switch
                                                               x7 File system

      24-Port Infiniband Switch
                                                            x7x2 MPI interconnects
                                                          16Gb/s bandwidth per node

8Gb/s Connection
                                                              Cluster Hardware
                                        MRI System
                                                             External Hardware

                                                                        2x8Gig Infiniband connection

                                                                        1Gig Ethernet connection
   Loosely coupled
     SETI / BOINC
     “Grid computing”
   BIOS-level abstraction
     ScaleMP
   Tightly coupled
     MPI
     “Cluster computing”
   Hybrid
     Folding@Home

            Master             Node


Host                              Host I
OS                                OS I
Process A              Thread 1    Process A

                       Thread 2
                                  Host II
                                  OS II
Process B                          Process B

                                  Host N
                                  OS N
            Memory Transfers       Process C
            Network Transfers
Host                              Host I
OS                                OS I
Process A              Thread 1    Process A   Process D

                       Thread 2
                                  Host II
                                  OS II
Process B                          Process B   Process E

                                  Host N
                                  OS N
            Memory Transfers       Process C   Process F
            Network Transfers
   Motivation for distributed computing
   What MPI is
   Intro to MPI programming
   Thinking in parallel
   Wrap up
 Message Passing Interface is…
   “a library specification for message-passing” 1
   Available in many implementations on multiple
    platforms *
 A set of functions for moving messages between
  different processes without a shared memory
 Low-level*; no concept of overall computing tasks
  to be performed

   MPI-1
       Version 1.0 draft standard 1994
       Version 1.1 in 1995
       Version 1.2 in 1997
       Version 1.3 in 2008
   MPI-2
     Added:
        ▪ 1-sided communication
        ▪ Dynamic “world” sizes; spawn / join
     Version 2.0 in 1997
     Version 2.1 in 2008
   MPI-3
     In process
     Enhanced fault handling
   Forward compatibility preserved
   MPI is the de-facto standard for distributed computing
       Freely available
       Open source implementations exist
       Portable
       Mature
   From a discussion of why MPI is dominant [1]:
     […] 100s of languages have come and gone.
     Good stuff must have been created [… yet] it is broadly accepted in the field
        that they’re not used.
       MPI has a lock.
       OpenMP is accepted, but a distant second.
       There are substantial barriers to the introduction of new languages and
        language constructs.
       Economic, ecosystem related, psychological, a catch-22 of widespread
        use, etc.
       Any parallel language proposal must come equipped with reasons why it will
        overcome those barriers.
   MPI itself is just a specification. We want an implementation
     Widely portable
     Infiniband-centric; MPICH/MPICH2 based
   OpenMPI
     Plug-in architecture; many run-time options
   And more:
       IntelMPI
       HP-MPI
       MPI for IBM Blue Gene
       MPI for Cray
       Microsoft MPI
       MPI for SiCortex
       MPI for Myrinet Express (MX)
       MPICH2 over SCTP
   Without MPI:
     Start all of the processes across bank of machines
      (shell scripting + ssh)
     socket(), bind(), listen(), accept() or connect() each
     send(), read() on individual links
     Raw byte interfaces; no discrete messages
   With MPI
       mpiexec –np<n> app
       MPI_Init()
       MPI_Send()
       MPI_Recv()
       MPI_Finalize()
   MPI:
     Manages the connections
     Packages messages
     Provides launching mechanism
Provides definitions for:
 Communication functions
       MPI_Send()
       MPI_Recv()
       MPI_Bcast()
       etc.
   Datatypemangement functions
     MPI_Type_create_hvector()
   C, C++, and Fortran bindings
   Also recommends process startup
     mpiexec –np<nproc><program><args>

MPI_Abort                    MPI_Comm_remote_size           MPI_File_read_ordered_end      MPI_Group_rank              MPI_Scatterv                    MPI_Unpublish_name
MPI_Accumulate               MPI_Comm_set_attr              MPI_File_read_shared           MPI_Group_size              MPI_Send                        MPI_Wait
MPI_Add_error_class          MPI_Comm_set_errhandler        MPI_File_seek                  MPI_Group_translate_ranks   MPI_Send_init                   MPI_Waitall
MPI_Add_error_code           MPI_Comm_set_name              MPI_File_seek_shared           MPI_Group_union             MPI_Sendrecv                    MPI_Waitany
MPI_Add_error_string         MPI_Comm_size                  MPI_File_set_atomicity         MPI_Ibsend                  MPI_Sendrecv_replace            MPI_Waitsome
MPI_Address                  MPI_Comm_spawn                 MPI_File_set_errhandler        MPI_Info_create             MPI_Ssend                       MPI_Win_call_errhandler
MPI_Allgather                MPI_Comm_spawn_multiple        MPI_File_set_info              MPI_Info_delete             MPI_Ssend_init                  MPI_Win_complete
MPI_Allgatherv               MPI_Comm_split                 MPI_File_set_size              MPI_Info_dup                MPI_Start                       MPI_Win_create
MPI_Alloc_mem                MPI_Comm_test_inter            MPI_File_set_view              MPI_Info_free               MPI_Startall                    MPI_Win_create_errhandler
MPI_Allreduce                MPI_Dims_create                MPI_File_sync                  MPI_Info_get                MPI_Status_set_cancelled        MPI_Win_create_keyval
MPI_Alltoall                 MPI_Errhandler_create          MPI_File_write                 MPI_Info_get_nkeys          MPI_Status_set_elements         MPI_Win_delete_attr
MPI_Alltoallv                MPI_Errhandler_free            MPI_File_write_all             MPI_Info_get_nthkey         MPI_Test                        MPI_Win_fence
MPI_Alltoallw                MPI_Errhandler_get             MPI_File_write_all_begin       MPI_Info_get_valuelen       MPI_Test_cancelled              MPI_Win_free
MPI_Attr_delete              MPI_Errhandler_set             MPI_File_write_all_end         MPI_Info_set                MPI_Testall                     MPI_Win_free_keyval
MPI_Attr_get                 MPI_Error_class                MPI_File_write_at              MPI_Init                    MPI_Testany                     MPI_Win_get_attr
MPI_Attr_put                 MPI_Error_string               MPI_File_write_at_all          MPI_Init_thread             MPI_Testsome                    MPI_Win_get_errhandler
MPI_Barrier                  MPI_Exscan                     MPI_File_write_at_all_begin    MPI_Initialized             MPI_Topo_test                   MPI_Win_get_group
MPI_Bcast                    MPI_File_c2f                   MPI_File_write_at_all_end      MPI_Intercomm_create        MPI_Type_commit                 MPI_Win_get_name
MPI_Bsend                    MPI_File_call_errhandler       MPI_File_write_ordered         MPI_Intercomm_merge         MPI_Type_contiguous             MPI_Win_lock
MPI_Bsend_init               MPI_File_close                 MPI_File_write_ordered_begin   MPI_Iprobe                  MPI_Type_create_darray          MPI_Win_post
MPI_Buffer_attach            MPI_File_create_errhandler     MPI_File_write_ordered_end     MPI_Irecv                   MPI_Type_create_hindexed        MPI_Win_set_attr
MPI_Buffer_detach            MPI_File_delete                MPI_File_write_shared          MPI_Irsend                  MPI_Type_create_hvector         MPI_Win_set_errhandler
MPI_Cancel                   MPI_File_f2c                   MPI_Finalize                   MPI_Is_thread_main          MPI_Type_create_indexed_block   MPI_Win_set_name
MPI_Cart_coords              MPI_File_get_amode             MPI_Finalized                  MPI_Isend                   MPI_Type_create_keyval          MPI_Win_start
MPI_Cart_create              MPI_File_get_atomicity         MPI_Free_mem                   MPI_Issend                  MPI_Type_create_resized         MPI_Win_test
MPI_Cart_get                 MPI_File_get_byte_offset       MPI_Gather                     MPI_Keyval_create           MPI_Type_create_struct          MPI_Win_unlock
MPI_Cart_map                 MPI_File_get_errhandler        MPI_Gatherv                    MPI_Keyval_free             MPI_Type_create_subarray        MPI_Win_wait
MPI_Cart_rank                MPI_File_get_group             MPI_Get                        MPI_Lookup_name             MPI_Type_delete_attr            MPI_Wtick
MPI_Cart_shift               MPI_File_get_info              MPI_Get_address                MPI_Op_create               MPI_Type_dup                    MPI_Wtime
MPI_Cart_sub                 MPI_File_get_position          MPI_Get_count                  MPI_Op_free                 MPI_Type_extent
MPI_Cartdim_get              MPI_File_get_position_shared   MPI_Get_elements               MPI_Open_port               MPI_Type_free
MPI_Close_port               MPI_File_get_size              MPI_Get_processor_name         MPI_Pack                    MPI_Type_free_keyval
MPI_Comm_accept              MPI_File_get_type_extent       MPI_Get_version                MPI_Pack_external           MPI_Type_get_attr
MPI_Comm_call_errhandler     MPI_File_get_view              MPI_Graph_create               MPI_Pack_external_size      MPI_Type_get_contents
MPI_Comm_compare             MPI_File_iread                 MPI_Graph_get                  MPI_Pack_size               MPI_Type_get_envelope
MPI_Comm_connect             MPI_File_iread_at              MPI_Graph_map                  MPI_Pcontrol                MPI_Type_get_extent
MPI_Comm_create              MPI_File_iread_shared          MPI_Graph_neighbors            MPI_Probe                   MPI_Type_get_name
MPI_Comm_create_errhandler   MPI_File_iwrite                MPI_Graph_neighbors_count      MPI_Publish_name            MPI_Type_get_true_extent
MPI_Comm_create_keyval       MPI_File_iwrite_at             MPI_Graphdims_get              MPI_Put                     MPI_Type_hindexed
MPI_Comm_delete_attr         MPI_File_iwrite_shared         MPI_Grequest_complete          MPI_Query_thread            MPI_Type_hvector
MPI_Comm_disconnect          MPI_File_open                  MPI_Grequest_start             MPI_Recv                    MPI_Type_indexed
MPI_Comm_dup                 MPI_File_preallocate           MPI_Group_compare              MPI_Recv_init               MPI_Type_lb
MPI_Comm_free                MPI_File_read                  MPI_Group_difference           MPI_Reduce                  MPI_Type_match_size
MPI_Comm_free_keyval         MPI_File_read_all              MPI_Group_excl                 MPI_Reduce_scatter          MPI_Type_set_attr
MPI_Comm_get_attr            MPI_File_read_all_begin        MPI_Group_free                 MPI_Register_datarep        MPI_Type_set_name
MPI_Comm_get_errhandler      MPI_File_read_all_end          MPI_Group_incl                 MPI_Request_free            MPI_Type_size
MPI_Comm_get_name            MPI_File_read_at               MPI_Group_intersection         MPI_Request_get_status      MPI_Type_struct
MPI_Comm_get_parent          MPI_File_read_at_all           MPI_Group_range_excl           MPI_Rsend                   MPI_Type_ub
MPI_Comm_group               MPI_File_read_at_all_begin     MPI_Group_range_incl           MPI_Rsend_init              MPI_Type_vector
MPI_Comm_join                MPI_File_read_at_all_end                                      MPI_Scan                    MPI_Unpack
MPI_Comm_rank                MPI_File_read_ordered                                         MPI_Scatter                 MPI_Unpack_external
MPI_Comm_remote_group        MPI_File_read_ordered_begin
   Each process owns their data – there is no “our”
     Makes many things simpler; no mutexes, condition
      variables, semaphores, etc; memory access order race
      conditions go away
   Every message is an explicit copy
     I have the memory I sent from, you have the memory you
      used to received into
     Even when running in a “shared memory” environment
   Synchronization comes along for free
     I won’t get your message (or data) until you choose to
      send it
   Programming to MPI first can make it easier to scale-
    out later
   Motivation for distributed computing
   What MPI is
   Intro to MPI programming
   Thinking in parallel
   Wrap up
   Download / decompress MPICH source:
     Suports: c / c++ / Fortran
     Requires Python >= 2.2
   ./configure
   make install
     installs into /usr/local by default, or use
      --prefix=<chosen path>
   Make sure <prefix>/bin is in PATH
   Make sure <prefix>/share/man is in MANPATH
c compiler wrapper   c++ compiler wrapper

                      MPI job launcher
 MPD launcher
   Set up passwordlessssh to workers
   Start the daemons with mpdboot -n<N>
     Requires ~/.mpd.conf to exist on each host
      ▪ Contains: (same on each host)
        ▪ MPD_SECRETWORD=<some gibberish string>
      ▪ permissions set to 600 (r/w access for owner only)
     Requires ./mpd.hosts to list other host names
      ▪ Unless run as mpdboot -n 1 (run on current host only)
      ▪ Will not accept current host in list (implicit)
   Check for running daemons with mpdtrace
             For details:
   Use mpicc/ mpicxx for c/c++ compiler
     Wrapper script around c/c++ compilers detected
      during install
      ▪ $ mpicc --show
        gcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include -
        L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 -
        lpthread -luuid -lpthread –lrt
     $ mpicc -o hello hello.c
   Use mpiexec -np<nproc><app><args> to launch
     $ mpiexec -np 4 ./hello
/* hello.c */
#include <stdio.h>                                        $ mpicc -o hello hello.c
#include <mpi.h>                                          $ mpiexec -np 4 ./hello
                                                             Hello, from 0 of 4!
int main (int argc, char * argv[])
{                                                            Hello, from 2 of 4!
inti, rank, nodes;                                           Hello, from 1 of 4!
                                                             Hello, from 3 of 4!
MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nodes);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   for (i=0; i< nodes; i++)
          if (i == rank) printf("Hello from %i of %i!n", rank, nodes);
   return 0;

                                        Thread within
                                    threaded_app process
pthread_create( func() )

       Do work                            Do work

    pthread_join()                      pthread_exit()

mpiexec –np 4 ./mpi_app

                       mpd launches jobs

  mpi_app [rank 0]          mpi_app [rank 1]     mpi_app [rank 3]

       main()                   main()                main()

     MPI_Init()                MPI_Init()           MPI_Init()        MPI comm.

    MPI_Bcast()               MPI_Bcast()          MIP_Bcast()        MPI comm.

Do Work on local mem   Do Work on local mem    Do Work on local mem

  MPI_Allreduce()           MPI_Allreduce()      MPI_Allreduce()      MPI comm.

   MPI_Finalize()            MPI_Finalize()       MPI_Finalize()      MPI comm.

       exit()                    exit()               exit()
/* hello.c */
#include <stdio.h>
#include <mpi.h>
main (int argc, char * argv[])
    int i;
    int rank;
    int nodes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   for (i=0; i< nodes; i++)
          if (i == rank) printf("Hello from %i of %i!n", rank, nodes);
   return 0;
   MPICH2 comes with mpe by default (unless disabled
    during configure)
   Multiple tracing / logging options to track MPI traffic
   Enabled through –mpe=<option> at compile time
      MacPro:code$ mpicc -mpe=mpilog -o hello hello.c
      MacPro:code$ mpiexec -np 4 ./hello
      Hello from 0 of 4!
      Hello from 2 of 4!
      Hello from 1 of 4!
      Hello from 3 of 4!
      Writing logfile....
      Enabling the Default clock synchronization...
      Finished writing logfile ./hello.clog2.
MacPro:code$ mpicc -mpe=mpitrace -o hello hello.c
  MacPro:code$ mpiexec -np 2 ./hello > trace

MacPro:code$ grep 0 trace                       MacPro:code$ grep 1 trace
[0] Ending MPI_Init                             [1] Ending MPI_Init
[0] Starting MPI_Comm_size...                   [1] Starting MPI_Comm_size...
[0] Ending MPI_Comm_size                        [1] Ending MPI_Comm_size
[0] Starting MPI_Comm_rank...                   [1] Starting MPI_Comm_rank...
[0] Ending MPI_Comm_rank                        [1] Ending MPI_Comm_rank
[0] Starting MPI_Barrier...                     [1] Starting MPI_Barrier...
[0] Ending MPI_Barrier                          [1] Ending MPI_Barrier
Hello from 0 of 2!                              [1] Starting MPI_Barrier...
[0] Starting MPI_Barrier...                     [1] Ending MPI_Barrier
[0] Ending MPI_Barrier                          Hello from 1 of 2!
[0] Starting MPI_Finalize...                    [1] Starting MPI_Finalize...
[0] Ending MPI_Finalize                         [1] Ending MPI_Finalize
  void *buf,
     memory location to send from
  int count,
     number of elements (of type datatype) at buf
     MPI_INT, MPI_FLOAT, etc…
     Or custom datatypes; strided vectors; structures, etc
     rank (within the communicator comm) of destination for this message
  int tag,
     used to distinguish this message from other messages
  MPI_Commcomm )
     communicator for this transfer
     often MPI_COMM_WORLD
  void *buf,
      memory location to receive data into
  int count,
      number of elements (of type datatype) available to receive into at buf
      MPI_INT, MPI_FLOAT, etc…
      Or custom datatypes; strided vectors; structures, etc.
      Typically matches sending datatype, but doesn’t have to…
  int source,
      rank (within the communicator comm) of source for this message
      can also be MPI_ANY_SOURCE
  int tag,
      used to distinguish this message from other messages
      can also be MPI_ANY_TAG
      communicator for this transfer
      often MPI_COMM_WORLD
  MPI_Status *status )
      Structure describing the received message, including:
             actual count (can be smaller than passed count)
             source (useful if used with source = MPI_ANY_SOURCE)
             tag (useful if used with tag = MPI_ANY_TAG)
/* sr.c */
#include <stdio.h>
#include <mpi.h>
#ifndef SENDSIZE
#define SENDSIZE 1
main (int argc, char * argv[] )
    int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];
    MPI_Status sendStatus;
MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nodes);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   myData[0] = rank;
MPI_Send(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD);
MPI_Recv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1 ) % nodes,
        0, MPI_COMM_WORLD, &sendStatus);
printf("%i sent %i; received %in", rank, myData[0], theirData[0]);
    return 0;
$ mpicc -osrsr.c
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
$ mpicc -osrsr.c
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
$ mpicc -osrsr.c -DSENDSIZE="0x1<<13”
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
$ mpicc -osrsr.c -DSENDSIZE="0x1<<14”
$ mpiexec -np 2 ./sr

$ mpicc -osrsr.c -DSENDSIZE="0x1<<14 - 1”
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
3.4 Communication Modes
The send call described in Section Blocking send is blocking: it does not return until the message data
and envelope have been safely stored away so that the sender is free to access and overwrite the send
buffer. The message might be copied directly into the matching receive buffer, or it might be copied
into a temporary system buffer.
Message buffering decouples the send and receive operations. A blocking send can complete as soon
as the message was buffered, even if no matching receive has been executed by the receiver. On the
other hand, message buffering can be expensive, as it entails additional memory-to-memory
copying, and it requires the allocation of memory for buffering. MPI offers the choice of several
communication modes that allow one to control the choice of the communication protocol.
The send call described in Section Blocking send used the standard communication mode. In this
mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer
outgoing messages. In such a case, the send call may complete before a matching receive is
invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer
outgoing messages, for performance reasons. In this case, the send call will not complete until a
matching receive has been posted, and the data has been moved to the receiver.
Thus, a send in standard mode can be started whether or not a matching receive has been posted. It
may complete before a matching receive is posted. The standard mode send is non-local: successful
completion of the send operation may depend on the occurrence of a matching receive.

Process 1                            Process 2
 Send “small”
  message &
    return      Eager send             Eager recv

 Send “large”                                       Request & receive
   message                              Receive      small message
                Rndv. req.
                                       Rndv. req.
 Blocks until                         Match Rndv.    Request large
 completion.    Rndv. send                             message

                                        Receive       Receive large
                                       Rndv. data       message

                             User activity
                             MPI activity
   MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)
       Sends are “local” – they return independent of any remote activity
       Message buffer can be touched immediately after call returns
       Requires a user-provided buffer, provided via MPI_Buffer_attach()
       Forces an “eager”-like message transfer from sender’s perspective
       User can wait for completion by calling MPI_Buffer_detach()
   MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init)
       Won’t return until matching receive is posted
       Forces a “rendezvous”-like message transfer
       Can be used to guarantee synchronization without additional MPI_Barrier() calls
   MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init)
       Erroneous if matching receive has not been posted
       Performance tweak (on some systems) when user can guarantee matching receive is posted
   MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)
       Non-blocking, immediate return once send/receive request is posted
       Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion
       Send/receive buffers should not be touched until completed
       MPI_Request * argument used for eventual completion

   The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to
    receive any send mode.
/* sr2.c */
#include <stdio.h>
#include <mpi.h>
#ifndef SENDSIZE
#define SENDSIZE 1

main (int argc, char * argv[] )
     int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE];
     MPI_Status xferStatus[2];
     MPI_Request xferRequest[2];

     MPI_Init(&argc, &argv);

     MPI_Comm_size(MPI_COMM_WORLD, &nodes);
     MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    myData[0] = rank;
MPI_Isend(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[0]);
    MPI_Irecv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[1]);


printf("%i sent %i; received %in", rank, myData[0], theirData[0]);
$ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14”
$ mpiexec -np 4 ./sr2
0 sent 0; received 3
2 sent 2; received 1
1 sent 1; received 0
3 sent 3; received 2
   Motivation for distributed computing
   What MPI is
   Intro to MPI programming
   Thinking in parallel
   Wrap up
   Task parallelism
     Each process handles a unique kind of task
      ▪ Example: multi-image uploader (with resize/recompress)
        ▪   Thread 1: GUI / user interaction
        ▪   Thread 2: file reader & decompression
        ▪   Thread 3: resize & recompression
        ▪   Thread 3: network communication
     Can be used in a grid with a pipeline of separable tasks
      to be performed on each data set
      ▪ Resample / warp volume
      ▪ Segment volume
      ▪ Calculate metrics on segmented volume
   Data parallelism
     Each process handles a portion of the entire data
     Often used with large data sets
      ▪ [task 0… | … task 1 … | … | … task n]
     Frequently used in MPI programming
     Each process is “doing the same thing,” just on a
      different subset of the whole
   Layout is crucial in high-
    performance computing
     BW efficiency; cache efficiency
     Even more important in
      distributed                       Node 0
     Poor layout  extra               Node 1
      communication                     Node 2
                                        Node 3
   Shown is an example of              Node 4
    “block” data distribution           Node 5
     x is contiguous dimension         Node 6
                                        Node 7
     z is slowest dimension                         x
     Each node has contiguous                   y
      portion of z                                   z
                    Place view into correct x-Ky-Kz space (AP & LP)

    CAL                             FTyz (AP & LP)

                       “Traditional” 2D SENSE Unfold (AP & LP)

                                Homodyne Correction
 Pre-loaded data
                                 GW Correction (Y, Z)
  Real-time data
                                  GW Correction (X)
MPI Communication
    Root node

  Worker nodes
                                       Display /
   Completely separable problems:
     Add 1 to everyone
     Multiply each a[i] * b[i]
   Inseparable problems: [?]
       Max of a vector
       Sort a vector
       MIP of a volume
       1D FFT of a volume
       2d FFT of a volume
       3d FFT of a volume
                                  [Parallel sort] Pacheo, Peter S., Parallel Programming with MPI
   Dynamic datatypes
     MPI_Type_vector()
     Enables communication of sub-sets without packing
     Combined with DMA, permits zero-copy transposes, etc.
   Other collectives
     MPI_Reduce
     MPI_Scatter
     MPI_Gather
     One-sided (DMA) communication
      ▪ MPI_Put()
      ▪ MPI_Get()
     Dynamic world size
      ▪ Ability to spawn new processes during run
   Motivation for distributed computing
   What MPI is
   Intro to MPI programming
   Thinking in parallel
   Wrap up
   Take time on the algorithm & data layout
     Minimize traffic between nodes / separate
      ▪ FTx into xKyKz in SENSE example
     Cache-friendly (linear, efficient) access patterns
   Overlap processing and communication
     MPI_Isend() / MPI_Irecv() with multiple work
     While actively transferring one, process the other
     Larger messages will hit a higher BW (in general)
   Profile
     Vtune (Intel; Linux / Windows)
     Shark (Mac)
     MPI profiling with -mpe=mpilog
   Avoid “premature optimization” (Knuth)
   Implementation time & effort vs. runtime
   Use derived datatypes rather than packing
   Using a debugger with MPI is hard
     Build in your own debugging messages from go
   If you might need MPI, build to MPI.
     Works well in shared memory environments
      ▪ It’s getting better all the time
     Encourages memory locality in NUMA architectures
      ▪ Nehalem, AMD
     Portable, reusable, open-source
     Can be used in conjunction with threads / OpenMP /
      TBB / CUDA / OpenCL “Hybrid model of parallel
     Messaging paradigm can create “less obfuscated”
      code than threads / OpenMP
   Homogeneous nodes
   Private network
     Shared filesystem; ssh communication
   Password-less SSH
   High-bandwidth private interconnect
     MPI communication exclusively
     GbE, 10GbE
     Infiniband
   Consider using Rocks
       CentOS / RHEL based
       Built for building clusters
       Rapid network boot based install/reinstall of nodes
   MPI documents
   MPICH2
   OpenMPI
   MVAPICH[1|2] (Infiniband-tuned distribution)
   Rocks
   Books:
       Pacheo, Peter S., Parallel Programming with MPI
       Karniadakis, George E., Parallel Scientific Computing in C++ and MPI
       Gropp, W., Using MPI-2
   This is the painting operation    #define RB   0x00FF00FFu
                                      #define RB_8OFF 0xFF00FF00u
    for one RGBA pixel (in) onto      #define RGB 0x00FFFFFFu
                                      #define G   0x0000FF00u
    another (out)                     #define G_8OFF 0x00FF0000u
                                      #define A   0xFF000000u
   We can do red and blue
    together, as we know they         inlinevoid
                                      blendPreToStatic(constuint32_t& in,
    won’t collide, and we can mask    uint32_t& out)
    out the unwanted results.           uint32_t alpha = in >>24;
                                        if(alpha &0x00000080u) ++alpha;
   Post-multiply masks are             out = A | RGB&
                                          (in +
    applied in the shifted position         (
    to minimize the number of                 (
                                                (alpha * (out &RB) &RB_8OFF) |
    shift operations                            (alpha * (out &G) &G_8OFF)
                                              ) >>8
   Note: we’re using pre-            }

    multiplied colors & painting
    onto an opaque background
OUT     = A | RGB&
    (IN +
          (ALPHA * (OUT &RB) &RB_8OFF) |
          (ALPHA * (OUT &G) &G_8OFF)
        ) >>8
   For cases where there is no overlap between
    the four output pixels for four input pixels, we
    can use vectorized (SSE2) code
   128-bit wide registers; load four 32-bit RGBA
    values, use the same approach as previously
    (R|B and G) in two registers to perform four
    paints at once
blend4PreToStatic(uint32_t ** in,
uint32_t * out) // Paints in (quad-word) onto out
__m128irb, g, a, a_, o, mask_reg; // Registers
rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary)
  a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call
  *in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory

mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4)
g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4)
mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4)
rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue

rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing

a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word
mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word
a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word

// These steps add one to transparancy values >= 80
o = _mm_srli_epi16(a,7); // Now the high bit is the low bit
// We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want
// to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and
// storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're
// doing it in this fashion!)
   rb = _mm_mulhi_epu16(rb,a);
   g = _mm_mulhi_epu16(g,a);
   g =_mm_slli_epi32(g,8); // Move green into the correct location.
// R and B, both the lower 8 bits of their 16 bits, don't need to be shifted
   o = _mm_set1_epi32(0xFF000000); // Opaque alpha value
   o = _mm_or_si128(o,g);
   o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color

 mask_reg = _mm_set1_epi32(0x00FFFFFF);
 g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color

 o = _mm_add_epi32(o,g); // Add foreground and background contributions together

_mm_storeu_si128((__m128i *) out,o); // Unaligned store
   Vectorizing this code achieves 3-4x speedup
    on cluster
     8x 2x(3.4|3.2GHz) Xeon, 800MHz FSB
     Render 512x512x409 (400MB) volume in
      ▪ ~22ms (45fps) (SIMD code)
      ▪ ~92ms (11fps) (Non-vectorized)
     ~18GB/s memory throughput
     ~11 cycles / voxel vs. ~45 cycles non-vectorized
MPI_Init(3)                  MPI               MPI_Init(3)

MPI_Init - Initialize the MPI execution environment

    int MPI_Init( int *argc, char ***argv )

argc - Pointer to the number of arguments
argv - Pointer to the argument vector

    This routine must be called by one thread only. That thread is called
    the mainthread and must be the thread that calls MPI_Finalize .

    The MPI standard does not say what a program can do before an MPI_INIT
    or after an MPI_FINALIZE . In the MPICH implementation, you should do
    as little as possible. In particular, avoid anything that changes the
    external state of the program, such as opening files, reading standard
    input or writing to standard output.
MPI_Barrier(3)               MPI                MPI_Barrier(3)

MPI_Barrier - Blocks until all processes in the communicator have
   reached this routine.

    int MPI_Barrier( MPI_Commcomm )

comm - communicator (handle)

    Blocks the caller until all processes in the communicator have called
    it; that is, the call returns at any process only after all members of
    the communicator have entered the call.
MPI_Finalize(3)                MPI               MPI_Finalize(3)

    MPI_Finalize - Terminates MPI execution environment

    int MPI_Finalize( void )

    All processes must call this routine before exiting. The number of
    processes running after this routine is called is undefined; it is best
    not to perform much more than a returnrc after calling MPI_Finalize .
MPI_Comm_size(3)                MPI             MPI_Comm_size(3)

MPI_Comm_size - Determines the size of the group associated with a

    int MPI_Comm_size( MPI_Commcomm, int *size )

comm - communicator (handle)

size - number of processes in the group of comm (integer)
MPI_Comm_rank(3)                  MPI              MPI_Comm_rank(3)

MPI_Comm_rank - Determines the rank of the calling process in the com-

    int MPI_Comm_rank( MPI_Commcomm, int *rank )

comm - communicator (handle)

rank - rank of the calling process in the group of comm (integer)
MPI_Send(3)                   MPI                 MPI_Send(3)

MPI_Send - Performs a blocking send

   int MPI_Send(void *buf, int count, MPI_Datatypedatatype, int dest, int tag,

buf - initial address of send buffer (choice)
count - number of elements in send buffer (nonnegative integer)
         - datatype of each send buffer element (handle)
dest - rank of destination (integer)
tag - message tag (integer)
comm - communicator (handle)

    This routine may block until the message is received by the destination
MPI_Recv(3)                   MPI                  MPI_Recv(3)

MPI_Recv - Blocking receive for a message

   int MPI_Recv(void *buf, int count, MPI_Datatypedatatype, int source, int tag,
MPI_Commcomm, MPI_Status *status)

buf - initial address of receive buffer (choice)
status - status object (Status)

count - maximum number of elements in receive buffer (integer)
         - datatype of each receive buffer element (handle)
source - rank of source (integer)
tag - message tag (integer)
comm - communicator (handle)

    The count argument indicates the maximum length of a message; the
    actual length of the message can be determined with MPI_Get_count .
MPI_Isend(3)                  MPI                 MPI_Isend(3)

MPI_Isend - Begins a nonblocking send

intMPI_Isend(void *buf, int count, MPI_Datatypedatatype, intdest, int tag,
MPI_Commcomm, MPI_Request *request)

buf - initial address of send buffer (choice)
count - number of elements in send buffer (integer)
         - datatype of each send buffer element (handle)
dest - rank of destination (integer)
tag - message tag (integer)
comm - communicator (handle)

          - communication request (handle)
MPI_Irecv(3)                 MPI                 MPI_Irecv(3)

MPI_Irecv - Begins a nonblocking receive

intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source,
int tag, MPI_Commcomm, MPI_Request *request)

buf - initial address of receive buffer (choice)
count - number of elements in receive buffer (integer)
         - datatype of each receive buffer element (handle)
source - rank of source (integer)
tag - message tag (integer)
comm - communicator (handle)

          - communication request (handle)
MPI_Bcast(3)                    MPI               MPI_Bcast(3)

MPI_Bcast - Broadcasts a message from the process with rank "root" to
   all other processes of the communicator

   int MPI_Bcast( void *buffer, int count, MPI_Datatypedatatype, int root,
MPI_Commcomm )

buffer - starting address of buffer (choice)

count - number of entries in buffer (integer)
         - data type of buffer (handle)
root - rank of broadcast root (integer)
comm - communicator (handle)
MPI_Allreduce(3)                 MPI               MPI_Allreduce(3)

MPI_Allreduce - Combines values from all processes and distributes the
   result back to all processes

   int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count,
MPI_Datatypedatatype, MPI_Op op, MPI_Commcomm )

         - starting address of send buffer (choice)
count - number of elements in send buffer (integer)
         - data type of elements of send buffer (handle)
op - operation (handle)
comm - communicator (handle)

          - starting address of receive buffer (choice)
MPI_Type_create_hvector(3)          MPI        MPI_Type_create_hvector(3)

MPI_Type_create_hvector - Create a datatype with a constant stride
   given in bytes

   int MPI_Type_create_hvector(int count,
                  int blocklength,
MPI_Aint stride,
MPI_Datatype *newtype)

count - number of blocks (nonnegative integer)
          - number of elements in each block (nonnegative integer)
stride - number of bytes between start of each block (address integer)
          - old datatype (handle)

          - new datatype (handle)
mpicc(1)                    MPI                   mpicc(1)

mpicc - Compiles and links MPI programs written in C

      This command can be used to compile and link MPI programs written in C.
      It provides the options and any special libraries that are needed to
      compile and link MPI programs.

      It is important to use this command, particularly when linking pro-
      grams, as it provides the necessary libraries.

-show - Show the commands that would be used without runnning them
-help - Give short help
         - Use compiler name instead of the default choice. Use this
         only if the compiler is compatible with the MPICH library (see
         - Load a configuration file for a particular compiler. This
         allows a single mpicc command to be used with multiple compil-

mpiexec(1)                   MPI                 mpiexec(1)

mpiexec - Run an MPI program

mpiexecargs executable pgmargs [ : args executable pgmargs ... ]

     where args are command line arguments for mpiexec (see below), exe-
cutable is the name of an executable MPI program, and pgmargs are com-
mand line arguments for the executable. Multiple executables can be
     specified by using the colon notation (for MPMD - Multiple Program Mul-
tiple Data applications). For example, the following command will run
     the MPI program a.out on 4 processes:
mpiexec -n 4 a.out

      The MPI standard specifies the following arguments and their meanings:

        - Specify the number of processes to use
        - Name of host on which to run processes
        - Pick hosts with this architecture type


More Related Content

What's hot

Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balasValentina Emilia Balas
Accelerating science with Puppet
Accelerating science with PuppetAccelerating science with Puppet
Accelerating science with Puppet
Tim Bell
Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency Clusters
Vittorio Giovara
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Shinya Takamaeda-Y
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
Shinya Takamaeda-Y
GPU Computing In Higher Education And Research
GPU Computing In Higher Education And ResearchGPU Computing In Higher Education And Research
GPU Computing In Higher Education And Research
Devang Sachdev
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsShinya Takamaeda-Y
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
Qnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle eastQnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle east
Ali Shoaee
Internet Of Things: Vision, Prerequisites and OpenSpime
Internet Of Things: Vision, Prerequisites and OpenSpimeInternet Of Things: Vision, Prerequisites and OpenSpime
Internet Of Things: Vision, Prerequisites and OpenSpime
Roberto Ostinelli
20121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v320121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v3Tim Bell
How To Train Your Calxeda EnergyCore
How To Train Your  Calxeda EnergyCoreHow To Train Your  Calxeda EnergyCore
How To Train Your Calxeda EnergyCore
BonFIRE TridentCom presentation
BonFIRE TridentCom presentationBonFIRE TridentCom presentation
BonFIRE TridentCom presentation

What's hot (20)

Tftp errors
Tftp errorsTftp errors
Tftp errors
Data flow super computing valentina balas
Data flow super computing   valentina balasData flow super computing   valentina balas
Data flow super computing valentina balas
Accelerating science with Puppet
Accelerating science with PuppetAccelerating science with Puppet
Accelerating science with Puppet
Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency Clusters
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
Veriloggen.Thread & Stream: 最高性能FPGAコンピューティングを 目指したミックスドパラダイム型高位合成 (FPGAX 201...
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
GPU Computing In Higher Education And Research
GPU Computing In Higher Education And ResearchGPU Computing In Higher Education And Research
GPU Computing In Higher Education And Research
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
Mateo valero p2
Mateo valero p2Mateo valero p2
Mateo valero p2
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAsScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Mateo valero p1
Mateo valero p1Mateo valero p1
Mateo valero p1
Qnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle eastQnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle east
Ron perrot
Ron perrotRon perrot
Ron perrot
Internet Of Things: Vision, Prerequisites and OpenSpime
Internet Of Things: Vision, Prerequisites and OpenSpimeInternet Of Things: Vision, Prerequisites and OpenSpime
Internet Of Things: Vision, Prerequisites and OpenSpime
20121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v320121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v3
How To Train Your Calxeda EnergyCore
How To Train Your  Calxeda EnergyCoreHow To Train Your  Calxeda EnergyCore
How To Train Your Calxeda EnergyCore
BonFIRE TridentCom presentation
BonFIRE TridentCom presentationBonFIRE TridentCom presentation
BonFIRE TridentCom presentation
One day-workshop on tms320 f2812
One day-workshop on tms320 f2812One day-workshop on tms320 f2812
One day-workshop on tms320 f2812

Viewers also liked

Pratical mpi programming
Pratical mpi programmingPratical mpi programming
Pratical mpi programmingunifesptk
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
Jeff Squyres
Introduction to Parallel Programming
Introduction to Parallel ProgrammingIntroduction to Parallel Programming
Introduction to Parallel Programming
병렬처리와 성능향상
병렬처리와 성능향상병렬처리와 성능향상
병렬처리와 성능향상shaderx
Open MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFOpen MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOF
Jeff Squyres
High Performance Computing - The Future is Here
High Performance Computing - The Future is HereHigh Performance Computing - The Future is Here
High Performance Computing - The Future is Here
Martin Hamilton
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
Ankit Mahato
2node cluster
2node cluster2node cluster
2node clustersprdd
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
MPI Presentation
MPI PresentationMPI Presentation
MPI PresentationTayfun Sen
Introduction to Linux #1
Introduction to Linux #1Introduction to Linux #1
Introduction to Linux #1
Using MPI
Using MPIUsing MPI
Using MPI
Kazuki Ohta
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0sprdd

Viewers also liked (15)

Pratical mpi programming
Pratical mpi programmingPratical mpi programming
Pratical mpi programming
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything(Open) MPI, Parallel Computing, Life, the Universe, and Everything
(Open) MPI, Parallel Computing, Life, the Universe, and Everything
Introduction to Parallel Programming
Introduction to Parallel ProgrammingIntroduction to Parallel Programming
Introduction to Parallel Programming
병렬처리와 성능향상
병렬처리와 성능향상병렬처리와 성능향상
병렬처리와 성능향상
Open MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOFOpen MPI State of the Union X SC'16 BOF
Open MPI State of the Union X SC'16 BOF
High Performance Computing - The Future is Here
High Performance Computing - The Future is HereHigh Performance Computing - The Future is Here
High Performance Computing - The Future is Here
Gpu Systems
Gpu SystemsGpu Systems
Gpu Systems
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
2node cluster
2node cluster2node cluster
2node cluster
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
MPI Presentation
MPI PresentationMPI Presentation
MPI Presentation
Introduction to Linux #1
Introduction to Linux #1Introduction to Linux #1
Introduction to Linux #1
Using MPI
Using MPIUsing MPI
Using MPI
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0
Open MPI 2
Open MPI 2Open MPI 2
Open MPI 2

Similar to ISBI MPI Tutorial

Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Continuum PCAP
Continuum PCAP Continuum PCAP
Continuum PCAP
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodellingObsidian Software
HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012Agora Group
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Joshua Mora
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
Federica Pisani
Web cast-a day-in_the_life_of_a_hsd_nov_5th_2012_final_al_hamdu_ll_allah__hsd...
Web cast-a day-in_the_life_of_a_hsd_nov_5th_2012_final_al_hamdu_ll_allah__hsd...Web cast-a day-in_the_life_of_a_hsd_nov_5th_2012_final_al_hamdu_ll_allah__hsd...
Web cast-a day-in_the_life_of_a_hsd_nov_5th_2012_final_al_hamdu_ll_allah__hsd...Hany Fahmy
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
Modeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on PrototypingModeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on PrototypingDVClub
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
Jeff Larkin
Новые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS FusionНовые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS Fusion
TERMILAB. Интернет - лаборатория
An AI accelerator ASIC architecture
An AI accelerator ASIC architectureAn AI accelerator ASIC architecture
An AI accelerator ASIC architecture
Khanh Le
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
clifford sugerman
Open Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology TourOpen Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology TourWalter Moriconi
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttec

Similar to ISBI MPI Tutorial (20)

Example Application of GPU
Example Application of GPUExample Application of GPU
Example Application of GPU
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Continuum PCAP
Continuum PCAP Continuum PCAP
Continuum PCAP
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
Web cast-a day-in_the_life_of_a_hsd_nov_5th_2012_final_al_hamdu_ll_allah__hsd...
Web cast-a day-in_the_life_of_a_hsd_nov_5th_2012_final_al_hamdu_ll_allah__hsd...Web cast-a day-in_the_life_of_a_hsd_nov_5th_2012_final_al_hamdu_ll_allah__hsd...
Web cast-a day-in_the_life_of_a_hsd_nov_5th_2012_final_al_hamdu_ll_allah__hsd...
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
Modeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on PrototypingModeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on Prototyping
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
Новые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS FusionНовые коммутаторы QFX10000. Технология JunOS Fusion
Новые коммутаторы QFX10000. Технология JunOS Fusion
An AI accelerator ASIC architecture
An AI accelerator ASIC architectureAn AI accelerator ASIC architecture
An AI accelerator ASIC architecture
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
Open Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology TourOpen Storage Sun Intel European Business Technology Tour
Open Storage Sun Intel European Business Technology Tour
Sandy bridge platform from ttec
Sandy bridge platform from ttecSandy bridge platform from ttec
Sandy bridge platform from ttec
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor

Recently uploaded

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9 Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance

Recently uploaded (20)

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Founder Sachin Dev Duggal's Strategic Approach to Create an Innova... Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

ISBI MPI Tutorial

  • 2. Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
  • 3. Shared Memory: all memory within a system is directly addressable (ignoring access restrictions) by each process [or thread]  Single- and multi-CPU desktops & laptops  Multi-threaded apps  GPGPU *  MPI *  Distributed Memory: memory available a given node within a system is unique and distinct from its peers  MPI  Google MapReduce / Hadoop
  • 4. Centos 5.2; Dual Quad-Core 3GHz P4 [E5472]; DDR2 800MHz 2.5 Relative performance 2 1.5 Copy 1 Scale Add 0.5 Triad 0 1 2 3 4 5 6 7 8 # of processes
  • 5. STREAM benchmark OpenMP performance 400% 350% Relative performance 300% 250% Add: 200% Copy: 150% Scale: 100% Triad: 50% 0% 0 4 8 12 16 Threads (8 Physical cores + HT) 2x X5570 (2.93GHz; Quad-core; 6.4GT/s QPI); 12x4G 1033 DDR3
  • 6. Bandwidth (FSB, HT, Nehalem, CUDA, …)  Frequently run into with high-level languages (MATLAB)  Capacity – cost & availability  High-density chips are $$$ (if even available)  Memory limits on individual systems  Distributed computing addresses both bandwidth and capacity with multiple systems  MPI is the glue used to connect multiple distributed processes together
  • 7. Custom iterative SENSE reconstruction  3 x 8 coils x 400 x 320 x 176 x 8 [complex float]  Profile data (img space)  Estimate (img<->k space)  Acquired data (k space)  > 4GB data touched during each iteration  16, 32 channel data here or on the way… Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms” M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro
  • 8. FTx DATA Place view into correct x-Ky-Kz space (AP & LP) CAL FTyz (AP & LP) “Traditional” 2D SENSE Unfold (AP & LP) Homodyne Correction Pre-loaded data GW Correction (Y, Z) Real-time data GW Correction (X) MPI Communication MIP Root node Worker nodes Store / RESULT DICOM
  • 9. Root Node 1Gb Eth Site Intranet 3.6GHz P4 16GB RAM 1Gb Eth Worker Node (x7) 1Gb Eth 3.6GHz P4 16GB RAM 3.6GHz P4 1Gb Eth 80GB HDD 3.6GHz P4 2x8Gb IB 80GB HDD 500GB HDD 2x8Gb IB 16-Port Gigabit Ethernet Switch x7 File system connections 24-Port Infiniband Switch x7x2 MPI interconnects 16Gb/s bandwidth per node 8Gb/s Connection Key Cluster Hardware MRI System External Hardware 2x8Gig Infiniband connection 1Gig Ethernet connection
  • 10. Loosely coupled  SETI / BOINC  “Grid computing”  BIOS-level abstraction  ScaleMP  Tightly coupled  MPI  “Cluster computing”  Hybrid  Folding@Home 
  • 11. Worker Head Node Worker Worker Master Node Worker Worker Worker
  • 12. Host Host I OS OS I Process A Thread 1 Process A Thread 2 Host II ThreadN OS II Process B Process B Host N OS N Memory Transfers Process C Network Transfers
  • 13. Host Host I OS OS I Process A Thread 1 Process A Process D Thread 2 Host II ThreadN OS II Process B Process B Process E Host N OS N Memory Transfers Process C Process F Network Transfers
  • 14. Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
  • 15.  Message Passing Interface is…  “a library specification for message-passing” 1  Available in many implementations on multiple platforms *  A set of functions for moving messages between different processes without a shared memory environment  Low-level*; no concept of overall computing tasks to be performed [1]
  • 16. MPI-1  Version 1.0 draft standard 1994  Version 1.1 in 1995  Version 1.2 in 1997  Version 1.3 in 2008  MPI-2  Added: ▪ 1-sided communication ▪ Dynamic “world” sizes; spawn / join  Version 2.0 in 1997  Version 2.1 in 2008  MPI-3  In process  Enhanced fault handling  Forward compatibility preserved
  • 17. MPI is the de-facto standard for distributed computing  Freely available  Open source implementations exist  Portable  Mature  From a discussion of why MPI is dominant [1]:  […] 100s of languages have come and gone.  Good stuff must have been created [… yet] it is broadly accepted in the field that they’re not used.  MPI has a lock.  OpenMP is accepted, but a distant second.  There are substantial barriers to the introduction of new languages and language constructs.  Economic, ecosystem related, psychological, a catch-22 of widespread use, etc.  Any parallel language proposal must come equipped with reasons why it will overcome those barriers. [1]
  • 18. MPI itself is just a specification. We want an implementation  MPICH, MPICH2  Widely portable  MVAPICH, MVAPICH2  Infiniband-centric; MPICH/MPICH2 based  OpenMPI  Plug-in architecture; many run-time options  And more:  IntelMPI  HP-MPI  MPI for IBM Blue Gene  MPI for Cray  Microsoft MPI  MPI for SiCortex  MPI for Myrinet Express (MX)  MPICH2 over SCTP
  • 19. Without MPI:  Start all of the processes across bank of machines (shell scripting + ssh)  socket(), bind(), listen(), accept() or connect() each link  send(), read() on individual links  Raw byte interfaces; no discrete messages
  • 20. With MPI  mpiexec –np<n> app  MPI_Init()  MPI_Send()  MPI_Recv()  MPI_Finalize()  MPI:  Manages the connections  Packages messages  Provides launching mechanism
  • 21. Provides definitions for:  Communication functions  MPI_Send()  MPI_Recv()  MPI_Bcast()  etc.  Datatypemangement functions  MPI_Type_create_hvector()  C, C++, and Fortran bindings  Also recommends process startup  mpiexec –np<nproc><program><args> [1]
  • 22. MPI_Abort MPI_Comm_remote_size MPI_File_read_ordered_end MPI_Group_rank MPI_Scatterv MPI_Unpublish_name MPI_Accumulate MPI_Comm_set_attr MPI_File_read_shared MPI_Group_size MPI_Send MPI_Wait MPI_Add_error_class MPI_Comm_set_errhandler MPI_File_seek MPI_Group_translate_ranks MPI_Send_init MPI_Waitall MPI_Add_error_code MPI_Comm_set_name MPI_File_seek_shared MPI_Group_union MPI_Sendrecv MPI_Waitany MPI_Add_error_string MPI_Comm_size MPI_File_set_atomicity MPI_Ibsend MPI_Sendrecv_replace MPI_Waitsome MPI_Address MPI_Comm_spawn MPI_File_set_errhandler MPI_Info_create MPI_Ssend MPI_Win_call_errhandler MPI_Allgather MPI_Comm_spawn_multiple MPI_File_set_info MPI_Info_delete MPI_Ssend_init MPI_Win_complete MPI_Allgatherv MPI_Comm_split MPI_File_set_size MPI_Info_dup MPI_Start MPI_Win_create MPI_Alloc_mem MPI_Comm_test_inter MPI_File_set_view MPI_Info_free MPI_Startall MPI_Win_create_errhandler MPI_Allreduce MPI_Dims_create MPI_File_sync MPI_Info_get MPI_Status_set_cancelled MPI_Win_create_keyval MPI_Alltoall MPI_Errhandler_create MPI_File_write MPI_Info_get_nkeys MPI_Status_set_elements MPI_Win_delete_attr MPI_Alltoallv MPI_Errhandler_free MPI_File_write_all MPI_Info_get_nthkey MPI_Test MPI_Win_fence MPI_Alltoallw MPI_Errhandler_get MPI_File_write_all_begin MPI_Info_get_valuelen MPI_Test_cancelled MPI_Win_free MPI_Attr_delete MPI_Errhandler_set MPI_File_write_all_end MPI_Info_set MPI_Testall MPI_Win_free_keyval MPI_Attr_get MPI_Error_class MPI_File_write_at MPI_Init MPI_Testany MPI_Win_get_attr MPI_Attr_put MPI_Error_string MPI_File_write_at_all MPI_Init_thread MPI_Testsome MPI_Win_get_errhandler MPI_Barrier MPI_Exscan MPI_File_write_at_all_begin MPI_Initialized MPI_Topo_test MPI_Win_get_group MPI_Bcast MPI_File_c2f MPI_File_write_at_all_end MPI_Intercomm_create MPI_Type_commit MPI_Win_get_name MPI_Bsend MPI_File_call_errhandler MPI_File_write_ordered MPI_Intercomm_merge MPI_Type_contiguous MPI_Win_lock MPI_Bsend_init MPI_File_close MPI_File_write_ordered_begin MPI_Iprobe MPI_Type_create_darray MPI_Win_post MPI_Buffer_attach MPI_File_create_errhandler MPI_File_write_ordered_end MPI_Irecv MPI_Type_create_hindexed MPI_Win_set_attr MPI_Buffer_detach MPI_File_delete MPI_File_write_shared MPI_Irsend MPI_Type_create_hvector MPI_Win_set_errhandler MPI_Cancel MPI_File_f2c MPI_Finalize MPI_Is_thread_main MPI_Type_create_indexed_block MPI_Win_set_name MPI_Cart_coords MPI_File_get_amode MPI_Finalized MPI_Isend MPI_Type_create_keyval MPI_Win_start MPI_Cart_create MPI_File_get_atomicity MPI_Free_mem MPI_Issend MPI_Type_create_resized MPI_Win_test MPI_Cart_get MPI_File_get_byte_offset MPI_Gather MPI_Keyval_create MPI_Type_create_struct MPI_Win_unlock MPI_Cart_map MPI_File_get_errhandler MPI_Gatherv MPI_Keyval_free MPI_Type_create_subarray MPI_Win_wait MPI_Cart_rank MPI_File_get_group MPI_Get MPI_Lookup_name MPI_Type_delete_attr MPI_Wtick MPI_Cart_shift MPI_File_get_info MPI_Get_address MPI_Op_create MPI_Type_dup MPI_Wtime MPI_Cart_sub MPI_File_get_position MPI_Get_count MPI_Op_free MPI_Type_extent MPI_Cartdim_get MPI_File_get_position_shared MPI_Get_elements MPI_Open_port MPI_Type_free MPI_Close_port MPI_File_get_size MPI_Get_processor_name MPI_Pack MPI_Type_free_keyval MPI_Comm_accept MPI_File_get_type_extent MPI_Get_version MPI_Pack_external MPI_Type_get_attr MPI_Comm_call_errhandler MPI_File_get_view MPI_Graph_create MPI_Pack_external_size MPI_Type_get_contents MPI_Comm_compare MPI_File_iread MPI_Graph_get MPI_Pack_size MPI_Type_get_envelope MPI_Comm_connect MPI_File_iread_at MPI_Graph_map MPI_Pcontrol MPI_Type_get_extent MPI_Comm_create MPI_File_iread_shared MPI_Graph_neighbors MPI_Probe MPI_Type_get_name MPI_Comm_create_errhandler MPI_File_iwrite MPI_Graph_neighbors_count MPI_Publish_name MPI_Type_get_true_extent MPI_Comm_create_keyval MPI_File_iwrite_at MPI_Graphdims_get MPI_Put MPI_Type_hindexed MPI_Comm_delete_attr MPI_File_iwrite_shared MPI_Grequest_complete MPI_Query_thread MPI_Type_hvector MPI_Comm_disconnect MPI_File_open MPI_Grequest_start MPI_Recv MPI_Type_indexed MPI_Comm_dup MPI_File_preallocate MPI_Group_compare MPI_Recv_init MPI_Type_lb MPI_Comm_free MPI_File_read MPI_Group_difference MPI_Reduce MPI_Type_match_size MPI_Comm_free_keyval MPI_File_read_all MPI_Group_excl MPI_Reduce_scatter MPI_Type_set_attr MPI_Comm_get_attr MPI_File_read_all_begin MPI_Group_free MPI_Register_datarep MPI_Type_set_name MPI_Comm_get_errhandler MPI_File_read_all_end MPI_Group_incl MPI_Request_free MPI_Type_size MPI_Comm_get_name MPI_File_read_at MPI_Group_intersection MPI_Request_get_status MPI_Type_struct MPI_Comm_get_parent MPI_File_read_at_all MPI_Group_range_excl MPI_Rsend MPI_Type_ub MPI_Comm_group MPI_File_read_at_all_begin MPI_Group_range_incl MPI_Rsend_init MPI_Type_vector MPI_Comm_join MPI_File_read_at_all_end MPI_Scan MPI_Unpack MPI_Comm_rank MPI_File_read_ordered MPI_Scatter MPI_Unpack_external MPI_Comm_remote_group MPI_File_read_ordered_begin
  • 23. Each process owns their data – there is no “our”  Makes many things simpler; no mutexes, condition variables, semaphores, etc; memory access order race conditions go away  Every message is an explicit copy  I have the memory I sent from, you have the memory you used to received into  Even when running in a “shared memory” environment  Synchronization comes along for free  I won’t get your message (or data) until you choose to send it  Programming to MPI first can make it easier to scale- out later
  • 24. Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
  • 25. Download / decompress MPICH source: h2/  Suports: c / c++ / Fortran  Requires Python >= 2.2  ./configure  make install  installs into /usr/local by default, or use --prefix=<chosen path>  Make sure <prefix>/bin is in PATH  Make sure <prefix>/share/man is in MANPATH
  • 26. c compiler wrapper c++ compiler wrapper MPI job launcher MPD launcher
  • 27. Set up passwordlessssh to workers  Start the daemons with mpdboot -n<N>  Requires ~/.mpd.conf to exist on each host ▪ Contains: (same on each host) ▪ MPD_SECRETWORD=<some gibberish string> ▪ permissions set to 600 (r/w access for owner only)  Requires ./mpd.hosts to list other host names ▪ Unless run as mpdboot -n 1 (run on current host only) ▪ Will not accept current host in list (implicit)  Check for running daemons with mpdtrace For details:
  • 28.
  • 29. Use mpicc/ mpicxx for c/c++ compiler  Wrapper script around c/c++ compilers detected during install ▪ $ mpicc --show gcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include - L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 - lpthread -luuid -lpthread –lrt  $ mpicc -o hello hello.c  Use mpiexec -np<nproc><app><args> to launch  $ mpiexec -np 4 ./hello
  • 30. /* hello.c */ #include <stdio.h> $ mpicc -o hello hello.c #include <mpi.h> $ mpiexec -np 4 ./hello Hello, from 0 of 4! int main (int argc, char * argv[]) { Hello, from 2 of 4! inti, rank, nodes; Hello, from 1 of 4! Hello, from 3 of 4! MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nodes); MPI_Comm_rank(MPI_COMM_WORLD, &rank); for (i=0; i< nodes; i++) { MPI_Barrier(MPI_COMM_WORLD); if (i == rank) printf("Hello from %i of %i!n", rank, nodes); } MPI_Finalize(); return 0; }
  • 31. ./threaded_app main() Thread within threaded_app process pthread_create( func() ) func() Do work Do work Memory pthread_join() pthread_exit() exit()
  • 32. mpiexec –np 4 ./mpi_app mpd launches jobs mpi_app [rank 0] mpi_app [rank 1] mpi_app [rank 3] main() main() main() MPI_Init() MPI_Init() MPI_Init() MPI comm. MPI_Bcast() MPI_Bcast() MIP_Bcast() MPI comm. Do Work on local mem Do Work on local mem Do Work on local mem MPI_Allreduce() MPI_Allreduce() MPI_Allreduce() MPI comm. MPI_Finalize() MPI_Finalize() MPI_Finalize() MPI comm. exit() exit() exit()
  • 33. /* hello.c */ #include <stdio.h> #include <mpi.h> int main (int argc, char * argv[]) { int i; int rank; int nodes; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nodes); MPI_Comm_rank(MPI_COMM_WORLD, &rank); for (i=0; i< nodes; i++) { MPI_Barrier(MPI_COMM_WORLD); if (i == rank) printf("Hello from %i of %i!n", rank, nodes); } MPI_Finalize(); return 0; }
  • 34. MPICH2 comes with mpe by default (unless disabled during configure)  Multiple tracing / logging options to track MPI traffic  Enabled through –mpe=<option> at compile time MacPro:code$ mpicc -mpe=mpilog -o hello hello.c MacPro:code$ mpiexec -np 4 ./hello Hello from 0 of 4! Hello from 2 of 4! Hello from 1 of 4! Hello from 3 of 4! Writing logfile.... Enabling the Default clock synchronization... Finished writing logfile ./hello.clog2.
  • 35.
  • 36. MacPro:code$ mpicc -mpe=mpitrace -o hello hello.c MacPro:code$ mpiexec -np 2 ./hello > trace MacPro:code$ grep 0 trace MacPro:code$ grep 1 trace [0] Ending MPI_Init [1] Ending MPI_Init [0] Starting MPI_Comm_size... [1] Starting MPI_Comm_size... [0] Ending MPI_Comm_size [1] Ending MPI_Comm_size [0] Starting MPI_Comm_rank... [1] Starting MPI_Comm_rank... [0] Ending MPI_Comm_rank [1] Ending MPI_Comm_rank [0] Starting MPI_Barrier... [1] Starting MPI_Barrier... [0] Ending MPI_Barrier [1] Ending MPI_Barrier Hello from 0 of 2! [1] Starting MPI_Barrier... [0] Starting MPI_Barrier... [1] Ending MPI_Barrier [0] Ending MPI_Barrier Hello from 1 of 2! [0] Starting MPI_Finalize... [1] Starting MPI_Finalize... [0] Ending MPI_Finalize [1] Ending MPI_Finalize
  • 37.
  • 38.
  • 39. intMPI_Send( void *buf, memory location to send from int count, number of elements (of type datatype) at buf MPI_Datatypedatatype, MPI_INT, MPI_FLOAT, etc… Or custom datatypes; strided vectors; structures, etc intdest, rank (within the communicator comm) of destination for this message int tag, used to distinguish this message from other messages MPI_Commcomm ) communicator for this transfer often MPI_COMM_WORLD
  • 40. intMPI_Recv( void *buf, memory location to receive data into int count, number of elements (of type datatype) available to receive into at buf MPI_Datatypedatatype, MPI_INT, MPI_FLOAT, etc… Or custom datatypes; strided vectors; structures, etc. Typically matches sending datatype, but doesn’t have to… int source, rank (within the communicator comm) of source for this message can also be MPI_ANY_SOURCE int tag, used to distinguish this message from other messages can also be MPI_ANY_TAG MPI_Commcomm, communicator for this transfer often MPI_COMM_WORLD MPI_Status *status ) Structure describing the received message, including: actual count (can be smaller than passed count) source (useful if used with source = MPI_ANY_SOURCE) tag (useful if used with tag = MPI_ANY_TAG)
  • 41. /* sr.c */ #include <stdio.h> #include <mpi.h> #ifndef SENDSIZE #define SENDSIZE 1 #endif int main (int argc, char * argv[] ) { int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE]; MPI_Status sendStatus; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nodes); MPI_Comm_rank(MPI_COMM_WORLD, &rank); myData[0] = rank; MPI_Send(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD); MPI_Recv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1 ) % nodes, 0, MPI_COMM_WORLD, &sendStatus); printf("%i sent %i; received %in", rank, myData[0], theirData[0]); MPI_Finalize(); return 0; }
  • 42. $ mpicc -osrsr.c $ mpiexec -np 2 ./sr 0 sent 0; received 1 1 sent 1; received 0
  • 43.
  • 44. $ mpicc -osrsr.c $ mpiexec -np 2 ./sr 0 sent 0; received 1 1 sent 1; received 0 $ mpicc -osrsr.c -DSENDSIZE="0x1<<13” $ mpiexec -np 2 ./sr 0 sent 0; received 1 1 sent 1; received 0 $ mpicc -osrsr.c -DSENDSIZE="0x1<<14” $ mpiexec -np 2 ./sr ^C $ mpicc -osrsr.c -DSENDSIZE="0x1<<14 - 1” $ mpiexec -np 2 ./sr 0 sent 0; received 1 1 sent 1; received 0
  • 45. 3.4 Communication Modes The send call described in Section Blocking send is blocking: it does not return until the message data and envelope have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer. Message buffering decouples the send and receive operations. A blocking send can complete as soon as the message was buffered, even if no matching receive has been executed by the receiver. On the other hand, message buffering can be expensive, as it entails additional memory-to-memory copying, and it requires the allocation of memory for buffering. MPI offers the choice of several communication modes that allow one to control the choice of the communication protocol. The send call described in Section Blocking send used the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver. Thus, a send in standard mode can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. The standard mode send is non-local: successful completion of the send operation may depend on the occurrence of a matching receive.
  • 46. Process 1 Process 2 Send “small” message & return Eager send Eager recv Send “large” Request & receive message Receive small message Rndv. req. Rndv. req. Blocks until Match Rndv. Request large completion. Rndv. send message req. Receive Receive large Rndv. data message User activity MPI activity
  • 47. MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)  Sends are “local” – they return independent of any remote activity  Message buffer can be touched immediately after call returns  Requires a user-provided buffer, provided via MPI_Buffer_attach()  Forces an “eager”-like message transfer from sender’s perspective  User can wait for completion by calling MPI_Buffer_detach()  MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init)  Won’t return until matching receive is posted  Forces a “rendezvous”-like message transfer  Can be used to guarantee synchronization without additional MPI_Barrier() calls  MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init)  Erroneous if matching receive has not been posted  Performance tweak (on some systems) when user can guarantee matching receive is posted  MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)  Non-blocking, immediate return once send/receive request is posted  Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion  Send/receive buffers should not be touched until completed  MPI_Request * argument used for eventual completion  The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to receive any send mode.
  • 48. /* sr2.c */ #include <stdio.h> #include <mpi.h> #ifndef SENDSIZE #define SENDSIZE 1 #endif int main (int argc, char * argv[] ) { int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE]; MPI_Status xferStatus[2]; MPI_Request xferRequest[2]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nodes); MPI_Comm_rank(MPI_COMM_WORLD, &rank); myData[0] = rank; MPI_Isend(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[0]); MPI_Irecv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[1]); MPI_Waitall(2,xferRequest,xferStatus); printf("%i sent %i; received %in", rank, myData[0], theirData[0]);
  • 49. $ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14” $ mpiexec -np 4 ./sr2 0 sent 0; received 3 2 sent 2; received 1 1 sent 1; received 0 3 sent 3; received 2
  • 50. Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
  • 51. Task parallelism  Each process handles a unique kind of task ▪ Example: multi-image uploader (with resize/recompress) ▪ Thread 1: GUI / user interaction ▪ Thread 2: file reader & decompression ▪ Thread 3: resize & recompression ▪ Thread 3: network communication  Can be used in a grid with a pipeline of separable tasks to be performed on each data set ▪ Resample / warp volume ▪ Segment volume ▪ Calculate metrics on segmented volume
  • 52. Data parallelism  Each process handles a portion of the entire data  Often used with large data sets ▪ [task 0… | … task 1 … | … | … task n]  Frequently used in MPI programming  Each process is “doing the same thing,” just on a different subset of the whole
  • 53. Layout is crucial in high- performance computing  BW efficiency; cache efficiency  Even more important in distributed Node 0  Poor layout  extra Node 1 communication Node 2 Node 3  Shown is an example of Node 4 “block” data distribution Node 5  x is contiguous dimension Node 6 Node 7  z is slowest dimension x  Each node has contiguous y portion of z z
  • 54. FTx DATA Place view into correct x-Ky-Kz space (AP & LP) CAL FTyz (AP & LP) “Traditional” 2D SENSE Unfold (AP & LP) Homodyne Correction Pre-loaded data GW Correction (Y, Z) Real-time data GW Correction (X) MPI Communication MIP Root node Worker nodes Display / RESULT DICOM
  • 55. Completely separable problems:  Add 1 to everyone  Multiply each a[i] * b[i]  Inseparable problems: [?]  Max of a vector  Sort a vector  MIP of a volume  1D FFT of a volume  2d FFT of a volume  3d FFT of a volume [Parallel sort] Pacheo, Peter S., Parallel Programming with MPI
  • 56.
  • 57. Dynamic datatypes  MPI_Type_vector()  Enables communication of sub-sets without packing  Combined with DMA, permits zero-copy transposes, etc.  Other collectives  MPI_Reduce  MPI_Scatter  MPI_Gather  MPI-2 (MPICH2, MVAPICH2)  One-sided (DMA) communication ▪ MPI_Put() ▪ MPI_Get()  Dynamic world size ▪ Ability to spawn new processes during run
  • 58. Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
  • 59. Take time on the algorithm & data layout  Minimize traffic between nodes / separate problem ▪ FTx into xKyKz in SENSE example  Cache-friendly (linear, efficient) access patterns  Overlap processing and communication  MPI_Isend() / MPI_Irecv() with multiple work buffers  While actively transferring one, process the other  Larger messages will hit a higher BW (in general)
  • 60. Profile  Vtune (Intel; Linux / Windows)  Shark (Mac)  MPI profiling with -mpe=mpilog  Avoid “premature optimization” (Knuth)  Implementation time & effort vs. runtime performance  Use derived datatypes rather than packing  Using a debugger with MPI is hard  Build in your own debugging messages from go
  • 61. If you might need MPI, build to MPI.  Works well in shared memory environments ▪ It’s getting better all the time  Encourages memory locality in NUMA architectures ▪ Nehalem, AMD  Portable, reusable, open-source  Can be used in conjunction with threads / OpenMP / TBB / CUDA / OpenCL “Hybrid model of parallel programming”  Messaging paradigm can create “less obfuscated” code than threads / OpenMP
  • 62. Homogeneous nodes  Private network  Shared filesystem; ssh communication  Password-less SSH  High-bandwidth private interconnect  MPI communication exclusively  GbE, 10GbE  Infiniband  Consider using Rocks  CentOS / RHEL based  Built for building clusters  Rapid network boot based install/reinstall of nodes 
  • 63. MPI documents   MPICH2    OpenMPI    MVAPICH[1|2] (Infiniband-tuned distribution)    Rocks    Books:  Pacheo, Peter S., Parallel Programming with MPI  Karniadakis, George E., Parallel Scientific Computing in C++ and MPI  Gropp, W., Using MPI-2
  • 64.
  • 65.
  • 66. This is the painting operation #define RB 0x00FF00FFu #define RB_8OFF 0xFF00FF00u for one RGBA pixel (in) onto #define RGB 0x00FFFFFFu #define G 0x0000FF00u another (out) #define G_8OFF 0x00FF0000u #define A 0xFF000000u  We can do red and blue together, as we know they inlinevoid blendPreToStatic(constuint32_t& in, won’t collide, and we can mask uint32_t& out) { out the unwanted results. uint32_t alpha = in >>24; if(alpha &0x00000080u) ++alpha;  Post-multiply masks are out = A | RGB& (in + applied in the shifted position ( to minimize the number of ( (alpha * (out &RB) &RB_8OFF) | shift operations (alpha * (out &G) &G_8OFF) ) >>8 ) );  Note: we’re using pre- } multiplied colors & painting onto an opaque background
  • 67. OUT = A | RGB& (IN + ( ( (ALPHA * (OUT &RB) &RB_8OFF) | (ALPHA * (OUT &G) &G_8OFF) ) >>8 ) );
  • 68. For cases where there is no overlap between the four output pixels for four input pixels, we can use vectorized (SSE2) code  128-bit wide registers; load four 32-bit RGBA values, use the same approach as previously (R|B and G) in two registers to perform four paints at once
  • 69. inline void blend4PreToStatic(uint32_t ** in, uint32_t * out) // Paints in (quad-word) onto out { __m128irb, g, a, a_, o, mask_reg; // Registers rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary) a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call *in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4) g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4) mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4) rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word // These steps add one to transparancy values >= 80 o = _mm_srli_epi16(a,7); // Now the high bit is the low bit
  • 70. // We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want // to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and // storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're // doing it in this fashion!) rb = _mm_mulhi_epu16(rb,a); g = _mm_mulhi_epu16(g,a); g =_mm_slli_epi32(g,8); // Move green into the correct location. // R and B, both the lower 8 bits of their 16 bits, don't need to be shifted o = _mm_set1_epi32(0xFF000000); // Opaque alpha value o = _mm_or_si128(o,g); o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color mask_reg = _mm_set1_epi32(0x00FFFFFF); g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color o = _mm_add_epi32(o,g); // Add foreground and background contributions together _mm_storeu_si128((__m128i *) out,o); // Unaligned store }
  • 71. Vectorizing this code achieves 3-4x speedup on cluster  8x 2x(3.4|3.2GHz) Xeon, 800MHz FSB  Render 512x512x409 (400MB) volume in ▪ ~22ms (45fps) (SIMD code) ▪ ~92ms (11fps) (Non-vectorized)  ~18GB/s memory throughput  ~11 cycles / voxel vs. ~45 cycles non-vectorized
  • 72.
  • 73. MPI_Init(3) MPI MPI_Init(3) NAME MPI_Init - Initialize the MPI execution environment SYNOPSIS int MPI_Init( int *argc, char ***argv ) INPUT PARAMETERS argc - Pointer to the number of arguments argv - Pointer to the argument vector THREAD AND SIGNAL SAFETY This routine must be called by one thread only. That thread is called the mainthread and must be the thread that calls MPI_Finalize . NOTES The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE . In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
  • 74. MPI_Barrier(3) MPI MPI_Barrier(3) NAME MPI_Barrier - Blocks until all processes in the communicator have reached this routine. SYNOPSIS int MPI_Barrier( MPI_Commcomm ) INPUT PARAMETER comm - communicator (handle) NOTES Blocks the caller until all processes in the communicator have called it; that is, the call returns at any process only after all members of the communicator have entered the call.
  • 75. MPI_Finalize(3) MPI MPI_Finalize(3) NAME MPI_Finalize - Terminates MPI execution environment SYNOPSIS int MPI_Finalize( void ) NOTES All processes must call this routine before exiting. The number of processes running after this routine is called is undefined; it is best not to perform much more than a returnrc after calling MPI_Finalize .
  • 76. MPI_Comm_size(3) MPI MPI_Comm_size(3) NAME MPI_Comm_size - Determines the size of the group associated with a communicator SYNOPSIS int MPI_Comm_size( MPI_Commcomm, int *size ) INPUT PARAMETER comm - communicator (handle) OUTPUT PARAMETER size - number of processes in the group of comm (integer)
  • 77. MPI_Comm_rank(3) MPI MPI_Comm_rank(3) NAME MPI_Comm_rank - Determines the rank of the calling process in the com- municator SYNOPSIS int MPI_Comm_rank( MPI_Commcomm, int *rank ) INPUT ARGUMENT comm - communicator (handle) OUTPUT ARGUMENT rank - rank of the calling process in the group of comm (integer)
  • 78. MPI_Send(3) MPI MPI_Send(3) NAME MPI_Send - Performs a blocking send SYNOPSIS int MPI_Send(void *buf, int count, MPI_Datatypedatatype, int dest, int tag, MPI_Commcomm) INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (nonnegative integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle) NOTES This routine may block until the message is received by the destination process.
  • 79. MPI_Recv(3) MPI MPI_Recv(3) NAME MPI_Recv - Blocking receive for a message SYNOPSIS int MPI_Recv(void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Status *status) OUTPUT PARAMETERS buf - initial address of receive buffer (choice) status - status object (Status) INPUT PARAMETERS count - maximum number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle) NOTES The count argument indicates the maximum length of a message; the actual length of the message can be determined with MPI_Get_count .
  • 80. MPI_Isend(3) MPI MPI_Isend(3) NAME MPI_Isend - Begins a nonblocking send SYNOPSIS intMPI_Isend(void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm, MPI_Request *request) INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle) OUTPUT PARAMETER request - communication request (handle)
  • 81. MPI_Irecv(3) MPI MPI_Irecv(3) NAME MPI_Irecv - Begins a nonblocking receive SYNOPSIS intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Request *request) INPUT PARAMETERS buf - initial address of receive buffer (choice) count - number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle) OUTPUT PARAMETER request - communication request (handle)
  • 82. MPI_Bcast(3) MPI MPI_Bcast(3) NAME MPI_Bcast - Broadcasts a message from the process with rank "root" to all other processes of the communicator SYNOPSIS int MPI_Bcast( void *buffer, int count, MPI_Datatypedatatype, int root, MPI_Commcomm ) INPUT/OUTPUT PARAMETER buffer - starting address of buffer (choice) INPUT PARAMETERS count - number of entries in buffer (integer) datatype - data type of buffer (handle) root - rank of broadcast root (integer) comm - communicator (handle)
  • 83. MPI_Allreduce(3) MPI MPI_Allreduce(3) NAME MPI_Allreduce - Combines values from all processes and distributes the result back to all processes SYNOPSIS int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatypedatatype, MPI_Op op, MPI_Commcomm ) INPUT PARAMETERS sendbuf - starting address of send buffer (choice) count - number of elements in send buffer (integer) datatype - data type of elements of send buffer (handle) op - operation (handle) comm - communicator (handle) OUTPUT PARAMETER recvbuf - starting address of receive buffer (choice)
  • 84. MPI_Type_create_hvector(3) MPI MPI_Type_create_hvector(3) NAME MPI_Type_create_hvector - Create a datatype with a constant stride given in bytes SYNOPSIS int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatypeoldtype, MPI_Datatype *newtype) INPUT PARAMETERS count - number of blocks (nonnegative integer) blocklength - number of elements in each block (nonnegative integer) stride - number of bytes between start of each block (address integer) oldtype - old datatype (handle) OUTPUT PARAMETER newtype - new datatype (handle)
  • 85. mpicc(1) MPI mpicc(1) NAME mpicc - Compiles and links MPI programs written in C DESCRIPTION This command can be used to compile and link MPI programs written in C. It provides the options and any special libraries that are needed to compile and link MPI programs. It is important to use this command, particularly when linking pro- grams, as it provides the necessary libraries. COMMAND LINE ARGUMENTS -show - Show the commands that would be used without runnning them -help - Give short help -cc=name - Use compiler name instead of the default choice. Use this only if the compiler is compatible with the MPICH library (see below) -config=name - Load a configuration file for a particular compiler. This allows a single mpicc command to be used with multiple compil- ers. […]
  • 86. mpiexec(1) MPI mpiexec(1) NAME mpiexec - Run an MPI program SYNOPSIS mpiexecargs executable pgmargs [ : args executable pgmargs ... ] where args are command line arguments for mpiexec (see below), exe- cutable is the name of an executable MPI program, and pgmargs are com- mand line arguments for the executable. Multiple executables can be specified by using the colon notation (for MPMD - Multiple Program Mul- tiple Data applications). For example, the following command will run the MPI program a.out on 4 processes: mpiexec -n 4 a.out The MPI standard specifies the following arguments and their meanings: -n<np> - Specify the number of processes to use -host<hostname> - Name of host on which to run processes -arch<architecturename> - Pick hosts with this architecture type […]

Editor's Notes

  1. NUMANUMA is a distinction within shared memory systems. E.G. AMD HyperTransport or Intel QPI vs. Northbridge w/ FSBGPGPU: Sort of; xfers into and out of GPU memory are from the main shared system memory; xfers within GPU memory by GPU kernels are shared memory within their own private (GPU) memory spaceDistributed systems: comprised of multiple nodes. Each node typically == individual “computer”MPI can be used on shared memory systems; modern implementations use fastest xfer mechanism between each set of peers.
  2. Some scale betterCPUs keep getting faster, either through GHz or # of cores; memory BW has not kept up.STREAM benchmark------------------------------------------------------------------ name        kernel                  bytes/iter      FLOPS/iter------------------------------------------------------------------ COPY:       a(i) = b(i)                 16              0    SCALE:      a(i) = q*b(i)               16              1    SUM:        a(i) = b(i) + c(i)          24              1    TRIAD:      a(i) = b(i) + q*c(i)        24              2    ------------------------------------------------------------------8 million element double-precison arrays ~64MB arrays; ICC 10; -xPCPU manufacturers are focused on improving this; and have really sped things up with Nehalem;… what about Nehalem?
  3. Examples of tasks that hit BW walls:Highly tuned inner loops (few op/s per element; running over large volume)Masking operations (multiply each element from one volume by a mask in another volume)Max / min / mean / std operationsMIPsStill an issue on new systems; likely to continue to be an issue;Nehalem is NUMA as well; another layer of complexity -> can control somewhat via binding (numactl; through task manager in windows)This is not to say the 8 processors are useless; on programs where the inner loop operation does more work, the scaling can be close to ideal. E.g. sin(x)
  4. Front side bus, quick path interconnect, HyperTransportHigh-level languages: need to finish one operation (A += B) before doing the next operation (A = A*A)MPI is the de facto standard for parallel programs on distributed memory ssytems; from blue gene to off-the-shelf linux clusters1GB 1333 DDR3 : $95 ($800)2GB 1333 DDR3: $155 ($620)4GB 1333 DDR3: $322 ($644)8GB 1333 DDR3 chips: $3410 ($3410)Nehalem again makes this more confusing; memory bus clock changes based on # of modules…Also one of the key points that CUDA is focused on; 3 of the 8 called out improvements in the latest rev focus on efficient / improved memory bw usage.
  5. Needs for large data sets in image processing are real and here now.
  6. 2s for 400x320x220R=8 NCOILS=8volume; ~ 3.1 second for acquisition (calf data set)(This is not the iterative reconstruction)
  7. *MY* taxonomySETI@home 1999 – 2005; now part of BOINC 1.7PFlops > 1.4TFlops (RoadRunner)Grid: Jobs ~independent and asynchronous; Hadoop/ MapReduce; Cycle stealingScaleMP: Up to 32 processor (128 cores) and 4TB shared memoryCluster computing:Distributed “process” starts on multiple machines concurrentlyTypically cookie-cutter (although support for different architectures in possible in MPI)Significant communication between nodes during processingMassive simulationsApplications sensitive to timingsFolding@Home: Loosely coupled collective (GRID), tightly coupled within client (MPI); also Grid+GPU 4.6PFlops
  8. More taxonomy:Grid:Loosely connected; nodes “unaware” of other nodes.Works great for “batch” problemsDifferent architectures; different implementations (CPU, GPU, … PS3 and Nvidia clients for Folding@Home)Wildly varying performance between nodes “easily” accommodatedFail-over almost “automatic”Sun grid engineMap-reduce / hadoopCan be a cycle-stealing background processCluster:Tightly connected; nodes in tight communication with each otherFailures are hard to handle – intermediate results often saved; MPI-2Usually homogeneous nodes; varying performance can cause severe performance loss if not accounted for carefullyMPI(SGE / other schedulers)We will be focusing on Clusters; this is where MPI is used
  9. Network transfers (even on fast networks) are expensive compared to memory transactions
  10. Number of bi-directional links for nodes N = (N-1)*(N)/2 = 15 for 6 nodes; 28 for 8; ~ N^2 / 2Managing this yourself is complicated and time-consuming!>>> This is what MPI simplifies for usSo what is MPI?
  11. ANL = Agronne National Laboratory* Although available on many platforms, it has a unix heritage, and is most natural to use on unix-y (mac, linux, sun) environments. (OpenMPI ships Standard on macsw/Leopard)Low-level: there are some functions that operate on the data-type (Reduce operations) – but most “just” shuffle bytes around
  12. MPI is everywhere in high-performance computing, but why?>>> So what does MPI do for you? Why should you use it? Look at the complexity of setting up a distributed system again.
  13. Can also providing profiling; MPI can use different communication for different sets of peers (e.g. SMP, Infiniband, TCP/IP)You could (almost) write any MPI program with these 4 calls; much different from pthreadsw/ mutexes, OpenMP, GPU, etc; communication provides synchronization by its nature; no dealing with “locks” on “shared” variables, etc.BUT: need to be sure each node in initializing variables correctly…
  14. Getting back to what MPI is a little more…Even though most MPI programs could be written with just a few MPI commands, there are quite a few available….
  15. Linux / mac instructions; Leopard already has openmpi installed. In /usr/bin/mpi[cc|cxx|run]Not familiar with the windows version; see windows portion of andc++ if desiredSupports shared mem and tcp channels.
  16. MPD = multiprocessing daemon; used to start one daemon per host; these daemons are used to start the actual jobsTalk a little more about MPD
  17. MPD = multiprocessing daemon; launched (and left running) on each node that want to be ready to participate in an MPI executionOther options (mpirun) exist, but mpd is fast for starting new jobs (as opposed to new ssh sessions created each time a job is run)
  18. MPD = multiprocessing daemon
  19. MPI_Init() Must be called in every program that will use MPI callsCaveat: Printing to stdout (stderr) from different nodes works; but it is not guaranteed to be synchronized; On click: note 2 printed before 1. (Even though 2 occurred after 1 as enforced by MPI_Barrier(); fflush() does not fix… send all IO to one process for printout)Now is a good time to discuss what actually happens when an MPI parallel job is run. (In contrast to a threaded job)
  20. I want to look a little more into what actually happens when a parallel MPI program runs. Let’s start by looking at how a parallel threaded app run.Threads are spawned at runtime as requested by the program.Multiple threads may be spawned and joined over the course of a program.Each thread has access to memory to do its work (whatever it may be)main() is only entered and exited once
  21. Multiprocessing daemons already running; know about each otherNOT SHOWING RANK 2Each rank is a full program; starts in main; exits from mainMPI_Bcast() / MPI_All_reduce() included here as a way to show communication between nodesProgram logic during execution determines who does what
  22. Only the portions highlighted are different between the nodes; however, every line is executed – the full program – on each node; tests are performed to select different code at run time to run on each node.This is different from threaded apps, where common (global) code sections (initializations, etc) are really only run once. As long as init << parallel work, not a big performance issue(Doesn’t have to be done this way, but this is the typical way; different executables can be run as different processes, if desired.)
  23. MPE is useful for understanding what MPI is doing
  24. Black sections between barriers are the printf calls ~20usec each
  25. More of a debug tool
  26. Transpose; ~88MB data set in 70 ms (1.2GB/s)
  27. 320x176x320 -> 384x256x384 ~ 0.5 sec (14GFlops just for FFTs)Light pink = fftw library!!!Custom labels in profile2d – tp - 1d – pad - 1d – tp – pad - 2dOn click info boxes
  28. Let’s get back to MPI programming by examining the two basic building blocks for any MPI program: MPI_Send & MPI_RecvYou can make communicators that include only a subset of the active nodes; useful for doing “broadcasts” within a subset, etc.Tag can be used to separate classes of messages; etc. up to the user.Can be used with zero-length messages to communicate something via the tag alone. E.g. “Ready”, or “complete”
  29. It’s important to note that the types don’t have to be exactly the same; A strided vector could be received / sent from a contiguous vector)
  30. Does this work?You don’t have enough information; I haven’t told you if I compiled with SENDSIZE predefined; or what the eager / rendezvous threshold is.
  31. Threshold is 16kB
  32. We can see that sends can complete before the matching receives are posted; but not vice-versa. (Timing enforced by message passing; no mutexes required!
  33. Threshold is 16kB
  34. “Small” messages get sent into pre-allocated (within the MPI library) buffers; allows sender to return quicker; less traffic; etc. “Eager”“Large” messages get sent only once the receiver has posted the receive request (with the receive buffer) “Rendezous”
  35. Most of these also have _init modes to create a persistent request than can be started with MPI_Start[all]() and completed with MPI_[Test|Wait][any|all|some]* By basic, I mean excluding things like broadcasts, scatters, reduces; all of which have some send action included within them.
  36. Does this work?You don’t have enough information; I haven’t told you if I compiled with SENDSIZE predefined; or what the eager / rendezvous threshold is.
  37. 5 – 10us for MPI_Isend / MPI_Irecv to return; xfer took 1.7ms(147MB/s)
  38. It’s important to look at the work you need to speed up and understand which approach will do better for you.Multiple separable tasks each of ~ same difficulty works well with task parallelismData parallelism works will
  39. Data parallelism works well with large data setsLoad balancing can become an issue if relative workloads aren’t known a priori
  40. It’s important to consider how to split the data in a data-parallel systemSuppose you know you want to do mips across Z repeatedly; ignoring everythin else, would want to lay out with z available locally (but not necessarily contiguous; sse instructions for maximums don’t want to work along the four contiguous elements, but between a pair of elements in two four-value sets. (have z as your next-to-fastest dimension))Other examples of distributions are cyclic and block-cyclic; also high-dimension splitting (into a grid, for example)
  41. People are really doing this… We’re really doing this…Data is split along x immediately after FTx; distributed to all nodesCalibration scan taken earlier(This is not the iterative reconstruction)GW is done one dimension at a time; requires data along that dimension to be local, so we transpose before the GWx correction2s for 400x320x220R=8 NCOILS=8volume; ~ 3.1 second for acquisition (calf data set)
  42. It’s important to look at your problem and determine where it can be separated outIn general, MPI works better if you can separate it the large scale, rather than in the fine-scaleSIMD is an example of fine-scale parallelismAre each of these separable?Can do local maximums, and then max of maximumsParallel bitonic system; out of scope here; 55ms for ¼ qsort; 85 ms for full parallel sort; ~220 for one qsort of full vector (1 mega-element ints)1dfft : as long as 1dfft is not along split dimension (Assuming the time of a single 1d fft is small enough that you won’t try to split it up)2dffts : easy as long as not split along ffts3dffts: perform along contiguous dims; swap for final (“transposed input/output” options on fftw3 mpi implementation)
  43. 320x176x320 -> 384x256x384 ~ 0.5 sec (14GFlops just for FFTs)Light pink = fftw libraryCustom labels in profile2d – tp - 1d – pad - 1d – tp – pad - 2dOn click info boxes
  44. One-sided communication opens up race conditions concerns again, but gains some latency / BW because of reduced negotiation
  45. Efficient: make use of all data on a cache line when you read it; and only read it once
  46. Donald Knuth: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”There are some packages out there (openMPIw/ eclipse; TotalView) to help with debugging MPI.Errors on other nodes can cause the one you’re debugging to receive a signal to exit.
  47. You can build a cluster virtually just to see how things work…
  48. All of these mailing lists are active, and wonderful places to get help (After you’ve read the Docs & FAQ!)