ISBI MPI Tutorial


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • NUMANUMA is a distinction within shared memory systems. E.G. AMD HyperTransport or Intel QPI vs. Northbridge w/ FSBGPGPU: Sort of; xfers into and out of GPU memory are from the main shared system memory; xfers within GPU memory by GPU kernels are shared memory within their own private (GPU) memory spaceDistributed systems: comprised of multiple nodes. Each node typically == individual “computer”MPI can be used on shared memory systems; modern implementations use fastest xfer mechanism between each set of peers.
  • Some scale betterCPUs keep getting faster, either through GHz or # of cores; memory BW has not kept up.STREAM benchmark------------------------------------------------------------------ name        kernel                  bytes/iter      FLOPS/iter------------------------------------------------------------------ COPY:       a(i) = b(i)                 16              0    SCALE:      a(i) = q*b(i)               16              1    SUM:        a(i) = b(i) + c(i)          24              1    TRIAD:      a(i) = b(i) + q*c(i)        24              2    ------------------------------------------------------------------8 million element double-precison arrays ~64MB arrays; ICC 10; -xPCPU manufacturers are focused on improving this; and have really sped things up with Nehalem;… what about Nehalem?
  • Examples of tasks that hit BW walls:Highly tuned inner loops (few op/s per element; running over large volume)Masking operations (multiply each element from one volume by a mask in another volume)Max / min / mean / std operationsMIPsStill an issue on new systems; likely to continue to be an issue;Nehalem is NUMA as well; another layer of complexity -> can control somewhat via binding (numactl; through task manager in windows)This is not to say the 8 processors are useless; on programs where the inner loop operation does more work, the scaling can be close to ideal. E.g. sin(x)
  • Front side bus, quick path interconnect, HyperTransportHigh-level languages: need to finish one operation (A += B) before doing the next operation (A = A*A)MPI is the de facto standard for parallel programs on distributed memory ssytems; from blue gene to off-the-shelf linux clusters1GB 1333 DDR3 : $95 ($800)2GB 1333 DDR3: $155 ($620)4GB 1333 DDR3: $322 ($644)8GB 1333 DDR3 chips: $3410 ($3410)Nehalem again makes this more confusing; memory bus clock changes based on # of modules…Also one of the key points that CUDA is focused on; 3 of the 8 called out improvements in the latest rev focus on efficient / improved memory bw usage.
  • Needs for large data sets in image processing are real and here now.
  • 2s for 400x320x220R=8 NCOILS=8volume; ~ 3.1 second for acquisition (calf data set)(This is not the iterative reconstruction)
  • *MY* taxonomySETI@home 1999 – 2005; now part of BOINC 1.7PFlops > 1.4TFlops (RoadRunner)Grid: Jobs ~independent and asynchronous; Hadoop/ MapReduce; Cycle stealingScaleMP: Up to 32 processor (128 cores) and 4TB shared memoryCluster computing:Distributed “process” starts on multiple machines concurrentlyTypically cookie-cutter (although support for different architectures in possible in MPI)Significant communication between nodes during processingMassive simulationsApplications sensitive to timingsFolding@Home: Loosely coupled collective (GRID), tightly coupled within client (MPI); also Grid+GPU 4.6PFlops
  • More taxonomy:Grid:Loosely connected; nodes “unaware” of other nodes.Works great for “batch” problemsDifferent architectures; different implementations (CPU, GPU, … PS3 and Nvidia clients for Folding@Home)Wildly varying performance between nodes “easily” accommodatedFail-over almost “automatic”Sun grid engineMap-reduce / hadoopCan be a cycle-stealing background processCluster:Tightly connected; nodes in tight communication with each otherFailures are hard to handle – intermediate results often saved; MPI-2Usually homogeneous nodes; varying performance can cause severe performance loss if not accounted for carefullyMPI(SGE / other schedulers)We will be focusing on Clusters; this is where MPI is used
  • Network transfers (even on fast networks) are expensive compared to memory transactions
  • Number of bi-directional links for nodes N = (N-1)*(N)/2 = 15 for 6 nodes; 28 for 8; ~ N^2 / 2Managing this yourself is complicated and time-consuming!>>> This is what MPI simplifies for usSo what is MPI?
  • ANL = Agronne National Laboratory* Although available on many platforms, it has a unix heritage, and is most natural to use on unix-y (mac, linux, sun) environments. (OpenMPI ships Standard on macsw/Leopard)Low-level: there are some functions that operate on the data-type (Reduce operations) – but most “just” shuffle bytes around
  • MPI is everywhere in high-performance computing, but why?>>> So what does MPI do for you? Why should you use it? Look at the complexity of setting up a distributed system again.
  • Can also providing profiling; MPI can use different communication for different sets of peers (e.g. SMP, Infiniband, TCP/IP)You could (almost) write any MPI program with these 4 calls; much different from pthreadsw/ mutexes, OpenMP, GPU, etc; communication provides synchronization by its nature; no dealing with “locks” on “shared” variables, etc.BUT: need to be sure each node in initializing variables correctly…
  • Getting back to what MPI is a little more…Even though most MPI programs could be written with just a few MPI commands, there are quite a few available….
  • Linux / mac instructions; Leopard already has openmpi installed. In /usr/bin/mpi[cc|cxx|run]Not familiar with the windows version; see windows portion of andc++ if desiredSupports shared mem and tcp channels.
  • MPD = multiprocessing daemon; used to start one daemon per host; these daemons are used to start the actual jobsTalk a little more about MPD
  • MPD = multiprocessing daemon; launched (and left running) on each node that want to be ready to participate in an MPI executionOther options (mpirun) exist, but mpd is fast for starting new jobs (as opposed to new ssh sessions created each time a job is run)
  • MPD = multiprocessing daemon
  • MPI_Init() Must be called in every program that will use MPI callsCaveat: Printing to stdout (stderr) from different nodes works; but it is not guaranteed to be synchronized; On click: note 2 printed before 1. (Even though 2 occurred after 1 as enforced by MPI_Barrier(); fflush() does not fix… send all IO to one process for printout)Now is a good time to discuss what actually happens when an MPI parallel job is run. (In contrast to a threaded job)
  • I want to look a little more into what actually happens when a parallel MPI program runs. Let’s start by looking at how a parallel threaded app run.Threads are spawned at runtime as requested by the program.Multiple threads may be spawned and joined over the course of a program.Each thread has access to memory to do its work (whatever it may be)main() is only entered and exited once
  • Multiprocessing daemons already running; know about each otherNOT SHOWING RANK 2Each rank is a full program; starts in main; exits from mainMPI_Bcast() / MPI_All_reduce() included here as a way to show communication between nodesProgram logic during execution determines who does what
  • Only the portions highlighted are different between the nodes; however, every line is executed – the full program – on each node; tests are performed to select different code at run time to run on each node.This is different from threaded apps, where common (global) code sections (initializations, etc) are really only run once. As long as init << parallel work, not a big performance issue(Doesn’t have to be done this way, but this is the typical way; different executables can be run as different processes, if desired.)
  • MPE is useful for understanding what MPI is doing
  • Black sections between barriers are the printf calls ~20usec each
  • More of a debug tool
  • Transpose; ~88MB data set in 70 ms (1.2GB/s)
  • 320x176x320 -> 384x256x384 ~ 0.5 sec (14GFlops just for FFTs)Light pink = fftw library!!!Custom labels in profile2d – tp - 1d – pad - 1d – tp – pad - 2dOn click info boxes
  • Let’s get back to MPI programming by examining the two basic building blocks for any MPI program: MPI_Send & MPI_RecvYou can make communicators that include only a subset of the active nodes; useful for doing “broadcasts” within a subset, etc.Tag can be used to separate classes of messages; etc. up to the user.Can be used with zero-length messages to communicate something via the tag alone. E.g. “Ready”, or “complete”
  • It’s important to note that the types don’t have to be exactly the same; A strided vector could be received / sent from a contiguous vector)
  • Does this work?You don’t have enough information; I haven’t told you if I compiled with SENDSIZE predefined; or what the eager / rendezvous threshold is.
  • Threshold is 16kB
  • We can see that sends can complete before the matching receives are posted; but not vice-versa. (Timing enforced by message passing; no mutexes required!
  • Threshold is 16kB
  • “Small” messages get sent into pre-allocated (within the MPI library) buffers; allows sender to return quicker; less traffic; etc. “Eager”“Large” messages get sent only once the receiver has posted the receive request (with the receive buffer) “Rendezous”
  • Most of these also have _init modes to create a persistent request than can be started with MPI_Start[all]() and completed with MPI_[Test|Wait][any|all|some]* By basic, I mean excluding things like broadcasts, scatters, reduces; all of which have some send action included within them.
  • Does this work?You don’t have enough information; I haven’t told you if I compiled with SENDSIZE predefined; or what the eager / rendezvous threshold is.
  • 5 – 10us for MPI_Isend / MPI_Irecv to return; xfer took 1.7ms(147MB/s)
  • It’s important to look at the work you need to speed up and understand which approach will do better for you.Multiple separable tasks each of ~ same difficulty works well with task parallelismData parallelism works will
  • Data parallelism works well with large data setsLoad balancing can become an issue if relative workloads aren’t known a priori
  • It’s important to consider how to split the data in a data-parallel systemSuppose you know you want to do mips across Z repeatedly; ignoring everythin else, would want to lay out with z available locally (but not necessarily contiguous; sse instructions for maximums don’t want to work along the four contiguous elements, but between a pair of elements in two four-value sets. (have z as your next-to-fastest dimension))Other examples of distributions are cyclic and block-cyclic; also high-dimension splitting (into a grid, for example)
  • People are really doing this… We’re really doing this…Data is split along x immediately after FTx; distributed to all nodesCalibration scan taken earlier(This is not the iterative reconstruction)GW is done one dimension at a time; requires data along that dimension to be local, so we transpose before the GWx correction2s for 400x320x220R=8 NCOILS=8volume; ~ 3.1 second for acquisition (calf data set)
  • It’s important to look at your problem and determine where it can be separated outIn general, MPI works better if you can separate it the large scale, rather than in the fine-scaleSIMD is an example of fine-scale parallelismAre each of these separable?Can do local maximums, and then max of maximumsParallel bitonic system; out of scope here; 55ms for ¼ qsort; 85 ms for full parallel sort; ~220 for one qsort of full vector (1 mega-element ints)1dfft : as long as 1dfft is not along split dimension (Assuming the time of a single 1d fft is small enough that you won’t try to split it up)2dffts : easy as long as not split along ffts3dffts: perform along contiguous dims; swap for final (“transposed input/output” options on fftw3 mpi implementation)
  • 320x176x320 -> 384x256x384 ~ 0.5 sec (14GFlops just for FFTs)Light pink = fftw libraryCustom labels in profile2d – tp - 1d – pad - 1d – tp – pad - 2dOn click info boxes
  • One-sided communication opens up race conditions concerns again, but gains some latency / BW because of reduced negotiation
  • Efficient: make use of all data on a cache line when you read it; and only read it once
  • Donald Knuth: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”There are some packages out there (openMPIw/ eclipse; TotalView) to help with debugging MPI.Errors on other nodes can cause the one you’re debugging to receive a signal to exit.
  • You can build a cluster virtually just to see how things work…
  • All of these mailing lists are active, and wonderful places to get help (After you’ve read the Docs & FAQ!)
  • ISBI MPI Tutorial

    1. 1. Eric Borisch, M.S. Mayo Clinic
    2. 2.  Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
    3. 3.  Shared Memory: all memory within a system is directly addressable (ignoring access restrictions) by each process [or thread]  Single- and multi-CPU desktops & laptops  Multi-threaded apps  GPGPU *  MPI *  Distributed Memory: memory available a given node within a system is unique and distinct from its peers  MPI  Google MapReduce / Hadoop
    4. 4. Centos 5.2; Dual Quad-Core 3GHz P4 [E5472]; DDR2 800MHz 2.5 Relative performance 2 1.5 Copy 1 Scale Add 0.5 Triad 0 1 2 3 4 5 6 7 8 # of processes
    5. 5. STREAM benchmark OpenMP performance 400% 350% Relative performance 300% 250% Add: 200% Copy: 150% Scale: 100% Triad: 50% 0% 0 4 8 12 16 Threads (8 Physical cores + HT) 2x X5570 (2.93GHz; Quad-core; 6.4GT/s QPI); 12x4G 1033 DDR3
    6. 6.  Bandwidth (FSB, HT, Nehalem, CUDA, …)  Frequently run into with high-level languages (MATLAB)  Capacity – cost & availability  High-density chips are $$$ (if even available)  Memory limits on individual systems  Distributed computing addresses both bandwidth and capacity with multiple systems  MPI is the glue used to connect multiple distributed processes together
    7. 7.  Custom iterative SENSE reconstruction  3 x 8 coils x 400 x 320 x 176 x 8 [complex float]  Profile data (img space)  Estimate (img<->k space)  Acquired data (k space)  > 4GB data touched during each iteration  16, 32 channel data here or on the way… Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms” M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro
    8. 8. FTx DATA Place view into correct x-Ky-Kz space (AP & LP) CAL FTyz (AP & LP) “Traditional” 2D SENSE Unfold (AP & LP) Homodyne Correction Pre-loaded data GW Correction (Y, Z) Real-time data GW Correction (X) MPI Communication MIP Root node Worker nodes Store / RESULT DICOM
    9. 9. Root Node 1Gb Eth Site Intranet 3.6GHz P4 16GB RAM 1Gb Eth Worker Node (x7) 1Gb Eth 3.6GHz P4 16GB RAM 3.6GHz P4 1Gb Eth 80GB HDD 3.6GHz P4 2x8Gb IB 80GB HDD 500GB HDD 2x8Gb IB 16-Port Gigabit Ethernet Switch x7 File system connections 24-Port Infiniband Switch x7x2 MPI interconnects 16Gb/s bandwidth per node 8Gb/s Connection Key Cluster Hardware MRI System External Hardware 2x8Gig Infiniband connection 1Gig Ethernet connection
    10. 10.  Loosely coupled  SETI / BOINC  “Grid computing”  BIOS-level abstraction  ScaleMP  Tightly coupled  MPI  “Cluster computing”  Hybrid  Folding@Home 
    11. 11. Worker Head Node Worker Worker Master Node Worker Worker Worker
    12. 12. Host Host I OS OS I Process A Thread 1 Process A Thread 2 Host II ThreadN OS II Process B Process B Host N OS N Memory Transfers Process C Network Transfers
    13. 13. Host Host I OS OS I Process A Thread 1 Process A Process D Thread 2 Host II ThreadN OS II Process B Process B Process E Host N OS N Memory Transfers Process C Process F Network Transfers
    14. 14.  Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
    15. 15.  Message Passing Interface is…  “a library specification for message-passing” 1  Available in many implementations on multiple platforms *  A set of functions for moving messages between different processes without a shared memory environment  Low-level*; no concept of overall computing tasks to be performed [1]
    16. 16.  MPI-1  Version 1.0 draft standard 1994  Version 1.1 in 1995  Version 1.2 in 1997  Version 1.3 in 2008  MPI-2  Added: ▪ 1-sided communication ▪ Dynamic “world” sizes; spawn / join  Version 2.0 in 1997  Version 2.1 in 2008  MPI-3  In process  Enhanced fault handling  Forward compatibility preserved
    17. 17.  MPI is the de-facto standard for distributed computing  Freely available  Open source implementations exist  Portable  Mature  From a discussion of why MPI is dominant [1]:  […] 100s of languages have come and gone.  Good stuff must have been created [… yet] it is broadly accepted in the field that they’re not used.  MPI has a lock.  OpenMP is accepted, but a distant second.  There are substantial barriers to the introduction of new languages and language constructs.  Economic, ecosystem related, psychological, a catch-22 of widespread use, etc.  Any parallel language proposal must come equipped with reasons why it will overcome those barriers. [1]
    18. 18.  MPI itself is just a specification. We want an implementation  MPICH, MPICH2  Widely portable  MVAPICH, MVAPICH2  Infiniband-centric; MPICH/MPICH2 based  OpenMPI  Plug-in architecture; many run-time options  And more:  IntelMPI  HP-MPI  MPI for IBM Blue Gene  MPI for Cray  Microsoft MPI  MPI for SiCortex  MPI for Myrinet Express (MX)  MPICH2 over SCTP
    19. 19.  Without MPI:  Start all of the processes across bank of machines (shell scripting + ssh)  socket(), bind(), listen(), accept() or connect() each link  send(), read() on individual links  Raw byte interfaces; no discrete messages
    20. 20.  With MPI  mpiexec –np<n> app  MPI_Init()  MPI_Send()  MPI_Recv()  MPI_Finalize()  MPI:  Manages the connections  Packages messages  Provides launching mechanism
    21. 21. Provides definitions for:  Communication functions  MPI_Send()  MPI_Recv()  MPI_Bcast()  etc.  Datatypemangement functions  MPI_Type_create_hvector()  C, C++, and Fortran bindings  Also recommends process startup  mpiexec –np<nproc><program><args> [1]
    22. 22. MPI_Abort MPI_Comm_remote_size MPI_File_read_ordered_end MPI_Group_rank MPI_Scatterv MPI_Unpublish_name MPI_Accumulate MPI_Comm_set_attr MPI_File_read_shared MPI_Group_size MPI_Send MPI_Wait MPI_Add_error_class MPI_Comm_set_errhandler MPI_File_seek MPI_Group_translate_ranks MPI_Send_init MPI_Waitall MPI_Add_error_code MPI_Comm_set_name MPI_File_seek_shared MPI_Group_union MPI_Sendrecv MPI_Waitany MPI_Add_error_string MPI_Comm_size MPI_File_set_atomicity MPI_Ibsend MPI_Sendrecv_replace MPI_Waitsome MPI_Address MPI_Comm_spawn MPI_File_set_errhandler MPI_Info_create MPI_Ssend MPI_Win_call_errhandler MPI_Allgather MPI_Comm_spawn_multiple MPI_File_set_info MPI_Info_delete MPI_Ssend_init MPI_Win_complete MPI_Allgatherv MPI_Comm_split MPI_File_set_size MPI_Info_dup MPI_Start MPI_Win_create MPI_Alloc_mem MPI_Comm_test_inter MPI_File_set_view MPI_Info_free MPI_Startall MPI_Win_create_errhandler MPI_Allreduce MPI_Dims_create MPI_File_sync MPI_Info_get MPI_Status_set_cancelled MPI_Win_create_keyval MPI_Alltoall MPI_Errhandler_create MPI_File_write MPI_Info_get_nkeys MPI_Status_set_elements MPI_Win_delete_attr MPI_Alltoallv MPI_Errhandler_free MPI_File_write_all MPI_Info_get_nthkey MPI_Test MPI_Win_fence MPI_Alltoallw MPI_Errhandler_get MPI_File_write_all_begin MPI_Info_get_valuelen MPI_Test_cancelled MPI_Win_free MPI_Attr_delete MPI_Errhandler_set MPI_File_write_all_end MPI_Info_set MPI_Testall MPI_Win_free_keyval MPI_Attr_get MPI_Error_class MPI_File_write_at MPI_Init MPI_Testany MPI_Win_get_attr MPI_Attr_put MPI_Error_string MPI_File_write_at_all MPI_Init_thread MPI_Testsome MPI_Win_get_errhandler MPI_Barrier MPI_Exscan MPI_File_write_at_all_begin MPI_Initialized MPI_Topo_test MPI_Win_get_group MPI_Bcast MPI_File_c2f MPI_File_write_at_all_end MPI_Intercomm_create MPI_Type_commit MPI_Win_get_name MPI_Bsend MPI_File_call_errhandler MPI_File_write_ordered MPI_Intercomm_merge MPI_Type_contiguous MPI_Win_lock MPI_Bsend_init MPI_File_close MPI_File_write_ordered_begin MPI_Iprobe MPI_Type_create_darray MPI_Win_post MPI_Buffer_attach MPI_File_create_errhandler MPI_File_write_ordered_end MPI_Irecv MPI_Type_create_hindexed MPI_Win_set_attr MPI_Buffer_detach MPI_File_delete MPI_File_write_shared MPI_Irsend MPI_Type_create_hvector MPI_Win_set_errhandler MPI_Cancel MPI_File_f2c MPI_Finalize MPI_Is_thread_main MPI_Type_create_indexed_block MPI_Win_set_name MPI_Cart_coords MPI_File_get_amode MPI_Finalized MPI_Isend MPI_Type_create_keyval MPI_Win_start MPI_Cart_create MPI_File_get_atomicity MPI_Free_mem MPI_Issend MPI_Type_create_resized MPI_Win_test MPI_Cart_get MPI_File_get_byte_offset MPI_Gather MPI_Keyval_create MPI_Type_create_struct MPI_Win_unlock MPI_Cart_map MPI_File_get_errhandler MPI_Gatherv MPI_Keyval_free MPI_Type_create_subarray MPI_Win_wait MPI_Cart_rank MPI_File_get_group MPI_Get MPI_Lookup_name MPI_Type_delete_attr MPI_Wtick MPI_Cart_shift MPI_File_get_info MPI_Get_address MPI_Op_create MPI_Type_dup MPI_Wtime MPI_Cart_sub MPI_File_get_position MPI_Get_count MPI_Op_free MPI_Type_extent MPI_Cartdim_get MPI_File_get_position_shared MPI_Get_elements MPI_Open_port MPI_Type_free MPI_Close_port MPI_File_get_size MPI_Get_processor_name MPI_Pack MPI_Type_free_keyval MPI_Comm_accept MPI_File_get_type_extent MPI_Get_version MPI_Pack_external MPI_Type_get_attr MPI_Comm_call_errhandler MPI_File_get_view MPI_Graph_create MPI_Pack_external_size MPI_Type_get_contents MPI_Comm_compare MPI_File_iread MPI_Graph_get MPI_Pack_size MPI_Type_get_envelope MPI_Comm_connect MPI_File_iread_at MPI_Graph_map MPI_Pcontrol MPI_Type_get_extent MPI_Comm_create MPI_File_iread_shared MPI_Graph_neighbors MPI_Probe MPI_Type_get_name MPI_Comm_create_errhandler MPI_File_iwrite MPI_Graph_neighbors_count MPI_Publish_name MPI_Type_get_true_extent MPI_Comm_create_keyval MPI_File_iwrite_at MPI_Graphdims_get MPI_Put MPI_Type_hindexed MPI_Comm_delete_attr MPI_File_iwrite_shared MPI_Grequest_complete MPI_Query_thread MPI_Type_hvector MPI_Comm_disconnect MPI_File_open MPI_Grequest_start MPI_Recv MPI_Type_indexed MPI_Comm_dup MPI_File_preallocate MPI_Group_compare MPI_Recv_init MPI_Type_lb MPI_Comm_free MPI_File_read MPI_Group_difference MPI_Reduce MPI_Type_match_size MPI_Comm_free_keyval MPI_File_read_all MPI_Group_excl MPI_Reduce_scatter MPI_Type_set_attr MPI_Comm_get_attr MPI_File_read_all_begin MPI_Group_free MPI_Register_datarep MPI_Type_set_name MPI_Comm_get_errhandler MPI_File_read_all_end MPI_Group_incl MPI_Request_free MPI_Type_size MPI_Comm_get_name MPI_File_read_at MPI_Group_intersection MPI_Request_get_status MPI_Type_struct MPI_Comm_get_parent MPI_File_read_at_all MPI_Group_range_excl MPI_Rsend MPI_Type_ub MPI_Comm_group MPI_File_read_at_all_begin MPI_Group_range_incl MPI_Rsend_init MPI_Type_vector MPI_Comm_join MPI_File_read_at_all_end MPI_Scan MPI_Unpack MPI_Comm_rank MPI_File_read_ordered MPI_Scatter MPI_Unpack_external MPI_Comm_remote_group MPI_File_read_ordered_begin
    23. 23.  Each process owns their data – there is no “our”  Makes many things simpler; no mutexes, condition variables, semaphores, etc; memory access order race conditions go away  Every message is an explicit copy  I have the memory I sent from, you have the memory you used to received into  Even when running in a “shared memory” environment  Synchronization comes along for free  I won’t get your message (or data) until you choose to send it  Programming to MPI first can make it easier to scale- out later
    24. 24.  Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
    25. 25.  Download / decompress MPICH source: h2/  Suports: c / c++ / Fortran  Requires Python >= 2.2  ./configure  make install  installs into /usr/local by default, or use --prefix=<chosen path>  Make sure <prefix>/bin is in PATH  Make sure <prefix>/share/man is in MANPATH
    26. 26. c compiler wrapper c++ compiler wrapper MPI job launcher MPD launcher
    27. 27.  Set up passwordlessssh to workers  Start the daemons with mpdboot -n<N>  Requires ~/.mpd.conf to exist on each host ▪ Contains: (same on each host) ▪ MPD_SECRETWORD=<some gibberish string> ▪ permissions set to 600 (r/w access for owner only)  Requires ./mpd.hosts to list other host names ▪ Unless run as mpdboot -n 1 (run on current host only) ▪ Will not accept current host in list (implicit)  Check for running daemons with mpdtrace For details:
    28. 28.  Use mpicc/ mpicxx for c/c++ compiler  Wrapper script around c/c++ compilers detected during install ▪ $ mpicc --show gcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include - L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 - lpthread -luuid -lpthread –lrt  $ mpicc -o hello hello.c  Use mpiexec -np<nproc><app><args> to launch  $ mpiexec -np 4 ./hello
    29. 29. /* hello.c */ #include <stdio.h> $ mpicc -o hello hello.c #include <mpi.h> $ mpiexec -np 4 ./hello Hello, from 0 of 4! int main (int argc, char * argv[]) { Hello, from 2 of 4! inti, rank, nodes; Hello, from 1 of 4! Hello, from 3 of 4! MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nodes); MPI_Comm_rank(MPI_COMM_WORLD, &rank); for (i=0; i< nodes; i++) { MPI_Barrier(MPI_COMM_WORLD); if (i == rank) printf("Hello from %i of %i!n", rank, nodes); } MPI_Finalize(); return 0; }
    30. 30. ./threaded_app main() Thread within threaded_app process pthread_create( func() ) func() Do work Do work Memory pthread_join() pthread_exit() exit()
    31. 31. mpiexec –np 4 ./mpi_app mpd launches jobs mpi_app [rank 0] mpi_app [rank 1] mpi_app [rank 3] main() main() main() MPI_Init() MPI_Init() MPI_Init() MPI comm. MPI_Bcast() MPI_Bcast() MIP_Bcast() MPI comm. Do Work on local mem Do Work on local mem Do Work on local mem MPI_Allreduce() MPI_Allreduce() MPI_Allreduce() MPI comm. MPI_Finalize() MPI_Finalize() MPI_Finalize() MPI comm. exit() exit() exit()
    32. 32. /* hello.c */ #include <stdio.h> #include <mpi.h> int main (int argc, char * argv[]) { int i; int rank; int nodes; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nodes); MPI_Comm_rank(MPI_COMM_WORLD, &rank); for (i=0; i< nodes; i++) { MPI_Barrier(MPI_COMM_WORLD); if (i == rank) printf("Hello from %i of %i!n", rank, nodes); } MPI_Finalize(); return 0; }
    33. 33.  MPICH2 comes with mpe by default (unless disabled during configure)  Multiple tracing / logging options to track MPI traffic  Enabled through –mpe=<option> at compile time MacPro:code$ mpicc -mpe=mpilog -o hello hello.c MacPro:code$ mpiexec -np 4 ./hello Hello from 0 of 4! Hello from 2 of 4! Hello from 1 of 4! Hello from 3 of 4! Writing logfile.... Enabling the Default clock synchronization... Finished writing logfile ./hello.clog2.
    34. 34. MacPro:code$ mpicc -mpe=mpitrace -o hello hello.c MacPro:code$ mpiexec -np 2 ./hello > trace MacPro:code$ grep 0 trace MacPro:code$ grep 1 trace [0] Ending MPI_Init [1] Ending MPI_Init [0] Starting MPI_Comm_size... [1] Starting MPI_Comm_size... [0] Ending MPI_Comm_size [1] Ending MPI_Comm_size [0] Starting MPI_Comm_rank... [1] Starting MPI_Comm_rank... [0] Ending MPI_Comm_rank [1] Ending MPI_Comm_rank [0] Starting MPI_Barrier... [1] Starting MPI_Barrier... [0] Ending MPI_Barrier [1] Ending MPI_Barrier Hello from 0 of 2! [1] Starting MPI_Barrier... [0] Starting MPI_Barrier... [1] Ending MPI_Barrier [0] Ending MPI_Barrier Hello from 1 of 2! [0] Starting MPI_Finalize... [1] Starting MPI_Finalize... [0] Ending MPI_Finalize [1] Ending MPI_Finalize
    35. 35. intMPI_Send( void *buf, memory location to send from int count, number of elements (of type datatype) at buf MPI_Datatypedatatype, MPI_INT, MPI_FLOAT, etc… Or custom datatypes; strided vectors; structures, etc intdest, rank (within the communicator comm) of destination for this message int tag, used to distinguish this message from other messages MPI_Commcomm ) communicator for this transfer often MPI_COMM_WORLD
    36. 36. intMPI_Recv( void *buf, memory location to receive data into int count, number of elements (of type datatype) available to receive into at buf MPI_Datatypedatatype, MPI_INT, MPI_FLOAT, etc… Or custom datatypes; strided vectors; structures, etc. Typically matches sending datatype, but doesn’t have to… int source, rank (within the communicator comm) of source for this message can also be MPI_ANY_SOURCE int tag, used to distinguish this message from other messages can also be MPI_ANY_TAG MPI_Commcomm, communicator for this transfer often MPI_COMM_WORLD MPI_Status *status ) Structure describing the received message, including: actual count (can be smaller than passed count) source (useful if used with source = MPI_ANY_SOURCE) tag (useful if used with tag = MPI_ANY_TAG)
    37. 37. /* sr.c */ #include <stdio.h> #include <mpi.h> #ifndef SENDSIZE #define SENDSIZE 1 #endif int main (int argc, char * argv[] ) { int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE]; MPI_Status sendStatus; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nodes); MPI_Comm_rank(MPI_COMM_WORLD, &rank); myData[0] = rank; MPI_Send(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD); MPI_Recv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1 ) % nodes, 0, MPI_COMM_WORLD, &sendStatus); printf("%i sent %i; received %in", rank, myData[0], theirData[0]); MPI_Finalize(); return 0; }
    38. 38. $ mpicc -osrsr.c $ mpiexec -np 2 ./sr 0 sent 0; received 1 1 sent 1; received 0
    39. 39. $ mpicc -osrsr.c $ mpiexec -np 2 ./sr 0 sent 0; received 1 1 sent 1; received 0 $ mpicc -osrsr.c -DSENDSIZE="0x1<<13” $ mpiexec -np 2 ./sr 0 sent 0; received 1 1 sent 1; received 0 $ mpicc -osrsr.c -DSENDSIZE="0x1<<14” $ mpiexec -np 2 ./sr ^C $ mpicc -osrsr.c -DSENDSIZE="0x1<<14 - 1” $ mpiexec -np 2 ./sr 0 sent 0; received 1 1 sent 1; received 0
    40. 40. 3.4 Communication Modes The send call described in Section Blocking send is blocking: it does not return until the message data and envelope have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer. Message buffering decouples the send and receive operations. A blocking send can complete as soon as the message was buffered, even if no matching receive has been executed by the receiver. On the other hand, message buffering can be expensive, as it entails additional memory-to-memory copying, and it requires the allocation of memory for buffering. MPI offers the choice of several communication modes that allow one to control the choice of the communication protocol. The send call described in Section Blocking send used the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver. Thus, a send in standard mode can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. The standard mode send is non-local: successful completion of the send operation may depend on the occurrence of a matching receive.
    41. 41. Process 1 Process 2 Send “small” message & return Eager send Eager recv Send “large” Request & receive message Receive small message Rndv. req. Rndv. req. Blocks until Match Rndv. Request large completion. Rndv. send message req. Receive Receive large Rndv. data message User activity MPI activity
    42. 42.  MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)  Sends are “local” – they return independent of any remote activity  Message buffer can be touched immediately after call returns  Requires a user-provided buffer, provided via MPI_Buffer_attach()  Forces an “eager”-like message transfer from sender’s perspective  User can wait for completion by calling MPI_Buffer_detach()  MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init)  Won’t return until matching receive is posted  Forces a “rendezvous”-like message transfer  Can be used to guarantee synchronization without additional MPI_Barrier() calls  MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init)  Erroneous if matching receive has not been posted  Performance tweak (on some systems) when user can guarantee matching receive is posted  MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)  Non-blocking, immediate return once send/receive request is posted  Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion  Send/receive buffers should not be touched until completed  MPI_Request * argument used for eventual completion  The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to receive any send mode.
    43. 43. /* sr2.c */ #include <stdio.h> #include <mpi.h> #ifndef SENDSIZE #define SENDSIZE 1 #endif int main (int argc, char * argv[] ) { int i, rank, nodes, myData[SENDSIZE], theirData[SENDSIZE]; MPI_Status xferStatus[2]; MPI_Request xferRequest[2]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nodes); MPI_Comm_rank(MPI_COMM_WORLD, &rank); myData[0] = rank; MPI_Isend(myData, SENDSIZE, MPI_INT, (rank + 1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[0]); MPI_Irecv(theirData, SENDSIZE, MPI_INT, (rank + nodes -1) % nodes, 0, MPI_COMM_WORLD, &xferRequest[1]); MPI_Waitall(2,xferRequest,xferStatus); printf("%i sent %i; received %in", rank, myData[0], theirData[0]);
    44. 44. $ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14” $ mpiexec -np 4 ./sr2 0 sent 0; received 3 2 sent 2; received 1 1 sent 1; received 0 3 sent 3; received 2
    45. 45.  Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
    46. 46.  Task parallelism  Each process handles a unique kind of task ▪ Example: multi-image uploader (with resize/recompress) ▪ Thread 1: GUI / user interaction ▪ Thread 2: file reader & decompression ▪ Thread 3: resize & recompression ▪ Thread 3: network communication  Can be used in a grid with a pipeline of separable tasks to be performed on each data set ▪ Resample / warp volume ▪ Segment volume ▪ Calculate metrics on segmented volume
    47. 47.  Data parallelism  Each process handles a portion of the entire data  Often used with large data sets ▪ [task 0… | … task 1 … | … | … task n]  Frequently used in MPI programming  Each process is “doing the same thing,” just on a different subset of the whole
    48. 48.  Layout is crucial in high- performance computing  BW efficiency; cache efficiency  Even more important in distributed Node 0  Poor layout  extra Node 1 communication Node 2 Node 3  Shown is an example of Node 4 “block” data distribution Node 5  x is contiguous dimension Node 6 Node 7  z is slowest dimension x  Each node has contiguous y portion of z z
    49. 49. FTx DATA Place view into correct x-Ky-Kz space (AP & LP) CAL FTyz (AP & LP) “Traditional” 2D SENSE Unfold (AP & LP) Homodyne Correction Pre-loaded data GW Correction (Y, Z) Real-time data GW Correction (X) MPI Communication MIP Root node Worker nodes Display / RESULT DICOM
    50. 50.  Completely separable problems:  Add 1 to everyone  Multiply each a[i] * b[i]  Inseparable problems: [?]  Max of a vector  Sort a vector  MIP of a volume  1D FFT of a volume  2d FFT of a volume  3d FFT of a volume [Parallel sort] Pacheo, Peter S., Parallel Programming with MPI
    51. 51.  Dynamic datatypes  MPI_Type_vector()  Enables communication of sub-sets without packing  Combined with DMA, permits zero-copy transposes, etc.  Other collectives  MPI_Reduce  MPI_Scatter  MPI_Gather  MPI-2 (MPICH2, MVAPICH2)  One-sided (DMA) communication ▪ MPI_Put() ▪ MPI_Get()  Dynamic world size ▪ Ability to spawn new processes during run
    52. 52.  Motivation for distributed computing  What MPI is  Intro to MPI programming  Thinking in parallel  Wrap up
    53. 53.  Take time on the algorithm & data layout  Minimize traffic between nodes / separate problem ▪ FTx into xKyKz in SENSE example  Cache-friendly (linear, efficient) access patterns  Overlap processing and communication  MPI_Isend() / MPI_Irecv() with multiple work buffers  While actively transferring one, process the other  Larger messages will hit a higher BW (in general)
    54. 54.  Profile  Vtune (Intel; Linux / Windows)  Shark (Mac)  MPI profiling with -mpe=mpilog  Avoid “premature optimization” (Knuth)  Implementation time & effort vs. runtime performance  Use derived datatypes rather than packing  Using a debugger with MPI is hard  Build in your own debugging messages from go
    55. 55.  If you might need MPI, build to MPI.  Works well in shared memory environments ▪ It’s getting better all the time  Encourages memory locality in NUMA architectures ▪ Nehalem, AMD  Portable, reusable, open-source  Can be used in conjunction with threads / OpenMP / TBB / CUDA / OpenCL “Hybrid model of parallel programming”  Messaging paradigm can create “less obfuscated” code than threads / OpenMP
    56. 56.  Homogeneous nodes  Private network  Shared filesystem; ssh communication  Password-less SSH  High-bandwidth private interconnect  MPI communication exclusively  GbE, 10GbE  Infiniband  Consider using Rocks  CentOS / RHEL based  Built for building clusters  Rapid network boot based install/reinstall of nodes 
    57. 57.  MPI documents   MPICH2    OpenMPI    MVAPICH[1|2] (Infiniband-tuned distribution)    Rocks    Books:  Pacheo, Peter S., Parallel Programming with MPI  Karniadakis, George E., Parallel Scientific Computing in C++ and MPI  Gropp, W., Using MPI-2
    58. 58.  This is the painting operation #define RB 0x00FF00FFu #define RB_8OFF 0xFF00FF00u for one RGBA pixel (in) onto #define RGB 0x00FFFFFFu #define G 0x0000FF00u another (out) #define G_8OFF 0x00FF0000u #define A 0xFF000000u  We can do red and blue together, as we know they inlinevoid blendPreToStatic(constuint32_t& in, won’t collide, and we can mask uint32_t& out) { out the unwanted results. uint32_t alpha = in >>24; if(alpha &0x00000080u) ++alpha;  Post-multiply masks are out = A | RGB& (in + applied in the shifted position ( to minimize the number of ( (alpha * (out &RB) &RB_8OFF) | shift operations (alpha * (out &G) &G_8OFF) ) >>8 ) );  Note: we’re using pre- } multiplied colors & painting onto an opaque background
    59. 59. OUT = A | RGB& (IN + ( ( (ALPHA * (OUT &RB) &RB_8OFF) | (ALPHA * (OUT &G) &G_8OFF) ) >>8 ) );
    60. 60.  For cases where there is no overlap between the four output pixels for four input pixels, we can use vectorized (SSE2) code  128-bit wide registers; load four 32-bit RGBA values, use the same approach as previously (R|B and G) in two registers to perform four paints at once
    61. 61. inline void blend4PreToStatic(uint32_t ** in, uint32_t * out) // Paints in (quad-word) onto out { __m128irb, g, a, a_, o, mask_reg; // Registers rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary) a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call *in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4) g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4) mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4) rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word // These steps add one to transparancy values >= 80 o = _mm_srli_epi16(a,7); // Now the high bit is the low bit
    62. 62. // We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want // to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and // storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're // doing it in this fashion!) rb = _mm_mulhi_epu16(rb,a); g = _mm_mulhi_epu16(g,a); g =_mm_slli_epi32(g,8); // Move green into the correct location. // R and B, both the lower 8 bits of their 16 bits, don't need to be shifted o = _mm_set1_epi32(0xFF000000); // Opaque alpha value o = _mm_or_si128(o,g); o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color mask_reg = _mm_set1_epi32(0x00FFFFFF); g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color o = _mm_add_epi32(o,g); // Add foreground and background contributions together _mm_storeu_si128((__m128i *) out,o); // Unaligned store }
    63. 63.  Vectorizing this code achieves 3-4x speedup on cluster  8x 2x(3.4|3.2GHz) Xeon, 800MHz FSB  Render 512x512x409 (400MB) volume in ▪ ~22ms (45fps) (SIMD code) ▪ ~92ms (11fps) (Non-vectorized)  ~18GB/s memory throughput  ~11 cycles / voxel vs. ~45 cycles non-vectorized
    64. 64. MPI_Init(3) MPI MPI_Init(3) NAME MPI_Init - Initialize the MPI execution environment SYNOPSIS int MPI_Init( int *argc, char ***argv ) INPUT PARAMETERS argc - Pointer to the number of arguments argv - Pointer to the argument vector THREAD AND SIGNAL SAFETY This routine must be called by one thread only. That thread is called the mainthread and must be the thread that calls MPI_Finalize . NOTES The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE . In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
    65. 65. MPI_Barrier(3) MPI MPI_Barrier(3) NAME MPI_Barrier - Blocks until all processes in the communicator have reached this routine. SYNOPSIS int MPI_Barrier( MPI_Commcomm ) INPUT PARAMETER comm - communicator (handle) NOTES Blocks the caller until all processes in the communicator have called it; that is, the call returns at any process only after all members of the communicator have entered the call.
    66. 66. MPI_Finalize(3) MPI MPI_Finalize(3) NAME MPI_Finalize - Terminates MPI execution environment SYNOPSIS int MPI_Finalize( void ) NOTES All processes must call this routine before exiting. The number of processes running after this routine is called is undefined; it is best not to perform much more than a returnrc after calling MPI_Finalize .
    67. 67. MPI_Comm_size(3) MPI MPI_Comm_size(3) NAME MPI_Comm_size - Determines the size of the group associated with a communicator SYNOPSIS int MPI_Comm_size( MPI_Commcomm, int *size ) INPUT PARAMETER comm - communicator (handle) OUTPUT PARAMETER size - number of processes in the group of comm (integer)
    68. 68. MPI_Comm_rank(3) MPI MPI_Comm_rank(3) NAME MPI_Comm_rank - Determines the rank of the calling process in the com- municator SYNOPSIS int MPI_Comm_rank( MPI_Commcomm, int *rank ) INPUT ARGUMENT comm - communicator (handle) OUTPUT ARGUMENT rank - rank of the calling process in the group of comm (integer)
    69. 69. MPI_Send(3) MPI MPI_Send(3) NAME MPI_Send - Performs a blocking send SYNOPSIS int MPI_Send(void *buf, int count, MPI_Datatypedatatype, int dest, int tag, MPI_Commcomm) INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (nonnegative integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle) NOTES This routine may block until the message is received by the destination process.
    70. 70. MPI_Recv(3) MPI MPI_Recv(3) NAME MPI_Recv - Blocking receive for a message SYNOPSIS int MPI_Recv(void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Status *status) OUTPUT PARAMETERS buf - initial address of receive buffer (choice) status - status object (Status) INPUT PARAMETERS count - maximum number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle) NOTES The count argument indicates the maximum length of a message; the actual length of the message can be determined with MPI_Get_count .
    71. 71. MPI_Isend(3) MPI MPI_Isend(3) NAME MPI_Isend - Begins a nonblocking send SYNOPSIS intMPI_Isend(void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm, MPI_Request *request) INPUT PARAMETERS buf - initial address of send buffer (choice) count - number of elements in send buffer (integer) datatype - datatype of each send buffer element (handle) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle) OUTPUT PARAMETER request - communication request (handle)
    72. 72. MPI_Irecv(3) MPI MPI_Irecv(3) NAME MPI_Irecv - Begins a nonblocking receive SYNOPSIS intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Request *request) INPUT PARAMETERS buf - initial address of receive buffer (choice) count - number of elements in receive buffer (integer) datatype - datatype of each receive buffer element (handle) source - rank of source (integer) tag - message tag (integer) comm - communicator (handle) OUTPUT PARAMETER request - communication request (handle)
    73. 73. MPI_Bcast(3) MPI MPI_Bcast(3) NAME MPI_Bcast - Broadcasts a message from the process with rank "root" to all other processes of the communicator SYNOPSIS int MPI_Bcast( void *buffer, int count, MPI_Datatypedatatype, int root, MPI_Commcomm ) INPUT/OUTPUT PARAMETER buffer - starting address of buffer (choice) INPUT PARAMETERS count - number of entries in buffer (integer) datatype - data type of buffer (handle) root - rank of broadcast root (integer) comm - communicator (handle)
    74. 74. MPI_Allreduce(3) MPI MPI_Allreduce(3) NAME MPI_Allreduce - Combines values from all processes and distributes the result back to all processes SYNOPSIS int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatypedatatype, MPI_Op op, MPI_Commcomm ) INPUT PARAMETERS sendbuf - starting address of send buffer (choice) count - number of elements in send buffer (integer) datatype - data type of elements of send buffer (handle) op - operation (handle) comm - communicator (handle) OUTPUT PARAMETER recvbuf - starting address of receive buffer (choice)
    75. 75. MPI_Type_create_hvector(3) MPI MPI_Type_create_hvector(3) NAME MPI_Type_create_hvector - Create a datatype with a constant stride given in bytes SYNOPSIS int MPI_Type_create_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatypeoldtype, MPI_Datatype *newtype) INPUT PARAMETERS count - number of blocks (nonnegative integer) blocklength - number of elements in each block (nonnegative integer) stride - number of bytes between start of each block (address integer) oldtype - old datatype (handle) OUTPUT PARAMETER newtype - new datatype (handle)
    76. 76. mpicc(1) MPI mpicc(1) NAME mpicc - Compiles and links MPI programs written in C DESCRIPTION This command can be used to compile and link MPI programs written in C. It provides the options and any special libraries that are needed to compile and link MPI programs. It is important to use this command, particularly when linking pro- grams, as it provides the necessary libraries. COMMAND LINE ARGUMENTS -show - Show the commands that would be used without runnning them -help - Give short help -cc=name - Use compiler name instead of the default choice. Use this only if the compiler is compatible with the MPICH library (see below) -config=name - Load a configuration file for a particular compiler. This allows a single mpicc command to be used with multiple compil- ers. […]
    77. 77. mpiexec(1) MPI mpiexec(1) NAME mpiexec - Run an MPI program SYNOPSIS mpiexecargs executable pgmargs [ : args executable pgmargs ... ] where args are command line arguments for mpiexec (see below), exe- cutable is the name of an executable MPI program, and pgmargs are com- mand line arguments for the executable. Multiple executables can be specified by using the colon notation (for MPMD - Multiple Program Mul- tiple Data applications). For example, the following command will run the MPI program a.out on 4 processes: mpiexec -n 4 a.out The MPI standard specifies the following arguments and their meanings: -n<np> - Specify the number of processes to use -host<hostname> - Name of host on which to run processes -arch<architecturename> - Pick hosts with this architecture type […]