Your SlideShare is downloading. ×
  • Like
MPI History
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

MPI History

  • 1,002 views
Published

The history and development of the MPI Standard, by Jesper Larsson Träff.

The history and development of the MPI Standard, by Jesper Larsson Träff.

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,002
On SlideShare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
25
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013History and Development of theMPI StandardJesper Larsson TräffVienna University of TechnologyFaculty of Informatics, Institute of Information SystemsResearch Group Parallel ComputingFavoritenstrase 16, 1040 Wienwww.par.tuwien.ac.at
  • 2. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.201321. September, 2012, MPI Forum meeting in Vienna:MPI 3.0 has just been released …… but MPI has a long history and it is instructive tolook at thatwww.mpi-forum.org
  • 3. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013„Those who cannot remember the past are condemned to repeatit”, George Santyana, The Life of Reason, 1905-1906“History always repeats itself twice: first time as tragedy,second time as farce”, Karl Marx“History is written by the winners”, George Orwell, 1944 (buthe quotes from someone else)
  • 4. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Last quote:„history“ depends. Who tells it, and why? What informations isavailable? What‘s at stake?My stake:•Convinced of MPI as a well designed and extremely usefulstandard, that has posed productive research/developmentproblems, with a broader parallel computing relevance•Critical of current standardization effort, MPI 3.0•MPI implementer, 2000-2010 with NEC•MPI Forum member 2008-2010 (with Hubert Ritzdorf,representing NEC)•Voted „no“ to MPI 2.2
  • 5. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013A long debate: shared-memory vs. distributed memoryQuestion: What shall a parallel machine look like?MP P P PAnswer depends•What are your concerns?•What is desirable?•What is feasible?causing debate since (at least) the 70ties, 80ties?
  • 6. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Hoare/Dijkstra:Parallel programs shall be structured as collections ofcommunicating, sequential processesTheir concern: CORRECTNESSWyllie, Vishkin:A parallel algorithm is like a collection of synchronized sequentialalgorithms that access a common shared memory, and the machineis a PRAMTheir concern: (asymptotic) PERFORMANCEAnd, of course, PERFORMANCE: many, many practitioneersAnd, of course, CORRECTNESS: Hoare semantics
  • 7. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Hoare/Dijkstra:Parallel programs shall be structured as collections ofcommunicating, sequential processesWyllie, Vishkin:A parallel algorithm is like a collection of synchronized sequentialalgorithms that access a common shared memory, and the machineis a PRAM[Fortune, Wyllie: Parallelism in Random Access Machines. STOC1978: 114-118][Shiloach, Vishkin: Finding the Maximum, Merging, and Sorting in aParallel Computation Model. Jour. Algorithms 2(1): 88-102, 1981][C. A. R. Hoare: Communicating Sequential Processes. Comm. ACM21(8): 666-677, 1978]
  • 8. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Hoare/Dijkstra:Parallel programs shall be structured as collections ofcommunicating, sequential processesWyllie, Vishkin: (many, many practiotioneers, Burton-Smith, …)A parallel algorithm is like a collection of synchronized sequentialalgorithms that access a common shared memory, and the machineis a PRAMMP P P PNeither perhaps caredtoo much about how tobuild machines…Neither perhaps caredtoo much about how tobuild machines (in thebeginning)
  • 9. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013The INMOS transputer T400, T800, fromca. 1985…but others (fortunately) didA complete architecture entirely based onthe CSP idea. An original programminglanguage, OCCAM (1983, 1987)Parsytec (ca. 1988-1995)
  • 10. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Intel iPSC/2 ca. 1990Intel Paragon, ca. 1992IBM SP/2 ca. 1996Thinking machinesCM5, ca. 1994
  • 11. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013MasPar (1987.1996)MP2Thinking Machines (1982-94) CM2, CM5KSR 2, ca. 1992HEP Denelcor,mid 1980ties
  • 12. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Ironically…Despite algorithmically stronger properties and potential forscaling to much, much larger numbers of processors of shared-memory models (like the PRAM)practically, high-performance systems with (quite) substantialparallelism have all been distributed-memory systemsand the corresponding de facto standard – MPI (theMessage-Passing Interface) is much strongerthan (say) OpenMP
  • 13. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Sources of MPI: the early yearsCommercial vendors and national laboratories (includingmany European) needed practically working programmingsupport for their machines and applicationsEarly 90ties fruitful years for practical parallel computing(funding for „grand challenge“ and „star wars“)Vendors and labs proposed and maintained own languages,interfaces, libraries for parallel programming (early 90ties)•Intel NX, Express, Zipcode, PARMACS, IBM EUI/CCL,PVM, P4, OCCAM, …
  • 14. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013•Intel NX, Express, Zipcode, PARMACS, IBM EUI/CCL,PVM, P4, OCCAM, …intended for distributed memory machines, and centered aroundsimilar conceptsSimilar enough to warrant an effort towards creating a commonstandard for message-passing based parallel programmingPortability problem: wasted effort in maintaining own interfacefor small user group, lack of portability across systems
  • 15. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013•Intel NX: send-receive message passing (non-blocking,buffering?), tags(tag groups?), no group concept, somecollectives, weak encapsulation•IBM EUI: point-to-point and collectives (more than in MPI),group concept, high performance (??) [Snir et al.]•IBM CCL: point-to-point and collectives, encapsulation•Zipcode/Express: point-to-point, emphasis on library building[Skjellum]•PARMACS/Express: point-to-point, topological mapping [Hempel]•PVM: point-to-point communication, some collective, virtualmachine abstraction, fault-toleranceMessage-passing interfaces/languages early 90ties
  • 16. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Some odd men out•Linda: tuple space get/put – a first PGAS approach?•Active messages; seems to presuppose an SPMD model?•OCCAM: too strict CSP-based, synchronous message passing?•PVM: heterogeneous systems, fault-tolerance, …[Hempel, Hey, McBryan, Walker: Special Issue – Message PassingInterfaces. Parallel Computing 29(4), 1994]
  • 17. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Standardization: the MPI Forum and MPI 1.0[Hempel, Walker: The emergence of the MPI message passing standardfor parallel computing. Computer Standards & Interfaces, 21: 51-62,1999]A standardization effort was started early 1992; key Dongarra,Hempel, Hey, WalkerGoal: to come out within a few years time frame with astandard for message-passing parallel programming; building onlessons learned from existing interfaces/languages•Not a research effort (as such)!•Open to participation from all interested parties
  • 18. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Key technical design points•Basic message-passing and related functionality (collectivecommunication!)•Enable library building: safe encapsulation of messages (andother things, eg. query functionality)•High performance, across all available and future systems!•Scalable design•Support for C and FortranMPI should encompass and enable
  • 19. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013The MPI ForumNot and ANSI/IEEE Standardization body, nobody „owns“the MPI standard; „free“Open to participation for all interested parties; protocolsopen (votes, email discussions)Regular meetings, 6-8 week intervalsThose who participates at meetings (with a history) canvote, one vote per organization (current discussion:quorum, semantics of abstaining)The 1st MPI Forum set out out to work early 1993
  • 20. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Errata, minor adjustments: MPI 1.0, 1.1, 1.2: 1994-1995After 7 meetings, 1st version of the MPI Standard was readyearly 1994. Two finalizing meetings in February 1994MPI: A Message-Passing Interface standard. May 5th, 1994The standard is the 226 page pdf-documentthat can be found at www.mpi-forum.orgas voted by the MPI Forum
  • 21. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Take note:The MPI 1 standardization process was followed hand-in-handby a(n amazingly good) prototype implementation: mpich fromArgonne National Laboratory (Gropp, Lusk, …)[W. Gropp, E. L. Lusk, N. E. Doss, A. Skjellum: A High-Performance,Portable Implementation of the MPI Message Passing InterfaceStandard. Parallel Computing 22(6): 789-828, 1996]Other parties, vendors could build on this implementation (anddid!), so that MPI was quickly supported on many parallelsystems
  • 22. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Why MPI has been successful: an appreciation•abstractions, but is still close enough to common architecturesto allow efficient, low overhead implementations („MPI is theassembler of parallel computing…“);•is formulated with care and precision; but not a formalspecification•is complete (to a high degree), based on few, powerful, largelyorthogonal key concepts (few exceptions, few optionals)•and few mistakesMPI made some fundamental
  • 23. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013ijmklEntities: MPIprocessescan communicate through acommunication mediumthat can beimplemented as„processes“ (mostMPIimplementations),„threads“, …Communicationmedium: concretenetwork,…nature of which is of no concern to the MPI standard:•No explicit requirements on network structure or capabilities•No performance model or requirementsMessage-passing abstraction
  • 24. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013i jMPI_Send(&data,count,type,j,tag,comm);MPI_Recv(&data,count,type,i,tag,comm,&status);Basic message-passing: point-to-point communicationOnly processes in same communicator: ranked set of processeswith unique „context“ - can communicateFundamental library building concept: isolates communicationin library routines from application communication
  • 25. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013i jMPI_Send(&data,count,type,j,tag,comm);MPI_Recv(&data,count,type,i,tag,comm,&status);Basic message-passing: point-to-point communicationReceivingprocess blocksuntil data havebeentransferredMPI implementation must ensure reliable transmission; no timeout (see RT-MPI)Semantics: messages from same sender are delivered in order;possible to write fully deterministic programs
  • 26. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013i jMPI_Send(&data,count,type,j,tag,comm);MPI_Recv(&data,count,type,i,tag,comm,&status);Basic message-passing: point-to-point communicationReceivingprocess blocksuntil data havebeentransferredSending process may block or not… this is not synchronouscommunication (as in CSP; close to this, synchronous MPI_Ssend)Semantics: upon return, data buffer can safely be reused
  • 27. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013i jMPI_Isend(&data,count,type,j,tag,comm,&req);MPI_Irecv(&data,count,type,i,tag,comm,&req);Basic message-passing: point-to-point communicationReceiving processreturnsimmediately, databuffer must notbe touchedNon-blocking communication: MPI_Isend/MPI_IrecvExplicit completion: MPI_Wait(&req,&status), …Design principle: MPI specification shall not enforce internalbuffering, all communication memory in user space…
  • 28. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013i jBasic message-passing: point-to-point communicationReceiving processreturnsimmediately, databuffer must notbe touchedNon-blocking communication: MPI_Isend/MPI_IrecvExplicit completion: MPI_Wait(&req,&status), …Design choice: No progress rule, communication will/musteventually happenMPI_Isend(&data,count,type,j,tag,comm,&req);MPI_Irecv(&data,count,type,i,tag,comm,&req);
  • 29. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013i jBasic message-passing: point-to-point communicationCompleteness: MPI_Send, Isend, Issend, …, MPI_Recv, Irecv canbe combined, semantics make senseReceiving processreturnsimmediately, databuffer must notbe touchedMPI_Isend(&data,count,type,j,tag,comm,&req);MPI_Irecv(&data,count,type,i,tag,comm,&req);
  • 30. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013i jMPI_Send(&data,count,type,j,tag,comm);MPI_Recv(&data,count,type,i,tag,comm,&status);Basic message-passing: point-to-point communicationReceivingprocess blocksuntil data havebeentransferredMPI_Datatype describes structure of communication databuffer: basetypes MPI_INT, MPI_DOUBLE, …, andrecursively applicable type constructorsdata
  • 31. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013i jMPI_Send(&data,count,type,j,tag,comm);MPI_Recv(&data,count,type,i,tag,comm,&status);Basic message-passing: point-to-point communicationReceivingprocess blocksuntil data havebeentransferredOrthogonality: Any MPI_Datatype can be used in anycommunication operationdataSemantics: only signature of data sent and data received mustmatch (performance!)
  • 32. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Other functionality (supporting library building)•Attributes to describe MPI objects (communicators,datatypes)•Query functionality for MPI objects (MPI_Status)•Errorhandlers to influence behavior on errors•MPI_Group‘s for manipulating ordered sets of processes
  • 33. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013„MPI is too large“„MPI is the assembler of parallel computing“…„MPI is designed not to make easy things easy, but to makedifficult things possible“Gropp, EuroPVM/MPI 2004Conjecture (tested at EuroPVM/MPI 2002): for any MPIfeature there will be at least one (significant) user dependingessentially on exactly this featureOften heard objections/complaintsand two answers
  • 34. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Collective communication: patterns of process communicationFundamental, well-studied, and useful parallel communicationpatterns captured in MPI 1.0 as socalled collectiveoperations:•MPI_Barrier(comm);•MPI_Bcast(…,comm);•MPI_Gather(…,comm); MPI_Scatter(…,comm);•MPI_Allgather(…,comm);•MPI_Alltoall(…,comm);•MPI_Reduce(…,comm); MPI_Allreduce(…,comm);•MPI_Reduce_scatter(…,comm);•MPI_Scan(…,comm);Semantics: all processes in comm participates; blocking; no tags
  • 35. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Collective communication: patterns of process communicationFundamental, well-studied, and useful parallel communicationpatterns captured in MPI 1.0 as socalled collectiveoperations:•MPI_Barrier(comm);•MPI_Bcast(…,comm);•MPI_Gather(…,comm); MPI_Scatter(…,comm);•MPI_Allgather(…,comm);•MPI_Alltoall(…,comm);•MPI_Reduce(…,comm); MPI_Allreduce(…,comm);•MPI_Reduce_scatter(…,comm);•MPI_Scan(…,comm);Completeness: MPI_Bcast dual of MPI_Reduce; MPI_Gatherdual of MPI_Scatter. Regular and irregular (vector) variants
  • 36. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Collective communication: patterns of process communicationFundamental, well-studied, and useful parallel communicationpatterns captured in MPI 1.0 as socalled collectiveoperations:•MPI_Gatherv(…,comm); MPI_Scatterv(…,comm);•MPI_Allgatherv(…,comm);•MPI_Alltoallv(…,comm);•MPI_Reduce_scatter(…,comm);Completeness: MPI_Bcast dual of MPI_Reduce; MPI_Gatherdual of MPI_Scatter. Regular and irregular (vector) variants
  • 37. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Collective communication: patterns of process communicationFundamental, well-studied, and useful parallel communicationpatterns captured in MPI 1.0 as socalled collectiveoperationsCollectives capture complex patterns, often with non-trivialalgorithms and implementations: delegate work to libraryimplementer, save work for the application programmerObligation: MPI implementation must be of sufficiently highquality – otherwise application programmer will not use orimplement own collectivesThis did happen(s)! For datatypes: unused for a long time
  • 38. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Collective communication: patterns of process communicationFundamental, well-studied, and useful parallel communicationpatterns captured in MPI 1.0 as socalled collectiveoperationsCollectives capture complex patterns, often with non-trivialalgorithms and implementations: delegate work to libraryimplementer, save work for the application programmerCompleteness: MPI makes it possible to (almost) implement MPIcollectives „on top of“ MPI point-to-point communicationSome exceptions for reductions, MPI_Op; datatypes
  • 39. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Collective communication: patterns of process communicationFundamental, well-studied, and useful parallel communicationpatterns captured in MPI 1.0 as socalled collectiveoperationsCollectives capture complex patterns, often with non-trivialalgorithms and implementations: delegate work to libraryimplementer, save work for the application programmerConjecture: well-implemented collective operations contributessignificantly towards application „performance portability“[Träff, Gropp, Thakur: Self-Consistent MPI Performance Guidelines.IEEE TPDS 21(5): 698-709, 2010]
  • 40. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Proc 0xProc p-1Three algorithms for matrix-vector multiplicationmxn matrix A and n-element vector y distributed evenly acrossp MPI processes: compute z = AyAlgorithm 1:•Row-wise matrixdistribution•Each process needs fullvector: MPI_Allgather(v)•Compute blocks of resultvector locally
  • 41. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Proc 0 xProc p-1Algorithm 2:•Column-wise matrixdistribution•Compute local partialresult vector•MPI_Reduce_scatter tosum and distribute partialresultsThree algorithms for matrix-vector multiplication
  • 42. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Proc 0Proc 2cProc cProc c-1…xMPI_AllgatherMPI_AllgatherThree algorithms for matrix-vector multiplication•Algorithm 3:•Matrix distribution intoblocks of m/r x n/celements•Algorithm 1 on columns•Algorithm 2 on rows
  • 43. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Proc 0Proc 2cProc cProc c-1…xMPI_Reduce_scatterMPI_Reduce_scatterMPI_Reduce_scatterp=rcThree algorithms for matrix-vector multiplication•Algorithm 3:•Matrix distribution intoblocks of m/r x n/celements•Algorithm 1 on columns•Algorithm 2 on rowsAlgorithm 3 is more scalable. Partitioning the set processes(new communicators) is essential!Interfaces that do support collectives on subsets of processesare not able to express Algorithm 3: case in point UPC
  • 44. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Three algorithms for matrix-vector multiplicationFor the „regular“ case where p divides n (and p=rc)•Regular collectives: MPI_Allgather, MPI_Reduce_scatterFor the „irregular“ case•Irregular collectives: MPI_Allgatherv, MPI_Reduce_scatterMPI 1.0 defined regular/irregular versions – completeness –for all the considered collective patterns; except forMPI_Reduce_scatterPerformance: irregular subsume regular counterparts; butmuch better algorithms are known for the regular ones
  • 45. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013[R. A. van de Geijn, J. Watts: SUMMA: scalableuniversal matrix multiplication algorithm. Concurrency -Practice and Experience 9(4): 255-274 (1997)][Ernie Chan, Marcel Heimlich, Avi Purkayastha, Robert A. van de Geijn:Collective communication: theory, practice, and experience. Concurrencyand Computation: Practice and Experience 19(13): 1749-1783 (2007)][F. G. van Zee, E. Chan, R. A. van de Geijn, E. S.Quintana-Ortí, G. Quintana-Ortí: The libflame Libraryfor Dense Matrix Computations. Computing in Scienceand Engineering 11(6): 56-63 (2009)]A lesson: Dense Linear Algebra and (regular) collectivecommunication as offered by MPI go hand in handNote: Most of these collective communication algorithms are afactor 2 off from best possible
  • 46. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Another example: Integer (bucket) sortn integers in a given range [0,R-1], distributed evenly across pMPI processes: m= n/p integers per process0 1 3 0 0 2 0 1 …4213Step 1: bucket sort locally, let B[i] number of elements with key iStep 2: MPI_Allreduce(B,AllB,R,MPI_INT,MPI_SUM,comm);Step 3: MPI_Exscan(B,RelB,R,MPI_INT,MPI_SUM,comm);)B =A =Now: Element A[j] needs to go to position AllB[A[j]-1]+RelB[A[j]]+j’
  • 47. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Another example: Integer (bucket) sortn integers in a given range [0,R-1], distributed evenly across pMPI processes: m= n/p integers per process0 1 3 0 0 2 0 1 …4213Step 4: compute number of elements to be sent to each otherprocess, sendelts[i], i=0,…,p-1B =A =Step 5:MPI_Alltoall(sendelts,1,MPI_INT,recvelts,1,MPI_INT,comm);Step6: redistribute elementsMPI_Alltoallv(A,sendelts,sdispls,…,comm);
  • 48. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Another example: Integer (bucket) sortThe algorithm is stable RadixsortChoice of radix R depends on properties of network (fullyconnected, fat tree, mesh/torus, …) and quality ofreduction/scan-algorithmsThe algorithm is portable (by virtue of the MPI collectives),but tuning depends on systems – concrete performance modelneeded, but this is outside scope of MPINote: on strong network T(MPI_Allreduce(m)) = O(m+log p)NOT: O(mlog p)
  • 49. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013A last featureProcess topologies:Specify application communication pattern (as either directedgraph or Cartesian grid) to MPI library, let library assignprocesses to processors so as to improve communicationfollwing specified patternMPI version: collective communicator construction functions,process ranks in new communicator represent new (improved)mappingAnd a very last: (simple) tool building support – the MPI profilinginterface
  • 50. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013The mistakes•MPI_Cancel(), semantically ill-defined, difficult to implement; aconcession to RT?•MPI_Rsend(); vendors got too much leverage?•MPI_Pack/Unpack; was added as an afterthought in last 1994meetings•Some functions enforce full copy of argument (list)s intolibrary
  • 51. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013The mistakes•MPI_Cancel(), semantically ill-defined, difficult to implement; aconcession to RT? (but useful for certain patterns, e.g., doublebuffering, client-server-like, …)•MPI_Rsend(); vendors got too much leverage? (butadvantageous in some scenarios)•MPI_Pack/Unpack; was added as an afterthought in last 1994meetings (functionality is useful/needed, limitations in thespecification)•Some functions enforce full copy of argument (list)s intolibrary
  • 52. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Missing functionality•Datatype query functions – not possible to query/reconstructstructure specified by given datatype•Some MPI objects are not first class citizens (MPI_Aint,MPI_Op, MPI_Datatype); makes it difficult to build certaintypes of libraries•Reductions cannot be performed locally
  • 53. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Definition:An MPI construct is non-scalable, if memory or time overhead(*)is Ω(p), p number of processesQuestions:•Are there aspects of the MPI specification that are non-scalable (forces Ω(p) memory or time)?•Are there aspects of (typical) MPI implementations that arenon-scalable(*)cannot be accounted for in applicationIs MPI scalable?Question must distinguish between specification andimplementation
  • 54. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Answer is “yes” to both questionsExample:Irregular collective alltoall communication (each processexchange some data with each other process)MPI_Alltoallw(sendbuf,sendcounts[],senddispl[],sendtypes[],recvbuf,recvcounts[],recvdispls[],recvtypes[],…)takes 6 p-sized arrays (4- or 8-byte integers) ~ 5MBytes, 10%of memory on BlueGene/LSparse usage pattern: often each process exchanges withonly few neighbors, so most send/recvcounts[i]=0MPI_Alltoallw is non-scalablePi
  • 55. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Experiment:sendcounts[i]=0, recvcounts[i] =0 for all processes and all iArgonne Natl. LabBlueGene/LEntails no communication[Balaji, …, Träff : MPI on millions of cores. Parallel Processing Letters21(1): 45-60, 2011]
  • 56. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Definitely non-scalable features in MPI 1.0•Irregular collectives: p-sized lists of counts, displacements,types•Graph topology interface: requires specification of fullprocess topology (communication graph) by all processes(Cartesian topology interface is perfectly scalable, and muchused)
  • 57. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013MPI 2: what (almost) went wrongA number of issues/desired functionality were left open byMPI 1.0, either because of•no agreement•deadline, desire to get a consolidated standard out in timeMajor open issues•Parallel IO•One-sided communication•Dynamic process managementwere partly described in the socalled JOD: „Journal ofDevelopment“ (see www.mpi-forum.org)The challenge from PVMChallenge from SHMEM…
  • 58. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013MPI Forum started to reconvene already in 1995Between 1995-1997 there were 16 meetings which lead toMPI 2.0MPI 1.0:226 pagesMPI 2.0: 356additional pagesMajor new features, with new concepts: extended messagepassing models1. Dynamic process management2. One-sided communication3. MPI-IO
  • 59. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20131. Dynamic process management:MPI 1.0 was completely static: a communicator cannot change(design principle: no MPI object can change; new objects canbe created and old ones destroyed), so the number ofprocesses in MPI_COMM_WORLD cannot change: thereforenot possible to add or remove processes from a runningapplicationMPI 2.0 process management relies on inter-communicators(from MPI 1.0) to establish communication with newly startedprocesses or already running applications•MPI_Comm_spawn•MPI_Comm_connect/MPI_Comm_accept•MPI_Intercomm_merge
  • 60. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20131. Dynamic process management:MPI 1.0 was completely static: a communicator cannot change(design principle: no MPI object can change; new objects canbe created and old ones destroyed), so the number ofprocesses in MPI_COMM_WORLD cannot change: thereforenot possible to add or remove processes from a runningapplicationWhat if a process (in a communicator) dies? The fault-tolerance problemMost (all) MPI implementations also die – but this may be animplementation issue
  • 61. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20131. Dynamic process management:MPI 1.0 was completely static: a communicator cannot change(design principle: no MPI object can change; new objects canbe created and old ones destroyed), so the number ofprocesses in MPI_COMM_WORLD cannot change: thereforenot possible to add or remove processes from a runningapplicationWhat if a process (in a communicator) dies? The fault-tolerance problemIf implementation does not die, it might be possible toprogram around/isolate faults using MPI 1.0 error handlersand inter-communicators[W. Gropp, E. Lusk: Fault Tolerance in Message Passing InterfacePrograms. IJHPCA 18(3): 363-372, 2004]
  • 62. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20131. Dynamic process management:MPI 1.0 was completely static: a communicator cannot change(design principle: no MPI object can change; new objects canbe created and old ones destroyed), so the number ofprocesses in MPI_COMM_WORLD cannot change: thereforenot possible to add or remove processes from a runningapplicationWhat if a process (in a communicator) dies? The fault-tolerance problemThe issue is contentious&contagious…
  • 63. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20132. One-sided communicationMotivations/arguments:•Expressivity/convenience: For applications where only oneprocess may readily know with which process to communicatedata, the point-to-point message-passing communication modelmay be inconvenient•Performance: On some architectures point-to-pointcommunication could be inefficient; e.g. if shared-memory isavailableChallenge: define a model that captures the essence of one-sided communication, but can be implemented without requiringspecific hardware support
  • 64. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20132. One-sided communicationChallenge: define a model that captures the essence of one-sided communication, but can be implemented without requiringspecific hardware supportNew MPI 2.0 concepts: communication window, communicationepochMPI one-sided model cleanly separates communication fromsynchronization; three specific synchronization mechanisms•MPI_Win_fence•MPI_Win_Start/Complete/Post/Wait•MPI_Win_lock/unlockwith cleverly thought out semantics and memory model
  • 65. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20132. One-sided communicationMPI one-sided model cleanly separates communication fromsynchronization; three specific synchronization mechanisms•MPI_Win_fence•MPI_Win_start/complete/post/wait•MPI_Win_lock/unlockwith cleverly thought out semantics and memory modelUnfortunately, application programmers did not seem to like it•“too complicated“•“too rigid”•“not efficient“•…
  • 66. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20133. MPI-IOCommunication with external (disk/file) memory. Could leverageMPI concepts and implementations:•Datatypes to describe file structure•Collective communication for utilizing local file systems•Fast communicationMPI datatype mechanism is essential, and the power of thisconcept starts to become clearMPI 2.0 introduces (inelegant!) functionality to decode adatatype = discover the structure described by datatype.Needed for MPI-IO implementation (on top of MPI) andsupports library building
  • 67. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Take note:Apart from MPI-IO (ROMIO), the MPI 2.0 standardizationprocess was not followed by prototype implementationsNew concept (IO only): split collectives
  • 68. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Not discussed:Thread-support/compliance, the ability of MPI to work in athreaded environment•MPI 1.0: design is (largely: exception: MPI_Probe/MPI_Recv)thread safe; recommendation that MPI implementations bethread safe (contrast: PVM design)•MPI 2.0: level of thread support can be requested and queried;an MPI library is not required to support the requested level,but returns information on the highest smaller level supportedMPI_THREAD_SINGLEMPI_THREAD_FUNNELEDMPI_THREAD_SERIALIZEDMPI_THREAD _MULTIPLE
  • 69. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Quiet years: 1997-2006No standardization activity from 1997MPI 2.0 implementations•Fujitsu (claim) 1999•NEC 2000•mpich 2004•OpenMPI 2005•LAM/MPI 2005(?)•…Ca. 2006 most/many implementations support mostly full MPI 2.0Implementations evolved and improved; MPI was an interestingtopic to work on, good MPI work was/is acceptable to all parallelcomputing conferences (SC, IPDPS, ICPP, Euro-Par, PPoPP, SPAA)[J. L.Träff, H. Ritzdorf, R. Hempel:The Implementation of MPI-2 One-Sided Communication for the NECSX-5. SC 2000]
  • 70. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20132012 Wien, Austria2011 Santorini, Greece2010 Stuttgart, Germany EuroMPI (no longer PVM)2009 Helsinki, Finland EuroPVM/MPI2008 Dublin, Ireland2007 Paris, France2006 Bonn, Germany2005 Sorrento, Italy2004 Budapest, Hungary2003 Venice, Italy2002 Linz, Austria2001 Santorini, Greece (9.11 – did not actually take place)2000 Balatonfüred, Hungary1999 Barcelona, Spain1998 Liverpool, UK1997 Cracow, Poland Now EuroPVM/MPI1996 Munich, Germany1995 Lyon, France1994 Rome, Italy EuroPVMEuro(PVM/)MPI conference series: dedicated to MPIMPIForummeetings…biasedtowardsMPIimplementation
  • 71. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013[Thanks to Xavier Vigouroux, Vienna 2012]Bonn 2006: discussions („Open Forum“)on restarting MPI Forum starting
  • 72. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013The MPI 2.2 – MPI 3.0 processLate 2007: MPI Forum reconvenes - againConsolidate standard: MPI 1.2 and MPI 2.0 into singlestandard document: MPI 2.1 (Sept. 4th, 2008)MPI 2.2 intermediate step towards 3.0•Address scalability problems•Missing functionality•BUT preserve backwards compatibility586 pages 623 pages
  • 73. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Some MPI 2.2 features•Addressing scalability problems: new topology interface,application communication graph is specified in a distributedfashion•Library building: MPI_Reduce_local•Missing function: regular MPI_Reduce_scatter_block•More flexible MPI_Comm_create (more in MPI 3.0:MPI_Comm_split_type)•New datatypes, e.g. MPI_AINT[T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski,R. Thakur, J. L. Träff: The scalable process topologyinterface of MPI 2.2. Concurrency and Computation: Practiceand Experience 23(4): 293-310, 2011]
  • 74. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Some MPI 2.2 featuresC++ bindings (since MPI 2.0) deprecated! With the intentionthat they will be removedMPI_Op, MPI_Datatype still not first class citizens (datatypesupport is weak and cumbersome)Fortran bindings modernized and corrected
  • 75. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013•6 meetings 2012•6 meetings 2011•7 meetings 2010•6 meetings 2009•7 meetings 2008MPI 2.1, 2.2, and 3.0 processTotal: 32 meetings (and counting…)Recall:•MPI 1: 7 meetings•MPI 2.0: 16 meetingsMPI Forum rules: presence at physical meetings with a history(presence at past two meetings) required to voteRequirement: new functionality must be supported by use-caseand prototype implementation; backwards compatibility not strict
  • 76. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013
  • 77. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013MPI 2.2 – MPI 3.0 process had working groups on•Collectives Operations•Fault Tolerance•Fortran bindings•Generalized requests ("on hold")•Hybrid Programming•Point to point (this working group is "on hold")•Remote Memory Access•Tools•MPI subsetting ("on hold")•Backward Compatibility•Miscellaneous Items•Persistence
  • 78. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013MPI 3.0: new features, new themes, new opportunitiesMajor new functionalities:1. Non-blocking collectives2. Sparse collectives3. New one-sided communication4. Performance tool supportDeprecated functions removed: C++ interface has goneMPI 3.0, 21. September 2012: 822 pagesImplementation status: mpich should cover MPI 3.0Performance/quality?
  • 79. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20131. Non-blocking collectivesIntroduced for performance (overlap) and convenience reasonsSimilarly to non-blocking point-to-point routines; MPI_Requestobject to check and enforce progressSound semantics based on ordering, no tagsDifferent from point-to-point (with good reason): blocking andnon-blocking collectives do not mix and match: MPI_Ibcast() isincorrect with MPI_Bcast()Incomplete: non-blocking versions for some other collectives(MPI_Icomm_dup)Non-orthognal: split and non-blocking collectives
  • 80. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20132. Sparse collectivesAddresses scalability problem of irregular collectives.Neighborhood specified with topology functionalityMPI_Neighbor_allgather(…,comm);MPI_Neighbor_allgatherv(…,comm);MPI_Neighbor_alltoall(…,comm);MPI_Neighbor_alltoallv(…,comm);MPI_Neighbor_alltoallw(…,comm);Piand corresponding non-blocking versions[T. Hoefler, J. L. Träff: Sparse collective operations for MPI. IPDPS2009]Will users take up? Optimization potential?
  • 81. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013[Hoefler, Dinan, Buntinas, Balaji, Barrett, Brightwell, Gropp, Kale,Thakur: Leveraging MPI‘s one-sided communication for shared-memoryprogramming. EuroMPI 2012, LNCS 7490, 133-141, 2012]3. One-sided communicationModel extension for better performance on hybrid/sharedmemory systemsAtomic operations (lacking in MPI 2.0 model)Per operation local completion, MPI_Rget, MPI_Rput, … (butonly for passive synchronization)
  • 82. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.20134. Performance tool supportProblem of MPI 1.0 allowing only one profiling interface at atime (linker interception of MPI calls) NOT solvedFunctionality added to query certain internals of the MPIlibraryWill tool writers take up?
  • 83. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013MPI at a turning pointExtremely large-scale systemsnow appearing stretch thescalability of MPIIs MPI for exascale?•Heterogeneous?•memory constrained?•low bisection width?•unreliable?systems
  • 84. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013MPI Forum at a turning pointAttendance large enough?Attendance broad enough?MPI 2.1-MPI 3.0 process has been long and exhausting,attendance driven by implementors, relatively little input formusers and applications, non-technical goals have played a role;research conducted that did not lead to useful outcome for thestandard (fault tolerance, thread/hybrid support, persistence,…)Perhaps time to take a break?More meetings,smaller attendance
  • 85. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013MPI 2.2YES 24NO 1ABSTAIN 0MISSED 0PassedSeptember 2-4, 2009MPI 2.1 firstYES 22NO 0ABSTAIN 0MISSED 0PassedJune 30-July 2, 2008MPI 3.0YES 17NO/ABSTAIN 0/0PassedMPI 2.1 secondYES 23NO 0ABSTAIN 0MISSED 0PassedMPI 1.3 secondYES 22NO 1ABSTAIN 0MISSED 0PassedSeptember 3-5, 2008The recent votes (MPI 2.1 to MPI 3.0)September 20-21, 2012Change/discussion ofrulesSee meetings.mpi-forum.org/secretary/
  • 86. ©Jesper Larsson Träff21.11, 5.12.2012, 12.6.2013Summary:Study history and learn from it: how to do better than MPIStandardization is a major effort, has taken a lot of dedicationand effort from a relatively large (but declining?) group ofpeople and institutions/companiesMPI 3.0 will raise many new implementation challengesMPI 3.0 is not the end of the (hi)storyThanks to the MPI Forum; discussion with Bill Gropp, Rusty Lusk, RajeevThakur, Jeff Squyres, and others