The document discusses the history and development of the MPI standard for parallel programming. It describes how MPI was developed in the early 1990s to create a common standard for message passing programming that could unite the various proprietary interfaces that existed at the time. The first MPI standard was released in 1994 after several years of development and input from vendors, national labs, and researchers. MPI was quickly adopted due to a reference implementation and its ability to provide a portable abstraction while allowing for high-performance implementations.
The Message Passing Interface (MPI) in Layman's TermsJeff Squyres
Â
Introduction to the basic concepts of what the Message Passing Interface (MPI) is, and a brief overview of the Open MPI open source software implementation of the MPI specification.
The Message Passing Interface (MPI) in Layman's TermsJeff Squyres
Â
Introduction to the basic concepts of what the Message Passing Interface (MPI) is, and a brief overview of the Open MPI open source software implementation of the MPI specification.
Move Message Passing Interface Applications to the Next LevelIntelÂŽ Software
Â
Explore techniques to reduce and remove message passing interface (MPI) parallelization costs. Get practical examples and examples of performance improvements.
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingIntelÂŽ Software
Â
Discover, extend, and modernize your current development approach for hetergeneous compute with standards-based OpenFabrics Interfaces* (OFI), message passing interface (MPI), and OpenMP* programming methods on IntelŽ Xeon Phi⢠processors.
nterprocess communication (IPC) is a set of programming interfaces that allow a programmer to coordinate activities among different program processes that can run concurrently in an operating system. This allows a program to handle many user requests at the same time. Since even a single user request may result in multiple processes running in the operating system on the user's behalf, the processes need to communicate with each other. The IPC interfaces make this possible. Each IPC method has its own advantages and limitations so it is not unusual for a single program to use all of the IPC methods.
IPC methods include pipes and named pipes; message queueing;semaphores; shared memory; and sockets.
Move Message Passing Interface Applications to the Next LevelIntelÂŽ Software
Â
Explore techniques to reduce and remove message passing interface (MPI) parallelization costs. Get practical examples and examples of performance improvements.
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingIntelÂŽ Software
Â
Discover, extend, and modernize your current development approach for hetergeneous compute with standards-based OpenFabrics Interfaces* (OFI), message passing interface (MPI), and OpenMP* programming methods on IntelŽ Xeon Phi⢠processors.
nterprocess communication (IPC) is a set of programming interfaces that allow a programmer to coordinate activities among different program processes that can run concurrently in an operating system. This allows a program to handle many user requests at the same time. Since even a single user request may result in multiple processes running in the operating system on the user's behalf, the processes need to communicate with each other. The IPC interfaces make this possible. Each IPC method has its own advantages and limitations so it is not unusual for a single program to use all of the IPC methods.
IPC methods include pipes and named pipes; message queueing;semaphores; shared memory; and sockets.
Towards high performance computing(hpc) through parallel programming paradigm...ijpla
Â
Nowadays, we are to find out solutions to huge computing problems very rapidly. It brings the idea of parallel computing in which several machines or processors work cooperatively for computational tasks. In the past decades, there are a lot of variations in perceiving the importance of parallelism in computing machines. And it is observed that the parallel computing is a superior solution to many of the computing limitations like speed and density; non-recurring and high cost; and power consumption and heat dissipation etc. The commercial multiprocessors have emerged with lower prices than the mainframe machines and supercomputers machines. In this article the high performance computing (HPC) through parallel programming paradigms (PPPs) are discussed with their constructs and design approaches.
The Seven Main Challenges of an Early Warning System Architecturestreamspotter
Â
J. MoĂgraber, F. Chaves, S. Middleton, Z. Zlatev, and R. Tao on "The Seven Main Challenges of an Early Warning System Architecture" at ISCRAM 2013 in Baden-Baden.
10th International Conference on Information Systems for Crisis Response and Management
12-15 May 2013, Baden-Baden, Germany
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...Deltares
Â
Presentation by Jarno Verkaik (Deltares) at the iMOD International User Day, during Delft Software Days - Edition 2017. Tuesday, 31 October 2017, Delft.
A brief introduction to deep learning, providing rough interpretation to deep neural networks and simple implementations with Keras for deep learning beginners.
Distributed Shared Memory â A Survey and Implementation Using OpenshmemIJERA Editor
Â
Parallel programs nowadays are written either in multiprocessor or multicomputer environment. Both these
concepts suffer from some problems. Distributed Shared Memory (DSM) systems is a new and attractive area of
research recently, which combines the advantages of both shared-memory parallel processors (multiprocessors)
and distributed systems (multi-computers). An overview of DSM is given in the first part of the paper. Later we
have shown how parallel programs can be implemented in DSM environment using Open SHMEM.
Distributed Shared Memory â A Survey and Implementation Using OpenshmemIJERA Editor
Â
Parallel programs nowadays are written either in multiprocessor or multicomputer environment. Both these
concepts suffer from some problems. Distributed Shared Memory (DSM) systems is a new and attractive area of
research recently, which combines the advantages of both shared-memory parallel processors (multiprocessors)
and distributed systems (multi-computers). An overview of DSM is given in the first part of the paper. Later we
have shown how parallel programs can be implemented in DSM environment using Open SHMEM.
All new computers have multicore processors. To exploit this hardware parallelism for improved
performance, the predominant approach today is multithreading using shared variables and locks. This
approach has potential data races that can create a nondeterministic program. This paper presents a
promising new approach to parallel programming that is both lock-free and deterministic. The standard
forall primitive for parallel execution of for-loop iterations is extended into a more highly structured
primitive called a Parallel Operation (POP). Each parallel process created by a POP may read shared
variables (or shared collections) freely. Shared collections modified by a POP must be selected from a
special set of predefined Parallel Access Collections (PAC). Each PAC has several Write Modes that
govern parallel updates in a deterministic way. This paper presents an overview of a Prototype Library
that implements this POP-PAC approach for the C++ language, including performance results for two
benchmark parallel programs.
All new computers have multicore processors. To exploit this hardware parallelism for improved
perf
ormance, the predominant approach today is multithreading using shared variables and locks. This
approach has potential data races that can create a nondeterministic program. This paper presents a
promising new approach to parallel programming that is both
lock
-
free and deterministic. The standard
forall primitive for parallel execution of for
-
loop iterations is extended into a more highly structured
primitive called a Parallel Operation (POP). Each parallel process created by a POP may read shared
variable
s (or shared collections) freely. Shared collections modified by a POP must be selected from a
special set of predefined Parallel Access Collections (PAC). Each PAC has several Write Modes that
govern parallel updates in a deterministic way. This paper pre
sents an overview of a Prototype Library
that implements this POP
-
PAC approach for the C++ language, including performance results for two
benchmark parallel programs.
Cisco's journey from Verbs to LibfabricJeff Squyres
Â
This is one of two mini-talks that I gave at Euro MPI 2015 / Bordeaux.
It describes the journey Cisco undertook to evaluate two different Linux operating-system bypass APIs: Verbs and Libfabric. I detail the technical points we evaluated in both APIs, and ultimately show why we picked Libfabric over Verbs.
(Very) Loose proposal to revamp MPI_INIT and MPI_FINALIZEJeff Squyres
Â
This is one of two mini-talks that I gave at Euro MPI 2015 / Bordeaux. It's mainly a taste of the kinds of discussions that we have at the MPI Forum. This particular talk is about some thoughts I've had about revamping MPI_INIT and MPI_FINALIZE. It is by NO means a finalized proposal -- it's mainly to give you an idea of the scope of ideas that are routinely discussed at the Forum. ...hey, you should attend MPI Forum meetings and see for yourself!
Fun with Github webhooks: verifying Signed-off-byJeff Squyres
Â
An overview of an afternoon project I noodled around with one day to play with Ruby and Github Webhooks. I surprised myself by creating something somewhat actually useful.
Presentation given to the Kentucky Open Source Society (KyOSS) on July 8, 2015.
Slides presented by Jeff Squyres at the 2015 OpenFabrics Software Developers' Workshop. This talk discusses Cisco's experiences implementing an ultra-low latency Ethernet plugin / provider for the Linux Verbs API and for for the Libfabric API.
Slides presented by Jeff Squyres at the 2015 OpenFabrics Software Developers' Workshop. This talk discusses the current state and future plans for the use of Libfabric in Open MPI.
(Open) MPI, Parallel Computing, Life, the Universe, and EverythingJeff Squyres
Â
This talk is a general discussion of the current state of Open MPI, and a deep dive on two new features:
1. The flexible process affinity system (I presented many of these slides at the Madrid EuroMPI'13 conference in September 2013).
2. The MPI-3 "MPI_T" tools interface.
I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.
Cisco usNIC: how it works, how it is used in Open MPIJeff Squyres
Â
In this talk, I expand on the slides I presented at the Madrid, Spain EuroMPI conference in September 2013 (I re-used some of the slides from that Madrid presentation, but there's a bunch of new content in the latter half of the slide deck).
This talk is a technical deep dive into how Cisco's usNIC technology works, and how we use that technology in the BTL plugin that we wrote for Open MPI.
I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.
These are the slides that I presented at MOSSCon 2013 (slightly edited, because the original slides contained some animations that I morphed to look ok on Slideshare).
The general talk is about two things:
1. General philosophy of open source at Cisco.
2. My specific open source work at Cisco.
Enjoy!
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Â
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Â
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Â
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
Â
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
Â
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
Â
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Dev Dives: Train smarter, not harder â active learning and UiPath LLMs for do...UiPathCommunity
Â
đĽ Speed, accuracy, and scaling â discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Miningâ˘:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing â with little to no training required
Get an exclusive demo of the new family of UiPath LLMs â GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
đ¨âđŤ Andras Palfi, Senior Product Manager, UiPath
đŠâđŤ Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Â
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
Â
As AI technology is pushing into IT I was wondering myself, as an âinfrastructure container kubernetes guyâ, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefitâs both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
Â
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties â USA
Expansion of bot farms â how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks â Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Â
MPI History
1. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
History and Development of the
MPI Standard
Jesper Larsson Träff
Vienna University of Technology
Faculty of Informatics, Institute of Information Systems
Research Group Parallel Computing
Favoritenstrase 16, 1040 Wien
www.par.tuwien.ac.at
2. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
21. September, 2012, MPI Forum meeting in Vienna:
MPI 3.0 has just been released âŚ
⌠but MPI has a long history and it is instructive to
look at that
www.mpi-forum.org
3. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
âThose who cannot remember the past are condemned to repeat
itâ, George Santyana, The Life of Reason, 1905-1906
âHistory always repeats itself twice: first time as tragedy,
second time as farceâ, Karl Marx
âHistory is written by the winnersâ, George Orwell, 1944 (but
he quotes from someone else)
4. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Last quote:
âhistoryâ depends. Who tells it, and why? What informations is
available? Whatâs at stake?
My stake:
â˘Convinced of MPI as a well designed and extremely useful
standard, that has posed productive research/development
problems, with a broader parallel computing relevance
â˘Critical of current standardization effort, MPI 3.0
â˘MPI implementer, 2000-2010 with NEC
â˘MPI Forum member 2008-2010 (with Hubert Ritzdorf,
representing NEC)
â˘Voted ânoâ to MPI 2.2
5. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
A long debate: shared-memory vs. distributed memory
Question: What shall a parallel machine look like?
M
P P P P
Answer depends
â˘What are your concerns?
â˘What is desirable?
â˘What is feasible?
causing debate since (at least) the 70ties, 80ties
?
6. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Hoare/Dijkstra:
Parallel programs shall be structured as collections of
communicating, sequential processes
Their concern: CORRECTNESS
Wyllie, Vishkin:
A parallel algorithm is like a collection of synchronized sequential
algorithms that access a common shared memory, and the machine
is a PRAM
Their concern: (asymptotic) PERFORMANCE
And, of course, PERFORMANCE: many, many practitioneers
And, of course, CORRECTNESS: Hoare semantics
7. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Hoare/Dijkstra:
Parallel programs shall be structured as collections of
communicating, sequential processes
Wyllie, Vishkin:
A parallel algorithm is like a collection of synchronized sequential
algorithms that access a common shared memory, and the machine
is a PRAM
[Fortune, Wyllie: Parallelism in Random Access Machines. STOC
1978: 114-118]
[Shiloach, Vishkin: Finding the Maximum, Merging, and Sorting in a
Parallel Computation Model. Jour. Algorithms 2(1): 88-102, 1981]
[C. A. R. Hoare: Communicating Sequential Processes. Comm. ACM
21(8): 666-677, 1978]
8. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Hoare/Dijkstra:
Parallel programs shall be structured as collections of
communicating, sequential processes
Wyllie, Vishkin: (many, many practiotioneers, Burton-Smith, âŚ)
A parallel algorithm is like a collection of synchronized sequential
algorithms that access a common shared memory, and the machine
is a PRAM
M
P P P P
Neither perhaps cared
too much about how to
build machinesâŚ
Neither perhaps cared
too much about how to
build machines (in the
beginning)
9. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The INMOS transputer T400, T800, from
ca. 1985
âŚbut others (fortunately) did
A complete architecture entirely based on
the CSP idea. An original programming
language, OCCAM (1983, 1987)
Parsytec (ca. 1988-1995)
10. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Intel iPSC/2 ca. 1990
Intel Paragon, ca. 1992
IBM SP/2 ca. 1996
Thinking machines
CM5, ca. 1994
12. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
IronicallyâŚ
Despite algorithmically stronger properties and potential for
scaling to much, much larger numbers of processors of shared-
memory models (like the PRAM)
practically, high-performance systems with (quite) substantial
parallelism have all been distributed-memory systems
and the corresponding de facto standard â MPI (the
Message-Passing Interface) is much stronger
than (say) OpenMP
13. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Sources of MPI: the early years
Commercial vendors and national laboratories (including
many European) needed practically working programming
support for their machines and applications
Early 90ties fruitful years for practical parallel computing
(funding for âgrand challengeâ and âstar warsâ)
Vendors and labs proposed and maintained own languages,
interfaces, libraries for parallel programming (early 90ties)
â˘Intel NX, Express, Zipcode, PARMACS, IBM EUI/CCL,
PVM, P4, OCCAM, âŚ
14. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
â˘Intel NX, Express, Zipcode, PARMACS, IBM EUI/CCL,
PVM, P4, OCCAM, âŚ
intended for distributed memory machines, and centered around
similar concepts
Similar enough to warrant an effort towards creating a common
standard for message-passing based parallel programming
Portability problem: wasted effort in maintaining own interface
for small user group, lack of portability across systems
15. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
â˘Intel NX: send-receive message passing (non-blocking,
buffering?), tags(tag groups?), no group concept, some
collectives, weak encapsulation
â˘IBM EUI: point-to-point and collectives (more than in MPI),
group concept, high performance (??) [Snir et al.]
â˘IBM CCL: point-to-point and collectives, encapsulation
â˘Zipcode/Express: point-to-point, emphasis on library building
[Skjellum]
â˘PARMACS/Express: point-to-point, topological mapping [Hempel]
â˘PVM: point-to-point communication, some collective, virtual
machine abstraction, fault-tolerance
Message-passing interfaces/languages early 90ties
16. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Some odd men out
â˘Linda: tuple space get/put â a first PGAS approach?
â˘Active messages; seems to presuppose an SPMD model?
â˘OCCAM: too strict CSP-based, synchronous message passing?
â˘PVM: heterogeneous systems, fault-tolerance, âŚ
[Hempel, Hey, McBryan, Walker: Special Issue â Message Passing
Interfaces. Parallel Computing 29(4), 1994]
17. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Standardization: the MPI Forum and MPI 1.0
[Hempel, Walker: The emergence of the MPI message passing standard
for parallel computing. Computer Standards & Interfaces, 21: 51-62,
1999]
A standardization effort was started early 1992; key Dongarra,
Hempel, Hey, Walker
Goal: to come out within a few years time frame with a
standard for message-passing parallel programming; building on
lessons learned from existing interfaces/languages
â˘Not a research effort (as such)!
â˘Open to participation from all interested parties
18. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Key technical design points
â˘Basic message-passing and related functionality (collective
communication!)
â˘Enable library building: safe encapsulation of messages (and
other things, eg. query functionality)
â˘High performance, across all available and future systems!
â˘Scalable design
â˘Support for C and Fortran
MPI should encompass and enable
19. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The MPI Forum
Not and ANSI/IEEE Standardization body, nobody âownsâ
the MPI standard; âfreeâ
Open to participation for all interested parties; protocols
open (votes, email discussions)
Regular meetings, 6-8 week intervals
Those who participates at meetings (with a history) can
vote, one vote per organization (current discussion:
quorum, semantics of abstaining)
The 1st MPI Forum set out out to work early 1993
20. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Errata, minor adjustments: MPI 1.0, 1.1, 1.2: 1994-1995
After 7 meetings, 1st version of the MPI Standard was ready
early 1994. Two finalizing meetings in February 1994
MPI: A Message-Passing Interface standard. May 5th, 1994
The standard is the 226 page pdf-document
that can be found at www.mpi-forum.org
as voted by the MPI Forum
21. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Take note:
The MPI 1 standardization process was followed hand-in-hand
by a(n amazingly good) prototype implementation: mpich from
Argonne National Laboratory (Gropp, Lusk, âŚ)
[W. Gropp, E. L. Lusk, N. E. Doss, A. Skjellum: A High-Performance,
Portable Implementation of the MPI Message Passing Interface
Standard. Parallel Computing 22(6): 789-828, 1996]
Other parties, vendors could build on this implementation (and
did!), so that MPI was quickly supported on many parallel
systems
22. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Why MPI has been successful: an appreciation
â˘abstractions, but is still close enough to common architectures
to allow efficient, low overhead implementations (âMPI is the
assembler of parallel computingâŚâ);
â˘is formulated with care and precision; but not a formal
specification
â˘is complete (to a high degree), based on few, powerful, largely
orthogonal key concepts (few exceptions, few optionals)
â˘and few mistakes
MPI made some fundamental
23. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i
j
m
k
l
Entities: MPI
processes
can communicate through a
communication medium
that can be
implemented as
âprocessesâ (most
MPI
implementations),
âthreadsâ, âŚ
Communication
medium: concrete
network,âŚ
nature of which is of no concern to the MPI standard:
â˘No explicit requirements on network structure or capabilities
â˘No performance model or requirements
Message-passing abstraction
24. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Only processes in same communicator: ranked set of processes
with unique âcontextâ - can communicate
Fundamental library building concept: isolates communication
in library routines from application communication
25. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Receiving
process blocks
until data have
been
transferred
MPI implementation must ensure reliable transmission; no time
out (see RT-MPI)
Semantics: messages from same sender are delivered in order;
possible to write fully deterministic programs
26. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Receiving
process blocks
until data have
been
transferred
Sending process may block or not⌠this is not synchronous
communication (as in CSP; close to this, synchronous MPI_Ssend)
Semantics: upon return, data buffer can safely be reused
27. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Isend(&data,count,type,j,tag,comm,&req);
MPI_Irecv(&data,count,type,i,tag,comm,&req);
Basic message-passing: point-to-point communication
Receiving process
returns
immediately, data
buffer must not
be touched
Non-blocking communication: MPI_Isend/MPI_Irecv
Explicit completion: MPI_Wait(&req,&status), âŚ
Design principle: MPI specification shall not enforce internal
buffering, all communication memory in user spaceâŚ
28. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
Basic message-passing: point-to-point communication
Receiving process
returns
immediately, data
buffer must not
be touched
Non-blocking communication: MPI_Isend/MPI_Irecv
Explicit completion: MPI_Wait(&req,&status), âŚ
Design choice: No progress rule, communication will/must
eventually happen
MPI_Isend(&data,count,type,j,tag,comm,&req);
MPI_Irecv(&data,count,type,i,tag,comm,&req);
29. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
Basic message-passing: point-to-point communication
Completeness: MPI_Send, Isend, Issend, âŚ, MPI_Recv, Irecv can
be combined, semantics make sense
Receiving process
returns
immediately, data
buffer must not
be touched
MPI_Isend(&data,count,type,j,tag,comm,&req);
MPI_Irecv(&data,count,type,i,tag,comm,&req);
30. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Receiving
process blocks
until data have
been
transferred
MPI_Datatype describes structure of communication data
buffer: basetypes MPI_INT, MPI_DOUBLE, âŚ, and
recursively applicable type constructors
data
31. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
i j
MPI_Send(&data,count,type,j,tag,comm);
MPI_Recv(&data,count,type,i,tag,comm,&status);
Basic message-passing: point-to-point communication
Receiving
process blocks
until data have
been
transferred
Orthogonality: Any MPI_Datatype can be used in any
communication operation
data
Semantics: only signature of data sent and data received must
match (performance!)
32. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Other functionality (supporting library building)
â˘Attributes to describe MPI objects (communicators,
datatypes)
â˘Query functionality for MPI objects (MPI_Status)
â˘Errorhandlers to influence behavior on errors
â˘MPI_Groupâs for manipulating ordered sets of processes
33. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
âMPI is too largeâ
âMPI is the assembler of parallel computingââŚ
âMPI is designed not to make easy things easy, but to make
difficult things possibleâ
Gropp, EuroPVM/MPI 2004
Conjecture (tested at EuroPVM/MPI 2002): for any MPI
feature there will be at least one (significant) user depending
essentially on exactly this feature
Often heard objections/complaints
and two answers
34. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations:
â˘MPI_Barrier(comm);
â˘MPI_Bcast(âŚ,comm);
â˘MPI_Gather(âŚ,comm); MPI_Scatter(âŚ,comm);
â˘MPI_Allgather(âŚ,comm);
â˘MPI_Alltoall(âŚ,comm);
â˘MPI_Reduce(âŚ,comm); MPI_Allreduce(âŚ,comm);
â˘MPI_Reduce_scatter(âŚ,comm);
â˘MPI_Scan(âŚ,comm);
Semantics: all processes in comm participates; blocking; no tags
35. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations:
â˘MPI_Barrier(comm);
â˘MPI_Bcast(âŚ,comm);
â˘MPI_Gather(âŚ,comm); MPI_Scatter(âŚ,comm);
â˘MPI_Allgather(âŚ,comm);
â˘MPI_Alltoall(âŚ,comm);
â˘MPI_Reduce(âŚ,comm); MPI_Allreduce(âŚ,comm);
â˘MPI_Reduce_scatter(âŚ,comm);
â˘MPI_Scan(âŚ,comm);
Completeness: MPI_Bcast dual of MPI_Reduce; MPI_Gather
dual of MPI_Scatter. Regular and irregular (vector) variants
36. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations:
â˘MPI_Gatherv(âŚ,comm); MPI_Scatterv(âŚ,comm);
â˘MPI_Allgatherv(âŚ,comm);
â˘MPI_Alltoallv(âŚ,comm);
â˘MPI_Reduce_scatter(âŚ,comm);
Completeness: MPI_Bcast dual of MPI_Reduce; MPI_Gather
dual of MPI_Scatter. Regular and irregular (vector) variants
37. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations
Collectives capture complex patterns, often with non-trivial
algorithms and implementations: delegate work to library
implementer, save work for the application programmer
Obligation: MPI implementation must be of sufficiently high
quality â otherwise application programmer will not use or
implement own collectives
This did happen(s)! For datatypes: unused for a long time
38. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations
Collectives capture complex patterns, often with non-trivial
algorithms and implementations: delegate work to library
implementer, save work for the application programmer
Completeness: MPI makes it possible to (almost) implement MPI
collectives âon top ofâ MPI point-to-point communication
Some exceptions for reductions, MPI_Op; datatypes
39. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Collective communication: patterns of process communication
Fundamental, well-studied, and useful parallel communication
patterns captured in MPI 1.0 as socalled collective
operations
Collectives capture complex patterns, often with non-trivial
algorithms and implementations: delegate work to library
implementer, save work for the application programmer
Conjecture: well-implemented collective operations contributes
significantly towards application âperformance portabilityâ
[Träff, Gropp, Thakur: Self-Consistent MPI Performance Guidelines.
IEEE TPDS 21(5): 698-709, 2010]
40. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Proc 0
x
Proc p-1
Three algorithms for matrix-vector multiplication
mxn matrix A and n-element vector y distributed evenly across
p MPI processes: compute z = Ay
Algorithm 1:
â˘Row-wise matrix
distribution
â˘Each process needs full
vector: MPI_Allgather(v)
â˘Compute blocks of result
vector locally
41. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Proc 0 xProc p-1
Algorithm 2:
â˘Column-wise matrix
distribution
â˘Compute local partial
result vector
â˘MPI_Reduce_scatter to
sum and distribute partial
results
Three algorithms for matrix-vector multiplication
42. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Proc 0
Proc 2c
Proc c
Proc c-1âŚ
x
MPI_Allgather
MPI_Allgather
Three algorithms for matrix-vector multiplication
â˘Algorithm 3:
â˘Matrix distribution into
blocks of m/r x n/c
elements
â˘Algorithm 1 on columns
â˘Algorithm 2 on rows
43. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Proc 0
Proc 2c
Proc c
Proc c-1âŚ
x
MPI_Reduce_scatter
MPI_Reduce_scatter
MPI_Reduce_scatter
p=rc
Three algorithms for matrix-vector multiplication
â˘Algorithm 3:
â˘Matrix distribution into
blocks of m/r x n/c
elements
â˘Algorithm 1 on columns
â˘Algorithm 2 on rows
Algorithm 3 is more scalable. Partitioning the set processes
(new communicators) is essential!
Interfaces that do support collectives on subsets of processes
are not able to express Algorithm 3: case in point UPC
44. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Three algorithms for matrix-vector multiplication
For the âregularâ case where p divides n (and p=rc)
â˘Regular collectives: MPI_Allgather, MPI_Reduce_scatter
For the âirregularâ case
â˘Irregular collectives: MPI_Allgatherv, MPI_Reduce_scatter
MPI 1.0 defined regular/irregular versions â completeness â
for all the considered collective patterns; except for
MPI_Reduce_scatter
Performance: irregular subsume regular counterparts; but
much better algorithms are known for the regular ones
45. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
[R. A. van de Geijn, J. Watts: SUMMA: scalable
universal matrix multiplication algorithm. Concurrency -
Practice and Experience 9(4): 255-274 (1997)]
[Ernie Chan, Marcel Heimlich, Avi Purkayastha, Robert A. van de Geijn:
Collective communication: theory, practice, and experience. Concurrency
and Computation: Practice and Experience 19(13): 1749-1783 (2007)]
[F. G. van Zee, E. Chan, R. A. van de Geijn, E. S.
Quintana-OrtĂ, G. Quintana-OrtĂ: The libflame Library
for Dense Matrix Computations. Computing in Science
and Engineering 11(6): 56-63 (2009)]
A lesson: Dense Linear Algebra and (regular) collective
communication as offered by MPI go hand in hand
Note: Most of these collective communication algorithms are a
factor 2 off from best possible
46. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Another example: Integer (bucket) sort
n integers in a given range [0,R-1], distributed evenly across p
MPI processes: m= n/p integers per process
0 1 3 0 0 2 0 1 âŚ
4
2
1
3
Step 1: bucket sort locally, let B[i] number of elements with key i
Step 2: MPI_Allreduce(B,AllB,R,MPI_INT,MPI_SUM,comm);
Step 3: MPI_Exscan(B,RelB,R,MPI_INT,MPI_SUM,comm);)
B =A =
Now: Element A[j] needs to go to position AllB[A[j]-1]+RelB[A[j]]+jâ
47. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Another example: Integer (bucket) sort
n integers in a given range [0,R-1], distributed evenly across p
MPI processes: m= n/p integers per process
0 1 3 0 0 2 0 1 âŚ
4
2
1
3
Step 4: compute number of elements to be sent to each other
process, sendelts[i], i=0,âŚ,p-1
B =A =
Step 5:
MPI_Alltoall(sendelts,1,MPI_INT,recvelts,1,MPI_INT,comm);
Step6: redistribute elements
MPI_Alltoallv(A,sendelts,sdispls,âŚ,comm);
48. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Another example: Integer (bucket) sort
The algorithm is stable Radixsort
Choice of radix R depends on properties of network (fully
connected, fat tree, mesh/torus, âŚ) and quality of
reduction/scan-algorithms
The algorithm is portable (by virtue of the MPI collectives),
but tuning depends on systems â concrete performance model
needed, but this is outside scope of MPI
Note: on strong network T(MPI_Allreduce(m)) = O(m+log p)
NOT: O(mlog p)
49. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
A last feature
Process topologies:
Specify application communication pattern (as either directed
graph or Cartesian grid) to MPI library, let library assign
processes to processors so as to improve communication
follwing specified pattern
MPI version: collective communicator construction functions,
process ranks in new communicator represent new (improved)
mapping
And a very last: (simple) tool building support â the MPI profiling
interface
50. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The mistakes
â˘MPI_Cancel(), semantically ill-defined, difficult to implement; a
concession to RT?
â˘MPI_Rsend(); vendors got too much leverage?
â˘MPI_Pack/Unpack; was added as an afterthought in last 1994
meetings
â˘Some functions enforce full copy of argument (list)s into
library
51. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The mistakes
â˘MPI_Cancel(), semantically ill-defined, difficult to implement; a
concession to RT? (but useful for certain patterns, e.g., double
buffering, client-server-like, âŚ)
â˘MPI_Rsend(); vendors got too much leverage? (but
advantageous in some scenarios)
â˘MPI_Pack/Unpack; was added as an afterthought in last 1994
meetings (functionality is useful/needed, limitations in the
specification)
â˘Some functions enforce full copy of argument (list)s into
library
52. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Missing functionality
â˘Datatype query functions â not possible to query/reconstruct
structure specified by given datatype
â˘Some MPI objects are not first class citizens (MPI_Aint,
MPI_Op, MPI_Datatype); makes it difficult to build certain
types of libraries
â˘Reductions cannot be performed locally
53. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Definition:
An MPI construct is non-scalable, if memory or time overhead(*)
is Ί(p), p number of processes
Questions:
â˘Are there aspects of the MPI specification that are non-
scalable (forces Ί(p) memory or time)?
â˘Are there aspects of (typical) MPI implementations that are
non-scalable
(*)cannot be accounted for in application
Is MPI scalable?
Question must distinguish between specification and
implementation
54. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Answer is âyesâ to both questions
Example:
Irregular collective alltoall communication (each process
exchange some data with each other process)
MPI_Alltoallw(sendbuf,sendcounts[],senddispl[],sendtypes[],
recvbuf,recvcounts[],recvdispls[],recvtypes[],âŚ)
takes 6 p-sized arrays (4- or 8-byte integers) ~ 5MBytes, 10%
of memory on BlueGene/L
Sparse usage pattern: often each process exchanges with
only few neighbors, so most send/recvcounts[i]=0
MPI_Alltoallw is non-scalable
Pi
55. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Experiment:
sendcounts[i]=0, recvcounts[i] =0 for all processes and all i
Argonne Natl. Lab
BlueGene/L
Entails no communication
[Balaji, âŚ, Träff : MPI on millions of cores. Parallel Processing Letters
21(1): 45-60, 2011]
56. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Definitely non-scalable features in MPI 1.0
â˘Irregular collectives: p-sized lists of counts, displacements,
types
â˘Graph topology interface: requires specification of full
process topology (communication graph) by all processes
(Cartesian topology interface is perfectly scalable, and much
used)
57. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI 2: what (almost) went wrong
A number of issues/desired functionality were left open by
MPI 1.0, either because of
â˘no agreement
â˘deadline, desire to get a consolidated standard out in time
Major open issues
â˘Parallel IO
â˘One-sided communication
â˘Dynamic process management
were partly described in the socalled JOD: âJournal of
Developmentâ (see www.mpi-forum.org)
The challenge from PVM
Challenge from SHMEMâŚ
58. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI Forum started to reconvene already in 1995
Between 1995-1997 there were 16 meetings which lead to
MPI 2.0
MPI 1.0:
226 pages
MPI 2.0: 356
additional pages
Major new features, with new concepts: extended message
passing models
1. Dynamic process management
2. One-sided communication
3. MPI-IO
59. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Dynamic process management:
MPI 1.0 was completely static: a communicator cannot change
(design principle: no MPI object can change; new objects can
be created and old ones destroyed), so the number of
processes in MPI_COMM_WORLD cannot change: therefore
not possible to add or remove processes from a running
application
MPI 2.0 process management relies on inter-communicators
(from MPI 1.0) to establish communication with newly started
processes or already running applications
â˘MPI_Comm_spawn
â˘MPI_Comm_connect/MPI_Comm_accept
â˘MPI_Intercomm_merge
60. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Dynamic process management:
MPI 1.0 was completely static: a communicator cannot change
(design principle: no MPI object can change; new objects can
be created and old ones destroyed), so the number of
processes in MPI_COMM_WORLD cannot change: therefore
not possible to add or remove processes from a running
application
What if a process (in a communicator) dies? The fault-
tolerance problem
Most (all) MPI implementations also die â but this may be an
implementation issue
61. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Dynamic process management:
MPI 1.0 was completely static: a communicator cannot change
(design principle: no MPI object can change; new objects can
be created and old ones destroyed), so the number of
processes in MPI_COMM_WORLD cannot change: therefore
not possible to add or remove processes from a running
application
What if a process (in a communicator) dies? The fault-
tolerance problem
If implementation does not die, it might be possible to
program around/isolate faults using MPI 1.0 error handlers
and inter-communicators
[W. Gropp, E. Lusk: Fault Tolerance in Message Passing Interface
Programs. IJHPCA 18(3): 363-372, 2004]
62. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Dynamic process management:
MPI 1.0 was completely static: a communicator cannot change
(design principle: no MPI object can change; new objects can
be created and old ones destroyed), so the number of
processes in MPI_COMM_WORLD cannot change: therefore
not possible to add or remove processes from a running
application
What if a process (in a communicator) dies? The fault-
tolerance problem
The issue is contentious&contagiousâŚ
63. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2. One-sided communication
Motivations/arguments:
â˘Expressivity/convenience: For applications where only one
process may readily know with which process to communicate
data, the point-to-point message-passing communication model
may be inconvenient
â˘Performance: On some architectures point-to-point
communication could be inefficient; e.g. if shared-memory is
available
Challenge: define a model that captures the essence of one-
sided communication, but can be implemented without requiring
specific hardware support
64. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2. One-sided communication
Challenge: define a model that captures the essence of one-
sided communication, but can be implemented without requiring
specific hardware support
New MPI 2.0 concepts: communication window, communication
epoch
MPI one-sided model cleanly separates communication from
synchronization; three specific synchronization mechanisms
â˘MPI_Win_fence
â˘MPI_Win_Start/Complete/Post/Wait
â˘MPI_Win_lock/unlock
with cleverly thought out semantics and memory model
65. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2. One-sided communication
MPI one-sided model cleanly separates communication from
synchronization; three specific synchronization mechanisms
â˘MPI_Win_fence
â˘MPI_Win_start/complete/post/wait
â˘MPI_Win_lock/unlock
with cleverly thought out semantics and memory model
Unfortunately, application programmers did not seem to like it
â˘âtoo complicatedâ
â˘âtoo rigidâ
â˘ânot efficientâ
â˘âŚ
66. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
3. MPI-IO
Communication with external (disk/file) memory. Could leverage
MPI concepts and implementations:
â˘Datatypes to describe file structure
â˘Collective communication for utilizing local file systems
â˘Fast communication
MPI datatype mechanism is essential, and the power of this
concept starts to become clear
MPI 2.0 introduces (inelegant!) functionality to decode a
datatype = discover the structure described by datatype.
Needed for MPI-IO implementation (on top of MPI) and
supports library building
67. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Take note:
Apart from MPI-IO (ROMIO), the MPI 2.0 standardization
process was not followed by prototype implementations
New concept (IO only): split collectives
68. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Not discussed:
Thread-support/compliance, the ability of MPI to work in a
threaded environment
â˘MPI 1.0: design is (largely: exception: MPI_Probe/MPI_Recv)
thread safe; recommendation that MPI implementations be
thread safe (contrast: PVM design)
â˘MPI 2.0: level of thread support can be requested and queried;
an MPI library is not required to support the requested level,
but returns information on the highest smaller level supported
MPI_THREAD_SINGLE
MPI_THREAD_FUNNELED
MPI_THREAD_SERIALIZED
MPI_THREAD _MULTIPLE
69. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Quiet years: 1997-2006
No standardization activity from 1997
MPI 2.0 implementations
â˘Fujitsu (claim) 1999
â˘NEC 2000
â˘mpich 2004
â˘OpenMPI 2005
â˘LAM/MPI 2005(?)
â˘âŚ
Ca. 2006 most/many implementations support mostly full MPI 2.0
Implementations evolved and improved; MPI was an interesting
topic to work on, good MPI work was/is acceptable to all parallel
computing conferences (SC, IPDPS, ICPP, Euro-Par, PPoPP, SPAA)
[J. L.Träff, H. Ritzdorf, R. Hempel:
The Implementation of MPI-2 One-
Sided Communication for the NEC
SX-5. SC 2000]
70. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2012 Wien, Austria
2011 Santorini, Greece
2010 Stuttgart, Germany EuroMPI (no longer PVM)
2009 Helsinki, Finland EuroPVM/MPI
2008 Dublin, Ireland
2007 Paris, France
2006 Bonn, Germany
2005 Sorrento, Italy
2004 Budapest, Hungary
2003 Venice, Italy
2002 Linz, Austria
2001 Santorini, Greece (9.11 â did not actually take place)
2000 BalatonfĂźred, Hungary
1999 Barcelona, Spain
1998 Liverpool, UK
1997 Cracow, Poland Now EuroPVM/MPI
1996 Munich, Germany
1995 Lyon, France
1994 Rome, Italy EuroPVM
Euro(PVM/)MPI conference series: dedicated to MPI
MPIForummeetings
âŚbiased
towards
MPI
implementa
tion
71. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
[Thanks to Xavier Vigouroux, Vienna 2012]
Bonn 2006: discussions (âOpen Forumâ)
on restarting MPI Forum starting
72. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
The MPI 2.2 â MPI 3.0 process
Late 2007: MPI Forum reconvenes - again
Consolidate standard: MPI 1.2 and MPI 2.0 into single
standard document: MPI 2.1 (Sept. 4th, 2008)
MPI 2.2 intermediate step towards 3.0
â˘Address scalability problems
â˘Missing functionality
â˘BUT preserve backwards compatibility
586 pages 623 pages
73. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Some MPI 2.2 features
â˘Addressing scalability problems: new topology interface,
application communication graph is specified in a distributed
fashion
â˘Library building: MPI_Reduce_local
â˘Missing function: regular MPI_Reduce_scatter_block
â˘More flexible MPI_Comm_create (more in MPI 3.0:
MPI_Comm_split_type)
â˘New datatypes, e.g. MPI_AINT
[T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski,
R. Thakur, J. L. Träff: The scalable process topology
interface of MPI 2.2. Concurrency and Computation: Practice
and Experience 23(4): 293-310, 2011]
74. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Some MPI 2.2 features
C++ bindings (since MPI 2.0) deprecated! With the intention
that they will be removed
MPI_Op, MPI_Datatype still not first class citizens (datatype
support is weak and cumbersome)
Fortran bindings modernized and corrected
75. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
â˘6 meetings 2012
â˘6 meetings 2011
â˘7 meetings 2010
â˘6 meetings 2009
â˘7 meetings 2008
MPI 2.1, 2.2, and 3.0 process
Total: 32 meetings (and countingâŚ)
Recall:
â˘MPI 1: 7 meetings
â˘MPI 2.0: 16 meetings
MPI Forum rules: presence at physical meetings with a history
(presence at past two meetings) required to vote
Requirement: new functionality must be supported by use-case
and prototype implementation; backwards compatibility not strict
77. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI 2.2 â MPI 3.0 process had working groups on
â˘Collectives Operations
â˘Fault Tolerance
â˘Fortran bindings
â˘Generalized requests ("on hold")
â˘Hybrid Programming
â˘Point to point (this working group is "on hold")
â˘Remote Memory Access
â˘Tools
â˘MPI subsetting ("on hold")
â˘Backward Compatibility
â˘Miscellaneous Items
â˘Persistence
78. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI 3.0: new features, new themes, new opportunities
Major new functionalities:
1. Non-blocking collectives
2. Sparse collectives
3. New one-sided communication
4. Performance tool support
Deprecated functions removed: C++ interface has gone
MPI 3.0, 21. September 2012: 822 pages
Implementation status: mpich should cover MPI 3.0
Performance/quality?
79. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
1. Non-blocking collectives
Introduced for performance (overlap) and convenience reasons
Similarly to non-blocking point-to-point routines; MPI_Request
object to check and enforce progress
Sound semantics based on ordering, no tags
Different from point-to-point (with good reason): blocking and
non-blocking collectives do not mix and match: MPI_Ibcast() is
incorrect with MPI_Bcast()
Incomplete: non-blocking versions for some other collectives
(MPI_Icomm_dup)
Non-orthognal: split and non-blocking collectives
80. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
2. Sparse collectives
Addresses scalability problem of irregular collectives.
Neighborhood specified with topology functionality
MPI_Neighbor_allgather(âŚ,comm);
MPI_Neighbor_allgatherv(âŚ,comm);
MPI_Neighbor_alltoall(âŚ,comm);
MPI_Neighbor_alltoallv(âŚ,comm);
MPI_Neighbor_alltoallw(âŚ,comm);
Pi
and corresponding non-blocking versions
[T. Hoefler, J. L. Träff: Sparse collective operations for MPI. IPDPS
2009]
Will users take up? Optimization potential?
81. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
[Hoefler, Dinan, Buntinas, Balaji, Barrett, Brightwell, Gropp, Kale,
Thakur: Leveraging MPIâs one-sided communication for shared-memory
programming. EuroMPI 2012, LNCS 7490, 133-141, 2012]
3. One-sided communication
Model extension for better performance on hybrid/shared
memory systems
Atomic operations (lacking in MPI 2.0 model)
Per operation local completion, MPI_Rget, MPI_Rput, ⌠(but
only for passive synchronization)
82. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
4. Performance tool support
Problem of MPI 1.0 allowing only one profiling interface at a
time (linker interception of MPI calls) NOT solved
Functionality added to query certain internals of the MPI
library
Will tool writers take up?
83. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI at a turning point
Extremely large-scale systems
now appearing stretch the
scalability of MPI
Is MPI for exascale?
â˘Heterogeneous?
â˘memory constrained?
â˘low bisection width?
â˘unreliable?
systems
84. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI Forum at a turning point
Attendance large enough?
Attendance broad enough?
MPI 2.1-MPI 3.0 process has been long and exhausting,
attendance driven by implementors, relatively little input form
users and applications, non-technical goals have played a role;
research conducted that did not lead to useful outcome for the
standard (fault tolerance, thread/hybrid support, persistence,
âŚ)
Perhaps time to take a break?
More meetings,
smaller attendance
85. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
MPI 2.2
YES 24
NO 1
ABSTAIN 0
MISSED 0
Passed
September 2-4, 2009
MPI 2.1 first
YES 22
NO 0
ABSTAIN 0
MISSED 0
Passed
June 30-July 2, 2008
MPI 3.0
YES 17
NO/ABSTAIN 0/0
Passed
MPI 2.1 second
YES 23
NO 0
ABSTAIN 0
MISSED 0
Passed
MPI 1.3 second
YES 22
NO 1
ABSTAIN 0
MISSED 0
Passed
September 3-5, 2008
The recent votes (MPI 2.1 to MPI 3.0)
September 20-21, 2012
Change/discussion of
rules
See meetings.mpi-forum.org/secretary/
86. ŠJesper Larsson Träff21.11, 5.12.2012, 12.6.2013
Summary:
Study history and learn from it: how to do better than MPI
Standardization is a major effort, has taken a lot of dedication
and effort from a relatively large (but declining?) group of
people and institutions/companies
MPI 3.0 will raise many new implementation challenges
MPI 3.0 is not the end of the (hi)story
Thanks to the MPI Forum; discussion with Bill Gropp, Rusty Lusk, Rajeev
Thakur, Jeff Squyres, and others