This document provides an introduction to Message Passing Interface (MPI) and distributed computing. It discusses what MPI is, which is a library specification for message passing between processes without shared memory. The document outlines some key MPI functions and concepts, introduces MPI programming, and discusses thinking in parallel when using MPI. It also provides information on MPI implementations, versions of the MPI standard, and motivations for distributed computing.
One of the presentations used in a discussion meeting about GlusterFS held on Sep. 14, 2011 in Japan.
Ust: http://www.ustream.tv/channel/glusterfs
Togetter: http://togetter.com/li/188183
In this deck, Yuichiro Ajima from Fujitsu presents: The Tofu Interconnect D.
"Through the development of post-K, which will be equipped with this CPU, Fujitsu will contribute to the resolution of social and scientific issues in such computer simulation fields as cutting-edge research, health and longevity, disaster prevention and mitigation, energy, as well as manufacturing, while enhancing industrial competitiveness and contributing to the creation of Society 5.0 by promoting applications in big data and AI fields."
Learn more: https://insidehpc.com/2018/08/fujitsu-unveils-details-post-k-supercomputer-processor-powered-arm/
and
http://www.fujitsu.com/jp/solutions/business-technology/tc/catalog/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
One of the presentations used in a discussion meeting about GlusterFS held on Sep. 14, 2011 in Japan.
Ust: http://www.ustream.tv/channel/glusterfs
Togetter: http://togetter.com/li/188183
In this deck, Yuichiro Ajima from Fujitsu presents: The Tofu Interconnect D.
"Through the development of post-K, which will be equipped with this CPU, Fujitsu will contribute to the resolution of social and scientific issues in such computer simulation fields as cutting-edge research, health and longevity, disaster prevention and mitigation, energy, as well as manufacturing, while enhancing industrial competitiveness and contributing to the creation of Society 5.0 by promoting applications in big data and AI fields."
Learn more: https://insidehpc.com/2018/08/fujitsu-unveils-details-post-k-supercomputer-processor-powered-arm/
and
http://www.fujitsu.com/jp/solutions/business-technology/tc/catalog/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Review of CERN's objectives and how the computing infrastructure is evolving to address the challenges at scale using community supported software such as Puppet and OpenStack.
The BonFIRE architecture was presented at the TridentCom Conference. These are the slides for the paper, which describes the key components and principles of the architecture and also some specific features offered to experimenter that are not available elsewhere.
(Open) MPI, Parallel Computing, Life, the Universe, and EverythingJeff Squyres
This talk is a general discussion of the current state of Open MPI, and a deep dive on two new features:
1. The flexible process affinity system (I presented many of these slides at the Madrid EuroMPI'13 conference in September 2013).
2. The MPI-3 "MPI_T" tools interface.
I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.
Review of CERN's objectives and how the computing infrastructure is evolving to address the challenges at scale using community supported software such as Puppet and OpenStack.
The BonFIRE architecture was presented at the TridentCom Conference. These are the slides for the paper, which describes the key components and principles of the architecture and also some specific features offered to experimenter that are not available elsewhere.
(Open) MPI, Parallel Computing, Life, the Universe, and EverythingJeff Squyres
This talk is a general discussion of the current state of Open MPI, and a deep dive on two new features:
1. The flexible process affinity system (I presented many of these slides at the Madrid EuroMPI'13 conference in September 2013).
2. The MPI-3 "MPI_T" tools interface.
I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.
High Performance Computing - The Future is HereMartin Hamilton
These are the slides from my talk on supercomputing to DARC in January 2014. The talk covers everything from the UK's "missing million" young people not in employment, education or training (NEETs) to engaging with the Raspberry Pi generation, and also provides an introduction to supercomputing and our HPC Midlands facility.
Our unique 1U GPU servers allow you to use the latest GPUs (Tesla, GTX285, Quadro FX5800) for visualization or offloading processing in a small form factor. These are built on Intel\'s latest Nehalem processors.
High Performance Computing Workshop for IHPC, Techkriti'13
Supercomputing Blog contains the codes -
http://ankitmahato.blogspot.in/search/label/Supercomputing
Credits:
https://computing.llnl.gov/
http://www.mcs.anl.gov/research/projects/mpi/
Heterogeneous Systems Architecture: The Next Area of Computing Innovation AMD
Dr. Lisa Su, Senior Vice President and GM, Global Business Units, AMD keynote from ISSCC on Heterogeneous Systems Architecture: The Next Area of Computing Innovation - Case Study, The Holodeck.
uCluster (micro-Cluster) is a toy computer cluster composed of 3 Raspberry Pi boards, 2 NVIDIA Jetson Nano boards and 1 NVIDIA Jetson TX2 board.
The presentation shows how to build the uCluster and focuses on few interesting technologies for further consideration when building a cluster at any scale.
The project is for educational purposes and tinkering with various technologies.
Performance analysis of 3D Finite Difference computational stencils on Seamic...Joshua Mora
Seamicro fabric compute systems offers an array of low power compute nodes interconnected with a 3D torus network fabric (branded Freedom Supercomputer Fabric). This specific network topology allows very efficient point to point communications where only your neighbor compute nodes are involved in the communications. Such type of communication pattern arises in a wide variety of distributed memory applications like in 3D Finite Difference computational stencils, present on many computationally expensive scientific applications (eg. seismic, computational fluid dynamics). We present the performance analysis (computation, communication, scalability) of a generic 3D Finite Difference computational stencil on such a system. We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for HPC applications that exhibit this communication pattern.
QsNetIII, An HPC Interconnect For Peta Scale SystemsFederica Pisani
QsNetIII Network
–Multi-stage switch network
–Evolution of the QsNetIIdesign
–Increased use of commodity hardware
–Increasing support for standard software
•QsNetIII Components
–ASICs Elan5 and Elite5
–Adapters, switches, cables
–Firmware, drivers, libraries
–Diagnostics, documentation
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
2. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
3. Shared Memory: all memory within a system is directly
addressable (ignoring access restrictions) by each process [or
thread]
Single- and multi-CPU desktops & laptops
Multi-threaded apps
GPGPU *
MPI *
Distributed Memory: memory available a given node within
a system is unique and distinct from its peers
MPI
Google MapReduce / Hadoop
6. Bandwidth (FSB, HT, Nehalem, CUDA, …)
Frequently run into with high-level languages (MATLAB)
Capacity – cost & availability
High-density chips are $$$ (if even available)
Memory limits on individual systems
Distributed computing addresses both bandwidth and
capacity with multiple systems
MPI is the glue used to connect multiple distributed
processes together
7. Custom iterative SENSE reconstruction
3 x 8 coils x 400 x 320 x 176 x 8 [complex float]
Profile data (img space)
Estimate (img<->k space)
Acquired data (k space)
> 4GB data touched during each iteration
16, 32 channel data here or on the way…
Trzasko, Josh ISBI 2009 #1349, “Practical Nonconvex Compressive Sensing Reconstruction of Highly-Accelerated 3D Parallel MR Angiograms”
M01.R4: MRI Reconstruction Algorithms, Monday @ 1045 in Stanbro
8. FTx
DATA
Place view into correct x-Ky-Kz space (AP & LP)
CAL FTyz (AP & LP)
“Traditional” 2D SENSE Unfold (AP & LP)
Homodyne Correction
Pre-loaded data
GW Correction (Y, Z)
Real-time data
GW Correction (X)
MPI Communication
MIP
Root node
Worker nodes
Store /
RESULT
DICOM
9. Root Node 1Gb Eth Site Intranet
3.6GHz P4 16GB RAM 1Gb Eth
Worker Node (x7)
1Gb Eth
3.6GHz P4 16GB RAM
3.6GHz P4
1Gb Eth
80GB HDD 3.6GHz P4 2x8Gb IB
80GB HDD
500GB HDD 2x8Gb IB
16-Port Gigabit Ethernet Switch
x7 File system
connections
24-Port Infiniband Switch
x7x2 MPI interconnects
16Gb/s bandwidth per node
8Gb/s Connection
Key
Cluster Hardware
MRI System
External Hardware
2x8Gig Infiniband connection
1Gig Ethernet connection
12. Host Host I
OS OS I
Process A Thread 1 Process A
Thread 2
Host II
ThreadN
OS II
Process B Process B
Host N
OS N
Memory Transfers Process C
Network Transfers
13. Host Host I
OS OS I
Process A Thread 1 Process A Process D
Thread 2
Host II
ThreadN
OS II
Process B Process B Process E
Host N
OS N
Memory Transfers Process C Process F
Network Transfers
14. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
15. Message Passing Interface is…
“a library specification for message-passing” 1
Available in many implementations on multiple
platforms *
A set of functions for moving messages between
different processes without a shared memory
environment
Low-level*; no concept of overall computing tasks
to be performed
[1] http://www.mcs.anl.gov/research/projects/mpi/
16. MPI-1
Version 1.0 draft standard 1994
Version 1.1 in 1995
Version 1.2 in 1997
Version 1.3 in 2008
MPI-2
Added:
▪ 1-sided communication
▪ Dynamic “world” sizes; spawn / join
Version 2.0 in 1997
Version 2.1 in 2008
MPI-3
In process
Enhanced fault handling
Forward compatibility preserved
17. MPI is the de-facto standard for distributed computing
Freely available
Open source implementations exist
Portable
Mature
From a discussion of why MPI is dominant [1]:
[…] 100s of languages have come and gone.
Good stuff must have been created [… yet] it is broadly accepted in the field
that they’re not used.
MPI has a lock.
OpenMP is accepted, but a distant second.
There are substantial barriers to the introduction of new languages and
language constructs.
Economic, ecosystem related, psychological, a catch-22 of widespread
use, etc.
Any parallel language proposal must come equipped with reasons why it will
overcome those barriers.
[1] http://www.ieeetcsc.org/newsletters/2006-01/why_all_mpi_discussion.html
18. MPI itself is just a specification. We want an implementation
MPICH, MPICH2
Widely portable
MVAPICH, MVAPICH2
Infiniband-centric; MPICH/MPICH2 based
OpenMPI
Plug-in architecture; many run-time options
And more:
IntelMPI
HP-MPI
MPI for IBM Blue Gene
MPI for Cray
Microsoft MPI
MPI for SiCortex
MPI for Myrinet Express (MX)
MPICH2 over SCTP
19. Without MPI:
Start all of the processes across bank of machines
(shell scripting + ssh)
socket(), bind(), listen(), accept() or connect() each
link
send(), read() on individual links
Raw byte interfaces; no discrete messages
23. Each process owns their data – there is no “our”
Makes many things simpler; no mutexes, condition
variables, semaphores, etc; memory access order race
conditions go away
Every message is an explicit copy
I have the memory I sent from, you have the memory you
used to received into
Even when running in a “shared memory” environment
Synchronization comes along for free
I won’t get your message (or data) until you choose to
send it
Programming to MPI first can make it easier to scale-
out later
24. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
25. Download / decompress MPICH source:
http://www.mcs.anl.gov/research/projects/mpic
h2/
Suports: c / c++ / Fortran
Requires Python >= 2.2
./configure
make install
installs into /usr/local by default, or use
--prefix=<chosen path>
Make sure <prefix>/bin is in PATH
Make sure <prefix>/share/man is in MANPATH
27. Set up passwordlessssh to workers
Start the daemons with mpdboot -n<N>
Requires ~/.mpd.conf to exist on each host
▪ Contains: (same on each host)
▪ MPD_SECRETWORD=<some gibberish string>
▪ permissions set to 600 (r/w access for owner only)
Requires ./mpd.hosts to list other host names
▪ Unless run as mpdboot -n 1 (run on current host only)
▪ Will not accept current host in list (implicit)
Check for running daemons with mpdtrace
For details: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdf
28.
29. Use mpicc/ mpicxx for c/c++ compiler
Wrapper script around c/c++ compilers detected
during install
▪ $ mpicc --show
gcc -march=core2 -I/usr/local/builds/mpich2-1.0.8/include -
L/usr/local/builds/mpich2-1.0.8/lib -lmpich -lhdf5_cpp -lhdf5 -
lpthread -luuid -lpthread –lrt
$ mpicc -o hello hello.c
Use mpiexec -np<nproc><app><args> to launch
$ mpiexec -np 4 ./hello
30. /* hello.c */
#include <stdio.h> $ mpicc -o hello hello.c
#include <mpi.h> $ mpiexec -np 4 ./hello
Hello, from 0 of 4!
int main (int argc, char * argv[])
{ Hello, from 2 of 4!
inti, rank, nodes; Hello, from 1 of 4!
Hello, from 3 of 4!
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i=0; i< nodes; i++)
{
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) printf("Hello from %i of %i!n", rank, nodes);
}
MPI_Finalize();
return 0;
}
31. ./threaded_app
main()
Thread within
threaded_app process
pthread_create( func() )
func()
Do work Do work
Memory
pthread_join() pthread_exit()
exit()
32. mpiexec –np 4 ./mpi_app
mpd launches jobs
mpi_app [rank 0] mpi_app [rank 1] mpi_app [rank 3]
main() main() main()
MPI_Init() MPI_Init() MPI_Init() MPI comm.
MPI_Bcast() MPI_Bcast() MIP_Bcast() MPI comm.
Do Work on local mem Do Work on local mem Do Work on local mem
MPI_Allreduce() MPI_Allreduce() MPI_Allreduce() MPI comm.
MPI_Finalize() MPI_Finalize() MPI_Finalize() MPI comm.
exit() exit() exit()
33. /* hello.c */
#include <stdio.h>
#include <mpi.h>
int
main (int argc, char * argv[])
{
int i;
int rank;
int nodes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nodes);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i=0; i< nodes; i++)
{
MPI_Barrier(MPI_COMM_WORLD);
if (i == rank) printf("Hello from %i of %i!n", rank, nodes);
}
MPI_Finalize();
return 0;
}
34. MPICH2 comes with mpe by default (unless disabled
during configure)
Multiple tracing / logging options to track MPI traffic
Enabled through –mpe=<option> at compile time
MacPro:code$ mpicc -mpe=mpilog -o hello hello.c
MacPro:code$ mpiexec -np 4 ./hello
Hello from 0 of 4!
Hello from 2 of 4!
Hello from 1 of 4!
Hello from 3 of 4!
Writing logfile....
Enabling the Default clock synchronization...
Finished writing logfile ./hello.clog2.
39. intMPI_Send(
void *buf,
memory location to send from
int count,
number of elements (of type datatype) at buf
MPI_Datatypedatatype,
MPI_INT, MPI_FLOAT, etc…
Or custom datatypes; strided vectors; structures, etc
intdest,
rank (within the communicator comm) of destination for this message
int tag,
used to distinguish this message from other messages
MPI_Commcomm )
communicator for this transfer
often MPI_COMM_WORLD
40. intMPI_Recv(
void *buf,
memory location to receive data into
int count,
number of elements (of type datatype) available to receive into at buf
MPI_Datatypedatatype,
MPI_INT, MPI_FLOAT, etc…
Or custom datatypes; strided vectors; structures, etc.
Typically matches sending datatype, but doesn’t have to…
int source,
rank (within the communicator comm) of source for this message
can also be MPI_ANY_SOURCE
int tag,
used to distinguish this message from other messages
can also be MPI_ANY_TAG
MPI_Commcomm,
communicator for this transfer
often MPI_COMM_WORLD
MPI_Status *status )
Structure describing the received message, including:
actual count (can be smaller than passed count)
source (useful if used with source = MPI_ANY_SOURCE)
tag (useful if used with tag = MPI_ANY_TAG)
42. $ mpicc -osrsr.c
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
43.
44. $ mpicc -osrsr.c
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
$ mpicc -osrsr.c -DSENDSIZE="0x1<<13”
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
$ mpicc -osrsr.c -DSENDSIZE="0x1<<14”
$ mpiexec -np 2 ./sr
^C
$ mpicc -osrsr.c -DSENDSIZE="0x1<<14 - 1”
$ mpiexec -np 2 ./sr
0 sent 0; received 1
1 sent 1; received 0
45. 3.4 Communication Modes
The send call described in Section Blocking send is blocking: it does not return until the message data
and envelope have been safely stored away so that the sender is free to access and overwrite the send
buffer. The message might be copied directly into the matching receive buffer, or it might be copied
into a temporary system buffer.
Message buffering decouples the send and receive operations. A blocking send can complete as soon
as the message was buffered, even if no matching receive has been executed by the receiver. On the
other hand, message buffering can be expensive, as it entails additional memory-to-memory
copying, and it requires the allocation of memory for buffering. MPI offers the choice of several
communication modes that allow one to control the choice of the communication protocol.
The send call described in Section Blocking send used the standard communication mode. In this
mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer
outgoing messages. In such a case, the send call may complete before a matching receive is
invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer
outgoing messages, for performance reasons. In this case, the send call will not complete until a
matching receive has been posted, and the data has been moved to the receiver.
Thus, a send in standard mode can be started whether or not a matching receive has been posted. It
may complete before a matching receive is posted. The standard mode send is non-local: successful
completion of the send operation may depend on the occurrence of a matching receive.
http://www.mpi-forum.org/docs/mpi-11-html/node40.html#Node40
46. Process 1 Process 2
Send “small”
message &
return Eager send Eager recv
Send “large” Request & receive
message Receive small message
Rndv. req.
Rndv. req.
Blocks until Match Rndv. Request large
completion. Rndv. send message
req.
Receive Receive large
Rndv. data message
User activity
MPI activity
47. MPI_Bsend (Buffered) (MPI_Ibsend, MPI_Bsend_init)
Sends are “local” – they return independent of any remote activity
Message buffer can be touched immediately after call returns
Requires a user-provided buffer, provided via MPI_Buffer_attach()
Forces an “eager”-like message transfer from sender’s perspective
User can wait for completion by calling MPI_Buffer_detach()
MPI_Ssend (Syncronous) (MPI_Issend, MPI_Ssend_init)
Won’t return until matching receive is posted
Forces a “rendezvous”-like message transfer
Can be used to guarantee synchronization without additional MPI_Barrier() calls
MPI_Rsend (Ready) (MPI_Irsend, MPI_Rsend_init)
Erroneous if matching receive has not been posted
Performance tweak (on some systems) when user can guarantee matching receive is posted
MPI_Isend, MPI_Irecv (Immediate) (MPI_Send_init, MPI_Recv_init)
Non-blocking, immediate return once send/receive request is posted
Requires MPI_[Test|Wait][|all|any|some] call to guarantee completion
Send/receive buffers should not be touched until completed
MPI_Request * argument used for eventual completion
The basic* receive modes are MPI_Recv() and MPI_Irecv(); either can be used to
receive any send mode.
49. $ mpicc -o sr2 sr2.c -DSENDSIZE="0x1<<14”
$ mpiexec -np 4 ./sr2
0 sent 0; received 3
2 sent 2; received 1
1 sent 1; received 0
3 sent 3; received 2
50. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
51. Task parallelism
Each process handles a unique kind of task
▪ Example: multi-image uploader (with resize/recompress)
▪ Thread 1: GUI / user interaction
▪ Thread 2: file reader & decompression
▪ Thread 3: resize & recompression
▪ Thread 3: network communication
Can be used in a grid with a pipeline of separable tasks
to be performed on each data set
▪ Resample / warp volume
▪ Segment volume
▪ Calculate metrics on segmented volume
52. Data parallelism
Each process handles a portion of the entire data
Often used with large data sets
▪ [task 0… | … task 1 … | … | … task n]
Frequently used in MPI programming
Each process is “doing the same thing,” just on a
different subset of the whole
53. Layout is crucial in high-
performance computing
BW efficiency; cache efficiency
Even more important in
distributed Node 0
Poor layout extra Node 1
communication Node 2
Node 3
Shown is an example of Node 4
“block” data distribution Node 5
x is contiguous dimension Node 6
Node 7
z is slowest dimension x
Each node has contiguous y
portion of z z
54. FTx
DATA
Place view into correct x-Ky-Kz space (AP & LP)
CAL FTyz (AP & LP)
“Traditional” 2D SENSE Unfold (AP & LP)
Homodyne Correction
Pre-loaded data
GW Correction (Y, Z)
Real-time data
GW Correction (X)
MPI Communication
MIP
Root node
Worker nodes
Display /
RESULT
DICOM
55. Completely separable problems:
Add 1 to everyone
Multiply each a[i] * b[i]
Inseparable problems: [?]
Max of a vector
Sort a vector
MIP of a volume
1D FFT of a volume
2d FFT of a volume
3d FFT of a volume
[Parallel sort] Pacheo, Peter S., Parallel Programming with MPI
56.
57. Dynamic datatypes
MPI_Type_vector()
Enables communication of sub-sets without packing
Combined with DMA, permits zero-copy transposes, etc.
Other collectives
MPI_Reduce
MPI_Scatter
MPI_Gather
MPI-2 (MPICH2, MVAPICH2)
One-sided (DMA) communication
▪ MPI_Put()
▪ MPI_Get()
Dynamic world size
▪ Ability to spawn new processes during run
58. Motivation for distributed computing
What MPI is
Intro to MPI programming
Thinking in parallel
Wrap up
59. Take time on the algorithm & data layout
Minimize traffic between nodes / separate
problem
▪ FTx into xKyKz in SENSE example
Cache-friendly (linear, efficient) access patterns
Overlap processing and communication
MPI_Isend() / MPI_Irecv() with multiple work
buffers
While actively transferring one, process the other
Larger messages will hit a higher BW (in general)
60. Profile
Vtune (Intel; Linux / Windows)
Shark (Mac)
MPI profiling with -mpe=mpilog
Avoid “premature optimization” (Knuth)
Implementation time & effort vs. runtime
performance
Use derived datatypes rather than packing
Using a debugger with MPI is hard
Build in your own debugging messages from go
61. If you might need MPI, build to MPI.
Works well in shared memory environments
▪ It’s getting better all the time
Encourages memory locality in NUMA architectures
▪ Nehalem, AMD
Portable, reusable, open-source
Can be used in conjunction with threads / OpenMP /
TBB / CUDA / OpenCL “Hybrid model of parallel
programming”
Messaging paradigm can create “less obfuscated”
code than threads / OpenMP
62. Homogeneous nodes
Private network
Shared filesystem; ssh communication
Password-less SSH
High-bandwidth private interconnect
MPI communication exclusively
GbE, 10GbE
Infiniband
Consider using Rocks
CentOS / RHEL based
Built for building clusters
Rapid network boot based install/reinstall of nodes
http://www.rocksclusters.org/
63. MPI documents
http://www.mpi-forum.org/docs/
MPICH2
http://www.mcs.anl.gov/research/projects/mpich2
http://lists.mcs.anl.gov/pipermail/mpich-discuss/
OpenMPI
http://www.open-mpi.org/
http://www.open-mpi.org/community/lists/ompi.php
MVAPICH[1|2] (Infiniband-tuned distribution)
http://mvapich.cse.ohio-state.edu/
http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/
Rocks
http://www.rocksclusters.org/
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/
Books:
Pacheo, Peter S., Parallel Programming with MPI
Karniadakis, George E., Parallel Scientific Computing in C++ and MPI
Gropp, W., Using MPI-2
64.
65.
66. This is the painting operation #define RB 0x00FF00FFu
#define RB_8OFF 0xFF00FF00u
for one RGBA pixel (in) onto #define RGB 0x00FFFFFFu
#define G 0x0000FF00u
another (out) #define G_8OFF 0x00FF0000u
#define A 0xFF000000u
We can do red and blue
together, as we know they inlinevoid
blendPreToStatic(constuint32_t& in,
won’t collide, and we can mask uint32_t& out)
{
out the unwanted results. uint32_t alpha = in >>24;
if(alpha &0x00000080u) ++alpha;
Post-multiply masks are out = A | RGB&
(in +
applied in the shifted position (
to minimize the number of (
(alpha * (out &RB) &RB_8OFF) |
shift operations (alpha * (out &G) &G_8OFF)
) >>8
)
);
Note: we’re using pre- }
multiplied colors & painting
onto an opaque background
67. OUT = A | RGB&
(IN +
(
(
(ALPHA * (OUT &RB) &RB_8OFF) |
(ALPHA * (OUT &G) &G_8OFF)
) >>8
)
);
68. For cases where there is no overlap between
the four output pixels for four input pixels, we
can use vectorized (SSE2) code
128-bit wide registers; load four 32-bit RGBA
values, use the same approach as previously
(R|B and G) in two registers to perform four
paints at once
69. inline
void
blend4PreToStatic(uint32_t ** in,
uint32_t * out) // Paints in (quad-word) onto out
{
__m128irb, g, a, a_, o, mask_reg; // Registers
rb = _mm_loadu_si128((__m128i *) out); // Load destination (unaligned -- may not be on a 128-bit boundary)
a_ = _mm_load_si128((__m128i *) *in); // We make sure the input is on a 128-bit boundary before this call
*in += 4; _mm_prefetch((char*) (*in + 28),_MM_HINT_T0); // Fetch the two-cache-line-out memory
mask_reg = _mm_set1_epi32(0x0000FF00); // Set green mask (x4)
g = _mm_and_si128(rb,mask_reg); // Mask to greens (x4)
mask_reg = _mm_set1_epi32(0x00FF00FF); // Set red and blue mask (x4)
rb = _mm_and_si128(rb,mask_reg); // Mask to red and blue
rb =_mm_slli_epi32(rb,8); // << 8 ; g is already biased by 256 in 16-bit spacing
a = _mm_srli_epi32(a_,24); // >> 24 ; These are the four alpha values, shifted to lower 8 bits of each word
mask_reg = _mm_slli_epi32(a,16); // << 16 ; A copy of the four alpha values, shifted to bits [16-23] of each word
a = _mm_or_si128(a,mask_reg); // We now have the alpha value at both bits [0-7] and [16-23] of each word
// These steps add one to transparancy values >= 80
o = _mm_srli_epi16(a,7); // Now the high bit is the low bit
70. // We now have 8 16-bit alpha values, and 8 rb or 4 g values. The values are biased by 256, and we want
// to muptiply by alpha and then divide by 65536; we achieve this by multiplying the 16-bit values and
// storing the upper 16 of the 32 bit result. (This is the operation that is available, so that's why we're
// doing it in this fashion!)
rb = _mm_mulhi_epu16(rb,a);
g = _mm_mulhi_epu16(g,a);
g =_mm_slli_epi32(g,8); // Move green into the correct location.
// R and B, both the lower 8 bits of their 16 bits, don't need to be shifted
o = _mm_set1_epi32(0xFF000000); // Opaque alpha value
o = _mm_or_si128(o,g);
o = _mm_or_si128(o,rb); // o now has the the background's contribution to the output color
mask_reg = _mm_set1_epi32(0x00FFFFFF);
g = _mm_and_si128(mask_reg,a_); // Removes alpha from foreground color
o = _mm_add_epi32(o,g); // Add foreground and background contributions together
_mm_storeu_si128((__m128i *) out,o); // Unaligned store
}
73. MPI_Init(3) MPI MPI_Init(3)
NAME
MPI_Init - Initialize the MPI execution environment
SYNOPSIS
int MPI_Init( int *argc, char ***argv )
INPUT PARAMETERS
argc - Pointer to the number of arguments
argv - Pointer to the argument vector
THREAD AND SIGNAL SAFETY
This routine must be called by one thread only. That thread is called
the mainthread and must be the thread that calls MPI_Finalize .
NOTES
The MPI standard does not say what a program can do before an MPI_INIT
or after an MPI_FINALIZE . In the MPICH implementation, you should do
as little as possible. In particular, avoid anything that changes the
external state of the program, such as opening files, reading standard
input or writing to standard output.
74. MPI_Barrier(3) MPI MPI_Barrier(3)
NAME
MPI_Barrier - Blocks until all processes in the communicator have
reached this routine.
SYNOPSIS
int MPI_Barrier( MPI_Commcomm )
INPUT PARAMETER
comm - communicator (handle)
NOTES
Blocks the caller until all processes in the communicator have called
it; that is, the call returns at any process only after all members of
the communicator have entered the call.
75. MPI_Finalize(3) MPI MPI_Finalize(3)
NAME
MPI_Finalize - Terminates MPI execution environment
SYNOPSIS
int MPI_Finalize( void )
NOTES
All processes must call this routine before exiting. The number of
processes running after this routine is called is undefined; it is best
not to perform much more than a returnrc after calling MPI_Finalize .
76. MPI_Comm_size(3) MPI MPI_Comm_size(3)
NAME
MPI_Comm_size - Determines the size of the group associated with a
communicator
SYNOPSIS
int MPI_Comm_size( MPI_Commcomm, int *size )
INPUT PARAMETER
comm - communicator (handle)
OUTPUT PARAMETER
size - number of processes in the group of comm (integer)
77. MPI_Comm_rank(3) MPI MPI_Comm_rank(3)
NAME
MPI_Comm_rank - Determines the rank of the calling process in the com-
municator
SYNOPSIS
int MPI_Comm_rank( MPI_Commcomm, int *rank )
INPUT ARGUMENT
comm - communicator (handle)
OUTPUT ARGUMENT
rank - rank of the calling process in the group of comm (integer)
78. MPI_Send(3) MPI MPI_Send(3)
NAME
MPI_Send - Performs a blocking send
SYNOPSIS
int MPI_Send(void *buf, int count, MPI_Datatypedatatype, int dest, int tag,
MPI_Commcomm)
INPUT PARAMETERS
buf - initial address of send buffer (choice)
count - number of elements in send buffer (nonnegative integer)
datatype
- datatype of each send buffer element (handle)
dest - rank of destination (integer)
tag - message tag (integer)
comm - communicator (handle)
NOTES
This routine may block until the message is received by the destination
process.
79. MPI_Recv(3) MPI MPI_Recv(3)
NAME
MPI_Recv - Blocking receive for a message
SYNOPSIS
int MPI_Recv(void *buf, int count, MPI_Datatypedatatype, int source, int tag,
MPI_Commcomm, MPI_Status *status)
OUTPUT PARAMETERS
buf - initial address of receive buffer (choice)
status - status object (Status)
INPUT PARAMETERS
count - maximum number of elements in receive buffer (integer)
datatype
- datatype of each receive buffer element (handle)
source - rank of source (integer)
tag - message tag (integer)
comm - communicator (handle)
NOTES
The count argument indicates the maximum length of a message; the
actual length of the message can be determined with MPI_Get_count .
80. MPI_Isend(3) MPI MPI_Isend(3)
NAME
MPI_Isend - Begins a nonblocking send
SYNOPSIS
intMPI_Isend(void *buf, int count, MPI_Datatypedatatype, intdest, int tag,
MPI_Commcomm, MPI_Request *request)
INPUT PARAMETERS
buf - initial address of send buffer (choice)
count - number of elements in send buffer (integer)
datatype
- datatype of each send buffer element (handle)
dest - rank of destination (integer)
tag - message tag (integer)
comm - communicator (handle)
OUTPUT PARAMETER
request
- communication request (handle)
81. MPI_Irecv(3) MPI MPI_Irecv(3)
NAME
MPI_Irecv - Begins a nonblocking receive
SYNOPSIS
intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source,
int tag, MPI_Commcomm, MPI_Request *request)
INPUT PARAMETERS
buf - initial address of receive buffer (choice)
count - number of elements in receive buffer (integer)
datatype
- datatype of each receive buffer element (handle)
source - rank of source (integer)
tag - message tag (integer)
comm - communicator (handle)
OUTPUT PARAMETER
request
- communication request (handle)
82. MPI_Bcast(3) MPI MPI_Bcast(3)
NAME
MPI_Bcast - Broadcasts a message from the process with rank "root" to
all other processes of the communicator
SYNOPSIS
int MPI_Bcast( void *buffer, int count, MPI_Datatypedatatype, int root,
MPI_Commcomm )
INPUT/OUTPUT PARAMETER
buffer - starting address of buffer (choice)
INPUT PARAMETERS
count - number of entries in buffer (integer)
datatype
- data type of buffer (handle)
root - rank of broadcast root (integer)
comm - communicator (handle)
83. MPI_Allreduce(3) MPI MPI_Allreduce(3)
NAME
MPI_Allreduce - Combines values from all processes and distributes the
result back to all processes
SYNOPSIS
int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count,
MPI_Datatypedatatype, MPI_Op op, MPI_Commcomm )
INPUT PARAMETERS
sendbuf
- starting address of send buffer (choice)
count - number of elements in send buffer (integer)
datatype
- data type of elements of send buffer (handle)
op - operation (handle)
comm - communicator (handle)
OUTPUT PARAMETER
recvbuf
- starting address of receive buffer (choice)
84. MPI_Type_create_hvector(3) MPI MPI_Type_create_hvector(3)
NAME
MPI_Type_create_hvector - Create a datatype with a constant stride
given in bytes
SYNOPSIS
int MPI_Type_create_hvector(int count,
int blocklength,
MPI_Aint stride,
MPI_Datatypeoldtype,
MPI_Datatype *newtype)
INPUT PARAMETERS
count - number of blocks (nonnegative integer)
blocklength
- number of elements in each block (nonnegative integer)
stride - number of bytes between start of each block (address integer)
oldtype
- old datatype (handle)
OUTPUT PARAMETER
newtype
- new datatype (handle)
85. mpicc(1) MPI mpicc(1)
NAME
mpicc - Compiles and links MPI programs written in C
DESCRIPTION
This command can be used to compile and link MPI programs written in C.
It provides the options and any special libraries that are needed to
compile and link MPI programs.
It is important to use this command, particularly when linking pro-
grams, as it provides the necessary libraries.
COMMAND LINE ARGUMENTS
-show - Show the commands that would be used without runnning them
-help - Give short help
-cc=name
- Use compiler name instead of the default choice. Use this
only if the compiler is compatible with the MPICH library (see
below)
-config=name
- Load a configuration file for a particular compiler. This
allows a single mpicc command to be used with multiple compil-
ers.
[…]
86. mpiexec(1) MPI mpiexec(1)
NAME
mpiexec - Run an MPI program
SYNOPSIS
mpiexecargs executable pgmargs [ : args executable pgmargs ... ]
where args are command line arguments for mpiexec (see below), exe-
cutable is the name of an executable MPI program, and pgmargs are com-
mand line arguments for the executable. Multiple executables can be
specified by using the colon notation (for MPMD - Multiple Program Mul-
tiple Data applications). For example, the following command will run
the MPI program a.out on 4 processes:
mpiexec -n 4 a.out
The MPI standard specifies the following arguments and their meanings:
-n<np>
- Specify the number of processes to use
-host<hostname>
- Name of host on which to run processes
-arch<architecturename>
- Pick hosts with this architecture type
[…]
Editor's Notes
NUMANUMA is a distinction within shared memory systems. E.G. AMD HyperTransport or Intel QPI vs. Northbridge w/ FSBGPGPU: Sort of; xfers into and out of GPU memory are from the main shared system memory; xfers within GPU memory by GPU kernels are shared memory within their own private (GPU) memory spaceDistributed systems: comprised of multiple nodes. Each node typically == individual “computer”MPI can be used on shared memory systems; modern implementations use fastest xfer mechanism between each set of peers.
Some scale betterCPUs keep getting faster, either through GHz or # of cores; memory BW has not kept up.STREAM benchmark------------------------------------------------------------------ name kernel bytes/iter FLOPS/iter------------------------------------------------------------------ COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2 ------------------------------------------------------------------8 million element double-precison arrays ~64MB arrays; ICC 10; -xPCPU manufacturers are focused on improving this; and have really sped things up with Nehalem;… what about Nehalem?
Examples of tasks that hit BW walls:Highly tuned inner loops (few op/s per element; running over large volume)Masking operations (multiply each element from one volume by a mask in another volume)Max / min / mean / std operationsMIPsStill an issue on new systems; likely to continue to be an issue;Nehalem is NUMA as well; another layer of complexity -> can control somewhat via binding (numactl; through task manager in windows)This is not to say the 8 processors are useless; on programs where the inner loop operation does more work, the scaling can be close to ideal. E.g. sin(x)
Front side bus, quick path interconnect, HyperTransportHigh-level languages: need to finish one operation (A += B) before doing the next operation (A = A*A)MPI is the de facto standard for parallel programs on distributed memory ssytems; from blue gene to off-the-shelf linux clusters1GB 1333 DDR3 : $95 ($800)2GB 1333 DDR3: $155 ($620)4GB 1333 DDR3: $322 ($644)8GB 1333 DDR3 chips: $3410 ($3410)Nehalem again makes this more confusing; memory bus clock changes based on # of modules…Also one of the key points that CUDA is focused on; 3 of the 8 called out improvements in the latest rev focus on efficient / improved memory bw usage.
Needs for large data sets in image processing are real and here now.
2s for 400x320x220R=8 NCOILS=8volume; ~ 3.1 second for acquisition (calf data set)(This is not the iterative reconstruction)
*MY* taxonomySETI@home 1999 – 2005; now part of BOINC 1.7PFlops > 1.4TFlops (RoadRunner)Grid: Jobs ~independent and asynchronous; Hadoop/ MapReduce; Cycle stealingScaleMP: Up to 32 processor (128 cores) and 4TB shared memoryCluster computing:Distributed “process” starts on multiple machines concurrentlyTypically cookie-cutter (although support for different architectures in possible in MPI)Significant communication between nodes during processingMassive simulationsApplications sensitive to timingsFolding@Home: Loosely coupled collective (GRID), tightly coupled within client (MPI); also Grid+GPU 4.6PFlops
More taxonomy:Grid:Loosely connected; nodes “unaware” of other nodes.Works great for “batch” problemsDifferent architectures; different implementations (CPU, GPU, … PS3 and Nvidia clients for Folding@Home)Wildly varying performance between nodes “easily” accommodatedFail-over almost “automatic”Sun grid engineMap-reduce / hadoopCan be a cycle-stealing background processCluster:Tightly connected; nodes in tight communication with each otherFailures are hard to handle – intermediate results often saved; MPI-2Usually homogeneous nodes; varying performance can cause severe performance loss if not accounted for carefullyMPI(SGE / other schedulers)We will be focusing on Clusters; this is where MPI is used
Network transfers (even on fast networks) are expensive compared to memory transactions
Number of bi-directional links for nodes N = (N-1)*(N)/2 = 15 for 6 nodes; 28 for 8; ~ N^2 / 2Managing this yourself is complicated and time-consuming!>>> This is what MPI simplifies for usSo what is MPI?
ANL = Agronne National Laboratory* Although available on many platforms, it has a unix heritage, and is most natural to use on unix-y (mac, linux, sun) environments. (OpenMPI ships Standard on macsw/Leopard)Low-level: there are some functions that operate on the data-type (Reduce operations) – but most “just” shuffle bytes around
MPI is everywhere in high-performance computing, but why?>>> So what does MPI do for you? Why should you use it? Look at the complexity of setting up a distributed system again.
Can also providing profiling; MPI can use different communication for different sets of peers (e.g. SMP, Infiniband, TCP/IP)You could (almost) write any MPI program with these 4 calls; much different from pthreadsw/ mutexes, OpenMP, GPU, etc; communication provides synchronization by its nature; no dealing with “locks” on “shared” variables, etc.BUT: need to be sure each node in initializing variables correctly…
Getting back to what MPI is a little more…Even though most MPI programs could be written with just a few MPI commands, there are quite a few available….
Linux / mac instructions; Leopard already has openmpi installed. In /usr/bin/mpi[cc|cxx|run]Not familiar with the windows version; see windows portion of http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-1.0.8-installguide.pdfFortran andc++ if desiredSupports shared mem and tcp channels.
MPD = multiprocessing daemon; used to start one daemon per host; these daemons are used to start the actual jobsTalk a little more about MPD
MPD = multiprocessing daemon; launched (and left running) on each node that want to be ready to participate in an MPI executionOther options (mpirun) exist, but mpd is fast for starting new jobs (as opposed to new ssh sessions created each time a job is run)
MPD = multiprocessing daemon
MPI_Init() Must be called in every program that will use MPI callsCaveat: Printing to stdout (stderr) from different nodes works; but it is not guaranteed to be synchronized; On click: note 2 printed before 1. (Even though 2 occurred after 1 as enforced by MPI_Barrier(); fflush() does not fix… send all IO to one process for printout)Now is a good time to discuss what actually happens when an MPI parallel job is run. (In contrast to a threaded job)
I want to look a little more into what actually happens when a parallel MPI program runs. Let’s start by looking at how a parallel threaded app run.Threads are spawned at runtime as requested by the program.Multiple threads may be spawned and joined over the course of a program.Each thread has access to memory to do its work (whatever it may be)main() is only entered and exited once
Multiprocessing daemons already running; know about each otherNOT SHOWING RANK 2Each rank is a full program; starts in main; exits from mainMPI_Bcast() / MPI_All_reduce() included here as a way to show communication between nodesProgram logic during execution determines who does what
Only the portions highlighted are different between the nodes; however, every line is executed – the full program – on each node; tests are performed to select different code at run time to run on each node.This is different from threaded apps, where common (global) code sections (initializations, etc) are really only run once. As long as init << parallel work, not a big performance issue(Doesn’t have to be done this way, but this is the typical way; different executables can be run as different processes, if desired.)
MPE is useful for understanding what MPI is doing
Black sections between barriers are the printf calls ~20usec each
More of a debug tool
Transpose; ~88MB data set in 70 ms (1.2GB/s)
320x176x320 -> 384x256x384 ~ 0.5 sec (14GFlops just for FFTs)Light pink = fftw library!!!Custom labels in profile2d – tp - 1d – pad - 1d – tp – pad - 2dOn click info boxes
Let’s get back to MPI programming by examining the two basic building blocks for any MPI program: MPI_Send & MPI_RecvYou can make communicators that include only a subset of the active nodes; useful for doing “broadcasts” within a subset, etc.Tag can be used to separate classes of messages; etc. up to the user.Can be used with zero-length messages to communicate something via the tag alone. E.g. “Ready”, or “complete”
It’s important to note that the types don’t have to be exactly the same; A strided vector could be received / sent from a contiguous vector)
Does this work?You don’t have enough information; I haven’t told you if I compiled with SENDSIZE predefined; or what the eager / rendezvous threshold is.
Threshold is 16kB
We can see that sends can complete before the matching receives are posted; but not vice-versa. (Timing enforced by message passing; no mutexes required!
Threshold is 16kB
“Small” messages get sent into pre-allocated (within the MPI library) buffers; allows sender to return quicker; less traffic; etc. “Eager”“Large” messages get sent only once the receiver has posted the receive request (with the receive buffer) “Rendezous”
Most of these also have _init modes to create a persistent request than can be started with MPI_Start[all]() and completed with MPI_[Test|Wait][any|all|some]* By basic, I mean excluding things like broadcasts, scatters, reduces; all of which have some send action included within them.
Does this work?You don’t have enough information; I haven’t told you if I compiled with SENDSIZE predefined; or what the eager / rendezvous threshold is.
5 – 10us for MPI_Isend / MPI_Irecv to return; xfer took 1.7ms(147MB/s)
It’s important to look at the work you need to speed up and understand which approach will do better for you.Multiple separable tasks each of ~ same difficulty works well with task parallelismData parallelism works will
Data parallelism works well with large data setsLoad balancing can become an issue if relative workloads aren’t known a priori
It’s important to consider how to split the data in a data-parallel systemSuppose you know you want to do mips across Z repeatedly; ignoring everythin else, would want to lay out with z available locally (but not necessarily contiguous; sse instructions for maximums don’t want to work along the four contiguous elements, but between a pair of elements in two four-value sets. (have z as your next-to-fastest dimension))Other examples of distributions are cyclic and block-cyclic; also high-dimension splitting (into a grid, for example)
People are really doing this… We’re really doing this…Data is split along x immediately after FTx; distributed to all nodesCalibration scan taken earlier(This is not the iterative reconstruction)GW is done one dimension at a time; requires data along that dimension to be local, so we transpose before the GWx correction2s for 400x320x220R=8 NCOILS=8volume; ~ 3.1 second for acquisition (calf data set)
It’s important to look at your problem and determine where it can be separated outIn general, MPI works better if you can separate it the large scale, rather than in the fine-scaleSIMD is an example of fine-scale parallelismAre each of these separable?Can do local maximums, and then max of maximumsParallel bitonic system; out of scope here; 55ms for ¼ qsort; 85 ms for full parallel sort; ~220 for one qsort of full vector (1 mega-element ints)1dfft : as long as 1dfft is not along split dimension (Assuming the time of a single 1d fft is small enough that you won’t try to split it up)2dffts : easy as long as not split along ffts3dffts: perform along contiguous dims; swap for final (“transposed input/output” options on fftw3 mpi implementation)
320x176x320 -> 384x256x384 ~ 0.5 sec (14GFlops just for FFTs)Light pink = fftw libraryCustom labels in profile2d – tp - 1d – pad - 1d – tp – pad - 2dOn click info boxes
One-sided communication opens up race conditions concerns again, but gains some latency / BW because of reduced negotiation
Efficient: make use of all data on a cache line when you read it; and only read it once
Donald Knuth: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”There are some packages out there (openMPIw/ eclipse; TotalView) to help with debugging MPI.Errors on other nodes can cause the one you’re debugging to receive a signal to exit.
You can build a cluster virtually just to see how things work…
All of these mailing lists are active, and wonderful places to get help (After you’ve read the Docs & FAQ!)