How to make MPI Awesome:
MPI Sessions
Follow-on to Jeff’s crazy thoughts discussed in Bordeaux
Random group of people who have been talking about this stuff:
Wesley Bland, Ryan Grant, Dan Holmes, Kathryn Mohror,
Martin Schulz, Anthony Skjellum, Jeff Squyres
What we want
• Any thread (e.g., library) can use MPI any time it wants
• But still be able to totally clean up MPI if/when desired
• New parameters to initialize the MPI API
MPI Process
// Library 1
MPI_Init(…);
// Library 2
MPI_Init(…);
// Library 3
MPI_Init(…);
// Library 4
MPI_Init(…);
// Library 5
MPI_Init(…);
// Library 6
MPI_Init(…);// Library 7
MPI_Init(…);
// Library 8
MPI_Init(…);
// Library 9
MPI_Init(…);
// Library 10
MPI_Init(…);
// Library 11
MPI_Init(…);
// Library 12
MPI_Init(…);
Before MPI-3.1, this could be erroneous
int my_thread1_main(void *context) {
MPI_Initialized(&flag);
// …
}
int my_thread2_main(void *context) {
MPI_Initialized(&flag);
// …
}
int main(int argc, char **argv) {
MPI_Init_thread(…, MPI_THREAD_FUNNELED, …);
pthread_create(…, my_thread1_main, NULL);
pthread_create(…, my_thread2_main, NULL);
// …
}
These might
run at the same time (!)
The MPI-3.1 solution
• MPI_INITIALIZED (and friends) are allowed to
be called at any time
– …even by multiple threads
– …regardless of MPI_THREAD_* level
• This is a simple, easy-to-explain solution
– And probably what most applications do, anyway
• But many other paths were investigated
MPI-3.1 MPI_INIT / FINALIZE limitations
• Cannot init MPI from different entities within a process without
a priori knowledge / coordination
– I.e.: MPI-3.1 (intentionally) still did not solve the underlying problem
MPI Process
// Library 1 (thread)
MPI_Initialized(&flag);
if (!flag) MPI_Init(…);
// Library 2 (thread)
MPI_Initialized(&flag);
if (!flag) MPI_Init(…);
THIS IS INSUFFICIENT /
POTENTIALLY ERRONEOUS
(More of) What we want
• Fix MPI-3.1 limitations:
– Cannot init MPI from different entities within a
process without a priori knowledge / coordination
– Cannot initialize MPI more than once
– Cannot set error behavior of MPI initialization
– Cannot re-initialize MPI after it has been finalized
All these things overlap
Still be able to
finalize MPI
Any thread can
use MPI any time
Re-initialize MPI
Affect MPI
initialization error
behavior
New concept: “session”
• A local handle to the MPI library
– Implementation intent: lightweight / uses very few
resources
– Can also cache some local state
• Can have multiple sessions in an MPI process
– MPI_Session_init(…, &session);
– MPI_Session_finalize(…, &session);
MPI Session
MPI Process
ocean library atmosphere library
MPI library
ocean
session
atmos-
phere
session
Unique handles to the underlying MPI library
Initialize / finalize a session
• MPI_Session_init(
– IN MPI_Info info,
– IN MPI_Errhandler errhandler,
– OUT MPI_Session *session)
• MPI_Session_finalize(
– INOUT MPI_Session *session)
• Parameters described in next slides…
Session init params
• Info: For future expansion
• Errhandler: to be invoked if
MPI_SESSION_INIT errors
– Likely need a new type of errhandler
• …or a generic errhandler
• FT working is discussing exactly this topic
MPI Session
MPI Process
ocean library atmosphere library
MPI library
ocean
Errors
return
atmos-
phere
Errors
abort
Unique errhandlers, info, local state, etc.
Fair warning
• The MPI runtime has
long-since been a
bastard stepchild
– Barely acknowledged in
the standard
– Mainly in the form of
non-normative
suggestions
• It’s time to change that
Overview
• General scheme:
– Query the underlying
run-time system
• Get a “set” of processes
– Determine the processes
you want
• Create an MPI_Group
– Create a communicator
with just those processes
• Create an MPI_Comm
Query runtime
for set of processes
MPI_Group
MPI_Comm
MPI_Session
Runtime concepts
• Expose 2 concepts to MPI from the runtime:
1. Static sets of processes
2. Each set caches (key,value) string tuples
These slides only discuss static sets
(unchanged for the life of the process).
However, there are several useful scenarios that
involve dynamic membership of sets over time.
More discussion needs to occur for these scenarios.
For the purposes of these slides,
just consider static sets.
Static sets of processes
• Sets are identified by string name
• Two sets are mandated
– “mpi://WORLD”
– “mpi://SELF”
• Other sets can be defined by the system:
– “location://rack/19”
– “network://leaf-switch/37”
– “arch://x86_64”
– “job://12942”
– … etc.
• Processes can be in more than one set
These names are
implementation-
dependent
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
arch://x86_64
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://WORLD
job://12942
arch://x86_64
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
mpi://SELF mpi://SELF mpi://SELF mpi://SELF
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
location://rack/self location://rack/self
location://rack/17 location://rack/23
Examples of sets
MPI process 0 MPI process 1 MPI process 2 MPI process 3
user://ocean user://atmosphere
mpiexec
--np 2 --set user://ocean ocean.exe :
--np 2 --set user://atmosphere atmosphere.exe
Querying the run-time
• MPI_Session_get_names(
– IN MPI_Session session,
– OUT char **set_names)
• Returns argv-style list of
0-terminated names
– Must be freed by caller
Example list of set names returned
mpi://WORLD
mpi://SELF
arch://x86_64
location://rack/17
job://12942
user://ocean
Values in sets
• Each set has an associated MPI_Info object
• One mandated key in each info:
– “size”: number of processes in this set
• Runtime may also provide other keys
– Implementation-dependent
Querying the run-time
• MPI_Session_get_info(
– IN MPI_Session session,
– IN const char *set_name,
– OUT MPI_Info *info)
• Use existing MPI_Info functions to retrieve
(key,value) tuples
Make MPI_Groups!
• MPI_Group_create_from_session(
– IN MPI_Session session,
– IN const char *set_name,
– OUT MPI_Group *group);
Advice to implementers:
This MPI_Group can still be a
lightweight object (even if there are
a large number of processes in it)
Example
// Make a group of procs from “location://rack/self”
MPI_Create_group_from_session_name(
session, “location://rack/self”, &group);
// Use just the even procs
MPI_Group_size(group, &size);
ranges[0][0] = 0;
ranges[0][1] = size;
ranges[0][2] = 2;
MPI_Group_range_incl(group, 1, ranges,
&group_of_evens);
Make a communicator from that group
• MPI_Create_comm_from_group(
– IN MPI_Group group,
– IN const char *tag, // for matching (see next slide)
– IN MPI_Info info,
– IN MPI_Errhandler errhander,
– OUT MPI_Comm *comm)
Note: this is different than
the existing function
MPI_Comm_create_group(
oldcomm, group, (int) tag,
&newcomm)
Might need a better name
for this new function…?
String tag is used to match concurrent
creations by different entities
MPI Process
ocean library
atmosphere
library
MPI Process
ocean library
atmosphere
library
MPI Process
ocean library
atmosphere
library
MPI_Create_comm_from_group(…, tag = “gov.anl.ocean”, …)
MPI_Create_comm_from_group(.., tag = “gov.llnl.atmosphere”, …)
Make any kind of communicator
• MPI_Create_cart_comm_from_group(
– IN MPI_Group group,
– IN const char *tag,
– IN MPI_Info info,
– IN MPI_Errhandler errhander,
– IN int ndims,
– IN const int dims[],
– IN const int periods[],
– IN int reorder,
– OUT MPI_Comm *comm)
Make any kind of communicator
• MPI_Create_graph_comm_from_group(…)
• MPI_Create_dist_graph_comm_from_group(…)
• MPI_Create_dist_graph_adjacent_comm_from
_group(…)
Run-time static sets across different
sessions in the same process
• Making communicators from the same static
set will always result in the same local rank
– Even if created from different sessions
See example in the
next slide…
Run-time static sets across different
sessions in the same process
// Session, group, and communicator 1
MPI_Create_group_from_session_name(session_1,
“mpi://WORLD”, &group1);
MPI_Create_comm_from_group(group1, “ocean”, …, &comm1);
MPI_Comm_rank(comm1, &rank1);
// Session, group, and communicator 2
MPI_Create_group_from_session_name(session_2,
“mpi://WORLD”, &group2);
MPI_Create_comm_from_group(group2, “atmosphere”, …,
&comm2);
MPI_Comm_rank(comm2, &rank2);
// Ranks are guaranteed to be the same
assert(rank1 == rank2);
Law of Least
Astonishment
Mixing requests from different
sessions: disallowed
// Session, group, and communicator 1
MPI_Create_group_from_session_name(session_1,
“mpi://WORLD”, &group1);
MPI_Create_comm_from_group(group1, “ocean”, …, &comm1);
MPI_Isend(…, &req[0]);
// Session, group, and communicator 2
MPI_Create_group_from_session_name(session_2,
“mpi://WORLD”, &group2);
MPI_Create_comm_from_group(group2, “atmosphere”, …,
&comm2);
MPI_Isend(…, &req[1]);
// Mixing requests from different
// sessions is disallowed
MPI_Waitall(2, req, …);
Rationale: this is difficult to
optimize, particularly if a session
maps to hardware resources
MPI_Session_finalize
• Analogous to MPI_FINALIZE
– Can block waiting for the destruction of the
objects derived from that session
• Communicators, Windows, Files, … etc.
– Each session that is initialized must be finalized
Well, that all sounds great.
…but who calls MPI_INIT?
And what session does
MPI_COMM_WORLD /
MPI_COMM_SELF belong to?
New concept: no longer require
MPI_INIT / MPI_FINALIZE
• WHAT?!
• When will MPI initialize itself?
• How will MPI finalize itself?
– It is still (very) desirable to allow MPI to clean
itself up so that MPI processes can be “valgrind
clean” when they exit
Split MPI APIs into two sets
Performance doesn’t
matter (as much)
• Functions that create / query /
destroy:
– MPI_Comm
– MPI_File
– MPI_Win
– MPI_Info
– MPI_Op
– MPI_Errhandler
– MPI_Datatype
– MPI_Group
– MPI_Session
– Attributes
– Processes
• MPI_T
Performance
absolutely matters
• Point to point
• Collectives
• I/O
• RMA
• Test/Wait
• Handle language xfer
Split MPI APIs into two sets
Performance doesn’t
matter (as much)
• Functions that create / query /
destroy:
– MPI_Comm
– MPI_File
– MPI_Win
– MPI_Info
– MPI_Op
– MPI_Errhandler
– MPI_Datatype
– MPI_Group
– MPI_Session
– Attributes
– Processes
• MPI_T
Performance
absolutely matters
• Point to point
• Collectives
• I/O
• RMA
• Test/Wait
• Handle language xfer
Ensure that MPI is
initialized (and/or
finalized) by these
functions
These functions still can’t
be used unless MPI is
initialized
Split MPI APIs into two sets
Performance doesn’t
matter (as much)
• Functions that create / query /
destroy:
– MPI_Comm
– MPI_File
– MPI_Win
– MPI_Info
– MPI_Op
– MPI_Errhandler
– MPI_Datatype
– MPI_Group
– MPI_Session
– Attributes
– Processes
• MPI_T
Performance absolutely
matters
• Point to point
• Collectives
• I/O
• RMA
• Test/Wait
• Handle language xfer
These functions init / finalize
MPI transparently
These functions can’t be called
without a handle created from
the left-hand column
Split MPI APIs into two sets
Performance doesn’t
matter (as much)
• Functions that create / query /
destroy:
– MPI_Comm
– MPI_File
– MPI_Win
– MPI_Info
– MPI_Op
– MPI_Errhandler
– MPI_Datatype
– MPI_Group
– MPI_Session
– Attributes
– Processes
• MPI_T
Performance absolutely
matters
• Point to point
• Collectives
• I/O
• RMA
• Test/Wait
• Handle language xfer
MPI_COMM_WORLD and
MPI_COMM_SELF are notable
exceptions.
…I’ll address this shortly.
Example
int main() {
// Create a datatype – initializes MPI
MPI_Type_contiguous(2, MPI_INT, &mytype);
The creation of the first user-
defined MPI object initializes MPI
Initialization can be a local action!
Example
int main() {
// Create a datatype – initializes MPI
MPI_Type_contiguous(2, MPI_INT, &mytype);
// Free the datatype – finalizes MPI
MPI_Type_free(&mytype);
// Valgrind clean
return 0;
}
The destruction of the last user-
defined MPI object finalizes /
cleans up MPI. This is guaranteed.
There are some
corner cases
described on the
following slides.
Example
int main() {
// Create a datatype – initializes MPI
MPI_Type_contiguous(2, MPI_INT, &mytype);
// Free the datatype – finalizes MPI
MPI_Type_free(&mytype);
// Re-initialize MPI!
MPI_Type_dup(MPI_INT, &mytype);
We can also re-initialize MPI!
(it’s transparent to the user – so why not?)
Example
int main() {
// Create a datatype – initializes MPI
MPI_Type_contiguous(2, MPI_INT, &mytype);
// Free the datatype – finalizes MPI
MPI_Type_free(&mytype);
// Re-initialize MPI!
MPI_Type_dup(MPI_INT, &mytype);
return 0;
}
(Sometimes) Not an error to exit
the process with MPI still initialized
The overall theme
• Just use MPI functions whenever you want
– MPI will initialize as it needs to
– Initialization essentially becomes an
implementation detail
• Finalization will occur whenever all user-
defined handles are destroyed
Wait a minute –
What about MPI_COMM_WORLD?
int main() {
// Can’t I do this?
MPI_Send(…, MPI_COMM_WORLD);
This would be calling a
“performance matters”
function before a
“performance doesn’t
matter” function
I.e., MPI has not initialized yet
Wait a minute –
What about MPI_COMM_WORLD?
int main() {
// This is valid
MPI_Init(NULL, NULL);
MPI_Send(…, MPI_COMM_WORLD);
Re-define MPI_INIT and MPI_FINALIZE:
constructor and destructor for
MPI_COMM_WORLD and MPI_COMM_SELF
INIT and FINALIZE
int main() {
MPI_Init(NULL, NULL);
MPI_Send(…, MPI_COMM_WORLD);
MPI_Finalize();
}
INIT and FINALIZE continue to exist for two reasons:
1. Backwards compatibility
2. Convenience
So let’s keep them as close to MPI-3.1 as possible:
• If you call INIT, you have to call FINALIZE
• You can only call INIT / FINALIZE once
• INITIALIZED / FINALIZED only refer to INIT / FINALIZE (not sessions)
If you want different behavior, use sessions
INIT and FINALIZE
• INIT/FINALIZE create an implicit session
– You cannot extract an MPI_Session handle for the
implicit session created by MPI_INIT[_THREAD]
• Yes, you can use INIT/FINALIZE in the same
MPI process as other sessions
Issues that still need more discussion
• Dynamic runtime sets
– Temporal
– Membership
• Covered in other proposals:
– Thread concurrent vs. non-concurrent
– Generic error handlers
Issues that still need more discussion
• If COMM_WORLD|SELF are not available by
default:
– Do we need new destruction hooks to replace SELF
attribute callbacks on FINALIZE?
– What is the default error handler behavior for
functions without comm/file/win?
• Do we need syntactic sugar to get a comm from
mpi://WORLD?
• How do tools hook into MPI initialization and
finalization?
Session queries
• Query session handle equality
– MPI_Session_query(handle1, handle1_type,
handle2, handle2_type, bool *are_they_equal)
– Not 100% sure we need this…?
Session thread support
• Associate thread level support with sessions
• Three options:
1. Similar to MPI-3.1: “first” initialization picks
thread level
2. Let each session pick its own thread level (via
info key in SESSION_CREATE)
3. Just make MPI always be THREAD_MULTIPLE
Editor's Notes
Even though WAITALL is semantically equivalent to loop of WAITs, an implementation may have to scan/choose whether to use an optimized plural implementation or have to split it out into a sequence of WAITs.
Plus: what does it mean if multiple sessions have different thread levels and we WAITALL on requests from them?
NOTE: Handle xfer functions need to be high performance, too. We convinced ourselves that this is still implementable:
MPICH-like implementations: no issue.
OMPI-like implementation: can have an initial table that is all the pre-defined f2c lookups. Upon first user handle creation, alloc a new table (and re-alloc every time you need to grow after that).