Advertisement

MPI Sessions: a proposal to the MPI Forum

Computer Engineer at Cisco Systems, Inc
Mar. 3, 2016
Advertisement

More Related Content

Advertisement
Advertisement

MPI Sessions: a proposal to the MPI Forum

  1. How to make MPI Awesome: MPI Sessions Follow-on to Jeff’s crazy thoughts discussed in Bordeaux Random group of people who have been talking about this stuff: Wesley Bland, Ryan Grant, Dan Holmes, Kathryn Mohror, Martin Schulz, Anthony Skjellum, Jeff Squyres
  2. What we want • Any thread (e.g., library) can use MPI any time it wants • But still be able to totally clean up MPI if/when desired • New parameters to initialize the MPI API MPI Process // Library 1 MPI_Init(…); // Library 2 MPI_Init(…); // Library 3 MPI_Init(…); // Library 4 MPI_Init(…); // Library 5 MPI_Init(…); // Library 6 MPI_Init(…);// Library 7 MPI_Init(…); // Library 8 MPI_Init(…); // Library 9 MPI_Init(…); // Library 10 MPI_Init(…); // Library 11 MPI_Init(…); // Library 12 MPI_Init(…);
  3. Before MPI-3.1, this could be erroneous int my_thread1_main(void *context) { MPI_Initialized(&flag); // … } int my_thread2_main(void *context) { MPI_Initialized(&flag); // … } int main(int argc, char **argv) { MPI_Init_thread(…, MPI_THREAD_FUNNELED, …); pthread_create(…, my_thread1_main, NULL); pthread_create(…, my_thread2_main, NULL); // … } These might run at the same time (!)
  4. The MPI-3.1 solution • MPI_INITIALIZED (and friends) are allowed to be called at any time – …even by multiple threads – …regardless of MPI_THREAD_* level • This is a simple, easy-to-explain solution – And probably what most applications do, anyway  • But many other paths were investigated
  5. MPI-3.1 MPI_INIT / FINALIZE limitations • Cannot init MPI from different entities within a process without a priori knowledge / coordination – I.e.: MPI-3.1 (intentionally) still did not solve the underlying problem  MPI Process // Library 1 (thread) MPI_Initialized(&flag); if (!flag) MPI_Init(…); // Library 2 (thread) MPI_Initialized(&flag); if (!flag) MPI_Init(…); THIS IS INSUFFICIENT / POTENTIALLY ERRONEOUS
  6. (More of) What we want • Fix MPI-3.1 limitations: – Cannot init MPI from different entities within a process without a priori knowledge / coordination – Cannot initialize MPI more than once – Cannot set error behavior of MPI initialization – Cannot re-initialize MPI after it has been finalized
  7. All these things overlap Still be able to finalize MPI Any thread can use MPI any time Re-initialize MPI Affect MPI initialization error behavior
  8. How do we get those things?
  9. KEEP CALM AND LISTEN TO THE ENTIRE PROPOSAL
  10. New concept: “session” • A local handle to the MPI library – Implementation intent: lightweight / uses very few resources – Can also cache some local state • Can have multiple sessions in an MPI process – MPI_Session_init(…, &session); – MPI_Session_finalize(…, &session);
  11. MPI Session MPI Process ocean library MPI_SESSION_INIT(…) atmosphere library MPI_SESSION_INIT(…) MPI library
  12. MPI Session MPI Process ocean library atmosphere library MPI library ocean session atmos- phere session Unique handles to the underlying MPI library
  13. Initialize / finalize a session • MPI_Session_init( – IN MPI_Info info, – IN MPI_Errhandler errhandler, – OUT MPI_Session *session) • MPI_Session_finalize( – INOUT MPI_Session *session) • Parameters described in next slides…
  14. Session init params • Info: For future expansion • Errhandler: to be invoked if MPI_SESSION_INIT errors – Likely need a new type of errhandler • …or a generic errhandler • FT working is discussing exactly this topic
  15. MPI Session MPI Process ocean library atmosphere library MPI library ocean Errors return atmos- phere Errors abort Unique errhandlers, info, local state, etc.
  16. Great. I have a session. Now what?
  17. Fair warning • The MPI runtime has long-since been a bastard stepchild – Barely acknowledged in the standard – Mainly in the form of non-normative suggestions • It’s time to change that
  18. Overview • General scheme: – Query the underlying run-time system • Get a “set” of processes – Determine the processes you want • Create an MPI_Group – Create a communicator with just those processes • Create an MPI_Comm Query runtime for set of processes MPI_Group MPI_Comm MPI_Session
  19. Runtime concepts • Expose 2 concepts to MPI from the runtime: 1. Static sets of processes 2. Each set caches (key,value) string tuples These slides only discuss static sets (unchanged for the life of the process). However, there are several useful scenarios that involve dynamic membership of sets over time. More discussion needs to occur for these scenarios. For the purposes of these slides, just consider static sets.
  20. Static sets of processes • Sets are identified by string name • Two sets are mandated – “mpi://WORLD” – “mpi://SELF” • Other sets can be defined by the system: – “location://rack/19” – “network://leaf-switch/37” – “arch://x86_64” – “job://12942” – … etc. • Processes can be in more than one set These names are implementation- dependent
  21. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 mpi://WORLD
  22. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 mpi://WORLD arch://x86_64
  23. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 mpi://WORLD job://12942 arch://x86_64
  24. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 mpi://SELF mpi://SELF mpi://SELF mpi://SELF
  25. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 location://rack/self location://rack/self location://rack/17 location://rack/23
  26. Examples of sets MPI process 0 MPI process 1 MPI process 2 MPI process 3 user://ocean user://atmosphere mpiexec --np 2 --set user://ocean ocean.exe : --np 2 --set user://atmosphere atmosphere.exe
  27. Querying the run-time • MPI_Session_get_names( – IN MPI_Session session, – OUT char **set_names) • Returns argv-style list of 0-terminated names – Must be freed by caller Example list of set names returned mpi://WORLD mpi://SELF arch://x86_64 location://rack/17 job://12942 user://ocean
  28. Values in sets • Each set has an associated MPI_Info object • One mandated key in each info: – “size”: number of processes in this set • Runtime may also provide other keys – Implementation-dependent
  29. Querying the run-time • MPI_Session_get_info( – IN MPI_Session session, – IN const char *set_name, – OUT MPI_Info *info) • Use existing MPI_Info functions to retrieve (key,value) tuples
  30. Example MPI_Info info; MPI_Session_get_info(session, “mpi://WORLD”, &info); char *size_str[MPI_MAX_INFO_VAL] MPI_Info_get(info, “size”, …, size_str, …); int size = atoi(size_str);
  31. Ummmm… great. What’s the point of that?
  32. Make MPI_Groups! • MPI_Group_create_from_session( – IN MPI_Session session, – IN const char *set_name, – OUT MPI_Group *group); Advice to implementers: This MPI_Group can still be a lightweight object (even if there are a large number of processes in it)
  33. Example // Make a group of procs from “location://rack/self” MPI_Create_group_from_session_name( session, “location://rack/self”, &group); // Use just the even procs MPI_Group_size(group, &size); ranges[0][0] = 0; ranges[0][1] = size; ranges[0][2] = 2; MPI_Group_range_incl(group, 1, ranges, &group_of_evens);
  34. Make a communicator from that group • MPI_Create_comm_from_group( – IN MPI_Group group, – IN const char *tag, // for matching (see next slide) – IN MPI_Info info, – IN MPI_Errhandler errhander, – OUT MPI_Comm *comm) Note: this is different than the existing function MPI_Comm_create_group( oldcomm, group, (int) tag, &newcomm) Might need a better name for this new function…?
  35. String tag is used to match concurrent creations by different entities MPI Process ocean library atmosphere library MPI Process ocean library atmosphere library MPI Process ocean library atmosphere library MPI_Create_comm_from_group(…, tag = “gov.anl.ocean”, …) MPI_Create_comm_from_group(.., tag = “gov.llnl.atmosphere”, …)
  36. Make any kind of communicator • MPI_Create_cart_comm_from_group( – IN MPI_Group group, – IN const char *tag, – IN MPI_Info info, – IN MPI_Errhandler errhander, – IN int ndims, – IN const int dims[], – IN const int periods[], – IN int reorder, – OUT MPI_Comm *comm)
  37. Make any kind of communicator • MPI_Create_graph_comm_from_group(…) • MPI_Create_dist_graph_comm_from_group(…) • MPI_Create_dist_graph_adjacent_comm_from _group(…)
  38. Run-time static sets across different sessions in the same process • Making communicators from the same static set will always result in the same local rank – Even if created from different sessions See example in the next slide…
  39. Run-time static sets across different sessions in the same process // Session, group, and communicator 1 MPI_Create_group_from_session_name(session_1, “mpi://WORLD”, &group1); MPI_Create_comm_from_group(group1, “ocean”, …, &comm1); MPI_Comm_rank(comm1, &rank1); // Session, group, and communicator 2 MPI_Create_group_from_session_name(session_2, “mpi://WORLD”, &group2); MPI_Create_comm_from_group(group2, “atmosphere”, …, &comm2); MPI_Comm_rank(comm2, &rank2); // Ranks are guaranteed to be the same assert(rank1 == rank2); Law of Least Astonishment
  40. Mixing requests from different sessions: disallowed // Session, group, and communicator 1 MPI_Create_group_from_session_name(session_1, “mpi://WORLD”, &group1); MPI_Create_comm_from_group(group1, “ocean”, …, &comm1); MPI_Isend(…, &req[0]); // Session, group, and communicator 2 MPI_Create_group_from_session_name(session_2, “mpi://WORLD”, &group2); MPI_Create_comm_from_group(group2, “atmosphere”, …, &comm2); MPI_Isend(…, &req[1]); // Mixing requests from different // sessions is disallowed MPI_Waitall(2, req, …); Rationale: this is difficult to optimize, particularly if a session maps to hardware resources
  41. MPI_Session_finalize • Analogous to MPI_FINALIZE – Can block waiting for the destruction of the objects derived from that session • Communicators, Windows, Files, … etc. – Each session that is initialized must be finalized
  42. Well, that all sounds great. …but who calls MPI_INIT? And what session does MPI_COMM_WORLD / MPI_COMM_SELF belong to?
  43. New concept: no longer require MPI_INIT / MPI_FINALIZE
  44. New concept: no longer require MPI_INIT / MPI_FINALIZE • WHAT?! • When will MPI initialize itself? • How will MPI finalize itself? – It is still (very) desirable to allow MPI to clean itself up so that MPI processes can be “valgrind clean” when they exit
  45. Split MPI APIs into two sets Performance doesn’t matter (as much) • Functions that create / query / destroy: – MPI_Comm – MPI_File – MPI_Win – MPI_Info – MPI_Op – MPI_Errhandler – MPI_Datatype – MPI_Group – MPI_Session – Attributes – Processes • MPI_T Performance absolutely matters • Point to point • Collectives • I/O • RMA • Test/Wait • Handle language xfer
  46. Split MPI APIs into two sets Performance doesn’t matter (as much) • Functions that create / query / destroy: – MPI_Comm – MPI_File – MPI_Win – MPI_Info – MPI_Op – MPI_Errhandler – MPI_Datatype – MPI_Group – MPI_Session – Attributes – Processes • MPI_T Performance absolutely matters • Point to point • Collectives • I/O • RMA • Test/Wait • Handle language xfer Ensure that MPI is initialized (and/or finalized) by these functions These functions still can’t be used unless MPI is initialized
  47. Split MPI APIs into two sets Performance doesn’t matter (as much) • Functions that create / query / destroy: – MPI_Comm – MPI_File – MPI_Win – MPI_Info – MPI_Op – MPI_Errhandler – MPI_Datatype – MPI_Group – MPI_Session – Attributes – Processes • MPI_T Performance absolutely matters • Point to point • Collectives • I/O • RMA • Test/Wait • Handle language xfer These functions init / finalize MPI transparently These functions can’t be called without a handle created from the left-hand column
  48. Split MPI APIs into two sets Performance doesn’t matter (as much) • Functions that create / query / destroy: – MPI_Comm – MPI_File – MPI_Win – MPI_Info – MPI_Op – MPI_Errhandler – MPI_Datatype – MPI_Group – MPI_Session – Attributes – Processes • MPI_T Performance absolutely matters • Point to point • Collectives • I/O • RMA • Test/Wait • Handle language xfer MPI_COMM_WORLD and MPI_COMM_SELF are notable exceptions. …I’ll address this shortly.
  49. Example int main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); The creation of the first user- defined MPI object initializes MPI Initialization can be a local action!
  50. Example int main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype); // Valgrind clean return 0; } The destruction of the last user- defined MPI object finalizes / cleans up MPI. This is guaranteed. There are some corner cases described on the following slides.
  51. Example int main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype); // Re-initialize MPI! MPI_Type_dup(MPI_INT, &mytype); We can also re-initialize MPI! (it’s transparent to the user – so why not?)
  52. Example int main() { // Create a datatype – initializes MPI MPI_Type_contiguous(2, MPI_INT, &mytype); // Free the datatype – finalizes MPI MPI_Type_free(&mytype); // Re-initialize MPI! MPI_Type_dup(MPI_INT, &mytype); return 0; } (Sometimes) Not an error to exit the process with MPI still initialized
  53. The overall theme • Just use MPI functions whenever you want – MPI will initialize as it needs to – Initialization essentially becomes an implementation detail • Finalization will occur whenever all user- defined handles are destroyed
  54. Wait a minute – What about MPI_COMM_WORLD? int main() { // Can’t I do this? MPI_Send(…, MPI_COMM_WORLD); This would be calling a “performance matters” function before a “performance doesn’t matter” function I.e., MPI has not initialized yet
  55. Wait a minute – What about MPI_COMM_WORLD? int main() { // This is valid MPI_Init(NULL, NULL); MPI_Send(…, MPI_COMM_WORLD); Re-define MPI_INIT and MPI_FINALIZE: constructor and destructor for MPI_COMM_WORLD and MPI_COMM_SELF
  56. INIT and FINALIZE int main() { MPI_Init(NULL, NULL); MPI_Send(…, MPI_COMM_WORLD); MPI_Finalize(); } INIT and FINALIZE continue to exist for two reasons: 1. Backwards compatibility 2. Convenience So let’s keep them as close to MPI-3.1 as possible: • If you call INIT, you have to call FINALIZE • You can only call INIT / FINALIZE once • INITIALIZED / FINALIZED only refer to INIT / FINALIZE (not sessions) If you want different behavior, use sessions
  57. INIT and FINALIZE • INIT/FINALIZE create an implicit session – You cannot extract an MPI_Session handle for the implicit session created by MPI_INIT[_THREAD] • Yes, you can use INIT/FINALIZE in the same MPI process as other sessions
  58. Backwards compatibility: INITIALIZED and FINALIZED behavior int main() { MPI_Initialized(&flag); assert(flag == false); MPI_Finalized(&flag); assert(flag == false); MPI_Session_create(…, &session1); MPI_Initialized(&flag); assert(flag == false); MPI_Finalized(&flag); assert(flag == false); MPI_Init(NULL, NULL); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false); MPI_Session_free(…, &session1); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false); MPI_Session_create(…, &session2); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == false); MPI_Finalize(); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == true); MPI_Session_free(…, &session2); MPI_Initialized(&flag); assert(flag == true); MPI_Finalized(&flag); assert(flag == true); } Short version: INITIALIZED, FINALIZED, IS_THREAD_MAIN all still refer to INIT / FINALIZE
  59. FIN (for the main part of the proposal)
  60. Items that still need more discussion
  61. Issues that still need more discussion • Dynamic runtime sets – Temporal – Membership • Covered in other proposals: – Thread concurrent vs. non-concurrent – Generic error handlers
  62. Issues that still need more discussion • If COMM_WORLD|SELF are not available by default: – Do we need new destruction hooks to replace SELF attribute callbacks on FINALIZE? – What is the default error handler behavior for functions without comm/file/win? • Do we need syntactic sugar to get a comm from mpi://WORLD? • How do tools hook into MPI initialization and finalization?
  63. Session queries • Query session handle equality – MPI_Session_query(handle1, handle1_type, handle2, handle2_type, bool *are_they_equal) – Not 100% sure we need this…?
  64. Session thread support • Associate thread level support with sessions • Three options: 1. Similar to MPI-3.1: “first” initialization picks thread level 2. Let each session pick its own thread level (via info key in SESSION_CREATE) 3. Just make MPI always be THREAD_MULTIPLE

Editor's Notes

  1. Even though WAITALL is semantically equivalent to loop of WAITs, an implementation may have to scan/choose whether to use an optimized plural implementation or have to split it out into a sequence of WAITs. Plus: what does it mean if multiple sessions have different thread levels and we WAITALL on requests from them?
  2. NOTE: Handle xfer functions need to be high performance, too. We convinced ourselves that this is still implementable: MPICH-like implementations: no issue. OMPI-like implementation: can have an initial table that is all the pre-defined f2c lookups. Upon first user handle creation, alloc a new table (and re-alloc every time you need to grow after that).
Advertisement