SC'18 BoF Presentation

R
Ralph H. Castain
Intel
Joshua Hursey
IBM
PMIx: State-of-the-Union
https://pmix.orgSlack: pmix-workspace.slack.com
Agenda
• Quick status update
 Library releases
 Standards documents
• Application examples
 PMIx Basics: Querying information
 OpenSHMEM
 Interlibrary coordination
 MPI Sessions: Dynamic Process Grouping
 Dynamic Programming Models (web only)
Ralph H. Castain, Joshua Hursey, Aurelien Bouteiller, David Solt, “PMIx: Process management for exascale
environments”, Parallel Computing, 2018.
https://doi.org/10.1016/j.parco.2018.08.002
Overview Paper
• Standards Documents
 v2.0 released Sept. 27, 2018
 v2.1 (errata/clarifications, Dec.)
 v3.0 (Dec.)
 v4.0 (1Q19)
Status Snapshot
Current Library Releases
https://github.com/pmix/pmix-standard/releases
• Clarify terms
 Session, job, application
• New chapter detailing namespace registration
data organization
• Data retrieval examples
 Application size in multi-app scenarios
 What rank to provide for which data attributes
PMIx Standard – In Progress
• Standards Documents
 v2.0 released Sept. 27, 2018
 v2.1 (errata/clarifications, Dec.)
 v3.0 (Dec.)
 v4.0 (1Q19)
• Reference Implementation
 v3.0 => v3.1 (Dec)
 v2.1 => v2.2 (Dec)
Status Snapshot
Current Library Releases
New dstore
implementation
Full Standard
Compliance
• v2 => workflow orchestration
 (Ch.7) Job Allocation Management and Reporting, (Ch.8) Event Notification
• v3 => tools support
 Security credentials, I/O management
• v4 => tools, network, dynamic programming models
 Complete debugger attach/detach/re-attach
• Direct and indirect launch
 Fabric topology information (switches, NICs, …)
• Communication cost, connectivity graphs, inventory collection
 PMIx Groups
 Bindings (Python, …)
What’s in the Standard?
• PRRTE
 Formal releases to begin in Dec. or Jan.
 Ties to external PMIx, libevent, hwloc installations
 Support up thru current head PMIx master branch
• Cross-version support
 Regularly tested, automated script
• Help Wanted: Container developers for regular regression testing
 v2.1.1 and above remain interchangeable
 https://pmix.org/support/faq/how-does-pmix-work-with-containers/
Status Snapshot
Adoption Updates
• MPI use-cases
 Re-ordering for load balance
(UTK/ECP)
 Fault management (UTK)
 On-the-fly session
formation/teardown (MPIF)
• MPI libraries
 OMPI, MPICH, Intel MPI,
HPE-MPI, Spectrum MPI,
Fujitsu MPI
• Resource Managers (RMs)
 Slurm, Fujitsu,
IBM’s Job Step Manager (JSM),
PBSPro (2019), Kubernetes(?)
• Spark, TensorFlow
 Just getting underway
• OpenSHMEM
 Multiple implementations
Artem Y. Polyakov, Joshua S. Ladd, Boris I. Karasev
Mellanox Technologies
Shared Memory Optimizations (ds21)
• Improvements to intra-node performance
 Highly optimized the intra-node communication component dstor/ds21[1]
• Significantly improved the PMIx_Get latency using the 2N-lock algorithm on systems with a large
number of processes per node accessing many keys.
• Designed for exascale: Deployed in production on CORAL systems, Summit and Sierra,
POWER9-based systems.
• Available in PMIx v2.2, v3.1, and subsequent releases.
 Enabled by default.
• Improvements to the SLURM PMIx plug-in
 Added a ring-based PMIx_Fence collective (Slurm v18.08)
 Added support for high-performance RDMA point-to-point
via UCX (Slurm v17.11) [2]
 Verified PMIx compatibility through v3.x API (Slurm v18.08)
[1] Artem Polyakov et al. A Scalable PMIx Database (poster) EuroMPI’18: https://eurompi2018.bsc.es/sites/default/files/uploaded/dstore_empi2018.pdf
[2] Artem Polyakov PMIx Plugin with UCX Support (SC17), SchedMD Booth talk: https://slurm.schedmd.com/publications.html
Mellanox update
PMIx_Get() latency (POWER8 system)
Joshua Hursey
IBM
PMIx Basics: Querying Information
IBM
Systems
Harnessing PMIx capabilities
#include <pmix.h>
int main(int argc, char**argv) {
pmix_proc_tmyproc,proc_wildcard;
pmix_value_t value;
pmix_value_t *val =&value;
PMIx_Init(&myproc, NULL, 0);
PMIX_PROC_CONSTRUCT(&proc_wildcard);
(void)strncpy(proc_wildcard.nspace, myproc.nspace, PMIX_MAX_NSLEN);
proc_wildcard.rank =PMIX_RANK_WILDCARD;
PMIx_Get(&proc_wildcard, PMIX_JOB_SIZE,NULL, 0, &val);
job_size =val->data.uint32;
PMIx_Get(&myproc, PMIX_HOSTNAME, NULL, 0, &val);
strncpy(hostname, val->data.string,256);
printf("%d/%d) Hello Worldfrom %sn", myproc.rank, job_size, hostname);
PMIx_Get(&proc_wildcard, PMIX_LOCALLDR,NULL, 0, &val);// Lowest rankon this node
printf("%d/%d) Lowest Local Rank: %dn", myproc.rank, job_size,val->data.rank);
PMIx_Get(&proc_wildcard, PMIX_LOCAL_SIZE,NULL, 0, &val);// Number ofranks on this node
printf("%d/%d) Local Ranks: %dn", myproc.rank, job_size, val->data.uint32);
PMIx_Fence(&proc_wildcard, 1, NULL, 0); // Synchronize processes (not required, just fordemo)
PMIx_Finalize(NULL, 0); // Cleanup
return 0;
}
• https://pmix.org/support/faq/rm-provided-information/
• https://pmix.org/support/faq/what-information-is-an-rm-supposed-to-provide/
IBM
Systems
Harnessing PMIx capabilities
volatile bool isdone =false;
static void cbfunc(pmix_status_t status, pmix_info_t*info, size_t ninfo, void *cbdata,
pmix_release_cbfunc_t release_fn, void *release_cbdata)
{
size_t n, i;
if (0 < ninfo) {
for (n=0; n < ninfo; n++) {
// iterate trough the info[n].value.data.darray->arrayof
/*
typedef struct pmix_proc_info {
pmix_proc_tproc;
char *hostname;
char *executable_name;
pid_tpid;
int exit_code;
pmix_proc_state_t state;
} pmix_proc_info_t;
*/
}
}
if (NULL !=release_fn) {
release_fn(release_cbdata);
}
isdone =true;
}
• https://pmix.org/support/how-to/example-direct-launch-debugger-tool/
• https://pmix.org/support/how-to/example-indirect-launch-debugger-tool/
#include <pmix.h>
intmain(int argc, char**argv) {
pmix_query_t*query;
char clientspace[PMIX_MAX_NSLEN+1];
// ... Initialize and setup – PMIx_tool_init()...
/* get the proctable forthis nspace */
PMIX_QUERY_CREATE(query, 1);
PMIX_ARGV_APPEND(rc, query[0].keys, PMIX_QUERY_PROC_TABLE);
query[0].nqual =1;
PMIX_INFO_CREATE(query->qualifiers, query[0].nqual);
PMIX_INFO_LOAD(&query->qualifiers[0], PMIX_NSPACE, clientspace, PMIX_STRING);
if (PMIX_SUCCESS!=(rc=PMIx_Query_info_nb(query, 1,cbfunc, NULL))) {
exit(42);
}
/* wait toget a response */
while( !isdone ) { usleep(10); }
// ... Finalize andcleanup
}
Tony Curtis <anthony.curtis@stonybrook.edu>
Dr. Barbara Chapman
Abdullah Shahneous Bari Sayket, Wenbin Lü, Wes Suttle
Stony Brook University
OSSS OpenSHMEM-on-UCX / Fault Tolerance
https://www.iacs.stonybrook.edu/
PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance
• OpenSHMEM
 Partitioned Global Address Space library
 Take advantage of RDMA
• put/get directly to/from remote memory
• Atomics, locks
• Collectives
• And more…
http://www.openshmem.org/
PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance
• OpenSHMEM: our bit
 Reference Implementation
• OSSS + LANL + SBU + Rice + Duke + Rhodes
• Wireup/maintenance: PMIx
• Communication substrate: UCX
 shm, IB, uGNI, CUDA, …
SC'18 BoF Presentation
PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance
• Reference Implementation
 PMIx/UCX for OpenSHMEM ≥ 1.4
 Also for < 1.4 based on GASNet
 Open-Source @ github
• Or my dev repo if you’re brave
• Our research vehicle
PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance
• Reference Implementation
 PMIx/PRRTE used as o-o-b startup / launch
 Exchange of wireup info e.g. heap addresses/rkeys
 Rank/locality and other “ambient” queries
 Stays out of the way until OpenSHMEM finalized
 Could kill it once we’re up but used for fault tolerance
project
PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance
• Fault Tolerance (NSF) #1
 Project with UTK and Rutgers
 Current OpenSHMEM specification lacks general
fault tolerance (FT) features
 PMIx has basic FT building blocks already
• Event handling, process monitoring, job control
 Using these features to build FT API for
OpenSHMEM
PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance
• Fault Tolerance (NSF) #2
 User specifies
• Desired process monitoring scheme
• Application-specific fault mitigation procedure
 Then our FT API takes care of:
• Initiating specified process monitoring
• Registering fault mitigation routine with PMIx server
 PMIx takes care of the rest
PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance
• Fault Tolerance (NSF) #3
 Longer-term goal: automated selection and
implementation of FT techniques
 Compiler chooses from a small set of pre-packaged
FT schemes
 Appropriate technique selected based on application
structure and data access patterns
 Compiler implements the scheme at compile-time
PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance
• PMIx @ github
 We try to keep up on the bleeding edge of both PMIx
& PRRTE for development
 But make sure releases work too!
 We’ve opened quite a few tickets
• Always fixed or nixed quickly!
Any (easy) Questions?
Tony Curtis <anthony.curtis@stonybrook.edu>
Dr. Barbara Chapman
Abdullah Shahneous Bari Sayket, Wenbin Lü, Wes Suttle
Stony Brook University
OSSS OpenSHMEM-on-UCX / Fault Tolerance
https://www.iacs.stonybrook.edu/
http://www.openshmem.org/
http://www.openucx.org/
Geoffroy Vallee
Oak Ridge National Laboratory
Runtime Coordination Based on PMIx
Introduction
• Many pre-exascale systems show a fairly
drastic hardware architecture change
 Bigger/complex compute nodes
 Summit: 2 IBM Power-9 CPUs, 6 GPUs per node
• Requires most scientific simulations to switch
from pure MPI to a MPI+X paradigm
• MPI+OpenMP is a popular solution
Challenges
• MPI and OpenMP are independent communities
• Implementations un-aware of each other
 MPI will deploy ranks without any knowledge of the
OpenMP threads that will run under each rank
 OpenMP assumes by default that all available
resources can be used
It is very difficult for users to have a fine-grain control
over application execution
Context
• U.S. DOE Exascale Computing Project (ECP)
• ECP OMPI-X project
• ECP SOLLVE project
• Oak Ridge Leadership Computing Facility
(OCLF)
Challenges (2)
• Impossible to coordinate runtimes and optimize
resource utilization
 Runtimes independently allocate resources
 No fine-grain & coordinated control over the
placement of MPI ranks and OpenMP threads
 Difficult to express affinity across runtimes
 Difficult to optimize applications
Illustration
• On Summit nodes
Numa Node 1
Numa Node 2
GPU
GPU
GPU
GPU
GPU
GPU
Node architecture
Numa Node 1
Numa Node 2
GPU
GPU
GPU
GPU
GPU
GPU
Desired placement
MPI rank / OpenMP master thread
OpenMP thread
Proposed Solution
• Make all runtimes PMIx aware for data
exchange and notification through events
• Key PMIx characteristics
 Distributed key/value store for data exchange
 Asynchronous events for coordination
 Enable interactions with the resource manager
• Modification of LLVM OpenMP and Open MPI
Two Use-cases
• Fine-grain placement over MPI ranks and
OpenMP threads (prototyping stage)
• Inter-runtime progress (design stage)
Fine-grain Placement
• Goals
 Explicit placement of MPI ranks
 Explicit placement of OpenMP threads
 Re-configure MPI+OpenMP layout at runtime
• Propose the concept of layout
 Static layout: explicit definition of ranks and threads
placement at initialization time only
 Dynamic layouts: change placement at runtime
Layout: Example
→
Static Layouts
• New MPI mapper
 Get the layout specified by the user
 Publish it through PMIx
• New MPI OpenMP Coordination (MOC) helper-
library gets the layout and set OpenMP places
to specify threads will be created placement
• No modification to OpenMP standard or runtime
Dynamic Layouts
• New API to define phases within the application
 A static layout is deployed for each phase
 Modification of the layout between phases (w/ MOC)
 Well adapted to the nature of some applications
• Requires
 OpenMP runtime modifications (LLVM prototype)
• Get phases’ layouts and set new places internally
 OpenMP standard modifications to allow dynamicity
Architecture Overview
• MOC ensures inter-runtime data
exchanged (MOC-specific PMIx
keys)
• Runtimes are PMIx-aware
• Runtimes have a layout-aware
mapper/affinity manager
• Respect the separation between
standards, resource manager
and runtimes
Inter-runtime Progress
• Runtime usually guarantee internal progress but
are not aware of other runtimes
• Runtimes can end up in a deadlock if mutually
waiting for completion
 MPI operation is ready to complete
 Application calls and block in other runtimes
Inter-runtime Progress (2)
• Proposed solution
 Runtimes use PMIx to notify of progress
 When a runtime gets a progress notification, it yields
once done with all tasks that can be completed
• Currently working on prototyping
Conclusion
• There is a real need for making runtimes aware
of each other
• PMIx is a privileged option
 Asynchronous / event based
 Data exchange
 Enable interaction with the resource manager
• Opens new opportunities for both research and
development (e.g. new project focusing on QoS)
Dan Holmes
EPCC University of Edinburgh
Howard Pritchard, Nathan Hjelm
Los Alamos National Laboratory
MPI Sessions: Dynamic Proc. Groups
Using PMIx to Help
Replace MPI_Init
14th November 2018
Dan Holmes, Howard Pritchard, Nathan Hjelm
LA-UR-18-30830
Problems with MPI_Init
● All MPI processes must initialize MPI exactly once
● MPI cannot be initialized within an MPI process from different
application components without coordination
● MPI cannot be re-initialized after MPI is finalized
● Error handling for MPI initialization cannot be specified
Sessions – a new way to start
MPI
● General scheme:
● Query the underlying
run-time system
● Get a “set” of
processes
● Determine the processes
you want
● Create an MPI_Group
● Create a communicator
with just those processes
● Create an MPI_Comm
Query runtime
for set of processes
MPI_Group
MPI_Comm
MPI_Session
MPI Sessions proposed API
● Create (or destroy) a session:
● MPI_SESSION_INIT (and MPI_SESSION_FINALIZE)
● Get names of sets of processes:
● MPI_SESSION_GET_NUM_PSETS,
MPI_SESSION_GET_NTH_PSET
● Create an MPI_GROUP from a process set name:
● MPI_GROUP_CREATE_FROM_SESSION
● Create an MPI_COMM from an MPI_GROUP:
● MPI_COMM_CREATE_FROM_GROUP
PMIx
groups
helps here
MPI_COMM_CREATE_FROM_GROU
P
The ‘uri’ is supplied by the application.
Implementation challenge: ‘group’ is a local object.
Need some way to synchronize with other “joiners” to the
communicator. The ‘uri’ is different than a process set
name.
MPI_Create_comm_from_group(IN MPI_Group group,
IN const char *uri,
IN MPI_Info info,
IN MPI_Erhandler hndl,
OUT MPI_Comm *comm);
Using PMIx Groups
● PMIx Groups - a collection of processes desiring a unified
identifier for purposes such as passing events or participating in
PMIx fence operations
● Invite/join/leave semantics
● Sessions prototype implementation currently uses
PMIX_Group_construct/PMIX_Group_destruct
● Can be used to generate a “unique” 64-bit identifier for the group.
Used by the sessions prototype to generate a communicator ID.
● Useful options for future work
● Timeout for processes joining the group
● Asynchronous notification when a process leaves the group
Using PMIx_Group_Construct
● ‘id’ maps to/from the ‘uri’ in MPI_Comm_create_from_group
(plus additional Open MPI internal info)
● ‘procs’ array comes from information previously supplied by PMIx
○ “mpi://world” and “mpi://self” already available
○ mpiexec -np 2 --pset user://ocean ocean.x : 
-np 2 --pset user://atmosphere atmosphere.x
PMIx_Group_Construct(const char id[],
const pmix_proc_t procs[],
const pmix_info_t info[],
size_t ninfo);
MPI Sessions Prototype Status
● All “MPI_*_from_group” functions have been implemented
● Only pml/ob1 supported at this time
● Working on MPI_Group creation (from PSET) now
● Will start with mpi://world and mpi://self
● User-defined process sets need additional support from PMIx
● Up next: break apart MPI initialization
● Goal is to reduce startup time and memory footprint
Summary
● PMIx Groups provides an OOB mechanism for MPI processes to
bootstrap the formation of a communication context (MPI
Communicator) from a group of MPI processes
● Functionality for future work
● Handling (unexpected) process exit
● User-defined process sets
● Group expansion
Funding Acknowledgments
Useful Links:
General Information: https://pmix.org/
PMIx Library: https://github.com/pmix/pmix
PMIx Reference RunTime Environment (PRRTE): https://github.com/pmix/prrte
PMIx Standard: https://github.com/pmix/pmix-standard
Slack: pmix-workspace.slack.com
Q&A
1 of 53

Recommended

EBPF and Linux Networking by
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux NetworkingPLUMgrid
14.6K views36 slides
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds by
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsIn-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsPLUMgrid
2K views27 slides
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem by
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation EcosystemHow APIs are Transforming Cisco Solutions and Catalyzing an Innovation Ecosystem
How APIs are Transforming Cisco Solutions and Catalyzing an Innovation EcosystemCisco DevNet
1.8K views61 slides
TLDK - FD.io Sept 2016 by
TLDK - FD.io Sept 2016 TLDK - FD.io Sept 2016
TLDK - FD.io Sept 2016 Benoit Hudzia
818 views8 slides
Intel® RDT Hands-on Lab by
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on LabMichelle Holley
3.9K views32 slides
DPDK In Depth by
DPDK In DepthDPDK In Depth
DPDK In DepthKernel TLV
5.2K views34 slides

More Related Content

What's hot

HPC at NIBR by
HPC at NIBRHPC at NIBR
HPC at NIBRinside-BigData.com
744 views44 slides
Linux Kernel Cryptographic API and Use Cases by
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
5.3K views33 slides
2008-09-09 IBM Interaction Conference, Red Hat Update for System z by
2008-09-09 IBM Interaction Conference, Red Hat Update for System z2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System zShawn Wells
205 views55 slides
44CON 2013 - Hack the Gibson - Exploiting Supercomputers - John Fitzpatrick &... by
44CON 2013 - Hack the Gibson - Exploiting Supercomputers - John Fitzpatrick &...44CON 2013 - Hack the Gibson - Exploiting Supercomputers - John Fitzpatrick &...
44CON 2013 - Hack the Gibson - Exploiting Supercomputers - John Fitzpatrick &...44CON
1.3K views56 slides
Data Plane and VNF Acceleration Mini Summit by
Data Plane and VNF Acceleration Mini Summit Data Plane and VNF Acceleration Mini Summit
Data Plane and VNF Acceleration Mini Summit Open-NFP
3.4K views88 slides
Real time Linux by
Real time LinuxReal time Linux
Real time Linuxnavid ashrafi
786 views43 slides

What's hot(20)

Linux Kernel Cryptographic API and Use Cases by Kernel TLV
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
Kernel TLV5.3K views
2008-09-09 IBM Interaction Conference, Red Hat Update for System z by Shawn Wells
2008-09-09 IBM Interaction Conference, Red Hat Update for System z2008-09-09 IBM Interaction Conference, Red Hat Update for System z
2008-09-09 IBM Interaction Conference, Red Hat Update for System z
Shawn Wells205 views
44CON 2013 - Hack the Gibson - Exploiting Supercomputers - John Fitzpatrick &... by 44CON
44CON 2013 - Hack the Gibson - Exploiting Supercomputers - John Fitzpatrick &...44CON 2013 - Hack the Gibson - Exploiting Supercomputers - John Fitzpatrick &...
44CON 2013 - Hack the Gibson - Exploiting Supercomputers - John Fitzpatrick &...
44CON1.3K views
Data Plane and VNF Acceleration Mini Summit by Open-NFP
Data Plane and VNF Acceleration Mini Summit Data Plane and VNF Acceleration Mini Summit
Data Plane and VNF Acceleration Mini Summit
Open-NFP3.4K views
Overview of Linux real-time challenges by Daniel Stenberg
Overview of Linux real-time challengesOverview of Linux real-time challenges
Overview of Linux real-time challenges
Daniel Stenberg5.6K views
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx by Vivek Kumar
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptxProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
ProjectVault[VivekKumar_CS-C_6Sem_MIT].pptx
Vivek Kumar67 views
Accelerating Envoy and Istio with Cilium and the Linux Kernel by Thomas Graf
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux Kernel
Thomas Graf7.5K views
Rina sim workshop by ICT PRISTINE
Rina sim workshopRina sim workshop
Rina sim workshop
ICT PRISTINE1.5K views
Dpdk: rte_security: An update and introducing PDCP by Hemant Agrawal
Dpdk: rte_security: An update and introducing PDCPDpdk: rte_security: An update and introducing PDCP
Dpdk: rte_security: An update and introducing PDCP
Hemant Agrawal177 views
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO... by Masaaki Nakagawa
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
Masaaki Nakagawa1.8K views
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP by Thomas Graf
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Thomas Graf10.1K views
eBPF - Rethinking the Linux Kernel by Thomas Graf
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
Thomas Graf1.2K views
Introduction to eBPF by RogerColl2
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
RogerColl2319 views
Evolving Virtual Networking with IO Visor by Larry Lang
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
Larry Lang663 views
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray by harryvanhaaren
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
harryvanhaaren4.4K views
The Impact of Software-based Virtual Network in the Public Cloud by Chunghan Lee
The Impact of Software-based Virtual Network in the Public CloudThe Impact of Software-based Virtual Network in the Public Cloud
The Impact of Software-based Virtual Network in the Public Cloud
Chunghan Lee348 views
SDN & NFV Introduction - Open Source Data Center Networking by Thomas Graf
SDN & NFV Introduction - Open Source Data Center NetworkingSDN & NFV Introduction - Open Source Data Center Networking
SDN & NFV Introduction - Open Source Data Center Networking
Thomas Graf8K views

Similar to SC'18 BoF Presentation

GPA Software Overview R3 by
GPA Software Overview R3GPA Software Overview R3
GPA Software Overview R3Grid Protection Alliance
332 views42 slides
Exploring the Final Frontier of Data Center Orchestration: Network Elements -... by
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Puppet
2.2K views27 slides
eBPF Basics by
eBPF BasicseBPF Basics
eBPF BasicsMichael Kehoe
2.7K views63 slides
SC'16 PMIx BoF Presentation by
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentationrcastain
123 views34 slides
EuroMPI 2017 PMIx presentation by
EuroMPI 2017 PMIx presentationEuroMPI 2017 PMIx presentation
EuroMPI 2017 PMIx presentationrcastain
133 views17 slides
Typesafe spark- Zalando meetup by
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetupStavros Kontopoulos
542 views34 slides

Similar to SC'18 BoF Presentation(20)

Exploring the Final Frontier of Data Center Orchestration: Network Elements -... by Puppet
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Puppet2.2K views
SC'16 PMIx BoF Presentation by rcastain
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentation
rcastain123 views
EuroMPI 2017 PMIx presentation by rcastain
EuroMPI 2017 PMIx presentationEuroMPI 2017 PMIx presentation
EuroMPI 2017 PMIx presentation
rcastain133 views
Tamir Dresher - DotNet 7 What's new.pptx by Tamir Dresher
Tamir Dresher - DotNet 7 What's new.pptxTamir Dresher - DotNet 7 What's new.pptx
Tamir Dresher - DotNet 7 What's new.pptx
Tamir Dresher27 views
MingLiuResume2016 by Ming Liu
MingLiuResume2016MingLiuResume2016
MingLiuResume2016
Ming Liu173 views
Docker meetup - PaaS interoperability by Ludovic Piot
Docker meetup - PaaS interoperabilityDocker meetup - PaaS interoperability
Docker meetup - PaaS interoperability
Ludovic Piot819 views
KubeCon USA 2017 brief Overview - from Kubernetes meetup Bangalore by Krishna-Kumar
KubeCon USA 2017 brief Overview - from Kubernetes meetup BangaloreKubeCon USA 2017 brief Overview - from Kubernetes meetup Bangalore
KubeCon USA 2017 brief Overview - from Kubernetes meetup Bangalore
Krishna-Kumar 465 views
PETRUCCI_Andrea_Research_Projects_and_Publications by Andrea PETRUCCI
PETRUCCI_Andrea_Research_Projects_and_PublicationsPETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_Publications
Andrea PETRUCCI434 views
HiPEAC 2019 Tutorial - Maestro RTOS by Tulipp. Eu
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
Tulipp. Eu127 views
Intel open stack-summit-session-nov13-final by Deepak Mane
Intel open stack-summit-session-nov13-finalIntel open stack-summit-session-nov13-final
Intel open stack-summit-session-nov13-final
Deepak Mane1.1K views
Microservices @ Work - A Practice Report of Developing Microservices by QAware GmbH
Microservices @ Work - A Practice Report of Developing MicroservicesMicroservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing Microservices
QAware GmbH581 views
Introduction to Civil Infrastructure Platform by SZ Lin
Introduction to Civil Infrastructure PlatformIntroduction to Civil Infrastructure Platform
Introduction to Civil Infrastructure Platform
SZ Lin507 views
P4_tutorial.pdf by PramodhN3
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdf
PramodhN324 views

More from rcastain

PMIx: Bridging the Container Boundary by
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundaryrcastain
218 views23 slides
PMIx Tiered Storage Support by
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Supportrcastain
115 views22 slides
SC'17 BoF Presentation by
SC'17 BoF PresentationSC'17 BoF Presentation
SC'17 BoF Presentationrcastain
34 views13 slides
PMIx: Debuggers and Fabric Support by
PMIx: Debuggers and Fabric SupportPMIx: Debuggers and Fabric Support
PMIx: Debuggers and Fabric Supportrcastain
162 views20 slides
HPC Resource Management: Futures by
HPC Resource Management: FuturesHPC Resource Management: Futures
HPC Resource Management: Futuresrcastain
534 views31 slides
SC15 PMIx Birds-of-a-Feather by
SC15 PMIx Birds-of-a-FeatherSC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-Featherrcastain
1.2K views57 slides

More from rcastain(8)

PMIx: Bridging the Container Boundary by rcastain
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundary
rcastain218 views
PMIx Tiered Storage Support by rcastain
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Support
rcastain115 views
SC'17 BoF Presentation by rcastain
SC'17 BoF PresentationSC'17 BoF Presentation
SC'17 BoF Presentation
rcastain34 views
PMIx: Debuggers and Fabric Support by rcastain
PMIx: Debuggers and Fabric SupportPMIx: Debuggers and Fabric Support
PMIx: Debuggers and Fabric Support
rcastain162 views
HPC Resource Management: Futures by rcastain
HPC Resource Management: FuturesHPC Resource Management: Futures
HPC Resource Management: Futures
rcastain534 views
SC15 PMIx Birds-of-a-Feather by rcastain
SC15 PMIx Birds-of-a-FeatherSC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-Feather
rcastain1.2K views
HPC Controls Future by rcastain
HPC Controls FutureHPC Controls Future
HPC Controls Future
rcastain1K views
Exascale Process Management Interface by rcastain
Exascale Process Management InterfaceExascale Process Management Interface
Exascale Process Management Interface
rcastain1.3K views

Recently uploaded

The details of description: Techniques, tips, and tangents on alternative tex... by
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...BookNet Canada
126 views24 slides
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
257 views86 slides
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdfDr. Jimmy Schwarzkopf
16 views29 slides
Piloting & Scaling Successfully With Microsoft Viva by
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft VivaRichard Harbridge
12 views160 slides
PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
126 views17 slides
SUPPLIER SOURCING.pptx by
SUPPLIER SOURCING.pptxSUPPLIER SOURCING.pptx
SUPPLIER SOURCING.pptxangelicacueva6
14 views1 slide

Recently uploaded(20)

The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada126 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software257 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi126 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb13 views

SC'18 BoF Presentation

  • 1. Ralph H. Castain Intel Joshua Hursey IBM PMIx: State-of-the-Union https://pmix.orgSlack: pmix-workspace.slack.com
  • 2. Agenda • Quick status update  Library releases  Standards documents • Application examples  PMIx Basics: Querying information  OpenSHMEM  Interlibrary coordination  MPI Sessions: Dynamic Process Grouping  Dynamic Programming Models (web only)
  • 3. Ralph H. Castain, Joshua Hursey, Aurelien Bouteiller, David Solt, “PMIx: Process management for exascale environments”, Parallel Computing, 2018. https://doi.org/10.1016/j.parco.2018.08.002 Overview Paper
  • 4. • Standards Documents  v2.0 released Sept. 27, 2018  v2.1 (errata/clarifications, Dec.)  v3.0 (Dec.)  v4.0 (1Q19) Status Snapshot Current Library Releases https://github.com/pmix/pmix-standard/releases
  • 5. • Clarify terms  Session, job, application • New chapter detailing namespace registration data organization • Data retrieval examples  Application size in multi-app scenarios  What rank to provide for which data attributes PMIx Standard – In Progress
  • 6. • Standards Documents  v2.0 released Sept. 27, 2018  v2.1 (errata/clarifications, Dec.)  v3.0 (Dec.)  v4.0 (1Q19) • Reference Implementation  v3.0 => v3.1 (Dec)  v2.1 => v2.2 (Dec) Status Snapshot Current Library Releases New dstore implementation Full Standard Compliance
  • 7. • v2 => workflow orchestration  (Ch.7) Job Allocation Management and Reporting, (Ch.8) Event Notification • v3 => tools support  Security credentials, I/O management • v4 => tools, network, dynamic programming models  Complete debugger attach/detach/re-attach • Direct and indirect launch  Fabric topology information (switches, NICs, …) • Communication cost, connectivity graphs, inventory collection  PMIx Groups  Bindings (Python, …) What’s in the Standard?
  • 8. • PRRTE  Formal releases to begin in Dec. or Jan.  Ties to external PMIx, libevent, hwloc installations  Support up thru current head PMIx master branch • Cross-version support  Regularly tested, automated script • Help Wanted: Container developers for regular regression testing  v2.1.1 and above remain interchangeable  https://pmix.org/support/faq/how-does-pmix-work-with-containers/ Status Snapshot
  • 9. Adoption Updates • MPI use-cases  Re-ordering for load balance (UTK/ECP)  Fault management (UTK)  On-the-fly session formation/teardown (MPIF) • MPI libraries  OMPI, MPICH, Intel MPI, HPE-MPI, Spectrum MPI, Fujitsu MPI • Resource Managers (RMs)  Slurm, Fujitsu, IBM’s Job Step Manager (JSM), PBSPro (2019), Kubernetes(?) • Spark, TensorFlow  Just getting underway • OpenSHMEM  Multiple implementations
  • 10. Artem Y. Polyakov, Joshua S. Ladd, Boris I. Karasev Mellanox Technologies Shared Memory Optimizations (ds21)
  • 11. • Improvements to intra-node performance  Highly optimized the intra-node communication component dstor/ds21[1] • Significantly improved the PMIx_Get latency using the 2N-lock algorithm on systems with a large number of processes per node accessing many keys. • Designed for exascale: Deployed in production on CORAL systems, Summit and Sierra, POWER9-based systems. • Available in PMIx v2.2, v3.1, and subsequent releases.  Enabled by default. • Improvements to the SLURM PMIx plug-in  Added a ring-based PMIx_Fence collective (Slurm v18.08)  Added support for high-performance RDMA point-to-point via UCX (Slurm v17.11) [2]  Verified PMIx compatibility through v3.x API (Slurm v18.08) [1] Artem Polyakov et al. A Scalable PMIx Database (poster) EuroMPI’18: https://eurompi2018.bsc.es/sites/default/files/uploaded/dstore_empi2018.pdf [2] Artem Polyakov PMIx Plugin with UCX Support (SC17), SchedMD Booth talk: https://slurm.schedmd.com/publications.html Mellanox update PMIx_Get() latency (POWER8 system)
  • 12. Joshua Hursey IBM PMIx Basics: Querying Information
  • 13. IBM Systems Harnessing PMIx capabilities #include <pmix.h> int main(int argc, char**argv) { pmix_proc_tmyproc,proc_wildcard; pmix_value_t value; pmix_value_t *val =&value; PMIx_Init(&myproc, NULL, 0); PMIX_PROC_CONSTRUCT(&proc_wildcard); (void)strncpy(proc_wildcard.nspace, myproc.nspace, PMIX_MAX_NSLEN); proc_wildcard.rank =PMIX_RANK_WILDCARD; PMIx_Get(&proc_wildcard, PMIX_JOB_SIZE,NULL, 0, &val); job_size =val->data.uint32; PMIx_Get(&myproc, PMIX_HOSTNAME, NULL, 0, &val); strncpy(hostname, val->data.string,256); printf("%d/%d) Hello Worldfrom %sn", myproc.rank, job_size, hostname); PMIx_Get(&proc_wildcard, PMIX_LOCALLDR,NULL, 0, &val);// Lowest rankon this node printf("%d/%d) Lowest Local Rank: %dn", myproc.rank, job_size,val->data.rank); PMIx_Get(&proc_wildcard, PMIX_LOCAL_SIZE,NULL, 0, &val);// Number ofranks on this node printf("%d/%d) Local Ranks: %dn", myproc.rank, job_size, val->data.uint32); PMIx_Fence(&proc_wildcard, 1, NULL, 0); // Synchronize processes (not required, just fordemo) PMIx_Finalize(NULL, 0); // Cleanup return 0; } • https://pmix.org/support/faq/rm-provided-information/ • https://pmix.org/support/faq/what-information-is-an-rm-supposed-to-provide/
  • 14. IBM Systems Harnessing PMIx capabilities volatile bool isdone =false; static void cbfunc(pmix_status_t status, pmix_info_t*info, size_t ninfo, void *cbdata, pmix_release_cbfunc_t release_fn, void *release_cbdata) { size_t n, i; if (0 < ninfo) { for (n=0; n < ninfo; n++) { // iterate trough the info[n].value.data.darray->arrayof /* typedef struct pmix_proc_info { pmix_proc_tproc; char *hostname; char *executable_name; pid_tpid; int exit_code; pmix_proc_state_t state; } pmix_proc_info_t; */ } } if (NULL !=release_fn) { release_fn(release_cbdata); } isdone =true; } • https://pmix.org/support/how-to/example-direct-launch-debugger-tool/ • https://pmix.org/support/how-to/example-indirect-launch-debugger-tool/ #include <pmix.h> intmain(int argc, char**argv) { pmix_query_t*query; char clientspace[PMIX_MAX_NSLEN+1]; // ... Initialize and setup – PMIx_tool_init()... /* get the proctable forthis nspace */ PMIX_QUERY_CREATE(query, 1); PMIX_ARGV_APPEND(rc, query[0].keys, PMIX_QUERY_PROC_TABLE); query[0].nqual =1; PMIX_INFO_CREATE(query->qualifiers, query[0].nqual); PMIX_INFO_LOAD(&query->qualifiers[0], PMIX_NSPACE, clientspace, PMIX_STRING); if (PMIX_SUCCESS!=(rc=PMIx_Query_info_nb(query, 1,cbfunc, NULL))) { exit(42); } /* wait toget a response */ while( !isdone ) { usleep(10); } // ... Finalize andcleanup }
  • 15. Tony Curtis <anthony.curtis@stonybrook.edu> Dr. Barbara Chapman Abdullah Shahneous Bari Sayket, Wenbin Lü, Wes Suttle Stony Brook University OSSS OpenSHMEM-on-UCX / Fault Tolerance https://www.iacs.stonybrook.edu/
  • 16. PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance • OpenSHMEM  Partitioned Global Address Space library  Take advantage of RDMA • put/get directly to/from remote memory • Atomics, locks • Collectives • And more… http://www.openshmem.org/
  • 17. PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance • OpenSHMEM: our bit  Reference Implementation • OSSS + LANL + SBU + Rice + Duke + Rhodes • Wireup/maintenance: PMIx • Communication substrate: UCX  shm, IB, uGNI, CUDA, …
  • 19. PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance • Reference Implementation  PMIx/UCX for OpenSHMEM ≥ 1.4  Also for < 1.4 based on GASNet  Open-Source @ github • Or my dev repo if you’re brave • Our research vehicle
  • 20. PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance • Reference Implementation  PMIx/PRRTE used as o-o-b startup / launch  Exchange of wireup info e.g. heap addresses/rkeys  Rank/locality and other “ambient” queries  Stays out of the way until OpenSHMEM finalized  Could kill it once we’re up but used for fault tolerance project
  • 21. PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance • Fault Tolerance (NSF) #1  Project with UTK and Rutgers  Current OpenSHMEM specification lacks general fault tolerance (FT) features  PMIx has basic FT building blocks already • Event handling, process monitoring, job control  Using these features to build FT API for OpenSHMEM
  • 22. PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance • Fault Tolerance (NSF) #2  User specifies • Desired process monitoring scheme • Application-specific fault mitigation procedure  Then our FT API takes care of: • Initiating specified process monitoring • Registering fault mitigation routine with PMIx server  PMIx takes care of the rest
  • 23. PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance • Fault Tolerance (NSF) #3  Longer-term goal: automated selection and implementation of FT techniques  Compiler chooses from a small set of pre-packaged FT schemes  Appropriate technique selected based on application structure and data access patterns  Compiler implements the scheme at compile-time
  • 24. PMIx@SC18: OSSS OpenSHMEM-on-UCX / Fault Tolerance • PMIx @ github  We try to keep up on the bleeding edge of both PMIx & PRRTE for development  But make sure releases work too!  We’ve opened quite a few tickets • Always fixed or nixed quickly!
  • 25. Any (easy) Questions? Tony Curtis <anthony.curtis@stonybrook.edu> Dr. Barbara Chapman Abdullah Shahneous Bari Sayket, Wenbin Lü, Wes Suttle Stony Brook University OSSS OpenSHMEM-on-UCX / Fault Tolerance https://www.iacs.stonybrook.edu/ http://www.openshmem.org/ http://www.openucx.org/
  • 26. Geoffroy Vallee Oak Ridge National Laboratory Runtime Coordination Based on PMIx
  • 27. Introduction • Many pre-exascale systems show a fairly drastic hardware architecture change  Bigger/complex compute nodes  Summit: 2 IBM Power-9 CPUs, 6 GPUs per node • Requires most scientific simulations to switch from pure MPI to a MPI+X paradigm • MPI+OpenMP is a popular solution
  • 28. Challenges • MPI and OpenMP are independent communities • Implementations un-aware of each other  MPI will deploy ranks without any knowledge of the OpenMP threads that will run under each rank  OpenMP assumes by default that all available resources can be used It is very difficult for users to have a fine-grain control over application execution
  • 29. Context • U.S. DOE Exascale Computing Project (ECP) • ECP OMPI-X project • ECP SOLLVE project • Oak Ridge Leadership Computing Facility (OCLF)
  • 30. Challenges (2) • Impossible to coordinate runtimes and optimize resource utilization  Runtimes independently allocate resources  No fine-grain & coordinated control over the placement of MPI ranks and OpenMP threads  Difficult to express affinity across runtimes  Difficult to optimize applications
  • 31. Illustration • On Summit nodes Numa Node 1 Numa Node 2 GPU GPU GPU GPU GPU GPU Node architecture Numa Node 1 Numa Node 2 GPU GPU GPU GPU GPU GPU Desired placement MPI rank / OpenMP master thread OpenMP thread
  • 32. Proposed Solution • Make all runtimes PMIx aware for data exchange and notification through events • Key PMIx characteristics  Distributed key/value store for data exchange  Asynchronous events for coordination  Enable interactions with the resource manager • Modification of LLVM OpenMP and Open MPI
  • 33. Two Use-cases • Fine-grain placement over MPI ranks and OpenMP threads (prototyping stage) • Inter-runtime progress (design stage)
  • 34. Fine-grain Placement • Goals  Explicit placement of MPI ranks  Explicit placement of OpenMP threads  Re-configure MPI+OpenMP layout at runtime • Propose the concept of layout  Static layout: explicit definition of ranks and threads placement at initialization time only  Dynamic layouts: change placement at runtime
  • 36. Static Layouts • New MPI mapper  Get the layout specified by the user  Publish it through PMIx • New MPI OpenMP Coordination (MOC) helper- library gets the layout and set OpenMP places to specify threads will be created placement • No modification to OpenMP standard or runtime
  • 37. Dynamic Layouts • New API to define phases within the application  A static layout is deployed for each phase  Modification of the layout between phases (w/ MOC)  Well adapted to the nature of some applications • Requires  OpenMP runtime modifications (LLVM prototype) • Get phases’ layouts and set new places internally  OpenMP standard modifications to allow dynamicity
  • 38. Architecture Overview • MOC ensures inter-runtime data exchanged (MOC-specific PMIx keys) • Runtimes are PMIx-aware • Runtimes have a layout-aware mapper/affinity manager • Respect the separation between standards, resource manager and runtimes
  • 39. Inter-runtime Progress • Runtime usually guarantee internal progress but are not aware of other runtimes • Runtimes can end up in a deadlock if mutually waiting for completion  MPI operation is ready to complete  Application calls and block in other runtimes
  • 40. Inter-runtime Progress (2) • Proposed solution  Runtimes use PMIx to notify of progress  When a runtime gets a progress notification, it yields once done with all tasks that can be completed • Currently working on prototyping
  • 41. Conclusion • There is a real need for making runtimes aware of each other • PMIx is a privileged option  Asynchronous / event based  Data exchange  Enable interaction with the resource manager • Opens new opportunities for both research and development (e.g. new project focusing on QoS)
  • 42. Dan Holmes EPCC University of Edinburgh Howard Pritchard, Nathan Hjelm Los Alamos National Laboratory MPI Sessions: Dynamic Proc. Groups
  • 43. Using PMIx to Help Replace MPI_Init 14th November 2018 Dan Holmes, Howard Pritchard, Nathan Hjelm LA-UR-18-30830
  • 44. Problems with MPI_Init ● All MPI processes must initialize MPI exactly once ● MPI cannot be initialized within an MPI process from different application components without coordination ● MPI cannot be re-initialized after MPI is finalized ● Error handling for MPI initialization cannot be specified
  • 45. Sessions – a new way to start MPI ● General scheme: ● Query the underlying run-time system ● Get a “set” of processes ● Determine the processes you want ● Create an MPI_Group ● Create a communicator with just those processes ● Create an MPI_Comm Query runtime for set of processes MPI_Group MPI_Comm MPI_Session
  • 46. MPI Sessions proposed API ● Create (or destroy) a session: ● MPI_SESSION_INIT (and MPI_SESSION_FINALIZE) ● Get names of sets of processes: ● MPI_SESSION_GET_NUM_PSETS, MPI_SESSION_GET_NTH_PSET ● Create an MPI_GROUP from a process set name: ● MPI_GROUP_CREATE_FROM_SESSION ● Create an MPI_COMM from an MPI_GROUP: ● MPI_COMM_CREATE_FROM_GROUP PMIx groups helps here
  • 47. MPI_COMM_CREATE_FROM_GROU P The ‘uri’ is supplied by the application. Implementation challenge: ‘group’ is a local object. Need some way to synchronize with other “joiners” to the communicator. The ‘uri’ is different than a process set name. MPI_Create_comm_from_group(IN MPI_Group group, IN const char *uri, IN MPI_Info info, IN MPI_Erhandler hndl, OUT MPI_Comm *comm);
  • 48. Using PMIx Groups ● PMIx Groups - a collection of processes desiring a unified identifier for purposes such as passing events or participating in PMIx fence operations ● Invite/join/leave semantics ● Sessions prototype implementation currently uses PMIX_Group_construct/PMIX_Group_destruct ● Can be used to generate a “unique” 64-bit identifier for the group. Used by the sessions prototype to generate a communicator ID. ● Useful options for future work ● Timeout for processes joining the group ● Asynchronous notification when a process leaves the group
  • 49. Using PMIx_Group_Construct ● ‘id’ maps to/from the ‘uri’ in MPI_Comm_create_from_group (plus additional Open MPI internal info) ● ‘procs’ array comes from information previously supplied by PMIx ○ “mpi://world” and “mpi://self” already available ○ mpiexec -np 2 --pset user://ocean ocean.x : -np 2 --pset user://atmosphere atmosphere.x PMIx_Group_Construct(const char id[], const pmix_proc_t procs[], const pmix_info_t info[], size_t ninfo);
  • 50. MPI Sessions Prototype Status ● All “MPI_*_from_group” functions have been implemented ● Only pml/ob1 supported at this time ● Working on MPI_Group creation (from PSET) now ● Will start with mpi://world and mpi://self ● User-defined process sets need additional support from PMIx ● Up next: break apart MPI initialization ● Goal is to reduce startup time and memory footprint
  • 51. Summary ● PMIx Groups provides an OOB mechanism for MPI processes to bootstrap the formation of a communication context (MPI Communicator) from a group of MPI processes ● Functionality for future work ● Handling (unexpected) process exit ● User-defined process sets ● Group expansion
  • 53. Useful Links: General Information: https://pmix.org/ PMIx Library: https://github.com/pmix/pmix PMIx Reference RunTime Environment (PRRTE): https://github.com/pmix/prrte PMIx Standard: https://github.com/pmix/pmix-standard Slack: pmix-workspace.slack.com Q&A