SlideShare a Scribd company logo
1 of 57
Process Management
Interface – Exascale
Agenda
• Overview
 Introductions
 Vision/objectives
• Performance status
• Integration status
• Roadmap
• Malleable application support
• Wrap-Up/Open Forum
PMIx – PMI exascale
Collaborative open source effort led by Intel, Mellanox
Technologies, IBM, Adaptive Computing, and SchedMD.
New collaborators are most welcome!
Contributors
• Intel
 Ralph Castain
 Annapurna Dasari
• Mellanox
 Joshua Ladd
 Artem Polyakov
 Elena Shipunova
 Nadezhda Kogteva
 Igor Ivanov
• HP
 David Linden
 Andy Riebs
• IBM
 Dave Solt
• Adaptive Computing
 Gary Brown
• RIST
 Gilles Gouaillardet
• SchedMD
 David Bigagli
• LANL
 Nathan Hjelmn
Motivation
• Exascale launch times are a hot topic
 Desire: reduce from many minutes to few
seconds
 Target: O(106) MPI processes on O(105)
nodes thru MPI_Init in < 30 seconds
• New programming models are exploding
 Driven by need to efficiently exploit scale vs.
resource constraints
 Characterized by increased app-RM
integration
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Time(sec)
What Is Being Shared?
• Job Info (~90%)
 Names of participating nodes
 Location and ID of procs
 Relative ranks of procs (node, job)
 Sizes (#procs in job, #procs on each node)
• Endpoint info (~10%)
 Contact info for each supported fabric
Known to
local RM
daemon
Can be
computed for
many fabrics
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Time(sec) Stage I
Provide method for
RM to share job
info
Work with fabric
and library
implementers to
compute endpt
from RM info
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Time(sec) Stage II
Add on 1st
communication
(non-PMIx)
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Time(sec) Stage III
Use HSN+Coll
Changing Needs
• Notifications/response
 Errors, resource changes
 Negotiated response
• Request allocation changes
 shrink/expand
• Workflow management
 Steered/conditional execution
• QoS requests
 Power, file system, fabric
Multiple,
use-
specific
libs?
(difficult for RM
community to
support)
Single,
multi-
purpose
lib?
Objectives
• Establish an independent, open community
 Industry, academia, lab
• Standalone client/server libraries
 Ease adoption, enable broad/consistent support
 Open source, non-copy-left
 Transparent backward compatibility
• Support evolving programming requirements
• Enable “Instant On” support
 Eliminate time-devouring steps
 Provide faster, more scalable operations
Today’s Goal
• Inform the community
• Solicit your input on the roadmap
• Get you a little excited
• Encourage participation
Agenda
• Overview
 Introductions
 Vision/objectives
• Performance status
• Integration/development status
• Roadmap
• Malleable/Evolving application support
• Wrap-Up/Open Forum
PMIx End Users
• OSHMEM consumers
 In Open MPI OSHMEM:
𝑠ℎ𝑚𝑒𝑚_𝑖𝑛𝑖𝑡 = 𝑚𝑝𝑖_𝑖𝑛𝑖𝑡 + 𝐶
• Job launch scales as MPI_Init.
 Data driven communication patterns
• Assume dense connectivity
• MPI consumers
 Large class of applications have spare connectivity
• Ideal for an API that supports the direct modex
concept
What’s been done
• Worked closely with customers, OEMs, and open source community to design a scalable API that addresses measured
limitations of PMI2
 Data driven design.
• Led to the PMIx v1.0 API
• Implementation and imminent release of PMIx v1.1
 November 2015 release scheduled.
• Significant architectural changes in Open MPI to support direct modex
 “Add procs” in bulk MPI_Init  “Add proc” on-demand on first use outside MPI_init.
 Available in the OMPI v2.x release Q1 2016.
• Integrated PMIx into Open MPI v2.x
 For native launching as well as direct launching under supported RMs.
 For mpirun launched jobs, ORTE implements PMIx callbacks.
 For srun launched jobs, SLURM implements PMIx callbacks in the PMIx plugin.
 Client side framework added to OPAL with components for
• Cray PMI
• PMI1
• PMI2
• PMIx
• backwards compatibility with PMI1 and PMI2.
• Implemented and submitted upstream SLURM PMIx plugin
 Currently available in SLURM Head
 To be released in SLURM 16.05
 Client side PMIx Framework and S1, S2, PMIxxx components in OPAL
• PMIx unit tests integrated into Jenkins test harness
srun --mpi=xxx hello_world
0
1
2
3
4
5
6
7
8
0 500 1000 1500 2000 2500 3000 3500 4000
Time(sec.)
MPI/OSHMEM Processes
MPI_Init / Shmem_init
PMIx async
PMIx
PMI2
Open MPI Trunk
SLURM 16.05 prerelease with PMIx plugin
PMIx v1.1
BTLs openib,self,vader
Stage I-II
srun --mpi=pmix ./hello_world
0
10
20
30
40
50
60
70
80
0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000
PMIx async projected performance
Stage I-II
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 10 20 30 40 50 60 70
Time(millisec)
Nodes
bin-true
orte-no-op
mpi hello world
async modex
Daemon wireup – linear scaling needs to be
addressed
mpirun/oshrun ./hello_world
Conclusions
• API is implemented and performing well in a variety of settings
 Server integrated in OMPI for native launching and in SLURM as PMIx plugin for
direct launching.
• PMIx shows improvement over other state-of-the-art PMI2
implementations when doing a full modex
 Data blobs versus encoded metakeys
 Data scoping to reduce the modex size
• PMIx supported direct modex significantly outperforms full modex
operations for BTL/MTLs that can support this feature
• Direct modex still scales as O(N)
• Efforts and energy should be focused on daemon bootstrap problem
• Instant-on capabilities could be used to further reduce deamon
bootstrap time
Next Steps
• Leverage PMIx features
• Reduce modex size with data scoping
• Change MTL/PMLs to support direct modex
• Investigate the impact of direct modex on densely
connected applications
• Continue to improve collective performance
 Still need to have a scalable solution
• Focus more efforts on the daemon bootstrap problem –
this becomes the limiting factor moving to exascale
 Leverage instant-on here as well
Agenda
• Overview
 Introductions
 Vision/objectives
• Performance status
• Integration/development status
• Roadmap
• Malleable application support
• Wrap-Up/Open Forum
Client Implementation Status
• PMIx 1.1.1 released
 Complete API definition
• Future-proof API’s with Info array/length parameter
for most calls
• Blocking/non-blocking versions of most calls
• Picked up by Fedora, others to come
• PMIx MPI clients launched/tested with
 ORTE (indirect) / ORCM (direct launch)
 SLURM servers (direct launch)
 IBM PMIx server (direct launch)
Server Implementation Status
• Server implementation time is greatly
reduced through the PMIx convenience
library
 Handles all server/client interactions
 Handles many PMIx requests that can be
handled locally
 Bundles many off-host requests
• Optional
 RMs free to implement their own
Server Implementation Status
• Moab
 Integrated PMIx server in scheduler/launcher
 Currently integrating PMIx effort with Moab
 Scheduled for general availability: no time set
• ORTE/ORCM
 Full embedded PMIx reference server
implementation
 Scheduled for release with v2.0
Server Implementation Status
• SLURM
 PMIx support for initial job launch/wireup
currently developed & tested
 Scheduled for GA: 16.05 release
• IBM/LSF
 CORAL
• PMIx support for initial job launch/wireup currently
developed & tested w/PM
• Full PMIx support planned for CORAL
 Integration to LSF to follow (TBD)
Agenda
• Overview
 Introductions
 Vision/objectives
• Performance status
• Integration/development status
• Roadmap
• Malleable/Evolving application support
• Wrap-Up/Open Forum
Scalability
• Memory footprint
 Distributed database for storing Key-Values
• Memory cache, DHT, other models?
 One instance of database per node
• "zero-message" data access using shared-memory
• Launch scaling
 Enhanced support for collective operations
• Provide pattern to host, host-provided send/recv functions,
embedded inter-node comm?
 Rely on HSN for launch, wireup support
• While app is quiescent, then return to OOB
Flexible Allocation Support
• Request additional resources
 Compute, memory, network, NVM, burst buffer
 Immediate, forecast
 Expand existing allocation, separate allocation
• Return extra resources
 No longer required
 Will not be used for some specified time, reclaim
(handshake) when ready to use
• Notification of preemption
 Provide opportunity to cleanly pause
I/O Support
• Asynchronous operations
 Anticipatory data fetch, staging
 Advise time to complete
 Notify upon available
• Storage policy requests
 Hot/warm/cold data movement
 Desired locations and striping/replication patterns
 Persistence of files, shared memory regions across
jobs, sessions
 ACL to generated data across jobs, sessions
Spawn Support
• Staging support
 Files, libraries required by new apps
 Allow RM to consider in scheduler
• Current location of data
• Time to retrieve and position
• Schedule scalable preload
• Provisioning requests
 Allow RM to consider in selecting resources,
minimize startup time due to provisioning
 Desired image, packages
Network Integration
• Quality of service requests
 Bandwidth, traffic priority, power constraints
 Multi-fabric failover, striping prioritization
 Security requirements
• Network domain definitions, ACLs
• Notification requests
 State-of-health
 Update process endpoint upon fault recovery
• Topology information
 Torus, dragonfly, …
Power Control/Management
• Application requests
 Advise of changing workload requirements
 Request changes in policy
 Specify desired policy for spawned applications
 Transfer allocated power to specifed job
• RM notifications
 Need to change power policy
• Allow application to accept, request pause
 Preemption notification
• Provide backward compatibility with
PowerAPI
PMIx: Fault Tolerance
• Notification
 App can register for error notifications, incipient faults
• RM will notify when app would be impacted
• App responds with desired action
 Terminate/restart job, wait for checkpoint, etc.
• RM/app negotiate final response
 App can notify RM of errors
• RM will notify specified, registered procs
• Restart support
 Specify source (remote NVM checkpoint, global filesystem, etc)
 Location hints/requests
 Entire job, specific processes
Agenda
• Overview
 Introductions
 Vision/objectives
• Performance status
• Integration/development status
• Roadmap
• Malleable/Evolving application support
• Wrap-Up/Open Forum
Job Types
• Rigid
• Moldable
• Malleable
• Evolving
• Adaptive (Malleable + Evolving)
Job Type Characteristics
• Resource Allocation Type
 Static
 Dynamic
Who Decides
When it is decided
At job submission
(static allocation)
During job execution
(dynamic allocation)
User Rigid Evolving
Scheduler Moldable Malleable
Rigid Job
8
7
6
5
4
3
2
1
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240
• Job Resources
 4 nodes
 60 minutes
Moldable Job
8
7
6
5
4
3
2
1
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240
• Job Resources
 4 nodes/60 min
 2 nodes/120 min
 1 node/240 min
8
7
6
5
4
3
2
1
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240
Malleable Job
• Job Resources
 1-8 nodes
 1000 node-min
8
7
6
5
4
3
2
1
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240
C
B
A
D
Phases A, B, C
and D (each
phase 2x nodes
of previous
phase)
Evolving Job
• Job Resources
 1-8 nodes
 1000 node-min
Adaptive Job
8
7
6
5
4
3
2
1
C
B
A
D1
D2
Phases A, B, C and D
(each phase 2x nodes
of previous phase)
Scheduler takes 4 nodes
halfway through Phase D
and phase D2 takes 2x as
long as phase D1
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240
• Job Resources
 1-8 nodes
 1000 node-min
Motivations
• New Programming Models
• New Algorithmic Techniques
• Unconventional Cluster Architectures
Adaptive Mesh Refinement
 Granularity
 Node Allocation
Coarse Medium Fine 
Few (1) Some (4) Many (64) 
Master/Slaves
Master
Slave Slave Slave Slave Slave Slave
Slave Slave Slave Slave Slave Slave
Slave Slave Slave Slave Slave Slave
Secondary Simulations
8
7
6
5
4
3
2
1
Sim2
Primary Simulation
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240
Sim2
Sim2 Sim2
Same Network Domain
• Cluster Booster
Many-core “Booster”
Multi-core “Cluster”
Multi-core jobs dynamically
burst out to parallel “booster”
nodes with accelerators
Unconventional Architectures
Apps, RTEs and Archs
Applications
• Astrophysics
• Brain simulation
• Climate simulation
• Flow solvers (QuadFlow)
• Hydrodynamics (Lulesh)
• Molecular Dynamics (NAMD)
• Water reservoir storage/flow
• Wave propagation (Wave2D)
RTEs
• Charm++
• OmpSs
• Uintah
• Radical-Pilot
Architectures
• EU DEEP/DEEP-ER
Application
Scheduler/Malleable Job Dialog
• Expand resource allocation
• Contract resource allocation
ApplicationScheduler
ApplicationScheduler Application
Expand (more)
Response
Contract (less)
Response
Scheduler/Evolving Job Dialog
• Grow resource allocation
• Shrink resource allocation
ApplicationApplicationScheduler
ApplicationScheduler Application
Grow (more)
Response
Shrink (less)
Response
Adaptive Job Race Condition
• Reason for naming convention
 Prevent ambiguity and confusion
ApplicationApplicationScheduler
Grow (more)
Grow (more)Expand (more)
?
Need for Standard API
• MPI: standard API for parallel communication
• Need standard API for application /
scheduler resource management dialogs
 Same API for applications
 Scheduler-specific API implementations
• Scheduler and Malleable/Evolvable
Application Dialog (SMEAD) API
 Make part of PMIx
 Need application use cases
Interested Parties
• Adaptive Computing (Moab scheduler, TORQUE RM, Nitro)
• Altair (PBS Pro scheduler/RM)
• Argonne National Laboratory (Cobalt scheduler)
• HLRS at University of Stuttgart
• Jülich Supercomputing Centre (DEEP-ER)
• Lawrence Livermore National Laboratory (Flux scheduler)
• Partec (ParaStation)
• SchedMD (Slurm scheduler/RM)
• TU Darmstadt Laboratory for Parallel Programming
• UK Atomic Weapons Establishment (AWE)
• University of Cambridge COSMOS
• University of Illinois at Urbana-Champaign Parallel
Programming Laboratory (Charm++ RTE)
• University of Utah SCI Institute (Uintah RTE)
URLs
• Need your help to design a standard API!
 Malleable/Evolving Application Use Case Survey
 http://goo.gl/forms/lq85y3SkV3 (Google Form)
• Info on adaptive job types and scheduling
 4-part blog about malleable / evolving / adaptive
jobs and schedulers
 http://www.adaptivecomputing.com/series/
malleable-and-evolving-jobs/
Agenda
• Overview
 Introductions
 Vision/objectives
• Performance status
• Integration/development status
• Roadmap
• Malleable/Evolving application support
• Wrap-Up/Open Forum
Bottom Line
We now have an interface library the RMs
will support for application-directed requests
Need to collaboratively define
what we want to do with it
For any programming model
MPI, OSHMEM, PGAS,…
Contribute or Follow Along!
 Project: https://pmix.github.io/master
 Code: https://github.com/pmix
Contributors/collaborators
are welcomed!

More Related Content

What's hot

DPDK Architecture Musings - Andy Harvey
DPDK Architecture Musings - Andy HarveyDPDK Architecture Musings - Andy Harvey
DPDK Architecture Musings - Andy Harveyharryvanhaaren
 
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...SERENEWorkshop
 
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudAPNIC
 
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...Cloud Native Day Tel Aviv
 
Developing Enterprise Applications for the Cloud, from Monolith to Microservice
Developing Enterprise Applications for the Cloud, from Monolith to MicroserviceDeveloping Enterprise Applications for the Cloud, from Monolith to Microservice
Developing Enterprise Applications for the Cloud, from Monolith to MicroserviceJack-Junjie Cai
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsBryan Bende
 
L4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef LaribiL4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef Laribibuildacloud
 
Architecture of OpenFlow SDNs
Architecture of OpenFlow SDNsArchitecture of OpenFlow SDNs
Architecture of OpenFlow SDNsUS-Ignite
 
Generic Resource Manager - László Vadkerti, András Kovács
Generic Resource Manager - László Vadkerti, András KovácsGeneric Resource Manager - László Vadkerti, András Kovács
Generic Resource Manager - László Vadkerti, András Kovácsharryvanhaaren
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Brocade
 
Hotplug and Virtio - Tetsuya Mukawa
Hotplug and Virtio - Tetsuya MukawaHotplug and Virtio - Tetsuya Mukawa
Hotplug and Virtio - Tetsuya Mukawaharryvanhaaren
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLinaro
 
Dynamic Service Chaining
Dynamic Service Chaining Dynamic Service Chaining
Dynamic Service Chaining Tail-f Systems
 
DEVNET-1175 OpenDaylight Service Function Chaining
DEVNET-1175	OpenDaylight Service Function ChainingDEVNET-1175	OpenDaylight Service Function Chaining
DEVNET-1175 OpenDaylight Service Function ChainingCisco DevNet
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVGhodhbane Mohamed Amine
 

What's hot (20)

Microservice Powered Orchestration
Microservice Powered OrchestrationMicroservice Powered Orchestration
Microservice Powered Orchestration
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
 
DPDK Architecture Musings - Andy Harvey
DPDK Architecture Musings - Andy HarveyDPDK Architecture Musings - Andy Harvey
DPDK Architecture Musings - Andy Harvey
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
SDN Project PPT
SDN Project PPTSDN Project PPT
SDN Project PPT
 
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
SERENE 2014 Workshop: Paper "Automatic Generation of Description Files for Hi...
 
FD.io - The Universal Dataplane
FD.io - The Universal DataplaneFD.io - The Universal Dataplane
FD.io - The Universal Dataplane
 
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
 
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
Barak Perlman, ConteXtream - SFC (Service Function Chaining) Using Openstack ...
 
Developing Enterprise Applications for the Cloud, from Monolith to Microservice
Developing Enterprise Applications for the Cloud, from Monolith to MicroserviceDeveloping Enterprise Applications for the Cloud, from Monolith to Microservice
Developing Enterprise Applications for the Cloud, from Monolith to Microservice
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
L4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef LaribiL4-L7 services for SDN and NVF by Youcef Laribi
L4-L7 services for SDN and NVF by Youcef Laribi
 
Architecture of OpenFlow SDNs
Architecture of OpenFlow SDNsArchitecture of OpenFlow SDNs
Architecture of OpenFlow SDNs
 
Generic Resource Manager - László Vadkerti, András Kovács
Generic Resource Manager - László Vadkerti, András KovácsGeneric Resource Manager - László Vadkerti, András Kovács
Generic Resource Manager - László Vadkerti, András Kovács
 
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
Recent Advances in Machine Learning: Bringing a New Level of Intelligence to ...
 
Hotplug and Virtio - Tetsuya Mukawa
Hotplug and Virtio - Tetsuya MukawaHotplug and Virtio - Tetsuya Mukawa
Hotplug and Virtio - Tetsuya Mukawa
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
 
Dynamic Service Chaining
Dynamic Service Chaining Dynamic Service Chaining
Dynamic Service Chaining
 
DEVNET-1175 OpenDaylight Service Function Chaining
DEVNET-1175	OpenDaylight Service Function ChainingDEVNET-1175	OpenDaylight Service Function Chaining
DEVNET-1175 OpenDaylight Service Function Chaining
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFV
 

Similar to SC15 PMIx Birds-of-a-Feather

PMIx Tiered Storage Support
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Supportrcastain
 
SC'16 PMIx BoF Presentation
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentationrcastain
 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated OverviewRalph Castain
 
12 Factor App Methodology
12 Factor App Methodology12 Factor App Methodology
12 Factor App Methodologylaeshin park
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsYong Feng
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsRonny López
 
Meetup Openshift Geneva 03/10
Meetup Openshift Geneva 03/10Meetup Openshift Geneva 03/10
Meetup Openshift Geneva 03/10MagaliDavidCruz
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundaryrcastain
 
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National PoliceiSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Change Management in Hybrid landscapes 2017
Change Management in Hybrid landscapes 2017Change Management in Hybrid landscapes 2017
Change Management in Hybrid landscapes 2017Chris Kernaghan
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Lucas Jellema
 
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard LiAmbassador Labs
 
Bol.com Tech lab September 2017 - Microservices in action at the Dutch Nation...
Bol.com Tech lab September 2017 - Microservices in action at the Dutch Nation...Bol.com Tech lab September 2017 - Microservices in action at the Dutch Nation...
Bol.com Tech lab September 2017 - Microservices in action at the Dutch Nation...Bert Jan Schrijver
 
Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...Bert Jan Schrijver
 
Dublin JUG February 2018 - Microservices in action at the Dutch National Police
Dublin JUG February 2018 - Microservices in action at the Dutch National PoliceDublin JUG February 2018 - Microservices in action at the Dutch National Police
Dublin JUG February 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Devoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National PoliceDevoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Subbu Rama
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology confluent
 

Similar to SC15 PMIx Birds-of-a-Feather (20)

PMIx Tiered Storage Support
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Support
 
SC'16 PMIx BoF Presentation
SC'16 PMIx BoF PresentationSC'16 PMIx BoF Presentation
SC'16 PMIx BoF Presentation
 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated Overview
 
12 Factor App Methodology
12 Factor App Methodology12 Factor App Methodology
12 Factor App Methodology
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Adding Real-time Features to PHP Applications
Adding Real-time Features to PHP ApplicationsAdding Real-time Features to PHP Applications
Adding Real-time Features to PHP Applications
 
Meetup Openshift Geneva 03/10
Meetup Openshift Geneva 03/10Meetup Openshift Geneva 03/10
Meetup Openshift Geneva 03/10
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundary
 
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National PoliceiSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
 
Change Management in Hybrid landscapes 2017
Change Management in Hybrid landscapes 2017Change Management in Hybrid landscapes 2017
Change Management in Hybrid landscapes 2017
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
 
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
[QCon London 2020] The Future of Cloud Native API Gateways - Richard Li
 
Bol.com Tech lab September 2017 - Microservices in action at the Dutch Nation...
Bol.com Tech lab September 2017 - Microservices in action at the Dutch Nation...Bol.com Tech lab September 2017 - Microservices in action at the Dutch Nation...
Bol.com Tech lab September 2017 - Microservices in action at the Dutch Nation...
 
Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...
 
Dublin JUG February 2018 - Microservices in action at the Dutch National Police
Dublin JUG February 2018 - Microservices in action at the Dutch National PoliceDublin JUG February 2018 - Microservices in action at the Dutch National Police
Dublin JUG February 2018 - Microservices in action at the Dutch National Police
 
Devoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National PoliceDevoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National Police
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 

Recently uploaded

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Masticationvidulajaib
 
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayZachary Labe
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 

Recently uploaded (20)

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Mastication
 
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work Day
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 

SC15 PMIx Birds-of-a-Feather

  • 2. Agenda • Overview  Introductions  Vision/objectives • Performance status • Integration status • Roadmap • Malleable application support • Wrap-Up/Open Forum
  • 3. PMIx – PMI exascale Collaborative open source effort led by Intel, Mellanox Technologies, IBM, Adaptive Computing, and SchedMD. New collaborators are most welcome!
  • 4. Contributors • Intel  Ralph Castain  Annapurna Dasari • Mellanox  Joshua Ladd  Artem Polyakov  Elena Shipunova  Nadezhda Kogteva  Igor Ivanov • HP  David Linden  Andy Riebs • IBM  Dave Solt • Adaptive Computing  Gary Brown • RIST  Gilles Gouaillardet • SchedMD  David Bigagli • LANL  Nathan Hjelmn
  • 5. Motivation • Exascale launch times are a hot topic  Desire: reduce from many minutes to few seconds  Target: O(106) MPI processes on O(105) nodes thru MPI_Init in < 30 seconds • New programming models are exploding  Driven by need to efficiently exploit scale vs. resource constraints  Characterized by increased app-RM integration
  • 7. What Is Being Shared? • Job Info (~90%)  Names of participating nodes  Location and ID of procs  Relative ranks of procs (node, job)  Sizes (#procs in job, #procs on each node) • Endpoint info (~10%)  Contact info for each supported fabric Known to local RM daemon Can be computed for many fabrics
  • 8. Launch Initialization ExchangeMPIcontactinfo SetupMPIstructures barrier mpi_initcompletion barrier MPI_Init MPI_Finalize RRZ, 16-nodes, 8ppn, rank=0 Time(sec) Stage I Provide method for RM to share job info Work with fabric and library implementers to compute endpt from RM info
  • 11. Changing Needs • Notifications/response  Errors, resource changes  Negotiated response • Request allocation changes  shrink/expand • Workflow management  Steered/conditional execution • QoS requests  Power, file system, fabric Multiple, use- specific libs? (difficult for RM community to support) Single, multi- purpose lib?
  • 12. Objectives • Establish an independent, open community  Industry, academia, lab • Standalone client/server libraries  Ease adoption, enable broad/consistent support  Open source, non-copy-left  Transparent backward compatibility • Support evolving programming requirements • Enable “Instant On” support  Eliminate time-devouring steps  Provide faster, more scalable operations
  • 13. Today’s Goal • Inform the community • Solicit your input on the roadmap • Get you a little excited • Encourage participation
  • 14. Agenda • Overview  Introductions  Vision/objectives • Performance status • Integration/development status • Roadmap • Malleable/Evolving application support • Wrap-Up/Open Forum
  • 15. PMIx End Users • OSHMEM consumers  In Open MPI OSHMEM: 𝑠ℎ𝑚𝑒𝑚_𝑖𝑛𝑖𝑡 = 𝑚𝑝𝑖_𝑖𝑛𝑖𝑡 + 𝐶 • Job launch scales as MPI_Init.  Data driven communication patterns • Assume dense connectivity • MPI consumers  Large class of applications have spare connectivity • Ideal for an API that supports the direct modex concept
  • 16. What’s been done • Worked closely with customers, OEMs, and open source community to design a scalable API that addresses measured limitations of PMI2  Data driven design. • Led to the PMIx v1.0 API • Implementation and imminent release of PMIx v1.1  November 2015 release scheduled. • Significant architectural changes in Open MPI to support direct modex  “Add procs” in bulk MPI_Init  “Add proc” on-demand on first use outside MPI_init.  Available in the OMPI v2.x release Q1 2016. • Integrated PMIx into Open MPI v2.x  For native launching as well as direct launching under supported RMs.  For mpirun launched jobs, ORTE implements PMIx callbacks.  For srun launched jobs, SLURM implements PMIx callbacks in the PMIx plugin.  Client side framework added to OPAL with components for • Cray PMI • PMI1 • PMI2 • PMIx • backwards compatibility with PMI1 and PMI2. • Implemented and submitted upstream SLURM PMIx plugin  Currently available in SLURM Head  To be released in SLURM 16.05  Client side PMIx Framework and S1, S2, PMIxxx components in OPAL • PMIx unit tests integrated into Jenkins test harness
  • 17. srun --mpi=xxx hello_world 0 1 2 3 4 5 6 7 8 0 500 1000 1500 2000 2500 3000 3500 4000 Time(sec.) MPI/OSHMEM Processes MPI_Init / Shmem_init PMIx async PMIx PMI2 Open MPI Trunk SLURM 16.05 prerelease with PMIx plugin PMIx v1.1 BTLs openib,self,vader Stage I-II
  • 18. srun --mpi=pmix ./hello_world 0 10 20 30 40 50 60 70 80 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 PMIx async projected performance Stage I-II
  • 19. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 10 20 30 40 50 60 70 Time(millisec) Nodes bin-true orte-no-op mpi hello world async modex Daemon wireup – linear scaling needs to be addressed mpirun/oshrun ./hello_world
  • 20. Conclusions • API is implemented and performing well in a variety of settings  Server integrated in OMPI for native launching and in SLURM as PMIx plugin for direct launching. • PMIx shows improvement over other state-of-the-art PMI2 implementations when doing a full modex  Data blobs versus encoded metakeys  Data scoping to reduce the modex size • PMIx supported direct modex significantly outperforms full modex operations for BTL/MTLs that can support this feature • Direct modex still scales as O(N) • Efforts and energy should be focused on daemon bootstrap problem • Instant-on capabilities could be used to further reduce deamon bootstrap time
  • 21. Next Steps • Leverage PMIx features • Reduce modex size with data scoping • Change MTL/PMLs to support direct modex • Investigate the impact of direct modex on densely connected applications • Continue to improve collective performance  Still need to have a scalable solution • Focus more efforts on the daemon bootstrap problem – this becomes the limiting factor moving to exascale  Leverage instant-on here as well
  • 22. Agenda • Overview  Introductions  Vision/objectives • Performance status • Integration/development status • Roadmap • Malleable application support • Wrap-Up/Open Forum
  • 23. Client Implementation Status • PMIx 1.1.1 released  Complete API definition • Future-proof API’s with Info array/length parameter for most calls • Blocking/non-blocking versions of most calls • Picked up by Fedora, others to come • PMIx MPI clients launched/tested with  ORTE (indirect) / ORCM (direct launch)  SLURM servers (direct launch)  IBM PMIx server (direct launch)
  • 24. Server Implementation Status • Server implementation time is greatly reduced through the PMIx convenience library  Handles all server/client interactions  Handles many PMIx requests that can be handled locally  Bundles many off-host requests • Optional  RMs free to implement their own
  • 25. Server Implementation Status • Moab  Integrated PMIx server in scheduler/launcher  Currently integrating PMIx effort with Moab  Scheduled for general availability: no time set • ORTE/ORCM  Full embedded PMIx reference server implementation  Scheduled for release with v2.0
  • 26. Server Implementation Status • SLURM  PMIx support for initial job launch/wireup currently developed & tested  Scheduled for GA: 16.05 release • IBM/LSF  CORAL • PMIx support for initial job launch/wireup currently developed & tested w/PM • Full PMIx support planned for CORAL  Integration to LSF to follow (TBD)
  • 27. Agenda • Overview  Introductions  Vision/objectives • Performance status • Integration/development status • Roadmap • Malleable/Evolving application support • Wrap-Up/Open Forum
  • 28. Scalability • Memory footprint  Distributed database for storing Key-Values • Memory cache, DHT, other models?  One instance of database per node • "zero-message" data access using shared-memory • Launch scaling  Enhanced support for collective operations • Provide pattern to host, host-provided send/recv functions, embedded inter-node comm?  Rely on HSN for launch, wireup support • While app is quiescent, then return to OOB
  • 29. Flexible Allocation Support • Request additional resources  Compute, memory, network, NVM, burst buffer  Immediate, forecast  Expand existing allocation, separate allocation • Return extra resources  No longer required  Will not be used for some specified time, reclaim (handshake) when ready to use • Notification of preemption  Provide opportunity to cleanly pause
  • 30. I/O Support • Asynchronous operations  Anticipatory data fetch, staging  Advise time to complete  Notify upon available • Storage policy requests  Hot/warm/cold data movement  Desired locations and striping/replication patterns  Persistence of files, shared memory regions across jobs, sessions  ACL to generated data across jobs, sessions
  • 31. Spawn Support • Staging support  Files, libraries required by new apps  Allow RM to consider in scheduler • Current location of data • Time to retrieve and position • Schedule scalable preload • Provisioning requests  Allow RM to consider in selecting resources, minimize startup time due to provisioning  Desired image, packages
  • 32. Network Integration • Quality of service requests  Bandwidth, traffic priority, power constraints  Multi-fabric failover, striping prioritization  Security requirements • Network domain definitions, ACLs • Notification requests  State-of-health  Update process endpoint upon fault recovery • Topology information  Torus, dragonfly, …
  • 33. Power Control/Management • Application requests  Advise of changing workload requirements  Request changes in policy  Specify desired policy for spawned applications  Transfer allocated power to specifed job • RM notifications  Need to change power policy • Allow application to accept, request pause  Preemption notification • Provide backward compatibility with PowerAPI
  • 34. PMIx: Fault Tolerance • Notification  App can register for error notifications, incipient faults • RM will notify when app would be impacted • App responds with desired action  Terminate/restart job, wait for checkpoint, etc. • RM/app negotiate final response  App can notify RM of errors • RM will notify specified, registered procs • Restart support  Specify source (remote NVM checkpoint, global filesystem, etc)  Location hints/requests  Entire job, specific processes
  • 35. Agenda • Overview  Introductions  Vision/objectives • Performance status • Integration/development status • Roadmap • Malleable/Evolving application support • Wrap-Up/Open Forum
  • 36. Job Types • Rigid • Moldable • Malleable • Evolving • Adaptive (Malleable + Evolving)
  • 37. Job Type Characteristics • Resource Allocation Type  Static  Dynamic Who Decides When it is decided At job submission (static allocation) During job execution (dynamic allocation) User Rigid Evolving Scheduler Moldable Malleable
  • 38. Rigid Job 8 7 6 5 4 3 2 1 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 • Job Resources  4 nodes  60 minutes
  • 39. Moldable Job 8 7 6 5 4 3 2 1 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 • Job Resources  4 nodes/60 min  2 nodes/120 min  1 node/240 min
  • 40. 8 7 6 5 4 3 2 1 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 Malleable Job • Job Resources  1-8 nodes  1000 node-min
  • 41. 8 7 6 5 4 3 2 1 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 C B A D Phases A, B, C and D (each phase 2x nodes of previous phase) Evolving Job • Job Resources  1-8 nodes  1000 node-min
  • 42. Adaptive Job 8 7 6 5 4 3 2 1 C B A D1 D2 Phases A, B, C and D (each phase 2x nodes of previous phase) Scheduler takes 4 nodes halfway through Phase D and phase D2 takes 2x as long as phase D1 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 • Job Resources  1-8 nodes  1000 node-min
  • 43. Motivations • New Programming Models • New Algorithmic Techniques • Unconventional Cluster Architectures
  • 44. Adaptive Mesh Refinement  Granularity  Node Allocation Coarse Medium Fine  Few (1) Some (4) Many (64) 
  • 45. Master/Slaves Master Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave Slave
  • 46. Secondary Simulations 8 7 6 5 4 3 2 1 Sim2 Primary Simulation 0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 Sim2 Sim2 Sim2
  • 47. Same Network Domain • Cluster Booster Many-core “Booster” Multi-core “Cluster” Multi-core jobs dynamically burst out to parallel “booster” nodes with accelerators Unconventional Architectures
  • 48. Apps, RTEs and Archs Applications • Astrophysics • Brain simulation • Climate simulation • Flow solvers (QuadFlow) • Hydrodynamics (Lulesh) • Molecular Dynamics (NAMD) • Water reservoir storage/flow • Wave propagation (Wave2D) RTEs • Charm++ • OmpSs • Uintah • Radical-Pilot Architectures • EU DEEP/DEEP-ER
  • 49. Application Scheduler/Malleable Job Dialog • Expand resource allocation • Contract resource allocation ApplicationScheduler ApplicationScheduler Application Expand (more) Response Contract (less) Response
  • 50. Scheduler/Evolving Job Dialog • Grow resource allocation • Shrink resource allocation ApplicationApplicationScheduler ApplicationScheduler Application Grow (more) Response Shrink (less) Response
  • 51. Adaptive Job Race Condition • Reason for naming convention  Prevent ambiguity and confusion ApplicationApplicationScheduler Grow (more) Grow (more)Expand (more) ?
  • 52. Need for Standard API • MPI: standard API for parallel communication • Need standard API for application / scheduler resource management dialogs  Same API for applications  Scheduler-specific API implementations • Scheduler and Malleable/Evolvable Application Dialog (SMEAD) API  Make part of PMIx  Need application use cases
  • 53. Interested Parties • Adaptive Computing (Moab scheduler, TORQUE RM, Nitro) • Altair (PBS Pro scheduler/RM) • Argonne National Laboratory (Cobalt scheduler) • HLRS at University of Stuttgart • Jülich Supercomputing Centre (DEEP-ER) • Lawrence Livermore National Laboratory (Flux scheduler) • Partec (ParaStation) • SchedMD (Slurm scheduler/RM) • TU Darmstadt Laboratory for Parallel Programming • UK Atomic Weapons Establishment (AWE) • University of Cambridge COSMOS • University of Illinois at Urbana-Champaign Parallel Programming Laboratory (Charm++ RTE) • University of Utah SCI Institute (Uintah RTE)
  • 54. URLs • Need your help to design a standard API!  Malleable/Evolving Application Use Case Survey  http://goo.gl/forms/lq85y3SkV3 (Google Form) • Info on adaptive job types and scheduling  4-part blog about malleable / evolving / adaptive jobs and schedulers  http://www.adaptivecomputing.com/series/ malleable-and-evolving-jobs/
  • 55. Agenda • Overview  Introductions  Vision/objectives • Performance status • Integration/development status • Roadmap • Malleable/Evolving application support • Wrap-Up/Open Forum
  • 56. Bottom Line We now have an interface library the RMs will support for application-directed requests Need to collaboratively define what we want to do with it For any programming model MPI, OSHMEM, PGAS,…
  • 57. Contribute or Follow Along!  Project: https://pmix.github.io/master  Code: https://github.com/pmix Contributors/collaborators are welcomed!