SlideShare a Scribd company logo
1 of 34
Process Management
Interface – Exascale
PMIx – PMI exascale
Collaborative open source effort led by Intel, Mellanox
Technologies, IBM, Adaptive Computing, and SchedMD.
New collaborators are most welcome!
Motivation
• Exascale launch times are a hot topic
 Desire: reduce from many minutes to few
seconds
 Target: O(106) MPI processes on O(105)
nodes thru MPI_Init in < 30 seconds
• New programming models are exploding
 Driven by need to efficiently exploit scale vs.
resource constraints
 Characterized by increased app-RM
integration
Objectives
• Establish an independent, open community
 Industry, academia, lab
• Standalone client/server libraries
 Ease adoption, enable broad/consistent support
 Open source, non-copy-left
 Transparent backward compatibility
• Support evolving programming requirements
• Enable “Instant On” support
 Eliminate time-devouring steps
 Provide faster, more scalable operations
Active Thrust Areas
• Launch scaling
• Flexible allocation
support
• I/O support
• Advanced spawn
support
• Network support
• Power management
support
• Fault tolerance and
prediction
• Workflow
orchestration
Motivation
• Exascale launch times are a hot topic
 Desire: reduce from many minutes to few
seconds
 Target: O(106) MPI processes on O(105)
nodes thru MPI_Init in < 30 seconds
• New programming models are exploding
 Driven by need to efficiently exploit scale vs.
resource constraints
 Characterized by increased app-RM
integration
Launch Process
• Allocation definition sent to affected root-level daemons
 Session user-level daemons are started and wireup
• Application launch command sent to session daemons
 Daemons spawn local procs
• Procs initialize libraries
 Setup required bookkeeping, identify their endpoints
 Publish all information to session key-value store, and barrier
• Session daemons circulate published values
 Variety of collective algorithms employed
 Release procs from barrier
• Procs complete initialization
 Retrieve endpoint info for peers
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Time(sec)
Launch Process
• Allocation definition sent to affected root-level daemons
 Session user-level daemons are started and wireup
• Application launch command sent to session daemons
 Daemons spawn local procs
• Procs initialize libraries
 Setup required bookkeeping, identify their endpoints
 Publish all information to session key-value store, and barrier
• Session daemons circulate published values
 Variety of collective algorithms employed
 Release procs from barrier
• Procs complete initialization
 Retrieve endpoint info for peers
Obstacles to
scalability
What Is Being Shared?
• Job Info (~90%)
 Names of participating nodes
 Location and ID of procs
 Relative ranks of procs (node, job)
 Sizes (#procs in job, #procs on each node)
• Endpoint info (~10%)
 Contact info for each supported fabric
Known to
local RM
daemon
Can be
computed for
many fabrics
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Time(sec) Stage I
Provide method for
RM to share job
info
Work with fabric
and library
implementers to
compute endpt
from RM info
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Time(sec) Stage II
Add on 1st
communication
(non-PMIx)
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
RRZ, 16-nodes, 8ppn, rank=0
Time(sec) Stage III
Use HSN+Coll
Current Status
Stage I-II
Stage III (projected)
Can We Go Faster?
• Remaining startup time primarily consumed by
 Positioning of application binary
 Loading dynamic libraries
 Bottlenecks: global file system, IO nodes
• Stage IV: Pre-Positioning
 WLM extended to consider retrieval, positioning times
 Add NVRAM to rack-level controllers and/or compute
nodes for staging
 Pre-cache in background (during epilog, at finalize, …)
What Gets Positioned?
• Historically used static binaries
 Lose support for scripting languages, dynamic programs
• Proposal: containers
 Working with Shifter & Singularity groups
 Execution unit: application => container (single/multi-process)
 Detect containerized application at job submission
 Auto-containerize if not (past history of library usage)? Layered
containers?
• Provisioning
 Provision when base OS is incompatible with container
(Warewulf)
How Low Can We Go?
Launch/wireup
as fast as
executables
can be started
Motivation
• Exascale launch times are a hot topic
 Desire: reduce from many minutes to few
seconds
 Target: O(106) MPI processes on O(105)
nodes thru MPI_Init in < 30 seconds
• New programming models are exploding
 Driven by need to efficiently exploit scale vs.
resource constraints
 Characterized by increased app-RM
integration
Changing Needs
• Notifications/response
 Errors, resource changes
 Negotiated response
• Request allocation changes
 shrink/expand
• Workflow management
 Steered/conditional execution
• QoS requests
 Power, file system, fabric
Multiple,
use-
specific
libs?
(difficult for RM
community to
support)
Single,
multi-
purpose
lib?
Scalability
• Memory footprint
 Distributed database for storing Key-Values
• Memory cache, DHT, other models?
 One instance of database per node
• "zero-message" data access using shared-memory
• Launch scaling
 Enhanced support for collective operations
• Provide pattern to host, host-provided send/recv functions,
embedded inter-node comm?
 Rely on HSN for launch, wireup support
• While app is quiescent, then return to mgmtEth
Flexible Allocation Support
• Request additional resources
 Compute, memory, network, NVM, burst buffer
 Immediate, forecast
 Expand existing allocation, separate allocation
• Return extra resources
 No longer required
 Will not be used for some specified time, reclaim
(handshake) when ready to use
• Notification of preemption
 Provide opportunity to cleanly pause
I/O Support
• Asynchronous operations
 Anticipatory data fetch, staging
 Advise time to complete
 Notify upon available
• Storage policy requests
 Hot/warm/cold data movement
 Desired locations and striping/replication patterns
 Persistence of files, shared memory regions across
jobs, sessions
 ACL to generated data across jobs, sessions
Spawn Support
• Staging support
 Files, libraries required by new apps
 Allow RM to consider in scheduler
• Current location of data
• Time to retrieve and position
• Schedule scalable preload
• Provisioning requests
 Allow RM to consider in selecting resources,
minimize startup time due to provisioning
 Desired image, packages
Network Integration
• Quality of service requests
 Bandwidth, traffic priority, power constraints
 Multi-fabric failover, striping prioritization
 Security requirements
• Network domain definitions, ACLs
• Notification requests
 State-of-health
 Update process endpoint upon fault recovery
• Topology information
 Torus, dragonfly, …
Power Control/Management
• Application requests
 Advise of changing workload requirements
 Request changes in policy
 Specify desired policy for spawned applications
 Transfer allocated power to specifed job
• RM notifications
 Need to change power policy
• Allow application to accept, request pause
 Preemption notification
• Provide backward compatibility with
PowerAPI
Fault Tolerance
• Notification
 App can register for event notifications, incipient faults
• RM-app negotiate to determine response
 App can notify RM of errors
• RM will notify specified, registered procs
• Rapid application-driven checkpoint
 Local on-node NVRAM, auto-stripe checkpoints
 Policy-driven “bleed” to remote burst buffers and/or global
file system
• Restart support
 Specify source (remote NVM checkpoint, global filesystem)
 Location hints/requests
 Entire job, specific processes
Fault Prediction: Methodology
• Exploit access to internals
 Investigate optimal location, number of sensors
 Embed intelligence, communications capability
• Integrate data from all available sources
 Engineering design tests
 Reliability life tests
 Production qualification tests
• Utilize learning algorithms to improve
performance
 Both embedded, post process
 Seed with expert knowledge
Fault Prediction: Outcomes
• Continuous update of mean-time-to-
preventative-maintenance
 Feed into projected downtime planning
 Incorporate into scheduling algorithm
• Alarm reports for imminent failures
 Notify impacted sessions/applications
 Plan/execute preemptive actions
• Store predictions
 Algorithm improvement
Workflow Orchestration
• Growing range of workflow patterns
 Traditional HPC: bulk synchronous, single application
 Analytics: asynchronous, staged
 Parametric: large number of simultaneous
independent jobs
• Ability to dynamically spawn a job
 Need flexibility – min/max size, rolling allocation, …
 Events between jobs and cross-linking of I/O
 Queuing of spawn requests
• Tool interfaces that work in these environments
Workload Manager: Job
Description Language
• Complexity of describing job is growing
 Power, file/lib positioning
 Performance vs capacity, programming model
 System, project, application-level defaults
• Provide templates?
 System defaults, with modifiers
• --hadoop:mapper=foo,reducer=bar
 User-defined
• Application templates
• Shared, group templates
 Markup language definition of behaviors, priorities
30 yrs
PMIx Roadmap
2014
1/2016
1.1.3
Full PMIx reference server in OpenMPI
Client library includes all APIs used by MPI
SLURM server integration complete
PMIx Roadmap
2014
1/2016
1.1.3
2Q2016
2.0.0
Support for new APIs:
• Event notification
• Flexible allocation
• System query
Scaling support:
• Shared memory datastore
PMIx Roadmap
2014
1/2016
1.1.3
2Q2016
2.0.0
3Q2016
RM Production Release*
2.1
Fabric manager integration
• Topology info
• Event reports
• Process endpoint updates
File system integration
• Pre-position requests
4Q2016
*SLURM, IBM, Moab, PBSPro,…
Bottom Line
We now have an interface library to support
application-directed requests
Community is working to
define what they want to do
with it
For any programming model
MPI, OSHMEM, PGAS,…

More Related Content

What's hot

Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Apache Apex Introduction with PubMatic
Apache Apex Introduction with PubMaticApache Apex Introduction with PubMatic
Apache Apex Introduction with PubMaticApache Apex
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmDataWorks Summit
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkDataWorks Summit/Hadoop Summit
 
Stream Processing with Apache Apex
Stream Processing with Apache ApexStream Processing with Apache Apex
Stream Processing with Apache ApexPramod Immaneni
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingDataWorks Summit
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopDataWorks Summit
 

What's hot (20)

Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Apache Apex Introduction with PubMatic
Apache Apex Introduction with PubMaticApache Apex Introduction with PubMatic
Apache Apex Introduction with PubMatic
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache FlinkUnifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
 
Stream Processing with Apache Apex
Stream Processing with Apache ApexStream Processing with Apache Apex
Stream Processing with Apache Apex
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
 

Similar to PMIx Updated Overview

PMIx Tiered Storage Support
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Supportrcastain
 
SC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-FeatherSC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-Featherrcastain
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesIvo Andreev
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundaryrcastain
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology confluent
 
Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)ewerkboy
 
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudInterop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudSusan Wu
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Amazon Web Services
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsCarole Goble
 
HPC Controls Future
HPC Controls FutureHPC Controls Future
HPC Controls Futurercastain
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptxSamPrem3
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Operating-System-(1-3 group) Case study on windows Mac and linux among variou...
Operating-System-(1-3 group) Case study on windows Mac and linux among variou...Operating-System-(1-3 group) Case study on windows Mac and linux among variou...
Operating-System-(1-3 group) Case study on windows Mac and linux among variou...ssuser4a97d3
 
The evolving container landscape
The evolving container landscapeThe evolving container landscape
The evolving container landscapeNilesh Trivedi
 

Similar to PMIx Updated Overview (20)

PMIx Tiered Storage Support
PMIx Tiered Storage SupportPMIx Tiered Storage Support
PMIx Tiered Storage Support
 
SC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-FeatherSC15 PMIx Birds-of-a-Feather
SC15 PMIx Birds-of-a-Feather
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Hpc 6 7
Hpc 6 7Hpc 6 7
Hpc 6 7
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
PMIx: Bridging the Container Boundary
PMIx: Bridging the Container BoundaryPMIx: Bridging the Container Boundary
PMIx: Bridging the Container Boundary
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 
Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)
 
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-CloudInterop ITX: Moving applications: From Legacy to Cloud-to-Cloud
Interop ITX: Moving applications: From Legacy to Cloud-to-Cloud
 
地产知识.ppt
地产知识.ppt地产知识.ppt
地产知识.ppt
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
HPC Controls Future
HPC Controls FutureHPC Controls Future
HPC Controls Future
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptx
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Operating-System-(1-3 group) Case study on windows Mac and linux among variou...
Operating-System-(1-3 group) Case study on windows Mac and linux among variou...Operating-System-(1-3 group) Case study on windows Mac and linux among variou...
Operating-System-(1-3 group) Case study on windows Mac and linux among variou...
 
The evolving container landscape
The evolving container landscapeThe evolving container landscape
The evolving container landscape
 

Recently uploaded

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Recently uploaded (20)

Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

PMIx Updated Overview

  • 2. PMIx – PMI exascale Collaborative open source effort led by Intel, Mellanox Technologies, IBM, Adaptive Computing, and SchedMD. New collaborators are most welcome!
  • 3. Motivation • Exascale launch times are a hot topic  Desire: reduce from many minutes to few seconds  Target: O(106) MPI processes on O(105) nodes thru MPI_Init in < 30 seconds • New programming models are exploding  Driven by need to efficiently exploit scale vs. resource constraints  Characterized by increased app-RM integration
  • 4. Objectives • Establish an independent, open community  Industry, academia, lab • Standalone client/server libraries  Ease adoption, enable broad/consistent support  Open source, non-copy-left  Transparent backward compatibility • Support evolving programming requirements • Enable “Instant On” support  Eliminate time-devouring steps  Provide faster, more scalable operations
  • 5. Active Thrust Areas • Launch scaling • Flexible allocation support • I/O support • Advanced spawn support • Network support • Power management support • Fault tolerance and prediction • Workflow orchestration
  • 6. Motivation • Exascale launch times are a hot topic  Desire: reduce from many minutes to few seconds  Target: O(106) MPI processes on O(105) nodes thru MPI_Init in < 30 seconds • New programming models are exploding  Driven by need to efficiently exploit scale vs. resource constraints  Characterized by increased app-RM integration
  • 7. Launch Process • Allocation definition sent to affected root-level daemons  Session user-level daemons are started and wireup • Application launch command sent to session daemons  Daemons spawn local procs • Procs initialize libraries  Setup required bookkeeping, identify their endpoints  Publish all information to session key-value store, and barrier • Session daemons circulate published values  Variety of collective algorithms employed  Release procs from barrier • Procs complete initialization  Retrieve endpoint info for peers
  • 9. Launch Process • Allocation definition sent to affected root-level daemons  Session user-level daemons are started and wireup • Application launch command sent to session daemons  Daemons spawn local procs • Procs initialize libraries  Setup required bookkeeping, identify their endpoints  Publish all information to session key-value store, and barrier • Session daemons circulate published values  Variety of collective algorithms employed  Release procs from barrier • Procs complete initialization  Retrieve endpoint info for peers Obstacles to scalability
  • 10. What Is Being Shared? • Job Info (~90%)  Names of participating nodes  Location and ID of procs  Relative ranks of procs (node, job)  Sizes (#procs in job, #procs on each node) • Endpoint info (~10%)  Contact info for each supported fabric Known to local RM daemon Can be computed for many fabrics
  • 11. Launch Initialization ExchangeMPIcontactinfo SetupMPIstructures barrier mpi_initcompletion barrier MPI_Init MPI_Finalize RRZ, 16-nodes, 8ppn, rank=0 Time(sec) Stage I Provide method for RM to share job info Work with fabric and library implementers to compute endpt from RM info
  • 15. Can We Go Faster? • Remaining startup time primarily consumed by  Positioning of application binary  Loading dynamic libraries  Bottlenecks: global file system, IO nodes • Stage IV: Pre-Positioning  WLM extended to consider retrieval, positioning times  Add NVRAM to rack-level controllers and/or compute nodes for staging  Pre-cache in background (during epilog, at finalize, …)
  • 16. What Gets Positioned? • Historically used static binaries  Lose support for scripting languages, dynamic programs • Proposal: containers  Working with Shifter & Singularity groups  Execution unit: application => container (single/multi-process)  Detect containerized application at job submission  Auto-containerize if not (past history of library usage)? Layered containers? • Provisioning  Provision when base OS is incompatible with container (Warewulf)
  • 17. How Low Can We Go? Launch/wireup as fast as executables can be started
  • 18. Motivation • Exascale launch times are a hot topic  Desire: reduce from many minutes to few seconds  Target: O(106) MPI processes on O(105) nodes thru MPI_Init in < 30 seconds • New programming models are exploding  Driven by need to efficiently exploit scale vs. resource constraints  Characterized by increased app-RM integration
  • 19. Changing Needs • Notifications/response  Errors, resource changes  Negotiated response • Request allocation changes  shrink/expand • Workflow management  Steered/conditional execution • QoS requests  Power, file system, fabric Multiple, use- specific libs? (difficult for RM community to support) Single, multi- purpose lib?
  • 20. Scalability • Memory footprint  Distributed database for storing Key-Values • Memory cache, DHT, other models?  One instance of database per node • "zero-message" data access using shared-memory • Launch scaling  Enhanced support for collective operations • Provide pattern to host, host-provided send/recv functions, embedded inter-node comm?  Rely on HSN for launch, wireup support • While app is quiescent, then return to mgmtEth
  • 21. Flexible Allocation Support • Request additional resources  Compute, memory, network, NVM, burst buffer  Immediate, forecast  Expand existing allocation, separate allocation • Return extra resources  No longer required  Will not be used for some specified time, reclaim (handshake) when ready to use • Notification of preemption  Provide opportunity to cleanly pause
  • 22. I/O Support • Asynchronous operations  Anticipatory data fetch, staging  Advise time to complete  Notify upon available • Storage policy requests  Hot/warm/cold data movement  Desired locations and striping/replication patterns  Persistence of files, shared memory regions across jobs, sessions  ACL to generated data across jobs, sessions
  • 23. Spawn Support • Staging support  Files, libraries required by new apps  Allow RM to consider in scheduler • Current location of data • Time to retrieve and position • Schedule scalable preload • Provisioning requests  Allow RM to consider in selecting resources, minimize startup time due to provisioning  Desired image, packages
  • 24. Network Integration • Quality of service requests  Bandwidth, traffic priority, power constraints  Multi-fabric failover, striping prioritization  Security requirements • Network domain definitions, ACLs • Notification requests  State-of-health  Update process endpoint upon fault recovery • Topology information  Torus, dragonfly, …
  • 25. Power Control/Management • Application requests  Advise of changing workload requirements  Request changes in policy  Specify desired policy for spawned applications  Transfer allocated power to specifed job • RM notifications  Need to change power policy • Allow application to accept, request pause  Preemption notification • Provide backward compatibility with PowerAPI
  • 26. Fault Tolerance • Notification  App can register for event notifications, incipient faults • RM-app negotiate to determine response  App can notify RM of errors • RM will notify specified, registered procs • Rapid application-driven checkpoint  Local on-node NVRAM, auto-stripe checkpoints  Policy-driven “bleed” to remote burst buffers and/or global file system • Restart support  Specify source (remote NVM checkpoint, global filesystem)  Location hints/requests  Entire job, specific processes
  • 27. Fault Prediction: Methodology • Exploit access to internals  Investigate optimal location, number of sensors  Embed intelligence, communications capability • Integrate data from all available sources  Engineering design tests  Reliability life tests  Production qualification tests • Utilize learning algorithms to improve performance  Both embedded, post process  Seed with expert knowledge
  • 28. Fault Prediction: Outcomes • Continuous update of mean-time-to- preventative-maintenance  Feed into projected downtime planning  Incorporate into scheduling algorithm • Alarm reports for imminent failures  Notify impacted sessions/applications  Plan/execute preemptive actions • Store predictions  Algorithm improvement
  • 29. Workflow Orchestration • Growing range of workflow patterns  Traditional HPC: bulk synchronous, single application  Analytics: asynchronous, staged  Parametric: large number of simultaneous independent jobs • Ability to dynamically spawn a job  Need flexibility – min/max size, rolling allocation, …  Events between jobs and cross-linking of I/O  Queuing of spawn requests • Tool interfaces that work in these environments
  • 30. Workload Manager: Job Description Language • Complexity of describing job is growing  Power, file/lib positioning  Performance vs capacity, programming model  System, project, application-level defaults • Provide templates?  System defaults, with modifiers • --hadoop:mapper=foo,reducer=bar  User-defined • Application templates • Shared, group templates  Markup language definition of behaviors, priorities 30 yrs
  • 31. PMIx Roadmap 2014 1/2016 1.1.3 Full PMIx reference server in OpenMPI Client library includes all APIs used by MPI SLURM server integration complete
  • 32. PMIx Roadmap 2014 1/2016 1.1.3 2Q2016 2.0.0 Support for new APIs: • Event notification • Flexible allocation • System query Scaling support: • Shared memory datastore
  • 33. PMIx Roadmap 2014 1/2016 1.1.3 2Q2016 2.0.0 3Q2016 RM Production Release* 2.1 Fabric manager integration • Topology info • Event reports • Process endpoint updates File system integration • Pre-position requests 4Q2016 *SLURM, IBM, Moab, PBSPro,…
  • 34. Bottom Line We now have an interface library to support application-directed requests Community is working to define what they want to do with it For any programming model MPI, OSHMEM, PGAS,…