Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HPC Resource Management: View to the Future
Ralph H. Castain
Jan 27, 2016
Intel Confidential
Definition: RM
• Scheduler/Workload Manager (WLM)
o Allocates resources to session
o Interactive and batch
• Run-Time Envi...
• Provided as complete package (WLM+RTE)
o Sometimes offer hook to replace one
o Mostly proprietary
o Open source often GP...
RM => Orchestrator
Monitoring
Console
DB
File
System
Network
Resource
Manager
Overlay
Network
Pub-
Sub
SCON
Prov.
Agent
RM...
Resource Management Emerging Needs
• Emerging Application Needs
o Dynamic resource management
o Application involvement in...
Multi-Tiered Strategy
Intel Confidential
Existing HPC RMs
Open Resilient Cluster Mgr
SLURM
MOAB PBSPro
LSF
PMIx RAS
Integr...
Reference Implementation: ORCM
• Scalable to exascale levels
& beyond
– Better-than-linear scaling
– Constrained memory
fo...
Launch Scaling: Core Capability
Intel Confidential
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
R...
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init
MPI_Finalize
R...
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
R...
Launch
Initialization
ExchangeMPIcontactinfo
SetupMPIstructures
barrier
mpi_initcompletion
barrier
MPI_Init MPI_Finalize
R...
Current Status
Intel Confidential
Stage I-II
Stage III (projected)
Adoption:
SLURM
LSF
Moab
PBSPro
Flexible Allocation Support
• Request additional resources
o Compute, memory, network, NVM, burst buffer
o Immediate, fore...
I/O Support
• Asynchronous operations
o Anticipatory data fetch, staging
o Advise time to complete
o Notify upon available...
Spawn Support
• Staging support
o Files, libraries required by new apps
o Allow RM to consider in scheduler
• Current loca...
Network Integration
• Quality of service requests
o Bandwidth, traffic priority, power constraints
o Multi-fabric failover...
Power Control/Management
• Scheduler analytics
o Predict power usage, advise updates
• Application requests
o Advise of ch...
Fault Tolerance
• Notification
o App can register for error notifications, incipient faults
• RM-app negotiate to determin...
Workflow Orchestration
• Growing range of workflow patterns
o Traditional HPC: bulk synchronous, single application
o Anal...
Workload Manager: Job Description Language
• Complexity of describing job is growing
o Power, file/lib positioning
o Perfo...
Flexible Architecture
• Each tool built on top of same plugin system
o Different combinations of frameworks
o Different pl...
Breaking it Down
• Workload Manager
o Dedicated framework
o Plugins for two-way integration to external WM (Moab, Cobalt)
...
Analytics Workflow Concept
I
n
p
u
t
M
o
d
u
l
e
Generalized
Format
Sensors
Other
Workflows
Other
Workflow
RAS
Event
Avail...
Workflow Elements
• Average (window, running, etc.)
• Rate (convert incoming data to events/sec)
• Threshold (high, low)
•...
Analytics
• Execute on aggregator nodes for in-flight reduction
o Sys admin defines, user can define (if permitted)
• Even...
Distributed Architecture
• Hierarchical, distributed approach for unlimited scalability
o Utilize daemons on rack/row cont...
Fault Diagnosis
• Identify root cause and location
o Sometimes obvious – e.g., when direct measurement
o Other times non-o...
Fault Prediction: Methodology
• Exploit access to internals
o Investigate optimal location, number of sensors
o Embed inte...
Fault Prediction: Outcomes
• Continuous update of mean-time-to-preventative-maintenance
o Feed into projected downtime pla...
HPC Controls
Upcoming SlideShare
Loading in …5
×

HPC Resource Management: Futures

Future vision of HPC resource management

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

HPC Resource Management: Futures

  1. 1. HPC Resource Management: View to the Future Ralph H. Castain Jan 27, 2016 Intel Confidential
  2. 2. Definition: RM • Scheduler/Workload Manager (WLM) o Allocates resources to session o Interactive and batch • Run-Time Environment o Launch and monitor applications o Support inter-process communication wireup o Serve as intermediary between applications and WLM • Dynamic resource requests • Error notification o Implement failure policies 2
  3. 3. • Provided as complete package (WLM+RTE) o Sometimes offer hook to replace one o Mostly proprietary o Open source often GPL (integration difficult) • Programming model specific, static • Independent island o Not integrated with monitoring, file systems, or networks • Limited fault tolerance support o Restart failed job Current State
  4. 4. RM => Orchestrator Monitoring Console DB File System Network Resource Manager Overlay Network Pub- Sub SCON Prov. Agent RM APP
  5. 5. Resource Management Emerging Needs • Emerging Application Needs o Dynamic resource management o Application involvement in decisions, including fault notification o Workflow orchestration • Emerging System Needs o Scalable operations o Cross-subsystem integration (RM, RAS, site utilities, …) o Data staging, Burst Buffers, persistent memory management o System notification requests, topology, QoS o Power management 5
  6. 6. Multi-Tiered Strategy Intel Confidential Existing HPC RMs Open Resilient Cluster Mgr SLURM MOAB PBSPro LSF PMIx RAS Integration Hadoop Support Enable Reference Implementation Overlay Network Containers (S&S) Provisioning (WW) Data Analytics ULFM & Fault Pred.Technology
  7. 7. Reference Implementation: ORCM • Scalable to exascale levels & beyond – Better-than-linear scaling – Constrained memory footprint • Dynamically configurable – Sense and adapt, user- directable – On-the-fly updates – Fully utilize underlying hw • Open source (non-GPL) – Support proprietary add- ons • Maintainable, flexible – Plugin architecture – Easy extension for R&D • Resilient – Self-heal around failures – Reintegrate recovered resources 7
  8. 8. Launch Scaling: Core Capability Intel Confidential
  9. 9. Launch Initialization ExchangeMPIcontactinfo SetupMPIstructures barrier mpi_initcompletion barrier MPI_Init MPI_Finalize RRZ, 16-nodes, 8ppn, rank=0 Time(sec)
  10. 10. Launch Initialization ExchangeMPIcontactinfo SetupMPIstructures barrier mpi_initcompletion barrier MPI_Init MPI_Finalize RRZ, 16-nodes, 8ppn, rank=0 Time(sec) Provide method for RM to share job info Work with fabric and library implementers to compute endpt from RM info Stage I
  11. 11. Launch Initialization ExchangeMPIcontactinfo SetupMPIstructures barrier mpi_initcompletion barrier MPI_Init MPI_Finalize RRZ, 16-nodes, 8ppn, rank=0 Time(sec) Add on 1st communication (non-PMIx) Stage II
  12. 12. Launch Initialization ExchangeMPIcontactinfo SetupMPIstructures barrier mpi_initcompletion barrier MPI_Init MPI_Finalize RRZ, 16-nodes, 8ppn, rank=0 Time(sec) Use HSN, Daemon “instant on”, Coll Stage III
  13. 13. Current Status Intel Confidential Stage I-II Stage III (projected) Adoption: SLURM LSF Moab PBSPro
  14. 14. Flexible Allocation Support • Request additional resources o Compute, memory, network, NVM, burst buffer o Immediate, forecast o Expand existing allocation, separate allocation • Return extra resources o No longer required o Will not be used for some specified time, reclaim (handshake) when ready to use • Notification of preemption o Provide opportunity to cleanly pause
  15. 15. I/O Support • Asynchronous operations o Anticipatory data fetch, staging o Advise time to complete o Notify upon available • Storage policy requests o Hot/warm/cold data movement o Desired locations and striping/replication patterns o Persistence of files, shared memory regions across jobs, sessions o ACL to generated data across jobs, sessions
  16. 16. Spawn Support • Staging support o Files, libraries required by new apps o Allow RM to consider in scheduler • Current location of data • Time to retrieve and position • Schedule scalable preload • Specified topology, min/max procs, … • Provisioning requests o Allow RM to consider in selecting resources, minimize startup time due to provisioning o Desired image, packages
  17. 17. Network Integration • Quality of service requests o Bandwidth, traffic priority, power constraints o Multi-fabric failover, striping prioritization o Security requirements • Network domain definitions, ACLs • Notification requests o State-of-health o Update process endpoint upon fault recovery • Topology information o Torus, dragonfly, …
  18. 18. Power Control/Management • Scheduler analytics o Predict power usage, advise updates • Application requests o Advise of changing workload requirements o Request changes in policy o Specify desired policy for spawned applications • RM notifications o Need to change power policy o Preemption notification • Allow application to accept, request pause
  19. 19. Fault Tolerance • Notification o App can register for error notifications, incipient faults • RM-app negotiate to determine response o App can notify RM of errors • RM will notify specified, registered procs • Rapid application-driven checkpoint o Local on-node NVRAM, auto-stripe checkpoints o Policy-driven “bleed” to remote burst buffers and/or global file system • Restart support o Specify source (remote NVM checkpoint, global filesystem, etc) o Location hints/requests o Entire job, specific processes
  20. 20. Workflow Orchestration • Growing range of workflow patterns o Traditional HPC: bulk synchronous, single application o Analytics: asynchronous, staged o Parametric: large number of simultaneous independent jobs • Ability to dynamically spawn a job o Need flexibility – min/max size, rolling allocation, … o Events between jobs and cross-linking of I/O o Queuing of spawn requests • Tool interfaces that work in these environments 20
  21. 21. Workload Manager: Job Description Language • Complexity of describing job is growing o Power, file/lib positioning o Performance vs capacity, programming model o System, project, application-level defaults • Provide templates? o System defaults, with modifiers • --hadoop:mapper=foo,reducer=bar o User-defined • Application templates • Shared, group templates o Markup language definition of behaviors, priorities 30 yrs
  22. 22. Flexible Architecture • Each tool built on top of same plugin system o Different combinations of frameworks o Different plugins activated to play different roles o Example: orcmd on compute node vs on rack/row controllers • Designed for distributed, centralized, hybrid operations o Centralized for small clusters o Hybrid for larger clusters o Example: centralized scheduler, distributed “worker-bees” • Accessible to users for interacting with RM o Add shim libraries (abstract, public APIs) to access framework APIs o Examples: SCON, pub-sub, in-flight analytics
  23. 23. Breaking it Down • Workload Manager o Dedicated framework o Plugins for two-way integration to external WM (Moab, Cobalt) o Plugins for implementing internal WM (FIFO) • Run-Time Environment o Broken down into functional blocks, each with own framework • Loosely divided into three general categories • Messaging, launch, error handling • One or more frameworks for each category o Knitted together via “state machine” • Event-driven, async • Each functional block can be separate thread • Each plugin within each block can be separate thread(s)
  24. 24. Analytics Workflow Concept I n p u t M o d u l e Generalized Format Sensors Other Workflows Other Workflow RAS Event Available in SCON as well Pub-Sub Database
  25. 25. Workflow Elements • Average (window, running, etc.) • Rate (convert incoming data to events/sec) • Threshold (high, low) • Filter o Selects input values based on provided params • RAS event o Generates a RAS event corresponding to input description • Publish data
  26. 26. Analytics • Execute on aggregator nodes for in-flight reduction o Sys admin defines, user can define (if permitted) • Event-based state machine o Each workflow in own thread, own instance of each plugin o Branch and merge of workflow o Tap stream between workflow steps o Tap data streams (sensors, others) • Event generation o Generate events/alarms o Specify data to be included (window) 26
  27. 27. Distributed Architecture • Hierarchical, distributed approach for unlimited scalability o Utilize daemons on rack/row controllers • Analysis done at each level of the hierarchy o Support rapid response to critical events o Distribute processing load o Minimize data movement • RM’s error manager framework controls response o Based on specified policies
  28. 28. Fault Diagnosis • Identify root cause and location o Sometimes obvious – e.g., when direct measurement o Other times non-obvious • Multiple cascading impacts • Cause identified by multi-sensor correlations (indirect measurement) • Direct measurement yields early report of non-root cause • Example: power supply fails due to borderline cooling + high load • Estimate severity o Safety issue, long-term damage, imminent failure • Requires in-depth understanding of hardware
  29. 29. Fault Prediction: Methodology • Exploit access to internals o Investigate optimal location, number of sensors o Embed intelligence, communications capability • Integrate data from all available sources o Engineering design tests o Reliability life tests o Production qualification tests • Utilize learning algorithms to improve performance o Both embedded, post process o Seed with expert knowledge
  30. 30. Fault Prediction: Outcomes • Continuous update of mean-time-to-preventative-maintenance o Feed into projected downtime planning o Incorporate into scheduling algo • Alarm reports for imminent failures o Notify impacted sessions/applications o Plan/execute preemptive actions • Store predictions o Algorithm improvement
  31. 31. HPC Controls

    Be the first to comment

    Login to see the comments

Future vision of HPC resource management

Views

Total views

468

On Slideshare

0

From embeds

0

Number of embeds

6

Actions

Downloads

28

Shares

0

Comments

0

Likes

0

×