SlideShare a Scribd company logo
Altair Confidential 1
PBS Professional
20 Features in 20 Minutes
Altair Confidential 2
USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>
Hooks
Altair Confidential 3
• PBS Plugin (“Hooks”) Framework
• Unified data model built on industry-standard Python
• Augment core capabilities on-the-fly
• No re-compiling  PBS Pro core stability
• Hook events at major state transition points
• Use cases
• Routing jobs
• Managing job resource requests
• Managing access to resources for
users and jobs
• Ensuring efficient use of resources
• Ensuring that jobs run properly
• Converting requests to usable format
• Controlling interactive jobs
• Communicating information to users
• Helping to schedule jobs
• Managing user activity
• Enabling accounting and validation
• Allocation management
• Helping manage job execution
Dynamic Resources
Altair Confidential 4
• Represent elements that are outside of the control of PBS
• Modular
• Scalable
• Rich rules with hooks
• License as a resource
• Global license Managers
• Storage
• User quotas
• Scratch spaces on nodes
Multi-scheduler
Altair Confidential 5
PBSPro
FIFOJobsortformulaFairshare
• Run multiple scheduling engines within the same PBS Complex
• Heterogenous user groups and workloads
• Load balancing
• Testing and staging
OS Provisioning
Altair Confidential 6
• Operating System as a Resource
• Integrate with third-party OS provisioning tools
• Provisioning / Orchestration – Bare metal
• Install required Operating system or application on bare metal
• Post install automation support
• Multi boot systems
• Workstation grids
PBSPro
WRFLSDYNAOpenFOAM
ProvisioningTools
High Availability
Altair Confidential 7
• High Availability in built
• No third party software required
• All critical services moved in real time
• No loss of service availability
• Transparent
• Notifications
• Full feature manageability tools
• Maintain quorum
• Interventions and servicing
Cgroups
Altair Confidential 8
• Ensures jobs have access to requested resources
• Can restrict resources for PBS jobs, preventing OOM conditions
• Ensures accurate resource accounting
• Provides resource enforcement at kernel level instead of the
MoM polling for usage
• Consistent job runtime
Containers
Altair Confidential 9
• Lightweight virtualized environment for traditional HPC apps
• Number of containers that can be run on a host
• Time to launch a container
• All the goodies of containers (App maintenance)
• Conflicting requirements for applications (e.g., app can run only on centos 6, or needs an older library)
• Ease of packaging application into their own “containers” with all dependencies included.
• Natural extension to cgroups and cpusets
• resource constraining, CPU pinning, etc.
Cloud bursting
Altair Confidential 10
Microsoft Azure
Amazon Web Services
GCP
Oracle
PBS Works
• On-demand use of cloud resources to
maximize efficiency
• Improve responsiveness, adding capacity
exactly when needed
• Automatic governance and cost controls via
site-defined policy and quotas
• Understands on-premise utilization, ensuring
bursting only when cost-efficient
• Vendor-agnostic: no lock-in
• Fast: 1,000+ nodes in minutes
Topology Aware
Altair Confidential 11
Before After
Average runtimes
~ 45% Faster
** actual Customer Reported Results
• Inter-node & intra-node placement
• Switches, clusters, and NUMA
• All networks
• Infiniband, Ethernet, custom
• Dynamic (runtime changeable)
• Support for all popular topologies
Energy Aware
Altair Confidential 12
DoD HPCMP
Yearly Savings (estimate)
• Eliminate energy waste with no loss in service
• turn off idle machines and backfill holes
• A/C savings by scheduling work onto cooler nodes
• Power capping: power_budget=0.5MW
• fit more hardware into smaller datacenters
• run in degraded mode during power emergencies
• Per-job power profiles: power=600W
• Power saving mode: off, standby, …
• Power ramping: slow up/down
• Energy accounting: energy=64.2kWh
Nvidia DGCM Ready
Altair Confidential 13
• Pre-job node risk identification and GPU resource allocation
• Automated monitoring of node health
• Reduced job terminations due to GPU failures
• Increased system resilience via intelligent routing decisions
• Increased job throughput via topology optimization
• Optimized job scheduling through GPU load and health
monitoring
PBSPro
Burst Buffer Ready
Altair Confidential 14
• Stage / Cache data between an application computation and
the PFS
• Use as private scratch on compute nodes
• Out of core memory
• Shared Storage, provides multiple jobs the same access to data
• Shared inputs
• Ensembles analysis
• In-transit analysis
• Compute Node Swap
• over-commit compute node memory.
• Job script support
• Native client integrations through hooks
ARM64
Altair Confidential 15
• Fujitsu Post K supercomputer will be powered by 64-bit Arm processors.
• HPE - Sandia National Lab: ARM based Astra Supercomputer
• Fast evolving ecosystem
• Support for ARM-V8 in PBSPro starting v18
Allocation Management
Altair Confidential 16
• Supports compute, storage and budget ($)
• Manages grants, quotas, budgets, limits, etc.
• Implements charge-back business logic
• Includes reporting tools
• PBS Pro add-on module
Flexi Reservations
Altair Confidential 17
• Resource Reservation
• SLA
• Predictable workloads – e.g weather models
• Standing Reservations
• Allow Reservations to start early or runover schedule
Throughput Mode
Altair Confidential 18
• Scheduler can run asynchronously
• doesn’t wait for each job to be accepted by MoM
• 10000 Jobs / minute
• Add-on hierarchical scheduler
• Handles small, short-job workloads
• Deploys per-user/project or site-wide
• Automatically adjusts to demand
• Built-in fairshare and limits
• Scales to millions of jobs
Auto Health check
Altair Confidential 19
• Handling failures at scale
• Degraded Hardware health
• Mean time between failures hardware components
• Improve Productivity
• Job failures prevented
• Improved throughput
• Improve admin productivity
• Offline nodes with possible causes
• Notifications
Automations
Altair Confidential 20
• HPC and High Throughput Workflows
• Directed acyclic graphs
• Expressed as Job Dependencies between two or more jobs
• Specifying the order in which jobs in a set should execute
• Requesting a job run only if an error occurs in another job
• Holding jobs until a particular job starts or completes execution
• Cylc
• Open Source project founded by NIWA
• Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological
Service Singapore and more
Reclaim Resources
Altair Confidential 21
• Releasing Unneeded Vnodes from Your Job
• Userlevel: -W release_nodes_on_stageout=true
• Admin: pbs_release_nodes
• Shrink to fit Jobs
• Jobs that are internally checkpointed.
• Jobs using periodic PBS checkpointing
• Jobs whose real running time might be much less than the
expected time
Usability
Altair Confidential 22
• Manage, Monitor and Measure
• Backward compatibility
• Behaves as a platform
• REST web service
• Data exchange formats for upstream processing and integrations
• Feature extensions Unlimited
Altair Confidential 23
USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>

More Related Content

What's hot

Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbtSiligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Jon Su
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
Splunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorSplunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operator
Imply
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
Jonathan Katz
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
Nikiforos Botis
 
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
DevGAMM Conference
 
ProxySQL in the Cloud
ProxySQL in the CloudProxySQL in the Cloud
ProxySQL in the Cloud
René Cannaò
 
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
NirajKumar938204
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
EDB
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication Tutorial
Jean-François Gagné
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandra
Vinay Kumar Chella
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
Pradeep Bhadani
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Fully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data ValidationFully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data Validation
Databricks
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
Michelle Ufford
 
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationBig Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data Democratization
Cambridge Semantics
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
MongoDB
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
Mydbops
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 

What's hot (20)

Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbtSiligong.Data - May 2021 - Transforming your analytics workflow with dbt
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Splunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operatorSplunk: Druid on Kubernetes with Druid-operator
Splunk: Druid on Kubernetes with Druid-operator
 
Operating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with KubernetesOperating PostgreSQL at Scale with Kubernetes
Operating PostgreSQL at Scale with Kubernetes
 
Presentation of Apache Cassandra
Presentation of Apache Cassandra Presentation of Apache Cassandra
Presentation of Apache Cassandra
 
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
 
ProxySQL in the Cloud
ProxySQL in the CloudProxySQL in the Cloud
ProxySQL in the Cloud
 
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication Tutorial
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandra
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Fully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data ValidationFully Utilizing Spark for Data Validation
Fully Utilizing Spark for Data Validation
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
 
Scaling Data Quality @ Netflix
Scaling Data Quality @ NetflixScaling Data Quality @ Netflix
Scaling Data Quality @ Netflix
 
Big Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data DemocratizationBig Data Fabric 2.0 Drives Data Democratization
Big Data Fabric 2.0 Drives Data Democratization
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 

Similar to 20 Altair PBS Professional Features in 20 minutes, 2018

Building Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft AzureBuilding Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft Azure
Fisnik Doko
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
MarkTaylorIBM
 
goto; London: Keeping your Cloud Footprint in Check
goto; London: Keeping your Cloud Footprint in Checkgoto; London: Keeping your Cloud Footprint in Check
goto; London: Keeping your Cloud Footprint in Check
Coburn Watson
 
Ame 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAme 2269 ibm mq high availability
Ame 2269 ibm mq high availability
Andrew Schofield
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
MarkTaylorIBM
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
Tianjian Chen
 
Software is Eating The Data center
Software is Eating The Data centerSoftware is Eating The Data center
Software is Eating The Data center
Matthias Grawinkel
 
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
wangbo626
 
Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...
Pete Siddall
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
Prakash Chockalingam
 
Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]
AppFirst
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructure
Tarun Rajput
 
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lightbend
 
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
Grid Dynamics
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
Sahdev Zala
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
Ankit Gupta
 
Production grade edge computing on Kubernetes OSS EU 2018
Production grade edge computing on Kubernetes   OSS EU 2018Production grade edge computing on Kubernetes   OSS EU 2018
Production grade edge computing on Kubernetes OSS EU 2018
Steve Wong
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Karthik Ramasamy
 

Similar to 20 Altair PBS Professional Features in 20 minutes, 2018 (20)

Building Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft AzureBuilding Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft Azure
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
 
goto; London: Keeping your Cloud Footprint in Check
goto; London: Keeping your Cloud Footprint in Checkgoto; London: Keeping your Cloud Footprint in Check
goto; London: Keeping your Cloud Footprint in Check
 
Ame 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAme 2269 ibm mq high availability
Ame 2269 ibm mq high availability
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
Software is Eating The Data center
Software is Eating The Data centerSoftware is Eating The Data center
Software is Eating The Data center
 
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
 
Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructure
 
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
 
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
Production grade edge computing on Kubernetes OSS EU 2018
Production grade edge computing on Kubernetes   OSS EU 2018Production grade edge computing on Kubernetes   OSS EU 2018
Production grade edge computing on Kubernetes OSS EU 2018
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
 

Recently uploaded

How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 

Recently uploaded (20)

How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 

20 Altair PBS Professional Features in 20 minutes, 2018

  • 1. Altair Confidential 1 PBS Professional 20 Features in 20 Minutes
  • 2. Altair Confidential 2 USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS >
  • 3. Hooks Altair Confidential 3 • PBS Plugin (“Hooks”) Framework • Unified data model built on industry-standard Python • Augment core capabilities on-the-fly • No re-compiling  PBS Pro core stability • Hook events at major state transition points • Use cases • Routing jobs • Managing job resource requests • Managing access to resources for users and jobs • Ensuring efficient use of resources • Ensuring that jobs run properly • Converting requests to usable format • Controlling interactive jobs • Communicating information to users • Helping to schedule jobs • Managing user activity • Enabling accounting and validation • Allocation management • Helping manage job execution
  • 4. Dynamic Resources Altair Confidential 4 • Represent elements that are outside of the control of PBS • Modular • Scalable • Rich rules with hooks • License as a resource • Global license Managers • Storage • User quotas • Scratch spaces on nodes
  • 5. Multi-scheduler Altair Confidential 5 PBSPro FIFOJobsortformulaFairshare • Run multiple scheduling engines within the same PBS Complex • Heterogenous user groups and workloads • Load balancing • Testing and staging
  • 6. OS Provisioning Altair Confidential 6 • Operating System as a Resource • Integrate with third-party OS provisioning tools • Provisioning / Orchestration – Bare metal • Install required Operating system or application on bare metal • Post install automation support • Multi boot systems • Workstation grids PBSPro WRFLSDYNAOpenFOAM ProvisioningTools
  • 7. High Availability Altair Confidential 7 • High Availability in built • No third party software required • All critical services moved in real time • No loss of service availability • Transparent • Notifications • Full feature manageability tools • Maintain quorum • Interventions and servicing
  • 8. Cgroups Altair Confidential 8 • Ensures jobs have access to requested resources • Can restrict resources for PBS jobs, preventing OOM conditions • Ensures accurate resource accounting • Provides resource enforcement at kernel level instead of the MoM polling for usage • Consistent job runtime
  • 9. Containers Altair Confidential 9 • Lightweight virtualized environment for traditional HPC apps • Number of containers that can be run on a host • Time to launch a container • All the goodies of containers (App maintenance) • Conflicting requirements for applications (e.g., app can run only on centos 6, or needs an older library) • Ease of packaging application into their own “containers” with all dependencies included. • Natural extension to cgroups and cpusets • resource constraining, CPU pinning, etc.
  • 10. Cloud bursting Altair Confidential 10 Microsoft Azure Amazon Web Services GCP Oracle PBS Works • On-demand use of cloud resources to maximize efficiency • Improve responsiveness, adding capacity exactly when needed • Automatic governance and cost controls via site-defined policy and quotas • Understands on-premise utilization, ensuring bursting only when cost-efficient • Vendor-agnostic: no lock-in • Fast: 1,000+ nodes in minutes
  • 11. Topology Aware Altair Confidential 11 Before After Average runtimes ~ 45% Faster ** actual Customer Reported Results • Inter-node & intra-node placement • Switches, clusters, and NUMA • All networks • Infiniband, Ethernet, custom • Dynamic (runtime changeable) • Support for all popular topologies
  • 12. Energy Aware Altair Confidential 12 DoD HPCMP Yearly Savings (estimate) • Eliminate energy waste with no loss in service • turn off idle machines and backfill holes • A/C savings by scheduling work onto cooler nodes • Power capping: power_budget=0.5MW • fit more hardware into smaller datacenters • run in degraded mode during power emergencies • Per-job power profiles: power=600W • Power saving mode: off, standby, … • Power ramping: slow up/down • Energy accounting: energy=64.2kWh
  • 13. Nvidia DGCM Ready Altair Confidential 13 • Pre-job node risk identification and GPU resource allocation • Automated monitoring of node health • Reduced job terminations due to GPU failures • Increased system resilience via intelligent routing decisions • Increased job throughput via topology optimization • Optimized job scheduling through GPU load and health monitoring PBSPro
  • 14. Burst Buffer Ready Altair Confidential 14 • Stage / Cache data between an application computation and the PFS • Use as private scratch on compute nodes • Out of core memory • Shared Storage, provides multiple jobs the same access to data • Shared inputs • Ensembles analysis • In-transit analysis • Compute Node Swap • over-commit compute node memory. • Job script support • Native client integrations through hooks
  • 15. ARM64 Altair Confidential 15 • Fujitsu Post K supercomputer will be powered by 64-bit Arm processors. • HPE - Sandia National Lab: ARM based Astra Supercomputer • Fast evolving ecosystem • Support for ARM-V8 in PBSPro starting v18
  • 16. Allocation Management Altair Confidential 16 • Supports compute, storage and budget ($) • Manages grants, quotas, budgets, limits, etc. • Implements charge-back business logic • Includes reporting tools • PBS Pro add-on module
  • 17. Flexi Reservations Altair Confidential 17 • Resource Reservation • SLA • Predictable workloads – e.g weather models • Standing Reservations • Allow Reservations to start early or runover schedule
  • 18. Throughput Mode Altair Confidential 18 • Scheduler can run asynchronously • doesn’t wait for each job to be accepted by MoM • 10000 Jobs / minute • Add-on hierarchical scheduler • Handles small, short-job workloads • Deploys per-user/project or site-wide • Automatically adjusts to demand • Built-in fairshare and limits • Scales to millions of jobs
  • 19. Auto Health check Altair Confidential 19 • Handling failures at scale • Degraded Hardware health • Mean time between failures hardware components • Improve Productivity • Job failures prevented • Improved throughput • Improve admin productivity • Offline nodes with possible causes • Notifications
  • 20. Automations Altair Confidential 20 • HPC and High Throughput Workflows • Directed acyclic graphs • Expressed as Job Dependencies between two or more jobs • Specifying the order in which jobs in a set should execute • Requesting a job run only if an error occurs in another job • Holding jobs until a particular job starts or completes execution • Cylc • Open Source project founded by NIWA • Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological Service Singapore and more
  • 21. Reclaim Resources Altair Confidential 21 • Releasing Unneeded Vnodes from Your Job • Userlevel: -W release_nodes_on_stageout=true • Admin: pbs_release_nodes • Shrink to fit Jobs • Jobs that are internally checkpointed. • Jobs using periodic PBS checkpointing • Jobs whose real running time might be much less than the expected time
  • 22. Usability Altair Confidential 22 • Manage, Monitor and Measure • Backward compatibility • Behaves as a platform • REST web service • Data exchange formats for upstream processing and integrations • Feature extensions Unlimited
  • 23. Altair Confidential 23 USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS >

Editor's Notes

  1. Mana – 9216 cores Harold – 500 nodes AbUtil – 80 nodes Overall Benefits: Eliminate waste with no loss in service (as we turn off idle machines and backfill holes) A/C savings by scheduling work onto cooler nodes Power capping means you can fit more hardware into smaller datacenters (provision only for used power, not peak power) Power capping can also be used to run in degraded mode during power emergencies / disasters Measure, report, charge-back power use Note: not running a jobs twice (because PBS mitigates system failures) is also very Green
  2. Staging copies files from the PFS to the Burst Buffer for executions and then stages the data out Cache moves data implicitly (read-ahead and write-behind); useful for the following Checkpoint/Restart Periodic output Application libraries
  3. Open Source project founded by NIWA, Newzealand Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological Service Singapore, …