Altair PBS Works Suite, Industries most Advance Suite of Software for High Performance Computing. It has PBS Access focused on Engineers and Researchers and PBS Control with Administrator & HPC Managers.
2. Altair Confidential 2
USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>
3. Hooks
Altair Confidential 3
• PBS Plugin (“Hooks”) Framework
• Unified data model built on industry-standard Python
• Augment core capabilities on-the-fly
• No re-compiling PBS Pro core stability
• Hook events at major state transition points
• Use cases
• Routing jobs
• Managing job resource requests
• Managing access to resources for
users and jobs
• Ensuring efficient use of resources
• Ensuring that jobs run properly
• Converting requests to usable format
• Controlling interactive jobs
• Communicating information to users
• Helping to schedule jobs
• Managing user activity
• Enabling accounting and validation
• Allocation management
• Helping manage job execution
4. Dynamic Resources
Altair Confidential 4
• Represent elements that are outside of the control of PBS
• Modular
• Scalable
• Rich rules with hooks
• License as a resource
• Global license Managers
• Storage
• User quotas
• Scratch spaces on nodes
6. OS Provisioning
Altair Confidential 6
• Operating System as a Resource
• Integrate with third-party OS provisioning tools
• Provisioning / Orchestration – Bare metal
• Install required Operating system or application on bare metal
• Post install automation support
• Multi boot systems
• Workstation grids
PBSPro
WRFLSDYNAOpenFOAM
ProvisioningTools
7. High Availability
Altair Confidential 7
• High Availability in built
• No third party software required
• All critical services moved in real time
• No loss of service availability
• Transparent
• Notifications
• Full feature manageability tools
• Maintain quorum
• Interventions and servicing
8. Cgroups
Altair Confidential 8
• Ensures jobs have access to requested resources
• Can restrict resources for PBS jobs, preventing OOM conditions
• Ensures accurate resource accounting
• Provides resource enforcement at kernel level instead of the
MoM polling for usage
• Consistent job runtime
9. Containers
Altair Confidential 9
• Lightweight virtualized environment for traditional HPC apps
• Number of containers that can be run on a host
• Time to launch a container
• All the goodies of containers (App maintenance)
• Conflicting requirements for applications (e.g., app can run only on centos 6, or needs an older library)
• Ease of packaging application into their own “containers” with all dependencies included.
• Natural extension to cgroups and cpusets
• resource constraining, CPU pinning, etc.
10. Cloud bursting
Altair Confidential 10
Microsoft Azure
Amazon Web Services
GCP
Oracle
PBS Works
• On-demand use of cloud resources to
maximize efficiency
• Improve responsiveness, adding capacity
exactly when needed
• Automatic governance and cost controls via
site-defined policy and quotas
• Understands on-premise utilization, ensuring
bursting only when cost-efficient
• Vendor-agnostic: no lock-in
• Fast: 1,000+ nodes in minutes
11. Topology Aware
Altair Confidential 11
Before After
Average runtimes
~ 45% Faster
** actual Customer Reported Results
• Inter-node & intra-node placement
• Switches, clusters, and NUMA
• All networks
• Infiniband, Ethernet, custom
• Dynamic (runtime changeable)
• Support for all popular topologies
12. Energy Aware
Altair Confidential 12
DoD HPCMP
Yearly Savings (estimate)
• Eliminate energy waste with no loss in service
• turn off idle machines and backfill holes
• A/C savings by scheduling work onto cooler nodes
• Power capping: power_budget=0.5MW
• fit more hardware into smaller datacenters
• run in degraded mode during power emergencies
• Per-job power profiles: power=600W
• Power saving mode: off, standby, …
• Power ramping: slow up/down
• Energy accounting: energy=64.2kWh
13. Nvidia DGCM Ready
Altair Confidential 13
• Pre-job node risk identification and GPU resource allocation
• Automated monitoring of node health
• Reduced job terminations due to GPU failures
• Increased system resilience via intelligent routing decisions
• Increased job throughput via topology optimization
• Optimized job scheduling through GPU load and health
monitoring
PBSPro
14. Burst Buffer Ready
Altair Confidential 14
• Stage / Cache data between an application computation and
the PFS
• Use as private scratch on compute nodes
• Out of core memory
• Shared Storage, provides multiple jobs the same access to data
• Shared inputs
• Ensembles analysis
• In-transit analysis
• Compute Node Swap
• over-commit compute node memory.
• Job script support
• Native client integrations through hooks
15. ARM64
Altair Confidential 15
• Fujitsu Post K supercomputer will be powered by 64-bit Arm processors.
• HPE - Sandia National Lab: ARM based Astra Supercomputer
• Fast evolving ecosystem
• Support for ARM-V8 in PBSPro starting v18
16. Allocation Management
Altair Confidential 16
• Supports compute, storage and budget ($)
• Manages grants, quotas, budgets, limits, etc.
• Implements charge-back business logic
• Includes reporting tools
• PBS Pro add-on module
17. Flexi Reservations
Altair Confidential 17
• Resource Reservation
• SLA
• Predictable workloads – e.g weather models
• Standing Reservations
• Allow Reservations to start early or runover schedule
18. Throughput Mode
Altair Confidential 18
• Scheduler can run asynchronously
• doesn’t wait for each job to be accepted by MoM
• 10000 Jobs / minute
• Add-on hierarchical scheduler
• Handles small, short-job workloads
• Deploys per-user/project or site-wide
• Automatically adjusts to demand
• Built-in fairshare and limits
• Scales to millions of jobs
19. Auto Health check
Altair Confidential 19
• Handling failures at scale
• Degraded Hardware health
• Mean time between failures hardware components
• Improve Productivity
• Job failures prevented
• Improved throughput
• Improve admin productivity
• Offline nodes with possible causes
• Notifications
20. Automations
Altair Confidential 20
• HPC and High Throughput Workflows
• Directed acyclic graphs
• Expressed as Job Dependencies between two or more jobs
• Specifying the order in which jobs in a set should execute
• Requesting a job run only if an error occurs in another job
• Holding jobs until a particular job starts or completes execution
• Cylc
• Open Source project founded by NIWA
• Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological
Service Singapore and more
21. Reclaim Resources
Altair Confidential 21
• Releasing Unneeded Vnodes from Your Job
• Userlevel: -W release_nodes_on_stageout=true
• Admin: pbs_release_nodes
• Shrink to fit Jobs
• Jobs that are internally checkpointed.
• Jobs using periodic PBS checkpointing
• Jobs whose real running time might be much less than the
expected time
22. Usability
Altair Confidential 22
• Manage, Monitor and Measure
• Backward compatibility
• Behaves as a platform
• REST web service
• Data exchange formats for upstream processing and integrations
• Feature extensions Unlimited
23. Altair Confidential 23
USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>
Editor's Notes
Mana – 9216 cores
Harold – 500 nodes
AbUtil – 80 nodes
Overall Benefits:
Eliminate waste with no loss in service (as we turn off idle machines and backfill holes)
A/C savings by scheduling work onto cooler nodes
Power capping means you can fit more hardware into smaller datacenters (provision only for used power, not peak power)
Power capping can also be used to run in degraded mode during power emergencies / disasters
Measure, report, charge-back power use
Note: not running a jobs twice (because PBS mitigates system failures) is also very Green
Staging copies files from the PFS to the Burst Buffer for executions and then stages the data out
Cache moves data implicitly (read-ahead and write-behind); useful for the following
Checkpoint/Restart
Periodic output
Application libraries
Open Source project founded by NIWA, Newzealand
Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological Service Singapore, …