20 Altair PBS Professional Features in 20 minutes, 2018

Altair Confidential 1
PBS Professional
20 Features in 20 Minutes

USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>

Hooks
• PBS Plugin (“Hooks”) Framework
• Unified data model built on industry-standard Python
• Augment core capabilities on-the-fly
• No re-compiling  PBS Pro core stability
• Hook events at major state transition points
• Use cases
• Routing jobs
• Managing job resource requests
• Managing access to resources for
users and jobs
• Ensuring efficient use of resources
• Ensuring that jobs run properly
• Converting requests to usable format
• Controlling interactive jobs
• Communicating information to users
• Helping to schedule jobs
• Managing user activity
• Enabling accounting and validation
• Allocation management
• Helping manage job execution

Dynamic Resources
• Represent elements that are outside of the control of PBS
• Modular
• Scalable
• Rich rules with hooks
• License as a resource
• Global license Managers
• Storage
• User quotas
• Scratch spaces on nodes

Multi-scheduler
PBSPro
FIFOJobsortformulaFairshare
• Run multiple scheduling engines within the same PBS Complex
• Heterogenous user groups and workloads
• Load balancing
• Testing and staging

OS Provisioning
• Operating System as a Resource
• Integrate with third-party OS provisioning tools
• Provisioning / Orchestration – Bare metal
• Install required Operating system or application on bare metal
• Post install automation support
• Multi boot systems
• Workstation grids
PBSPro
WRFLSDYNAOpenFOAM
ProvisioningTools

High Availability
• High Availability in built
• No third party software required
• All critical services moved in real time
• No loss of service availability
• Transparent
• Notifications
• Full feature manageability tools
• Maintain quorum
• Interventions and servicing

Cgroups
• Ensures jobs have access to requested resources
• Can restrict resources for PBS jobs, preventing OOM conditions
• Ensures accurate resource accounting
• Provides resource enforcement at kernel level instead of the
MoM polling for usage
• Consistent job runtime

Containers
• Lightweight virtualized environment for traditional HPC apps
• Number of containers that can be run on a host
• Time to launch a container
• All the goodies of containers (App maintenance)
• Conflicting requirements for applications (e.g., app can run only on centos 6, or needs an older library)
• Ease of packaging application into their own “containers” with all dependencies included.
• Natural extension to cgroups and cpusets
• resource constraining, CPU pinning, etc.

Cloud bursting
Microsoft Azure
Amazon Web Services
GCP
Oracle
PBS Works
• On-demand use of cloud resources to
maximize efficiency
• Improve responsiveness, adding capacity
exactly when needed
• Automatic governance and cost controls via
site-defined policy and quotas
• Understands on-premise utilization, ensuring
bursting only when cost-efficient
• Vendor-agnostic: no lock-in
• Fast: 1,000+ nodes in minutes

Topology Aware
Before After
Average runtimes
~ 45% Faster
** actual Customer Reported Results
• Inter-node & intra-node placement
• Switches, clusters, and NUMA
• All networks
• Infiniband, Ethernet, custom
• Dynamic (runtime changeable)
• Support for all popular topologies

Energy Aware
DoD HPCMP
Yearly Savings (estimate)
• Eliminate energy waste with no loss in service
• turn off idle machines and backfill holes
• A/C savings by scheduling work onto cooler nodes
• Power capping: power_budget=0.5MW
• fit more hardware into smaller datacenters
• run in degraded mode during power emergencies
• Per-job power profiles: power=600W
• Power saving mode: off, standby, …
• Power ramping: slow up/down
• Energy accounting: energy=64.2kWh

Nvidia DGCM Ready
• Pre-job node risk identification and GPU resource allocation
• Automated monitoring of node health
• Reduced job terminations due to GPU failures
• Increased system resilience via intelligent routing decisions
• Increased job throughput via topology optimization
• Optimized job scheduling through GPU load and health
monitoring
PBSPro

Burst Buffer Ready
• Stage / Cache data between an application computation and
the PFS
• Use as private scratch on compute nodes
• Out of core memory
• Shared Storage, provides multiple jobs the same access to data
• Shared inputs
• Ensembles analysis
• In-transit analysis
• Compute Node Swap
• over-commit compute node memory.
• Job script support
• Native client integrations through hooks

ARM64
• Fujitsu Post K supercomputer will be powered by 64-bit Arm processors.
• HPE - Sandia National Lab: ARM based Astra Supercomputer
• Fast evolving ecosystem
• Support for ARM-V8 in PBSPro starting v18

Allocation Management
• Supports compute, storage and budget ($)
• Manages grants, quotas, budgets, limits, etc.
• Implements charge-back business logic
• Includes reporting tools
• PBS Pro add-on module

Flexi Reservations
• Resource Reservation
• SLA
• Predictable workloads – e.g weather models
• Standing Reservations
• Allow Reservations to start early or runover schedule

Throughput Mode
• Scheduler can run asynchronously
• doesn’t wait for each job to be accepted by MoM
• 10000 Jobs / minute
• Add-on hierarchical scheduler
• Handles small, short-job workloads
• Deploys per-user/project or site-wide
• Automatically adjusts to demand
• Built-in fairshare and limits
• Scales to millions of jobs

Auto Health check
• Handling failures at scale
• Degraded Hardware health
• Mean time between failures hardware components
• Improve Productivity
• Job failures prevented
• Improved throughput
• Improve admin productivity
• Offline nodes with possible causes
• Notifications

Automations
• HPC and High Throughput Workflows
• Directed acyclic graphs
• Expressed as Job Dependencies between two or more jobs
• Specifying the order in which jobs in a set should execute
• Requesting a job run only if an error occurs in another job
• Holding jobs until a particular job starts or completes execution
• Cylc
• Open Source project founded by NIWA
• Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological
Service Singapore and more

Reclaim Resources
• Releasing Unneeded Vnodes from Your Job
• Userlevel: -W release_nodes_on_stageout=true
• Admin: pbs_release_nodes
• Shrink to fit Jobs
• Jobs that are internally checkpointed.
• Jobs using periodic PBS checkpointing
• Jobs whose real running time might be much less than the
expected time

Usability
• Manage, Monitor and Measure
• Backward compatibility
• Behaves as a platform
• REST web service
• Data exchange formats for upstream processing and integrations
• Feature extensions Unlimited

USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>

20 Altair PBS Professional Features in 20 minutes, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20 Altair PBS Professional Features in 20 minutes, 2018

Similar to 20 Altair PBS Professional Features in 20 minutes, 2018 (20)

Recently uploaded

Recently uploaded (20)

20 Altair PBS Professional Features in 20 minutes, 2018

Editor's Notes