SlideShare a Scribd company logo
1 of 23
Altair Confidential 1
PBS Professional
20 Features in 20 Minutes
Altair Confidential 2
USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>
Hooks
Altair Confidential 3
• PBS Plugin (“Hooks”) Framework
• Unified data model built on industry-standard Python
• Augment core capabilities on-the-fly
• No re-compiling  PBS Pro core stability
• Hook events at major state transition points
• Use cases
• Routing jobs
• Managing job resource requests
• Managing access to resources for
users and jobs
• Ensuring efficient use of resources
• Ensuring that jobs run properly
• Converting requests to usable format
• Controlling interactive jobs
• Communicating information to users
• Helping to schedule jobs
• Managing user activity
• Enabling accounting and validation
• Allocation management
• Helping manage job execution
Dynamic Resources
Altair Confidential 4
• Represent elements that are outside of the control of PBS
• Modular
• Scalable
• Rich rules with hooks
• License as a resource
• Global license Managers
• Storage
• User quotas
• Scratch spaces on nodes
Multi-scheduler
Altair Confidential 5
PBSPro
FIFOJobsortformulaFairshare
• Run multiple scheduling engines within the same PBS Complex
• Heterogenous user groups and workloads
• Load balancing
• Testing and staging
OS Provisioning
Altair Confidential 6
• Operating System as a Resource
• Integrate with third-party OS provisioning tools
• Provisioning / Orchestration – Bare metal
• Install required Operating system or application on bare metal
• Post install automation support
• Multi boot systems
• Workstation grids
PBSPro
WRFLSDYNAOpenFOAM
ProvisioningTools
High Availability
Altair Confidential 7
• High Availability in built
• No third party software required
• All critical services moved in real time
• No loss of service availability
• Transparent
• Notifications
• Full feature manageability tools
• Maintain quorum
• Interventions and servicing
Cgroups
Altair Confidential 8
• Ensures jobs have access to requested resources
• Can restrict resources for PBS jobs, preventing OOM conditions
• Ensures accurate resource accounting
• Provides resource enforcement at kernel level instead of the
MoM polling for usage
• Consistent job runtime
Containers
Altair Confidential 9
• Lightweight virtualized environment for traditional HPC apps
• Number of containers that can be run on a host
• Time to launch a container
• All the goodies of containers (App maintenance)
• Conflicting requirements for applications (e.g., app can run only on centos 6, or needs an older library)
• Ease of packaging application into their own “containers” with all dependencies included.
• Natural extension to cgroups and cpusets
• resource constraining, CPU pinning, etc.
Cloud bursting
Altair Confidential 10
Microsoft Azure
Amazon Web Services
GCP
Oracle
PBS Works
• On-demand use of cloud resources to
maximize efficiency
• Improve responsiveness, adding capacity
exactly when needed
• Automatic governance and cost controls via
site-defined policy and quotas
• Understands on-premise utilization, ensuring
bursting only when cost-efficient
• Vendor-agnostic: no lock-in
• Fast: 1,000+ nodes in minutes
Topology Aware
Altair Confidential 11
Before After
Average runtimes
~ 45% Faster
** actual Customer Reported Results
• Inter-node & intra-node placement
• Switches, clusters, and NUMA
• All networks
• Infiniband, Ethernet, custom
• Dynamic (runtime changeable)
• Support for all popular topologies
Energy Aware
Altair Confidential 12
DoD HPCMP
Yearly Savings (estimate)
• Eliminate energy waste with no loss in service
• turn off idle machines and backfill holes
• A/C savings by scheduling work onto cooler nodes
• Power capping: power_budget=0.5MW
• fit more hardware into smaller datacenters
• run in degraded mode during power emergencies
• Per-job power profiles: power=600W
• Power saving mode: off, standby, …
• Power ramping: slow up/down
• Energy accounting: energy=64.2kWh
Nvidia DGCM Ready
Altair Confidential 13
• Pre-job node risk identification and GPU resource allocation
• Automated monitoring of node health
• Reduced job terminations due to GPU failures
• Increased system resilience via intelligent routing decisions
• Increased job throughput via topology optimization
• Optimized job scheduling through GPU load and health
monitoring
PBSPro
Burst Buffer Ready
Altair Confidential 14
• Stage / Cache data between an application computation and
the PFS
• Use as private scratch on compute nodes
• Out of core memory
• Shared Storage, provides multiple jobs the same access to data
• Shared inputs
• Ensembles analysis
• In-transit analysis
• Compute Node Swap
• over-commit compute node memory.
• Job script support
• Native client integrations through hooks
ARM64
Altair Confidential 15
• Fujitsu Post K supercomputer will be powered by 64-bit Arm processors.
• HPE - Sandia National Lab: ARM based Astra Supercomputer
• Fast evolving ecosystem
• Support for ARM-V8 in PBSPro starting v18
Allocation Management
Altair Confidential 16
• Supports compute, storage and budget ($)
• Manages grants, quotas, budgets, limits, etc.
• Implements charge-back business logic
• Includes reporting tools
• PBS Pro add-on module
Flexi Reservations
Altair Confidential 17
• Resource Reservation
• SLA
• Predictable workloads – e.g weather models
• Standing Reservations
• Allow Reservations to start early or runover schedule
Throughput Mode
Altair Confidential 18
• Scheduler can run asynchronously
• doesn’t wait for each job to be accepted by MoM
• 10000 Jobs / minute
• Add-on hierarchical scheduler
• Handles small, short-job workloads
• Deploys per-user/project or site-wide
• Automatically adjusts to demand
• Built-in fairshare and limits
• Scales to millions of jobs
Auto Health check
Altair Confidential 19
• Handling failures at scale
• Degraded Hardware health
• Mean time between failures hardware components
• Improve Productivity
• Job failures prevented
• Improved throughput
• Improve admin productivity
• Offline nodes with possible causes
• Notifications
Automations
Altair Confidential 20
• HPC and High Throughput Workflows
• Directed acyclic graphs
• Expressed as Job Dependencies between two or more jobs
• Specifying the order in which jobs in a set should execute
• Requesting a job run only if an error occurs in another job
• Holding jobs until a particular job starts or completes execution
• Cylc
• Open Source project founded by NIWA
• Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological
Service Singapore and more
Reclaim Resources
Altair Confidential 21
• Releasing Unneeded Vnodes from Your Job
• Userlevel: -W release_nodes_on_stageout=true
• Admin: pbs_release_nodes
• Shrink to fit Jobs
• Jobs that are internally checkpointed.
• Jobs using periodic PBS checkpointing
• Jobs whose real running time might be much less than the
expected time
Usability
Altair Confidential 22
• Manage, Monitor and Measure
• Backward compatibility
• Behaves as a platform
• REST web service
• Data exchange formats for upstream processing and integrations
• Feature extensions Unlimited
Altair Confidential 23
USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT
FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY
HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS
ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS
>

More Related Content

What's hot

HBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージ
HBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージHBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージ
HBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージLINE Corporation
 
コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門Kohei Tokunaga
 
【たぶん日本初導入!】Azure Stack Hub with GPUの性能と機能紹介
【たぶん日本初導入!】Azure Stack Hub with GPUの性能と機能紹介【たぶん日本初導入!】Azure Stack Hub with GPUの性能と機能紹介
【たぶん日本初導入!】Azure Stack Hub with GPUの性能と機能紹介NTT Communications Technology Development
 
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送Google Cloud Platform - Japan
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI
 
Gaming on aws 〜ゲームにおけるAWS最新活用術〜
Gaming on aws 〜ゲームにおけるAWS最新活用術〜Gaming on aws 〜ゲームにおけるAWS最新活用術〜
Gaming on aws 〜ゲームにおけるAWS最新活用術〜Amazon Web Services Japan
 
普通のRailsアプリをdockerで本番運用する知見
普通のRailsアプリをdockerで本番運用する知見普通のRailsアプリをdockerで本番運用する知見
普通のRailsアプリをdockerで本番運用する知見zaru sakuraba
 
マイクロサービスと Red Hat Integration
マイクロサービスと Red Hat Integrationマイクロサービスと Red Hat Integration
マイクロサービスと Red Hat IntegrationKenta Kosugi
 
行ロックと「LOG: process 12345 still waiting for ShareLock on transaction 710 afte...
行ロックと「LOG:  process 12345 still waiting for ShareLock on transaction 710 afte...行ロックと「LOG:  process 12345 still waiting for ShareLock on transaction 710 afte...
行ロックと「LOG: process 12345 still waiting for ShareLock on transaction 710 afte...Masahiko Sawada
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Akihiro Suda
 
TIME_WAITに関する話
TIME_WAITに関する話TIME_WAITに関する話
TIME_WAITに関する話Takanori Sejima
 
ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)
ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)
ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)Amazon Web Services Japan
 
AWSで作る分析基盤
AWSで作る分析基盤AWSで作る分析基盤
AWSで作る分析基盤Yu Otsubo
 
AWSではじめるMLOps
AWSではじめるMLOpsAWSではじめるMLOps
AWSではじめるMLOpsMariOhbuchi
 
ジョブ管理でcronは限界があったので”Rundeck”を使ってハッピーになりました
ジョブ管理でcronは限界があったので”Rundeck”を使ってハッピーになりましたジョブ管理でcronは限界があったので”Rundeck”を使ってハッピーになりました
ジョブ管理でcronは限界があったので”Rundeck”を使ってハッピーになりましたYukiya Hayashi
 
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)NTT DATA Technology & Innovation
 
WebAssemblyのWeb以外のことぜんぶ話す
WebAssemblyのWeb以外のことぜんぶ話すWebAssemblyのWeb以外のことぜんぶ話す
WebAssemblyのWeb以外のことぜんぶ話すTakaya Saeki
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and PerformanceMineaki Motohashi
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜Preferred Networks
 

What's hot (20)

HBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージ
HBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージHBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージ
HBaseとRedisを使った100億超/日メッセージを処理するLINEのストレージ
 
コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門
 
【たぶん日本初導入!】Azure Stack Hub with GPUの性能と機能紹介
【たぶん日本初導入!】Azure Stack Hub with GPUの性能と機能紹介【たぶん日本初導入!】Azure Stack Hub with GPUの性能と機能紹介
【たぶん日本初導入!】Azure Stack Hub with GPUの性能と機能紹介
 
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
[Cloud OnAir] Bigtable に迫る!基本機能も含めユースケースまで丸ごと紹介 2018年8月30日 放送
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
Gaming on aws 〜ゲームにおけるAWS最新活用術〜
Gaming on aws 〜ゲームにおけるAWS最新活用術〜Gaming on aws 〜ゲームにおけるAWS最新活用術〜
Gaming on aws 〜ゲームにおけるAWS最新活用術〜
 
普通のRailsアプリをdockerで本番運用する知見
普通のRailsアプリをdockerで本番運用する知見普通のRailsアプリをdockerで本番運用する知見
普通のRailsアプリをdockerで本番運用する知見
 
マイクロサービスと Red Hat Integration
マイクロサービスと Red Hat Integrationマイクロサービスと Red Hat Integration
マイクロサービスと Red Hat Integration
 
行ロックと「LOG: process 12345 still waiting for ShareLock on transaction 710 afte...
行ロックと「LOG:  process 12345 still waiting for ShareLock on transaction 710 afte...行ロックと「LOG:  process 12345 still waiting for ShareLock on transaction 710 afte...
行ロックと「LOG: process 12345 still waiting for ShareLock on transaction 710 afte...
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
TIME_WAITに関する話
TIME_WAITに関する話TIME_WAITに関する話
TIME_WAITに関する話
 
ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)
ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)
ゲームアーキテクチャパターン (Aurora Serverless / DynamoDB)
 
AWSで作る分析基盤
AWSで作る分析基盤AWSで作る分析基盤
AWSで作る分析基盤
 
HBase at LINE
HBase at LINEHBase at LINE
HBase at LINE
 
AWSではじめるMLOps
AWSではじめるMLOpsAWSではじめるMLOps
AWSではじめるMLOps
 
ジョブ管理でcronは限界があったので”Rundeck”を使ってハッピーになりました
ジョブ管理でcronは限界があったので”Rundeck”を使ってハッピーになりましたジョブ管理でcronは限界があったので”Rundeck”を使ってハッピーになりました
ジョブ管理でcronは限界があったので”Rundeck”を使ってハッピーになりました
 
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
Knative Eventing 入門(Kubernetes Novice Tokyo #11 発表資料)
 
WebAssemblyのWeb以外のことぜんぶ話す
WebAssemblyのWeb以外のことぜんぶ話すWebAssemblyのWeb以外のことぜんぶ話す
WebAssemblyのWeb以外のことぜんぶ話す
 
Snowflake Architecture and Performance
Snowflake Architecture and PerformanceSnowflake Architecture and Performance
Snowflake Architecture and Performance
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
 

Similar to 20 Altair PBS Professional Features in 20 minutes, 2018

Building Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft AzureBuilding Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft AzureFisnik Doko
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)MarkTaylorIBM
 
goto; London: Keeping your Cloud Footprint in Check
goto; London: Keeping your Cloud Footprint in Checkgoto; London: Keeping your Cloud Footprint in Check
goto; London: Keeping your Cloud Footprint in CheckCoburn Watson
 
Ame 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAme 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAndrew Schofield
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryMarkTaylorIBM
 
Software is Eating The Data center
Software is Eating The Data centerSoftware is Eating The Data center
Software is Eating The Data centerMatthias Grawinkel
 
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...wangbo626
 
Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...Pete Siddall
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot modePrakash Chockalingam
 
Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]AppFirst
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructureTarun Rajput
 
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lightbend
 
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...Grid Dynamics
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaSahdev Zala
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Ankit Gupta
 
Production grade edge computing on Kubernetes OSS EU 2018
Production grade edge computing on Kubernetes   OSS EU 2018Production grade edge computing on Kubernetes   OSS EU 2018
Production grade edge computing on Kubernetes OSS EU 2018Steve Wong
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarKarthik Ramasamy
 

Similar to 20 Altair PBS Professional Features in 20 minutes, 2018 (20)

Building Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft AzureBuilding Scalable Applications with Microsoft Azure
Building Scalable Applications with Microsoft Azure
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
 
goto; London: Keeping your Cloud Footprint in Check
goto; London: Keeping your Cloud Footprint in Checkgoto; London: Keeping your Cloud Footprint in Check
goto; London: Keeping your Cloud Footprint in Check
 
Ame 2269 ibm mq high availability
Ame 2269 ibm mq high availabilityAme 2269 ibm mq high availability
Ame 2269 ibm mq high availability
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
Software is Eating The Data center
Software is Eating The Data centerSoftware is Eating The Data center
Software is Eating The Data center
 
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
AME-1934 : Enable Active-Active Messaging Technology to Extend Workload Balan...
 
Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...Hhm 3474 mq messaging technologies and support for high availability and acti...
Hhm 3474 mq messaging technologies and support for high availability and acti...
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]Architecture for Scale [AppFirst]
Architecture for Scale [AppFirst]
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructure
 
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
 
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
 
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC ComputingHPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 
Production grade edge computing on Kubernetes OSS EU 2018
Production grade edge computing on Kubernetes   OSS EU 2018Production grade edge computing on Kubernetes   OSS EU 2018
Production grade edge computing on Kubernetes OSS EU 2018
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
 

Recently uploaded

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 

Recently uploaded (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 

20 Altair PBS Professional Features in 20 minutes, 2018

  • 1. Altair Confidential 1 PBS Professional 20 Features in 20 Minutes
  • 2. Altair Confidential 2 USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS >
  • 3. Hooks Altair Confidential 3 • PBS Plugin (“Hooks”) Framework • Unified data model built on industry-standard Python • Augment core capabilities on-the-fly • No re-compiling  PBS Pro core stability • Hook events at major state transition points • Use cases • Routing jobs • Managing job resource requests • Managing access to resources for users and jobs • Ensuring efficient use of resources • Ensuring that jobs run properly • Converting requests to usable format • Controlling interactive jobs • Communicating information to users • Helping to schedule jobs • Managing user activity • Enabling accounting and validation • Allocation management • Helping manage job execution
  • 4. Dynamic Resources Altair Confidential 4 • Represent elements that are outside of the control of PBS • Modular • Scalable • Rich rules with hooks • License as a resource • Global license Managers • Storage • User quotas • Scratch spaces on nodes
  • 5. Multi-scheduler Altair Confidential 5 PBSPro FIFOJobsortformulaFairshare • Run multiple scheduling engines within the same PBS Complex • Heterogenous user groups and workloads • Load balancing • Testing and staging
  • 6. OS Provisioning Altair Confidential 6 • Operating System as a Resource • Integrate with third-party OS provisioning tools • Provisioning / Orchestration – Bare metal • Install required Operating system or application on bare metal • Post install automation support • Multi boot systems • Workstation grids PBSPro WRFLSDYNAOpenFOAM ProvisioningTools
  • 7. High Availability Altair Confidential 7 • High Availability in built • No third party software required • All critical services moved in real time • No loss of service availability • Transparent • Notifications • Full feature manageability tools • Maintain quorum • Interventions and servicing
  • 8. Cgroups Altair Confidential 8 • Ensures jobs have access to requested resources • Can restrict resources for PBS jobs, preventing OOM conditions • Ensures accurate resource accounting • Provides resource enforcement at kernel level instead of the MoM polling for usage • Consistent job runtime
  • 9. Containers Altair Confidential 9 • Lightweight virtualized environment for traditional HPC apps • Number of containers that can be run on a host • Time to launch a container • All the goodies of containers (App maintenance) • Conflicting requirements for applications (e.g., app can run only on centos 6, or needs an older library) • Ease of packaging application into their own “containers” with all dependencies included. • Natural extension to cgroups and cpusets • resource constraining, CPU pinning, etc.
  • 10. Cloud bursting Altair Confidential 10 Microsoft Azure Amazon Web Services GCP Oracle PBS Works • On-demand use of cloud resources to maximize efficiency • Improve responsiveness, adding capacity exactly when needed • Automatic governance and cost controls via site-defined policy and quotas • Understands on-premise utilization, ensuring bursting only when cost-efficient • Vendor-agnostic: no lock-in • Fast: 1,000+ nodes in minutes
  • 11. Topology Aware Altair Confidential 11 Before After Average runtimes ~ 45% Faster ** actual Customer Reported Results • Inter-node & intra-node placement • Switches, clusters, and NUMA • All networks • Infiniband, Ethernet, custom • Dynamic (runtime changeable) • Support for all popular topologies
  • 12. Energy Aware Altair Confidential 12 DoD HPCMP Yearly Savings (estimate) • Eliminate energy waste with no loss in service • turn off idle machines and backfill holes • A/C savings by scheduling work onto cooler nodes • Power capping: power_budget=0.5MW • fit more hardware into smaller datacenters • run in degraded mode during power emergencies • Per-job power profiles: power=600W • Power saving mode: off, standby, … • Power ramping: slow up/down • Energy accounting: energy=64.2kWh
  • 13. Nvidia DGCM Ready Altair Confidential 13 • Pre-job node risk identification and GPU resource allocation • Automated monitoring of node health • Reduced job terminations due to GPU failures • Increased system resilience via intelligent routing decisions • Increased job throughput via topology optimization • Optimized job scheduling through GPU load and health monitoring PBSPro
  • 14. Burst Buffer Ready Altair Confidential 14 • Stage / Cache data between an application computation and the PFS • Use as private scratch on compute nodes • Out of core memory • Shared Storage, provides multiple jobs the same access to data • Shared inputs • Ensembles analysis • In-transit analysis • Compute Node Swap • over-commit compute node memory. • Job script support • Native client integrations through hooks
  • 15. ARM64 Altair Confidential 15 • Fujitsu Post K supercomputer will be powered by 64-bit Arm processors. • HPE - Sandia National Lab: ARM based Astra Supercomputer • Fast evolving ecosystem • Support for ARM-V8 in PBSPro starting v18
  • 16. Allocation Management Altair Confidential 16 • Supports compute, storage and budget ($) • Manages grants, quotas, budgets, limits, etc. • Implements charge-back business logic • Includes reporting tools • PBS Pro add-on module
  • 17. Flexi Reservations Altair Confidential 17 • Resource Reservation • SLA • Predictable workloads – e.g weather models • Standing Reservations • Allow Reservations to start early or runover schedule
  • 18. Throughput Mode Altair Confidential 18 • Scheduler can run asynchronously • doesn’t wait for each job to be accepted by MoM • 10000 Jobs / minute • Add-on hierarchical scheduler • Handles small, short-job workloads • Deploys per-user/project or site-wide • Automatically adjusts to demand • Built-in fairshare and limits • Scales to millions of jobs
  • 19. Auto Health check Altair Confidential 19 • Handling failures at scale • Degraded Hardware health • Mean time between failures hardware components • Improve Productivity • Job failures prevented • Improved throughput • Improve admin productivity • Offline nodes with possible causes • Notifications
  • 20. Automations Altair Confidential 20 • HPC and High Throughput Workflows • Directed acyclic graphs • Expressed as Job Dependencies between two or more jobs • Specifying the order in which jobs in a set should execute • Requesting a job run only if an error occurs in another job • Holding jobs until a particular job starts or completes execution • Cylc • Open Source project founded by NIWA • Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological Service Singapore and more
  • 21. Reclaim Resources Altair Confidential 21 • Releasing Unneeded Vnodes from Your Job • Userlevel: -W release_nodes_on_stageout=true • Admin: pbs_release_nodes • Shrink to fit Jobs • Jobs that are internally checkpointed. • Jobs using periodic PBS checkpointing • Jobs whose real running time might be much less than the expected time
  • 22. Usability Altair Confidential 22 • Manage, Monitor and Measure • Backward compatibility • Behaves as a platform • REST web service • Data exchange formats for upstream processing and integrations • Feature extensions Unlimited
  • 23. Altair Confidential 23 USABILITYRECLAIM RESOURCESAUTOMATIONSAUTO HEALTH CHECKASYNC THROUGHPUT FLEXI RESERVATIONSALLOCATION MGMTARM64 READYBURST BUFFER READYNVIDIA DGCM READY HIGH AVAILABILITYOS PROVISIONINGMULTI SCHEDULERDYNAMIC RESOURCESHOOKS ENERGY AWARETOPOLOGY AWARECLOUD BURSTINGCONTAINERSCGROUPS >

Editor's Notes

  1. Mana – 9216 cores Harold – 500 nodes AbUtil – 80 nodes Overall Benefits: Eliminate waste with no loss in service (as we turn off idle machines and backfill holes) A/C savings by scheduling work onto cooler nodes Power capping means you can fit more hardware into smaller datacenters (provision only for used power, not peak power) Power capping can also be used to run in degraded mode during power emergencies / disasters Measure, report, charge-back power use Note: not running a jobs twice (because PBS mitigates system failures) is also very Green
  2. Staging copies files from the PFS to the Burst Buffer for executions and then stages the data out Cache moves data implicitly (read-ahead and write-behind); useful for the following Checkpoint/Restart Periodic output Application libraries
  3. Open Source project founded by NIWA, Newzealand Now includes: NIWA, UK Met, BOM, KMA, NCAR, Meteorological Service Singapore, …