This document discusses modernizing the software stack for high performance computing (HPC) systems. It proposes using minimal operating systems on compute nodes to improve manageability, scalability, and security. Cluster services should be containerized and stateless to improve resilience. Jobs would utilize containerization as well. Configuration management, state management, orchestration and provisioning are identified as key components for managing future HPC systems, with recommendations provided for each area. Adapting practices from cloud computing could help HPC systems become more manageable, serviceable, scalable, resilient and secure.
What Are The Drone Anti-jamming Systems Technology?
Paper sharing_Modernizing the HPC system
1. MODERNIZING THE HPC SYSTEM
SOWARE STACK
2 0 2 1 . 0 5 . 2 8 報 告 人 陳 佑 昇
ALLEN, BENJAMIN S.; EZELL, MATTHEW A.;
PELTZ, PAUL; JACOBSEN, DOUG;
ROMAN, ERIC; LUENINGHOENER, CORY;
LOWELL WOFFORD, J.
2. Keyword: high performance computing, distributed computing, operating systems
T H I S PA P E R W A S P U B L I S H E D I N S C 2 0
3. CONTENT
• I n t ro d u c t i o n
• C o m p u t e a n d s e r v i c e n o d e s
w i t h i n a n H P C S y s t e m
• T h e l o g i c a l c o m p o n e n t s fo r
f u t u re H P C sy s t e m s
m a n a g e m e n t
- c o n f i g u ra t i o n m a n a g e m e n t
- s t a t e m a n a g e m e n t
- o rc h e s t ra t i o n
- p ro v i s i o n i n g
• C o n c l u s i o n s
3 /18
5. 5
INTRODUCTION
5
Mid-1990s
• US DOE had a largest
HPC systems.
By around
2010s
• HPC Were eclipsed
by the scale of web-
scale and cloud
computing tech.
This Photo by Unknown Author is licensed under CC BY-SA
The paper contend that a modern system
software stack that focuses on
manageability, scalability, security, and
modern methods and make
recommendations for HPC community.
6. 6
COMPUTE AND SERVICE NODES
WITHIN AN HPC SYSTEM
Minimal OS
Cluster
Services
Jobs
• Provide for
stateless service
• Mount from
network
• Hierarchically
manage
• Easy to copy
• Containerization
environment
(solve conflict)
/18
7. 7
• C u r r e n t St a t e
Almost mini OS distributions are generally
targeted toward microservices environments.
• M a n a g e a b i l i t y & S e r v i c e a b i l i t y
A. Reduced code base
-Only include the kernel and base services
B. Reduced image configuration
-Simplified node image configurations
-Lower node boot time when moving conf
7
COMPUTE AND SERVICE NODES
WITHIN AN HPC SYSTEM
Minimal OS
This Photo by Unknown Author is licensed under CC BY-NC-ND
8. 8
This Photo by Unknown Author is licensed under CC BY-NC-ND
• S c a l a b i l i t y & Re s i l i e n c y
-Easier to logically separate a node’s
-Have a Layer for a center to include sandboxing
-Automatic remediation tools
• I m p l e m e n t a t i o n
-Kernel, Kernel Modules, and Hardware Support
-Initial ramdisk
-Read-only root filesystem image
-Boot-time OS configuation
8
COMPUTE AND SERVICE NODES
WITHIN AN HPC SYSTEM
Minimal OS
9. 9
• C u r r e n t St a t e
-Multiple copies need to run on the same time
-Request monitoring and system managers are
not easy to work
• M a n a g e a b i l i t y & S e r v i c e a b i l i t y
-Containerization/Virtualization
-Minimal OS
-Service profiling
-Visibility into operations
9
COMPUTE AND SERVICE NODES
WITHIN AN HPC SYSTEM
Cluster Services
10. 10
• S c a l a b i l i t y & Re s i l i e n c y
Resiliency
Components that can be quickly started,
restarted, and replaced without affecting a
running system
Failure Modes
A failure in a service node should not result in
failures of client node
Cluster independence
Used for more than one logical cluster at a time
-Transparent load balancing
-Automatic scalability
10
COMPUTE AND SERVICE NODES
WITHIN AN HPC SYSTEM
Cluster Services
11. 11
• C u r r e n t St a t e
-Containers develop slowly in HPC community
-There are full-service and lightweight containers
• U s a b i l i t y
1. Standardize on a single container image format
that can work on any system
2.Provide transparent containerization to one
11
COMPUTE AND SERVICE NODES
WITHIN AN HPC SYSTEM
Jobs
12. 12
• M a n a g e a b i l i t y & S e r v i c e a b i l i t y
-User Environment Upgrades
-User Environment Flexibility
-Operating System Separation
• I m p l e m e n t a t i o n
-To provide a very basic level of support, this
requires the ability to start jobs on compute nodes
with a minimal set of Linux namespaces in use
HPC systems often have dependencies that cross
that boundary
12
COMPUTE AND SERVICE NODES
WITHIN AN HPC SYSTEM
Jobs
13. 13
THE LOGICAL COMPONENTS
FOR FUTURE HPC SYSTEMS
MANAGEMENT
Configuration
management
State
management
Orchestration Provisioning
/18
14. 14
THE LOGICAL COMPONENTS
FOR FUTURE HPC SYSTEMS
MANAGEMENT
-Configuration management
• Manageability & Serviceability (實施策略自動化方法,
統一API介面、提供不同環境設定)
• Scalability & Resiliency (實施非同步操作)
• Modern methods (實施版本控制)
• Security (實施防火牆控制、金鑰管理)
• Current State (目前技術發展成熟,唯獨安全管理尚
需加強)
/18
This Photo by Unknown Author is licensed under CC BY-SA
15. 15
THE LOGICAL COMPONENTS
FOR FUTURE HPC SYSTEMS
MANAGEMENT
-State management
• Manageability & Serviceability (提供狀態管理的
可信賴方法)
• Scalability & Resiliency (管理狀態可以一致事件處理)
• Implementation (實施狀態管理伺服器)
• Current State (本方面在狀態管理中很常被HPC忽略)
/18
This Photo by Unknown Author is licensed under CC BY-SA
16. 16
THE LOGICAL COMPONENTS
FOR FUTURE HPC SYSTEMS
MANAGEMENT
-Orchestration
• Manageability & Serviceability (跨系統實施更新、
系統控制與復原)
• Scalability & Resiliency (編排適當的任務邏輯)
• Modern methods (提供API介面存取)
• Implementation (操控、實施完全自動化)
• Current State (自動化系統仍不常見於HPC)
/18
This Photo by Unknown Author is licensed under CC BY-SA
17. 17
THE LOGICAL COMPONENTS
FOR FUTURE HPC SYSTEMS
MANAGEMENT
-Provisioning
• Manageability & Serviceability (實施簡單自動
化配置)
• Scalability & Resiliency (快速啟動的需求、發
現節點)
• Security (產生不可變更的唯讀檔)
• Implementation (節點發現、產生映像與傳輸)
• Current State (現有工具主要針對企業化部屬)
/18
18. 18
CONCLUSIONS
This Photo by Unknown Author is licensed under CC BY-SA
• A variety of practices that can be beneficial to
adapt to make HPC systems more
manageable, serviceable, scalable, resilient,
and secure.
• Many can translate to this model with minimal
effort due to their horizontal scaling features.
• A lot of potential in moving toward
containerized workflows in both of these
areas.
/18