Paper sharing_Modernizing the HPC system

MODERNIZING THE HPC SYSTEM
SOWARE STACK
2 0 2 1 . 0 5 . 2 8 報告人陳佑昇
ALLEN, BENJAMIN S.; EZELL, MATTHEW A.;
PELTZ, PAUL; JACOBSEN, DOUG;
ROMAN, ERIC; LUENINGHOENER, CORY;
LOWELL WOFFORD, J.

Keyword: high performance computing, distributed computing, operating systems
T H I S PA P E R W A S P U B L I S H E D I N S C 2 0

CONTENT
• I n t ro d u c t i o n
• C o m p u t e a n d s e r v i c e n o d e s
w i t h i n a n H P C S y s t e m
• T h e l o g i c a l c o m p o n e n t s fo r
f u t u re H P C sy s t e m s
m a n a g e m e n t
- c o n f i g u ra t i o n m a n a g e m e n t
- s t a t e m a n a g e m e n t
- o rc h e s t ra t i o n
- p ro v i s i o n i n g
• C o n c l u s i o n s
3 /18

VOCABULARIES
4
English Chinese
Exascale
HPC (high performance computing)
百億億級高效能運
算 (10^18浮點)
magnitudes 量級
stagnate 停滯
monolithic 龐大的
orchestration 調度(一種電腦技術)
remediation 糾正
Stateless service 無狀態服務
transparent 顯然的
intervention 干預
hierarchical 分層的
credential 憑據
/18

5
INTRODUCTION
5
Mid-1990s
• US DOE had a largest
HPC systems.
By around
2010s
• HPC Were eclipsed
by the scale of web-
scale and cloud
computing tech.
This Photo by Unknown Author is licensed under CC BY-SA
The paper contend that a modern system
software stack that focuses on
manageability, scalability, security, and
modern methods and make
recommendations for HPC community.

6
COMPUTE AND SERVICE NODES
WITHIN AN HPC SYSTEM
Minimal OS
Cluster
Services
Jobs
• Provide for
stateless service
• Mount from
network
• Hierarchically
manage
• Easy to copy
• Containerization
environment
(solve conflict)
/18

7
• C u r r e n t St a t e
Almost mini OS distributions are generally
targeted toward microservices environments.
• M a n a g e a b i l i t y & S e r v i c e a b i l i t y
A. Reduced code base
-Only include the kernel and base services
B. Reduced image configuration
-Simplified node image configurations
-Lower node boot time when moving conf
7
Minimal OS
This Photo by Unknown Author is licensed under CC BY-NC-ND

8
This Photo by Unknown Author is licensed under CC BY-NC-ND
• S c a l a b i l i t y & Re s i l i e n c y
-Easier to logically separate a node’s
-Have a Layer for a center to include sandboxing
-Automatic remediation tools
• I m p l e m e n t a t i o n
-Kernel, Kernel Modules, and Hardware Support
-Initial ramdisk
-Read-only root filesystem image
-Boot-time OS configuation
8
Minimal OS

9
-Multiple copies need to run on the same time
-Request monitoring and system managers are
not easy to work
-Containerization/Virtualization
-Minimal OS
-Service profiling
-Visibility into operations
9
Cluster Services

10
• S c a l a b i l i t y & Re s i l i e n c y
Resiliency
Components that can be quickly started,
restarted, and replaced without affecting a
running system
Failure Modes
A failure in a service node should not result in
failures of client node
Cluster independence
Used for more than one logical cluster at a time
-Transparent load balancing
-Automatic scalability
10
Cluster Services

11
-Containers develop slowly in HPC community
-There are full-service and lightweight containers
• U s a b i l i t y
1. Standardize on a single container image format
that can work on any system
2.Provide transparent containerization to one
11
Jobs

12
-User Environment Upgrades
-User Environment Flexibility
-Operating System Separation
• I m p l e m e n t a t i o n
-To provide a very basic level of support, this
requires the ability to start jobs on compute nodes
with a minimal set of Linux namespaces in use
HPC systems often have dependencies that cross
that boundary
12
Jobs

13
THE LOGICAL COMPONENTS
FOR FUTURE HPC SYSTEMS
MANAGEMENT
Configuration
management
State
management
Orchestration Provisioning
/18

14
MANAGEMENT
-Configuration management
• Manageability & Serviceability (實施策略自動化方法,
統一API介面、提供不同環境設定)
• Scalability & Resiliency (實施非同步操作)
• Modern methods (實施版本控制)
• Security (實施防火牆控制、金鑰管理)
• Current State (目前技術發展成熟，唯獨安全管理尚
需加強)
/18

15
MANAGEMENT
-State management
• Manageability & Serviceability (提供狀態管理的
可信賴方法)
• Scalability & Resiliency (管理狀態可以一致事件處理)
• Implementation (實施狀態管理伺服器)
• Current State (本方面在狀態管理中很常被HPC忽略)
/18

16
MANAGEMENT
-Orchestration
• Manageability & Serviceability (跨系統實施更新、
系統控制與復原)
• Scalability & Resiliency (編排適當的任務邏輯)
• Modern methods (提供API介面存取)
• Implementation (操控、實施完全自動化)
• Current State (自動化系統仍不常見於HPC)
/18

17
MANAGEMENT
-Provisioning
• Manageability & Serviceability (實施簡單自動
化配置)
• Scalability & Resiliency (快速啟動的需求、發
現節點)
• Security (產生不可變更的唯讀檔)
• Implementation (節點發現、產生映像與傳輸)
• Current State (現有工具主要針對企業化部屬)
/18

18
CONCLUSIONS
• A variety of practices that can be beneficial to
adapt to make HPC systems more
manageable, serviceable, scalable, resilient,
and secure.
• Many can translate to this model with minimal
effort due to their horizontal scaling features.
• A lot of potential in moving toward
containerized workflows in both of these
areas.
/18

Paper sharing_Modernizing the HPC system

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Paper sharing_Modernizing the HPC system

Similar to Paper sharing_Modernizing the HPC system (20)

More from YOU SHENG CHEN

More from YOU SHENG CHEN (20)

Recently uploaded

Recently uploaded (20)

Paper sharing_Modernizing the HPC system