SlideShare a Scribd company logo
1 of 42
Download to read offline
0
대규모 GPU 기반 K8S Cluster를 활용한
ML Training Troubleshooting
Open Infrastructure & Cloud Native Days Korea 2019
19.July.2019(Fri)
조규남
mystous@{naver, gmail}.com
1
mystous@kyunam.com:~$ who am i
• Principal Software Engineer / Software Architect @ Samsung Electronics
• C언어 pointer 이해 한지 22년째…
• Working
Private/Public Cloud Solution and Application – VM & Container
Possibility of HPC application on Cloud infrastructure by container cluster
(The 22nd IEEE International Conference on Computational Science and Engineering, 2019)
Time-efficient simulations of tight-binding electronic structures with Intel
Xeon PhiTM many-core processors (Computer Physics Communications 209권, 2016)
인텔 제온 파이를 활용한 푸아송 방정식 풀이의 병렬화
(한국정보처리학회 2015년 추계학술발표대회)
한국 슈퍼컴퓨팅 프로그래밍 경진대회 우수상 (2015)
2
Previous Presentation
https://developer.ibm.com/kr/devday2018/
https://www.slideshare.net/ssuser3e70ba/deep-learning-100-high-performance-computing-for-ai
3
Today
+ +
4
Introduction
What is Machine Learning Platform and Why
5
Machine Learning Platform Era
• Rising of Machine Learning Platform
1) Laptop, 2) High Performance Computing [HPC], 3) Machine Learning Platform
Photo by frank mckenna on Unsplash
Personal PC HPC Platform
Mark by Vladyslav Severyn from the Noun Project
+Performance +Convenience
6
Why Platform is needed ?
• Too many pain points
End to End Management
: Various version of data set, unmanaged hyper
Parameters and uncontrolled trained Models
Configuration
: Too many ML Framework, version dependency
and Huge versions of ML Architecture
Utilization
: Dedicated Resource, Silo Management
Image from https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
*2
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
7
Machine Learning Platforms
• Machine Learning Platform 춘추전국시대
8
How to build
• On-Premise? On Public Cloud?
versus by Hea Poh Lin from the Noun Project
9
Some Platforms
Machine Learning Platforms
10
Best Practice
• Uber {michelangelo}
Images from https://eng.uber.com/michelangelo/ copyright to Uber
11
Best Practice
• Airbnb {Bighead}
Slide clip from https://www.slideshare.net/databricks/bighead-airbnbs-endtoend-machine-learning-platform-with-krishna-puttaswamy-and-andrew-hoh copyright to Airbnb
12
Best Practice
• {Singularity}
Slide and Image from http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/GMKurtzer_Singularity_Keynote_Tuesday_02072017.pdf#43 copyright to Gregory M. Kurtzer <gmk@lbl.gov>
13
Machine Learning Platform
Basic components of Machine Learning Platform
14
Basic Sequence
• Machine Learning Basic Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon
Collect
Cleansing & Labeling
Model Selection
Training
Evaluation
Parameter Tuning
Prediction
Machine Learning Platform Coverage
AI Engineer
15
Overall Software Stack
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414.
Cluster
Hardware
System Software
HPC&AITechnology
Middleware &
Management
Infiniband + Ethernet SAN + Local Node Storage
Linux OS variant
GPGPU or Accelerators
ParallelFramework
NumericalLibraries
SystemTool
Development Language
Training Algorithm
MLFramework
Hadoop
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
Platform
Components { Environment, Workflow, Model, Quota, Resource, Log, Metering, … } Management
*2
16
Based on HPC Technology
• Low Latency and High Throughput not Traffic
Enterprise Solution – Mass Traffic Handling HPC – Large Scale Problem Solving
17
With Container and WAS
• Easy deployment + Isolated environment + Convenient
HPC – Large Scale Problem Solving
+
18
With Container and WAS
• Easy deployment + Isolated environment + Convenient
HPC – Large Scale Problem Solving
+
19
Performance overhead on Kubernetes
[CPU Intensive Application] [Infiniband Comparison] [GPGPU Intensive Application]
K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering (IEEE CSE 2019), 2019.
20
Business Logic Layer
• Deliverable Management Layer
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon, ProSymbols
Cleansing & Labeling
Model Selection
Training
Evaluation
Parameter Tuning
Machine Learning Platform Coverage
21
Basic Architecture
• Kubernetes 기반 Machine Learning Platform
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
22
Basic Architecture
• 사전 고려 사항
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
.yaml Template化
RDMA-SRIOV
plug-in
NVIDIA-peer-
memory package
Training Task
실행 전처리
Docker insecure
registry
Docker unlock
memory limit
Persistent Volume
Mount
Multi Tenant 관리
Timezone 통일
23
Troubleshooting
Problems that you can meet
24
Basic Architecture
• Kubernetes 기반 Machine Learning Platform
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing PreprocessingStorage Issue Data Feeding 속도 부족
Multi GPU 처리Data Locality
CNI Overhead Pod Scheduler 부적합 동적 POD 구성&동작
Direct call kubectl
command
vGPU 부재
Server간 Communication overhead
Resource
Management
Container Root
Privilege
25
Troubleshooting
• Storage Issue
- 1) 다양한 환경을 Docker Image로 구성하여 저장
→ML Framework의 조합에 따라 저장 용량 증가
→사용자별 개인화 환경 제공시 저장 용량 급증
- 2) Docker내 대규모/대용량 파일 저장 가능성 존재
→ BERT등과 같이 대용량의 데이터를 가공하여 사용할 경우 k8s resource evict 발생
Solution
1) a. Docker Image On-Demand로 제공 Dirty flag 활용 Cache 관리
b. User Custom Image Garbage Collecting 및 정책 수립
2) a. Only Notice
b. Will be – persistent volume mount to user directory
26
Troubleshooting
• Data Feeding Bottleneck Issue
- Training에 사용되는 GPGPU 개수증가에 따른 Data Feeding 속도 문제
Graph from https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html
[ GPGPU 개수에 따른 데이터 필요량 ]
Solution
a. CPU Intensive Pipeline 제공
→ Resource Issue 및 Multi node 사용
b. Hardware vender solution enabling
ex) NVIDIA DALI1), Intel DAAL2)/MKL3) 등
1) https://github.com/NVIDIA/DALI
2) https://software.intel.com/en-us/intel-daal
3) https://software.intel.com/en-us/mkl
27
Troubleshooting
• Resource Management & Pod Scheduler 부적합 Issue
- 1) GPGPU Machine Resource 파편화
→ Kubernetes Resource affinity는 Computing을 분산하여 Multi GPU Scheduling이 어려움
- 2) Abusing User
→ Resource 선점 및 Low Utilization
Solution
1) a. Kubernetes custom Scheduler 개발 및 적용, Resource affinity 조정
b. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
2) a. Fair share scheduling and Quota Consuming
b. Will be – Preemption scheduler for GPGPU
28
Troubleshooting
• Data Locality and Copy Issue
- Storage  GPGPU Server  GPGPU 간 Data Copy Overhead
Solution
1) Storage  GPGPU Server간 Cache enable
→ Hardware vender별 Solution 상이
2) In Memory DB, SR-IOV, GPUDirectRDMA1) 등
3) GPUDirect1), GPGPU Memory Align
1) https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
Storage
Server with GPGPU
GPGPUs
CPUs
1)
2)
3)
29
Troubleshooting
• Multi GPGPU & Server Communication & 동적 pod Issue
- 1) On-Demand Multi GPGPU providing
→Single Cluster내 Multi GPGPU가 아닌 개별 Cluster 제공
→다양한 Multi GPGPU 지원 ML Framework 지원 – Horovod, CNTK, mxnet, Caffe-MPI 등
- 2) 성능 이슈
→ Network overhead에 따른 Scalability 저하
Solution
1) a. Cluster별 별도 Subnet 구성
b. ML Framework별 Cluster 구성 방법 및 ML Framework Plug-in 구조 수립
2) a. Hardware optimization & GPGPU Locality aware Topology 제공
b. 다양한 Peer-to-Peer Communication API 제공
30
Troubleshooting
• Multi GPGPU & Server Communication
- GPUDirectRDMA
Revisits from https://developer.nvidia.com/gpudirect
Mallanox. Accelerating High Performance Computing with GPUDirect RDMA. GTC 2013
Image Source from http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpudirect-rdma.pdf
31
Troubleshooting
• Multi GPGPU & Server Communication
- Single Root Complex
Images from Microway homepage https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/octoputer-8-gpu-with-dual-root-tesla-v100/
https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/
32
Troubleshooting
• Direct call kubectl command & Root Privilege Issue
- k8s API 와 kubectl command 상이, Command line 권한
- Container Root Privilege 제거 필요
Solution
a. k8s API 및 kubectl Wrapper Layer
b. Docker container user privilege 부여
- Docker insecure registry 등록
- Docker memory 제한 해제
- User Secret container 저장 및 관리• https://blog.paranoidsoftware.com/dirty-cow-cve-2016-5195-docker-container-escape/
• https://0x0d.im/archives/docker-security.html
• https://www.slideshare.net/BorgHan/hacking-docker-the-easy-way
• https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails
• https://dirtycow.ninja/
33
Troubleshooting
• CNI Overhead Issue
- Hardware에 따라 지원 가능한 CNI가 다름, Network Layer 상이
- 가상화에 따른 성능 이슈 및 동일 Server내 Container간 RPC 통신 사용
Table from https://chrislovecnm.com/kubernetes/cni/choosing-a-cni-provider/
Graph from ZENG, Hao, et al. Measurement and evaluation for docker container networking. In: 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). IEEE, 2017. p. 105-108.
[ CNI 비교 표 ] [ 일부 CNI 성능 비교 표 ]
34
Troubleshooting
• CNI Overhead Issue
- 객관적 비교 수치가 거의 없음 – Flannel, Calico 위주
- Hardware configuration, 가상화 Layer 제 각각
Graph from https://community.cisco.com/t5/jive-developer-archive-blogs/docker-overlay-network-performance-comparison-intel-driver/ba-p/3664582
35
Troubleshooting
• CNI Overhead Issue
Graph from K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering
(IEEE CSE 2019), 2019.
Solution
a. 각자의 Hardware 요구사항 및 Hardware
Architecture 고려
b. 성능 측정은 직접 진행, Network는
제약 사항으로 두고 Workaround 고려
c. 다양한 Resource Packing 제공
ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
36
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Data sheet from NVIDIA official document https://images.nvidia.com/content/pdf/grid/data-sheet/tesla-gpu-linecard-virtualization-us-nvidia-669786-r7.pdf
37
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Image from NVIDIA official document https://docs.nvidia.com/grid/4.3/grid-vgpu-user-guide/index.html
[ vGPU Overall Architecture ]
Solution
Servers with
GPGPU
Kubernetes Cluster
Servers with
GPGPU
Servers with
GPGPU
OpenStack Cluster
VM with vGPGPU VM with vGPGPU VM with vGPGPU
Training Job Training Job Training Job Training Job
38
Suggestion
So what can we do?
39
Suggestion
• Watch your stage
All icon from the noun project (http://thenounprojecct.com) - Daouna Jeong, ruliani, Vectors Market
Beginner or Individual
Local Environment
cf. DIGITS, Anaconda
Professional
Cloud Environment
cf. ML Studio, Sagemaker
Expert and Product
Customized
cf. mlflow, kubeflow
40
Question?
Do you remain Curious?
41

More Related Content

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
drm1699
 

Recently uploaded (20)

Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Turfloop ](+27832195400*)[ 🏥 Women's Abortion Clinic in ...
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test Automation
 
Rapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and InsightsRapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and Insights
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
Abortion Pills For Sale WhatsApp[[+27737758557]] In Birch Acres, Abortion Pil...
 
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4jGraphSummit Milan - Visione e roadmap del prodotto Neo4j
GraphSummit Milan - Visione e roadmap del prodotto Neo4j
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
 
Encryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key ConceptsEncryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key Concepts
 
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
Abortion Pill Prices Mthatha (@](+27832195400*)[ 🏥 Women's Abortion Clinic In...
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting

  • 1. 0 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting Open Infrastructure & Cloud Native Days Korea 2019 19.July.2019(Fri) 조규남 mystous@{naver, gmail}.com
  • 2. 1 mystous@kyunam.com:~$ who am i • Principal Software Engineer / Software Architect @ Samsung Electronics • C언어 pointer 이해 한지 22년째… • Working Private/Public Cloud Solution and Application – VM & Container Possibility of HPC application on Cloud infrastructure by container cluster (The 22nd IEEE International Conference on Computational Science and Engineering, 2019) Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors (Computer Physics Communications 209권, 2016) 인텔 제온 파이를 활용한 푸아송 방정식 풀이의 병렬화 (한국정보처리학회 2015년 추계학술발표대회) 한국 슈퍼컴퓨팅 프로그래밍 경진대회 우수상 (2015)
  • 5. 4 Introduction What is Machine Learning Platform and Why
  • 6. 5 Machine Learning Platform Era • Rising of Machine Learning Platform 1) Laptop, 2) High Performance Computing [HPC], 3) Machine Learning Platform Photo by frank mckenna on Unsplash Personal PC HPC Platform Mark by Vladyslav Severyn from the Noun Project +Performance +Convenience
  • 7. 6 Why Platform is needed ? • Too many pain points End to End Management : Various version of data set, unmanaged hyper Parameters and uncontrolled trained Models Configuration : Too many ML Framework, version dependency and Huge versions of ML Architecture Utilization : Dedicated Resource, Silo Management Image from https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699 *1 *1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb *2 Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
  • 8. 7 Machine Learning Platforms • Machine Learning Platform 춘추전국시대
  • 9. 8 How to build • On-Premise? On Public Cloud? versus by Hea Poh Lin from the Noun Project
  • 11. 10 Best Practice • Uber {michelangelo} Images from https://eng.uber.com/michelangelo/ copyright to Uber
  • 12. 11 Best Practice • Airbnb {Bighead} Slide clip from https://www.slideshare.net/databricks/bighead-airbnbs-endtoend-machine-learning-platform-with-krishna-puttaswamy-and-andrew-hoh copyright to Airbnb
  • 13. 12 Best Practice • {Singularity} Slide and Image from http://www.hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/GMKurtzer_Singularity_Keynote_Tuesday_02072017.pdf#43 copyright to Gregory M. Kurtzer <gmk@lbl.gov>
  • 14. 13 Machine Learning Platform Basic components of Machine Learning Platform
  • 15. 14 Basic Sequence • Machine Learning Basic Flow All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon Collect Cleansing & Labeling Model Selection Training Evaluation Parameter Tuning Prediction Machine Learning Platform Coverage AI Engineer
  • 16. 15 Overall Software Stack Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing, Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414. Cluster Hardware System Software HPC&AITechnology Middleware & Management Infiniband + Ethernet SAN + Local Node Storage Linux OS variant GPGPU or Accelerators ParallelFramework NumericalLibraries SystemTool Development Language Training Algorithm MLFramework Hadoop *1 *1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb Platform Components { Environment, Workflow, Model, Quota, Resource, Log, Metering, … } Management *2
  • 17. 16 Based on HPC Technology • Low Latency and High Throughput not Traffic Enterprise Solution – Mass Traffic Handling HPC – Large Scale Problem Solving
  • 18. 17 With Container and WAS • Easy deployment + Isolated environment + Convenient HPC – Large Scale Problem Solving +
  • 19. 18 With Container and WAS • Easy deployment + Isolated environment + Convenient HPC – Large Scale Problem Solving +
  • 20. 19 Performance overhead on Kubernetes [CPU Intensive Application] [Infiniband Comparison] [GPGPU Intensive Application] K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering (IEEE CSE 2019), 2019.
  • 21. 20 Business Logic Layer • Deliverable Management Layer All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon, ProSymbols Cleansing & Labeling Model Selection Training Evaluation Parameter Tuning Machine Learning Platform Coverage
  • 22. 21 Basic Architecture • Kubernetes 기반 Machine Learning Platform Storage Management Servers Management Servers Management Servers Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU Kubernetes Cluster Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
  • 23. 22 Basic Architecture • 사전 고려 사항 Storage Management Servers Management Servers Management Servers Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU Kubernetes Cluster Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing .yaml Template化 RDMA-SRIOV plug-in NVIDIA-peer- memory package Training Task 실행 전처리 Docker insecure registry Docker unlock memory limit Persistent Volume Mount Multi Tenant 관리 Timezone 통일
  • 25. 24 Basic Architecture • Kubernetes 기반 Machine Learning Platform Storage Management Servers Management Servers Management Servers Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU Kubernetes Cluster Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing PreprocessingStorage Issue Data Feeding 속도 부족 Multi GPU 처리Data Locality CNI Overhead Pod Scheduler 부적합 동적 POD 구성&동작 Direct call kubectl command vGPU 부재 Server간 Communication overhead Resource Management Container Root Privilege
  • 26. 25 Troubleshooting • Storage Issue - 1) 다양한 환경을 Docker Image로 구성하여 저장 →ML Framework의 조합에 따라 저장 용량 증가 →사용자별 개인화 환경 제공시 저장 용량 급증 - 2) Docker내 대규모/대용량 파일 저장 가능성 존재 → BERT등과 같이 대용량의 데이터를 가공하여 사용할 경우 k8s resource evict 발생 Solution 1) a. Docker Image On-Demand로 제공 Dirty flag 활용 Cache 관리 b. User Custom Image Garbage Collecting 및 정책 수립 2) a. Only Notice b. Will be – persistent volume mount to user directory
  • 27. 26 Troubleshooting • Data Feeding Bottleneck Issue - Training에 사용되는 GPGPU 개수증가에 따른 Data Feeding 속도 문제 Graph from https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html [ GPGPU 개수에 따른 데이터 필요량 ] Solution a. CPU Intensive Pipeline 제공 → Resource Issue 및 Multi node 사용 b. Hardware vender solution enabling ex) NVIDIA DALI1), Intel DAAL2)/MKL3) 등 1) https://github.com/NVIDIA/DALI 2) https://software.intel.com/en-us/intel-daal 3) https://software.intel.com/en-us/mkl
  • 28. 27 Troubleshooting • Resource Management & Pod Scheduler 부적합 Issue - 1) GPGPU Machine Resource 파편화 → Kubernetes Resource affinity는 Computing을 분산하여 Multi GPU Scheduling이 어려움 - 2) Abusing User → Resource 선점 및 Low Utilization Solution 1) a. Kubernetes custom Scheduler 개발 및 적용, Resource affinity 조정 b. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2 2) a. Fair share scheduling and Quota Consuming b. Will be – Preemption scheduler for GPGPU
  • 29. 28 Troubleshooting • Data Locality and Copy Issue - Storage  GPGPU Server  GPGPU 간 Data Copy Overhead Solution 1) Storage  GPGPU Server간 Cache enable → Hardware vender별 Solution 상이 2) In Memory DB, SR-IOV, GPUDirectRDMA1) 등 3) GPUDirect1), GPGPU Memory Align 1) https://docs.nvidia.com/cuda/gpudirect-rdma/index.html Storage Server with GPGPU GPGPUs CPUs 1) 2) 3)
  • 30. 29 Troubleshooting • Multi GPGPU & Server Communication & 동적 pod Issue - 1) On-Demand Multi GPGPU providing →Single Cluster내 Multi GPGPU가 아닌 개별 Cluster 제공 →다양한 Multi GPGPU 지원 ML Framework 지원 – Horovod, CNTK, mxnet, Caffe-MPI 등 - 2) 성능 이슈 → Network overhead에 따른 Scalability 저하 Solution 1) a. Cluster별 별도 Subnet 구성 b. ML Framework별 Cluster 구성 방법 및 ML Framework Plug-in 구조 수립 2) a. Hardware optimization & GPGPU Locality aware Topology 제공 b. 다양한 Peer-to-Peer Communication API 제공
  • 31. 30 Troubleshooting • Multi GPGPU & Server Communication - GPUDirectRDMA Revisits from https://developer.nvidia.com/gpudirect Mallanox. Accelerating High Performance Computing with GPUDirect RDMA. GTC 2013 Image Source from http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpudirect-rdma.pdf
  • 32. 31 Troubleshooting • Multi GPGPU & Server Communication - Single Root Complex Images from Microway homepage https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/octoputer-8-gpu-with-dual-root-tesla-v100/ https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/
  • 33. 32 Troubleshooting • Direct call kubectl command & Root Privilege Issue - k8s API 와 kubectl command 상이, Command line 권한 - Container Root Privilege 제거 필요 Solution a. k8s API 및 kubectl Wrapper Layer b. Docker container user privilege 부여 - Docker insecure registry 등록 - Docker memory 제한 해제 - User Secret container 저장 및 관리• https://blog.paranoidsoftware.com/dirty-cow-cve-2016-5195-docker-container-escape/ • https://0x0d.im/archives/docker-security.html • https://www.slideshare.net/BorgHan/hacking-docker-the-easy-way • https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails • https://dirtycow.ninja/
  • 34. 33 Troubleshooting • CNI Overhead Issue - Hardware에 따라 지원 가능한 CNI가 다름, Network Layer 상이 - 가상화에 따른 성능 이슈 및 동일 Server내 Container간 RPC 통신 사용 Table from https://chrislovecnm.com/kubernetes/cni/choosing-a-cni-provider/ Graph from ZENG, Hao, et al. Measurement and evaluation for docker container networking. In: 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). IEEE, 2017. p. 105-108. [ CNI 비교 표 ] [ 일부 CNI 성능 비교 표 ]
  • 35. 34 Troubleshooting • CNI Overhead Issue - 객관적 비교 수치가 거의 없음 – Flannel, Calico 위주 - Hardware configuration, 가상화 Layer 제 각각 Graph from https://community.cisco.com/t5/jive-developer-archive-blogs/docker-overlay-network-performance-comparison-intel-driver/ba-p/3664582
  • 36. 35 Troubleshooting • CNI Overhead Issue Graph from K. Cho, H. Lee, K. Bang, and S. Kim, “Possibility of HPC application on Cloud infrastructure by container cluster,” in The 22nd IEEE International Conference on Computational Science and Engineering (IEEE CSE 2019), 2019. Solution a. 각자의 Hardware 요구사항 및 Hardware Architecture 고려 b. 성능 측정은 직접 진행, Network는 제약 사항으로 두고 Workaround 고려 c. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
  • 37. 36 Troubleshooting • vGPU Issue - Hardware vender dependence, VM Only (NVIDIA Grid vGPU) Data sheet from NVIDIA official document https://images.nvidia.com/content/pdf/grid/data-sheet/tesla-gpu-linecard-virtualization-us-nvidia-669786-r7.pdf
  • 38. 37 Troubleshooting • vGPU Issue - Hardware vender dependence, VM Only (NVIDIA Grid vGPU) Image from NVIDIA official document https://docs.nvidia.com/grid/4.3/grid-vgpu-user-guide/index.html [ vGPU Overall Architecture ] Solution Servers with GPGPU Kubernetes Cluster Servers with GPGPU Servers with GPGPU OpenStack Cluster VM with vGPGPU VM with vGPGPU VM with vGPGPU Training Job Training Job Training Job Training Job
  • 40. 39 Suggestion • Watch your stage All icon from the noun project (http://thenounprojecct.com) - Daouna Jeong, ruliani, Vectors Market Beginner or Individual Local Environment cf. DIGITS, Anaconda Professional Cloud Environment cf. ML Studio, Sagemaker Expert and Product Customized cf. mlflow, kubeflow
  • 42. 41