GPU Cluster Platform for Distributed Deep Learning

0
Large scale GPU
Cluster for AI
KRnet 2020
23.June.2020(Tue)
조규남
mystous@{naver, gmail}.com

1
mystous@kyunam.com:~$ who am i
• Principal Software Engineer / Software Architect @ Samsung Electronics
• C언어 pointer 이해 한지 23년째…
• Working
Private/Public Cloud Solution and Application – VM & Container
Possibility of HPC application on Cloud infrastructure by container cluster
(The 22nd IEEE International Conference on Computational Science and Engineering, 2019)
Time-efficient simulations of tight-binding electronic structures with Intel
Xeon PhiTM many-core processors (Computer Physics Communications 209권, 2016)
인텔 제온 파이를 활용한 푸아송 방정식 풀이의 병렬화
(한국정보처리학회 2015년 추계학술발표대회)
한국 슈퍼컴퓨팅 프로그래밍 경진대회 우수상 (2015)

2
Previous Presentation
https://developer.ibm.com/kr/devday2018/
https://www.slideshare.net/ssuser3e70ba/deep-learning-100-high-performance-computing-for-ai

3
Previous Presentation
https://openinfradays.kr/
https://www.slideshare.net/ssuser3e70ba/gpu-k8s-cluster-ml-training-troubleshooting
https://www.youtube.com/watch?v=cabIO2ZHtU8

5
Large Scale GPU Cluster for AI
Icons from the noun project (http://thenounprojecct.com) - Chad Remsing, Mohamed Mbarki
GPU Cluster AI Dev Platform

6
Well Made New Trend

7
Icons from the noun project (http://thenounprojecct.com) - Chad Remsing, Mohamed Mbarki
GPU Cluster
AI Dev Platform
TODAY

9
AI Basic Sequence
• Machine Learning Basic Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon
Machine Learning Platform Coverage
AI Engineer
데이터수집
데이터정제 및 라벨링
학습모델 선택 및 재설정
기계학습
평가
하이퍼파라미터 재설정
학습모델 활용
반복
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)

10
What for Artificial
Intelligence
Machine Learning Platform with GPU Cluster and Paint points

11
GPU Cluster with HPC Technology
Cluster
Hardware
System
Software
HPCTechnology
Middleware &
Management
Infiniband + Ethernet SAN + Local Node Storage
Linux OS variant
GPGPU or Accelerators
ParallelFramework
NumericalLibraries
SystemTool
Development Language
ⓒ Romanzes637@Wikimedia Commons @Wikimedia Commons ⓒ Éducation nationale @Wikimedia Commons
HPC & AI
Applications
MLFramework
Hadoop
Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414.

12
Is it enough?
• Too many pain points
End to End Management
: Various version of data set, unmanaged hyper
Parameters and uncontrolled trained Models
Configuration
: Too many ML Framework, version dependency
and Huge versions of ML Architecture
Utilization
: Dedicated Resource, Silo Management
Image from https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
*2

13
Machine Learning Platform on GPU Cluster
• Rising of Machine Learning Platform
1) Laptop, 2) High Performance Computing [HPC], 3) Machine Learning Platform
Photo by frank mckenna on Unsplash
Personal PC HPC Platform
Mark by Vladyslav Severyn from the Noun Project
+Performance +Convenience

14
New Trend in AI
Machine Learning Trend

16
New wave has been here
• New ML Trend
1 2
3 4
Multi Node Distributed Training복잡한 Neural Architecture 다변화 된 학습 환경
Automated ML(AutoML) Federated Learning
X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019. T. Yang et al., “Applied Federated Learning: Improving Google Keyboard Query
Suggestions,” Dec. 2018.
J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer,
“A Survey on Distributed Machine Learning,” Dec. 2019.
B. Wu et al., “FBNET: Hardware-aware efficient convnet design via differentiable neural architecture search,” in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, vol.
2019-June, doi: 10.1109/CVPR.2019.01099.

17
New ML Trend #1
• No. 1 더 깊고 복잡한 Neural Architecture 다변화 된 Data 종류와 적용 분야
B. Wu et al., “FBNET: Hardware-aware efficient convnet design via differentiable neural architecture search,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
2019, vol. 2019-June, doi: 10.1109/CVPR.2019.01099.

18
New ML Trend #1
※ M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” May 2019.
신경망의 정확도가 올라 갈 수록
파라미터의 개수는 함께 증가 하는
패턴을 보임

19
New ML Trend #1
L. Ben and N. Paco, AI Adoption in the Enterprise: How Companies Are Planning and Prioritizing AI Projects in Practice. O’Reilly, 2019.
Defector Standard라고 할 만한 F/W이 없으며
다양한 ML F/W이 지속적으로 활용되고 있음

20
New ML Trend #1
01. 새로운 기술 Trend에 맞는 Platform Level Support
Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software)

21
New ML Trend #2
• No. 2 Multi Node Distributed Training
R. Mayer and H.-A. Jacobsen, “Scalable Deep Learning on Distributed Infrastructures,” ACM Comput. Surv., vol. 53, no. 1, pp. 1–37, Feb. 2020.

22
New ML Trend #2
• [149] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing
Systems. MIT Press, 693–701.
• [38] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, KeYang, Quoc V. Le, et al. 2012. Large scale distributed deep networks. In
Advances in Neural Information Processing Systems. MIT Press, 1223–1231.
• [28] James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. 2013. Solving the straggler problem with bounded staleness. In
Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS’13). USENIX, Santa Ana Pueblo, NM. Retrieved from https://www.usenix.org/conference/hotos13/solving-straggler-
problem-bounded-staleness
• [133] Cyprien Noel and Simon Osindero. 2014. Dogwild!-Distributed hogwild for CPU & GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.
• [35] Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2014.
Exploiting bounded staleness to speed up big data analytics. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley,
CA, 37–48. Retrieved from http://dl.acm.org/citation.cfm?id=2643634.2643639.
• [102] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with
the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 583–598. Retrieved from
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu.
• [37] Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, and Eric P. Xing. 2015. High-performance distributed ML at scale through parameter server consistency models. In Proceedings
of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, 79–87. Retrieved from http://dl.acm.org/citation.cfm?id=2887007.2887019
• [101] Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungureanu. 2015. MALT: Distributed data-parallelism for existing ML applications. In Proceedings of the 10th European Conference on Computer
Systems (EuroSys’15). ACM, New York, NY. DOI:https://doi.org/10.1145/2741948.2741965
• [204] H. Zhang, C. Hsieh, and V. Akella. 2016. HogWild++: A new mechanism for decentralized asynchronous stochastic gradient descent. In Proceedings of the IEEE 16th International Conference on
Data Mining (ICDM’16). 629–638. DOI:https://doi.org/10.1109/ICDM.2016.0074
• [36] Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In
Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, New York, NY. DOI:https://doi.org/10.1145/2901318.2901323
• [83] Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data
(SIGMOD’17). ACM, New York, NY, 463–478. DOI:https://doi.org/10.1145/3035918.3035933
• [184] Shaoqi Wang, Wei Chen, Aidi Pi, and Xiaobo Zhou. 2018. Aggressive synchronization with partial processing for iterative ml jobs on clusters. In Proceedings of the 19th International Middleware
Conference (Middleware’18). ACM, New York, NY, 253–265. DOI:https://doi.org/10.1145/3274808.3274828
• [89] Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo Costa, and Peter R. Pietzuch. 2019. CROSSBOW: Scaling deep learning with small batch sizes on multi-GPU
servers. Retrieved from http://arxiv.org/abs/1901.02244.
• [19] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe M. Kiddon, Jakub Konecný, Stefano Mazzocchi, Brendan McMahan, Timon Van
Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. 2019. Towards federated learning at scale: System design. In Proceedings of the Conference on Systems and Machine Learning
(SysML’19). Retrieved from https://arxiv.org/abs/1902.01046.

23
New ML Trend #2
• No. 2 Multi Node Distributed Training
Infrastructure Layer Software Layer
초기 도입 단계에 결정이 되며 수정이 어려움
수시 변경이 가능하나 운영 적인 측면 고려
1) Mallanox. Accelerating High Performance Computing with GPUDirect RDMA. GTC 2013, Image Source from http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpudirect-rdma.pdf
2) Images from Microway homepage https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/octoputer-8-gpu-with-dual-root-tesla-v100/, https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/
3) Images from NVIDIA Korea homepage https://www.nvidia.com/ko-kr/data-center/nvlink/
1)
2)
3)
NVIDIA NCCL
State-of-the-art를 지속적으로 적용할 수 있는 Self-
Service를 통한 환경 구성 지원 및 최적화 필요

24
New ML Trend #2
02. 다양한 형태의 Node, Network Configuration 지원
Self-Service를 통해서 원하는 Framework과 Network topology 구성 필요

25
New ML Trend #3
• No. 3 Automated ML (AutoML)
NAS(Neural
Architecture Search)
Automated ML
Feature
Engineering
Hyper-parameter
Optimization

26
New ML Trend #3
• Hyperparameter Optimization
- NNI 지원 Hyperparameter Optimization
(https://github.com/microsoft/nni/blob/master/docs/en_US/Tuner/BuiltinTuner.md)
다양한 자동화 방법으로 Model Training의 반복이 기계
화되고 자동화 되어 수십, 수백 종의 결과를 동시 배출

27
New ML Trend #3
• Neural Architecture Search
X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019.
신경망의 결과가 설명 하기 힘든 현재 구조에서 무작위성 또는
강화 학습에 의한 신경망 자동 생성이 사람의 노력을 앞서고 있음

28
New ML Trend #3
• Neural Architecture Search
X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019.
NAS로 개발된 신경망의 정확도는 이미 사용이 가능한 수준

29
New ML Trend #3
• [4] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning.” [Online]. Available: http://arxiv.org/abs/1611.01578
• [5] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” vol. ICML. [Online]. Available: http://arxiv.org/abs/1802.03268
• [7] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition.” [Online]. Available: http://arxiv.org/abs/1707.07012
• [8] Z. Zhong, J. Yan, W. Wu, J. Shao, and C.-L. Liu, “Practical block-wise neural network architecture generation.” [Online]. Available: http://arxiv.org/abs/1708.05552
• [9] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search.” [Online]. Available: http://arxiv.org/abs/1806.09055
• [10] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search.” [Online]. Available:
http://arxiv.org/abs/1712.00559
• [11] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, “Hierarchical representations for efficient architecture search,” in ICLR, p.
• [15] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” vol. ICLR. [Online]. Available: http://arxiv.org/abs/1611.02167
• [17] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin, “Large-scale evolution of image classifiers.” [Online]. Available: http://arxiv.org/abs/1703.01041
• [18] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search.” [Online]. Available: http://arxiv.org/abs/1802.01548
• [19] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-objective neural architecture search via lamarckian evolution.” [Online]. Available: http://arxiv.org/abs/1804.09081
• [20] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming approach to designing convolutional neural network architectures.” [Online]. Available: http://arxiv.org/abs/1704.00764
• [21] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat, “Evolving deep neural networks.” [Online]. Available:
http://arxiv.org/abs/1703.00548
• [22] L. Xie and A. Yuille, “Genetic CNN,” vol. ICCV. [Online]. Available: http://arxiv.org/abs/1703.01513
• [94] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,” arXiv preprint arXiv:1812.00332, 2018.
• [96] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in Proceedings of the IEEE International Conference
on Computer Vision, 2019, pp. 1294–1303.
• [123] S. Xie, H. Zheng, C. Liu, and L. Lin, “Snas: stochastic neural architecture search,” arXiv preprint arXiv:1812.09926, 2018.
• [125] G. D. H. Andrew Hundt, Varun Jain, “sharpDARTS: Faster and More Accurate Differentiable Architecture Search,” Tech. Rep. [Online]. Available: https://arxiv.org/pdf/1903.09900.pdf
• [142] R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu, “Neural architecture optimization,” in Advances in neural information processing systems, 2018, pp. 7816–7827.
• [143] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” arXiv preprint arXiv:1902.07638, 2019.

30
New ML Trend #3
• Neural Architecture Search (강화 학습 기반)
신경망 구조 탐색 신경망 구조 평가
딥러닝 학습
강화학습을 통한 신경망 구조 탐색(NAS) 딥러닝을 통한 모델 정교화
컨트롤러(에이젼트) 환경
Conv3x3
sep5x5
concat
Max3x3
concat
Conv5x5
concat
<<액션>>
신경망 구조 A, 확률 P
<<보상, 상태>>
확률 P의 변화량 및
R에 의한 변동량
목표 예측률 R

31
New ML Trend #3
• Neural Architecture Search (NASNet)
3x3conv,stride2
환원셀(ReductionCell)
일반셀(NormalCell)
Softmax
hi-1
sep
7x7
sep
5x5
add
hi
이전 셀
Hi+1
max
3x3
sep
7x7
add
max
3x3
sep
5x5
add
max
3x3
sep
3x3
add
avg
3x3
Iden
tity
add
conca
t
환원 셀
(Reduction Cell)
hi-1
sep
3x3
sep
5x5
add
hi
Hi+1
sep
3x3
sep
5x5
add
avg
3x3
Iden
tity
add
avg
3x3
avg
3x3
add
sep
5x5
sep
3x3
add
conca
t
일반 셀(Normal Cell)
이전 셀
X 2 X 6 X 6 X 6
Revisited from B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710.

32
New ML Trend #3
• AutoML Applied Machine Learning Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon
데이터수집
학습모델 구성 및 정교화
기계학습
평가
학습모델 활용
반복
반복

33
New ML Trend #3
ALGORITHMIA: 2020 state of enterprise machine learning(https://cdn2.hubspot.net/hubfs/2631050/0284%20CDAO%20FS/Algorithmia_2020_State_of_Enterprise_ML.pdf)
다양한 방식으로 늘어난 Model 관리 부담이 커짐

34
New ML Trend #3
Self-Service를 통해서 원하는 Framework과 Network topology 선택 가능
03. 자동화 된 Training 및 Data, Model 등 결과물 관리
Versioning, 형상관리, History 관리 등 다양한 관리 기능 지원

35
New ML Trend #4
• No. 4 Federated Learning
T. Yang et al., “Applied Federated Learning: Improving Google Keyboard Query Suggestions,” Dec. 2018.

36
New ML Trend #4
• Why Federated Learning is needed?
Privacy Issue Data Regulation

37
New ML Trend #4
• No. 4 Federated Learning
Park, Jihong & Samarakoon, Sumudu & Bennis, Mehdi & Debbah, mérouane. (2018). Wireless Network Intelligence at the Edge>
신경망 학습이 더 이상 서버/클라우드에서만 학습되지 않고,
디바이스에서 학습이 진행이 되며, 상호 교환이 된다.

38
New ML Trend #4
• Federated Learning Applied Machine Learning Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon, Adrien Coquet, LAFS
데이터수집
학습모델 구성 및 정교화
기계학습
평가
학습모델 활용
반복
반복
개인정보 활용 학습
디바이스 모델 합성
반복

39
New ML Trend #4
Self-Service를 통해서 원하는 Framework과 Network topology 선택 가능
03. 자동화 된 Training 및 Data, Model 등 결과물 관리
Versioning, 형상관리, History 관리 등 다양한 관리 기능 지원
04. Platform 외부 연결, Platform간 연결이 필수적
수십 종, 수십 만개의 Device와의 연결 및 관리, Region간 System 연계

40
Full Managed End to end ML Platform
Management
Machine
Learning
Platform
Developer
Experience
Hardware
Icons from the noun project (http://thenounprojecct.com) - Luis Prado, Mello, Product Pencil, Gan Khoon Lay, pxLens, Adrien Coquet, Chad Remsing, Bartama Graphic, Creative Stall, Angelina, Alfredo @ IconsAlfredo.com, Eucalyp, Adi Kurniawan
Data Management
Model Management
Training Management
Hyperparameter
Optimization
Neural Architecture
Search
H/W Optimization Network as Code
Connect Device
Configurable
Environment
Keep Behavior

41
Machine Learning은 더 이상 SOTA만으로 만족되지 않음
다양한 ML 최신 학습 기술(SOTA Architecture, Automated ML) 을 관리할 수 있어야 하며,
 학습 환경, 학습 코드, Hyperparameter, Data Set Version 등과 함께
학습된 결과를 관리, 재생산(Reproduction), 재활용 할 수 있어야 하고,
단일 물리 환경에서의 학습이 아닌 다양한 물리 환경이 연동 되어야 한다.

42
Overall Architecture
Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414.
Cluster
Hardware
System Software
HPC&AITechnology
Middleware &
Management
Infiniband + Ethernet SAN + Local Node Storage
Linux OS variant
GPGPU or Accelerators
ParallelFramework
NumericalLibraries
SystemTool
Development Language
Training Algorithm
MLFramework
Hadoop
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
Platform
Components { Environment, Workflow, Model, Quota, Resource, Log, Metering, … } Management
*2

43
Troubleshooting
Problems that you can meet

44
Basic Architecture
• 사전 고려 사항
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
.yaml Template化
RDMA-SRIOV
plug-in
NVIDIA-peer-
memory package
Training Task
실행 전처리
Docker insecure
registry
Docker unlock
memory limit
Persistent Volume
Mount
Multi Tenant 관리
Timezone 통일

45
Basic Architecture
• Kubernetes 기반 Machine Learning Platform
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing PreprocessingStorage Issue Data Feeding 속도 부족
Multi GPU 처리Data Locality
CNI Overhead Pod Scheduler 부적합 동적 POD 구성&동작
Direct call kubectl
command
vGPU 부재
Server간 Communication overhead
Resource
Management
Container Root
Privilege

46
Troubleshooting
• Storage Issue
- NFS의 성능 한계가 Enterprise Server보다 빠르게 도달
ML의 특징상 학습에 다수의 파일을 사용하여 학습을 진행함
학습 후 가공된 파일이 학습 원본 데이터 수의 N 배가 발생함
사용자들이 유사하거나 동일한 Dataset을 활용하여 학습을 진행
Solution
1) 학습 Data Set Meta 관리를 통한 Original Copy 최소화
2) Data Set Lifetime Control 필수, Object Storage 사용
File System 접근으로 작성된 학습 코드를 위한 Library 제공 필요

47
Troubleshooting
• Resource Management & Pod Scheduler 부적합 Issue
- 1) GPGPU Machine Resource 파편화
 Kubernetes Resource affinity는 Computing을 분산하여 Multi GPU Scheduling이 어려움
- 2) Abusing User
 Resource 선점 및 Low Utilization
Solution
1) a. Kubernetes custom Scheduler 개발 및 적용, Resource affinity 조정
b. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
2) a. Fair share scheduling and Quota Consuming
b. Will be – Preemption scheduler for GPGPU

48
Troubleshooting
• Kubernetes Out of Memory(OOM) Issue
- 1) OpenStack등과 같은 VM Solution들과 다른 Configuration
Resource Over commit이 불가능함 (CPU, RAM 등)
- 2) Kubernetes 기본 Option = Swap off
Pod간 CPU와 Memory를 Share해야 하며 Paging이 불가함
cAdvisor와 kubelet의 통신 주기 보다 OOM의 원인의 메모리 사용량 증가 폭이 큼
Solution
1) 분할 배분, Watch Dog
Server내 GPU 개수 할당에 따른 분배 또는 Watch Dog을 통한 빠른 조치
Icons from the noun project (http://thenounprojecct.com) - Gregor Cresnar
Working on it…

49
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Data sheet from NVIDIA official document https://images.nvidia.com/content/pdf/grid/data-sheet/tesla-gpu-linecard-virtualization-us-nvidia-669786-r7.pdf

50
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Image from NVIDIA official document https://docs.nvidia.com/grid/4.3/grid-vgpu-user-guide/index.html
[ vGPU Overall Architecture ]
Solution
Servers with
GPGPU
Kubernetes Cluster
Servers with
GPGPU
Servers with
GPGPU
OpenStack Cluster
VM with vGPGPU VM with vGPGPU VM with vGPGPU
Training Job Training Job Training Job Training Job

51
Question?
Do you remain Curious?

GPU Cluster Platform for Distributed Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GPU Cluster Platform for Distributed Deep Learning

Similar to GPU Cluster Platform for Distributed Deep Learning (20)

Recently uploaded

Recently uploaded (20)

GPU Cluster Platform for Distributed Deep Learning