SlideShare a Scribd company logo
1 of 53
0
Large scale GPU
Cluster for AI
KRnet 2020
23.June.2020(Tue)
조규남
mystous@{naver, gmail}.com
1
mystous@kyunam.com:~$ who am i
• Principal Software Engineer / Software Architect @ Samsung Electronics
• C언어 pointer 이해 한지 23년째…
• Working
Private/Public Cloud Solution and Application – VM & Container
Possibility of HPC application on Cloud infrastructure by container cluster
(The 22nd IEEE International Conference on Computational Science and Engineering, 2019)
Time-efficient simulations of tight-binding electronic structures with Intel
Xeon PhiTM many-core processors (Computer Physics Communications 209권, 2016)
인텔 제온 파이를 활용한 푸아송 방정식 풀이의 병렬화
(한국정보처리학회 2015년 추계학술발표대회)
한국 슈퍼컴퓨팅 프로그래밍 경진대회 우수상 (2015)
2
Previous Presentation
https://developer.ibm.com/kr/devday2018/
https://www.slideshare.net/ssuser3e70ba/deep-learning-100-high-performance-computing-for-ai
3
Previous Presentation
https://openinfradays.kr/
https://www.slideshare.net/ssuser3e70ba/gpu-k8s-cluster-ml-training-troubleshooting
https://www.youtube.com/watch?v=cabIO2ZHtU8
4
Prologue
Today Scope
5
Large Scale GPU Cluster for AI
Icons from the noun project (http://thenounprojecct.com) - Chad Remsing, Mohamed Mbarki
GPU Cluster AI Dev Platform
6
Large Scale GPU Cluster for AI
Well Made New Trend
7
Large Scale GPU Cluster for AI
Icons from the noun project (http://thenounprojecct.com) - Chad Remsing, Mohamed Mbarki
GPU Cluster
AI Dev Platform
TODAY
8
Let’s Start
9
AI Basic Sequence
• Machine Learning Basic Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon
Machine Learning Platform Coverage
AI Engineer
데이터수집
데이터정제 및 라벨링
학습모델 선택 및 재설정
기계학습
평가
하이퍼파라미터 재설정
학습모델 활용
반복
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
10
What for Artificial
Intelligence
Machine Learning Platform with GPU Cluster and Paint points
11
GPU Cluster with HPC Technology
Cluster
Hardware
System
Software
HPCTechnology
Middleware &
Management
Infiniband + Ethernet SAN + Local Node Storage
Linux OS variant
GPGPU or Accelerators
ParallelFramework
NumericalLibraries
SystemTool
Development Language
ⓒ Romanzes637@Wikimedia Commons @Wikimedia Commons ⓒ Éducation nationale @Wikimedia Commons
HPC & AI
Applications
MLFramework
Hadoop
Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414.
12
Is it enough?
• Too many pain points
End to End Management
: Various version of data set, unmanaged hyper
Parameters and uncontrolled trained Models
Configuration
: Too many ML Framework, version dependency
and Huge versions of ML Architecture
Utilization
: Dedicated Resource, Silo Management
Image from https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
*2
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
13
Machine Learning Platform on GPU Cluster
• Rising of Machine Learning Platform
1) Laptop, 2) High Performance Computing [HPC], 3) Machine Learning Platform
Photo by frank mckenna on Unsplash
Personal PC HPC Platform
Mark by Vladyslav Severyn from the Noun Project
+Performance +Convenience
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
14
New Trend in AI
Machine Learning Trend
15
New wave has been here
16
New wave has been here
• New ML Trend
1 2
3 4
Multi Node Distributed Training복잡한 Neural Architecture 다변화 된 학습 환경
Automated ML(AutoML) Federated Learning
X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019. T. Yang et al., “Applied Federated Learning: Improving Google Keyboard Query
Suggestions,” Dec. 2018.
J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer,
“A Survey on Distributed Machine Learning,” Dec. 2019.
B. Wu et al., “FBNET: Hardware-aware efficient convnet design via differentiable neural architecture search,” in
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, vol.
2019-June, doi: 10.1109/CVPR.2019.01099.
17
New ML Trend #1
• No. 1 더 깊고 복잡한 Neural Architecture 다변화 된 Data 종류와 적용 분야
B. Wu et al., “FBNET: Hardware-aware efficient convnet design via differentiable neural architecture search,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
2019, vol. 2019-June, doi: 10.1109/CVPR.2019.01099.
18
New ML Trend #1
• No. 1 더 깊고 복잡한 Neural Architecture 다변화 된 Data 종류와 적용 분야
※ M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” May 2019.
신경망의 정확도가 올라 갈 수록
파라미터의 개수는 함께 증가 하는
패턴을 보임
19
New ML Trend #1
• No. 1 더 깊고 복잡한 Neural Architecture 다변화 된 Data 종류와 적용 분야
L. Ben and N. Paco, AI Adoption in the Enterprise: How Companies Are Planning and Prioritizing AI Projects in Practice. O’Reilly, 2019.
Defector Standard라고 할 만한 F/W이 없으며
다양한 ML F/W이 지속적으로 활용되고 있음
20
New ML Trend #1
01. 새로운 기술 Trend에 맞는 Platform Level Support
Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software)
21
New ML Trend #2
• No. 2 Multi Node Distributed Training
R. Mayer and H.-A. Jacobsen, “Scalable Deep Learning on Distributed Infrastructures,” ACM Comput. Surv., vol. 53, no. 1, pp. 1–37, Feb. 2020.
22
New ML Trend #2
• [149] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing
Systems. MIT Press, 693–701.
• [38] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, KeYang, Quoc V. Le, et al. 2012. Large scale distributed deep networks. In
Advances in Neural Information Processing Systems. MIT Press, 1223–1231.
• [28] James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. 2013. Solving the straggler problem with bounded staleness. In
Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS’13). USENIX, Santa Ana Pueblo, NM. Retrieved from https://www.usenix.org/conference/hotos13/solving-straggler-
problem-bounded-staleness
• [133] Cyprien Noel and Simon Osindero. 2014. Dogwild!-Distributed hogwild for CPU & GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.
• [35] Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2014.
Exploiting bounded staleness to speed up big data analytics. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley,
CA, 37–48. Retrieved from http://dl.acm.org/citation.cfm?id=2643634.2643639.
• [102] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with
the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 583–598. Retrieved from
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu.
• [37] Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, and Eric P. Xing. 2015. High-performance distributed ML at scale through parameter server consistency models. In Proceedings
of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, 79–87. Retrieved from http://dl.acm.org/citation.cfm?id=2887007.2887019
• [101] Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungureanu. 2015. MALT: Distributed data-parallelism for existing ML applications. In Proceedings of the 10th European Conference on Computer
Systems (EuroSys’15). ACM, New York, NY. DOI:https://doi.org/10.1145/2741948.2741965
• [204] H. Zhang, C. Hsieh, and V. Akella. 2016. HogWild++: A new mechanism for decentralized asynchronous stochastic gradient descent. In Proceedings of the IEEE 16th International Conference on
Data Mining (ICDM’16). 629–638. DOI:https://doi.org/10.1109/ICDM.2016.0074
• [36] Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In
Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, New York, NY. DOI:https://doi.org/10.1145/2901318.2901323
• [83] Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data
(SIGMOD’17). ACM, New York, NY, 463–478. DOI:https://doi.org/10.1145/3035918.3035933
• [184] Shaoqi Wang, Wei Chen, Aidi Pi, and Xiaobo Zhou. 2018. Aggressive synchronization with partial processing for iterative ml jobs on clusters. In Proceedings of the 19th International Middleware
Conference (Middleware’18). ACM, New York, NY, 253–265. DOI:https://doi.org/10.1145/3274808.3274828
• [89] Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo Costa, and Peter R. Pietzuch. 2019. CROSSBOW: Scaling deep learning with small batch sizes on multi-GPU
servers. Retrieved from http://arxiv.org/abs/1901.02244.
• [19] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe M. Kiddon, Jakub Konecný, Stefano Mazzocchi, Brendan McMahan, Timon Van
Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. 2019. Towards federated learning at scale: System design. In Proceedings of the Conference on Systems and Machine Learning
(SysML’19). Retrieved from https://arxiv.org/abs/1902.01046.
23
New ML Trend #2
• No. 2 Multi Node Distributed Training
Infrastructure Layer Software Layer
초기 도입 단계에 결정이 되며 수정이 어려움
수시 변경이 가능하나 운영 적인 측면 고려
1) Mallanox. Accelerating High Performance Computing with GPUDirect RDMA. GTC 2013, Image Source from http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpudirect-rdma.pdf
2) Images from Microway homepage https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/octoputer-8-gpu-with-dual-root-tesla-v100/, https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/
3) Images from NVIDIA Korea homepage https://www.nvidia.com/ko-kr/data-center/nvlink/
1)
2)
3)
NVIDIA NCCL
State-of-the-art를 지속적으로 적용할 수 있는 Self-
Service를 통한 환경 구성 지원 및 최적화 필요
24
New ML Trend #2
01. 새로운 기술 Trend에 맞는 Platform Level Support
Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software)
02. 다양한 형태의 Node, Network Configuration 지원
Self-Service를 통해서 원하는 Framework과 Network topology 구성 필요
25
New ML Trend #3
• No. 3 Automated ML (AutoML)
NAS(Neural
Architecture Search)
Automated ML
Feature
Engineering
Hyper-parameter
Optimization
26
New ML Trend #3
• Hyperparameter Optimization
- NNI 지원 Hyperparameter Optimization
(https://github.com/microsoft/nni/blob/master/docs/en_US/Tuner/BuiltinTuner.md)
다양한 자동화 방법으로 Model Training의 반복이 기계
화되고 자동화 되어 수십, 수백 종의 결과를 동시 배출
27
New ML Trend #3
• Neural Architecture Search
X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019.
신경망의 결과가 설명 하기 힘든 현재 구조에서 무작위성 또는
강화 학습에 의한 신경망 자동 생성이 사람의 노력을 앞서고 있음
28
New ML Trend #3
• Neural Architecture Search
X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019.
NAS로 개발된 신경망의 정확도는 이미 사용이 가능한 수준
29
New ML Trend #3
• [4] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning.” [Online]. Available: http://arxiv.org/abs/1611.01578
• [5] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” vol. ICML. [Online]. Available: http://arxiv.org/abs/1802.03268
• [7] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition.” [Online]. Available: http://arxiv.org/abs/1707.07012
• [8] Z. Zhong, J. Yan, W. Wu, J. Shao, and C.-L. Liu, “Practical block-wise neural network architecture generation.” [Online]. Available: http://arxiv.org/abs/1708.05552
• [9] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search.” [Online]. Available: http://arxiv.org/abs/1806.09055
• [10] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search.” [Online]. Available:
http://arxiv.org/abs/1712.00559
• [11] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, “Hierarchical representations for efficient architecture search,” in ICLR, p.
• [15] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” vol. ICLR. [Online]. Available: http://arxiv.org/abs/1611.02167
• [17] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin, “Large-scale evolution of image classifiers.” [Online]. Available: http://arxiv.org/abs/1703.01041
• [18] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search.” [Online]. Available: http://arxiv.org/abs/1802.01548
• [19] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-objective neural architecture search via lamarckian evolution.” [Online]. Available: http://arxiv.org/abs/1804.09081
• [20] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming approach to designing convolutional neural network architectures.” [Online]. Available: http://arxiv.org/abs/1704.00764
• [21] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat, “Evolving deep neural networks.” [Online]. Available:
http://arxiv.org/abs/1703.00548
• [22] L. Xie and A. Yuille, “Genetic CNN,” vol. ICCV. [Online]. Available: http://arxiv.org/abs/1703.01513
• [94] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,” arXiv preprint arXiv:1812.00332, 2018.
• [96] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in Proceedings of the IEEE International Conference
on Computer Vision, 2019, pp. 1294–1303.
• [123] S. Xie, H. Zheng, C. Liu, and L. Lin, “Snas: stochastic neural architecture search,” arXiv preprint arXiv:1812.09926, 2018.
• [125] G. D. H. Andrew Hundt, Varun Jain, “sharpDARTS: Faster and More Accurate Differentiable Architecture Search,” Tech. Rep. [Online]. Available: https://arxiv.org/pdf/1903.09900.pdf
• [142] R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu, “Neural architecture optimization,” in Advances in neural information processing systems, 2018, pp. 7816–7827.
• [143] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” arXiv preprint arXiv:1902.07638, 2019.
30
New ML Trend #3
• Neural Architecture Search (강화 학습 기반)
신경망 구조 탐색 신경망 구조 평가
딥러닝 학습
강화학습을 통한 신경망 구조 탐색(NAS) 딥러닝을 통한 모델 정교화
컨트롤러(에이젼트) 환경
Conv3x3
sep5x5
concat
Max3x3
concat
Conv5x5
concat
<<액션>>
신경망 구조 A, 확률 P
<<보상, 상태>>
확률 P의 변화량 및
R에 의한 변동량
목표 예측률 R
31
New ML Trend #3
• Neural Architecture Search (NASNet)
3x3conv,stride2
환원셀(ReductionCell)
일반셀(NormalCell)
Softmax
환원셀(ReductionCell)
일반셀(NormalCell)
환원셀(ReductionCell)
일반셀(NormalCell)
hi-1
sep
7x7
sep
5x5
add
hi
이전 셀
Hi+1
max
3x3
sep
7x7
add
max
3x3
sep
5x5
add
max
3x3
sep
3x3
add
avg
3x3
Iden
tity
add
conca
t
환원 셀
(Reduction Cell)
hi-1
sep
3x3
sep
5x5
add
hi
Hi+1
sep
3x3
sep
5x5
add
avg
3x3
Iden
tity
add
avg
3x3
avg
3x3
add
sep
5x5
sep
3x3
add
conca
t
일반 셀(Normal Cell)
이전 셀
X 2 X 6 X 6 X 6
Revisited from B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710.
32
New ML Trend #3
• AutoML Applied Machine Learning Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon
Machine Learning Platform Coverage
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
데이터수집
데이터정제 및 라벨링
학습모델 구성 및 정교화
기계학습
평가
하이퍼파라미터 재설정
학습모델 활용
반복
반복
33
New ML Trend #3
ALGORITHMIA: 2020 state of enterprise machine learning(https://cdn2.hubspot.net/hubfs/2631050/0284%20CDAO%20FS/Algorithmia_2020_State_of_Enterprise_ML.pdf)
다양한 방식으로 늘어난 Model 관리 부담이 커짐
34
New ML Trend #3
01. 새로운 기술 Trend에 맞는 Platform Level Support
Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software)
02. 다양한 형태의 Node, Network Configuration 지원
Self-Service를 통해서 원하는 Framework과 Network topology 선택 가능
03. 자동화 된 Training 및 Data, Model 등 결과물 관리
Versioning, 형상관리, History 관리 등 다양한 관리 기능 지원
35
New ML Trend #4
• No. 4 Federated Learning
T. Yang et al., “Applied Federated Learning: Improving Google Keyboard Query Suggestions,” Dec. 2018.
36
New ML Trend #4
• Why Federated Learning is needed?
Privacy Issue Data Regulation
37
New ML Trend #4
• No. 4 Federated Learning
Park, Jihong & Samarakoon, Sumudu & Bennis, Mehdi & Debbah, mérouane. (2018). Wireless Network Intelligence at the Edge>
신경망 학습이 더 이상 서버/클라우드에서만 학습되지 않고,
디바이스에서 학습이 진행이 되며, 상호 교환이 된다.
38
New ML Trend #4
• Federated Learning Applied Machine Learning Flow
All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon, Adrien Coquet, LAFS
Machine Learning Platform Coverage
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
데이터수집
데이터정제 및 라벨링
학습모델 구성 및 정교화
기계학습
평가
하이퍼파라미터 재설정
학습모델 활용
반복
반복
개인정보 활용 학습
디바이스 모델 합성
반복
39
New ML Trend #4
01. 새로운 기술 Trend에 맞는 Platform Level Support
Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software)
02. 다양한 형태의 Node, Network Configuration 지원
Self-Service를 통해서 원하는 Framework과 Network topology 선택 가능
03. 자동화 된 Training 및 Data, Model 등 결과물 관리
Versioning, 형상관리, History 관리 등 다양한 관리 기능 지원
04. Platform 외부 연결, Platform간 연결이 필수적
수십 종, 수십 만개의 Device와의 연결 및 관리, Region간 System 연계
40
Full Managed End to end ML Platform
Management
Machine
Learning
Platform
Developer
Experience
Hardware
Icons from the noun project (http://thenounprojecct.com) - Luis Prado, Mello, Product Pencil, Gan Khoon Lay, pxLens, Adrien Coquet, Chad Remsing, Bartama Graphic, Creative Stall, Angelina, Alfredo @ IconsAlfredo.com, Eucalyp, Adi Kurniawan
Data Management
Model Management
Training Management
Hyperparameter
Optimization
Neural Architecture
Search
H/W Optimization Network as Code
Connect Device
Configurable
Environment
Keep Behavior
41
Full Managed End to end ML Platform
Machine Learning은 더 이상 SOTA만으로 만족되지 않음
다양한 ML 최신 학습 기술(SOTA Architecture, Automated ML) 을 관리할 수 있어야 하며,
 학습 환경, 학습 코드, Hyperparameter, Data Set Version 등과 함께
학습된 결과를 관리, 재생산(Reproduction), 재활용 할 수 있어야 하고,
단일 물리 환경에서의 학습이 아닌 다양한 물리 환경이 연동 되어야 한다.
Full Managed End to end ML Platform
42
Overall Architecture
Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing,
Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414.
Cluster
Hardware
System Software
HPC&AITechnology
Middleware &
Management
Infiniband + Ethernet SAN + Local Node Storage
Linux OS variant
GPGPU or Accelerators
ParallelFramework
NumericalLibraries
SystemTool
Development Language
Training Algorithm
MLFramework
Hadoop
*1
*1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb
Platform
Components { Environment, Workflow, Model, Quota, Resource, Log, Metering, … } Management
*2
43
Troubleshooting
Problems that you can meet
44
Basic Architecture
• 사전 고려 사항
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing
.yaml Template化
RDMA-SRIOV
plug-in
NVIDIA-peer-
memory package
Training Task
실행 전처리
Docker insecure
registry
Docker unlock
memory limit
Persistent Volume
Mount
Multi Tenant 관리
Timezone 통일
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
45
Basic Architecture
• Kubernetes 기반 Machine Learning Platform
Storage
Management Servers
Management Servers
Management Servers
Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU
Kubernetes Cluster
Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job
Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing PreprocessingStorage Issue Data Feeding 속도 부족
Multi GPU 처리Data Locality
CNI Overhead Pod Scheduler 부적합 동적 POD 구성&동작
Direct call kubectl
command
vGPU 부재
Server간 Communication overhead
Resource
Management
Container Root
Privilege
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
46
Troubleshooting
• Storage Issue
- NFS의 성능 한계가 Enterprise Server보다 빠르게 도달
ML의 특징상 학습에 다수의 파일을 사용하여 학습을 진행함
학습 후 가공된 파일이 학습 원본 데이터 수의 N 배가 발생함
사용자들이 유사하거나 동일한 Dataset을 활용하여 학습을 진행
Solution
1) 학습 Data Set Meta 관리를 통한 Original Copy 최소화
2) Data Set Lifetime Control 필수, Object Storage 사용
File System 접근으로 작성된 학습 코드를 위한 Library 제공 필요
47
Troubleshooting
• Resource Management & Pod Scheduler 부적합 Issue
- 1) GPGPU Machine Resource 파편화
 Kubernetes Resource affinity는 Computing을 분산하여 Multi GPU Scheduling이 어려움
- 2) Abusing User
 Resource 선점 및 Low Utilization
Solution
1) a. Kubernetes custom Scheduler 개발 및 적용, Resource affinity 조정
b. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2
2) a. Fair share scheduling and Quota Consuming
b. Will be – Preemption scheduler for GPGPU
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
48
Troubleshooting
• Kubernetes Out of Memory(OOM) Issue
- 1) OpenStack등과 같은 VM Solution들과 다른 Configuration
Resource Over commit이 불가능함 (CPU, RAM 등)
- 2) Kubernetes 기본 Option = Swap off
Pod간 CPU와 Memory를 Share해야 하며 Paging이 불가함
cAdvisor와 kubelet의 통신 주기 보다 OOM의 원인의 메모리 사용량 증가 폭이 큼
Solution
1) 분할 배분, Watch Dog
Server내 GPU 개수 할당에 따른 분배 또는 Watch Dog을 통한 빠른 조치
Icons from the noun project (http://thenounprojecct.com) - Gregor Cresnar
Working on it…
49
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Data sheet from NVIDIA official document https://images.nvidia.com/content/pdf/grid/data-sheet/tesla-gpu-linecard-virtualization-us-nvidia-669786-r7.pdf
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
50
Troubleshooting
• vGPU Issue
- Hardware vender dependence, VM Only (NVIDIA Grid vGPU)
Image from NVIDIA official document https://docs.nvidia.com/grid/4.3/grid-vgpu-user-guide/index.html
[ vGPU Overall Architecture ]
Solution
Servers with
GPGPU
Kubernetes Cluster
Servers with
GPGPU
Servers with
GPGPU
OpenStack Cluster
VM with vGPGPU VM with vGPGPU VM with vGPGPU
Training Job Training Job Training Job Training Job
Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
51
Question?
Do you remain Curious?
52

More Related Content

What's hot

Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixBill Liu
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
Google edge tpu
Google edge tpuGoogle edge tpu
Google edge tpuRouyun Pan
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer ModelsDatabricks
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit SupercomputerVigneshwarRamaswamy
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for researchEsteban Hernandez
 
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super AffordableSupermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super AffordableRebekah Rodriguez
 
Uncover the mysteries of infrastructure as code (iac)!
Uncover the mysteries of infrastructure as code (iac)!Uncover the mysteries of infrastructure as code (iac)!
Uncover the mysteries of infrastructure as code (iac)!Prashant Kalkar
 

What's hot (20)

Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
Google edge tpu
Google edge tpuGoogle edge tpu
Google edge tpu
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Cuda
CudaCuda
Cuda
 
Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Hardware architecture of Summit Supercomputer
 Hardware architecture of Summit Supercomputer Hardware architecture of Summit Supercomputer
Hardware architecture of Summit Supercomputer
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super AffordableSupermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
 
Uncover the mysteries of infrastructure as code (iac)!
Uncover the mysteries of infrastructure as code (iac)!Uncover the mysteries of infrastructure as code (iac)!
Uncover the mysteries of infrastructure as code (iac)!
 

Similar to GPU Cluster Platform for Distributed Deep Learning

Lecture_IIITD.pptx
Lecture_IIITD.pptxLecture_IIITD.pptx
Lecture_IIITD.pptxachakracu
 
YangHu-CV-Nov2016
YangHu-CV-Nov2016YangHu-CV-Nov2016
YangHu-CV-Nov2016Yang Hu
 
Achieving High Performance Distributed System: Using Grid, Cluster and Cloud ...
Achieving High Performance Distributed System: Using Grid, Cluster and Cloud ...Achieving High Performance Distributed System: Using Grid, Cluster and Cloud ...
Achieving High Performance Distributed System: Using Grid, Cluster and Cloud ...IJERA Editor
 
Big data analytics, machine learning and artificial intelligence in next gene...
Big data analytics, machine learning and artificial intelligence in next gene...Big data analytics, machine learning and artificial intelligence in next gene...
Big data analytics, machine learning and artificial intelligence in next gene...nexgentechnology
 
dagrep_v006_i004_p057_s16152
dagrep_v006_i004_p057_s16152dagrep_v006_i004_p057_s16152
dagrep_v006_i004_p057_s16152Lenore Mullin
 
CROM Digital Twins and IoT
CROM Digital Twins and IoTCROM Digital Twins and IoT
CROM Digital Twins and IoTJuan C. Vasquez
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringSK Ahammad Fahad
 
Nimble@itcecnogrid novel toolkit for computing weather
Nimble@itcecnogrid novel toolkit for computing weatherNimble@itcecnogrid novel toolkit for computing weather
Nimble@itcecnogrid novel toolkit for computing weatheriaemedu
 
Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018ITIIIndustries
 
Grid computing the grid
Grid computing the gridGrid computing the grid
Grid computing the gridJivan Nepali
 
Analysis and assessment software for multi-user collaborative cognitive radi...
Analysis and assessment software for multi-user collaborative  cognitive radi...Analysis and assessment software for multi-user collaborative  cognitive radi...
Analysis and assessment software for multi-user collaborative cognitive radi...IJECEIAES
 
Industrial big data analytics for prediction of remaining useful life based o...
Industrial big data analytics for prediction of remaining useful life based o...Industrial big data analytics for prediction of remaining useful life based o...
Industrial big data analytics for prediction of remaining useful life based o...nexgentechnology
 
Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim
Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim
Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim IJECEIAES
 
A Comparative Study: Taxonomy of High Performance Computing (HPC)
A Comparative Study: Taxonomy of High Performance Computing (HPC) A Comparative Study: Taxonomy of High Performance Computing (HPC)
A Comparative Study: Taxonomy of High Performance Computing (HPC) IJECEIAES
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
The Internet of Things: What's next?
The Internet of Things: What's next? The Internet of Things: What's next?
The Internet of Things: What's next? PayamBarnaghi
 
R15A0529_CloudComputing_Notes-converted.pdf
R15A0529_CloudComputing_Notes-converted.pdfR15A0529_CloudComputing_Notes-converted.pdf
R15A0529_CloudComputing_Notes-converted.pdfkhan593595
 
Dynamic Semantics for Semantics for Dynamic IoT Environments
Dynamic Semantics for Semantics for Dynamic IoT EnvironmentsDynamic Semantics for Semantics for Dynamic IoT Environments
Dynamic Semantics for Semantics for Dynamic IoT EnvironmentsPayamBarnaghi
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud ComputingAnimesh Chaturvedi
 

Similar to GPU Cluster Platform for Distributed Deep Learning (20)

Lecture_IIITD.pptx
Lecture_IIITD.pptxLecture_IIITD.pptx
Lecture_IIITD.pptx
 
YangHu-CV-Nov2016
YangHu-CV-Nov2016YangHu-CV-Nov2016
YangHu-CV-Nov2016
 
Achieving High Performance Distributed System: Using Grid, Cluster and Cloud ...
Achieving High Performance Distributed System: Using Grid, Cluster and Cloud ...Achieving High Performance Distributed System: Using Grid, Cluster and Cloud ...
Achieving High Performance Distributed System: Using Grid, Cluster and Cloud ...
 
Big data analytics, machine learning and artificial intelligence in next gene...
Big data analytics, machine learning and artificial intelligence in next gene...Big data analytics, machine learning and artificial intelligence in next gene...
Big data analytics, machine learning and artificial intelligence in next gene...
 
dagrep_v006_i004_p057_s16152
dagrep_v006_i004_p057_s16152dagrep_v006_i004_p057_s16152
dagrep_v006_i004_p057_s16152
 
CROM Digital Twins and IoT
CROM Digital Twins and IoTCROM Digital Twins and IoT
CROM Digital Twins and IoT
 
A modified k means algorithm for big data clustering
A modified k means algorithm for big data clusteringA modified k means algorithm for big data clustering
A modified k means algorithm for big data clustering
 
Nimble@itcecnogrid novel toolkit for computing weather
Nimble@itcecnogrid novel toolkit for computing weatherNimble@itcecnogrid novel toolkit for computing weather
Nimble@itcecnogrid novel toolkit for computing weather
 
Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018
 
Grid computing the grid
Grid computing the gridGrid computing the grid
Grid computing the grid
 
Analysis and assessment software for multi-user collaborative cognitive radi...
Analysis and assessment software for multi-user collaborative  cognitive radi...Analysis and assessment software for multi-user collaborative  cognitive radi...
Analysis and assessment software for multi-user collaborative cognitive radi...
 
Industrial big data analytics for prediction of remaining useful life based o...
Industrial big data analytics for prediction of remaining useful life based o...Industrial big data analytics for prediction of remaining useful life based o...
Industrial big data analytics for prediction of remaining useful life based o...
 
Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim
Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim
Virtual Machine Allocation Policy in Cloud Computing Environment using CloudSim
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
A Comparative Study: Taxonomy of High Performance Computing (HPC)
A Comparative Study: Taxonomy of High Performance Computing (HPC) A Comparative Study: Taxonomy of High Performance Computing (HPC)
A Comparative Study: Taxonomy of High Performance Computing (HPC)
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
The Internet of Things: What's next?
The Internet of Things: What's next? The Internet of Things: What's next?
The Internet of Things: What's next?
 
R15A0529_CloudComputing_Notes-converted.pdf
R15A0529_CloudComputing_Notes-converted.pdfR15A0529_CloudComputing_Notes-converted.pdf
R15A0529_CloudComputing_Notes-converted.pdf
 
Dynamic Semantics for Semantics for Dynamic IoT Environments
Dynamic Semantics for Semantics for Dynamic IoT EnvironmentsDynamic Semantics for Semantics for Dynamic IoT Environments
Dynamic Semantics for Semantics for Dynamic IoT Environments
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 

Recently uploaded

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 

Recently uploaded (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 

GPU Cluster Platform for Distributed Deep Learning

  • 1. 0 Large scale GPU Cluster for AI KRnet 2020 23.June.2020(Tue) 조규남 mystous@{naver, gmail}.com
  • 2. 1 mystous@kyunam.com:~$ who am i • Principal Software Engineer / Software Architect @ Samsung Electronics • C언어 pointer 이해 한지 23년째… • Working Private/Public Cloud Solution and Application – VM & Container Possibility of HPC application on Cloud infrastructure by container cluster (The 22nd IEEE International Conference on Computational Science and Engineering, 2019) Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors (Computer Physics Communications 209권, 2016) 인텔 제온 파이를 활용한 푸아송 방정식 풀이의 병렬화 (한국정보처리학회 2015년 추계학술발표대회) 한국 슈퍼컴퓨팅 프로그래밍 경진대회 우수상 (2015)
  • 6. 5 Large Scale GPU Cluster for AI Icons from the noun project (http://thenounprojecct.com) - Chad Remsing, Mohamed Mbarki GPU Cluster AI Dev Platform
  • 7. 6 Large Scale GPU Cluster for AI Well Made New Trend
  • 8. 7 Large Scale GPU Cluster for AI Icons from the noun project (http://thenounprojecct.com) - Chad Remsing, Mohamed Mbarki GPU Cluster AI Dev Platform TODAY
  • 10. 9 AI Basic Sequence • Machine Learning Basic Flow All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon Machine Learning Platform Coverage AI Engineer 데이터수집 데이터정제 및 라벨링 학습모델 선택 및 재설정 기계학습 평가 하이퍼파라미터 재설정 학습모델 활용 반복 Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
  • 11. 10 What for Artificial Intelligence Machine Learning Platform with GPU Cluster and Paint points
  • 12. 11 GPU Cluster with HPC Technology Cluster Hardware System Software HPCTechnology Middleware & Management Infiniband + Ethernet SAN + Local Node Storage Linux OS variant GPGPU or Accelerators ParallelFramework NumericalLibraries SystemTool Development Language ⓒ Romanzes637@Wikimedia Commons @Wikimedia Commons ⓒ Éducation nationale @Wikimedia Commons HPC & AI Applications MLFramework Hadoop Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing, Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414.
  • 13. 12 Is it enough? • Too many pain points End to End Management : Various version of data set, unmanaged hyper Parameters and uncontrolled trained Models Configuration : Too many ML Framework, version dependency and Huge versions of ML Architecture Utilization : Dedicated Resource, Silo Management Image from https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699 *1 *1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb *2 Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing, Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
  • 14. 13 Machine Learning Platform on GPU Cluster • Rising of Machine Learning Platform 1) Laptop, 2) High Performance Computing [HPC], 3) Machine Learning Platform Photo by frank mckenna on Unsplash Personal PC HPC Platform Mark by Vladyslav Severyn from the Noun Project +Performance +Convenience Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
  • 15. 14 New Trend in AI Machine Learning Trend
  • 16. 15 New wave has been here
  • 17. 16 New wave has been here • New ML Trend 1 2 3 4 Multi Node Distributed Training복잡한 Neural Architecture 다변화 된 학습 환경 Automated ML(AutoML) Federated Learning X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019. T. Yang et al., “Applied Federated Learning: Improving Google Keyboard Query Suggestions,” Dec. 2018. J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A Survey on Distributed Machine Learning,” Dec. 2019. B. Wu et al., “FBNET: Hardware-aware efficient convnet design via differentiable neural architecture search,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, vol. 2019-June, doi: 10.1109/CVPR.2019.01099.
  • 18. 17 New ML Trend #1 • No. 1 더 깊고 복잡한 Neural Architecture 다변화 된 Data 종류와 적용 분야 B. Wu et al., “FBNET: Hardware-aware efficient convnet design via differentiable neural architecture search,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, vol. 2019-June, doi: 10.1109/CVPR.2019.01099.
  • 19. 18 New ML Trend #1 • No. 1 더 깊고 복잡한 Neural Architecture 다변화 된 Data 종류와 적용 분야 ※ M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” May 2019. 신경망의 정확도가 올라 갈 수록 파라미터의 개수는 함께 증가 하는 패턴을 보임
  • 20. 19 New ML Trend #1 • No. 1 더 깊고 복잡한 Neural Architecture 다변화 된 Data 종류와 적용 분야 L. Ben and N. Paco, AI Adoption in the Enterprise: How Companies Are Planning and Prioritizing AI Projects in Practice. O’Reilly, 2019. Defector Standard라고 할 만한 F/W이 없으며 다양한 ML F/W이 지속적으로 활용되고 있음
  • 21. 20 New ML Trend #1 01. 새로운 기술 Trend에 맞는 Platform Level Support Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software)
  • 22. 21 New ML Trend #2 • No. 2 Multi Node Distributed Training R. Mayer and H.-A. Jacobsen, “Scalable Deep Learning on Distributed Infrastructures,” ACM Comput. Surv., vol. 53, no. 1, pp. 1–37, Feb. 2020.
  • 23. 22 New ML Trend #2 • [149] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. MIT Press, 693–701. • [38] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, KeYang, Quoc V. Le, et al. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. MIT Press, 1223–1231. • [28] James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric Xing. 2013. Solving the straggler problem with bounded staleness. In Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS’13). USENIX, Santa Ana Pueblo, NM. Retrieved from https://www.usenix.org/conference/hotos13/solving-straggler- problem-bounded-staleness • [133] Cyprien Noel and Simon Osindero. 2014. Dogwild!-Distributed hogwild for CPU & GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations. • [35] Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2014. Exploiting bounded staleness to speed up big data analytics. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley, CA, 37–48. Retrieved from http://dl.acm.org/citation.cfm?id=2643634.2643639. • [102] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 583–598. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu. • [37] Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, and Eric P. Xing. 2015. High-performance distributed ML at scale through parameter server consistency models. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, 79–87. Retrieved from http://dl.acm.org/citation.cfm?id=2887007.2887019 • [101] Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungureanu. 2015. MALT: Distributed data-parallelism for existing ML applications. In Proceedings of the 10th European Conference on Computer Systems (EuroSys’15). ACM, New York, NY. DOI:https://doi.org/10.1145/2741948.2741965 • [204] H. Zhang, C. Hsieh, and V. Akella. 2016. HogWild++: A new mechanism for decentralized asynchronous stochastic gradient descent. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 629–638. DOI:https://doi.org/10.1109/ICDM.2016.0074 • [36] Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, New York, NY. DOI:https://doi.org/10.1145/2901318.2901323 • [83] Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). ACM, New York, NY, 463–478. DOI:https://doi.org/10.1145/3035918.3035933 • [184] Shaoqi Wang, Wei Chen, Aidi Pi, and Xiaobo Zhou. 2018. Aggressive synchronization with partial processing for iterative ml jobs on clusters. In Proceedings of the 19th International Middleware Conference (Middleware’18). ACM, New York, NY, 253–265. DOI:https://doi.org/10.1145/3274808.3274828 • [89] Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo Costa, and Peter R. Pietzuch. 2019. CROSSBOW: Scaling deep learning with small batch sizes on multi-GPU servers. Retrieved from http://arxiv.org/abs/1901.02244. • [19] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe M. Kiddon, Jakub Konecný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. 2019. Towards federated learning at scale: System design. In Proceedings of the Conference on Systems and Machine Learning (SysML’19). Retrieved from https://arxiv.org/abs/1902.01046.
  • 24. 23 New ML Trend #2 • No. 2 Multi Node Distributed Training Infrastructure Layer Software Layer 초기 도입 단계에 결정이 되며 수정이 어려움 수시 변경이 가능하나 운영 적인 측면 고려 1) Mallanox. Accelerating High Performance Computing with GPUDirect RDMA. GTC 2013, Image Source from http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-gpudirect-rdma.pdf 2) Images from Microway homepage https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/octoputer-8-gpu-with-dual-root-tesla-v100/, https://www.microway.com/product/octoputer-4u-8-gpu-server-2-5-drives/ 3) Images from NVIDIA Korea homepage https://www.nvidia.com/ko-kr/data-center/nvlink/ 1) 2) 3) NVIDIA NCCL State-of-the-art를 지속적으로 적용할 수 있는 Self- Service를 통한 환경 구성 지원 및 최적화 필요
  • 25. 24 New ML Trend #2 01. 새로운 기술 Trend에 맞는 Platform Level Support Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software) 02. 다양한 형태의 Node, Network Configuration 지원 Self-Service를 통해서 원하는 Framework과 Network topology 구성 필요
  • 26. 25 New ML Trend #3 • No. 3 Automated ML (AutoML) NAS(Neural Architecture Search) Automated ML Feature Engineering Hyper-parameter Optimization
  • 27. 26 New ML Trend #3 • Hyperparameter Optimization - NNI 지원 Hyperparameter Optimization (https://github.com/microsoft/nni/blob/master/docs/en_US/Tuner/BuiltinTuner.md) 다양한 자동화 방법으로 Model Training의 반복이 기계 화되고 자동화 되어 수십, 수백 종의 결과를 동시 배출
  • 28. 27 New ML Trend #3 • Neural Architecture Search X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019. 신경망의 결과가 설명 하기 힘든 현재 구조에서 무작위성 또는 강화 학습에 의한 신경망 자동 생성이 사람의 노력을 앞서고 있음
  • 29. 28 New ML Trend #3 • Neural Architecture Search X. He, K. Zhao, and X. Chu, “AutoML: A Survey of the State-of-the-Art,” Aug. 2019. NAS로 개발된 신경망의 정확도는 이미 사용이 가능한 수준
  • 30. 29 New ML Trend #3 • [4] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning.” [Online]. Available: http://arxiv.org/abs/1611.01578 • [5] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” vol. ICML. [Online]. Available: http://arxiv.org/abs/1802.03268 • [7] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition.” [Online]. Available: http://arxiv.org/abs/1707.07012 • [8] Z. Zhong, J. Yan, W. Wu, J. Shao, and C.-L. Liu, “Practical block-wise neural network architecture generation.” [Online]. Available: http://arxiv.org/abs/1708.05552 • [9] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search.” [Online]. Available: http://arxiv.org/abs/1806.09055 • [10] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search.” [Online]. Available: http://arxiv.org/abs/1712.00559 • [11] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, “Hierarchical representations for efficient architecture search,” in ICLR, p. • [15] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” vol. ICLR. [Online]. Available: http://arxiv.org/abs/1611.02167 • [17] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin, “Large-scale evolution of image classifiers.” [Online]. Available: http://arxiv.org/abs/1703.01041 • [18] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search.” [Online]. Available: http://arxiv.org/abs/1802.01548 • [19] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-objective neural architecture search via lamarckian evolution.” [Online]. Available: http://arxiv.org/abs/1804.09081 • [20] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming approach to designing convolutional neural network architectures.” [Online]. Available: http://arxiv.org/abs/1704.00764 • [21] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, and B. Hodjat, “Evolving deep neural networks.” [Online]. Available: http://arxiv.org/abs/1703.00548 • [22] L. Xie and A. Yuille, “Genetic CNN,” vol. ICCV. [Online]. Available: http://arxiv.org/abs/1703.01513 • [94] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,” arXiv preprint arXiv:1812.00332, 2018. • [96] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1294–1303. • [123] S. Xie, H. Zheng, C. Liu, and L. Lin, “Snas: stochastic neural architecture search,” arXiv preprint arXiv:1812.09926, 2018. • [125] G. D. H. Andrew Hundt, Varun Jain, “sharpDARTS: Faster and More Accurate Differentiable Architecture Search,” Tech. Rep. [Online]. Available: https://arxiv.org/pdf/1903.09900.pdf • [142] R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu, “Neural architecture optimization,” in Advances in neural information processing systems, 2018, pp. 7816–7827. • [143] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” arXiv preprint arXiv:1902.07638, 2019.
  • 31. 30 New ML Trend #3 • Neural Architecture Search (강화 학습 기반) 신경망 구조 탐색 신경망 구조 평가 딥러닝 학습 강화학습을 통한 신경망 구조 탐색(NAS) 딥러닝을 통한 모델 정교화 컨트롤러(에이젼트) 환경 Conv3x3 sep5x5 concat Max3x3 concat Conv5x5 concat <<액션>> 신경망 구조 A, 확률 P <<보상, 상태>> 확률 P의 변화량 및 R에 의한 변동량 목표 예측률 R
  • 32. 31 New ML Trend #3 • Neural Architecture Search (NASNet) 3x3conv,stride2 환원셀(ReductionCell) 일반셀(NormalCell) Softmax 환원셀(ReductionCell) 일반셀(NormalCell) 환원셀(ReductionCell) 일반셀(NormalCell) hi-1 sep 7x7 sep 5x5 add hi 이전 셀 Hi+1 max 3x3 sep 7x7 add max 3x3 sep 5x5 add max 3x3 sep 3x3 add avg 3x3 Iden tity add conca t 환원 셀 (Reduction Cell) hi-1 sep 3x3 sep 5x5 add hi Hi+1 sep 3x3 sep 5x5 add avg 3x3 Iden tity add avg 3x3 avg 3x3 add sep 5x5 sep 3x3 add conca t 일반 셀(Normal Cell) 이전 셀 X 2 X 6 X 6 X 6 Revisited from B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710.
  • 33. 32 New ML Trend #3 • AutoML Applied Machine Learning Flow All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon Machine Learning Platform Coverage Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표) 데이터수집 데이터정제 및 라벨링 학습모델 구성 및 정교화 기계학습 평가 하이퍼파라미터 재설정 학습모델 활용 반복 반복
  • 34. 33 New ML Trend #3 ALGORITHMIA: 2020 state of enterprise machine learning(https://cdn2.hubspot.net/hubfs/2631050/0284%20CDAO%20FS/Algorithmia_2020_State_of_Enterprise_ML.pdf) 다양한 방식으로 늘어난 Model 관리 부담이 커짐
  • 35. 34 New ML Trend #3 01. 새로운 기술 Trend에 맞는 Platform Level Support Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software) 02. 다양한 형태의 Node, Network Configuration 지원 Self-Service를 통해서 원하는 Framework과 Network topology 선택 가능 03. 자동화 된 Training 및 Data, Model 등 결과물 관리 Versioning, 형상관리, History 관리 등 다양한 관리 기능 지원
  • 36. 35 New ML Trend #4 • No. 4 Federated Learning T. Yang et al., “Applied Federated Learning: Improving Google Keyboard Query Suggestions,” Dec. 2018.
  • 37. 36 New ML Trend #4 • Why Federated Learning is needed? Privacy Issue Data Regulation
  • 38. 37 New ML Trend #4 • No. 4 Federated Learning Park, Jihong & Samarakoon, Sumudu & Bennis, Mehdi & Debbah, mérouane. (2018). Wireless Network Intelligence at the Edge> 신경망 학습이 더 이상 서버/클라우드에서만 학습되지 않고, 디바이스에서 학습이 진행이 되며, 상호 교환이 된다.
  • 39. 38 New ML Trend #4 • Federated Learning Applied Machine Learning Flow All icon from the noun project (http://thenounprojecct.com) - National Park service, Yazmin Alanis, Bakunetsu Kaito, Gan Khoon Lay, ProSymbols, pxLens, Matt Hawdon, Adrien Coquet, LAFS Machine Learning Platform Coverage Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표) 데이터수집 데이터정제 및 라벨링 학습모델 구성 및 정교화 기계학습 평가 하이퍼파라미터 재설정 학습모델 활용 반복 반복 개인정보 활용 학습 디바이스 모델 합성 반복
  • 40. 39 New ML Trend #4 01. 새로운 기술 Trend에 맞는 Platform Level Support Kubernetes 등 Fine grain level의 Customization 지원 필요 (Hardware, Software) 02. 다양한 형태의 Node, Network Configuration 지원 Self-Service를 통해서 원하는 Framework과 Network topology 선택 가능 03. 자동화 된 Training 및 Data, Model 등 결과물 관리 Versioning, 형상관리, History 관리 등 다양한 관리 기능 지원 04. Platform 외부 연결, Platform간 연결이 필수적 수십 종, 수십 만개의 Device와의 연결 및 관리, Region간 System 연계
  • 41. 40 Full Managed End to end ML Platform Management Machine Learning Platform Developer Experience Hardware Icons from the noun project (http://thenounprojecct.com) - Luis Prado, Mello, Product Pencil, Gan Khoon Lay, pxLens, Adrien Coquet, Chad Remsing, Bartama Graphic, Creative Stall, Angelina, Alfredo @ IconsAlfredo.com, Eucalyp, Adi Kurniawan Data Management Model Management Training Management Hyperparameter Optimization Neural Architecture Search H/W Optimization Network as Code Connect Device Configurable Environment Keep Behavior
  • 42. 41 Full Managed End to end ML Platform Machine Learning은 더 이상 SOTA만으로 만족되지 않음 다양한 ML 최신 학습 기술(SOTA Architecture, Automated ML) 을 관리할 수 있어야 하며,  학습 환경, 학습 코드, Hyperparameter, Data Set Version 등과 함께 학습된 결과를 관리, 재생산(Reproduction), 재활용 할 수 있어야 하고, 단일 물리 환경에서의 학습이 아닌 다양한 물리 환경이 연동 되어야 한다. Full Managed End to end ML Platform
  • 43. 42 Overall Architecture Some icons from the noun project (http://thenounprojecct.com) - Creaticca Creative Agency, Chad Remsing, Revisited from A. Reed, Daniel & Dongarra, Jack. (2015). Exascale Computing and Big Data. Communications of the ACM. 58. 56-68. 10.1145/2699414. Cluster Hardware System Software HPC&AITechnology Middleware & Management Infiniband + Ethernet SAN + Local Node Storage Linux OS variant GPGPU or Accelerators ParallelFramework NumericalLibraries SystemTool Development Language Training Algorithm MLFramework Hadoop *1 *1 Source) https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0 *2 Source) Chrislb - Erstellt von Chrislb Platform Components { Environment, Workflow, Model, Quota, Resource, Log, Metering, … } Management *2
  • 45. 44 Basic Architecture • 사전 고려 사항 Storage Management Servers Management Servers Management Servers Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU Kubernetes Cluster Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing .yaml Template化 RDMA-SRIOV plug-in NVIDIA-peer- memory package Training Task 실행 전처리 Docker insecure registry Docker unlock memory limit Persistent Volume Mount Multi Tenant 관리 Timezone 통일 Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
  • 46. 45 Basic Architecture • Kubernetes 기반 Machine Learning Platform Storage Management Servers Management Servers Management Servers Servers with GPGPU Servers with GPGPU Servers with GPGPU Servers with GPGPU Kubernetes Cluster Management ModulesManagement ModulesManagement ModulesManagement Modules Training Job Training Job Training Job Training Job Training Job Training Job Preprocessing Preprocessing Preprocessing Preprocessing Preprocessing PreprocessingStorage Issue Data Feeding 속도 부족 Multi GPU 처리Data Locality CNI Overhead Pod Scheduler 부적합 동적 POD 구성&동작 Direct call kubectl command vGPU 부재 Server간 Communication overhead Resource Management Container Root Privilege Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
  • 47. 46 Troubleshooting • Storage Issue - NFS의 성능 한계가 Enterprise Server보다 빠르게 도달 ML의 특징상 학습에 다수의 파일을 사용하여 학습을 진행함 학습 후 가공된 파일이 학습 원본 데이터 수의 N 배가 발생함 사용자들이 유사하거나 동일한 Dataset을 활용하여 학습을 진행 Solution 1) 학습 Data Set Meta 관리를 통한 Original Copy 최소화 2) Data Set Lifetime Control 필수, Object Storage 사용 File System 접근으로 작성된 학습 코드를 위한 Library 제공 필요
  • 48. 47 Troubleshooting • Resource Management & Pod Scheduler 부적합 Issue - 1) GPGPU Machine Resource 파편화  Kubernetes Resource affinity는 Computing을 분산하여 Multi GPU Scheduling이 어려움 - 2) Abusing User  Resource 선점 및 Low Utilization Solution 1) a. Kubernetes custom Scheduler 개발 및 적용, Resource affinity 조정 b. 다양한 Resource Packing 제공 ex) 16 GPGPU = 1*16, 2*8, 4*4, 8*2 2) a. Fair share scheduling and Quota Consuming b. Will be – Preemption scheduler for GPGPU Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
  • 49. 48 Troubleshooting • Kubernetes Out of Memory(OOM) Issue - 1) OpenStack등과 같은 VM Solution들과 다른 Configuration Resource Over commit이 불가능함 (CPU, RAM 등) - 2) Kubernetes 기본 Option = Swap off Pod간 CPU와 Memory를 Share해야 하며 Paging이 불가함 cAdvisor와 kubelet의 통신 주기 보다 OOM의 원인의 메모리 사용량 증가 폭이 큼 Solution 1) 분할 배분, Watch Dog Server내 GPU 개수 할당에 따른 분배 또는 Watch Dog을 통한 빠른 조치 Icons from the noun project (http://thenounprojecct.com) - Gregor Cresnar Working on it…
  • 50. 49 Troubleshooting • vGPU Issue - Hardware vender dependence, VM Only (NVIDIA Grid vGPU) Data sheet from NVIDIA official document https://images.nvidia.com/content/pdf/grid/data-sheet/tesla-gpu-linecard-virtualization-us-nvidia-669786-r7.pdf Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
  • 51. 50 Troubleshooting • vGPU Issue - Hardware vender dependence, VM Only (NVIDIA Grid vGPU) Image from NVIDIA official document https://docs.nvidia.com/grid/4.3/grid-vgpu-user-guide/index.html [ vGPU Overall Architecture ] Solution Servers with GPGPU Kubernetes Cluster Servers with GPGPU Servers with GPGPU OpenStack Cluster VM with vGPGPU VM with vGPGPU VM with vGPGPU Training Job Training Job Training Job Training Job Recall from Open Infrastructure & Cloud Native Days Korea 2019 – 대규모 GPU 기반 K8S Cluster를 활용한 ML Training Troubleshooting (조규남 발표)
  • 53. 52