1. digitalocean.com
A Journey to Do AI Research
in the Cloud
Liang Yan
Sr. software engineer – virtualization, DigitalOcean
https://www.linkedin.co
SELF 06/12/2022
2. Liang Yan
xryan.net
Software engineer - Virtualization KY
Kentucky Open Source Society,
OpenSUSE Member, Debian Contributor,
ARM64 board Enthusiast
http://xryan.net
https://www.linkedin.com/in/lyantech
2022 @DigitalOcean
KVM/QEMU development for public cloud
Performance optimization for Scalable VMs
2017 @SUSE
Hardware Virtualization
GPU(AI/ML accelerator)
ARM64-KVM/QEMU Maintainer
7. • IDE: Jupyter notebook
• ML framework: PyTorch TensorFlow
• Accelerator: CUDA OpenCL ROCm
• Driver: NVIDIA/AMD/Intel
• OS: Linux
AI Dev Stack == Open-Source
https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/
Almost!
8. Predict the training time in the cloud
A Runtime-Based Computational Performance Predictor for Deep
Neural Network Training
https://github.com/geoffxy/habitat
https://github.com/liayan/habitat
https://www.usenix.org/conference/atc21/presentation/yu
Habitat makes accurate predictions, with an average error of 11.8% across all
configurations
9. Predict the training time in the cloud
https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/
10. Distributed training in the cloud cluster
https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md
https://wallpaperaccess.com/to-be-continued
Distributed Machine Learning
Shared Machine Learning
Federated Machine Learning
11. AI Accelerator
• GPU AMD/INTEL/NVIDIA
• FPGA: AMD Xilinx, Intel Altera
• Google TPU
• NPU/BPU/XPU...
AI accelerator:
• Graphic: Game, Streaming, 3D...
• Compute: Training, Inference
Use case
12. AI Accelerator: do we really need it?
http://makeyourownneuralnetwork.blogspot.com/2017/05/learning-mnist-with-gpu-acceleration.html
https://christiancosgrove.com/blog/2019/10/06/challenges-in-distributed-machine-learning.html
13. AI Cloud Implementation
• AI Cloud technical implementation
• Passthrough: perf ~95%
• MIG
• FPGA
• TPU/NPU
• Virtualization: perf ~90%
• NVIDIA: mdev and SRIOV(Ampere and later)
• AMD: SRIOV
• Intel mdev
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#pass-through-gpu-use-introduction
17. Lesson learned
• Pytorch: clarify version dependency first
Pytorch -> the other python tools
Pytorch -> CUDA/cuDNN
CUDA/cuDNN -> GPU driver -> Kernel -> OS
Python tools --> OS
• AI Cloud using experience:
• Google Cloud: quite cheap, quota policy, not easy to move data
• AWS: need apply in advance,too many choices, expensive
• Aliyun/Tencent: needs a lot of personal privacy information, a lot of deal
• Azure decent, FPGA and AMD Instinct MI25
• Intel GPU: Tencent Intel SG1
• Paperspace or Colaboratory, make sure setupGPU backend
18. Q & A
Thanks!
In memory of
John Hicks
Founder of KYOSS
kyoss.dev
Claim:
All the information is based on personal using experience,
no preference or commercial advertising. If there are
any conflicts, please refer to the statement from providers.