A journay to do AI research in the cloud.pdf

•

0 likes•44 views

Liang Yan

A journay to do AI research in the cloud

Technology

digitalocean.com
A Journey to Do AI Research
in the Cloud
Liang Yan
Sr. software engineer – virtualization, DigitalOcean
https://www.linkedin.co
SELF 06/12/2022

Liang Yan
xryan.net
Software engineer - Virtualization KY
Kentucky Open Source Society,
OpenSUSE Member, Debian Contributor,
ARM64 board Enthusiast
http://xryan.net
https://www.linkedin.com/in/lyantech
2022 @DigitalOcean
KVM/QEMU development for public cloud
Performance optimization for Scalable VMs
2017 @SUSE
Hardware Virtualization
GPU(AI/ML accelerator)
ARM64-KVM/QEMU Maintainer

3
Outline
Open-Source in AI
Project: Predict the predict
AI Accelerator
AI Cloud
Lessons learned
Q&A

AI 101
https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/

Pytorch
https://pytorch.org/get-started/locally/

• IDE: Jupyter notebook
• ML framework: PyTorch TensorFlow
• Accelerator: CUDA OpenCL ROCm
• Driver: NVIDIA/AMD/Intel
• OS: Linux
AI Dev Stack == Open-Source
https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/
Almost!

Predict the training time in the cloud
A Runtime-Based Computational Performance Predictor for Deep
Neural Network Training
https://github.com/geoffxy/habitat
https://github.com/liayan/habitat
https://www.usenix.org/conference/atc21/presentation/yu
Habitat makes accurate predictions, with an average error of 11.8% across all
configurations

Predict the training time in the cloud
https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/

Distributed training in the cloud cluster
https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md
https://wallpaperaccess.com/to-be-continued
Distributed Machine Learning
Shared Machine Learning
Federated Machine Learning

AI Accelerator
• GPU AMD/INTEL/NVIDIA
• FPGA: AMD Xilinx, Intel Altera
• Google TPU
• NPU/BPU/XPU...
AI accelerator:
• Graphic: Game, Streaming, 3D...
• Compute: Training, Inference
Use case

AI Accelerator: do we really need it?
http://makeyourownneuralnetwork.blogspot.com/2017/05/learning-mnist-with-gpu-acceleration.html
https://christiancosgrove.com/blog/2019/10/06/challenges-in-distributed-machine-learning.html

AI Cloud Implementation
• AI Cloud technical implementation
• Passthrough: perf ～95%
• MIG
• FPGA
• TPU/NPU
• Virtualization: perf ～90%
• NVIDIA: mdev and SRIOV(Ampere and later)
• AMD: SRIOV
• Intel mdev
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#pass-through-gpu-use-introduction

AI Cloud Providers
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/

Support Matrix
M60 P4 P40 P100 T4 RTX 6000 V100 A10 A40 A100 Notes
Aliyun
AWS
Baidu
Google TPU
IBM
Microsoft FPGA/AMD
Oracle
Tencent
linode
paperspace
Lambda
vultr vGPU/MIG
https://developer.nvidia.com/cuda-gpus
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/

AI Cloud Service
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
•IAAS
• ML VM Image
• Container:
• Docker
• NGC
• Conda/pip3
•PaaS
Help manage data and model
(paperspace, Colaboratory)
•*SaaS
Help consume AI solution
(IBM Watson, GOogle voice)

Lesson learned
• Pytorch: clarify version dependency first
Pytorch -> the other python tools
Pytorch -> CUDA/cuDNN
CUDA/cuDNN -> GPU driver -> Kernel -> OS
Python tools --> OS
• AI Cloud using experience:
• Google Cloud: quite cheap, quota policy, not easy to move data
• AWS: need apply in advance,too many choices, expensive
• Aliyun/Tencent: needs a lot of personal privacy information, a lot of deal
• Azure decent, FPGA and AMD Instinct MI25
• Intel GPU: Tencent Intel SG1
• Paperspace or Colaboratory, make sure setupGPU backend

Q & A
Thanks!
In memory of
John Hicks
Founder of KYOSS
kyoss.dev
Claim:
All the information is based on personal using experience,
no preference or commercial advertising. If there are
any conflicts, please refer to the statement from providers.

Similar to A journay to do AI research in the cloud.pdf

Machine Learning in the Enterprise 2019 Timothy Spann

Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Timothy Spann

A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024Cloud Native NoVA

車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PXNVIDIA Japan

Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...inside-BigData.com

Harnessing the virtual realm for successful real world artificial intelligenceAlison B. Lowndes

AWS vs Azure vs Google (GCP) - SlidesTobyWilman

CSCfi Computing Services 12/2014Olli-Pekka Lehto

OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformMarc Dutoo

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann

Azure for HackathonsShahed Chowdhuri

ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...Abhinav Joshi

OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware

OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OW2

Journey to cloud engineeringMd. Sadhan Sarker

Azure en Nutanix: your journey to the hybrid cloudICT-Partners

[第35回 Machine Learning 15minutes!] Microsoft AI UpdatesNaoki (Neo) SATO

DDDP 2019 - Brown to GreenJohn Archer

DR_PRESENT 1Ahmed Salman

Chef and OpenStack Workshop from ChefConf 2013Matt Ray

Similar to A journay to do AI research in the cloud.pdf (20)

Machine Learning in the Enterprise 2019

Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)

A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024

車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX

Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...

Harnessing the virtual realm for successful real world artificial intelligence

AWS vs Azure vs Google (GCP) - Slides

CSCfi Computing Services 12/2014

OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines

Azure for Hackathons

ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...

OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...

OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...

Journey to cloud engineering

Azure en Nutanix: your journey to the hybrid cloud

[第35回 Machine Learning 15minutes!] Microsoft AI Updates

DDDP 2019 - Brown to Green

DR_PRESENT 1

Chef and OpenStack Workshop from ChefConf 2013

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Histor y of HAM Radio presentation slidevu2urc

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Histor y of HAM Radio presentation slide

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

CNv6 Instructor Chapter 6 Quality of Service

Handwritten Text Recognition for manuscripts and early printed texts

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Unblocking The Main Thread Solving ANRs and Frozen Frames

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

Google AI Hackathon: LLM based Evaluator for RAG

SQL Database Design For Developers at php[tek] 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Boost PC performance: How more available memory can improve productivity

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Finology Group – Insurtech Innovation Award 2024

Salesforce Community Group Quito, Salesforce 101

A journay to do AI research in the cloud.pdf

1. digitalocean.com A Journey to Do AI Research in the Cloud Liang Yan Sr. software engineer – virtualization, DigitalOcean https://www.linkedin.co SELF 06/12/2022

2. Liang Yan xryan.net Software engineer - Virtualization KY Kentucky Open Source Society, OpenSUSE Member, Debian Contributor, ARM64 board Enthusiast http://xryan.net https://www.linkedin.com/in/lyantech 2022 @DigitalOcean KVM/QEMU development for public cloud Performance optimization for Scalable VMs 2017 @SUSE Hardware Virtualization GPU(AI/ML accelerator) ARM64-KVM/QEMU Maintainer

3. 3 Outline Open-Source in AI Project: Predict the predict AI Accelerator AI Cloud Lessons learned Q&A

4. AI 101 https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/

5. Open-Source in AI

6. Pytorch https://pytorch.org/get-started/locally/

7. • IDE: Jupyter notebook • ML framework: PyTorch TensorFlow • Accelerator: CUDA OpenCL ROCm • Driver: NVIDIA/AMD/Intel • OS: Linux AI Dev Stack == Open-Source https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/ Almost!

8. Predict the training time in the cloud A Runtime-Based Computational Performance Predictor for Deep Neural Network Training https://github.com/geoffxy/habitat https://github.com/liayan/habitat https://www.usenix.org/conference/atc21/presentation/yu Habitat makes accurate predictions, with an average error of 11.8% across all configurations

9. Predict the training time in the cloud https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/

10. Distributed training in the cloud cluster https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md https://wallpaperaccess.com/to-be-continued Distributed Machine Learning Shared Machine Learning Federated Machine Learning

11. AI Accelerator • GPU AMD/INTEL/NVIDIA • FPGA: AMD Xilinx, Intel Altera • Google TPU • NPU/BPU/XPU... AI accelerator: • Graphic: Game, Streaming, 3D... • Compute: Training, Inference Use case

12. AI Accelerator: do we really need it? http://makeyourownneuralnetwork.blogspot.com/2017/05/learning-mnist-with-gpu-acceleration.html https://christiancosgrove.com/blog/2019/10/06/challenges-in-distributed-machine-learning.html

13. AI Cloud Implementation • AI Cloud technical implementation • Passthrough: perf ～95% • MIG • FPGA • TPU/NPU • Virtualization: perf ～90% • NVIDIA: mdev and SRIOV(Ampere and later) • AMD: SRIOV • Intel mdev https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#pass-through-gpu-use-introduction

14. AI Cloud Providers https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/

15. Support Matrix M60 P4 P40 P100 T4 RTX 6000 V100 A10 A40 A100 Notes Aliyun AWS Baidu Google TPU IBM Microsoft FPGA/AMD Oracle Tencent linode paperspace Lambda vultr vGPU/MIG https://developer.nvidia.com/cuda-gpus https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/

16. AI Cloud Service https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/ •IAAS • ML VM Image • Container: • Docker • NGC • Conda/pip3 •PaaS Help manage data and model (paperspace, Colaboratory) •*SaaS Help consume AI solution (IBM Watson, GOogle voice)

17. Lesson learned • Pytorch: clarify version dependency first Pytorch -> the other python tools Pytorch -> CUDA/cuDNN CUDA/cuDNN -> GPU driver -> Kernel -> OS Python tools --> OS • AI Cloud using experience: • Google Cloud: quite cheap, quota policy, not easy to move data • AWS: need apply in advance,too many choices, expensive • Aliyun/Tencent: needs a lot of personal privacy information, a lot of deal • Azure decent, FPGA and AMD Instinct MI25 • Intel GPU: Tencent Intel SG1 • Paperspace or Colaboratory, make sure setupGPU backend

18. Q & A Thanks! In memory of John Hicks Founder of KYOSS kyoss.dev Claim: All the information is based on personal using experience, no preference or commercial advertising. If there are any conflicts, please refer to the statement from providers.

A journay to do AI research in the cloud.pdf

Recommended

Recommended

More Related Content

Similar to A journay to do AI research in the cloud.pdf

Similar to A journay to do AI research in the cloud.pdf (20)

More from Liang Yan

More from Liang Yan (13)

Recently uploaded

Recently uploaded (20)

A journay to do AI research in the cloud.pdf