SlideShare a Scribd company logo
1 of 18
Download to read offline
digitalocean.com
A Journey to Do AI Research
in the Cloud
Liang Yan
Sr. software engineer – virtualization, DigitalOcean
https://www.linkedin.co
SELF 06/12/2022
Liang Yan
xryan.net
Software engineer - Virtualization KY
Kentucky Open Source Society,
OpenSUSE Member, Debian Contributor,
ARM64 board Enthusiast
http://xryan.net
https://www.linkedin.com/in/lyantech
2022 @DigitalOcean
KVM/QEMU development for public cloud
Performance optimization for Scalable VMs
2017 @SUSE
Hardware Virtualization
GPU(AI/ML accelerator)
ARM64-KVM/QEMU Maintainer
3
Outline
Open-Source in AI
Project: Predict the predict
AI Accelerator
AI Cloud
Lessons learned
Q&A
AI 101
https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/
Open-Source in AI
Pytorch
https://pytorch.org/get-started/locally/
• IDE: Jupyter notebook
• ML framework: PyTorch TensorFlow
• Accelerator: CUDA OpenCL ROCm
• Driver: NVIDIA/AMD/Intel
• OS: Linux
AI Dev Stack == Open-Source
https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/
Almost!
Predict the training time in the cloud
A Runtime-Based Computational Performance Predictor for Deep
Neural Network Training
https://github.com/geoffxy/habitat
https://github.com/liayan/habitat
https://www.usenix.org/conference/atc21/presentation/yu
Habitat makes accurate predictions, with an average error of 11.8% across all
configurations
Predict the training time in the cloud
https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/
Distributed training in the cloud cluster
https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md
https://wallpaperaccess.com/to-be-continued
Distributed Machine Learning
Shared Machine Learning
Federated Machine Learning
AI Accelerator
• GPU AMD/INTEL/NVIDIA
• FPGA: AMD Xilinx, Intel Altera
• Google TPU
• NPU/BPU/XPU...
AI accelerator:
• Graphic: Game, Streaming, 3D...
• Compute: Training, Inference
Use case
AI Accelerator: do we really need it?
http://makeyourownneuralnetwork.blogspot.com/2017/05/learning-mnist-with-gpu-acceleration.html
https://christiancosgrove.com/blog/2019/10/06/challenges-in-distributed-machine-learning.html
AI Cloud Implementation
• AI Cloud technical implementation
• Passthrough: perf ~95%
• MIG
• FPGA
• TPU/NPU
• Virtualization: perf ~90%
• NVIDIA: mdev and SRIOV(Ampere and later)
• AMD: SRIOV
• Intel mdev
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#pass-through-gpu-use-introduction
AI Cloud Providers
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
Support Matrix
M60 P4 P40 P100 T4 RTX 6000 V100 A10 A40 A100 Notes
Aliyun
AWS
Baidu
Google TPU
IBM
Microsoft FPGA/AMD
Oracle
Tencent
linode
paperspace
Lambda
vultr vGPU/MIG
https://developer.nvidia.com/cuda-gpus
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
AI Cloud Service
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
•IAAS
• ML VM Image
• Container:
• Docker
• NGC
• Conda/pip3
•PaaS
Help manage data and model
(paperspace, Colaboratory)
•*SaaS
Help consume AI solution
(IBM Watson, GOogle voice)
Lesson learned
• Pytorch: clarify version dependency first
Pytorch -> the other python tools
Pytorch -> CUDA/cuDNN
CUDA/cuDNN -> GPU driver -> Kernel -> OS
Python tools --> OS
• AI Cloud using experience:
• Google Cloud: quite cheap, quota policy, not easy to move data
• AWS: need apply in advance,too many choices, expensive
• Aliyun/Tencent: needs a lot of personal privacy information, a lot of deal
• Azure decent, FPGA and AMD Instinct MI25
• Intel GPU: Tencent Intel SG1
• Paperspace or Colaboratory, make sure setupGPU backend
Q & A
Thanks!
In memory of
John Hicks
Founder of KYOSS
kyoss.dev
Claim:
All the information is based on personal using experience,
no preference or commercial advertising. If there are
any conflicts, please refer to the statement from providers.

More Related Content

Similar to A journay to do AI research in the cloud.pdf

Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019 Timothy Spann
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Timothy Spann
 
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024Cloud Native NoVA
 
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PXNVIDIA Japan
 
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...inside-BigData.com
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceAlison B. Lowndes
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesTobyWilman
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014Olli-Pekka Lehto
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformMarc Dutoo
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...Abhinav Joshi
 
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware
 
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OW2
 
Journey to cloud engineering
Journey to cloud engineeringJourney to cloud engineering
Journey to cloud engineeringMd. Sadhan Sarker
 
Azure en Nutanix: your journey to the hybrid cloud
Azure en Nutanix: your journey to the hybrid cloudAzure en Nutanix: your journey to the hybrid cloud
Azure en Nutanix: your journey to the hybrid cloudICT-Partners
 
[第35回 Machine Learning 15minutes!] Microsoft AI Updates
[第35回 Machine Learning 15minutes!] Microsoft AI Updates[第35回 Machine Learning 15minutes!] Microsoft AI Updates
[第35回 Machine Learning 15minutes!] Microsoft AI UpdatesNaoki (Neo) SATO
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to GreenJohn Archer
 
Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013Matt Ray
 

Similar to A journay to do AI research in the cloud.pdf (20)

Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
 
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
 
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
 
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
Azure for Hackathons
Azure for HackathonsAzure for Hackathons
Azure for Hackathons
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
 
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
 
Journey to cloud engineering
Journey to cloud engineeringJourney to cloud engineering
Journey to cloud engineering
 
Azure en Nutanix: your journey to the hybrid cloud
Azure en Nutanix: your journey to the hybrid cloudAzure en Nutanix: your journey to the hybrid cloud
Azure en Nutanix: your journey to the hybrid cloud
 
[第35回 Machine Learning 15minutes!] Microsoft AI Updates
[第35回 Machine Learning 15minutes!] Microsoft AI Updates[第35回 Machine Learning 15minutes!] Microsoft AI Updates
[第35回 Machine Learning 15minutes!] Microsoft AI Updates
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
DR_PRESENT 1
DR_PRESENT 1DR_PRESENT 1
DR_PRESENT 1
 
Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013
 

More from Liang Yan

Stable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdfStable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdfLiang Yan
 
ChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdfChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdfLiang Yan
 
Bring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdfBring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdfLiang Yan
 
GPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdfGPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdfLiang Yan
 
i-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdfi-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdfLiang Yan
 
a-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdfa-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdfLiang Yan
 
Accelerate-your-AI-Cloud-infrastructure.pdf
Accelerate-your-AI-Cloud-infrastructure.pdfAccelerate-your-AI-Cloud-infrastructure.pdf
Accelerate-your-AI-Cloud-infrastructure.pdfLiang Yan
 
A-Journney-to-support-vgpu-in-firecracker.pdf
A-Journney-to-support-vgpu-in-firecracker.pdfA-Journney-to-support-vgpu-in-firecracker.pdf
A-Journney-to-support-vgpu-in-firecracker.pdfLiang Yan
 
GPU Virtualization in SUSE
GPU Virtualization in SUSEGPU Virtualization in SUSE
GPU Virtualization in SUSELiang Yan
 
Linux and SUSE
Linux and SUSELinux and SUSE
Linux and SUSELiang Yan
 
The abcs of gpu
The abcs of gpuThe abcs of gpu
The abcs of gpuLiang Yan
 
How to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsHow to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsLiang Yan
 

More from Liang Yan (13)

Stable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdfStable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdf
 
ChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdfChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdf
 
Bring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdfBring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdf
 
utf.pdf
utf.pdfutf.pdf
utf.pdf
 
GPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdfGPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdf
 
i-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdfi-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdf
 
a-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdfa-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdf
 
Accelerate-your-AI-Cloud-infrastructure.pdf
Accelerate-your-AI-Cloud-infrastructure.pdfAccelerate-your-AI-Cloud-infrastructure.pdf
Accelerate-your-AI-Cloud-infrastructure.pdf
 
A-Journney-to-support-vgpu-in-firecracker.pdf
A-Journney-to-support-vgpu-in-firecracker.pdfA-Journney-to-support-vgpu-in-firecracker.pdf
A-Journney-to-support-vgpu-in-firecracker.pdf
 
GPU Virtualization in SUSE
GPU Virtualization in SUSEGPU Virtualization in SUSE
GPU Virtualization in SUSE
 
Linux and SUSE
Linux and SUSELinux and SUSE
Linux and SUSE
 
The abcs of gpu
The abcs of gpuThe abcs of gpu
The abcs of gpu
 
How to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsHow to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boards
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

A journay to do AI research in the cloud.pdf

  • 1. digitalocean.com A Journey to Do AI Research in the Cloud Liang Yan Sr. software engineer – virtualization, DigitalOcean https://www.linkedin.co SELF 06/12/2022
  • 2. Liang Yan xryan.net Software engineer - Virtualization KY Kentucky Open Source Society, OpenSUSE Member, Debian Contributor, ARM64 board Enthusiast http://xryan.net https://www.linkedin.com/in/lyantech 2022 @DigitalOcean KVM/QEMU development for public cloud Performance optimization for Scalable VMs 2017 @SUSE Hardware Virtualization GPU(AI/ML accelerator) ARM64-KVM/QEMU Maintainer
  • 3. 3 Outline Open-Source in AI Project: Predict the predict AI Accelerator AI Cloud Lessons learned Q&A
  • 7. • IDE: Jupyter notebook • ML framework: PyTorch TensorFlow • Accelerator: CUDA OpenCL ROCm • Driver: NVIDIA/AMD/Intel • OS: Linux AI Dev Stack == Open-Source https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/ Almost!
  • 8. Predict the training time in the cloud A Runtime-Based Computational Performance Predictor for Deep Neural Network Training https://github.com/geoffxy/habitat https://github.com/liayan/habitat https://www.usenix.org/conference/atc21/presentation/yu Habitat makes accurate predictions, with an average error of 11.8% across all configurations
  • 9. Predict the training time in the cloud https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/
  • 10. Distributed training in the cloud cluster https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md https://wallpaperaccess.com/to-be-continued Distributed Machine Learning Shared Machine Learning Federated Machine Learning
  • 11. AI Accelerator • GPU AMD/INTEL/NVIDIA • FPGA: AMD Xilinx, Intel Altera • Google TPU • NPU/BPU/XPU... AI accelerator: • Graphic: Game, Streaming, 3D... • Compute: Training, Inference Use case
  • 12. AI Accelerator: do we really need it? http://makeyourownneuralnetwork.blogspot.com/2017/05/learning-mnist-with-gpu-acceleration.html https://christiancosgrove.com/blog/2019/10/06/challenges-in-distributed-machine-learning.html
  • 13. AI Cloud Implementation • AI Cloud technical implementation • Passthrough: perf ~95% • MIG • FPGA • TPU/NPU • Virtualization: perf ~90% • NVIDIA: mdev and SRIOV(Ampere and later) • AMD: SRIOV • Intel mdev https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#pass-through-gpu-use-introduction
  • 15. Support Matrix M60 P4 P40 P100 T4 RTX 6000 V100 A10 A40 A100 Notes Aliyun AWS Baidu Google TPU IBM Microsoft FPGA/AMD Oracle Tencent linode paperspace Lambda vultr vGPU/MIG https://developer.nvidia.com/cuda-gpus https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
  • 16. AI Cloud Service https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/ •IAAS • ML VM Image • Container: • Docker • NGC • Conda/pip3 •PaaS Help manage data and model (paperspace, Colaboratory) •*SaaS Help consume AI solution (IBM Watson, GOogle voice)
  • 17. Lesson learned • Pytorch: clarify version dependency first Pytorch -> the other python tools Pytorch -> CUDA/cuDNN CUDA/cuDNN -> GPU driver -> Kernel -> OS Python tools --> OS • AI Cloud using experience: • Google Cloud: quite cheap, quota policy, not easy to move data • AWS: need apply in advance,too many choices, expensive • Aliyun/Tencent: needs a lot of personal privacy information, a lot of deal • Azure decent, FPGA and AMD Instinct MI25 • Intel GPU: Tencent Intel SG1 • Paperspace or Colaboratory, make sure setupGPU backend
  • 18. Q & A Thanks! In memory of John Hicks Founder of KYOSS kyoss.dev Claim: All the information is based on personal using experience, no preference or commercial advertising. If there are any conflicts, please refer to the statement from providers.