SlideShare a Scribd company logo
digitalocean.com
A Journey to Do AI Research
in the Cloud
Liang Yan
Sr. software engineer – virtualization, DigitalOcean
https://www.linkedin.co
SELF 06/12/2022
Liang Yan
xryan.net
Software engineer - Virtualization KY
Kentucky Open Source Society,
OpenSUSE Member, Debian Contributor,
ARM64 board Enthusiast
http://xryan.net
https://www.linkedin.com/in/lyantech
2022 @DigitalOcean
KVM/QEMU development for public cloud
Performance optimization for Scalable VMs
2017 @SUSE
Hardware Virtualization
GPU(AI/ML accelerator)
ARM64-KVM/QEMU Maintainer
3
Outline
Open-Source in AI
Project: Predict the predict
AI Accelerator
AI Cloud
Lessons learned
Q&A
AI 101
https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/
Open-Source in AI
Pytorch
https://pytorch.org/get-started/locally/
• IDE: Jupyter notebook
• ML framework: PyTorch TensorFlow
• Accelerator: CUDA OpenCL ROCm
• Driver: NVIDIA/AMD/Intel
• OS: Linux
AI Dev Stack == Open-Source
https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/
Almost!
Predict the training time in the cloud
A Runtime-Based Computational Performance Predictor for Deep
Neural Network Training
https://github.com/geoffxy/habitat
https://github.com/liayan/habitat
https://www.usenix.org/conference/atc21/presentation/yu
Habitat makes accurate predictions, with an average error of 11.8% across all
configurations
Predict the training time in the cloud
https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/
Distributed training in the cloud cluster
https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md
https://wallpaperaccess.com/to-be-continued
Distributed Machine Learning
Shared Machine Learning
Federated Machine Learning
AI Accelerator
• GPU AMD/INTEL/NVIDIA
• FPGA: AMD Xilinx, Intel Altera
• Google TPU
• NPU/BPU/XPU...
AI accelerator:
• Graphic: Game, Streaming, 3D...
• Compute: Training, Inference
Use case
AI Accelerator: do we really need it?
http://makeyourownneuralnetwork.blogspot.com/2017/05/learning-mnist-with-gpu-acceleration.html
https://christiancosgrove.com/blog/2019/10/06/challenges-in-distributed-machine-learning.html
AI Cloud Implementation
• AI Cloud technical implementation
• Passthrough: perf ~95%
• MIG
• FPGA
• TPU/NPU
• Virtualization: perf ~90%
• NVIDIA: mdev and SRIOV(Ampere and later)
• AMD: SRIOV
• Intel mdev
https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#pass-through-gpu-use-introduction
AI Cloud Providers
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
Support Matrix
M60 P4 P40 P100 T4 RTX 6000 V100 A10 A40 A100 Notes
Aliyun
AWS
Baidu
Google TPU
IBM
Microsoft FPGA/AMD
Oracle
Tencent
linode
paperspace
Lambda
vultr vGPU/MIG
https://developer.nvidia.com/cuda-gpus
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
AI Cloud Service
https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
•IAAS
• ML VM Image
• Container:
• Docker
• NGC
• Conda/pip3
•PaaS
Help manage data and model
(paperspace, Colaboratory)
•*SaaS
Help consume AI solution
(IBM Watson, GOogle voice)
Lesson learned
• Pytorch: clarify version dependency first
Pytorch -> the other python tools
Pytorch -> CUDA/cuDNN
CUDA/cuDNN -> GPU driver -> Kernel -> OS
Python tools --> OS
• AI Cloud using experience:
• Google Cloud: quite cheap, quota policy, not easy to move data
• AWS: need apply in advance,too many choices, expensive
• Aliyun/Tencent: needs a lot of personal privacy information, a lot of deal
• Azure decent, FPGA and AMD Instinct MI25
• Intel GPU: Tencent Intel SG1
• Paperspace or Colaboratory, make sure setupGPU backend
Q & A
Thanks!
In memory of
John Hicks
Founder of KYOSS
kyoss.dev
Claim:
All the information is based on personal using experience,
no preference or commercial advertising. If there are
any conflicts, please refer to the statement from providers.

More Related Content

Similar to A journay to do AI research in the cloud.pdf

Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019
Timothy Spann
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Timothy Spann
 
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
Cloud Native NoVA
 
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
NVIDIA Japan
 
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
inside-BigData.com
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
Alison B. Lowndes
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
TobyWilman
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014
Olli-Pekka Lehto
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
Marc Dutoo
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
Azure for Hackathons
Azure for HackathonsAzure for Hackathons
Azure for Hackathons
Shahed Chowdhuri
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
Abhinav Joshi
 
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OW2
 
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware
 
Journey to cloud engineering
Journey to cloud engineeringJourney to cloud engineering
Journey to cloud engineering
Md. Sadhan Sarker
 
Azure en Nutanix: your journey to the hybrid cloud
Azure en Nutanix: your journey to the hybrid cloudAzure en Nutanix: your journey to the hybrid cloud
Azure en Nutanix: your journey to the hybrid cloud
ICT-Partners
 
[第35回 Machine Learning 15minutes!] Microsoft AI Updates
[第35回 Machine Learning 15minutes!] Microsoft AI Updates[第35回 Machine Learning 15minutes!] Microsoft AI Updates
[第35回 Machine Learning 15minutes!] Microsoft AI Updates
Naoki (Neo) SATO
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
John Archer
 
Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013
Matt Ray
 

Similar to A journay to do AI research in the cloud.pdf (20)

Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
 
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
 
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
車載組み込み用ディープラーニング・エンジン NVIDIA DRIVE PX
 
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
Benchmarking MPI Applications in Singularity Containers on Traditional HPC an...
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014CSCfi Computing Services 12/2014
CSCfi Computing Services 12/2014
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
Azure for Hackathons
Azure for HackathonsAzure for Hackathons
Azure for Hackathons
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
 
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
 
Journey to cloud engineering
Journey to cloud engineeringJourney to cloud engineering
Journey to cloud engineering
 
Azure en Nutanix: your journey to the hybrid cloud
Azure en Nutanix: your journey to the hybrid cloudAzure en Nutanix: your journey to the hybrid cloud
Azure en Nutanix: your journey to the hybrid cloud
 
[第35回 Machine Learning 15minutes!] Microsoft AI Updates
[第35回 Machine Learning 15minutes!] Microsoft AI Updates[第35回 Machine Learning 15minutes!] Microsoft AI Updates
[第35回 Machine Learning 15minutes!] Microsoft AI Updates
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
DR_PRESENT 1
DR_PRESENT 1DR_PRESENT 1
DR_PRESENT 1
 
Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013Chef and OpenStack Workshop from ChefConf 2013
Chef and OpenStack Workshop from ChefConf 2013
 

More from Liang Yan

Stable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdfStable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdf
Liang Yan
 
ChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdfChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdf
Liang Yan
 
Bring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdfBring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdf
Liang Yan
 
utf.pdf
utf.pdfutf.pdf
utf.pdf
Liang Yan
 
GPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdfGPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdf
Liang Yan
 
i-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdfi-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdf
Liang Yan
 
a-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdfa-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdf
Liang Yan
 
Accelerate-your-AI-Cloud-infrastructure.pdf
Accelerate-your-AI-Cloud-infrastructure.pdfAccelerate-your-AI-Cloud-infrastructure.pdf
Accelerate-your-AI-Cloud-infrastructure.pdf
Liang Yan
 
A-Journney-to-support-vgpu-in-firecracker.pdf
A-Journney-to-support-vgpu-in-firecracker.pdfA-Journney-to-support-vgpu-in-firecracker.pdf
A-Journney-to-support-vgpu-in-firecracker.pdf
Liang Yan
 
GPU Virtualization in SUSE
GPU Virtualization in SUSEGPU Virtualization in SUSE
GPU Virtualization in SUSE
Liang Yan
 
Linux and SUSE
Linux and SUSELinux and SUSE
Linux and SUSE
Liang Yan
 
The abcs of gpu
The abcs of gpuThe abcs of gpu
The abcs of gpu
Liang Yan
 
How to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsHow to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boards
Liang Yan
 

More from Liang Yan (13)

Stable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdfStable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdf
 
ChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdfChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdf
 
Bring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdfBring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdf
 
utf.pdf
utf.pdfutf.pdf
utf.pdf
 
GPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdfGPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdf
 
i-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdfi-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdf
 
a-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdfa-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdf
 
Accelerate-your-AI-Cloud-infrastructure.pdf
Accelerate-your-AI-Cloud-infrastructure.pdfAccelerate-your-AI-Cloud-infrastructure.pdf
Accelerate-your-AI-Cloud-infrastructure.pdf
 
A-Journney-to-support-vgpu-in-firecracker.pdf
A-Journney-to-support-vgpu-in-firecracker.pdfA-Journney-to-support-vgpu-in-firecracker.pdf
A-Journney-to-support-vgpu-in-firecracker.pdf
 
GPU Virtualization in SUSE
GPU Virtualization in SUSEGPU Virtualization in SUSE
GPU Virtualization in SUSE
 
Linux and SUSE
Linux and SUSELinux and SUSE
Linux and SUSE
 
The abcs of gpu
The abcs of gpuThe abcs of gpu
The abcs of gpu
 
How to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsHow to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boards
 

Recently uploaded

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

A journay to do AI research in the cloud.pdf

  • 1. digitalocean.com A Journey to Do AI Research in the Cloud Liang Yan Sr. software engineer – virtualization, DigitalOcean https://www.linkedin.co SELF 06/12/2022
  • 2. Liang Yan xryan.net Software engineer - Virtualization KY Kentucky Open Source Society, OpenSUSE Member, Debian Contributor, ARM64 board Enthusiast http://xryan.net https://www.linkedin.com/in/lyantech 2022 @DigitalOcean KVM/QEMU development for public cloud Performance optimization for Scalable VMs 2017 @SUSE Hardware Virtualization GPU(AI/ML accelerator) ARM64-KVM/QEMU Maintainer
  • 3. 3 Outline Open-Source in AI Project: Predict the predict AI Accelerator AI Cloud Lessons learned Q&A
  • 7. • IDE: Jupyter notebook • ML framework: PyTorch TensorFlow • Accelerator: CUDA OpenCL ROCm • Driver: NVIDIA/AMD/Intel • OS: Linux AI Dev Stack == Open-Source https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/ Almost!
  • 8. Predict the training time in the cloud A Runtime-Based Computational Performance Predictor for Deep Neural Network Training https://github.com/geoffxy/habitat https://github.com/liayan/habitat https://www.usenix.org/conference/atc21/presentation/yu Habitat makes accurate predictions, with an average error of 11.8% across all configurations
  • 9. Predict the training time in the cloud https://pyimagesearch.com/2021/07/19/pytorch-training-your-first-convolutional-neural-network-cnn/
  • 10. Distributed training in the cloud cluster https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md https://wallpaperaccess.com/to-be-continued Distributed Machine Learning Shared Machine Learning Federated Machine Learning
  • 11. AI Accelerator • GPU AMD/INTEL/NVIDIA • FPGA: AMD Xilinx, Intel Altera • Google TPU • NPU/BPU/XPU... AI accelerator: • Graphic: Game, Streaming, 3D... • Compute: Training, Inference Use case
  • 12. AI Accelerator: do we really need it? http://makeyourownneuralnetwork.blogspot.com/2017/05/learning-mnist-with-gpu-acceleration.html https://christiancosgrove.com/blog/2019/10/06/challenges-in-distributed-machine-learning.html
  • 13. AI Cloud Implementation • AI Cloud technical implementation • Passthrough: perf ~95% • MIG • FPGA • TPU/NPU • Virtualization: perf ~90% • NVIDIA: mdev and SRIOV(Ampere and later) • AMD: SRIOV • Intel mdev https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#pass-through-gpu-use-introduction
  • 15. Support Matrix M60 P4 P40 P100 T4 RTX 6000 V100 A10 A40 A100 Notes Aliyun AWS Baidu Google TPU IBM Microsoft FPGA/AMD Oracle Tencent linode paperspace Lambda vultr vGPU/MIG https://developer.nvidia.com/cuda-gpus https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/
  • 16. AI Cloud Service https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/ •IAAS • ML VM Image • Container: • Docker • NGC • Conda/pip3 •PaaS Help manage data and model (paperspace, Colaboratory) •*SaaS Help consume AI solution (IBM Watson, GOogle voice)
  • 17. Lesson learned • Pytorch: clarify version dependency first Pytorch -> the other python tools Pytorch -> CUDA/cuDNN CUDA/cuDNN -> GPU driver -> Kernel -> OS Python tools --> OS • AI Cloud using experience: • Google Cloud: quite cheap, quota policy, not easy to move data • AWS: need apply in advance,too many choices, expensive • Aliyun/Tencent: needs a lot of personal privacy information, a lot of deal • Azure decent, FPGA and AMD Instinct MI25 • Intel GPU: Tencent Intel SG1 • Paperspace or Colaboratory, make sure setupGPU backend
  • 18. Q & A Thanks! In memory of John Hicks Founder of KYOSS kyoss.dev Claim: All the information is based on personal using experience, no preference or commercial advertising. If there are any conflicts, please refer to the statement from providers.