SlideShare a Scribd company logo
1 of 20
Download to read offline
Copyright © SUSE 2021
Accelerate Your AI
Cloud Infrastructure
12 APRIL 20 2 1
A Virtualization Perspective
Liang Yan – SUSE Labs
Copyright © SUSE 2021
Outline
• Background
• Cloud & AI
• Hardware Acceleration
• NVIDIA® GPU Virtualization
• Current status at SUSE
• Demo
• Running NGC inside a VM
• Current limitations and futures
• Q & A
Copyright © SUSE 2021
Copyright © SUSE 2021
Background
3
Copyright © SUSE 2021
Building
Machine
Learning
Infrastructur
e in the
Cloud
4
https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/
Copyright © SUSE 2021
Tools for
Deep
Learning
5
https://jameskle.com/writes/deep-learning-infrastructure-tooling
Copyright © SUSE 2021 6
Hardware Accelerator Landscape
Component GPU FPGA ASIC
Partipants NVIDIA®, AMD®, INTEL® Xilinx®, INTEL® (Altera) TPU, AI Chips
Development Frameworks OpenCL, CUDA OpenCL OpenCL, TensorFlow
Machine Learning Lifecycle Training Inference Inference
FPGA: Field-Programmable Gate Array
ASIC: Application-Specific Integrated Circuit
TPU: Tensor Processing Unit
Copyright © SUSE 2021
Copyright © SUSE 2021
NVIDIA® GPU Virtualization
7
Copyright © SUSE 2021 8
Why Choose NVIDIA
Software Ecosystem Powerful Performance
https://becominghuman.ai/nvidia-and-the-gpu-contribution-to-the-ai-world-of-self-driving-cars-1f00e3212508
http://www.nvidia.com/object/grid-certified-servers.html
Copyright © SUSE 2021
NVIDIA® GPU
Virtualization
• Scalability
• Split
• Time Slices
• Framebuffer
• Isolate
• MDEV/SR-IOV
• Schedule
• RR, BOND
9
Copyright © SUSE 2021
Copyright © SUSE 2021
Current Status at SUSE®
10
Copyright © SUSE 2021
— Test Setup
– Host: SUSE Linux Enterprise
Server 15 SP2
– Guests: SUSE Linux Enterprise
Server 15 SP2, 15SP1, Windows
Server 2019
– Hardware: HPE ProLiant DL380
Gen9, NVIDIA® Tesla V100
– Benchmarks: LAMMPS,
TensorRT, Perfview
11
SUSE Reference Platform: Tests and Results
— Functional Tests:
– Driver
– CUDA
– 3D Graphics
– Virt-manager display
– Max mdev support
— Performance Tests:
– vGPU vs Passthrough
– vGPU across different guest
VMs
– vGPU with different memory
configruations
– vGPU scalability
Copyright © SUSE 2021
Performance Results
SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04
vGPU 16C 54.74 22.87 60.02 42.77 53.35
vGPU 16Q 52.62 36.35 60.3 55.5 51.52
Passthrough 199.87 24.67 269.3 61.47 136.59
abs (16C) -72.612% -7.296% -77.712% -30.421% -60.941%
abs (16Q) -73.672% +47.344% -77.708% -9.712% -62.281%
SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04
vGPU 16C 198.68 29.77 311.2 69.67 126.15
vGPU 16Q 188.36 39.93 320.99 111.71 153.92
Passthrough 199.87 24.67 269.3 61.47 136.59
abs (16C) -0.5% +20.7% +15.6% +13.3% -7.64%
abs (16Q) -5.7% +61.9% +19.2% +91.5% +12.7%
Copyright © SUSE 2021
fp32 fp16 int8
average times host walltime
99% percentile
time times host walltime
99% percentile
time times host walltime
99%
percentile
time
16C 21.79712 23.14408 22.43136 22.0795 22.92232 22.50462 6.311586 6.868434 6.402446
16Q 21.79712 22.59112 22.01536 22.18726 23.07548 22.48336 6.332234 6.96658 6.39591
4C 22.06052 22.95922 22.3498 21.9007 22.76988 22.12804 6.071664 6.65127 6.197454
4Q 21.8033 22.68498 21.94474 22.24228 23.2023 22.43974 6.069044 6.632992 6.144616
Passthrough 21.69214 22.08166 21.83638 21.86884 22.2265 22.01682 6.064272 6.423492 6.161406
4C-194 55.47008 56.92716 64.96326 40.85198 41.91288 44.86924 6.073552 6.606642 6.17433
4C-210 37.50868 38.47168 41.63402 39.3009 40.47824 42.50746 11.095482 12.068492 12.20814
4C-211 22.44084 23.30192 27.70308 23.5173 24.25984 25.05932 10.803672 11.764844 11.968938
4C-212 37.90536 38.91082 43.96012 25.75758 26.4973 28.11208 7.265528 7.846488 8.420966
Performance Results
Copyright © SUSE 2021
— No major discernible difference between vGPU and pass-through
— Similar results were achieved across different SUSE Linux Enterprise guest
environments (15 SP2, 15 SP1)
— vGPU memory size showed no effect on performance (V100-16C vs V100-4C)
— vGPU model types showed no major differences (V100-16C vs V100-16Q)
— Scalability impacts performance, but still better than expectations
– V100-16C vs 4XV100-4C
14
Conclusions
Copyright © SUSE 2021
— Graphic Performance
— CUDA installation
— AI Platform installation
— Remote Display
— Secure boot for vGPU
— VM Snapshots
— Live Migration
— A100 support
15
Feature Checklist - Review
Copyright © SUSE 2021
Copyright © SUSE 2021
DEMO
16
Copyright © SUSE 2021
— Test Setup
– Host: SUSE Linux Enterprise Server 15 SP2
– Guest: SUSE Linux Enterprise Server 15
SP2
– Hardware: HPE ProLiant DL380 Gen9,
NVIDIA® Tesla V100
— Steps
– Secure trial license and acquire drivers
– Setup license server
– Install vGPU manager on SUSE Linux
Enterprise Server 15 SP2
– Create vGPU
– Passthrough vGPU in VM
– Install vGPU driver in VM
– Register vGPU
– Install CUDA
– Register NGC Account
– Setup NGC environment
– Pull TensorRT image
– Run TensorRT benchmark
17
Demo
Copyright © SUSE 2021
Copyright © SUSE 2021
Futures
18
Copyright © SUSE 2021
— Current:
– vGPU 12.x supported on SUSE Linux Enterprise Server 15 SP2
— Future:
– vGPU 12.x and 13.x (long-term release) to be supported with SUSE Linux
Enterprise Server 15 SP3
– GPU passthrough for ARM64
– vGPU plugin in KubeVirt (Kubernetes scenario)
– vGPU plugin in SUSE Manager (lifecycle management tool)
– vGPU plugin in RUST-VMM
19
Roadmap and Further Exploration
Copyright © SUSE 2021
© 2020 SUSE LLC. All Rights Reserved. SUSE
and the SUSE logo are registered trademarks
of SUSE LLC in the United States and other
countries. All third-party trademarks are the
property of their respective owners.
For more information, contact SUSE at:
+1 800 796 3700 (U.S./Canada)
+49 (0)911-740 53-0 (Worldwide)
Maxfeldstrasse 5
90409 Nuremberg
www.suse.com
Thank you

More Related Content

Similar to Accelerate-your-AI-Cloud-infrastructure.pdf

API Deep Dive: APIC EM Rest API
API Deep Dive: APIC EM Rest API API Deep Dive: APIC EM Rest API
API Deep Dive: APIC EM Rest API Cisco DevNet
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeIntel® Software
 
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworldCisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworldldangelo0772
 
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworldCisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworldldangelo0772
 
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案inwin stack
 
LXC – NextGen Virtualization for Cloud benefit realization (cloudexpo)
LXC – NextGen Virtualization for Cloud benefit realization (cloudexpo)LXC – NextGen Virtualization for Cloud benefit realization (cloudexpo)
LXC – NextGen Virtualization for Cloud benefit realization (cloudexpo)Boden Russell
 
A journay to do AI research in the cloud.pdf
A journay to do AI research in the cloud.pdfA journay to do AI research in the cloud.pdf
A journay to do AI research in the cloud.pdfLiang Yan
 
Horizon 6 pilot accelerator appliance
Horizon 6 pilot accelerator applianceHorizon 6 pilot accelerator appliance
Horizon 6 pilot accelerator appliancesolarisyougood
 
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Modern Applications Web Day | Continuous Delivery to Amazon EKS with SpinnakerModern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Modern Applications Web Day | Continuous Delivery to Amazon EKS with SpinnakerAWS Germany
 
PuppetConf 2016: Building Nano Server Images with Puppet and DSC – Michael Sm...
PuppetConf 2016: Building Nano Server Images with Puppet and DSC – Michael Sm...PuppetConf 2016: Building Nano Server Images with Puppet and DSC – Michael Sm...
PuppetConf 2016: Building Nano Server Images with Puppet and DSC – Michael Sm...Puppet
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkanax inc.
 
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...LinuxCon ContainerCon CloudOpen China
 
2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdfPhmNgcTr3
 
PuppetConf 2016: Nano Server, Puppet, and DSC
PuppetConf 2016: Nano Server, Puppet, and DSCPuppetConf 2016: Nano Server, Puppet, and DSC
PuppetConf 2016: Nano Server, Puppet, and DSCMichael Smith
 
GTC 2018 で発表された自動運転最新情報
GTC 2018 で発表された自動運転最新情報GTC 2018 で発表された自動運転最新情報
GTC 2018 で発表された自動運転最新情報NVIDIA Japan
 
Part 2 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 2   Maximizing the utilization of GPU resources on-premise and in the cloudPart 2   Maximizing the utilization of GPU resources on-premise and in the cloud
Part 2 Maximizing the utilization of GPU resources on-premise and in the cloudUniva, an Altair Company
 
Rancher Rodeo 13 mai 2022
Rancher Rodeo 13 mai 2022Rancher Rodeo 13 mai 2022
Rancher Rodeo 13 mai 2022SUSE
 
Open source technologies in Microsoft cloud
Open source technologies in Microsoft cloudOpen source technologies in Microsoft cloud
Open source technologies in Microsoft cloudAlexey Bokov
 
Rancher Rodéo France
Rancher Rodéo FranceRancher Rodéo France
Rancher Rodéo FranceSUSE
 

Similar to Accelerate-your-AI-Cloud-infrastructure.pdf (20)

API Deep Dive: APIC EM Rest API
API Deep Dive: APIC EM Rest API API Deep Dive: APIC EM Rest API
API Deep Dive: APIC EM Rest API
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworldCisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
 
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworldCisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
Cisco at v mworld 2015 gpu-solution-c240_m4-082715-vmworld
 
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
 
LXC – NextGen Virtualization for Cloud benefit realization (cloudexpo)
LXC – NextGen Virtualization for Cloud benefit realization (cloudexpo)LXC – NextGen Virtualization for Cloud benefit realization (cloudexpo)
LXC – NextGen Virtualization for Cloud benefit realization (cloudexpo)
 
A journay to do AI research in the cloud.pdf
A journay to do AI research in the cloud.pdfA journay to do AI research in the cloud.pdf
A journay to do AI research in the cloud.pdf
 
Horizon 6 pilot accelerator appliance
Horizon 6 pilot accelerator applianceHorizon 6 pilot accelerator appliance
Horizon 6 pilot accelerator appliance
 
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Modern Applications Web Day | Continuous Delivery to Amazon EKS with SpinnakerModern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
Modern Applications Web Day | Continuous Delivery to Amazon EKS with Spinnaker
 
PuppetConf 2016: Building Nano Server Images with Puppet and DSC – Michael Sm...
PuppetConf 2016: Building Nano Server Images with Puppet and DSC – Michael Sm...PuppetConf 2016: Building Nano Server Images with Puppet and DSC – Michael Sm...
PuppetConf 2016: Building Nano Server Images with Puppet and DSC – Michael Sm...
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkan
 
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
High Performance Linux Virtual Machine on Microsoft Azure: SR-IOV Networking ...
 
2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf2020-ntn-vsphere_performance_principles_bondzio.pdf
2020-ntn-vsphere_performance_principles_bondzio.pdf
 
PuppetConf 2016: Nano Server, Puppet, and DSC
PuppetConf 2016: Nano Server, Puppet, and DSCPuppetConf 2016: Nano Server, Puppet, and DSC
PuppetConf 2016: Nano Server, Puppet, and DSC
 
GTC 2018 で発表された自動運転最新情報
GTC 2018 で発表された自動運転最新情報GTC 2018 で発表された自動運転最新情報
GTC 2018 で発表された自動運転最新情報
 
Part 2 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 2   Maximizing the utilization of GPU resources on-premise and in the cloudPart 2   Maximizing the utilization of GPU resources on-premise and in the cloud
Part 2 Maximizing the utilization of GPU resources on-premise and in the cloud
 
Rancher Rodeo 13 mai 2022
Rancher Rodeo 13 mai 2022Rancher Rodeo 13 mai 2022
Rancher Rodeo 13 mai 2022
 
Open source technologies in Microsoft cloud
Open source technologies in Microsoft cloudOpen source technologies in Microsoft cloud
Open source technologies in Microsoft cloud
 
Rancher Rodéo France
Rancher Rodéo FranceRancher Rodéo France
Rancher Rodéo France
 

More from Liang Yan

Stable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdfStable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdfLiang Yan
 
ChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdfChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdfLiang Yan
 
Bring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdfBring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdfLiang Yan
 
GPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdfGPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdfLiang Yan
 
i-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdfi-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdfLiang Yan
 
a-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdfa-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdfLiang Yan
 
Linux and SUSE
Linux and SUSELinux and SUSE
Linux and SUSELiang Yan
 
The abcs of gpu
The abcs of gpuThe abcs of gpu
The abcs of gpuLiang Yan
 
How to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsHow to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsLiang Yan
 

More from Liang Yan (10)

Stable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdfStable-Diffusion-v2.pdf
Stable-Diffusion-v2.pdf
 
ChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdfChatGPT-the-revolution-is-coming.pdf
ChatGPT-the-revolution-is-coming.pdf
 
Bring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdfBring-your-ML-Project-into-Production-v2.pdf
Bring-your-ML-Project-into-Production-v2.pdf
 
utf.pdf
utf.pdfutf.pdf
utf.pdf
 
GPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdfGPU-Virtualization-in-openSUSE.pdf
GPU-Virtualization-in-openSUSE.pdf
 
i-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdfi-just-want-to-use-one-giant-vm.pdf
i-just-want-to-use-one-giant-vm.pdf
 
a-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdfa-new-playground-for-spdk-dpdk-on-arm64.pdf
a-new-playground-for-spdk-dpdk-on-arm64.pdf
 
Linux and SUSE
Linux and SUSELinux and SUSE
Linux and SUSE
 
The abcs of gpu
The abcs of gpuThe abcs of gpu
The abcs of gpu
 
How to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boardsHow to-boot-linuxl-on-your-soc-boards
How to-boot-linuxl-on-your-soc-boards
 

Recently uploaded

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 

Accelerate-your-AI-Cloud-infrastructure.pdf

  • 1. Copyright © SUSE 2021 Accelerate Your AI Cloud Infrastructure 12 APRIL 20 2 1 A Virtualization Perspective Liang Yan – SUSE Labs
  • 2. Copyright © SUSE 2021 Outline • Background • Cloud & AI • Hardware Acceleration • NVIDIA® GPU Virtualization • Current status at SUSE • Demo • Running NGC inside a VM • Current limitations and futures • Q & A
  • 3. Copyright © SUSE 2021 Copyright © SUSE 2021 Background 3
  • 4. Copyright © SUSE 2021 Building Machine Learning Infrastructur e in the Cloud 4 https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/
  • 5. Copyright © SUSE 2021 Tools for Deep Learning 5 https://jameskle.com/writes/deep-learning-infrastructure-tooling
  • 6. Copyright © SUSE 2021 6 Hardware Accelerator Landscape Component GPU FPGA ASIC Partipants NVIDIA®, AMD®, INTEL® Xilinx®, INTEL® (Altera) TPU, AI Chips Development Frameworks OpenCL, CUDA OpenCL OpenCL, TensorFlow Machine Learning Lifecycle Training Inference Inference FPGA: Field-Programmable Gate Array ASIC: Application-Specific Integrated Circuit TPU: Tensor Processing Unit
  • 7. Copyright © SUSE 2021 Copyright © SUSE 2021 NVIDIA® GPU Virtualization 7
  • 8. Copyright © SUSE 2021 8 Why Choose NVIDIA Software Ecosystem Powerful Performance https://becominghuman.ai/nvidia-and-the-gpu-contribution-to-the-ai-world-of-self-driving-cars-1f00e3212508 http://www.nvidia.com/object/grid-certified-servers.html
  • 9. Copyright © SUSE 2021 NVIDIA® GPU Virtualization • Scalability • Split • Time Slices • Framebuffer • Isolate • MDEV/SR-IOV • Schedule • RR, BOND 9
  • 10. Copyright © SUSE 2021 Copyright © SUSE 2021 Current Status at SUSE® 10
  • 11. Copyright © SUSE 2021 — Test Setup – Host: SUSE Linux Enterprise Server 15 SP2 – Guests: SUSE Linux Enterprise Server 15 SP2, 15SP1, Windows Server 2019 – Hardware: HPE ProLiant DL380 Gen9, NVIDIA® Tesla V100 – Benchmarks: LAMMPS, TensorRT, Perfview 11 SUSE Reference Platform: Tests and Results — Functional Tests: – Driver – CUDA – 3D Graphics – Virt-manager display – Max mdev support — Performance Tests: – vGPU vs Passthrough – vGPU across different guest VMs – vGPU with different memory configruations – vGPU scalability
  • 12. Copyright © SUSE 2021 Performance Results SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04 vGPU 16C 54.74 22.87 60.02 42.77 53.35 vGPU 16Q 52.62 36.35 60.3 55.5 51.52 Passthrough 199.87 24.67 269.3 61.47 136.59 abs (16C) -72.612% -7.296% -77.712% -30.421% -60.941% abs (16Q) -73.672% +47.344% -77.708% -9.712% -62.281% SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04 vGPU 16C 198.68 29.77 311.2 69.67 126.15 vGPU 16Q 188.36 39.93 320.99 111.71 153.92 Passthrough 199.87 24.67 269.3 61.47 136.59 abs (16C) -0.5% +20.7% +15.6% +13.3% -7.64% abs (16Q) -5.7% +61.9% +19.2% +91.5% +12.7%
  • 13. Copyright © SUSE 2021 fp32 fp16 int8 average times host walltime 99% percentile time times host walltime 99% percentile time times host walltime 99% percentile time 16C 21.79712 23.14408 22.43136 22.0795 22.92232 22.50462 6.311586 6.868434 6.402446 16Q 21.79712 22.59112 22.01536 22.18726 23.07548 22.48336 6.332234 6.96658 6.39591 4C 22.06052 22.95922 22.3498 21.9007 22.76988 22.12804 6.071664 6.65127 6.197454 4Q 21.8033 22.68498 21.94474 22.24228 23.2023 22.43974 6.069044 6.632992 6.144616 Passthrough 21.69214 22.08166 21.83638 21.86884 22.2265 22.01682 6.064272 6.423492 6.161406 4C-194 55.47008 56.92716 64.96326 40.85198 41.91288 44.86924 6.073552 6.606642 6.17433 4C-210 37.50868 38.47168 41.63402 39.3009 40.47824 42.50746 11.095482 12.068492 12.20814 4C-211 22.44084 23.30192 27.70308 23.5173 24.25984 25.05932 10.803672 11.764844 11.968938 4C-212 37.90536 38.91082 43.96012 25.75758 26.4973 28.11208 7.265528 7.846488 8.420966 Performance Results
  • 14. Copyright © SUSE 2021 — No major discernible difference between vGPU and pass-through — Similar results were achieved across different SUSE Linux Enterprise guest environments (15 SP2, 15 SP1) — vGPU memory size showed no effect on performance (V100-16C vs V100-4C) — vGPU model types showed no major differences (V100-16C vs V100-16Q) — Scalability impacts performance, but still better than expectations – V100-16C vs 4XV100-4C 14 Conclusions
  • 15. Copyright © SUSE 2021 — Graphic Performance — CUDA installation — AI Platform installation — Remote Display — Secure boot for vGPU — VM Snapshots — Live Migration — A100 support 15 Feature Checklist - Review
  • 16. Copyright © SUSE 2021 Copyright © SUSE 2021 DEMO 16
  • 17. Copyright © SUSE 2021 — Test Setup – Host: SUSE Linux Enterprise Server 15 SP2 – Guest: SUSE Linux Enterprise Server 15 SP2 – Hardware: HPE ProLiant DL380 Gen9, NVIDIA® Tesla V100 — Steps – Secure trial license and acquire drivers – Setup license server – Install vGPU manager on SUSE Linux Enterprise Server 15 SP2 – Create vGPU – Passthrough vGPU in VM – Install vGPU driver in VM – Register vGPU – Install CUDA – Register NGC Account – Setup NGC environment – Pull TensorRT image – Run TensorRT benchmark 17 Demo
  • 18. Copyright © SUSE 2021 Copyright © SUSE 2021 Futures 18
  • 19. Copyright © SUSE 2021 — Current: – vGPU 12.x supported on SUSE Linux Enterprise Server 15 SP2 — Future: – vGPU 12.x and 13.x (long-term release) to be supported with SUSE Linux Enterprise Server 15 SP3 – GPU passthrough for ARM64 – vGPU plugin in KubeVirt (Kubernetes scenario) – vGPU plugin in SUSE Manager (lifecycle management tool) – vGPU plugin in RUST-VMM 19 Roadmap and Further Exploration
  • 20. Copyright © SUSE 2021 © 2020 SUSE LLC. All Rights Reserved. SUSE and the SUSE logo are registered trademarks of SUSE LLC in the United States and other countries. All third-party trademarks are the property of their respective owners. For more information, contact SUSE at: +1 800 796 3700 (U.S./Canada) +49 (0)911-740 53-0 (Worldwide) Maxfeldstrasse 5 90409 Nuremberg www.suse.com Thank you