Accelerate-your-AI-Cloud-infrastructure.pdf

Copyright © SUSE 2021
Accelerate Your AI
Cloud Infrastructure
12 APRIL 20 2 1
A Virtualization Perspective
Liang Yan – SUSE Labs

Outline
• Background
• Cloud & AI
• Hardware Acceleration
• NVIDIA® GPU Virtualization
• Current status at SUSE
• Demo
• Running NGC inside a VM
• Current limitations and futures
• Q & A

Background
3

Building
Machine
Learning
Infrastructur
e in the
Cloud
4
https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/

Tools for
Deep
Learning
5
https://jameskle.com/writes/deep-learning-infrastructure-tooling

Copyright © SUSE 2021 6
Hardware Accelerator Landscape
Component GPU FPGA ASIC
Partipants NVIDIA®, AMD®, INTEL® Xilinx®, INTEL® (Altera) TPU, AI Chips
Development Frameworks OpenCL, CUDA OpenCL OpenCL, TensorFlow
Machine Learning Lifecycle Training Inference Inference
FPGA: Field-Programmable Gate Array
ASIC: Application-Specific Integrated Circuit
TPU: Tensor Processing Unit

NVIDIA® GPU Virtualization
7

Copyright © SUSE 2021 8
Why Choose NVIDIA
Software Ecosystem Powerful Performance
https://becominghuman.ai/nvidia-and-the-gpu-contribution-to-the-ai-world-of-self-driving-cars-1f00e3212508
http://www.nvidia.com/object/grid-certified-servers.html

NVIDIA® GPU
Virtualization
• Scalability
• Split
• Time Slices
• Framebuffer
• Isolate
• MDEV/SR-IOV
• Schedule
• RR, BOND
9

Current Status at SUSE®
10

— Test Setup
– Host: SUSE Linux Enterprise
Server 15 SP2
– Guests: SUSE Linux Enterprise
Server 15 SP2, 15SP1, Windows
Server 2019
– Hardware: HPE ProLiant DL380
Gen9, NVIDIA® Tesla V100
– Benchmarks: LAMMPS,
TensorRT, Perfview
11
SUSE Reference Platform: Tests and Results
— Functional Tests:
– Driver
– CUDA
– 3D Graphics
– Virt-manager display
– Max mdev support
— Performance Tests:
– vGPU vs Passthrough
– vGPU across different guest
VMs
– vGPU with different memory
configruations
– vGPU scalability

Performance Results
SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04
vGPU 16C 54.74 22.87 60.02 42.77 53.35
vGPU 16Q 52.62 36.35 60.3 55.5 51.52
Passthrough 199.87 24.67 269.3 61.47 136.59
abs (16C) -72.612% -7.296% -77.712% -30.421% -60.941%
abs (16Q) -73.672% +47.344% -77.708% -9.712% -62.281%
SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04
vGPU 16C 198.68 29.77 311.2 69.67 126.15
vGPU 16Q 188.36 39.93 320.99 111.71 153.92
Passthrough 199.87 24.67 269.3 61.47 136.59
abs (16C) -0.5% +20.7% +15.6% +13.3% -7.64%
abs (16Q) -5.7% +61.9% +19.2% +91.5% +12.7%

fp32 fp16 int8
average times host walltime
99% percentile
time times host walltime
99% percentile
time times host walltime
99%
percentile
time
16C 21.79712 23.14408 22.43136 22.0795 22.92232 22.50462 6.311586 6.868434 6.402446
16Q 21.79712 22.59112 22.01536 22.18726 23.07548 22.48336 6.332234 6.96658 6.39591
4C 22.06052 22.95922 22.3498 21.9007 22.76988 22.12804 6.071664 6.65127 6.197454
4Q 21.8033 22.68498 21.94474 22.24228 23.2023 22.43974 6.069044 6.632992 6.144616
Passthrough 21.69214 22.08166 21.83638 21.86884 22.2265 22.01682 6.064272 6.423492 6.161406
4C-194 55.47008 56.92716 64.96326 40.85198 41.91288 44.86924 6.073552 6.606642 6.17433
4C-210 37.50868 38.47168 41.63402 39.3009 40.47824 42.50746 11.095482 12.068492 12.20814
4C-211 22.44084 23.30192 27.70308 23.5173 24.25984 25.05932 10.803672 11.764844 11.968938
4C-212 37.90536 38.91082 43.96012 25.75758 26.4973 28.11208 7.265528 7.846488 8.420966
Performance Results

— No major discernible difference between vGPU and pass-through
— Similar results were achieved across different SUSE Linux Enterprise guest
environments (15 SP2, 15 SP1)
— vGPU memory size showed no effect on performance (V100-16C vs V100-4C)
— vGPU model types showed no major differences (V100-16C vs V100-16Q)
— Scalability impacts performance, but still better than expectations
– V100-16C vs 4XV100-4C
14
Conclusions

— Graphic Performance
— CUDA installation
— AI Platform installation
— Remote Display
— Secure boot for vGPU
— VM Snapshots
— Live Migration
— A100 support
15
Feature Checklist - Review

DEMO
16

— Test Setup
– Host: SUSE Linux Enterprise Server 15 SP2
– Guest: SUSE Linux Enterprise Server 15
SP2
– Hardware: HPE ProLiant DL380 Gen9,
NVIDIA® Tesla V100
— Steps
– Secure trial license and acquire drivers
– Setup license server
– Install vGPU manager on SUSE Linux
Enterprise Server 15 SP2
– Create vGPU
– Passthrough vGPU in VM
– Install vGPU driver in VM
– Register vGPU
– Install CUDA
– Register NGC Account
– Setup NGC environment
– Pull TensorRT image
– Run TensorRT benchmark
17
Demo

Futures
18

— Current:
– vGPU 12.x supported on SUSE Linux Enterprise Server 15 SP2
— Future:
– vGPU 12.x and 13.x (long-term release) to be supported with SUSE Linux
Enterprise Server 15 SP3
– GPU passthrough for ARM64
– vGPU plugin in KubeVirt (Kubernetes scenario)
– vGPU plugin in SUSE Manager (lifecycle management tool)
– vGPU plugin in RUST-VMM
19
Roadmap and Further Exploration

© 2020 SUSE LLC. All Rights Reserved. SUSE
and the SUSE logo are registered trademarks
of SUSE LLC in the United States and other
countries. All third-party trademarks are the
property of their respective owners.
For more information, contact SUSE at:
+1 800 796 3700 (U.S./Canada)
+49 (0)911-740 53-0 (Worldwide)
Maxfeldstrasse 5
90409 Nuremberg
www.suse.com
Thank you

Accelerate-your-AI-Cloud-infrastructure.pdf

Recommended

Recommended

More Related Content

Similar to Accelerate-your-AI-Cloud-infrastructure.pdf

Similar to Accelerate-your-AI-Cloud-infrastructure.pdf (20)

More from Liang Yan

More from Liang Yan (10)

Recently uploaded

Recently uploaded (20)

Accelerate-your-AI-Cloud-infrastructure.pdf