Copyright © SUSE 2021
Accelerate Your AI
Cloud Infrastructure
12 APRIL 20 2 1
A Virtualization Perspective
Liang Yan – SUSE Labs
Copyright © SUSE 2021
Outline
• Background
• Cloud & AI
• Hardware Acceleration
• NVIDIA® GPU Virtualization
• Current status at SUSE
• Demo
• Running NGC inside a VM
• Current limitations and futures
• Q & A
Copyright © SUSE 2021
Copyright © SUSE 2021
Background
3
Copyright © SUSE 2021
Building
Machine
Learning
Infrastructur
e in the
Cloud
4
https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/
Copyright © SUSE 2021
Tools for
Deep
Learning
5
https://jameskle.com/writes/deep-learning-infrastructure-tooling
Copyright © SUSE 2021 6
Hardware Accelerator Landscape
Component GPU FPGA ASIC
Partipants NVIDIA®, AMD®, INTEL® Xilinx®, INTEL® (Altera) TPU, AI Chips
Development Frameworks OpenCL, CUDA OpenCL OpenCL, TensorFlow
Machine Learning Lifecycle Training Inference Inference
FPGA: Field-Programmable Gate Array
ASIC: Application-Specific Integrated Circuit
TPU: Tensor Processing Unit
Copyright © SUSE 2021
Copyright © SUSE 2021
NVIDIA® GPU Virtualization
7
Copyright © SUSE 2021 8
Why Choose NVIDIA
Software Ecosystem Powerful Performance
https://becominghuman.ai/nvidia-and-the-gpu-contribution-to-the-ai-world-of-self-driving-cars-1f00e3212508
http://www.nvidia.com/object/grid-certified-servers.html
Copyright © SUSE 2021
NVIDIA® GPU
Virtualization
• Scalability
• Split
• Time Slices
• Framebuffer
• Isolate
• MDEV/SR-IOV
• Schedule
• RR, BOND
9
Copyright © SUSE 2021
Copyright © SUSE 2021
Current Status at SUSE®
10
Copyright © SUSE 2021
— Test Setup
– Host: SUSE Linux Enterprise
Server 15 SP2
– Guests: SUSE Linux Enterprise
Server 15 SP2, 15SP1, Windows
Server 2019
– Hardware: HPE ProLiant DL380
Gen9, NVIDIA® Tesla V100
– Benchmarks: LAMMPS,
TensorRT, Perfview
11
SUSE Reference Platform: Tests and Results
— Functional Tests:
– Driver
– CUDA
– 3D Graphics
– Virt-manager display
– Max mdev support
— Performance Tests:
– vGPU vs Passthrough
– vGPU across different guest
VMs
– vGPU with different memory
configruations
– vGPU scalability
Copyright © SUSE 2021
Performance Results
SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04
vGPU 16C 54.74 22.87 60.02 42.77 53.35
vGPU 16Q 52.62 36.35 60.3 55.5 51.52
Passthrough 199.87 24.67 269.3 61.47 136.59
abs (16C) -72.612% -7.296% -77.712% -30.421% -60.941%
abs (16Q) -73.672% +47.344% -77.708% -9.712% -62.281%
SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04
vGPU 16C 198.68 29.77 311.2 69.67 126.15
vGPU 16Q 188.36 39.93 320.99 111.71 153.92
Passthrough 199.87 24.67 269.3 61.47 136.59
abs (16C) -0.5% +20.7% +15.6% +13.3% -7.64%
abs (16Q) -5.7% +61.9% +19.2% +91.5% +12.7%
Copyright © SUSE 2021
fp32 fp16 int8
average times host walltime
99% percentile
time times host walltime
99% percentile
time times host walltime
99%
percentile
time
16C 21.79712 23.14408 22.43136 22.0795 22.92232 22.50462 6.311586 6.868434 6.402446
16Q 21.79712 22.59112 22.01536 22.18726 23.07548 22.48336 6.332234 6.96658 6.39591
4C 22.06052 22.95922 22.3498 21.9007 22.76988 22.12804 6.071664 6.65127 6.197454
4Q 21.8033 22.68498 21.94474 22.24228 23.2023 22.43974 6.069044 6.632992 6.144616
Passthrough 21.69214 22.08166 21.83638 21.86884 22.2265 22.01682 6.064272 6.423492 6.161406
4C-194 55.47008 56.92716 64.96326 40.85198 41.91288 44.86924 6.073552 6.606642 6.17433
4C-210 37.50868 38.47168 41.63402 39.3009 40.47824 42.50746 11.095482 12.068492 12.20814
4C-211 22.44084 23.30192 27.70308 23.5173 24.25984 25.05932 10.803672 11.764844 11.968938
4C-212 37.90536 38.91082 43.96012 25.75758 26.4973 28.11208 7.265528 7.846488 8.420966
Performance Results
Copyright © SUSE 2021
— No major discernible difference between vGPU and pass-through
— Similar results were achieved across different SUSE Linux Enterprise guest
environments (15 SP2, 15 SP1)
— vGPU memory size showed no effect on performance (V100-16C vs V100-4C)
— vGPU model types showed no major differences (V100-16C vs V100-16Q)
— Scalability impacts performance, but still better than expectations
– V100-16C vs 4XV100-4C
14
Conclusions
Copyright © SUSE 2021
— Graphic Performance
— CUDA installation
— AI Platform installation
— Remote Display
— Secure boot for vGPU
— VM Snapshots
— Live Migration
— A100 support
15
Feature Checklist - Review
Copyright © SUSE 2021
Copyright © SUSE 2021
DEMO
16
Copyright © SUSE 2021
— Test Setup
– Host: SUSE Linux Enterprise Server 15 SP2
– Guest: SUSE Linux Enterprise Server 15
SP2
– Hardware: HPE ProLiant DL380 Gen9,
NVIDIA® Tesla V100
— Steps
– Secure trial license and acquire drivers
– Setup license server
– Install vGPU manager on SUSE Linux
Enterprise Server 15 SP2
– Create vGPU
– Passthrough vGPU in VM
– Install vGPU driver in VM
– Register vGPU
– Install CUDA
– Register NGC Account
– Setup NGC environment
– Pull TensorRT image
– Run TensorRT benchmark
17
Demo
Copyright © SUSE 2021
Copyright © SUSE 2021
Futures
18
Copyright © SUSE 2021
— Current:
– vGPU 12.x supported on SUSE Linux Enterprise Server 15 SP2
— Future:
– vGPU 12.x and 13.x (long-term release) to be supported with SUSE Linux
Enterprise Server 15 SP3
– GPU passthrough for ARM64
– vGPU plugin in KubeVirt (Kubernetes scenario)
– vGPU plugin in SUSE Manager (lifecycle management tool)
– vGPU plugin in RUST-VMM
19
Roadmap and Further Exploration
Copyright © SUSE 2021
© 2020 SUSE LLC. All Rights Reserved. SUSE
and the SUSE logo are registered trademarks
of SUSE LLC in the United States and other
countries. All third-party trademarks are the
property of their respective owners.
For more information, contact SUSE at:
+1 800 796 3700 (U.S./Canada)
+49 (0)911-740 53-0 (Worldwide)
Maxfeldstrasse 5
90409 Nuremberg
www.suse.com
Thank you

Accelerate-your-AI-Cloud-infrastructure.pdf

  • 1.
    Copyright © SUSE2021 Accelerate Your AI Cloud Infrastructure 12 APRIL 20 2 1 A Virtualization Perspective Liang Yan – SUSE Labs
  • 2.
    Copyright © SUSE2021 Outline • Background • Cloud & AI • Hardware Acceleration • NVIDIA® GPU Virtualization • Current status at SUSE • Demo • Running NGC inside a VM • Current limitations and futures • Q & A
  • 3.
    Copyright © SUSE2021 Copyright © SUSE 2021 Background 3
  • 4.
    Copyright © SUSE2021 Building Machine Learning Infrastructur e in the Cloud 4 https://www.7wdata.be/big-data/building-the-machine-learning-infrastructure/
  • 5.
    Copyright © SUSE2021 Tools for Deep Learning 5 https://jameskle.com/writes/deep-learning-infrastructure-tooling
  • 6.
    Copyright © SUSE2021 6 Hardware Accelerator Landscape Component GPU FPGA ASIC Partipants NVIDIA®, AMD®, INTEL® Xilinx®, INTEL® (Altera) TPU, AI Chips Development Frameworks OpenCL, CUDA OpenCL OpenCL, TensorFlow Machine Learning Lifecycle Training Inference Inference FPGA: Field-Programmable Gate Array ASIC: Application-Specific Integrated Circuit TPU: Tensor Processing Unit
  • 7.
    Copyright © SUSE2021 Copyright © SUSE 2021 NVIDIA® GPU Virtualization 7
  • 8.
    Copyright © SUSE2021 8 Why Choose NVIDIA Software Ecosystem Powerful Performance https://becominghuman.ai/nvidia-and-the-gpu-contribution-to-the-ai-world-of-self-driving-cars-1f00e3212508 http://www.nvidia.com/object/grid-certified-servers.html
  • 9.
    Copyright © SUSE2021 NVIDIA® GPU Virtualization • Scalability • Split • Time Slices • Framebuffer • Isolate • MDEV/SR-IOV • Schedule • RR, BOND 9
  • 10.
    Copyright © SUSE2021 Copyright © SUSE 2021 Current Status at SUSE® 10
  • 11.
    Copyright © SUSE2021 — Test Setup – Host: SUSE Linux Enterprise Server 15 SP2 – Guests: SUSE Linux Enterprise Server 15 SP2, 15SP1, Windows Server 2019 – Hardware: HPE ProLiant DL380 Gen9, NVIDIA® Tesla V100 – Benchmarks: LAMMPS, TensorRT, Perfview 11 SUSE Reference Platform: Tests and Results — Functional Tests: – Driver – CUDA – 3D Graphics – Virt-manager display – Max mdev support — Performance Tests: – vGPU vs Passthrough – vGPU across different guest VMs – vGPU with different memory configruations – vGPU scalability
  • 12.
    Copyright © SUSE2021 Performance Results SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04 vGPU 16C 54.74 22.87 60.02 42.77 53.35 vGPU 16Q 52.62 36.35 60.3 55.5 51.52 Passthrough 199.87 24.67 269.3 61.47 136.59 abs (16C) -72.612% -7.296% -77.712% -30.421% -60.941% abs (16Q) -73.672% +47.344% -77.708% -9.712% -62.281% SPECveiw perf creo-02 energy-02 maya-05 medical-02 sw-04 vGPU 16C 198.68 29.77 311.2 69.67 126.15 vGPU 16Q 188.36 39.93 320.99 111.71 153.92 Passthrough 199.87 24.67 269.3 61.47 136.59 abs (16C) -0.5% +20.7% +15.6% +13.3% -7.64% abs (16Q) -5.7% +61.9% +19.2% +91.5% +12.7%
  • 13.
    Copyright © SUSE2021 fp32 fp16 int8 average times host walltime 99% percentile time times host walltime 99% percentile time times host walltime 99% percentile time 16C 21.79712 23.14408 22.43136 22.0795 22.92232 22.50462 6.311586 6.868434 6.402446 16Q 21.79712 22.59112 22.01536 22.18726 23.07548 22.48336 6.332234 6.96658 6.39591 4C 22.06052 22.95922 22.3498 21.9007 22.76988 22.12804 6.071664 6.65127 6.197454 4Q 21.8033 22.68498 21.94474 22.24228 23.2023 22.43974 6.069044 6.632992 6.144616 Passthrough 21.69214 22.08166 21.83638 21.86884 22.2265 22.01682 6.064272 6.423492 6.161406 4C-194 55.47008 56.92716 64.96326 40.85198 41.91288 44.86924 6.073552 6.606642 6.17433 4C-210 37.50868 38.47168 41.63402 39.3009 40.47824 42.50746 11.095482 12.068492 12.20814 4C-211 22.44084 23.30192 27.70308 23.5173 24.25984 25.05932 10.803672 11.764844 11.968938 4C-212 37.90536 38.91082 43.96012 25.75758 26.4973 28.11208 7.265528 7.846488 8.420966 Performance Results
  • 14.
    Copyright © SUSE2021 — No major discernible difference between vGPU and pass-through — Similar results were achieved across different SUSE Linux Enterprise guest environments (15 SP2, 15 SP1) — vGPU memory size showed no effect on performance (V100-16C vs V100-4C) — vGPU model types showed no major differences (V100-16C vs V100-16Q) — Scalability impacts performance, but still better than expectations – V100-16C vs 4XV100-4C 14 Conclusions
  • 15.
    Copyright © SUSE2021 — Graphic Performance — CUDA installation — AI Platform installation — Remote Display — Secure boot for vGPU — VM Snapshots — Live Migration — A100 support 15 Feature Checklist - Review
  • 16.
    Copyright © SUSE2021 Copyright © SUSE 2021 DEMO 16
  • 17.
    Copyright © SUSE2021 — Test Setup – Host: SUSE Linux Enterprise Server 15 SP2 – Guest: SUSE Linux Enterprise Server 15 SP2 – Hardware: HPE ProLiant DL380 Gen9, NVIDIA® Tesla V100 — Steps – Secure trial license and acquire drivers – Setup license server – Install vGPU manager on SUSE Linux Enterprise Server 15 SP2 – Create vGPU – Passthrough vGPU in VM – Install vGPU driver in VM – Register vGPU – Install CUDA – Register NGC Account – Setup NGC environment – Pull TensorRT image – Run TensorRT benchmark 17 Demo
  • 18.
    Copyright © SUSE2021 Copyright © SUSE 2021 Futures 18
  • 19.
    Copyright © SUSE2021 — Current: – vGPU 12.x supported on SUSE Linux Enterprise Server 15 SP2 — Future: – vGPU 12.x and 13.x (long-term release) to be supported with SUSE Linux Enterprise Server 15 SP3 – GPU passthrough for ARM64 – vGPU plugin in KubeVirt (Kubernetes scenario) – vGPU plugin in SUSE Manager (lifecycle management tool) – vGPU plugin in RUST-VMM 19 Roadmap and Further Exploration
  • 20.
    Copyright © SUSE2021 © 2020 SUSE LLC. All Rights Reserved. SUSE and the SUSE logo are registered trademarks of SUSE LLC in the United States and other countries. All third-party trademarks are the property of their respective owners. For more information, contact SUSE at: +1 800 796 3700 (U.S./Canada) +49 (0)911-740 53-0 (Worldwide) Maxfeldstrasse 5 90409 Nuremberg www.suse.com Thank you