Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud

Adam Tetelman, Solutions Architect
An overview of the MIG feature, configuration caveats, and details around additional features,
isolation, communication, value-add, and use-cases.
MIG Technical Overview

2
Why - MIG on the DGX A100
Multi-Instance GPU - Major MIG benefits and goals
● The goal of the MIG feature is to increase GPU utilization. This is done by partitioning a single GPU into
multiple fully isolated devices that are efficiently sized per-use-case, specifically smaller use-cases that only
require a subset of GPU resources.
● Benefits of MIG on the NVIDIA A100:
○ Physical allocation of resources used by parallel GPU workloads
→ Secure multi-tenant environments with isolation and predictable QoS
○ Versatile profiles with dynamic configuration (152 configurations on a DGX!)
→ Maximized utilization by configuring NVIDIA A100 for specific workloads
○ CUDA Programming model unchanged
→ No code changes required.

3
When - MIG on the DGX A100
Workloads that don’t utilize the full A100 GPU - hpc, prototyping, inference, and light training

4
How - MIG on the DGX A100
Optimize GPU Utilization, Expand Access to More Users with Guaranteed Quality of Service
Up To 7 GPU Instances In a Single A100:
Dedicated SM, Memory, L2 cache, Bandwidth for
hardware QoS & isolation
Simultaneous Workload Execution With
Guaranteed Quality Of Service:
All MIG instances run in parallel with predictable
throughput & latency
Right Sized GPU Allocation:
Different sized MIG instances based on target
workloads
Diverse Deployment Environments:
Supported with Bare metal, Docker, Kubernetes,
Virtualized Env.
MIG User Guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

6
MIG Terminology - Instances
A MIG device is made up of GPU Instance and a Compute Instance
● Multi-Instance GPU (MIG) - The MIG feature allows one or more GPU Instances to be allocated within a
GPU. Making a single GPU appear as if it were many.
● GPU Instances (GI) - A fully isolated collection of all physical GPU resources. Can contain one or more
GPU Compute Instances. Contains one or more Compute Instances.
● Compute Instance (CI) - An isolated collection of GPU SMs (CUDA cores) belongs to a single GPU Instance.
Shares GPU memory with other CIs in that GPU Instance.
● Instance Profile - A GPU Instance Profile (GIP) or GPU Compute Instance Profile (CIP) defines the
configuration and available resources in an Instance.
● MIG Device - A MIG Device is the actual “GPU” an application sees and it is the combination of a GPU,
GPU Instance, and Compute Instance.

7
GPU Instances vs. Compute Instance
Full isolation vs. shared resources
GPU instance per user Compute instance per process

8
GPU Instances in MIG
Applications in a single GPU Instance get isolation and guaranteed QoS
Each GPU Instance has physical partitions of GPU
memory, GPU SMs, and all other GPU hardware. This
provide each GPU Instance and the VMs with guaranteed
QoS through resource isolation.
Each GPU compute Instance within the GPU instance
provides partial isolation. This allows isolated compute
resources and independent workload scheduling.
Clock speed, MIG profile configuration, and other
settings should be optimized based on the expected MIG
use-cases. There is no “best” or “optimal” combination
of profiles and configurations.
The creation of GPU instances requires root privileges.

9
Compute Instances on MIG
Processes in a Compute Instance get isolation & flexibility at the expense of QoS
Each GPU compute instance within the GPU instance
provides functional isolation. This allows isolated
compute resources and independent workload
scheduling.
Processes in a CI (such as a container) are executed as
applications against GPU Compute instances.
A process can get full isolation by being the only CUDA
application running on a specific GPU Instance.
If multiple processes are running with GPU compute
instances running on the same GPU instance, then the
same level of isolation is not provided, but ICP is
available.

10
MIG Terminology - Slices
Memory slice, Compute slices, and GPU engines
● GPU Slice - A GPU slice is the smallest fraction of the GPU that combines a single GPU memory slice and
a single GPU SM slice.
● GPU Memory Slice - A GPU memory slice is the smallest fraction of the GPU’s memory, including the
corresponding memory controllers and cache. A GPU memory slice is roughly 1/8 of the total GPU
memory resources, including both capacity and bandwidth.
● GPU SM Slice - A GPU SM slice is the smallest fraction of the SMs. A GPU SM slice is roughly one seventh,
1/7 of the total number of SMs.
● GPU Engine - A GPU Engine is what executes work on the GPI. Different engines are responsible for
different actions such as the Compute engine or Copy engine. Engines are scheduled independently and
work in a GPU context.

11
Configuration options with MIG
Configuration MIG on the NVIDIA A100
Flexible configurations
Driver presents available profiles
18 combinations possible
Constraints
► Graphics contexts not supported
► P2P not available (e.g. no NVLink)
► For running VMs within each instance, vGPU is
required
Setup
► Workflow to enable MIG requires a GPU reset
► Setting is persistent across reboots (stored in
inforom)
► Reconfiguration is dynamic and can be done from a
container
MIG instances on A100-SXM4-40GB

12
GPU Instance Profiles
MIG
Instance
Instances
per GPU
SMs Memory Engines
Target Workload
Training Inference
1g.5gb 7 14 5 GB 0 NVDECs
BERT Fine-tuning (e.g.
SQuAD), Multiple
chatbots, Jupyter
notebooks
Multiple inference (e.g. Triton); ResNet-50, BERT, W&D
networks2g.10gb 3 28 10 GB 1 NVDECs
Training on ResNet-50,
BERT, WnD networks4g.20gb 1 56 20 GB 2 NVDECs
All the options for a single A100-SXM4-40GB GPU
* A small carveout (~0.5GB) is needed for MIG mgmt. and not usable by end user
* GPU Instance profile name is <SM_SLICE_COUNT>g.<GPU_MEMORY_COUNT>gb

13
GPU reset required to enable/disable MIG mode, this one-time operation is
per-gpu and persists across reboots
Use NVML/nvidia-smi to manage MIG
Configuration supported through management containers
Configure and reconfigure as-needed
Example: List available profiles with nvidia-smi
MIG Management & Configuration
List/Create/Update/Destroy Instances via NVML and nvidia-smi
# nvidia-smi mig --list-gpu-instances
+------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|================================================|
| 0 1g.5gb 19 9 2:1 |
+------------------------------------------------+
| 0 1g.5gb 19 10 3:1 |
+------------------------------------------------+
| 0 1g.5gb 19 13 6:1 |
+------------------------------------------------+
| 0 2g.10gb 14 3 0:2 |
+------------------------------------------------+
| 0 2g.10gb 14 5 4:2 |
+------------------------------------------------+

14
Using MIG in Docker Containers
Passing through specific MIG devices
With MIG disabled the GPU is still specified by the GPU index or GPU UUID.
However, when MIG is enabled the device is specified by index in the format
<gpu-device-index>:<mig-device-index> or by UUID in the format
MIG-<gpu-id>/<gpu-instance-id>/<gpu-compute-instance-id>.
Configure MIG (as root)
Configure MIG (non-root)
MIG monitoring
Note: Commands are slightly different on Docker version 18.x (shown here is 19.03+)
docker run --cap-add SYS_ADMIN --gpus '"device=0"' -e NVIDIA_MIG_CONFIG_DEVICES="all" nvidia/cuda:9.0-base nvidia-smi
docker run --gpus '"device=0:0,0:1"' nvidia/cuda:9.0-base nvidia-smi # MIG ENABLED
docker run --cap-add SYS_ADMIN --gpus '"device=0"' -e NVIDIA_MIG_MONITOR_DEVICES="all" nvidia/cuda:9.0-base nvidia-smi
chown nobody /proc/driver/nvidia/capabilities/mig/config
docker run --cap-add SYS_ADMIN --user nobody --gpus '"device=0"'
-e NVIDIA_MIG_CONFIG_DEVICES="all" nvidia/cuda:9.0-base nvidia-smi
docker run --gpus '"device=0,1"' nvidia/cuda:9.0-base nvidia-smi # MIG DISABLED

15
MIG for Maximized Resource Utilization
Use Case
Number of
GPU Instances
(GI) Available
Description Example
GPU Memory
per model
Inferencing 7
Run multiple Inference Servers
simultaneously
Jarvis Chatbots 5 GB
Development 7
Serve multiple development
environments simultaneously
Prototyping in
Jupyter
Notebooks
5 GB
Fine-tuning 3
Perform fine-tuning for
multiple smaller models &
datasets simultaneously
BERT
Fine-Tuning
10 GB
Training 2
Train multiple larger models
simultaneously
ResNet50
Training
20 GB
Dynamically change available cloud GPU instance types to meet demand

16
Mixed workloads with MIG
Running 3 different applications on a single GPU
P0
P1
P2
P3
P0
P0
Instance 0 Instance 1 Instance 2
4-Slice
4 parallel CUDA processes P0
– P3
2-Slice
One CUDA process P0
GPU
Instances
FB FB
Compute
Engine
[0]
L2
Compute
Engine
[1]
Compute
Engine
[2]
Compute
Engine
[3]
Compute Engine
(x2)
Compute
Engine
L2
FB
L2
Compute
&
Memory
Partitions
1-Slice
Running Debugger
NVDEC
[0]
NVDEC
[1]
NVDEC
x1
When a whole GPU is underutilized, use MIG to better match GPU
resources to users needs thus optimizing overall GPU utilization.

17
MIG for Optimized Inferencing
Scaling out microservices with Triton and MIG
Horizontally scale out your containers or VMs 7x by using MIG
GPU compute Instance instead of GPU devices
No updates needed for application code; deployment code
must be updated to use MIG rather than GPU resources
Continue using microservice best-practice, one server per app,
or allow Triton to manage all MIG devices
Ideal for batch-size 1 inferencing

18
MIG for Many Developers
Using a single DGX A100 as a development platform for 50+ people
Mix user development that requires minimal
resources with large workloads that are resource
intensive on the same platform
Give a very large number of users GPUs for
development work, learning, demos, and POCs
Perfect for:
● Dev teams that do not need extensive GPU mem
● Large instructor-led workshops
● Academic coursework and institutions
● Sales organization with ongoing demos

19
Consolidating Different Workloads on DGX A100
One platform for training, inference, and data analytics

20
Summary of MIG
Flexibility & Isolation
MIG allows you to configure a single GPU to fit any set of
workloads
A MIG devices looks like an ordinary GPU, so no change is
necessary to any user-code
With the proper MIG configuration, no GPU resources remain
idle
A single A100 can be dynamically configured to best serve the
current workload
MIG integrates into existing monitoring framework and works
with virtualization and container orchestration platforms
As your team grows and your clusters scale the DGX A100 will
continue to be the right tool for the job with MIG

Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud

Similar to Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud (20)

Recently uploaded

Recently uploaded (20)

Part 3 Maximizing the utilization of GPU resources on-premise and in the cloud