Scaling MLOps on NVIDIA DGX Systems

Scaling MLOps on
NVIDIA DGX platforms
Yochay Ettun, CEO
cnvrg.io
Michael Balint, Sr. Product Manager
NVIDIA

Version control (models, data, git, etc.)
Collaboration and sharing
Deployment
Production monitoring
Automate Continual Learning
Configuring hardware/platforms
Resource scheduling and assignments
Datasets import and management
Mass scale experimentation
Open source tools, plug-ins and dashboards
Hybrid cloud ML resource orchestration
Kubernetes and Container management
Models and repo management
Industry Pain Point: Data Scientists are doing very little Data Science
Data Scientist
Work Distribution*
0%
100%
35%
Models,
Insights,
Production
NotDataScienceWork
50%OPEX WASTE
2XSLOWER TIME
TO MARKET
30%REVENUES LOST

cnvrg.io Accelerates and Automates Data Science from Research to Production
MLOps/DevOps
START
Deployment
LAUNCH
Start
Deployment
Finish
GTC2020 <
@Johnm
PROJECTS
DATASETS
AI LIBRARY
DASHBOARD
COMPUTE
Docs
Feedback
Datasets
+ New Dataset
M
fraud_jun
V D J
32.4 MB
4 commits
Active 2 days ago
S
fraud_may
V
57.6 MB
2 commits
Active 2 days ago
J
consumer_img1
27.3 MB
4 commits
Active 2 days ago
fraud_sim_base
V
45.1 MB
2 commits
Active 24 days ago
M
misc_base1
D J
25.1 MB
1 commits
Active 2 days ago
M
public_set12
V D J
102.7 MB
7 commits
Active 2 days ago
6
Datasets
290.2
MB
3
NFS Connected
659.5/360.77 GB
Used by cached commits
2 Cached Commits 0 Cached Commits 0 Cached Commits 1 Cached Commits
1 Cached Commits 0 Cached Commits
TEAM
SETTINGS
CONTAINERS
Data selection
Compute Templates
Compute
Jobs
Templates
Resources
Compute Templates
The available Compute Templates for Workspaces, Experiments, and model endpoints
GTC2020 <
@Johnm
PROJECTS
DATASETS
AI LIBRARY
DASHBOARD
COMPUTE
TEAM
SETTINGS
Docs
Feedback
CONTAINERS
DGX-1 4 GPUs 4 CPUs 8.0 GB 4 GPUs 2 Jobs HEALTHY
DGX-1 1 GPU 1 CPUs 2.0 GB 1GPUs 0 Jobs HEALTHY
DGX Station 2 GPUs 2 CPUs 4.0 GB 2 GPUs 0 Jobs HEALTHY
DGX Station 1 GPU 1 CPUs 20 GB 1 GPUs 2 Jobs HEALTHY
+ Add Compute Template
Large 8 CPUs 8 CPUs 32 GB 8 Executors 0 Jobs HEALTHY
Medium 4 CPUs 4 CPUs 16 GB 4 Executors 2 Jobs HEALTHY
gpuxxl-p3.2xlarge 0 Jobs HEALTHY
gpuxl-spot – p2.2xlarge 2 Jobs HEALTHY
On-Board Compute
Data preparation
Model research
Experiments
cnvrg.io is a code-first, full stack, container /
Kubernetes and open platform. cnvrg.io
accelerates data science from research to
production across any platform in any cloud
Validation + Tuning

cnvrg.io ML/AI Control Plane Architecture
ON PREM
cnvrg CONTROL PLANE WORKERS (PODS or CONTAINERS)
CLOUD
Minimal Configuration
1. 8 CPUs (core)
2. 8GB RAM
3. 30GB Disk Space
4. K8S 1.15+ or Docker 18.01+
50%OPEX SAVING
2XFASTER TIME TO
MARKET
+30%REVENUES
Training
Data Prep
Deployed Model
Work space (e.g.
Jupyter Notebooks)
Open source tools
(e.g. Tensor Board)

Meta-scheduler: Foundation for Accelerated Data Science Development & Production
DEPLOYBUILDEXPERIMENTRESEARCH MONITORDATASETS
On Prem
Kubernetes Clusters
Bare metal On Prem
Machines
Cloud Based
Kubernetes Clusters
Cloud Based
Instances
Other Schedulers
cnvrg meta-scheduler
cnvrg platform

Meta-scheduler: How Does it Work?
Compute
smallA 1 GPU 2 GB DGX-1-A
smallB 1 GPU 4 GB DGX-1-B
mediumA 4 GPU 16 GB DGX-1-B
large 8 GPU 32 GB DGX-2-A
1 GPU 1 GPU 4 GB DGX Station 11
1 GPU 1 GPU 4 GB DGX Station 12
Spark(Medium) 4 GPU 16 GB + 4 Worker DGX-1-A
Spark(Large) 8 GPU 32 GB + 8 Workers DGX-1-A
gpuxl-spot 1 GPU 4 GB V100
gpuxxl 2 GPU 8 GB V100
smallA x xsmallB xgpuxl-spot
PRIORITY
COMPUTETEMPLATES
• Whenever data scientist wants to
attach compute to experiment,
workspace, project or a task in a
flow, a drop-down menu will show all
available compute templates
• Data scientist can list compute
resources in a priority list. If a
resource is 100% utilized, the next in
line will be picked
• Many use cases can be deployed:
1. Assuring on-prem high utilization before
bursting to the cloud
2. Resource segmentation per project
3. Cost driven resource assignments
4. Etc.

Flows Decouple Workloads and Workflow from the Physical Assets and Location
DGX-1 CLUSTER
DGX-2
DGX Station
DGX-1
• Assign each task to a
different compute type,
location and scale
• Flows serve also as
automation tool – running
iteratively based on new
datasets
• Flows can be versioned,
modified, shared, stored,
revoked and be customized
• When a task completes, it
frees the resource
• Flows is a foundation for
mass scale experimentation
and continual machine
learning

Launch Mass Scale Experiments
DGX-2
DGX Station
DGX-1
DGX-1 CLUSTER Launch experiments
with different
arguments (i.e. HPS)
Launch experiments
with different
arguments (i.e. HPS)

15
Onboard Any DGX Machines Create ‘one-click’ Compute Templates Assign ML Models from NGC
Meta-scheduler
100’s of Parallel Experiments
Distributed Pipelines
DGX-3 DGX-1 POD DGX-2 DGX Stations
Put Your Cluster into Action with High Utilization

Attach Distributed Training Model to Multi-nodes
V100 V100 V100 V100
V100 V100 V100 V100
V100 V100 V100 V100
V100 V100 V100 V100
MPI Enabled
Connectivity
1
Define multi-node
compute template
2
Attach Distributed Node
Compute Template to
the Model
Run
3 Execute!

Delivering the DevOps and Data Science Unified Control Plane
ON PREM CPUs / GPUs / AI Servers
cnvrg CONTROL PLANE WORKERS (PODS or CONTAINERS)
CLOUD
Training (e.g. PyTorch)
Data Prep (e.g. Spark)
Deployed Model
Work space (e.g.
Jupyter Notebooks)
Open source tools
(e.g. Tensor Board,
Grafana)

Adding New Infrastructure that Integrates to the Cluster
DGX-2
DGX-1
DGX-1 CLUSTER

Adding New Infrastructure that Integrates to the Cluster
DGX-2
DGX-1
DGX-1 CLUSTER
DGX A100
• No forklift upgrade
• No need to ‘lift and shift’ configurations,
environments or applications
• Integrates to the existing infrastructure

Mass Scale and Automated DGX Benchmarks
DGX-3 DGX-3
Onboard machines
Create templates
1
2
3
Run 100s of Benchmarks

Find the ‘Right-Size A100 GPU’ for your Workload
Create templates
Assign NGC containers in one-click
1
2
3
Profile and compare results, find the ‘right-size’ GPU

Increase Utilization and Grow from 1-2z to cnvrg.io meta-scheduler
Single POD, low utilization
DGX-3 DGX-1 POD DGX-2
Clusters with High Utilization
1
2
With cnvrg.io meta-scheduler
Workload mobility between DGX platforms

Scaling MLOps on NVIDIA DGX Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling MLOps on NVIDIA DGX Systems

Similar to Scaling MLOps on NVIDIA DGX Systems (20)

More from cnvrg.io AI OS - Hands-on ML Workshops

More from cnvrg.io AI OS - Hands-on ML Workshops (12)

Recently uploaded

Recently uploaded (20)

Scaling MLOps on NVIDIA DGX Systems