SlideShare a Scribd company logo
Scaling MLOps on
NVIDIA DGX platforms
Yochay Ettun, CEO
cnvrg.io
Michael Balint, Sr. Product Manager
NVIDIA
2
Announcing DGX-Ready
AI Boom
DGXA100
MIG1
MIG2
END2END
Version control (models, data, git, etc.)
Collaboration and sharing
Deployment
Production monitoring
Automate Continual Learning
Configuring hardware/platforms
Resource scheduling and assignments
Datasets import and management
Mass scale experimentation
Open source tools, plug-ins and dashboards
Hybrid cloud ML resource orchestration
Kubernetes and Container management
Models and repo management
Industry Pain Point: Data Scientists are doing very little Data Science
Data Scientist
Work Distribution*
0%
100%
35%
Models,
Insights,
Production
NotDataScienceWork
50%OPEX WASTE
2XSLOWER TIME
TO MARKET
30%REVENUES LOST
cnvrg.io Accelerates and Automates Data Science from Research to Production
MLOps/DevOps
START
Deployment
LAUNCH
Start
Deployment
Finish
GTC2020 <
@Johnm
PROJECTS
DATASETS
AI LIBRARY
DASHBOARD
COMPUTE
Docs
Feedback
Datasets
+ New Dataset
M
fraud_jun
V D J
32.4 MB
4 commits
Active 2 days ago
S
fraud_may
V
57.6 MB
2 commits
Active 2 days ago
J
consumer_img1
27.3 MB
4 commits
Active 2 days ago
fraud_sim_base
V
45.1 MB
2 commits
Active 24 days ago
M
misc_base1
D J
25.1 MB
1 commits
Active 2 days ago
M
public_set12
V D J
102.7 MB
7 commits
Active 2 days ago
6
Datasets
290.2
MB
3
NFS Connected
659.5/360.77 GB
Used by cached commits
2 Cached Commits 0 Cached Commits 0 Cached Commits 1 Cached Commits
1 Cached Commits 0 Cached Commits
TEAM
SETTINGS
CONTAINERS
Data selection
Compute Templates
Compute
Jobs
Templates
Resources
Compute Templates
The available Compute Templates for Workspaces, Experiments, and model endpoints
GTC2020 <
@Johnm
PROJECTS
DATASETS
AI LIBRARY
DASHBOARD
COMPUTE
TEAM
SETTINGS
Docs
Feedback
CONTAINERS
DGX-1 4 GPUs 4 CPUs 8.0 GB 4 GPUs 2 Jobs HEALTHY
DGX-1 2 GPUs 2 CPUs 4.0 GB 2 GPUs 1 Jobs HEALTHY
DGX-1 1 GPU 1 CPUs 2.0 GB 1GPUs 0 Jobs HEALTHY
DGX-2 8 GPUs 8 CPUs 16.0 GB 8 GPUs 1 Jobs HEALTHY
DGX-2 4 GPUs 4 CPUs 8.0 GB 4 GPUs 1 Jobs HEALTHY
DGX Station 2 GPUs 2 CPUs 4.0 GB 2 GPUs 0 Jobs HEALTHY
DGX Station 1 GPU 1 CPUs 20 GB 1 GPUs 2 Jobs HEALTHY
+ Add Compute Template
Large 8 CPUs 8 CPUs 32 GB 8 Executors 0 Jobs HEALTHY
Medium 4 CPUs 4 CPUs 16 GB 4 Executors 2 Jobs HEALTHY
gpuxxl-p3.2xlarge 0 Jobs HEALTHY
gpuxl-spot – p2.2xlarge 2 Jobs HEALTHY
On-Board Compute
Data preparation
Model research
Experiments
cnvrg.io is a code-first, full stack, container /
Kubernetes and open platform. cnvrg.io
accelerates data science from research to
production across any platform in any cloud
Validation + Tuning
cnvrg.io ML/AI Control Plane Architecture
ON PREM
cnvrg CONTROL PLANE WORKERS (PODS or CONTAINERS)
CLOUD
Minimal Configuration
1. 8 CPUs (core)
2. 8GB RAM
3. 30GB Disk Space
4. K8S 1.15+ or Docker 18.01+
50%OPEX SAVING
2XFASTER TIME TO
MARKET
+30%REVENUES
Training
Data Prep
Deployed Model
Work space (e.g.
Jupyter Notebooks)
Open source tools
(e.g. Tensor Board)
Meta-scheduler: Foundation for Accelerated Data Science Development & Production
DEPLOYBUILDEXPERIMENTRESEARCH MONITORDATASETS
On Prem
Kubernetes Clusters
Bare metal On Prem
Machines
Cloud Based
Kubernetes Clusters
Cloud Based
Instances
Other Schedulers
cnvrg meta-scheduler
cnvrg platform
Meta-scheduler: How Does it Work?
Compute
smallA 1 GPU 2 GB DGX-1-A
smallB 1 GPU 4 GB DGX-1-B
mediumA 4 GPU 16 GB DGX-1-B
large 8 GPU 32 GB DGX-2-A
1 GPU 1 GPU 4 GB DGX Station 11
1 GPU 1 GPU 4 GB DGX Station 12
Spark(Medium) 4 GPU 16 GB + 4 Worker DGX-1-A
Spark(Large) 8 GPU 32 GB + 8 Workers DGX-1-A
gpuxl-spot 1 GPU 4 GB V100
gpuxxl 2 GPU 8 GB V100
smallA x xsmallB xgpuxl-spot
PRIORITY
COMPUTETEMPLATES
• Whenever data scientist wants to
attach compute to experiment,
workspace, project or a task in a
flow, a drop-down menu will show all
available compute templates
• Data scientist can list compute
resources in a priority list. If a
resource is 100% utilized, the next in
line will be picked
• Many use cases can be deployed:
1. Assuring on-prem high utilization before
bursting to the cloud
2. Resource segmentation per project
3. Cost driven resource assignments
4. Etc.
Flows Decouple Workloads and Workflow from the Physical Assets and Location
DGX-1 CLUSTER
DGX-2
DGX Station
DGX-1
• Assign each task to a
different compute type,
location and scale
• Flows serve also as
automation tool – running
iteratively based on new
datasets
• Flows can be versioned,
modified, shared, stored,
revoked and be customized
• When a task completes, it
frees the resource
• Flows is a foundation for
mass scale experimentation
and continual machine
learning
Launch Mass Scale Experiments
DGX-2
DGX Station
DGX-1
DGX-1 CLUSTER Launch experiments
with different
arguments (i.e. HPS)
Launch experiments
with different
arguments (i.e. HPS)
15
Onboard Any DGX Machines Create ‘one-click’ Compute Templates Assign ML Models from NGC
Meta-scheduler
100’s of Parallel Experiments
Distributed Pipelines
DGX-3 DGX-1 POD DGX-2 DGX Stations
Put Your Cluster into Action with High Utilization
Attach Distributed Training Model to Multi-nodes
V100 V100 V100 V100
V100 V100 V100 V100
V100 V100 V100 V100
V100 V100 V100 V100
MPI Enabled
Connectivity
1
Define multi-node
compute template
2
Attach Distributed Node
Compute Template to
the Model
Run
3 Execute!
Delivering the DevOps and Data Science Unified Control Plane
ON PREM CPUs / GPUs / AI Servers
cnvrg CONTROL PLANE WORKERS (PODS or CONTAINERS)
CLOUD
Training (e.g. PyTorch)
Data Prep (e.g. Spark)
Deployed Model
Work space (e.g.
Jupyter Notebooks)
Open source tools
(e.g. Tensor Board,
Grafana)
Adding New Infrastructure that Integrates to the Cluster
DGX-2
DGX-1
DGX-1 CLUSTER
Adding New Infrastructure that Integrates to the Cluster
DGX-2
DGX-1
DGX-1 CLUSTER
DGX A100
• No forklift upgrade
• No need to ‘lift and shift’ configurations,
environments or applications
• Integrates to the existing infrastructure
Mass Scale and Automated DGX Benchmarks
DGX-3 DGX-3
Onboard machines
Create templates
1
2
3
Run 100s of Benchmarks
Find the ‘Right-Size A100 GPU’ for your Workload
Create templates
Assign NGC containers in one-click
1
2
3
Profile and compare results, find the ‘right-size’ GPU
Increase Utilization and Grow from 1-2z to cnvrg.io meta-scheduler
Single POD, low utilization
DGX-3 DGX-1 POD DGX-2
Clusters with High Utilization
1
2
With cnvrg.io meta-scheduler
Workload mobility between DGX platforms
23
Demo

More Related Content

What's hot

OpenACC Monthly Highlights May 2017
OpenACC Monthly Highlights  May 2017OpenACC Monthly Highlights  May 2017
OpenACC Monthly Highlights May 2017
NVIDIA
 
Part 1 Maximizing the utilization of GPU resources on-premise and in the c...
Part 1    Maximizing the utilization of GPU resources on-premise and in the c...Part 1    Maximizing the utilization of GPU resources on-premise and in the c...
Part 1 Maximizing the utilization of GPU resources on-premise and in the c...
Univa, an Altair Company
 
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Univa, an Altair Company
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
inside-BigData.com
 
GPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierGPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next Frontier
NVIDIA
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPU
Renee Yao
 
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Renee Yao
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
NVIDIA Taiwan
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- December
NVIDIA
 
OpenACC Monthly Highlights - September
OpenACC Monthly Highlights - SeptemberOpenACC Monthly Highlights - September
OpenACC Monthly Highlights - September
NVIDIA
 
Kubeflow
KubeflowKubeflow
Kubeflow
Karane Vieira
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
NVIDIA Japan
 
KubeCon + CloudNativeCon Europe 2021 Virtual Overview / Kubernetes Meetup Tok...
KubeCon + CloudNativeCon Europe 2021 Virtual Overview / Kubernetes Meetup Tok...KubeCon + CloudNativeCon Europe 2021 Virtual Overview / Kubernetes Meetup Tok...
KubeCon + CloudNativeCon Europe 2021 Virtual Overview / Kubernetes Meetup Tok...
Preferred Networks
 
GTC Taiwan 2017 主題演說
GTC Taiwan 2017 主題演說GTC Taiwan 2017 主題演說
GTC Taiwan 2017 主題演說
NVIDIA Taiwan
 
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangAi Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
NVIDIA Taiwan
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloud
Accubits Technologies
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
inside-BigData.com
 
KubeCon EU 2021 Recap - Running Cache-Efficient Builds at Scale on Kubernetes...
KubeCon EU 2021 Recap - Running Cache-Efficient Builds at Scale on Kubernetes...KubeCon EU 2021 Recap - Running Cache-Efficient Builds at Scale on Kubernetes...
KubeCon EU 2021 Recap - Running Cache-Efficient Builds at Scale on Kubernetes...
Preferred Networks
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
inside-BigData.com
 
Continuous Deployment using Kubernetes
Continuous Deployment using KubernetesContinuous Deployment using Kubernetes
Continuous Deployment using Kubernetes
Arun Veeramani
 

What's hot (20)

OpenACC Monthly Highlights May 2017
OpenACC Monthly Highlights  May 2017OpenACC Monthly Highlights  May 2017
OpenACC Monthly Highlights May 2017
 
Part 1 Maximizing the utilization of GPU resources on-premise and in the c...
Part 1    Maximizing the utilization of GPU resources on-premise and in the c...Part 1    Maximizing the utilization of GPU resources on-premise and in the c...
Part 1 Maximizing the utilization of GPU resources on-premise and in the c...
 
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4  Maximizing the utilization of GPU resources on-premise and in the cloudPart 4  Maximizing the utilization of GPU resources on-premise and in the cloud
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
 
GPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next FrontierGPU Computing with Python and Anaconda: The Next Frontier
GPU Computing with Python and Anaconda: The Next Frontier
 
Building the World's Largest GPU
Building the World's Largest GPUBuilding the World's Largest GPU
Building the World's Largest GPU
 
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
Orchestrate Your AI Workload with Cisco Hyperflex, Powered by NVIDIA GPUs
 
GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用GTC Taiwan 2017 企業端深度學習與人工智慧應用
GTC Taiwan 2017 企業端深度學習與人工智慧應用
 
OpenACC Monthly Highlights- December
OpenACC Monthly Highlights- DecemberOpenACC Monthly Highlights- December
OpenACC Monthly Highlights- December
 
OpenACC Monthly Highlights - September
OpenACC Monthly Highlights - SeptemberOpenACC Monthly Highlights - September
OpenACC Monthly Highlights - September
 
Kubeflow
KubeflowKubeflow
Kubeflow
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
KubeCon + CloudNativeCon Europe 2021 Virtual Overview / Kubernetes Meetup Tok...
KubeCon + CloudNativeCon Europe 2021 Virtual Overview / Kubernetes Meetup Tok...KubeCon + CloudNativeCon Europe 2021 Virtual Overview / Kubernetes Meetup Tok...
KubeCon + CloudNativeCon Europe 2021 Virtual Overview / Kubernetes Meetup Tok...
 
GTC Taiwan 2017 主題演說
GTC Taiwan 2017 主題演說GTC Taiwan 2017 主題演說
GTC Taiwan 2017 主題演說
 
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen HuangAi Forum at Computex 2017 - Keynote Slides by Jensen Huang
Ai Forum at Computex 2017 - Keynote Slides by Jensen Huang
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloud
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
KubeCon EU 2021 Recap - Running Cache-Efficient Builds at Scale on Kubernetes...
KubeCon EU 2021 Recap - Running Cache-Efficient Builds at Scale on Kubernetes...KubeCon EU 2021 Recap - Running Cache-Efficient Builds at Scale on Kubernetes...
KubeCon EU 2021 Recap - Running Cache-Efficient Builds at Scale on Kubernetes...
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
 
Continuous Deployment using Kubernetes
Continuous Deployment using KubernetesContinuous Deployment using Kubernetes
Continuous Deployment using Kubernetes
 

Similar to Scaling MLOps on NVIDIA DGX Systems

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
iguazio
 
NextGenML
NextGenML NextGenML
Hpc Cloud project Overview
Hpc Cloud project OverviewHpc Cloud project Overview
Hpc Cloud project Overview
Floris Sluiter
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
Dirk Petersen
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
KP Kaiser
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPCHPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY
 
Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019
VMware Tanzu
 
Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Managing and Deploying High Performance Computing Clusters using Windows HPC ...Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Saptak Sen
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Databricks
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Databricks
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
Deltares
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
OpenStack
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
Ganesan Narayanasamy
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Fwdays
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
Weaveworks
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
DESMOND YUEN
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
 

Similar to Scaling MLOps on NVIDIA DGX Systems (20)

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
 
NextGenML
NextGenML NextGenML
NextGenML
 
Hpc Cloud project Overview
Hpc Cloud project OverviewHpc Cloud project Overview
Hpc Cloud project Overview
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPCHPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
HPC DAY 2017 | FlyElephant Solutions for Data Science and HPC
 
Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019
 
Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Managing and Deploying High Performance Computing Clusters using Windows HPC ...Managing and Deploying High Performance Computing Clusters using Windows HPC ...
Managing and Deploying High Performance Computing Clusters using Windows HPC ...
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado BlascoDSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
DSD-INT 2015 - RSS Sentinel Toolbox - J. Manuel Delgado Blasco
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Free GitOps Workshop
Free GitOps WorkshopFree GitOps Workshop
Free GitOps Workshop
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
 

More from cnvrg.io AI OS - Hands-on ML Workshops

Webinar kubernetes and-spark
Webinar  kubernetes and-sparkWebinar  kubernetes and-spark
Webinar kubernetes and-spark
cnvrg.io AI OS - Hands-on ML Workshops
 
How to set up Kubernetes for all your machine learning workflows
How to set up Kubernetes for all your machine learning workflowsHow to set up Kubernetes for all your machine learning workflows
How to set up Kubernetes for all your machine learning workflows
cnvrg.io AI OS - Hands-on ML Workshops
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
cnvrg.io AI OS - Hands-on ML Workshops
 
How to use continual learning in your ML models
How to use continual learning in your ML modelsHow to use continual learning in your ML models
How to use continual learning in your ML models
cnvrg.io AI OS - Hands-on ML Workshops
 
How To Build Auto-Adaptive Machine Learning Models with Kubernetes
How To Build Auto-Adaptive Machine Learning Models with KubernetesHow To Build Auto-Adaptive Machine Learning Models with Kubernetes
How To Build Auto-Adaptive Machine Learning Models with Kubernetes
cnvrg.io AI OS - Hands-on ML Workshops
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 
Continual learning with human in-the-loop
Continual learning with human in-the-loopContinual learning with human in-the-loop
Continual learning with human in-the-loop
cnvrg.io AI OS - Hands-on ML Workshops
 
How to monitor your ML models in production with Kubernetes
How to monitor your ML models in production with KubernetesHow to monitor your ML models in production with Kubernetes
How to monitor your ML models in production with Kubernetes
cnvrg.io AI OS - Hands-on ML Workshops
 
Build machine learning pipelines from research to production
Build machine learning pipelines from research to productionBuild machine learning pipelines from research to production
Build machine learning pipelines from research to production
cnvrg.io AI OS - Hands-on ML Workshops
 
Why more than half of ML models don't make it to production
Why more than half of ML models don't make it to productionWhy more than half of ML models don't make it to production
Why more than half of ML models don't make it to production
cnvrg.io AI OS - Hands-on ML Workshops
 
Training Machine Learning models directly from GitHub with cnvrg.io MLOps
Training Machine Learning models directly from GitHub with cnvrg.io MLOpsTraining Machine Learning models directly from GitHub with cnvrg.io MLOps
Training Machine Learning models directly from GitHub with cnvrg.io MLOps
cnvrg.io AI OS - Hands-on ML Workshops
 
Deploy your machine learning models to production with Kubernetes
Deploy your machine learning models to production with KubernetesDeploy your machine learning models to production with Kubernetes
Deploy your machine learning models to production with Kubernetes
cnvrg.io AI OS - Hands-on ML Workshops
 

More from cnvrg.io AI OS - Hands-on ML Workshops (12)

Webinar kubernetes and-spark
Webinar  kubernetes and-sparkWebinar  kubernetes and-spark
Webinar kubernetes and-spark
 
How to set up Kubernetes for all your machine learning workflows
How to set up Kubernetes for all your machine learning workflowsHow to set up Kubernetes for all your machine learning workflows
How to set up Kubernetes for all your machine learning workflows
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
How to use continual learning in your ML models
How to use continual learning in your ML modelsHow to use continual learning in your ML models
How to use continual learning in your ML models
 
How To Build Auto-Adaptive Machine Learning Models with Kubernetes
How To Build Auto-Adaptive Machine Learning Models with KubernetesHow To Build Auto-Adaptive Machine Learning Models with Kubernetes
How To Build Auto-Adaptive Machine Learning Models with Kubernetes
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
Continual learning with human in-the-loop
Continual learning with human in-the-loopContinual learning with human in-the-loop
Continual learning with human in-the-loop
 
How to monitor your ML models in production with Kubernetes
How to monitor your ML models in production with KubernetesHow to monitor your ML models in production with Kubernetes
How to monitor your ML models in production with Kubernetes
 
Build machine learning pipelines from research to production
Build machine learning pipelines from research to productionBuild machine learning pipelines from research to production
Build machine learning pipelines from research to production
 
Why more than half of ML models don't make it to production
Why more than half of ML models don't make it to productionWhy more than half of ML models don't make it to production
Why more than half of ML models don't make it to production
 
Training Machine Learning models directly from GitHub with cnvrg.io MLOps
Training Machine Learning models directly from GitHub with cnvrg.io MLOpsTraining Machine Learning models directly from GitHub with cnvrg.io MLOps
Training Machine Learning models directly from GitHub with cnvrg.io MLOps
 
Deploy your machine learning models to production with Kubernetes
Deploy your machine learning models to production with KubernetesDeploy your machine learning models to production with Kubernetes
Deploy your machine learning models to production with Kubernetes
 

Recently uploaded

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 

Recently uploaded (20)

DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 

Scaling MLOps on NVIDIA DGX Systems

  • 1. Scaling MLOps on NVIDIA DGX platforms Yochay Ettun, CEO cnvrg.io Michael Balint, Sr. Product Manager NVIDIA
  • 8. Version control (models, data, git, etc.) Collaboration and sharing Deployment Production monitoring Automate Continual Learning Configuring hardware/platforms Resource scheduling and assignments Datasets import and management Mass scale experimentation Open source tools, plug-ins and dashboards Hybrid cloud ML resource orchestration Kubernetes and Container management Models and repo management Industry Pain Point: Data Scientists are doing very little Data Science Data Scientist Work Distribution* 0% 100% 35% Models, Insights, Production NotDataScienceWork 50%OPEX WASTE 2XSLOWER TIME TO MARKET 30%REVENUES LOST
  • 9. cnvrg.io Accelerates and Automates Data Science from Research to Production MLOps/DevOps START Deployment LAUNCH Start Deployment Finish GTC2020 < @Johnm PROJECTS DATASETS AI LIBRARY DASHBOARD COMPUTE Docs Feedback Datasets + New Dataset M fraud_jun V D J 32.4 MB 4 commits Active 2 days ago S fraud_may V 57.6 MB 2 commits Active 2 days ago J consumer_img1 27.3 MB 4 commits Active 2 days ago fraud_sim_base V 45.1 MB 2 commits Active 24 days ago M misc_base1 D J 25.1 MB 1 commits Active 2 days ago M public_set12 V D J 102.7 MB 7 commits Active 2 days ago 6 Datasets 290.2 MB 3 NFS Connected 659.5/360.77 GB Used by cached commits 2 Cached Commits 0 Cached Commits 0 Cached Commits 1 Cached Commits 1 Cached Commits 0 Cached Commits TEAM SETTINGS CONTAINERS Data selection Compute Templates Compute Jobs Templates Resources Compute Templates The available Compute Templates for Workspaces, Experiments, and model endpoints GTC2020 < @Johnm PROJECTS DATASETS AI LIBRARY DASHBOARD COMPUTE TEAM SETTINGS Docs Feedback CONTAINERS DGX-1 4 GPUs 4 CPUs 8.0 GB 4 GPUs 2 Jobs HEALTHY DGX-1 2 GPUs 2 CPUs 4.0 GB 2 GPUs 1 Jobs HEALTHY DGX-1 1 GPU 1 CPUs 2.0 GB 1GPUs 0 Jobs HEALTHY DGX-2 8 GPUs 8 CPUs 16.0 GB 8 GPUs 1 Jobs HEALTHY DGX-2 4 GPUs 4 CPUs 8.0 GB 4 GPUs 1 Jobs HEALTHY DGX Station 2 GPUs 2 CPUs 4.0 GB 2 GPUs 0 Jobs HEALTHY DGX Station 1 GPU 1 CPUs 20 GB 1 GPUs 2 Jobs HEALTHY + Add Compute Template Large 8 CPUs 8 CPUs 32 GB 8 Executors 0 Jobs HEALTHY Medium 4 CPUs 4 CPUs 16 GB 4 Executors 2 Jobs HEALTHY gpuxxl-p3.2xlarge 0 Jobs HEALTHY gpuxl-spot – p2.2xlarge 2 Jobs HEALTHY On-Board Compute Data preparation Model research Experiments cnvrg.io is a code-first, full stack, container / Kubernetes and open platform. cnvrg.io accelerates data science from research to production across any platform in any cloud Validation + Tuning
  • 10. cnvrg.io ML/AI Control Plane Architecture ON PREM cnvrg CONTROL PLANE WORKERS (PODS or CONTAINERS) CLOUD Minimal Configuration 1. 8 CPUs (core) 2. 8GB RAM 3. 30GB Disk Space 4. K8S 1.15+ or Docker 18.01+ 50%OPEX SAVING 2XFASTER TIME TO MARKET +30%REVENUES Training Data Prep Deployed Model Work space (e.g. Jupyter Notebooks) Open source tools (e.g. Tensor Board)
  • 11. Meta-scheduler: Foundation for Accelerated Data Science Development & Production DEPLOYBUILDEXPERIMENTRESEARCH MONITORDATASETS On Prem Kubernetes Clusters Bare metal On Prem Machines Cloud Based Kubernetes Clusters Cloud Based Instances Other Schedulers cnvrg meta-scheduler cnvrg platform
  • 12. Meta-scheduler: How Does it Work? Compute smallA 1 GPU 2 GB DGX-1-A smallB 1 GPU 4 GB DGX-1-B mediumA 4 GPU 16 GB DGX-1-B large 8 GPU 32 GB DGX-2-A 1 GPU 1 GPU 4 GB DGX Station 11 1 GPU 1 GPU 4 GB DGX Station 12 Spark(Medium) 4 GPU 16 GB + 4 Worker DGX-1-A Spark(Large) 8 GPU 32 GB + 8 Workers DGX-1-A gpuxl-spot 1 GPU 4 GB V100 gpuxxl 2 GPU 8 GB V100 smallA x xsmallB xgpuxl-spot PRIORITY COMPUTETEMPLATES • Whenever data scientist wants to attach compute to experiment, workspace, project or a task in a flow, a drop-down menu will show all available compute templates • Data scientist can list compute resources in a priority list. If a resource is 100% utilized, the next in line will be picked • Many use cases can be deployed: 1. Assuring on-prem high utilization before bursting to the cloud 2. Resource segmentation per project 3. Cost driven resource assignments 4. Etc.
  • 13. Flows Decouple Workloads and Workflow from the Physical Assets and Location DGX-1 CLUSTER DGX-2 DGX Station DGX-1 • Assign each task to a different compute type, location and scale • Flows serve also as automation tool – running iteratively based on new datasets • Flows can be versioned, modified, shared, stored, revoked and be customized • When a task completes, it frees the resource • Flows is a foundation for mass scale experimentation and continual machine learning
  • 14. Launch Mass Scale Experiments DGX-2 DGX Station DGX-1 DGX-1 CLUSTER Launch experiments with different arguments (i.e. HPS) Launch experiments with different arguments (i.e. HPS)
  • 15. 15 Onboard Any DGX Machines Create ‘one-click’ Compute Templates Assign ML Models from NGC Meta-scheduler 100’s of Parallel Experiments Distributed Pipelines DGX-3 DGX-1 POD DGX-2 DGX Stations Put Your Cluster into Action with High Utilization
  • 16. Attach Distributed Training Model to Multi-nodes V100 V100 V100 V100 V100 V100 V100 V100 V100 V100 V100 V100 V100 V100 V100 V100 MPI Enabled Connectivity 1 Define multi-node compute template 2 Attach Distributed Node Compute Template to the Model Run 3 Execute!
  • 17. Delivering the DevOps and Data Science Unified Control Plane ON PREM CPUs / GPUs / AI Servers cnvrg CONTROL PLANE WORKERS (PODS or CONTAINERS) CLOUD Training (e.g. PyTorch) Data Prep (e.g. Spark) Deployed Model Work space (e.g. Jupyter Notebooks) Open source tools (e.g. Tensor Board, Grafana)
  • 18. Adding New Infrastructure that Integrates to the Cluster DGX-2 DGX-1 DGX-1 CLUSTER
  • 19. Adding New Infrastructure that Integrates to the Cluster DGX-2 DGX-1 DGX-1 CLUSTER DGX A100 • No forklift upgrade • No need to ‘lift and shift’ configurations, environments or applications • Integrates to the existing infrastructure
  • 20. Mass Scale and Automated DGX Benchmarks DGX-3 DGX-3 Onboard machines Create templates 1 2 3 Run 100s of Benchmarks
  • 21. Find the ‘Right-Size A100 GPU’ for your Workload Create templates Assign NGC containers in one-click 1 2 3 Profile and compare results, find the ‘right-size’ GPU
  • 22. Increase Utilization and Grow from 1-2z to cnvrg.io meta-scheduler Single POD, low utilization DGX-3 DGX-1 POD DGX-2 Clusters with High Utilization 1 2 With cnvrg.io meta-scheduler Workload mobility between DGX platforms