SlideShare a Scribd company logo
Using Deep Learning Toolkits
with Kubernetes clusters
Wee Hyong, Joy Qiao
Cloud AI, Microsoft
Credits: Jin Li, Sanjeev Mehrotra, Hongzhi Li, Lachie Evenson, William Buchwalter,
Mathew Salvaris, Ilia Karmanov, Taifeng Wang, CNTK Team
O'Reilly Artificial Intelligence Conference 2017
Sept 17 – 20 , San Francisco, CA
Tips & tricks
learned from using
Deep Learning
Toolkits on
Kubernetes
1. Getting the K8S cluster to run in the cloud
using acs-engine
2. Scaling your Deployments
3. Distributed Deep Learning
4. Best Practices for High-Performance
Models
5. Distributed Training Performance on
Kubernetes
Deep Learning Common Patterns
CNN RNN
Convolutional Neural Network Recurrent Neural Network
How long does it take to train DNN models?
ResNet
ImageNet
GoogleNet
ImageNet
2000h Speech
LSTM Model
Neural
Translational Model
130
hours
570
hours
1,100
hours
2,000
hours
Imagenet: 1M Images, 1K Classes
K40 x 8 K40 K40 K40
Getting Started with Deep Learning
Toolkits Environment
Desktop / Laptop
Virtual Machine Devices / Edge
Cloud
And more….
Infrastructure that allows you to
do lots of experimentation
Infrastructure that enables you to
scale up/down as needed
Simplified View of
Kubernetes Concepts
Master
Node
Pod
Container
Pod
Container
Node
Pod
Container
Pod
Container
Node
Pod
Container
Pod
Container
Client
kubectl get nodes
Servers /
Virtual Machines
Cloud
Where can you run K8S?
kubelet
kubelet
kubelet
CNTK + K8S
Using acs-engine to setup K8S
1
Deploy the Kubernetes Cluster
Use an existing image or prep a new
Docker Image
Choose storage to persist the data
(logs, checkpoint files, model, etc)
3 Easy Steps
Getting Started
with Kubernetes
and Deep
Learning Toolkit
Resources: https://github.com/Azure/acs-engine
acs-engine
K8s cluster
definition file
Azure Resource
Manager (ARM) templates
ssh keys
kubeconfig file
Deploy to
Azure
Using acs-engine to setup K8S on Azure
VM1
k8s-master-27473156-0
VM2
VM3
k8s-agentpool1-27473156-1
k8s-agentpool2-27473156-0
NC6
(GPU Enabled)
DS_v2
(Non-GPU)
Virtual network
K8s-vnet-27473156
Deployment to Azure
Checking the Nvidia drivers are used
output logs from nvidia-smi
Running
nvidia-smi to display
the GPU info
dockerfile
1 Specify base image from NVidia
2 Define entry file that is run on startup
3 Install relevant tools
5 Specify entry point and port to be exposed
4 Install CNTK 2.1
docker build -t <image-name> -f <path-to-dockerfile>/dockerfile <src-folder>
Example: https://hub.docker.com/r/weehyong/cntkresnetgpu/
Demo
Prep Kubernetes Cluster using
acs-engine
Hello, CNTK Job!
Defining a Training Job
1 Run this as a K8S Job
2 Secret for Azure Storage
3 Specify the image to use
4 Run the download and
train script
5 Mount a folder on Azure File
Creating a Deployment for Serving
1 Specify this as a K8S Deployment

2
3
Specify the image to use
3 Mount a folder on Azure File
Scaling your Deployment
2
Have the GPU resources when
you need them
Auto-Scaling Deployment
1. To handle more load for serving, I want to scale my deployment
2. Having more pods to run different training jobs
Auto-Scaling
Deployment
Pod-Level
Horizontal Pod Autoscaling
kubectl autoscale
Node-Level
Autoscaling
aka.ms/k8sautoscaleazure
Walkthrough by @wbuchwalter
Based on OpenAI Kubernetes-ecs-autoscaler
Horizontal Pod AutoScaling
Pod
Container
RC / Deployment
Scale
Horizontal Pod
Autoscaler
Pod
Container
CPU% = 70%
Pod
Container
Node
What if the nodes are maxed out?
Node-Level AutoScaling
time
ETCD
kube API Server
User creates pod
kubectl create pod
kube scheduler
(kube master)
Any stuffs
to schedule?
Pending pods = { X }
Nodes = {A, B, C}
Nodes with
free capacity = { }
kubernetes-
acs-autoscaler
Do we have
pending pods?
Pending pods
= { X }
Azure
Container
Services
Set size
= 20
Get
current state
of all agents
New agent
(Azure VM)
Create
VM
kubelet
I am
Node D Put Pod X
On Node D
kube scheduler
(kube master)
Pending pods = { X }
Nodes = {A, B, C}
Nodes with
free capacity = { D }
Any stuffs
to schedule?
Sample YAML for a TensorFlow worker pod with GPUs
2. Check to make sure your K8s has your GPU resources data.
kubectl describe nodes
1. GPU setup scripts
source:
https://github.com/Microsoft/DLWorkspace/blob/master/src/ClusterBootstrap/scripts/prepare_acs.sh
Scheduling GPUs
Each node need to be pre-installed
with Nvidia drivers
Resource name to use Nvidia GPUs
alpha.kubernetes.io/nvidia-gpu
Demo
Node-level Scaling for Deep
Learning Jobs on k8s
Distributed Deep Learning
3
Distributed Training Architecture
Data Parallelism Model Parallelism
1. Parallel training on different
machines
2. Update the parameter server
synchronously/asynchronously
3. Refresh the local model with
new parameters, go to 1 and
repeat
1. The global model is partitioned
into K sub-models without
overlap.
2. The sub-models are distributed
over K local workers and serve
as their local models.
3. In each mini-batch, the local
workers compute the gradients
of the local weights by back
propagation.
Credits: Taifeng Wang, DMTK team
TensorFlow Training on Multi-GPU single node
• Places an individual model replica on
each GPU.
• Splits the batch across the GPUs.
• Updates model parameters
synchronously by waiting for all GPUs
to finish processing a batch of data.
Each tower computes the gradients for a
portion of the batch and the gradients are
combined and averaged across the
multiple towers in order to provide a
single update of the Variables stored on
the CPU.
Source:
https://www.tensorflow.org/tutorials/deep_cnn#launching_and_training_the_model_on_multiple_gpu_cards
Distributed TensorFlow Architecture
For Variable Distribution &
Gradient Aggregation
• Parameter_server
Source: https://www.tensorflow.org/performance/performance_models
Best Practices for
High-Performance Models4
Best Practices
• Input Pipeline
oDo not use feed_dict, slowest way of reading data
oUse the Dataset API
oUse the native parallelism in TensorFlow
➢Parallelize I/O Reads
➢Parallelize Image Processing
➢Parallelize CPU-to-GPU Data Transfer
➢Software Pipelining Source: https://www.tensorflow.org/performance/performance_models
Best Practices
• Preprocessing on the CPU
• Use large files, e.g. large TFRecord files.
o E.g. TensorFlow’s official benchmark training file is 140MB each, in TFRecord format
• Place shared parameters on CPU vs GPU
• NCCL vs TensorFlow’s implicit copy mechanism
oNCCL is an NVIDIA® library that can efficiently broadcast and aggregate data across
different GPUs, with optimized utilization of the underlying hardware topology
Best Practices
• Build the model with both NHWC and NCHW
o NCHW is the optimal format when training with GPUs.
o A flexible model can be trained on GPUs using NCHW, with inference done on
CPU using NHWC with the weights obtained from training.
Distributed Training Performance
on Kubernetes5
Training Environment on Azure
• VM SKU
oNC24r for workers
▪ 4x NVIDIA® Tesla® K80 GPU
▪ 24 CPU cores, 224 GB RAM
oD14_v2 for parameter server
▪ 16 CPU cores, 112 GB RAM
• Kubernetes: 1.6.6 (created using ACS-Engine)
• GPU: NVIDIA® Tesla® K80
• Benchmarks scripts:
https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
• OS: Ubuntu 16.04 LTS
• TensorFlow: 1.2
• CUDA / cuDNN: 8.0 / 6.0
• Disk: Local SSD
• DataSet: ImageNet (real data,
not synthetic)
Training on Single node, Multi-GPU
• Linear scalability
• GPUs are fully saturated
• variable_update mode: parameter_server
• local_parameter_device: cpu
48
96
190
0
20
40
60
80
100
120
140
160
180
200
1 2 3
images/sec
No. of GPUs
Resnet-50 with batchsize=64
Training on Single node, Multi-GPU
For Tesla K80:
• If the GPUs can use NVIDIA GPUDirect Peer to Peer, place the variables equally across the GPUs used for training.
• If the GPUs cannot use GPUDirect, place the variables on the CPU.
(source: https://www.tensorflow.org/performance/performance_guide)
96
190
95
182
0
20
40
60
80
100
120
140
160
180
200
1 2
Images/sec
No. of GPUs
Resnet-50 with batchsize=64
Series1 Series2
Training on Single node, Multi-GPU
• Larger batch size helps with training performance
• Batch size is limited by GPU memory (e.g. 12GB RAM for NVIDIA® Tesla® K80)
variable_update mode: parameter_server
135
124
0
20
40
60
80
100
120
140
160
1 2
images/sec
Batch size
VGG16
440
413
0
50
100
150
200
250
300
350
400
450
500
1 2
Images/sec
Batch size
GoogLeNet
Distributed Training
Settings:
• Topology: 1 ps and 2 workers
• Async variables update
• Using cpu as the local_parameter_device
• Each ps/worker pod has its own dedicated host
• variable_update mode: parameter_server
• Network protocol: gPRC
Single-node Training with 4 GPUs
vs Distributed Training with 2 workers with 8 GPUs in total
440
107.6
190
73
135
818
172.6
296
93 84.5
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5
Images/sec
Series1
Series2
Observations on distributed training:
• Linear scalability largely depends on the model and
network bandwidth.
• GPUs not fully saturated on the worker nodes, likely
due to network bottleneck.
• VGG16 had suboptimal performance than single-
node training. GPUs “starved” most of the time.
• Running directly on Host VMs rather than K8s pods
did not make a huge difference, in this particular
test environment.
Distributed Training
Distributed training scalability depends on the compute/bandwidth ratio of the model
1.86
1.60 1.56
1.27
0.63
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
1 2 3 4 5
Speedup
Training Speedup on 2 nodes vs single-node
Source: https://arxiv.org/abs/1704.04560
The model with a higher
ratio scales better.
GoogLeNet scales pretty well.
VGG16 is suboptimal, due to its large
size
Distributed Training
• Sync vs Async variable updates
• parameter_server vs distributed_replicated mode
814 800
764
0
100
200
300
400
500
600
700
800
900
1 2 3
Images/sec
GoogLeNet with 128 batch size
Distributed Training
Observations on different cluster topologies in this test environment
• Adding more ps servers do not seem to make much difference.
• Having ps servers running on the same pods as the workers seem to have worse performance
o Don’t forget to “export CUDA_VISIBLE_DEVICES=” before starting the ps job session if running ps server on the same pods
with GPUs
296 296
274
0
50
100
150
200
250
300
350
1 2 3
Images/sec
Resnet-50 with 64 batch size
variable_update mode: parameter_server
Demo
Deep Learning Workspace from Microsoft Research
Powered by Kubernetes
• Alpha release available at https://github.com/microsoft/DLWorkspace/
Documentation at https://microsoft.github.io/DLWorkspace/
• Note that DL Workspace is NOT a MS product/service.
It’s an open source solution, and we welcome contribution!
Summary
Tips & tricks
learned from using
Deep Learning
Toolkits on
Kubernetes
1. Getting the K8S cluster to run in the cloud
using acs-engine
2. Scaling your Deployments
3. Distributed Deep Learning
4. Best Practices for High-Performance
Models
5. Distributed Training Performance on
Kubernetes
Resources
• Getting Started with Kubernetes on Azure
https://github.com/Azure/acs-engine
https://docs.microsoft.com/en-us/azure/container-service/kubernetes/
• Running Distributed TensorFlow on Kubernetes using ACS-Engine
https://github.com/joyq-github/TensorFlowonK8s
• Using CNTK and Kubernetes
https://aka.ms/cntkkubernetes
• Distributed CNTK and TensorFlow resources
• https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines
• https://www.tensorflow.org/performance/
• https://arxiv.org/abs/1704.04560
• Deep Learning Workspace powered by Kubernetes
https://github.com/microsoft/DLWorkspace/
https://microsoft.github.io/DLWorkspace/

More Related Content

What's hot

Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Indrajit Poddar
 
Using Docker for GPU-accelerated Applications by Felix Abecassis and Jonathan...
Using Docker for GPU-accelerated Applications by Felix Abecassis and Jonathan...Using Docker for GPU-accelerated Applications by Felix Abecassis and Jonathan...
Using Docker for GPU-accelerated Applications by Felix Abecassis and Jonathan...
Docker, Inc.
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
inside-BigData.com
 
Gdg izmir kubernetes
Gdg izmir kubernetesGdg izmir kubernetes
Gdg izmir kubernetes
Gokhan Boranalp
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
KP Kaiser
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in Kubernetes
KP Kaiser
 
HPC Cloud Burst Using Docker
HPC Cloud Burst Using DockerHPC Cloud Burst Using Docker
HPC Cloud Burst Using Docker
IRJET Journal
 
Driving Digital Transformation With Containers And Kubernetes Complete Deck
Driving Digital Transformation With Containers And Kubernetes Complete DeckDriving Digital Transformation With Containers And Kubernetes Complete Deck
Driving Digital Transformation With Containers And Kubernetes Complete Deck
SlideTeam
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp
 
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowKostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
IT Arena
 
Cloud Native Dünyada CI/CD
Cloud Native Dünyada CI/CDCloud Native Dünyada CI/CD
Cloud Native Dünyada CI/CD
Mustafa AKIN
 
Google Cloud Platform Kubernetes Workshop IYTE
Google Cloud Platform Kubernetes Workshop IYTEGoogle Cloud Platform Kubernetes Workshop IYTE
Google Cloud Platform Kubernetes Workshop IYTE
Gokhan Boranalp
 
Docker training
Docker trainingDocker training
Docker training
Kiran Kumar
 
prodops.io k8s presentation
prodops.io k8s presentationprodops.io k8s presentation
prodops.io k8s presentation
Prodops.io
 
DCEU 18: Edge Computing with Docker Enterprise
DCEU 18: Edge Computing with Docker EnterpriseDCEU 18: Edge Computing with Docker Enterprise
DCEU 18: Edge Computing with Docker Enterprise
Docker, Inc.
 
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
SlideTeam
 
Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...
All Things Open
 
CI/CD with Kubernetes
CI/CD with KubernetesCI/CD with Kubernetes
CI/CD with Kubernetes
Hart Hoover
 
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with KubernetesWhen HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
Yong Feng
 
Classification of aerial photographs using DIGITS 2 - Mike Wang
Classification of aerial photographs using DIGITS 2 - Mike WangClassification of aerial photographs using DIGITS 2 - Mike Wang
Classification of aerial photographs using DIGITS 2 - Mike Wang
PAPIs.io
 

What's hot (20)

Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
 
Using Docker for GPU-accelerated Applications by Felix Abecassis and Jonathan...
Using Docker for GPU-accelerated Applications by Felix Abecassis and Jonathan...Using Docker for GPU-accelerated Applications by Felix Abecassis and Jonathan...
Using Docker for GPU-accelerated Applications by Felix Abecassis and Jonathan...
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
Gdg izmir kubernetes
Gdg izmir kubernetesGdg izmir kubernetes
Gdg izmir kubernetes
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in Kubernetes
 
HPC Cloud Burst Using Docker
HPC Cloud Burst Using DockerHPC Cloud Burst Using Docker
HPC Cloud Burst Using Docker
 
Driving Digital Transformation With Containers And Kubernetes Complete Deck
Driving Digital Transformation With Containers And Kubernetes Complete DeckDriving Digital Transformation With Containers And Kubernetes Complete Deck
Driving Digital Transformation With Containers And Kubernetes Complete Deck
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
 
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowKostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
 
Cloud Native Dünyada CI/CD
Cloud Native Dünyada CI/CDCloud Native Dünyada CI/CD
Cloud Native Dünyada CI/CD
 
Google Cloud Platform Kubernetes Workshop IYTE
Google Cloud Platform Kubernetes Workshop IYTEGoogle Cloud Platform Kubernetes Workshop IYTE
Google Cloud Platform Kubernetes Workshop IYTE
 
Docker training
Docker trainingDocker training
Docker training
 
prodops.io k8s presentation
prodops.io k8s presentationprodops.io k8s presentation
prodops.io k8s presentation
 
DCEU 18: Edge Computing with Docker Enterprise
DCEU 18: Edge Computing with Docker EnterpriseDCEU 18: Edge Computing with Docker Enterprise
DCEU 18: Edge Computing with Docker Enterprise
 
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
 
Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...
 
CI/CD with Kubernetes
CI/CD with KubernetesCI/CD with Kubernetes
CI/CD with Kubernetes
 
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with KubernetesWhen HPC meet ML/DL: Manage HPC Data Center with Kubernetes
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
 
Classification of aerial photographs using DIGITS 2 - Mike Wang
Classification of aerial photographs using DIGITS 2 - Mike WangClassification of aerial photographs using DIGITS 2 - Mike Wang
Classification of aerial photographs using DIGITS 2 - Mike Wang
 

Similar to Using Deep Learning Toolkits with Kubernetes clusters

Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
AWS ECS workshop
AWS ECS workshopAWS ECS workshop
AWS ECS workshop
Prashant Kalkar
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetes
Docker, Inc.
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
Wee Hyong Tok
 
Docker on Amazon ECS
Docker on Amazon ECSDocker on Amazon ECS
Docker on Amazon ECS
Deepak Kumar
 
Clustering tensor flow con kubernetes y raspberry pi
Clustering tensor flow con kubernetes y raspberry piClustering tensor flow con kubernetes y raspberry pi
Clustering tensor flow con kubernetes y raspberry pi
Andrés Leonardo Martinez Ortiz
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
Karol Chrapek
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
Puppet
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
S N
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
Altoros
 
[GS네오텍] Google Kubernetes Engine
[GS네오텍]  Google Kubernetes Engine [GS네오텍]  Google Kubernetes Engine
[GS네오텍] Google Kubernetes Engine
GS Neotek
 
Q&amp;a on running the elastic stack on kubernetes
Q&amp;a on running the elastic stack on kubernetesQ&amp;a on running the elastic stack on kubernetes
Q&amp;a on running the elastic stack on kubernetes
Daliya Spasova
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
AMD Developer Central
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
Lior Sidi
 
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsThe Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
Ahmed Abdullah
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Theofilos Papapanagiotou
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
Sergey Dzyuban
 
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
Docker, Inc.
 

Similar to Using Deep Learning Toolkits with Kubernetes clusters (20)

Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
AWS ECS workshop
AWS ECS workshopAWS ECS workshop
AWS ECS workshop
 
Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetes
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Docker on Amazon ECS
Docker on Amazon ECSDocker on Amazon ECS
Docker on Amazon ECS
 
Clustering tensor flow con kubernetes y raspberry pi
Clustering tensor flow con kubernetes y raspberry piClustering tensor flow con kubernetes y raspberry pi
Clustering tensor flow con kubernetes y raspberry pi
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
[GS네오텍] Google Kubernetes Engine
[GS네오텍]  Google Kubernetes Engine [GS네오텍]  Google Kubernetes Engine
[GS네오텍] Google Kubernetes Engine
 
Q&amp;a on running the elastic stack on kubernetes
Q&amp;a on running the elastic stack on kubernetesQ&amp;a on running the elastic stack on kubernetes
Q&amp;a on running the elastic stack on kubernetes
 
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsThe Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUsDCSF 19 Accelerating Docker Containers with NVIDIA GPUs
DCSF 19 Accelerating Docker Containers with NVIDIA GPUs
 

Recently uploaded

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 

Recently uploaded (20)

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 

Using Deep Learning Toolkits with Kubernetes clusters

  • 1. Using Deep Learning Toolkits with Kubernetes clusters Wee Hyong, Joy Qiao Cloud AI, Microsoft Credits: Jin Li, Sanjeev Mehrotra, Hongzhi Li, Lachie Evenson, William Buchwalter, Mathew Salvaris, Ilia Karmanov, Taifeng Wang, CNTK Team O'Reilly Artificial Intelligence Conference 2017 Sept 17 – 20 , San Francisco, CA
  • 2. Tips & tricks learned from using Deep Learning Toolkits on Kubernetes 1. Getting the K8S cluster to run in the cloud using acs-engine 2. Scaling your Deployments 3. Distributed Deep Learning 4. Best Practices for High-Performance Models 5. Distributed Training Performance on Kubernetes
  • 3. Deep Learning Common Patterns CNN RNN Convolutional Neural Network Recurrent Neural Network
  • 4. How long does it take to train DNN models? ResNet ImageNet GoogleNet ImageNet 2000h Speech LSTM Model Neural Translational Model 130 hours 570 hours 1,100 hours 2,000 hours Imagenet: 1M Images, 1K Classes K40 x 8 K40 K40 K40
  • 5. Getting Started with Deep Learning Toolkits Environment Desktop / Laptop Virtual Machine Devices / Edge Cloud And more….
  • 6. Infrastructure that allows you to do lots of experimentation
  • 7. Infrastructure that enables you to scale up/down as needed
  • 8. Simplified View of Kubernetes Concepts Master Node Pod Container Pod Container Node Pod Container Pod Container Node Pod Container Pod Container Client kubectl get nodes Servers / Virtual Machines Cloud Where can you run K8S? kubelet kubelet kubelet
  • 9. CNTK + K8S Using acs-engine to setup K8S 1
  • 10. Deploy the Kubernetes Cluster Use an existing image or prep a new Docker Image Choose storage to persist the data (logs, checkpoint files, model, etc) 3 Easy Steps Getting Started with Kubernetes and Deep Learning Toolkit
  • 11. Resources: https://github.com/Azure/acs-engine acs-engine K8s cluster definition file Azure Resource Manager (ARM) templates ssh keys kubeconfig file Deploy to Azure Using acs-engine to setup K8S on Azure
  • 13. Checking the Nvidia drivers are used output logs from nvidia-smi Running nvidia-smi to display the GPU info
  • 14. dockerfile 1 Specify base image from NVidia 2 Define entry file that is run on startup 3 Install relevant tools 5 Specify entry point and port to be exposed 4 Install CNTK 2.1
  • 15. docker build -t <image-name> -f <path-to-dockerfile>/dockerfile <src-folder> Example: https://hub.docker.com/r/weehyong/cntkresnetgpu/
  • 16. Demo Prep Kubernetes Cluster using acs-engine
  • 18. Defining a Training Job 1 Run this as a K8S Job 2 Secret for Azure Storage 3 Specify the image to use 4 Run the download and train script 5 Mount a folder on Azure File
  • 19. Creating a Deployment for Serving 1 Specify this as a K8S Deployment 2 3 Specify the image to use 3 Mount a folder on Azure File
  • 21. Have the GPU resources when you need them
  • 22. Auto-Scaling Deployment 1. To handle more load for serving, I want to scale my deployment 2. Having more pods to run different training jobs
  • 23. Auto-Scaling Deployment Pod-Level Horizontal Pod Autoscaling kubectl autoscale Node-Level Autoscaling aka.ms/k8sautoscaleazure Walkthrough by @wbuchwalter Based on OpenAI Kubernetes-ecs-autoscaler
  • 24. Horizontal Pod AutoScaling Pod Container RC / Deployment Scale Horizontal Pod Autoscaler Pod Container CPU% = 70% Pod Container Node
  • 25. What if the nodes are maxed out?
  • 26. Node-Level AutoScaling time ETCD kube API Server User creates pod kubectl create pod kube scheduler (kube master) Any stuffs to schedule? Pending pods = { X } Nodes = {A, B, C} Nodes with free capacity = { } kubernetes- acs-autoscaler Do we have pending pods? Pending pods = { X } Azure Container Services Set size = 20 Get current state of all agents New agent (Azure VM) Create VM kubelet I am Node D Put Pod X On Node D kube scheduler (kube master) Pending pods = { X } Nodes = {A, B, C} Nodes with free capacity = { D } Any stuffs to schedule?
  • 27. Sample YAML for a TensorFlow worker pod with GPUs 2. Check to make sure your K8s has your GPU resources data. kubectl describe nodes 1. GPU setup scripts source: https://github.com/Microsoft/DLWorkspace/blob/master/src/ClusterBootstrap/scripts/prepare_acs.sh
  • 28. Scheduling GPUs Each node need to be pre-installed with Nvidia drivers Resource name to use Nvidia GPUs alpha.kubernetes.io/nvidia-gpu
  • 29. Demo Node-level Scaling for Deep Learning Jobs on k8s
  • 31. Distributed Training Architecture Data Parallelism Model Parallelism 1. Parallel training on different machines 2. Update the parameter server synchronously/asynchronously 3. Refresh the local model with new parameters, go to 1 and repeat 1. The global model is partitioned into K sub-models without overlap. 2. The sub-models are distributed over K local workers and serve as their local models. 3. In each mini-batch, the local workers compute the gradients of the local weights by back propagation. Credits: Taifeng Wang, DMTK team
  • 32. TensorFlow Training on Multi-GPU single node • Places an individual model replica on each GPU. • Splits the batch across the GPUs. • Updates model parameters synchronously by waiting for all GPUs to finish processing a batch of data. Each tower computes the gradients for a portion of the batch and the gradients are combined and averaged across the multiple towers in order to provide a single update of the Variables stored on the CPU. Source: https://www.tensorflow.org/tutorials/deep_cnn#launching_and_training_the_model_on_multiple_gpu_cards
  • 33. Distributed TensorFlow Architecture For Variable Distribution & Gradient Aggregation • Parameter_server Source: https://www.tensorflow.org/performance/performance_models
  • 35. Best Practices • Input Pipeline oDo not use feed_dict, slowest way of reading data oUse the Dataset API oUse the native parallelism in TensorFlow ➢Parallelize I/O Reads ➢Parallelize Image Processing ➢Parallelize CPU-to-GPU Data Transfer ➢Software Pipelining Source: https://www.tensorflow.org/performance/performance_models
  • 36. Best Practices • Preprocessing on the CPU • Use large files, e.g. large TFRecord files. o E.g. TensorFlow’s official benchmark training file is 140MB each, in TFRecord format • Place shared parameters on CPU vs GPU • NCCL vs TensorFlow’s implicit copy mechanism oNCCL is an NVIDIA® library that can efficiently broadcast and aggregate data across different GPUs, with optimized utilization of the underlying hardware topology
  • 37. Best Practices • Build the model with both NHWC and NCHW o NCHW is the optimal format when training with GPUs. o A flexible model can be trained on GPUs using NCHW, with inference done on CPU using NHWC with the weights obtained from training.
  • 39. Training Environment on Azure • VM SKU oNC24r for workers ▪ 4x NVIDIA® Tesla® K80 GPU ▪ 24 CPU cores, 224 GB RAM oD14_v2 for parameter server ▪ 16 CPU cores, 112 GB RAM • Kubernetes: 1.6.6 (created using ACS-Engine) • GPU: NVIDIA® Tesla® K80 • Benchmarks scripts: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks • OS: Ubuntu 16.04 LTS • TensorFlow: 1.2 • CUDA / cuDNN: 8.0 / 6.0 • Disk: Local SSD • DataSet: ImageNet (real data, not synthetic)
  • 40. Training on Single node, Multi-GPU • Linear scalability • GPUs are fully saturated • variable_update mode: parameter_server • local_parameter_device: cpu 48 96 190 0 20 40 60 80 100 120 140 160 180 200 1 2 3 images/sec No. of GPUs Resnet-50 with batchsize=64
  • 41. Training on Single node, Multi-GPU For Tesla K80: • If the GPUs can use NVIDIA GPUDirect Peer to Peer, place the variables equally across the GPUs used for training. • If the GPUs cannot use GPUDirect, place the variables on the CPU. (source: https://www.tensorflow.org/performance/performance_guide) 96 190 95 182 0 20 40 60 80 100 120 140 160 180 200 1 2 Images/sec No. of GPUs Resnet-50 with batchsize=64 Series1 Series2
  • 42. Training on Single node, Multi-GPU • Larger batch size helps with training performance • Batch size is limited by GPU memory (e.g. 12GB RAM for NVIDIA® Tesla® K80) variable_update mode: parameter_server 135 124 0 20 40 60 80 100 120 140 160 1 2 images/sec Batch size VGG16 440 413 0 50 100 150 200 250 300 350 400 450 500 1 2 Images/sec Batch size GoogLeNet
  • 43. Distributed Training Settings: • Topology: 1 ps and 2 workers • Async variables update • Using cpu as the local_parameter_device • Each ps/worker pod has its own dedicated host • variable_update mode: parameter_server • Network protocol: gPRC Single-node Training with 4 GPUs vs Distributed Training with 2 workers with 8 GPUs in total 440 107.6 190 73 135 818 172.6 296 93 84.5 0 100 200 300 400 500 600 700 800 900 1 2 3 4 5 Images/sec Series1 Series2 Observations on distributed training: • Linear scalability largely depends on the model and network bandwidth. • GPUs not fully saturated on the worker nodes, likely due to network bottleneck. • VGG16 had suboptimal performance than single- node training. GPUs “starved” most of the time. • Running directly on Host VMs rather than K8s pods did not make a huge difference, in this particular test environment.
  • 44. Distributed Training Distributed training scalability depends on the compute/bandwidth ratio of the model 1.86 1.60 1.56 1.27 0.63 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 1 2 3 4 5 Speedup Training Speedup on 2 nodes vs single-node Source: https://arxiv.org/abs/1704.04560 The model with a higher ratio scales better. GoogLeNet scales pretty well. VGG16 is suboptimal, due to its large size
  • 45. Distributed Training • Sync vs Async variable updates • parameter_server vs distributed_replicated mode 814 800 764 0 100 200 300 400 500 600 700 800 900 1 2 3 Images/sec GoogLeNet with 128 batch size
  • 46. Distributed Training Observations on different cluster topologies in this test environment • Adding more ps servers do not seem to make much difference. • Having ps servers running on the same pods as the workers seem to have worse performance o Don’t forget to “export CUDA_VISIBLE_DEVICES=” before starting the ps job session if running ps server on the same pods with GPUs 296 296 274 0 50 100 150 200 250 300 350 1 2 3 Images/sec Resnet-50 with 64 batch size variable_update mode: parameter_server
  • 47. Demo Deep Learning Workspace from Microsoft Research Powered by Kubernetes • Alpha release available at https://github.com/microsoft/DLWorkspace/ Documentation at https://microsoft.github.io/DLWorkspace/ • Note that DL Workspace is NOT a MS product/service. It’s an open source solution, and we welcome contribution!
  • 48. Summary Tips & tricks learned from using Deep Learning Toolkits on Kubernetes 1. Getting the K8S cluster to run in the cloud using acs-engine 2. Scaling your Deployments 3. Distributed Deep Learning 4. Best Practices for High-Performance Models 5. Distributed Training Performance on Kubernetes
  • 49. Resources • Getting Started with Kubernetes on Azure https://github.com/Azure/acs-engine https://docs.microsoft.com/en-us/azure/container-service/kubernetes/ • Running Distributed TensorFlow on Kubernetes using ACS-Engine https://github.com/joyq-github/TensorFlowonK8s • Using CNTK and Kubernetes https://aka.ms/cntkkubernetes • Distributed CNTK and TensorFlow resources • https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines • https://www.tensorflow.org/performance/ • https://arxiv.org/abs/1704.04560 • Deep Learning Workspace powered by Kubernetes https://github.com/microsoft/DLWorkspace/ https://microsoft.github.io/DLWorkspace/