SlideShare a Scribd company logo
DISTRIBUTED TRAINING OF
DEEP LEARNING MODELS
Mathew Salvaris @msalvaris
Ilia Karmanov @ikdeepl
Miguel Fierro @miguelgfierro
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
more info: https://github.com/ilkarman/DeepLearningFrameworks
Rosetta Stone of Deep Learning
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
ImageNet Competition
error (%)
ImageNet top-5 error15.3%
7.3%
6.7%
3.6%
3.1%
5.1% (human)
AlexNet
(2012)
VGG
(2014)
Inception
(2015)
ResNet
(2015)
Inception-
ResNet
(2016)
NASNet
(2017)
3.8%
AmoebaNet
(2017)
3.8%
2.4%
ResNext
Instagram
(2018)
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training mode: Data parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Job manager
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training mode: Model parallelism
Dataset
CNN model
Dataset
Submodel 1
Worker 1
Submodel 2
Worker 2
Job manager
Dataset
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Data parallelism vs model parallelism
Data parallelism
▪ Easier implementation
▪ Stronger fault tolerance
▪ Higher cluster utilization
Model parallelism
▪ Better scalability of large models
▪ Less memory on each GPU
Why not both? Data parallelism for CNN layers and model parallelism in FC layers
source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training strategies: parameter averaging
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Average of weights for each worker
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training strategies: distributed gradient based
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Gradients of each worker
Synchronous
Asynchronous
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Overview of distributed training
Install software
and containers
Provision clusters
of VMs
Schedule jobs
Distribute data
Share results
Handling failures
Scale resources
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Azure Distributed Platforms
▪Batch AI
▪Batch Shipyard
▪DL Workspace
Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch Shipyard
https://github.com/Azure/batch-shipyard
•Supports Docker and Singularity:
run your Docker and Singularity
containers within the same job,
side-by-side or even concurrently
•Move data easily between locally
accessible storage systems,
remote filesystems, Azure Blob or
File Storage, and compute nodes
•Supports local storage, Azure
Blob or File Storage, and NFS.
•Low priority nodes
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch AI
https://github.com/Azure/BatchAI
•Supports running on Docker
container as well as the Data
Science Virtual Machine
•Supports local storage, Azure
Blob or File Storage, and NFS.
•Low priority nodes
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
DL Workspace
https://github.com/Microsoft/DLWorkspace
•Runs jobs inside Docker
•Uses Kubernetes
•Can be deployed anywhere not
just Azure
•Supports local storage and NFS
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
A I
1) Create scripts to run on Batch AI
and transfer them to file storage
2) Write the data to storage
3) Create the docker containers for
each DL framework and transfer
them to a container registry
1
2
3
I
Training with Batch AI
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
1) Create a Batch AI Pool
2) Each job will pull in the
appropriate container, script and
load data from chosen storage
3) Once the job is completed all the
results will be written to the
fileshare
Batch AI Pool1
2
2
2
3
A I
I
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch AI Interface
CLI
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24rs_v3
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
--nfs $NFS_NAME
--nfs-mount-path nfs
Python SDK
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with NFS
▪ Batch AI cluster configuration with
NFS share
A I
I
Batch AI Pool
NFS
Share
Mounted
Fileshare
Copy Data
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24rs_v3
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
--nfs $NFS_NAME
--nfs-mount-path nfs
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with blob storage
▪ Batch AI cluster configuration with
mounted blob
A I
I
Batch AI Pool
Mounted
Blob
Mounted
Fileshare
Copy Data
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24rs_v3
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--container-name $CONTAINER_NAME
--container-mount-path extcn
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with local storage
▪ Batch AI cluster configuration with
copying the data to the nodes
A I
I
Batch AI Pool
Node preparation configuration
Copy Data
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24r
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--container-name $CONTAINER_NAME
--container-mount-path extcn
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
-c cluster.json
Mounted
Fileshare
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with PyTorch
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Chainer
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with CNTK
1-bit SGD with MPI Blocked Momentum with MPI
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Demo
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Acknowledgements
Hongzhi Li
Alex Sutton
Alex Yukhanov
Attribution of some images: http://morguefile.com/
Thanks!
Mathew Salvaris @msalvaris
Ilia Karmanov @ikdeepl
Miguel Fierro @miguelgfierro

More Related Content

What's hot

What's hot (12)

Transforming WordPress Search and Query Performance with Elasticsearch
Transforming WordPress Search and Query Performance with Elasticsearch Transforming WordPress Search and Query Performance with Elasticsearch
Transforming WordPress Search and Query Performance with Elasticsearch
 
Hadoop Israel - HBase Browser in Hue
Hadoop Israel - HBase Browser in HueHadoop Israel - HBase Browser in Hue
Hadoop Israel - HBase Browser in Hue
 
Hadoop for sysadmins
Hadoop for sysadminsHadoop for sysadmins
Hadoop for sysadmins
 
Puppet at Bazaarvoice
Puppet at BazaarvoicePuppet at Bazaarvoice
Puppet at Bazaarvoice
 
play! scala file resource handling and image resizing
play! scala file resource handling and image resizingplay! scala file resource handling and image resizing
play! scala file resource handling and image resizing
 
JRubyConf 2009
JRubyConf 2009JRubyConf 2009
JRubyConf 2009
 
Automating Content Import
Automating Content ImportAutomating Content Import
Automating Content Import
 
Automation using Scripting and the Canvas API
Automation using Scripting and the Canvas APIAutomation using Scripting and the Canvas API
Automation using Scripting and the Canvas API
 
Firebase: Totally Not Parse All Over Again (Unless It Is) (CocoaConf San Jose...
Firebase: Totally Not Parse All Over Again (Unless It Is) (CocoaConf San Jose...Firebase: Totally Not Parse All Over Again (Unless It Is) (CocoaConf San Jose...
Firebase: Totally Not Parse All Over Again (Unless It Is) (CocoaConf San Jose...
 
Firebase: Totally Not Parse All Over Again (Unless It Is)
Firebase: Totally Not Parse All Over Again (Unless It Is)Firebase: Totally Not Parse All Over Again (Unless It Is)
Firebase: Totally Not Parse All Over Again (Unless It Is)
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Modernizing WordPress Search with Elasticsearch
Modernizing WordPress Search with ElasticsearchModernizing WordPress Search with Elasticsearch
Modernizing WordPress Search with Elasticsearch
 

Similar to Distributed training of Deep Learning Models

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - BerlinUsing MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin
Sébastien Le Marchand
 

Similar to Distributed training of Deep Learning Models (20)

Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)Django Files — A Short Talk (slides only)
Django Files — A Short Talk (slides only)
 
Advanced Topics in Continuous Deployment
Advanced Topics in Continuous DeploymentAdvanced Topics in Continuous Deployment
Advanced Topics in Continuous Deployment
 
Filesystem Management with Flysystem - php[tek] 2023
Filesystem Management with Flysystem - php[tek] 2023Filesystem Management with Flysystem - php[tek] 2023
Filesystem Management with Flysystem - php[tek] 2023
 
Filesystem Management with Flysystem at PHP UK 2023
Filesystem Management with Flysystem at PHP UK 2023Filesystem Management with Flysystem at PHP UK 2023
Filesystem Management with Flysystem at PHP UK 2023
 
Filesystem Abstraction with Flysystem
Filesystem Abstraction with FlysystemFilesystem Abstraction with Flysystem
Filesystem Abstraction with Flysystem
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 
PyFilesystem
PyFilesystemPyFilesystem
PyFilesystem
 
Data Science
Data ScienceData Science
Data Science
 
Django Files — A Short Talk
Django Files — A Short TalkDjango Files — A Short Talk
Django Files — A Short Talk
 
Environment for training models
Environment for training modelsEnvironment for training models
Environment for training models
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Private Cloud Storage via Open Source
Private Cloud Storage via Open SourcePrivate Cloud Storage via Open Source
Private Cloud Storage via Open Source
 
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - BerlinUsing MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin
Using MyBatis in Alfresco custom extensions - Alfresco Devcon 2012 - Berlin
 
GAB 2016 Cloud Storage
GAB 2016 Cloud StorageGAB 2016 Cloud Storage
GAB 2016 Cloud Storage
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
CrateDB 101: Sensor data
CrateDB 101: Sensor dataCrateDB 101: Sensor data
CrateDB 101: Sensor data
 
Get Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor DataGet Started with CrateDB: Sensor Data
Get Started with CrateDB: Sensor Data
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 

More from Miguel González-Fierro

Thesis dissertation: Humanoid Robot Control of Complex Postural Tasks based o...
Thesis dissertation: Humanoid Robot Control of Complex Postural Tasks based o...Thesis dissertation: Humanoid Robot Control of Complex Postural Tasks based o...
Thesis dissertation: Humanoid Robot Control of Complex Postural Tasks based o...
Miguel González-Fierro
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 

More from Miguel González-Fierro (12)

Los retos de la inteligencia artificial en la sociedad actual
Los retos de la inteligencia artificial en la sociedad actualLos retos de la inteligencia artificial en la sociedad actual
Los retos de la inteligencia artificial en la sociedad actual
 
Knowledge Graph Recommendation Systems For COVID-19
Knowledge Graph Recommendation Systems For COVID-19Knowledge Graph Recommendation Systems For COVID-19
Knowledge Graph Recommendation Systems For COVID-19
 
Thesis dissertation: Humanoid Robot Control of Complex Postural Tasks based o...
Thesis dissertation: Humanoid Robot Control of Complex Postural Tasks based o...Thesis dissertation: Humanoid Robot Control of Complex Postural Tasks based o...
Thesis dissertation: Humanoid Robot Control of Complex Postural Tasks based o...
 
Best practices in coding for beginners
Best practices in coding for beginnersBest practices in coding for beginners
Best practices in coding for beginners
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Deep Learning for Sales Professionals
Deep Learning for Sales ProfessionalsDeep Learning for Sales Professionals
Deep Learning for Sales Professionals
 
Deep Learning for Lung Cancer Detection
Deep Learning for Lung Cancer DetectionDeep Learning for Lung Cancer Detection
Deep Learning for Lung Cancer Detection
 
Mastering Computer Vision Problems with State-of-the-art Deep Learning
Mastering Computer Vision Problems with State-of-the-art Deep LearningMastering Computer Vision Problems with State-of-the-art Deep Learning
Mastering Computer Vision Problems with State-of-the-art Deep Learning
 
Speeding up machine-learning applications with the LightGBM library
Speeding up machine-learning applications with the LightGBM librarySpeeding up machine-learning applications with the LightGBM library
Speeding up machine-learning applications with the LightGBM library
 
Leveraging Data Driven Research Through Microsoft Azure
Leveraging Data Driven Research Through Microsoft AzureLeveraging Data Driven Research Through Microsoft Azure
Leveraging Data Driven Research Through Microsoft Azure
 
Empowering every person on the planet to achieve more
Empowering every person on the planet to achieve moreEmpowering every person on the planet to achieve more
Empowering every person on the planet to achieve more
 
Deep Learning for NLP
Deep Learning for NLP Deep Learning for NLP
Deep Learning for NLP
 

Recently uploaded

一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
zahraomer517
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 

Recently uploaded (20)

Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Introduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxxIntroduction-to-Cybersecurit57hhfcbbcxxx
Introduction-to-Cybersecurit57hhfcbbcxxx
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 

Distributed training of Deep Learning Models

  • 1. DISTRIBUTED TRAINING OF DEEP LEARNING MODELS Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro
  • 2. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) more info: https://github.com/ilkarman/DeepLearningFrameworks Rosetta Stone of Deep Learning
  • 3. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) ImageNet Competition error (%) ImageNet top-5 error15.3% 7.3% 6.7% 3.6% 3.1% 5.1% (human) AlexNet (2012) VGG (2014) Inception (2015) ResNet (2015) Inception- ResNet (2016) NASNet (2017) 3.8% AmoebaNet (2017) 3.8% 2.4% ResNext Instagram (2018)
  • 4. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training mode: Data parallelism Dataset CNN model Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Job manager
  • 5. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training mode: Model parallelism Dataset CNN model Dataset Submodel 1 Worker 1 Submodel 2 Worker 2 Job manager Dataset
  • 6. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Data parallelism vs model parallelism Data parallelism ▪ Easier implementation ▪ Stronger fault tolerance ▪ Higher cluster utilization Model parallelism ▪ Better scalability of large models ▪ Less memory on each GPU Why not both? Data parallelism for CNN layers and model parallelism in FC layers source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997
  • 7. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Training strategies: parameter averaging Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Average of weights for each worker
  • 8. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Training strategies: distributed gradient based Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Gradients of each worker Synchronous Asynchronous
  • 9. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Overview of distributed training Install software and containers Provision clusters of VMs Schedule jobs Distribute data Share results Handling failures Scale resources
  • 10. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Azure Distributed Platforms ▪Batch AI ▪Batch Shipyard ▪DL Workspace Horovod
  • 11. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Batch Shipyard https://github.com/Azure/batch-shipyard •Supports Docker and Singularity: run your Docker and Singularity containers within the same job, side-by-side or even concurrently •Move data easily between locally accessible storage systems, remote filesystems, Azure Blob or File Storage, and compute nodes •Supports local storage, Azure Blob or File Storage, and NFS. •Low priority nodes
  • 12. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Batch AI https://github.com/Azure/BatchAI •Supports running on Docker container as well as the Data Science Virtual Machine •Supports local storage, Azure Blob or File Storage, and NFS. •Low priority nodes
  • 13. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) DL Workspace https://github.com/Microsoft/DLWorkspace •Runs jobs inside Docker •Uses Kubernetes •Can be deployed anywhere not just Azure •Supports local storage and NFS
  • 14. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) A I 1) Create scripts to run on Batch AI and transfer them to file storage 2) Write the data to storage 3) Create the docker containers for each DL framework and transfer them to a container registry 1 2 3 I Training with Batch AI
  • 15. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) 1) Create a Batch AI Pool 2) Each job will pull in the appropriate container, script and load data from chosen storage 3) Once the job is completed all the results will be written to the fileshare Batch AI Pool1 2 2 2 3 A I I
  • 16. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Batch AI Interface CLI az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key --nfs $NFS_NAME --nfs-mount-path nfs Python SDK
  • 17. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with NFS ▪ Batch AI cluster configuration with NFS share A I I Batch AI Pool NFS Share Mounted Fileshare Copy Data az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key --nfs $NFS_NAME --nfs-mount-path nfs
  • 18. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with blob storage ▪ Batch AI cluster configuration with mounted blob A I I Batch AI Pool Mounted Blob Mounted Fileshare Copy Data az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --container-name $CONTAINER_NAME --container-mount-path extcn --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key
  • 19. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with local storage ▪ Batch AI cluster configuration with copying the data to the nodes A I I Batch AI Pool Node preparation configuration Copy Data az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24r --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --container-name $CONTAINER_NAME --container-mount-path extcn --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key -c cluster.json Mounted Fileshare
  • 20. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training Results images/second
  • 21. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training Results images/second
  • 22. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training Results images/second
  • 23. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with Horovod
  • 24. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with Horovod
  • 25. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with Horovod
  • 26. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with PyTorch
  • 27. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with Chainer
  • 28. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with CNTK 1-bit SGD with MPI Blocked Momentum with MPI
  • 29. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Demo
  • 30. Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Acknowledgements Hongzhi Li Alex Sutton Alex Yukhanov Attribution of some images: http://morguefile.com/
  • 31. Thanks! Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro