Distributed training of Deep Learning Models

DISTRIBUTED TRAINING OF
DEEP LEARNING MODELS
Mathew Salvaris @msalvaris
Ilia Karmanov @ikdeepl
Miguel Fierro @miguelgfierro

Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
more info: https://github.com/ilkarman/DeepLearningFrameworks
Rosetta Stone of Deep Learning

ImageNet Competition
error (%)
ImageNet top-5 error15.3%
7.3%
6.7%
3.6%
3.1%
5.1% (human)
AlexNet
(2012)
VGG
(2014)
Inception
(2015)
ResNet
(2015)
Inception-
ResNet
(2016)
NASNet
(2017)
3.8%
AmoebaNet
(2017)
3.8%
2.4%
ResNext
Instagram
(2018)

Distributed training mode: Data parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Job manager

Distributed training mode: Model parallelism
Dataset
CNN model
Dataset
Submodel 1
Worker 1
Submodel 2
Worker 2
Job manager
Dataset

Data parallelism vs model parallelism
Data parallelism
▪ Easier implementation
▪ Stronger fault tolerance
▪ Higher cluster utilization
Model parallelism
▪ Better scalability of large models
▪ Less memory on each GPU
Why not both? Data parallelism for CNN layers and model parallelism in FC layers
source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997

Training strategies: parameter averaging
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Average of weights for each worker

Training strategies: distributed gradient based
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Gradients of each worker
Synchronous
Asynchronous

Overview of distributed training
Install software
and containers
Provision clusters
of VMs
Schedule jobs
Distribute data
Share results
Handling failures
Scale resources

Azure Distributed Platforms
▪Batch AI
▪Batch Shipyard
▪DL Workspace
Horovod

Batch Shipyard
https://github.com/Azure/batch-shipyard
•Supports Docker and Singularity:
run your Docker and Singularity
containers within the same job,
side-by-side or even concurrently
•Move data easily between locally
accessible storage systems,
remote filesystems, Azure Blob or
File Storage, and compute nodes
•Supports local storage, Azure
Blob or File Storage, and NFS.
•Low priority nodes

Batch AI
https://github.com/Azure/BatchAI
•Supports running on Docker
container as well as the Data
Science Virtual Machine
•Supports local storage, Azure
Blob or File Storage, and NFS.
•Low priority nodes

DL Workspace
https://github.com/Microsoft/DLWorkspace
•Runs jobs inside Docker
•Uses Kubernetes
•Can be deployed anywhere not
just Azure
•Supports local storage and NFS

A I
1) Create scripts to run on Batch AI
and transfer them to file storage
2) Write the data to storage
3) Create the docker containers for
each DL framework and transfer
them to a container registry
1
2
3
I
Training with Batch AI

1) Create a Batch AI Pool
2) Each job will pull in the
appropriate container, script and
load data from chosen storage
3) Once the job is completed all the
results will be written to the
fileshare
Batch AI Pool1
2
2
2
3
A I
I

Batch AI Interface
CLI
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24rs_v3
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
--nfs $NFS_NAME
--nfs-mount-path nfs
Python SDK

Distributed training with NFS
▪ Batch AI cluster configuration with
NFS share
A I
I
Batch AI Pool
NFS
Share
Mounted
Fileshare
Copy Data
--name nc24r
--image UbuntuLTS
--min 8 --max 8
--nfs $NFS_NAME
--nfs-mount-path nfs

Distributed training with blob storage
mounted blob
A I
I
Batch AI Pool
Mounted
Blob
Mounted
Fileshare
Copy Data
--name nc24r
--image UbuntuLTS
--min 8 --max 8
--container-name $CONTAINER_NAME
--container-mount-path extcn

Distributed training with local storage
copying the data to the nodes
A I
I
Batch AI Pool
Node preparation configuration
Copy Data
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24r
--min 8 --max 8
--container-name $CONTAINER_NAME
--container-mount-path extcn
-c cluster.json
Mounted
Fileshare

Distributed training Results
images/second

Distributed training with Horovod

Distributed training with PyTorch

Distributed training with Chainer

Distributed training with CNTK
1-bit SGD with MPI Blocked Momentum with MPI

Demo

Acknowledgements
Hongzhi Li
Alex Sutton
Alex Yukhanov
Attribution of some images: http://morguefile.com/

Thanks!
Mathew Salvaris @msalvaris
Ilia Karmanov @ikdeepl
Miguel Fierro @miguelgfierro

Distributed training of Deep Learning Models

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Distributed training of Deep Learning Models

Similar to Distributed training of Deep Learning Models (20)

More from Miguel González-Fierro

More from Miguel González-Fierro (12)

Recently uploaded

Recently uploaded (20)

Distributed training of Deep Learning Models