DISTRIBUTED TRAINING OF
DEEP LEARNING MODELS
Mathew Salvaris @msalvaris
Ilia Karmanov @ikdeepl
Miguel Fierro @miguelgfierro
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
more info: https://github.com/ilkarman/DeepLearningFrameworks
Rosetta Stone of Deep Learning
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
ImageNet Competition
error (%)
ImageNet top-5 error15.3%
7.3%
6.7%
3.6%
3.1%
5.1% (human)
AlexNet
(2012)
VGG
(2014)
Inception
(2015)
ResNet
(2015)
Inception-
ResNet
(2016)
NASNet
(2017)
3.8%
AmoebaNet
(2017)
3.8%
2.4%
ResNext
Instagram
(2018)
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training mode: Data parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Job manager
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training mode: Model parallelism
Dataset
CNN model
Dataset
Submodel 1
Worker 1
Submodel 2
Worker 2
Job manager
Dataset
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Data parallelism vs model parallelism
Data parallelism
▪ Easier implementation
▪ Stronger fault tolerance
▪ Higher cluster utilization
Model parallelism
▪ Better scalability of large models
▪ Less memory on each GPU
Why not both? Data parallelism for CNN layers and model parallelism in FC layers
source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training strategies: parameter averaging
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Average of weights for each worker
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training strategies: distributed gradient based
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Worker 2
Gradients of each worker
Synchronous
Asynchronous
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Overview of distributed training
Install software
and containers
Provision clusters
of VMs
Schedule jobs
Distribute data
Share results
Handling failures
Scale resources
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Azure Distributed Platforms
▪Batch AI
▪Batch Shipyard
▪DL Workspace
Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch Shipyard
https://github.com/Azure/batch-shipyard
•Supports Docker and Singularity:
run your Docker and Singularity
containers within the same job,
side-by-side or even concurrently
•Move data easily between locally
accessible storage systems,
remote filesystems, Azure Blob or
File Storage, and compute nodes
•Supports local storage, Azure
Blob or File Storage, and NFS.
•Low priority nodes
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch AI
https://github.com/Azure/BatchAI
•Supports running on Docker
container as well as the Data
Science Virtual Machine
•Supports local storage, Azure
Blob or File Storage, and NFS.
•Low priority nodes
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
DL Workspace
https://github.com/Microsoft/DLWorkspace
•Runs jobs inside Docker
•Uses Kubernetes
•Can be deployed anywhere not
just Azure
•Supports local storage and NFS
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
A I
1) Create scripts to run on Batch AI
and transfer them to file storage
2) Write the data to storage
3) Create the docker containers for
each DL framework and transfer
them to a container registry
1
2
3
I
Training with Batch AI
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
1) Create a Batch AI Pool
2) Each job will pull in the
appropriate container, script and
load data from chosen storage
3) Once the job is completed all the
results will be written to the
fileshare
Batch AI Pool1
2
2
2
3
A I
I
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch AI Interface
CLI
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24rs_v3
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
--nfs $NFS_NAME
--nfs-mount-path nfs
Python SDK
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with NFS
▪ Batch AI cluster configuration with
NFS share
A I
I
Batch AI Pool
NFS
Share
Mounted
Fileshare
Copy Data
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24rs_v3
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
--nfs $NFS_NAME
--nfs-mount-path nfs
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with blob storage
▪ Batch AI cluster configuration with
mounted blob
A I
I
Batch AI Pool
Mounted
Blob
Mounted
Fileshare
Copy Data
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24rs_v3
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--container-name $CONTAINER_NAME
--container-mount-path extcn
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with local storage
▪ Batch AI cluster configuration with
copying the data to the nodes
A I
I
Batch AI Pool
Node preparation configuration
Copy Data
az batchai cluster create
--name nc24r
--image UbuntuLTS
--vm-size Standard_NC24r
--min 8 --max 8
--afs-name $FILESHARE_NAME
--afs-mount-path extfs
--container-name $CONTAINER_NAME
--container-mount-path extcn
--storage-account-name $STORAGE_ACCOUNT_NAME
--storage-account-key $storage_account_key
-c cluster.json
Mounted
Fileshare
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with PyTorch
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Chainer
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with CNTK
1-bit SGD with MPI Blocked Momentum with MPI
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Demo
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Acknowledgements
Hongzhi Li
Alex Sutton
Alex Yukhanov
Attribution of some images: http://morguefile.com/
Thanks!
Mathew Salvaris @msalvaris
Ilia Karmanov @ikdeepl
Miguel Fierro @miguelgfierro

Distributed training of Deep Learning Models

  • 1.
    DISTRIBUTED TRAINING OF DEEPLEARNING MODELS Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro
  • 2.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) more info: https://github.com/ilkarman/DeepLearningFrameworks Rosetta Stone of Deep Learning
  • 3.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) ImageNet Competition error (%) ImageNet top-5 error15.3% 7.3% 6.7% 3.6% 3.1% 5.1% (human) AlexNet (2012) VGG (2014) Inception (2015) ResNet (2015) Inception- ResNet (2016) NASNet (2017) 3.8% AmoebaNet (2017) 3.8% 2.4% ResNext Instagram (2018)
  • 4.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training mode: Data parallelism Dataset CNN model Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Job manager
  • 5.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training mode: Model parallelism Dataset CNN model Dataset Submodel 1 Worker 1 Submodel 2 Worker 2 Job manager Dataset
  • 6.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Data parallelism vs model parallelism Data parallelism ▪ Easier implementation ▪ Stronger fault tolerance ▪ Higher cluster utilization Model parallelism ▪ Better scalability of large models ▪ Less memory on each GPU Why not both? Data parallelism for CNN layers and model parallelism in FC layers source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997
  • 7.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Training strategies: parameter averaging Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Average of weights for each worker
  • 8.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Training strategies: distributed gradient based Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Gradients of each worker Synchronous Asynchronous
  • 9.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Overview of distributed training Install software and containers Provision clusters of VMs Schedule jobs Distribute data Share results Handling failures Scale resources
  • 10.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Azure Distributed Platforms ▪Batch AI ▪Batch Shipyard ▪DL Workspace Horovod
  • 11.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Batch Shipyard https://github.com/Azure/batch-shipyard •Supports Docker and Singularity: run your Docker and Singularity containers within the same job, side-by-side or even concurrently •Move data easily between locally accessible storage systems, remote filesystems, Azure Blob or File Storage, and compute nodes •Supports local storage, Azure Blob or File Storage, and NFS. •Low priority nodes
  • 12.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Batch AI https://github.com/Azure/BatchAI •Supports running on Docker container as well as the Data Science Virtual Machine •Supports local storage, Azure Blob or File Storage, and NFS. •Low priority nodes
  • 13.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) DL Workspace https://github.com/Microsoft/DLWorkspace •Runs jobs inside Docker •Uses Kubernetes •Can be deployed anywhere not just Azure •Supports local storage and NFS
  • 14.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) A I 1) Create scripts to run on Batch AI and transfer them to file storage 2) Write the data to storage 3) Create the docker containers for each DL framework and transfer them to a container registry 1 2 3 I Training with Batch AI
  • 15.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) 1) Create a Batch AI Pool 2) Each job will pull in the appropriate container, script and load data from chosen storage 3) Once the job is completed all the results will be written to the fileshare Batch AI Pool1 2 2 2 3 A I I
  • 16.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Batch AI Interface CLI az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key --nfs $NFS_NAME --nfs-mount-path nfs Python SDK
  • 17.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with NFS ▪ Batch AI cluster configuration with NFS share A I I Batch AI Pool NFS Share Mounted Fileshare Copy Data az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key --nfs $NFS_NAME --nfs-mount-path nfs
  • 18.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with blob storage ▪ Batch AI cluster configuration with mounted blob A I I Batch AI Pool Mounted Blob Mounted Fileshare Copy Data az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --container-name $CONTAINER_NAME --container-mount-path extcn --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key
  • 19.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with local storage ▪ Batch AI cluster configuration with copying the data to the nodes A I I Batch AI Pool Node preparation configuration Copy Data az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24r --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --container-name $CONTAINER_NAME --container-mount-path extcn --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key -c cluster.json Mounted Fileshare
  • 20.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training Results images/second
  • 21.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training Results images/second
  • 22.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training Results images/second
  • 23.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with Horovod
  • 24.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with Horovod
  • 25.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with Horovod
  • 26.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with PyTorch
  • 27.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with Chainer
  • 28.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Distributed training with CNTK 1-bit SGD with MPI Blocked Momentum with MPI
  • 29.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Demo
  • 30.
    Mathew Salvaris (@msalvaris)– Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro) Acknowledgements Hongzhi Li Alex Sutton Alex Yukhanov Attribution of some images: http://morguefile.com/
  • 31.
    Thanks! Mathew Salvaris @msalvaris IliaKarmanov @ikdeepl Miguel Fierro @miguelgfierro