Distributed Training (ODSC)

Distributed Training on Multi-Node Multi-
GPU of Deep Neural Networks
By Mathew Salvaris, Ilia Karmanov and Miguel Fierro
@msalvaris, @ikdeepl and @miguelgfierro

penultimate
layer
RGB Channels
of input image
Convolution layer
with Kernels
Pooling layer Fully connected layer
Cat
Dog
Mouse
Deep Learning Model (CNN)
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro

more info: https://github.com/ilkarman/DeepLearningFrameworks
Rosetta Stone of Deep Learning

ImageNet Competition
error (%)
ImageNet top-5 error15.3%
7.3%
6.7%
3.6%
3.1%
5.1% (human)
AlexNet
(2012)
VGG
(2014)
Inception
(2015)
ResNet
(2015)
Inception-
ResNet
(2016)
NASNet
(2017)
3.8%
AmoebaNet
(2017)
3.8%
2.4%
ResNext
Instagram
(2018)

Distributed training mode: Data parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Job manager
Worker 2

Distributed training mode: Model
parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Job manager
Worker 2

Data parallelism vs model parallelism
Data parallelism
• Easier implementation
• Stronger fault tolerance
• Higher cluster utilization
Model parallelism
 Better scalability of large models
 Less memory on each GPU
Why not both? Data parallelism for CNN layers and model parallelism in FC layers
source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997

Managed distributed training: Batch AI
Dependencies
and containers
Provision clusters
of VMs
Schedule jobs
Distribute data
Gather results
Handling failures
Scale resources

A I
1) Create scripts to run on Batch AI
and transfer them to file storage
2) Write the data to storage
3) Create the docker containers for
each DL framework and transfer
them to a container registry
1
2
3
I
Training with Batch AI

1) Create a Batch AI Pool
2) Each job will pull in the
appropriate container, script and
load data from chosen storage
3) Once the job is completed all the
results will be written to the
fileshare
Batch AI Pool
1
2
2
2
3
A I
I

Setup
Clusters of 8 nodes using K80, P40,
P100 and V100 (4 GPUs per node+Infiniband)
Two MPI configurations
OpenMPI+NCCL and IntelMPI

Experiments
345 experiments across many different models
including ResNet50, MobileNet V2 etc.
Using synthetic data
Batch size remains 64 across all models and
GPUs
Use the benchmarking scripts that TensorFlow
and Horovod use

Distributed training with synthetic data
• Cluster configuration with
synthetic data
A I
I
Batch AI Pool
Mounted
Fileshare

Single GPU

32 GPUs

MobileNetMathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro

Experiments
Using ResNet50 across three frameworks
[PyTorch, TensorFlow, Keras]
Using real and synthetic data. Real data on
local, NFS and Blob storage
Batch size remains 64 across all
configurations
Uses V100 GPUs

Distributed training with NFS
NFS share
A I
I
Batch AI Pool
NFS
Share
Mounted
Fileshare
Copy Data

Distributed training with blob storage
mounted blob
A I
I Mounted
Blob
Mounted
Fileshare
Copy Data
Batch AI Pool

Distributed training with local storage
copying the data to the
nodes
A I
I
Copy Data
Mounted
Fileshare
Batch AI Pool

PyTorch

Keras

TensorFlow

Observations & Conclusions
• Don’t use blob
• Use local wherever possible, if not use NFS
• For distributing across nodes use Intel MPI, within nodes OpenMPI+NCCL is probably
preferable
• Scaling efficiency gets worse with faster GPUs with a batch size of 64
• Don’t use distributed training for small models
• Distributed training can be quite inefficient and should only be used under the correct
circumstances:
• Model too big and can’t fit sensible batch size on a single GPU
• The problem can’t be addressed by distributing the model in a simple parallel way
• Be aware of framework specific limitations

Thanks!@msalvaris, @ikdeepl and @miguelgfierro
https://github.com/msalvaris/BatchAIHorovodBenchmark
https://github.com/msalvaris/gpu_monitor
https://github.com/Microsoft/DistributedDeepLearning

Distributed Training (ODSC)

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Distributed Training (ODSC)

Editor's Notes