Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Distributed Training (ODSC)
1. Distributed Training on Multi-Node Multi-
GPU of Deep Neural Networks
By Mathew Salvaris, Ilia Karmanov and Miguel Fierro
@msalvaris, @ikdeepl and @miguelgfierro
2. penultimate
layer
RGB Channels
of input image
Convolution layer
with Kernels
Pooling layer Fully connected layer
Cat
Dog
Mouse
Deep Learning Model (CNN)
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
5. Distributed training mode: Data parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Job manager
Worker 2
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
6. Distributed training mode: Model
parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Job manager
Worker 2
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
7. Data parallelism vs model parallelism
Data parallelism
• Easier implementation
• Stronger fault tolerance
• Higher cluster utilization
Model parallelism
Better scalability of large models
Less memory on each GPU
Why not both? Data parallelism for CNN layers and model parallelism in FC layers
source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
8. Managed distributed training: Batch AI
Dependencies
and containers
Provision clusters
of VMs
Schedule jobs
Distribute data
Gather results
Handling failures
Scale resources
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
9. A I
1) Create scripts to run on Batch AI
and transfer them to file storage
2) Write the data to storage
3) Create the docker containers for
each DL framework and transfer
them to a container registry
1
2
3
I
Training with Batch AI
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
10. 1) Create a Batch AI Pool
2) Each job will pull in the
appropriate container, script and
load data from chosen storage
3) Once the job is completed all the
results will be written to the
fileshare
Batch AI Pool
1
2
2
2
3
A I
I
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
11. Setup
Clusters of 8 nodes using K80, P40,
P100 and V100 (4 GPUs per node+Infiniband)
Two MPI configurations
OpenMPI+NCCL and IntelMPI
12. Experiments
345 experiments across many different models
including ResNet50, MobileNet V2 etc.
Using synthetic data
Batch size remains 64 across all models and
GPUs
Use the benchmarking scripts that TensorFlow
and Horovod use
13. Distributed training with synthetic data
• Cluster configuration with
synthetic data
A I
I
Batch AI Pool
Mounted
Fileshare
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
19. Experiments
Using ResNet50 across three frameworks
[PyTorch, TensorFlow, Keras]
Using real and synthetic data. Real data on
local, NFS and Blob storage
Batch size remains 64 across all
configurations
Uses V100 GPUs
20. Distributed training with NFS
• Cluster configuration with
NFS share
A I
I
Batch AI Pool
NFS
Share
Mounted
Fileshare
Copy Data
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
21. Distributed training with blob storage
• Cluster configuration with
mounted blob
A I
I Mounted
Blob
Mounted
Fileshare
Copy Data
Batch AI Pool
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
22. Distributed training with local storage
• Cluster configuration with
copying the data to the
nodes
A I
I
Copy Data
Mounted
Fileshare
Batch AI Pool
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
28. Observations & Conclusions
• Don’t use blob
• Use local wherever possible, if not use NFS
• For distributing across nodes use Intel MPI, within nodes OpenMPI+NCCL is probably
preferable
• Scaling efficiency gets worse with faster GPUs with a batch size of 64
• Don’t use distributed training for small models
• Distributed training can be quite inefficient and should only be used under the correct
circumstances:
• Model too big and can’t fit sensible batch size on a single GPU
• The problem can’t be addressed by distributing the model in a simple parallel way
• Be aware of framework specific limitations
29. Thanks!@msalvaris, @ikdeepl and @miguelgfierro
https://github.com/msalvaris/BatchAIHorovodBenchmark
https://github.com/msalvaris/gpu_monitor
https://github.com/Microsoft/DistributedDeepLearning
Editor's Notes
Copy of the entire model on each worker, processing different subsets of the training data set on each.
Copy of the entire model on each worker, processing different subsets of the training data set on each.
provisioning clusters of VMs, installing software and containers, queuing work, prioritizing and scheduling jobs, handing failures, distributing data, sharing results, scaling resources to manage costs, and integrating with tools and workflows
Example flow of working with Batch AI
Describe the diagram
Flow of execution
4*250GB on a single Standard_DS4_v2
Copy data onto node using AzCopy
Should provide greater throughput, was able to achieve around 350 MB/s
Better iops than Blob storage
Cons
Expensive compared to other options
Y axis is the images per second
And on the x with have the different CNN architectures and GPU types
Later generations of GPU are faster
With the V100s being the fastest
Larger networks are slower to train than smaller ones
These numbers are more or less the same everywhere
Now at 32 GPUs
Y axis image per second
X axis GPU type and network architecture
The purple bar is using intel MPI (inifiniband)
The light blue is openmpi and nccl (no infiniband)
As we can see the V100 is faster but it isn’t quite as dominating as with the single GPU
Here we are reporting something a little different
Here we are looking at scaling efficiency
As we can see the V100s scaling efficiency is quite poor
We interpret this as the fact that the amount of information that has to be passed around is the same for each CNN configuration.
The pace at which the GPU process each batch isn’t. So what we see here is that we don’t only need faster GPUs but far faster networks
This is the same as an earlier graph except now we are adding Mobilenet which is a small CNN designed to be quick
As we can see it is very quick to train. We achieve over 25k images a second on 32 GPUs
The problem is the scaling efficiency is miserable. So for smaller networks it really isn’t worth doing distributed training
4*250GB on a single Standard_DS4_v2
Copy data onto node using AzCopy
Should provide greater throughput, was able to achieve around 350 MB/s
Better iops than Blob storage
Cons
Expensive compared to other options
Cheaper to use
Still good performance 200 MB/s
Copy data to blob with AzCopy
Has to be copied as separate files
Cheap and less complicated since no attached storage
The longest to set up. Need to copy the files to every node.
If a node goes down or we need to recreate the cluster we have to copy the data again.
Here we compare local and synthetic data
Local on the left
Synthetic on the right
Blue is Keras
Red is Pytorch
Yellow is TF
We can see synthetic is quicker overall as wel might expect
In terms of speed tensorflow is fastest second is keras and the pytorch. This is because pytorch uses nccl and therefore can not use intel mpi and therefore no infiiniband
It may be a little hard to see here but on a single node (up to 4 GPU) pytorch is the quickest
We also notice a drop in performance from synthetic to local.
Blob on the left
Nfs on the right
Blob is really slow even on the single node blob is terrible.