SlideShare a Scribd company logo
1 of 29
Distributed Training on Multi-Node Multi-
GPU of Deep Neural Networks
By Mathew Salvaris, Ilia Karmanov and Miguel Fierro
@msalvaris, @ikdeepl and @miguelgfierro
penultimate
layer
RGB Channels
of input image
Convolution layer
with Kernels
Pooling layer Fully connected layer
Cat
Dog
Mouse
Deep Learning Model (CNN)
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
more info: https://github.com/ilkarman/DeepLearningFrameworks
Rosetta Stone of Deep Learning
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
ImageNet Competition
error (%)
ImageNet top-5 error15.3%
7.3%
6.7%
3.6%
3.1%
5.1% (human)
AlexNet
(2012)
VGG
(2014)
Inception
(2015)
ResNet
(2015)
Inception-
ResNet
(2016)
NASNet
(2017)
3.8%
AmoebaNet
(2017)
3.8%
2.4%
ResNext
Instagram
(2018)
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Distributed training mode: Data parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Job manager
Worker 2
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Distributed training mode: Model
parallelism
Dataset
CNN model
Subset 1 CNN model
Worker 1
Subset 2 CNN model
Job manager
Worker 2
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Data parallelism vs model parallelism
Data parallelism
• Easier implementation
• Stronger fault tolerance
• Higher cluster utilization
Model parallelism
 Better scalability of large models
 Less memory on each GPU
Why not both? Data parallelism for CNN layers and model parallelism in FC layers
source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Managed distributed training: Batch AI
Dependencies
and containers
Provision clusters
of VMs
Schedule jobs
Distribute data
Gather results
Handling failures
Scale resources
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
A I
1) Create scripts to run on Batch AI
and transfer them to file storage
2) Write the data to storage
3) Create the docker containers for
each DL framework and transfer
them to a container registry
1
2
3
I
Training with Batch AI
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
1) Create a Batch AI Pool
2) Each job will pull in the
appropriate container, script and
load data from chosen storage
3) Once the job is completed all the
results will be written to the
fileshare
Batch AI Pool
1
2
2
2
3
A I
I
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Setup
Clusters of 8 nodes using K80, P40,
P100 and V100 (4 GPUs per node+Infiniband)
Two MPI configurations
OpenMPI+NCCL and IntelMPI
Experiments
345 experiments across many different models
including ResNet50, MobileNet V2 etc.
Using synthetic data
Batch size remains 64 across all models and
GPUs
Use the benchmarking scripts that TensorFlow
and Horovod use
Distributed training with synthetic data
• Cluster configuration with
synthetic data
A I
I
Batch AI Pool
Mounted
Fileshare
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Single GPU
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
32 GPUs
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
32 GPUs
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
MobileNetMathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
MobileNetMathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Experiments
Using ResNet50 across three frameworks
[PyTorch, TensorFlow, Keras]
Using real and synthetic data. Real data on
local, NFS and Blob storage
Batch size remains 64 across all
configurations
Uses V100 GPUs
Distributed training with NFS
• Cluster configuration with
NFS share
A I
I
Batch AI Pool
NFS
Share
Mounted
Fileshare
Copy Data
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Distributed training with blob storage
• Cluster configuration with
mounted blob
A I
I Mounted
Blob
Mounted
Fileshare
Copy Data
Batch AI Pool
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Distributed training with local storage
• Cluster configuration with
copying the data to the
nodes
A I
I
Copy Data
Mounted
Fileshare
Batch AI Pool
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Local Vs Synthetic
Blob Vs NFS
PyTorch
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Keras
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
TensorFlow
Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
Observations & Conclusions
• Don’t use blob
• Use local wherever possible, if not use NFS
• For distributing across nodes use Intel MPI, within nodes OpenMPI+NCCL is probably
preferable
• Scaling efficiency gets worse with faster GPUs with a batch size of 64
• Don’t use distributed training for small models
• Distributed training can be quite inefficient and should only be used under the correct
circumstances:
• Model too big and can’t fit sensible batch size on a single GPU
• The problem can’t be addressed by distributing the model in a simple parallel way
• Be aware of framework specific limitations
Thanks!@msalvaris, @ikdeepl and @miguelgfierro
https://github.com/msalvaris/BatchAIHorovodBenchmark
https://github.com/msalvaris/gpu_monitor
https://github.com/Microsoft/DistributedDeepLearning

More Related Content

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Distributed Training (ODSC)

  • 1. Distributed Training on Multi-Node Multi- GPU of Deep Neural Networks By Mathew Salvaris, Ilia Karmanov and Miguel Fierro @msalvaris, @ikdeepl and @miguelgfierro
  • 2. penultimate layer RGB Channels of input image Convolution layer with Kernels Pooling layer Fully connected layer Cat Dog Mouse Deep Learning Model (CNN) Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 3. more info: https://github.com/ilkarman/DeepLearningFrameworks Rosetta Stone of Deep Learning Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 4. ImageNet Competition error (%) ImageNet top-5 error15.3% 7.3% 6.7% 3.6% 3.1% 5.1% (human) AlexNet (2012) VGG (2014) Inception (2015) ResNet (2015) Inception- ResNet (2016) NASNet (2017) 3.8% AmoebaNet (2017) 3.8% 2.4% ResNext Instagram (2018) Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 5. Distributed training mode: Data parallelism Dataset CNN model Subset 1 CNN model Worker 1 Subset 2 CNN model Job manager Worker 2 Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 6. Distributed training mode: Model parallelism Dataset CNN model Subset 1 CNN model Worker 1 Subset 2 CNN model Job manager Worker 2 Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 7. Data parallelism vs model parallelism Data parallelism • Easier implementation • Stronger fault tolerance • Higher cluster utilization Model parallelism  Better scalability of large models  Less memory on each GPU Why not both? Data parallelism for CNN layers and model parallelism in FC layers source: Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. https://arxiv.org/abs/1404.5997 Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 8. Managed distributed training: Batch AI Dependencies and containers Provision clusters of VMs Schedule jobs Distribute data Gather results Handling failures Scale resources Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 9. A I 1) Create scripts to run on Batch AI and transfer them to file storage 2) Write the data to storage 3) Create the docker containers for each DL framework and transfer them to a container registry 1 2 3 I Training with Batch AI Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 10. 1) Create a Batch AI Pool 2) Each job will pull in the appropriate container, script and load data from chosen storage 3) Once the job is completed all the results will be written to the fileshare Batch AI Pool 1 2 2 2 3 A I I Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 11. Setup Clusters of 8 nodes using K80, P40, P100 and V100 (4 GPUs per node+Infiniband) Two MPI configurations OpenMPI+NCCL and IntelMPI
  • 12. Experiments 345 experiments across many different models including ResNet50, MobileNet V2 etc. Using synthetic data Batch size remains 64 across all models and GPUs Use the benchmarking scripts that TensorFlow and Horovod use
  • 13. Distributed training with synthetic data • Cluster configuration with synthetic data A I I Batch AI Pool Mounted Fileshare Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 14. Single GPU Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 15. 32 GPUs Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 16. 32 GPUs Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 17. MobileNetMathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 18. MobileNetMathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 19. Experiments Using ResNet50 across three frameworks [PyTorch, TensorFlow, Keras] Using real and synthetic data. Real data on local, NFS and Blob storage Batch size remains 64 across all configurations Uses V100 GPUs
  • 20. Distributed training with NFS • Cluster configuration with NFS share A I I Batch AI Pool NFS Share Mounted Fileshare Copy Data Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 21. Distributed training with blob storage • Cluster configuration with mounted blob A I I Mounted Blob Mounted Fileshare Copy Data Batch AI Pool Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 22. Distributed training with local storage • Cluster configuration with copying the data to the nodes A I I Copy Data Mounted Fileshare Batch AI Pool Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 25. PyTorch Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 26. Keras Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 27. TensorFlow Mathew Salvaris @msalvaris - Ilia Karmanov @ikdeepl - Miguel Fierro @miguelgfierro
  • 28. Observations & Conclusions • Don’t use blob • Use local wherever possible, if not use NFS • For distributing across nodes use Intel MPI, within nodes OpenMPI+NCCL is probably preferable • Scaling efficiency gets worse with faster GPUs with a batch size of 64 • Don’t use distributed training for small models • Distributed training can be quite inefficient and should only be used under the correct circumstances: • Model too big and can’t fit sensible batch size on a single GPU • The problem can’t be addressed by distributing the model in a simple parallel way • Be aware of framework specific limitations
  • 29. Thanks!@msalvaris, @ikdeepl and @miguelgfierro https://github.com/msalvaris/BatchAIHorovodBenchmark https://github.com/msalvaris/gpu_monitor https://github.com/Microsoft/DistributedDeepLearning

Editor's Notes

  1. Copy of the entire model on each worker, processing different subsets of the training data set on each.
  2. Copy of the entire model on each worker, processing different subsets of the training data set on each.
  3. provisioning clusters of VMs, installing software and containers, queuing work, prioritizing and scheduling jobs, handing failures, distributing data, sharing results, scaling resources to manage costs, and integrating with tools and workflows
  4. Example flow of working with Batch AI Describe the diagram
  5. Flow of execution
  6. 4*250GB on a single Standard_DS4_v2 Copy data onto node using AzCopy Should provide greater throughput, was able to achieve around 350 MB/s Better iops than Blob storage Cons Expensive compared to other options
  7. Y axis is the images per second And on the x with have the different CNN architectures and GPU types Later generations of GPU are faster With the V100s being the fastest Larger networks are slower to train than smaller ones These numbers are more or less the same everywhere
  8. Now at 32 GPUs Y axis image per second X axis GPU type and network architecture The purple bar is using intel MPI (inifiniband) The light blue is openmpi and nccl (no infiniband) As we can see the V100 is faster but it isn’t quite as dominating as with the single GPU
  9. Here we are reporting something a little different Here we are looking at scaling efficiency As we can see the V100s scaling efficiency is quite poor We interpret this as the fact that the amount of information that has to be passed around is the same for each CNN configuration. The pace at which the GPU process each batch isn’t. So what we see here is that we don’t only need faster GPUs but far faster networks
  10. This is the same as an earlier graph except now we are adding Mobilenet which is a small CNN designed to be quick As we can see it is very quick to train. We achieve over 25k images a second on 32 GPUs
  11. The problem is the scaling efficiency is miserable. So for smaller networks it really isn’t worth doing distributed training
  12. 4*250GB on a single Standard_DS4_v2 Copy data onto node using AzCopy Should provide greater throughput, was able to achieve around 350 MB/s Better iops than Blob storage Cons Expensive compared to other options
  13. Cheaper to use Still good performance 200 MB/s Copy data to blob with AzCopy Has to be copied as separate files
  14. Cheap and less complicated since no attached storage The longest to set up. Need to copy the files to every node. If a node goes down or we need to recreate the cluster we have to copy the data again.
  15. Here we compare local and synthetic data Local on the left Synthetic on the right Blue is Keras Red is Pytorch Yellow is TF We can see synthetic is quicker overall as wel might expect In terms of speed tensorflow is fastest second is keras and the pytorch. This is because pytorch uses nccl and therefore can not use intel mpi and therefore no infiiniband It may be a little hard to see here but on a single node (up to 4 GPU) pytorch is the quickest We also notice a drop in performance from synthetic to local.
  16. Blob on the left Nfs on the right Blob is really slow even on the single node blob is terrible.