This document discusses training deep learning models using multiple GPUs in the cloud. It covers the challenges of distributed training including data bottlenecks, communication bottlenecks, and scaling batch sizes and learning rates. It provides benchmarks for frameworks like MXNet and TensorFlow on AWS and discusses the impact of infrastructure like GPU type and interconnect bandwidth on training performance and efficiency. It also analyzes the costs of using different cloud platforms for deep learning training.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Deep Learning Model Training on Multiple GPU Clouds
1.
2. DEEP LEARNING & MULTI
GPUs
Training Deep Learning Models on
Multiple GPUs in the Cloud
BEE PART OF THE CHANGE
Avenida de Burgos, 16 D, 28036
Madrid
hablemos@beeva.com
www.beeva.com
10. Training times vs. accuracy
Accelerate training is
essential!
Source: https://github.com/sailing-pmls/pmls-caffe/
Source: Canziani et al 2017
11. ● Stochastic Gradient Descent
(SGD)
● Mini-batch SGD
Source: Andrew Ng.
Source: http://www.eeng.dcu.ie/~mcguinne/
Error (loss) function
Stochastic gradient descent
12. Data parallel vs. model
parallel
● Faster or larger models?
Asynchronous vs.
Synchronous
● Fast or precise?
Distributed training
Source:
https://github.com/tensorflow/models/tree/master/research/inception
15. (Multi-node) third party
benchmarks
Small print:
● High speed connections!
● Synthetic data vs. real data
● Bottlenecks in hard disk
And more...
● accuracy penalization
● number of parameter serversSource: tensorflow.org
Source: https://chainer.org
32. Tesla K80 price on premises
Source: amazon.com November 2017
33. Tesla K80 prices on cloud
1$/h per-second billing
only 0.3$/h on AWS spot
market
Purchase 1 or rent 4000 to
12000 hours!
Training ResNet50 Imagenet1K (100 epochs):
180$ to 730$
Fine-tuning (8 epochs): < 2$
34. 2014 to 2017: from Kepler... to
Volta!
Source: aws.amazon.com New! October 2017
And Tesla Pascal P100 beta on Google Cloud Platform New!
September 2017.
on-demand spot