Distributed deep learning

Distributed Deep Learning
Alireza Shafaei
LCI – MLRG, April 2018

Strategies
• Parallelisms: data parallelism vs. model parallelism.
• Optimization: synchronous vs. asynchronous.
April 17, 2018 Distributed Deep Learning 2
Source [5]Source [5]

Challenges
• Infrastructural barriers
• Communication overhead.
• Long-tail machine/network latency.
• Optimization barriers
• Synchronous: batch-size trade-off.
• Asynchronous: gradient interference.
• Programming barriers
• Error prone pipelines.
• After 128 GPUs there’s no point [1].
April 17, 2018 3Distributed Deep Learning
Source [1]

Outline
• Downpour SGD, Sandblaster LBFGS (2012).
• ImageNet in one hour (2017).
• ImageNet in fifteen minutes (2017).
• Codistillation (2018).
• Deep gradient compression (2018).

Downpour SGD, Sandblaster LBFGS [5]
• Run on CPU (2012!).
• Downpour SGD: Asynchronous SGD, basically HOGWILD on multiple
machines.
• Sandblaster LBFGS: Simple distributed version of LBFGS.

Downpour SGD, Sandblaster LBFGS [5]

ImageNet in one hour [3]
• Using synchronous SGD and a linear scaling rule increase the mini-
batch size.

ImageNet in 15 minutes [4]

Co-distillation / Online Distillation [1]
• Based on knowledge distillation (KD) [2]:
1. Train a high-capacity teacher model.
2. Train a low-capacity student model + distillation loss.
• KD better generalization, model compression.
• When cannot scale up, train an ensemble in parallel.
• Merge the ensembles into a single model with KD.
• Instead of a two-phase training, do it at the same time.

• Unlike distributed SGD, using stale 𝜃𝜃𝑖𝑖’s does not hurt performance.
• The change in weights does not make a significant change in predictions.
• The synchronization can be at the model level or prediction level.
• Each model can be trained on a partition of the dataset.
• Each model can be trained in a distributed way.

Deep Gradient Compression [7]
• Gradient compression ratio 300x to 600x without losing accuracy.
• ResNet-50 from 97MB to 0.35MB.

• Local gradient clipping: accumulate small gradients, broadcast the
rest.
• Momentum correction: take momentum into account.

• Local gradient clipping: accumulate small gradients, broadcast the
rest.
• Momentum correction: take momentum into account.
• Momentum factor masking.
• Momentum warm-up training.

Conclusion
• After a certain point we cannot speed up further
• The stragglers (covered before) come first. (infrastructural barrier)
• Too large a batch-size does not seem to work. (optimization barrier)
• It is not obvious if any of the “trick” papers generalize beyond
ImageNet.
• Communication overhead comes next. (infrastructural barrier)
• Asynchronous.
• Model parallelism (distributed model weight).
• Gradient compression.

References
1) R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and
G. E. Hinton, “Large scale distributed neural network
training through online distillation,” in ICLR, 2018.
2) G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the
Knowledge in a Neural Network,” in NIPS Workshops,
2015.
3) P. Goyal et al., “Accurate, large minibatch SGD: training
imagenet in 1 hour,” arXiv, 2017.
4) T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large
minibatch sgd: Training resnet-50 on imagenet in 15
minutes,” arXiv, 2017.
5) J. Dean et al., “Large scale distributed deep networks,” in
NIPS, 2012.
6) M. Jaderberg et al., “Decoupled Neural Interfaces using
Synthetic Gradients,” in ICML, 2017.
7) Y. Lin et al.,“Deep Gradient Compression: Reducing the
Communication Bandwidth for Distributed Training,” in
ICLR, 2018.

Distributed deep learning

More Related Content

What's hot

Similar to Distributed deep learning

Recently uploaded

Distributed deep learning