Distributed Deep Learning
Alireza Shafaei
LCI – MLRG, April 2018
Strategies
• Parallelisms: data parallelism vs. model parallelism.
• Optimization: synchronous vs. asynchronous.
April 17, 2018 Distributed Deep Learning 2
Source [5]Source [5]
Challenges
• Infrastructural barriers
• Communication overhead.
• Long-tail machine/network latency.
• Optimization barriers
• Synchronous: batch-size trade-off.
• Asynchronous: gradient interference.
• Programming barriers
• Error prone pipelines.
• After 128 GPUs there’s no point [1].
April 17, 2018 3Distributed Deep Learning
Source [1]
Outline
• Downpour SGD, Sandblaster LBFGS (2012).
• ImageNet in one hour (2017).
• ImageNet in fifteen minutes (2017).
• Codistillation (2018).
• Deep gradient compression (2018).
April 17, 2018 Distributed Deep Learning 4
Downpour SGD, Sandblaster LBFGS [5]
• Run on CPU (2012!).
• Downpour SGD: Asynchronous SGD, basically HOGWILD on multiple
machines.
• Sandblaster LBFGS: Simple distributed version of LBFGS.
April 17, 2018 Distributed Deep Learning 5
Downpour SGD, Sandblaster LBFGS [5]
April 17, 2018 Distributed Deep Learning 6
ImageNet in one hour [3]
• Using synchronous SGD and a linear scaling rule increase the mini-
batch size.
April 17, 2018 Distributed Deep Learning 7
ImageNet in one hour [3]
April 17, 2018 Distributed Deep Learning 8
ImageNet in one hour [3]
April 17, 2018 Distributed Deep Learning 9
ImageNet in one hour [3]
April 17, 2018 Distributed Deep Learning 10
ImageNet in 15 minutes [4]
April 17, 2018 Distributed Deep Learning 11
Co-distillation / Online Distillation [1]
• Based on knowledge distillation (KD) [2]:
1. Train a high-capacity teacher model.
2. Train a low-capacity student model + distillation loss.
• KD better generalization, model compression.
• When cannot scale up, train an ensemble in parallel.
• Merge the ensembles into a single model with KD.
• Instead of a two-phase training, do it at the same time.
April 17, 2018 Distributed Deep Learning 12
Co-distillation / Online Distillation [1]
April 17, 2018 Distributed Deep Learning 13
Co-distillation / Online Distillation [1]
• Unlike distributed SGD, using stale 𝜃𝜃𝑖𝑖’s does not hurt performance.
• The change in weights does not make a significant change in predictions.
• The synchronization can be at the model level or prediction level.
• Each model can be trained on a partition of the dataset.
• Each model can be trained in a distributed way.
April 17, 2018 Distributed Deep Learning 14
Co-distillation / Online Distillation [1]
April 17, 2018 Distributed Deep Learning 15
Co-distillation / Online Distillation [1]
April 17, 2018 Distributed Deep Learning 16
Co-distillation / Online Distillation [1]
April 17, 2018 Distributed Deep Learning 17
Deep Gradient Compression [7]
• Gradient compression ratio 300x to 600x without losing accuracy.
• ResNet-50 from 97MB to 0.35MB.
April 17, 2018 Distributed Deep Learning 18
Deep Gradient Compression [7]
• Gradient compression ratio 300x to 600x without losing accuracy.
• ResNet-50 from 97MB to 0.35MB.
• Local gradient clipping: accumulate small gradients, broadcast the
rest.
• Momentum correction: take momentum into account.
April 17, 2018 Distributed Deep Learning 19
Deep Gradient Compression [7]
April 17, 2018 Distributed Deep Learning 20
Deep Gradient Compression [7]
• Gradient compression ratio 300x to 600x without losing accuracy.
• ResNet-50 from 97MB to 0.35MB.
• Local gradient clipping: accumulate small gradients, broadcast the
rest.
• Momentum correction: take momentum into account.
• Momentum factor masking.
• Momentum warm-up training.
April 17, 2018 Distributed Deep Learning 21
Deep Gradient Compression [7]
April 17, 2018 Distributed Deep Learning 22
Deep Gradient Compression [7]
April 17, 2018 Distributed Deep Learning 23
Conclusion
• After a certain point we cannot speed up further
• The stragglers (covered before) come first. (infrastructural barrier)
• Too large a batch-size does not seem to work. (optimization barrier)
• It is not obvious if any of the “trick” papers generalize beyond
ImageNet.
• Communication overhead comes next. (infrastructural barrier)
• Asynchronous.
• Model parallelism (distributed model weight).
• Gradient compression.
April 17, 2018 Distributed Deep Learning 24
References
April 17, 2018 Distributed Deep Learning 25
1) R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and
G. E. Hinton, “Large scale distributed neural network
training through online distillation,” in ICLR, 2018.
2) G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the
Knowledge in a Neural Network,” in NIPS Workshops,
2015.
3) P. Goyal et al., “Accurate, large minibatch SGD: training
imagenet in 1 hour,” arXiv, 2017.
4) T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large
minibatch sgd: Training resnet-50 on imagenet in 15
minutes,” arXiv, 2017.
5) J. Dean et al., “Large scale distributed deep networks,” in
NIPS, 2012.
6) M. Jaderberg et al., “Decoupled Neural Interfaces using
Synthetic Gradients,” in ICML, 2017.
7) Y. Lin et al.,“Deep Gradient Compression: Reducing the
Communication Bandwidth for Distributed Training,” in
ICLR, 2018.

Distributed deep learning

  • 1.
    Distributed Deep Learning AlirezaShafaei LCI – MLRG, April 2018
  • 2.
    Strategies • Parallelisms: dataparallelism vs. model parallelism. • Optimization: synchronous vs. asynchronous. April 17, 2018 Distributed Deep Learning 2 Source [5]Source [5]
  • 3.
    Challenges • Infrastructural barriers •Communication overhead. • Long-tail machine/network latency. • Optimization barriers • Synchronous: batch-size trade-off. • Asynchronous: gradient interference. • Programming barriers • Error prone pipelines. • After 128 GPUs there’s no point [1]. April 17, 2018 3Distributed Deep Learning Source [1]
  • 4.
    Outline • Downpour SGD,Sandblaster LBFGS (2012). • ImageNet in one hour (2017). • ImageNet in fifteen minutes (2017). • Codistillation (2018). • Deep gradient compression (2018). April 17, 2018 Distributed Deep Learning 4
  • 5.
    Downpour SGD, SandblasterLBFGS [5] • Run on CPU (2012!). • Downpour SGD: Asynchronous SGD, basically HOGWILD on multiple machines. • Sandblaster LBFGS: Simple distributed version of LBFGS. April 17, 2018 Distributed Deep Learning 5
  • 6.
    Downpour SGD, SandblasterLBFGS [5] April 17, 2018 Distributed Deep Learning 6
  • 7.
    ImageNet in onehour [3] • Using synchronous SGD and a linear scaling rule increase the mini- batch size. April 17, 2018 Distributed Deep Learning 7
  • 8.
    ImageNet in onehour [3] April 17, 2018 Distributed Deep Learning 8
  • 9.
    ImageNet in onehour [3] April 17, 2018 Distributed Deep Learning 9
  • 10.
    ImageNet in onehour [3] April 17, 2018 Distributed Deep Learning 10
  • 11.
    ImageNet in 15minutes [4] April 17, 2018 Distributed Deep Learning 11
  • 12.
    Co-distillation / OnlineDistillation [1] • Based on knowledge distillation (KD) [2]: 1. Train a high-capacity teacher model. 2. Train a low-capacity student model + distillation loss. • KD better generalization, model compression. • When cannot scale up, train an ensemble in parallel. • Merge the ensembles into a single model with KD. • Instead of a two-phase training, do it at the same time. April 17, 2018 Distributed Deep Learning 12
  • 13.
    Co-distillation / OnlineDistillation [1] April 17, 2018 Distributed Deep Learning 13
  • 14.
    Co-distillation / OnlineDistillation [1] • Unlike distributed SGD, using stale 𝜃𝜃𝑖𝑖’s does not hurt performance. • The change in weights does not make a significant change in predictions. • The synchronization can be at the model level or prediction level. • Each model can be trained on a partition of the dataset. • Each model can be trained in a distributed way. April 17, 2018 Distributed Deep Learning 14
  • 15.
    Co-distillation / OnlineDistillation [1] April 17, 2018 Distributed Deep Learning 15
  • 16.
    Co-distillation / OnlineDistillation [1] April 17, 2018 Distributed Deep Learning 16
  • 17.
    Co-distillation / OnlineDistillation [1] April 17, 2018 Distributed Deep Learning 17
  • 18.
    Deep Gradient Compression[7] • Gradient compression ratio 300x to 600x without losing accuracy. • ResNet-50 from 97MB to 0.35MB. April 17, 2018 Distributed Deep Learning 18
  • 19.
    Deep Gradient Compression[7] • Gradient compression ratio 300x to 600x without losing accuracy. • ResNet-50 from 97MB to 0.35MB. • Local gradient clipping: accumulate small gradients, broadcast the rest. • Momentum correction: take momentum into account. April 17, 2018 Distributed Deep Learning 19
  • 20.
    Deep Gradient Compression[7] April 17, 2018 Distributed Deep Learning 20
  • 21.
    Deep Gradient Compression[7] • Gradient compression ratio 300x to 600x without losing accuracy. • ResNet-50 from 97MB to 0.35MB. • Local gradient clipping: accumulate small gradients, broadcast the rest. • Momentum correction: take momentum into account. • Momentum factor masking. • Momentum warm-up training. April 17, 2018 Distributed Deep Learning 21
  • 22.
    Deep Gradient Compression[7] April 17, 2018 Distributed Deep Learning 22
  • 23.
    Deep Gradient Compression[7] April 17, 2018 Distributed Deep Learning 23
  • 24.
    Conclusion • After acertain point we cannot speed up further • The stragglers (covered before) come first. (infrastructural barrier) • Too large a batch-size does not seem to work. (optimization barrier) • It is not obvious if any of the “trick” papers generalize beyond ImageNet. • Communication overhead comes next. (infrastructural barrier) • Asynchronous. • Model parallelism (distributed model weight). • Gradient compression. April 17, 2018 Distributed Deep Learning 24
  • 25.
    References April 17, 2018Distributed Deep Learning 25 1) R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, and G. E. Hinton, “Large scale distributed neural network training through online distillation,” in ICLR, 2018. 2) G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” in NIPS Workshops, 2015. 3) P. Goyal et al., “Accurate, large minibatch SGD: training imagenet in 1 hour,” arXiv, 2017. 4) T. Akiba, S. Suzuki, and K. Fukuda, “Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes,” arXiv, 2017. 5) J. Dean et al., “Large scale distributed deep networks,” in NIPS, 2012. 6) M. Jaderberg et al., “Decoupled Neural Interfaces using Synthetic Gradients,” in ICML, 2017. 7) Y. Lin et al.,“Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training,” in ICLR, 2018.