Deep gradient compression

David Tung
2/1/2019
DEEP GRADIENT COMPRESSION:
REDUCING THE COMMUNICATION
BANDWIDTH FOR DISTRIBUTED
TRAINING

Outline
• Introduction
• Distributed Training
• Related Work
• Deep Gradient Compression
• Experiment and Result
• Conclusion and Discussion

Introduction
• Minimize training time by reducing the
bandwidth for gradient exchange in distributed
training
• Preserve model accuracy for faster training
• Focus on reducing data communication on
inexpensive commodity network or training on
mobile devices

Introduction
(continue)
• To preserve accuracy during compression:
Momentum correction, Local gradient
clipping, Momentum factor masking and
Warm-up training
• Applied DGC to CNN - Cifar10, ImageNet,
RNN - Penn Treebank (NLP), Speech -
Librispeech Corpus
• No need to modify neural network model
structure
Gradient compression 300x to
600x without losing accuracy

Introduction – Deep
Gradient Compression
Paper

Deep Gradient
Compression ( GDC )

Momentum Correction
Use Velocity instead

The Challenges
● AlexNet 240 MB weights and ResNet has
100 MB weights
● Every node has to exchange 100 MB of
gradient to each other during each
iteration training for ResNet, which make
the bottle neck of the infrastructure

Challenges
Reaching the limits of
distributed SGD for trainings
RNNS on Common crawl

Related Distributed Training
Research
• Asynchronous SGD
• Gradient Quantization
• Gradient Dropping ( Aji, 2017)
• Training ImageNet in one hour (FB)
• Training ImageNet in 15 mins (PFN)

Related - Gradient Quantization
• Quantizing the gradients to low-precision
values can reduce the communication
bandwidth.
• Seide et al. (2014) proposed 1-bit SGD to
reduce gradients transfer data size and
achieved 10× speedup in traditional speech
applications.

Related - Gradient Dropping
• Sparsify the gradients by a single
threshold value.
• To keep the convergence speed, Gradient
Dropping requires adding the layer
normalization
• Gradient Dropping saves 99% of gradient
exchange while incurring 0.3% loss on a
machine translation task.

Related - Training ImageNet in 1 hour
• FB Big Basin server
• Large minibatch SGD – 8k
• Caffe2 trains ResNet 50
• 256 GPU, Tesla P100

Related - Training ImageNet in 1 hour
• Used Facebook’s Big Basin GPU servers
• Each server has 8 Tesla P100 GPUs and 3.2TB
of SSDs.
• Servers have 50Gbit Ethernet network card
• ResNet-50 has approximately 25 million
parameters. This means the total size of
parameters is 25 · 106 · sizeof(float) = 100MB
$$ Expensive hardware

Related work - PFN training
ImageNet in 15 mins

PFN training ImageNet in 15
mins

Comparison
• DGC pushes the gradient compression ratio to up to
600× without expensive hardware
• DGC does not require extra layer normalization, and
thus does not need to change the model structure.
• Most importantly, Deep Gradient Compression
results in no loss of accuracy.

Deep Gradient Compression
• Gradient Sparsification
• Local Gradient Accumulation
• Momentum Correction
• Local Gradient Clipping
• Momentum Factor Masking
• Warm-up Training

1. GRADIENT
SPARSIFICATION
• Reduce the communication bandwidth by
sending only the important gradients.
• Use the gradient magnitude as a simple
heuristics for importance
• Only gradients larger than a threshold are
transmitted ( top 0.1%)

2. Local Gradient
Accumulation
• To avoid losing information, we
accumulate the rest of the gradients
locally.
• Eventually, these gradients become large
enough to be transmitted.
Accuracy Image classification: -1.6%
Accuracy speech recognition: -3.3%

3. Momentum Correction
● Momentum SGD – Using part of previous
gradient and current gradient to avoid
noise
● New vector is created as ‘Velocity’
● We should do local accumulation of
velocity than gradient
Speech recognition: can’t converge

4. Local Gradient Clipping
• Gradient clipping is widely adopted to avoid the
exploding gradient problem
• This step is conventionally executed after gradient
aggregation from all nodes.
• Perform the gradient clipping locally before adding
the current gradient to previous accumulation
Accuracy Image classification: N/A
Speech recognition: -2.0%

5. Momentum Factor Masking
There is a long tail accumulation issue ( ~2k
iterations)
Introduce momentum factor masking, to
alleviate staleness
This mask stops the momentum for delayed
gradients
Preventing the stale momentum from carrying
the weights in the wrong direction.
Speech recognition: -0.5%

6. Warm-up Training
Use a less aggressive learning rate to slow down the
changing speed of the neural network at the start of
training
Instead of linearly ramping up the learning rate during the
first several epochs, we exponentially increase the gradient
sparsity from a relatively small value to the final value, in
order to help the training adapt to the gradients of larger
sparsity.
Accuracy Image classification: +0.37%
Speech recognition: +0.4%

Conclusion
• Limitation on scale up, optimize communication comes next.
• Deep Gradient Compression compresses the gradient by
300-600× for a wide range of CNNs and RNNs.
• To achieve this compression without slowing down the
convergence, DGC employs momentum correction, local
gradient clipping, momentum factor masking and warm-up
training.
• Deep Gradient Compression reduces the required
communication bandwidth and improves the scalability of
distributed training with inexpensive, commodity networking
infrastructure.

Distributed Deep
Learning Research

Deep gradient compression

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Deep gradient compression

Similar to Deep gradient compression (20)

Recently uploaded

Recently uploaded (20)

Deep gradient compression

Editor's Notes