3. Introduction
• Minimize training time by reducing the
bandwidth for gradient exchange in distributed
training
• Preserve model accuracy for faster training
• Focus on reducing data communication on
inexpensive commodity network or training on
mobile devices
4. Introduction
(continue)
• To preserve accuracy during compression:
Momentum correction, Local gradient
clipping, Momentum factor masking and
Warm-up training
• Applied DGC to CNN - Cifar10, ImageNet,
RNN - Penn Treebank (NLP), Speech -
Librispeech Corpus
• No need to modify neural network model
structure
Gradient compression 300x to
600x without losing accuracy
12. The Challenges
● AlexNet 240 MB weights and ResNet has
100 MB weights
● Every node has to exchange 100 MB of
gradient to each other during each
iteration training for ResNet, which make
the bottle neck of the infrastructure
15. Related Distributed Training
Research
• Asynchronous SGD
• Gradient Quantization
• Gradient Dropping ( Aji, 2017)
• Training ImageNet in one hour (FB)
• Training ImageNet in 15 mins (PFN)
16. Related - Gradient Quantization
• Quantizing the gradients to low-precision
values can reduce the communication
bandwidth.
• Seide et al. (2014) proposed 1-bit SGD to
reduce gradients transfer data size and
achieved 10× speedup in traditional speech
applications.
17. Related - Gradient Dropping
• Sparsify the gradients by a single
threshold value.
• To keep the convergence speed, Gradient
Dropping requires adding the layer
normalization
• Gradient Dropping saves 99% of gradient
exchange while incurring 0.3% loss on a
machine translation task.
18. Related - Training ImageNet in 1 hour
• FB Big Basin server
• Large minibatch SGD – 8k
• Caffe2 trains ResNet 50
• 256 GPU, Tesla P100
19. Related - Training ImageNet in 1 hour
• Used Facebook’s Big Basin GPU servers
• Each server has 8 Tesla P100 GPUs and 3.2TB
of SSDs.
• Servers have 50Gbit Ethernet network card
• ResNet-50 has approximately 25 million
parameters. This means the total size of
parameters is 25 · 106 · sizeof(float) = 100MB
$$ Expensive hardware
24. Comparison
• DGC pushes the gradient compression ratio to up to
600× without expensive hardware
• DGC does not require extra layer normalization, and
thus does not need to change the model structure.
• Most importantly, Deep Gradient Compression
results in no loss of accuracy.
27. Deep Gradient Compression
• Gradient Sparsification
• Local Gradient Accumulation
• Momentum Correction
• Local Gradient Clipping
• Momentum Factor Masking
• Warm-up Training
28. 1. GRADIENT
SPARSIFICATION
• Reduce the communication bandwidth by
sending only the important gradients.
• Use the gradient magnitude as a simple
heuristics for importance
• Only gradients larger than a threshold are
transmitted ( top 0.1%)
29. 2. Local Gradient
Accumulation
• To avoid losing information, we
accumulate the rest of the gradients
locally.
• Eventually, these gradients become large
enough to be transmitted.
Accuracy Image classification: -1.6%
Accuracy speech recognition: -3.3%
30. 3. Momentum Correction
● Momentum SGD – Using part of previous
gradient and current gradient to avoid
noise
● New vector is created as ‘Velocity’
● We should do local accumulation of
velocity than gradient
Accuracy Image classification: -0.3%
Speech recognition: can’t converge
31. 4. Local Gradient Clipping
• Gradient clipping is widely adopted to avoid the
exploding gradient problem
• This step is conventionally executed after gradient
aggregation from all nodes.
• Perform the gradient clipping locally before adding
the current gradient to previous accumulation
Accuracy Image classification: N/A
Speech recognition: -2.0%
32. 5. Momentum Factor Masking
There is a long tail accumulation issue ( ~2k
iterations)
Introduce momentum factor masking, to
alleviate staleness
This mask stops the momentum for delayed
gradients
Preventing the stale momentum from carrying
the weights in the wrong direction.
Accuracy Image classification: -0.1%
Speech recognition: -0.5%
33. 6. Warm-up Training
Use a less aggressive learning rate to slow down the
changing speed of the neural network at the start of
training
Instead of linearly ramping up the learning rate during the
first several epochs, we exponentially increase the gradient
sparsity from a relatively small value to the final value, in
order to help the training adapt to the gradients of larger
sparsity.
Accuracy Image classification: +0.37%
Speech recognition: +0.4%
40. Conclusion
• Limitation on scale up, optimize communication comes next.
• Deep Gradient Compression compresses the gradient by
300-600× for a wide range of CNNs and RNNs.
• To achieve this compression without slowing down the
convergence, DGC employs momentum correction, local
gradient clipping, momentum factor masking and warm-up
training.
• Deep Gradient Compression reduces the required
communication bandwidth and improves the scalability of
distributed training with inexpensive, commodity networking
infrastructure.
This paper is regarding to Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure
Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth and preserve accuracy during this compression with inexpensive commodity hardware
General introduction to this paper, background, author
Some intro to Distributed deep learning training
Related research work/paper regarding to DDLT
Today’s paper - DGC
Experiment they did and detailed result 300-600X
wrap up and discussion
Add more nodes will provide more computing power, but there is another factor: communication which might limit distributed training scalability
99.9% gradient exchange are redundant, especially for recurrent neural networks (RNN) where the computation-to-communication ratio is low. Therefore, the network bandwidth becomes a significant bottleneck for scaling up distributed training.
lots of related work/research can have fast training, like 1 hours or even 15 mins
e.g. Uber framework horoboard requires expensive 40mbits/sec network, same as other big companies like google, amazon, FB…
Enable DT with less expensive network, e.g, AWS 1gbits ethernet to democratize DLT using commodity hardware
Training on mobile – for privacy and better personalization
Cifar10 is an established computer-vision dataset used for object recognition.
The ImageNet project is a large visual database designed for use in visual object recognition software research.
Resnet 50 from 97 MB to 0.35 MB, Deep Speech from 488 MB to .74 MB
Penn Treebank dataset, known as PTB dataset, is widely used in machine learning of NLP (Natural Language Processing) research
LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech,
DGC does not require extra layer normalization, and thus does not need to change the model structure.
especially for recurrent neural networks (RNN) where the computation-to-communication ratio is low. Therefore, the network bandwidth becomes a significant bottleneck for scaling up distributed training
2018 ICLR conference paper
Song Han – PhD from Stanford EECS, now assistant professor at MIT, also mange HAN’s lab @ MIT
His Deep compression paper got 2016 ICLR best paper award
Bill Dally – Professor @ Stanford, Chief scientist @ Nvidia for 10 years
Others: From Tsinghua in China
next is the overview of deep learning training in distributed env
General distributed system, it is similar for distributed db, computing…
vertical vs horizontal scaling
scale up or out
Data parallelism
different chunk of data to different nodes, easier to implement, same model on each node ( CNN or RNN),
node 1 may have batch of training images 1-32, node 2 may have next 32 images, etc.
All the node are sharing the same model but they are fed with different chunk of data and they are calculating local gradients according to their own chunk of data and then they exchange gradients to each other.
Can be implemented in two ways:
a. parameter server - centralized , it receives the gradients from all nodes then sum it up and calculate average and update local weights then broadcast to all the training nodes.
b. All-reduce operation(de-centralized) -
Model parallelism
different chuck of model to different nodes, hard to implement and less people to adopt this approach
For single node training, there is no gradient exchange over network
Every node receives every other's calculated gradients and then calculate the average. still have a master training node ( like a tree structure )
this is one of the basic implementation. more advanced like butterfly structure
where χ is the training dataset,
w are the weights of a network,
f(x, w) is the loss computed from samples x ∈ χ,
𝜂 is the learning rate,
N is the number of training nodes, and Bk,t for 1 ≤ k < N is a sequence of N minibatches sampled from χ at iteration t, each of size b.
After T iterations, we have Equation 2 shows that local gradient accumulation can be considered as increasing the batch size from N b to N bT (the second summation over τ ), where T is the length of the sparse update interval between two iterations at which the gradient of w (i) is sent. Learning rate scaling (Goyal et al., 2017) is a commonly used technique to deal with large minibatch
NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. provide all-gather, all-reduce, broadcast...
AlexNet: 2012
ResNet 2015
LARGE SCALE DISTRIBUTED NEURAL NETWORK TRAINING THROUGH ONLINE DISTILLATION. G. E. Hinton
In our first set of experiments, our goal was to approximately determine the maximum number of GPU workers that can be productively employed for SGD in our Common Crawl neural language model setup
Common crawl dataset is an open repository of web crawl data and a largest to-date dataset used for neural language modeling, Common Craw consists of petabytes of data collected since 2011.[3] It completes crawls generally every month
Figure 1a plots the validation error as a function of global steps for the different numbers of workers we tried, using the best learning rate for each number of workers. Increasing the number of workers (and thus the effective batch size) reduced the number of steps required to reach the best validation error until 128 workers, at which point there was no additional improvement. Even with idealized perfect infrastructure, 256 workers would at best result in the same end to end training time on this problem. However, because steps can take so much longer with 256 workers, going from 128 to 256 workers is highly counterproductive in practice. Figure 1b plots validation error against wall time for the same varying numbers of synchronous workers. There is a large degradation in step time, and thus learning progress, at 256 workers. Although it might be possible to improve the step time at 256 workers by using a more sophisticated scheme with backup workers (Chen et al., 2016), the operative limit to scalability on this task is the diminishing return from increasing the effective batch size, not the degradation in step times.
Next related work
Researchers have proposed many approaches to overcome the communication bottleneck in distributed training
We will quickly take a look of existing research for DDL and compare today’s paper I presented
Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.
FB - large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU, P100
Closer look of FB big basin
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
ImageNet top-1 validation error vs. minibatch size.
large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU
“Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes,” arXiv, 2017
basically it is a supercomputer
Chainer - The development is led by Japanese venture company Preferred Networks in partnership with IBM, Intel, Microsoft, and Nvidia
NVIDIA Collective Communications Library ( NCCL2) - multi nodes, multi-GPU systems
provide functions like : all-gather, all-reduce, broadcast...
pytorch has integrated NCCL2 to accelerate deep learning training on multi-GPU systems.
large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU
in some case, it improves accuracy
next - today’s paper
Gradient basic
- Optimization problem - single node: like find direction when climbing downhill
- multiple nodes - each node have images, finds their own direction how to merge together? they need to communicate and exchange the gradient via network
- Exchange can be bulky, e.g., alexnet 240 mB weights resnet 100 MB, every iteration, every node has to exchange 100 MB of gradient to each other which make the bottle neck of the infrastructure
In synchronized training, each node need to know every other nodes' computed gradient.
DeepSpeech from 488MB to 0.74MB
Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile
Some of the gradients are very small, not zero but small. so, sort the gradients, only send out the top 0.1% largest gradients. but just send 0,1%.
Only simply doing this using threading doesn’t event converge in CNN or RNN
So, some small gradients still affect accuracy
If we don't send the small gradients, it will hurt the accuracy => locally accumulate the gradients for more iteration until it gets large enough then sent them out, In this way, accuracy can be recovered.
Almost equivalent to increase batch size in N iteration, it mathematica way to interpret it
Let’s say we accumulate the gradients locally for 3 iterations, it almost equivalent to increase the batch 3 times
take into the prev gradient into account
Using part of prev gradients and current gradient to do weighted average. which give a new vector called velocity. We should do local accumulation of velocity rather then local accumulation of gradients.
gradient clipping is for RNN only. change the order between clipping and summation
go through continuous matrix multiplications because of the the chain rule, and as they approach the earlier layers, if they have small values (<1), they shrink exponentially until they vanish and make it impossible for the model to learn , this is the vanishing gradient problem. While on the other hand if they have large values (>1) they get larger and eventually blow up and crash the model, this is the exploding gradient problem
Long tail accumulation ( 2k iteration), it is necessary to cut or mask the gradients.
to mask away the obsoleted velocity
75, 95,.. 99.9%
In the early stages of training, the network is changing rapidly, and the gradients are more diverse and aggressive
The only hyper-parameter introduced by Deep Gradient Compression is the warm-up training strategy. In all experiments related to DGC, we rise the sparsity in the warm-up period as follows: 75%, 93.75%, 98.4375%, 99.6%, 99.9%
The warm-up period for DGC is 4 epochs out of164 epochs for Cifar10 and 4 epochs out of 90 epochs for ImageNet Dataset
Figure 6 shows the speedup of multi-node training compared with single-node training. Conventional training achieves much worse speedup with 1Gbps (Figure 6(a)) than 10Gbps Ethernet (Figure 6(b)). Nonetheless, Deep Gradient Compression enables the training with 1Gbps Ethernet to be competitive with conventional training with 10Gbps Ethernet
We refer to this migration as the momentum correction. It is a tweak to the update equation, it doesn’t incur any hyper parameter
Shorter training time
Equal to better model accuracy ( no degradation)
Programming?
LARGE SCALE DISTRIBUTED NEURAL NETWORK TRAINING THROUGH ONLINE DISTILLATION
As the number of machines increases, there are diminishing improvements to the time needed to train a high quality model, to a point where adding workers does not further improve training time
For the synchronous algorithm, there are rapidly diminishing returns from increasing the effective batch size
For the asynchronous algorithm, gradient interference from inconsistent weights can cause updates to thrash and even, in some cases, result in worse final accuracy or completely stall learning progress
In our experience it can be very difficult to scale effectively much beyond a hundred GPU workers in realistic setups
The encode() function packs the 32-bit nonzero gradient values and 16-bit run lengths of zeros.
where χ is the training dataset,
w are the weights of a network,
f(x, w) is the loss computed from samples x ∈ χ,
𝜂 is the learning rate,
N is the number of training nodes, and Bk,t for 1 ≤ k < N is a sequence of N minibatches sampled from χ at iteration t, each of size b.
After T iterations, we have Equation 2 shows that local gradient accumulation can be considered as increasing the batch size from N b to N bT (the second summation over τ ), where T is the length of the sparse update interval between two iterations at which the gradient of w (i) is sent. Learning rate scaling (Goyal et al., 2017) is a commonly used technique to deal with large minibatch
PFN’s strategies to improve all-reduce network bottleneck
Downpour SGD is an asynchronous variant of SGD in their DistBelief (predecessor to TensorFlow) at Google. It runs multiple replicas of a model in parallel on subsets of the training data. These models send their updates to a parameter server, which is split across many machines. Each machine is responsible for storing and updating a fraction of the model's parameters. However, as replicas don't communicate with each other e.g. by sharing weights or updates, their parameters are continuously at risk of diverging, hindering convergence.
Image net in one hour : FB – large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU
ImageNet in 15 mins : Preferred Network ( Japaness IOT company) – Chainer, 1024 P100 GPUs, BS = 32k
Codistillatin - Google
The idea of distillation is to first train a teacher model, which traditionally is an ensemble or another high-capacity model, and then, once this teacher model is trained, train a student model with an additional term in the loss function which encourages its predictions to be similar to the predictions of the teacher model.
large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU
large minibatch SGD, caffe2 trains ResNet 50 with minibatch 8192 on 256 GPU