- Title: Horovod: fast and easy distributed deep learning in TensorFlow
- Paper: https://arxiv.org/abs/1802.05799
- Youtube: https://youtu.be/8zQECRiONAo
Taekmin Kim, http://github.com/tantara
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
1. Horovod: fast and easy distributed
deep learning in Tensorflow
PR-129
Taekmin Kim
Dec 23, 2018
2. On single machine
● Training ResNet-50 on TPU
○ Batch Size: 1024
○ Accuracy: 76%(Top-1)
○ Training Time: 17hours
● Training Faster RCNN on 8GPUs
○ Batch Size: 8~16
■ 1~2 per GPU
3. Why large-scale training?
● Better accuracy
○ E.g. Object detection
○ Group Normalization(ECCV 2018)
● Fast training
○ ResNet-50
■ 6.6 minutes, 75.8%(Top-1)
■ 64k per mini-batch, 2048 GPUs
Group Normalization
21. Related Work
● Training
○ Communication Cost
■ Deep Gradient Compression
■ Training ImageNet in Four Minutes
● using mixed precision
○ RNN, RL
■ Dynamic Control Flow
● Inference
○ Low Latency RNN Inference with Cellular Batching
Editor's Notes
In the ring-allreduce algorithm, shown on Figure 4, each of N nodes communicates with two of its peers 2 ∗ (N − 1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N − 1 iterations, received values are added to the values in the node’s buffer. In the second N − 1 iterations, received values replace the values held in the node’s buffer. Patarasuk and Yuan in [9] suggest that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.
In addition to being network-optimal, the allreduce approach is much easier to understand and adopt. Users utilize a Message Passing Interface (MPI) [10] implementation such as Open MPI [11] to launch all copies of the TensorFlow program. MPI then transparently sets up the distributed infrastructure necessary for workers to communicate with each other. All the user needs to do is modify their program to average gradients using an allreduce() operation.
we found that RDMA did not significantly improve our performance and only achieved a three to four percent increase over TCP networking. RDMA, however, did help Horovod exceed 90 percent scaling efficiency on both mode
the VGG-16 model experienced a significant 30 percent speedup when we leveraged RDMA networking.