Large-Scale Training with GPUs at Facebook

Large-Scale Training
with GPUs at Facebook
Aapo Kyrola
Distributed AI Team @ Facebook
1

1. Quick intro to Caffe2 framework
2. Parallel Training: Async & Sync
3. Synchronous SGD with Caffe2 and GLOO
4. Case Study: How we trained Resnet-50 for Imagenet in just 1 hour
Contents

Deep Learning Frameworks by FB

• A lightweight framework for deep learning / ML / ..
• Primarily designed for production use cases and large-scale training
• Speed and low footprint
• C++ / Python based interfaces
• Supports deployment on multiple platforms
• Linux, Mac, iOS, Android and Windows
• IoT devices, Raspberry Pi, Tegra X1, ...
Caffe2 is...
5

• Describes model as a DAG of operators and blobs
• Caffe2 runtime does not have any deep learning concepts
à it just executes a DAG
• DAG also covers loss functions, data reading, metrics,
etc…
• Graph
• construction in Python, incl. auto-gradient (flexibility),
• description in Protobuf (portability),
Computational graph
6

TRAINING AS A DIRECTED GRAPH
FC
Y
DataReader (op)
b
X
W
CrossEntropy
FCGradient
IterOp
LearningRate
WeightedSum
WeightedSum
Label
W_grad b_grad
Loss
Graph Example

Asynchronous and Synchronous SGD
Asynchronous SGD
• Parameters are updated by parallel
workers in a “best-effort” basis, “in the
background”.
• Various algorithms how to adjust learning
to handle delayed updates, such as EASGD
or Block Momentum
• Parameter Servers manage parameters
• Can be used for very large models that do
not fit in one machine.
Synchronous SGD
• Workers synchronize (”all-reduce”)
parameter gradients after each iteration
• Models are always in sync.
• Mathematically the number of workers
does not matter: computation is function
of the total batch size only.

+ Async can scale to very large clusters.
- Async requires tuning when runtime characteristics change
+ Sync result is not affected by execution: only function of total batch size
- Sync is harder to scale to large clusters
Async vs. Sync
GPUs are very fast, so we can use fewer servers for computation
à Sync SGD can scale sufficiently.

Sync SGD with Caffe2 + GLOO
11

• Simple interface: data_parallel_model (DPM) for both multi-GPU and multi-
GPU-multi-host models.
• DPM injects AllReduce and Broadcast operators to the graph
• (Caffe2 runtime does not know about being parallel – all based on operators)
• Each worker runs the same code, same DAG, in parallel.
• AllReduce & Broadcasts act as implicit barriers.
SyncSGD with Caffe2

SyncSGD with Caffe2
ConvGradient
FCGradient
fc1_grad
input_grad
conv1_grad
conv1_w_grad
fc1_w_grad
AllReduce
AllReduce
conv1_w_grad
ParamUpdate
fc1_w_grad
ParamUpdate
Parameter updates execute in
parallel with the backward
pass.

GLOO
https://github.com/facebookincubator/gloo
• Library for very fast distributed reductions: AllReduce,
Reduce, Broadcast, Allgather
• External library. Operators for Caffe2 and Pytorch.
• “Mini-MPI”
• Uses NVIDIA’s NCCL for inter-GPU reductions
• TCP/IP and RDMA transports supported

Case Study: ImageNet in 1hr
15

• Train Resnet-50 (most popular image detection architecture) on Imagenet-1K
dataset in less than hour to ~ state-of-the-art accuracy.
• On a single 8-gpu P100: ~ 1.5 days to do 90 epochs.
• Why? (A) training faster improves development iterations;
• (B) enables training with extremely large datasets in reasonable time
Goal
June, 2017

• Accuracy
• Very large mini-batch sizes believed to hurt convergence
• Scale efficiently
• Facebook uses commodity networking (i.e no InfiniBand)
• 32 x 8 P100 NVIDIA GPUs, Big Basin architecture
Challenges
”Big Basin” architecture, open sourced design

Baseline 8 GPUs
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
• k = #gpus
• n = per gpus batch size
• 𝜂 = learning rate
• 256 = 8 x 32

32 x 8 GPUs: same Learning Rate
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
kn= 8k, = 0.1, 41.78% 0.10
8192 = 256 x 32
#gpus
per gpu batch

32 x 8 GPUs: Sqrt-scaling of LR?
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
kn= 8k, = 0.6, 26.28% 0.03

32 x 8 GPUs: Linear Scaling of LR?

Linear Scaling + Constant LR Warmup
rapid changes in the beginning of training-> use small LR for first few epochs

Linear Scaling + Gradual LR Warmup
start from LR of 𝜂 and increase it by constant amount at each iteration so that 𝜂̂ = 𝑘𝜂 after 5 epochs

Linear LR scaling
In this case, we found 8K mini-batch to be close to maximum we could go with Linear LR scaling technique.
More tricks in the paper.

Efficient All-Reduce
• Resnet-50: 25 million float parameters (100mb)
• Each iteration about ~0.3 secs, backward pass run in parallel with all-
reduces à Latency is not an issue.
• Halving-Doubling algorithm by Thakur et. al. provides optimal
throughput

3x speedup vs. “ring-algorithm” for all-
reduce on 32 servers

Using 100% commodity hardware and open source software stack, 90% scaling efficiency

Follow-up Work by Others
• Already several follow-up papers to reproduce &
improve on our results
• For example: You, Gitman, Ginsburg demonstrate using
batch size up to 32K (using layer-wise adaptive learning
rate)
• Alternatives to GPUs, such as Intel Xeon Phis

On-going work
• Elasticity: survicve crashes; incrementally add nodes to
cluster then they become available
• Data input is becoming a bottleneck
• Fp16 for training
• Implement & Experiment with asynchronous algorithms

Lessons Learned
• SyncSGD can go a long way and has fewer tunable
parameters than asynchronous SGD.
• Learning Rate is the fundamental parameter when
increasing mini-batch size.
• Utilize the inherent parallelism in training to hide
latency.
• Commodity hardware can go a long way

Large-Scale Training with GPUs at Facebook

More Related Content

What's hot

Similar to Large-Scale Training with GPUs at Facebook

More from Faisal Siddiqi

Recently uploaded

Large-Scale Training with GPUs at Facebook