Large-Scale Training
with GPUs at Facebook
Aapo Kyrola
Distributed AI Team @ Facebook
1
1. Quick intro to Caffe2 framework
2. Parallel Training: Async & Sync
3. Synchronous SGD with Caffe2 and GLOO
4. Case Study: How we trained Resnet-50 for Imagenet in just 1 hour
Contents
Deep Learning Frameworks by FB
Caffe2
4
• A lightweight framework for deep learning / ML / ..
• Primarily designed for production use cases and large-scale training
• Speed and low footprint
• C++ / Python based interfaces
• Supports deployment on multiple platforms
• Linux, Mac, iOS, Android and Windows
• IoT devices, Raspberry Pi, Tegra X1, ...
Caffe2 is...
5
• Describes model as a DAG of operators and blobs
• Caffe2 runtime does not have any deep learning concepts
à it just executes a DAG
• DAG also covers loss functions, data reading, metrics,
etc…
• Graph
• construction in Python, incl. auto-gradient (flexibility),
• description in Protobuf (portability),
Computational graph
6
TRAINING AS A DIRECTED GRAPH
FC
Y
DataReader (op)
b
X
W
CrossEntropy
FCGradient
IterOp
LearningRate
WeightedSum
WeightedSum
Label
W_grad b_grad
Loss
Graph Example
Parallel Training
8
Asynchronous and Synchronous SGD
Asynchronous SGD
• Parameters are updated by parallel
workers in a “best-effort” basis, “in the
background”.
• Various algorithms how to adjust learning
to handle delayed updates, such as EASGD
or Block Momentum
• Parameter Servers manage parameters
• Can be used for very large models that do
not fit in one machine.
Synchronous SGD
• Workers synchronize (”all-reduce”)
parameter gradients after each iteration
• Models are always in sync.
• Mathematically the number of workers
does not matter: computation is function
of the total batch size only.
+ Async can scale to very large clusters.
- Async requires tuning when runtime characteristics change
+ Sync result is not affected by execution: only function of total batch size
- Sync is harder to scale to large clusters
Async vs. Sync
GPUs are very fast, so we can use fewer servers for computation
à Sync SGD can scale sufficiently.
Sync SGD with Caffe2 + GLOO
11
• Simple interface: data_parallel_model (DPM) for both multi-GPU and multi-
GPU-multi-host models.
• DPM injects AllReduce and Broadcast operators to the graph
• (Caffe2 runtime does not know about being parallel – all based on operators)
• Each worker runs the same code, same DAG, in parallel.
• AllReduce & Broadcasts act as implicit barriers.
SyncSGD with Caffe2
SyncSGD with Caffe2
ConvGradient
FCGradient
fc1_grad
input_grad
conv1_grad
conv1_w_grad
fc1_w_grad
AllReduce
AllReduce
conv1_w_grad
ParamUpdate
fc1_w_grad
ParamUpdate
Parameter updates execute in
parallel with the backward
pass.
GLOO
https://github.com/facebookincubator/gloo
• Library for very fast distributed reductions: AllReduce,
Reduce, Broadcast, Allgather
• External library. Operators for Caffe2 and Pytorch.
• “Mini-MPI”
• Uses NVIDIA’s NCCL for inter-GPU reductions
• TCP/IP and RDMA transports supported
Case Study: ImageNet in 1hr
15
• Train Resnet-50 (most popular image detection architecture) on Imagenet-1K
dataset in less than hour to ~ state-of-the-art accuracy.
• On a single 8-gpu P100: ~ 1.5 days to do 90 epochs.
• Why? (A) training faster improves development iterations;
• (B) enables training with extremely large datasets in reasonable time
Goal
June, 2017
• Accuracy
• Very large mini-batch sizes believed to hurt convergence
• Scale efficiently
• Facebook uses commodity networking (i.e no InfiniBand)
• 32 x 8 P100 NVIDIA GPUs, Big Basin architecture
Challenges
”Big Basin” architecture, open sourced design
Baseline 8 GPUs
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
• k = #gpus
• n = per gpus batch size
• 𝜂 = learning rate
• 256 = 8 x 32
32 x 8 GPUs: same Learning Rate
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
kn= 8k, = 0.1, 41.78% 0.10
8192 = 256 x 32
#gpus
per gpu batch
32 x 8 GPUs: Sqrt-scaling of LR?
0 20 40 60 80
epochs
20
30
40
50
60
70
80
90
100
trainingerror%
kn=256, = 0.1, 23.60% 0.12
kn= 8k, = 0.6, 26.28% 0.03
32 x 8 GPUs: Linear Scaling of LR?
Linear Scaling + Constant LR Warmup
rapid changes in the beginning of training-> use small LR for first few epochs
Linear Scaling + Gradual LR Warmup
start from LR of 𝜂 and increase it by constant amount at each iteration so that 𝜂̂ = 𝑘𝜂	after 5 epochs
Linear LR scaling
In this case, we found 8K mini-batch to be close to maximum we could go with Linear LR scaling technique.
More tricks in the paper.
Scaling Efficiently
Efficient All-Reduce
• Resnet-50: 25 million float parameters (100mb)
• Each iteration about ~0.3 secs, backward pass run in parallel with all-
reduces à Latency is not an issue.
• Halving-Doubling algorithm by Thakur et. al. provides optimal
throughput
3x speedup vs. “ring-algorithm” for all-
reduce on 32 servers
Using 100% commodity hardware and open source software stack, 90% scaling efficiency
Follow-up Work by Others
• Already several follow-up papers to reproduce &
improve on our results
• For example: You, Gitman, Ginsburg demonstrate using
batch size up to 32K (using layer-wise adaptive learning
rate)
• Alternatives to GPUs, such as Intel Xeon Phis
On-going work
• Elasticity: survicve crashes; incrementally add nodes to
cluster then they become available
• Data input is becoming a bottleneck
• Fp16 for training
• Implement & Experiment with asynchronous algorithms
Lessons Learned
• SyncSGD can go a long way and has fewer tunable
parameters than asynchronous SGD.
• Learning Rate is the fundamental parameter when
increasing mini-batch size.
• Utilize the inherent parallelism in training to hide
latency.
• Commodity hardware can go a long way
Thank You!
Caffe2.ai

Large-Scale Training with GPUs at Facebook

  • 1.
    Large-Scale Training with GPUsat Facebook Aapo Kyrola Distributed AI Team @ Facebook 1
  • 2.
    1. Quick introto Caffe2 framework 2. Parallel Training: Async & Sync 3. Synchronous SGD with Caffe2 and GLOO 4. Case Study: How we trained Resnet-50 for Imagenet in just 1 hour Contents
  • 3.
  • 4.
  • 5.
    • A lightweightframework for deep learning / ML / .. • Primarily designed for production use cases and large-scale training • Speed and low footprint • C++ / Python based interfaces • Supports deployment on multiple platforms • Linux, Mac, iOS, Android and Windows • IoT devices, Raspberry Pi, Tegra X1, ... Caffe2 is... 5
  • 6.
    • Describes modelas a DAG of operators and blobs • Caffe2 runtime does not have any deep learning concepts à it just executes a DAG • DAG also covers loss functions, data reading, metrics, etc… • Graph • construction in Python, incl. auto-gradient (flexibility), • description in Protobuf (portability), Computational graph 6
  • 7.
    TRAINING AS ADIRECTED GRAPH FC Y DataReader (op) b X W CrossEntropy FCGradient IterOp LearningRate WeightedSum WeightedSum Label W_grad b_grad Loss Graph Example
  • 8.
  • 9.
    Asynchronous and SynchronousSGD Asynchronous SGD • Parameters are updated by parallel workers in a “best-effort” basis, “in the background”. • Various algorithms how to adjust learning to handle delayed updates, such as EASGD or Block Momentum • Parameter Servers manage parameters • Can be used for very large models that do not fit in one machine. Synchronous SGD • Workers synchronize (”all-reduce”) parameter gradients after each iteration • Models are always in sync. • Mathematically the number of workers does not matter: computation is function of the total batch size only.
  • 10.
    + Async canscale to very large clusters. - Async requires tuning when runtime characteristics change + Sync result is not affected by execution: only function of total batch size - Sync is harder to scale to large clusters Async vs. Sync GPUs are very fast, so we can use fewer servers for computation à Sync SGD can scale sufficiently.
  • 11.
    Sync SGD withCaffe2 + GLOO 11
  • 12.
    • Simple interface:data_parallel_model (DPM) for both multi-GPU and multi- GPU-multi-host models. • DPM injects AllReduce and Broadcast operators to the graph • (Caffe2 runtime does not know about being parallel – all based on operators) • Each worker runs the same code, same DAG, in parallel. • AllReduce & Broadcasts act as implicit barriers. SyncSGD with Caffe2
  • 13.
  • 14.
    GLOO https://github.com/facebookincubator/gloo • Library forvery fast distributed reductions: AllReduce, Reduce, Broadcast, Allgather • External library. Operators for Caffe2 and Pytorch. • “Mini-MPI” • Uses NVIDIA’s NCCL for inter-GPU reductions • TCP/IP and RDMA transports supported
  • 15.
  • 16.
    • Train Resnet-50(most popular image detection architecture) on Imagenet-1K dataset in less than hour to ~ state-of-the-art accuracy. • On a single 8-gpu P100: ~ 1.5 days to do 90 epochs. • Why? (A) training faster improves development iterations; • (B) enables training with extremely large datasets in reasonable time Goal June, 2017
  • 17.
    • Accuracy • Verylarge mini-batch sizes believed to hurt convergence • Scale efficiently • Facebook uses commodity networking (i.e no InfiniBand) • 32 x 8 P100 NVIDIA GPUs, Big Basin architecture Challenges ”Big Basin” architecture, open sourced design
  • 18.
    Baseline 8 GPUs 020 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 • k = #gpus • n = per gpus batch size • 𝜂 = learning rate • 256 = 8 x 32
  • 19.
    32 x 8GPUs: same Learning Rate 0 20 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 kn= 8k, = 0.1, 41.78% 0.10 8192 = 256 x 32 #gpus per gpu batch
  • 20.
    32 x 8GPUs: Sqrt-scaling of LR? 0 20 40 60 80 epochs 20 30 40 50 60 70 80 90 100 trainingerror% kn=256, = 0.1, 23.60% 0.12 kn= 8k, = 0.6, 26.28% 0.03
  • 21.
    32 x 8GPUs: Linear Scaling of LR?
  • 22.
    Linear Scaling +Constant LR Warmup rapid changes in the beginning of training-> use small LR for first few epochs
  • 23.
    Linear Scaling +Gradual LR Warmup start from LR of 𝜂 and increase it by constant amount at each iteration so that 𝜂̂ = 𝑘𝜂 after 5 epochs
  • 24.
    Linear LR scaling Inthis case, we found 8K mini-batch to be close to maximum we could go with Linear LR scaling technique. More tricks in the paper.
  • 25.
  • 26.
    Efficient All-Reduce • Resnet-50:25 million float parameters (100mb) • Each iteration about ~0.3 secs, backward pass run in parallel with all- reduces à Latency is not an issue. • Halving-Doubling algorithm by Thakur et. al. provides optimal throughput
  • 27.
    3x speedup vs.“ring-algorithm” for all- reduce on 32 servers
  • 28.
    Using 100% commodityhardware and open source software stack, 90% scaling efficiency
  • 29.
    Follow-up Work byOthers • Already several follow-up papers to reproduce & improve on our results • For example: You, Gitman, Ginsburg demonstrate using batch size up to 32K (using layer-wise adaptive learning rate) • Alternatives to GPUs, such as Intel Xeon Phis
  • 30.
    On-going work • Elasticity:survicve crashes; incrementally add nodes to cluster then they become available • Data input is becoming a bottleneck • Fp16 for training • Implement & Experiment with asynchronous algorithms
  • 31.
    Lessons Learned • SyncSGDcan go a long way and has fewer tunable parameters than asynchronous SGD. • Learning Rate is the fundamental parameter when increasing mini-batch size. • Utilize the inherent parallelism in training to hide latency. • Commodity hardware can go a long way
  • 32.