Generalized Pipeline Parallelism for DNN Training

PipeDream: Generalized Pipeline
Parallelism for DNN Training
Deepak Narayanan
Stanford University
In collaboration with many others

Deep Neural Networks have empowered state of
the art results across a range of applications…
2
cat dog
வண#க% எ' ெபய+ ,ப#
Hello, my name is Deepak
Machine Translation
Game Playing
Image Classification
Speech-to-Text

…but first need to be trained!
3
𝑦! = tiger
𝑥! =
activations
gradients
𝑊 optimized using standard iterative optimization procedures
𝑊 = 𝑊 − 𝜂 ⋅ ∇𝑊
𝛻𝑊
loss(𝑦!, +𝑦!)
+𝑦! = lion
prediction
Weight parameters 𝑊

Parallelizing DNN training: Data Parallelism
…
Worker 1
∇𝑊 = ∇𝑊!
+ ∇𝑊"
+ ⋯ + ∇𝑊#
∇𝑊!
Gradient aggregation
using all_reduce(.)
𝑛 copies of the
same model
…
Worker 𝑛
∇𝑊#
Inputs sharded
…

Parallelizing DNN training: Data Parallelism
…
Worker 1
∇𝑊 = ∇𝑊!
+ ∇𝑊"
+ ⋯ + ∇𝑊#
∇𝑊!
Gradient aggregation using all_reduce(.)
𝑛 copies of the
same model
…
∇𝑊#
Inputs sharded
…
Worker 𝑛
Despite many performance optimizations,
communication overhead high!
8xV100s with NVLink (AWS)
PyTorch + NCCL 2.4

Worker 𝑛
Model Parallelism: An alternative to data parallelism
All inputs
• Single version of weights split over workers
• Activations and gradients sent between workers using send(.) and recv(.)
Worker 1 …

Worker 𝑛
Model Parallelism: An alternative to data parallelism
All inputs
• Single version of weights split over workers
• Activations and gradients sent between workers using send(.) and recv(.)
Worker 1 …
Low throughput due to poor resource utilization!
Layers partitioned
over workers

Solution: Pipelining can increase throughput
Pipelining: injecting multiple
inputs into the system

Pipelining in DNN training != Traditional pipelining
• How should weight and activation versions be managed?
• Backward pass operators depend on internal state (𝑊, activations)
• Backward pass for inputs should use the same weight version as
corresponding forward pass

Challenge 1: Pipelining leads to weight version mismatches
Naïve pipelining leads to mismatch in weight versions
Input 𝒏 sees updates in backward pass not seen in the forward
pass, leading to incorrect gradients
𝑊#𝑥# 𝑦# Forward pass
𝑊#$%∇𝑥# ∇𝑦# Backward pass
𝑊#$!

𝑊#𝑥# 𝑦# Forward pass
𝑾 𝒏∇𝑥# ∇𝑦# Backward pass
𝑊#$!
Weight stashing: A solution to version mismatches
Naïve pipelining leads to mismatch in weight versions
Store multiple <weight, activation> versions
• Ensures same weight versions used in both forward and backward pass
• Worst case memory footprint similar to data parallelism (= 𝑑 ⋅ /( # $ % )
')
𝑊# 𝑊#$! 𝑊#$"
Stashed weights

Pipelining in DNN training != Traditional pipelining
• How should weight and activation versions be managed?
• Backward pass operators depend on internal state (𝑊, activations)
• Backward pass for inputs should use the same weight version as
corresponding forward pass
• How should the DNN operators be partitioned into pipeline stages?
• Each operator has a different computation time
• Activations and gradients need to be communicated across stages

Challenge 2: How do we assign operators to pipeline
stages?
Stage 1 Stage 2 Stage 3
𝑡( 𝑡) 𝑡*
• Desiderata #1: 𝑡!, 𝑡", 𝑡# as close to each other as possible
• Compute resources seldom idle → better hardware efficiency
• Desiderata #2: 𝑡!→"
comm and 𝑡"→#
comm minimized
• Less communication → better hardware efficiency
𝑡(→)
comm 𝑡)→*
comm
See SOSP paper for details on PipeDream’s optimizer!

Setup
• Integrated PipeDream with PyTorch in ~3000 lines of Python code
• Integrated with PyTorch’s communication library
• NCCL backend for Data Parallelism baselines
• Gloo backend for PipeDream
• Experiments run on three different server types
• Cluster A: 4xV100 GPUs, PCIe intra-server, and 10 Gbps inter-server (Azure)
• Cluster B: 8xV100 GPUs, NVLink intra-server, and 25 Gbps inter-server (AWS)
• Cluster C: 1xTitan X, and 40 Gbps inter-server (private)

5.3x faster
2.5x faster
PipeDream > Data Parallelism (DP) end-to-end

PipeDream vs. Data Parallelism on Time-to-Accuracy

Experiments on 4 different tasks: image
classification, translation, language
modeling, video captioning

With the same number of GPUs, PipeDream
up to 5.3x faster than Data Parallelism

Takeaways
• Model and data parallelism often suffer from high communication
overhead and low resource utilization for certain models and deployments
• PipeDream shows pipelining can be used to accelerate distributed training
for models that fit on a single worker
• Pipelining, when combined with data and model parallelism in a principled
way, achieves end-to-end speedups of up to 5.3x compared to data
parallelism, with similar worst-case memory footprint
Appeared at SOSP 2019

…but modern Deep Neural Networks are becoming
extremely large!
700 GB in 32-
bit precision
Figure from “Language Models are Few-Shot Learners”, Brown et al.

Background: GPipe
How should weight and activation versions be managed?
>> Single weight version
>> Periodic pipeline flushes update weight version across workers

GPipe and PipeDream make different tradeoffs
GPipe
Pipeline flushes
expensive
High memory footprint
from weight versionsPipeDream

Double-buffered weight updates: high
throughput and low memory footprint
How should weight and activation versions be managed?
>> Two weight versions (shadow version and main version)

Double-buffered weight updates: high
throughput and low memory footprint
Weight updates from inputs 1 to 4 accumulated and applied as a
single weight update to generate a new weight version W!
"
Input 5 uses W!
"
throughout
Use activation recomputation to limit memory
footprint of intermediate activations

Double-buffered weight updates: weight semantics
• Assuming a per-GPU microbatch size of 𝑏, minibatch size 𝐵 = 𝑏 ⋅ 𝑑, where 𝑑
is the depth of the pipeline
• Weight update semantics of data parallelism:
𝑊()$!)
= 𝑊())
− 𝜈 ⋅ ∇𝑓(𝑊())
)
• Weight update semantics with 2BW almost unchanged (note additional
delay term of 1 in gradient computation):
𝑊()$!)
= 𝑊())
− 𝜈 ⋅ ∇𝑓(𝑊()+!)
)
• Semantics similar with replicated stages or gradient aggregation (minibatch
size 𝐵 multiplied by appropriate scale factor)

Setup
• Experiments run on p3.16xlarge instances on AWS (8-V100 servers w/ NVLink)
• Baselines are hybrid parallelism (no pipelining), PipeDream, and GPipe
• Model and associated activations do not fit on a single worker for many
models, so data parallelism is not applicable
• Evaluation on BERT models with various numbers of transformer layers (24 to
192), and a GPT-2 model with 760 million parameters

2BW has weight update semantics similar to data
parallelism
Training loss trajectory identical Accuracy on downstream GLUE tasks unchanged
Training loss while pre-training a BERT-24 model with identical hyperparameters, and
downstream GLUE task accuracy after finetuning pre-trained models 3 times

PipeDream-2BW faster than baselines
1.9x faster than GPipe
6.9x faster than hybrid (no
pipelining)
Throughput in sequences/second for PipeDream-2BW and baselines on various models

PipeDream-2BW has low memory footprint
PipeDream-2BW with activation recomputation (R)
has similar memory footprint to model parallelism
Memory footprint for various systems, using a
fixed per-GPU microbatch size of 4

Takeaways
• Model parallelism can be used to train large models that do not fit on a single
worker, but suffers from low resource utilization
• PipeDream-2BW carefully manages weight versions and uses activation
recomputation when necessary to limit memory footprint
• PipeDream-2BW can accelerate training by up to 6.9x compared to optimized
baselines that do not use pipelining, and up to 1.9x compared to GPipe
Preprint on Arxiv: https://arxiv.org/pdf/2006.09503.pdf

Conclusion
https://cs.stanford.edu/~deepakn/
Pipeline parallelism can accelerate distributed training both in regimes
where model metadata (weight parameters and intermediate
activations) fit on a single worker, and where they do not
Code open sourced on Github: https://github.com/msr-fiddle/pipedream

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Generalized Pipeline Parallelism for DNN Training

More Related Content

What's hot

Similar to Generalized Pipeline Parallelism for DNN Training

More from Databricks

Recently uploaded

In this document

Generalized Pipeline Parallelism for DNN Training