PipeDream: Generalized Pipeline
Parallelism for DNN Training
Deepak Narayanan
Stanford University
In collaboration with many others
Deep Neural Networks have empowered state of
the art results across a range of applications…
2
cat dog
வண#க% எ' ெபய+ ,ப#
Hello, my name is Deepak
Machine Translation
Game Playing
Image Classification
Speech-to-Text
…but first need to be trained!
3
𝑦! = tiger
𝑥! =
activations
gradients
𝑊 optimized using standard iterative optimization procedures
𝑊 = 𝑊 − 𝜂 ⋅ ∇𝑊
𝛻𝑊
loss(𝑦!, +𝑦!)
+𝑦! = lion
prediction
Weight parameters 𝑊
Parallelizing DNN training: Data Parallelism
…
Worker 1
∇𝑊 = ∇𝑊!
+ ∇𝑊"
+ ⋯ + ∇𝑊#
∇𝑊!
Gradient aggregation
using all_reduce(.)
𝑛 copies of the
same model
…
Worker 𝑛
∇𝑊#
Inputs sharded
…
Parallelizing DNN training: Data Parallelism
…
Worker 1
∇𝑊 = ∇𝑊!
+ ∇𝑊"
+ ⋯ + ∇𝑊#
∇𝑊!
Gradient aggregation using all_reduce(.)
𝑛 copies of the
same model
…
∇𝑊#
Inputs sharded
…
Worker 𝑛
Despite many performance optimizations,
communication overhead high!
8xV100s with NVLink (AWS)
PyTorch + NCCL 2.4
Worker 𝑛
Model Parallelism: An alternative to data parallelism
All inputs
• Single version of weights split over workers
• Activations and gradients sent between workers using send(.) and recv(.)
Worker 1 …
Worker 𝑛
Model Parallelism: An alternative to data parallelism
All inputs
• Single version of weights split over workers
• Activations and gradients sent between workers using send(.) and recv(.)
Worker 1 …
Low throughput due to poor resource utilization!
Layers partitioned
over workers
Solution: Pipelining can increase throughput
Pipelining: injecting multiple
inputs into the system
Pipelining in DNN training != Traditional pipelining
• How should weight and activation versions be managed?
• Backward pass operators depend on internal state (𝑊, activations)
• Backward pass for inputs should use the same weight version as
corresponding forward pass
Challenge 1: Pipelining leads to weight version mismatches
Naïve pipelining leads to mismatch in weight versions
Input 𝒏 sees updates in backward pass not seen in the forward
pass, leading to incorrect gradients
𝑊#𝑥# 𝑦# Forward pass
𝑊#$%∇𝑥# ∇𝑦# Backward pass
𝑊#$!
𝑊#𝑥# 𝑦# Forward pass
𝑾 𝒏∇𝑥# ∇𝑦# Backward pass
𝑊#$!
Weight stashing: A solution to version mismatches
Naïve pipelining leads to mismatch in weight versions
Store multiple <weight, activation> versions
• Ensures same weight versions used in both forward and backward pass
• Worst case memory footprint similar to data parallelism (= 𝑑 ⋅ /( # $ % )
')
𝑊# 𝑊#$! 𝑊#$"
Stashed weights
Pipelining in DNN training != Traditional pipelining
• How should weight and activation versions be managed?
• Backward pass operators depend on internal state (𝑊, activations)
• Backward pass for inputs should use the same weight version as
corresponding forward pass
• How should the DNN operators be partitioned into pipeline stages?
• Each operator has a different computation time
• Activations and gradients need to be communicated across stages
Challenge 2: How do we assign operators to pipeline
stages?
Stage 1 Stage 2 Stage 3
𝑡( 𝑡) 𝑡*
• Desiderata #1: 𝑡!, 𝑡", 𝑡# as close to each other as possible
• Compute resources seldom idle → better hardware efficiency
• Desiderata #2: 𝑡!→"
comm and 𝑡"→#
comm minimized
• Less communication → better hardware efficiency
𝑡(→)
comm 𝑡)→*
comm
See SOSP paper for details on PipeDream’s optimizer!
Evaluation
Setup
• Integrated PipeDream with PyTorch in ~3000 lines of Python code
• Integrated with PyTorch’s communication library
• NCCL backend for Data Parallelism baselines
• Gloo backend for PipeDream
• Experiments run on three different server types
• Cluster A: 4xV100 GPUs, PCIe intra-server, and 10 Gbps inter-server (Azure)
• Cluster B: 8xV100 GPUs, NVLink intra-server, and 25 Gbps inter-server (AWS)
• Cluster C: 1xTitan X, and 40 Gbps inter-server (private)
5.3x faster
2.5x faster
PipeDream > Data Parallelism (DP) end-to-end
PipeDream vs. Data Parallelism on Time-to-Accuracy
Experiments on 4 different tasks: image
classification, translation, language
modeling, video captioning
PipeDream vs. Data Parallelism on Time-to-Accuracy
With the same number of GPUs, PipeDream
up to 5.3x faster than Data Parallelism
PipeDream vs. Data Parallelism on Time-to-Accuracy
Takeaways
• Model and data parallelism often suffer from high communication
overhead and low resource utilization for certain models and deployments
• PipeDream shows pipelining can be used to accelerate distributed training
for models that fit on a single worker
• Pipelining, when combined with data and model parallelism in a principled
way, achieves end-to-end speedups of up to 5.3x compared to data
parallelism, with similar worst-case memory footprint
Appeared at SOSP 2019
…but modern Deep Neural Networks are becoming
extremely large!
700 GB in 32-
bit precision
Figure from “Language Models are Few-Shot Learners”, Brown et al.
Background: GPipe
How should weight and activation versions be managed?
>> Single weight version
>> Periodic pipeline flushes update weight version across workers
GPipe and PipeDream make different tradeoffs
GPipe
Pipeline flushes
expensive
High memory footprint
from weight versionsPipeDream
Double-buffered weight updates: high
throughput and low memory footprint
How should weight and activation versions be managed?
>> Two weight versions (shadow version and main version)
Double-buffered weight updates: high
throughput and low memory footprint
Weight updates from inputs 1 to 4 accumulated and applied as a
single weight update to generate a new weight version W!
"
Input 5 uses W!
"
throughout
Use activation recomputation to limit memory
footprint of intermediate activations
Double-buffered weight updates: weight semantics
• Assuming a per-GPU microbatch size of 𝑏, minibatch size 𝐵 = 𝑏 ⋅ 𝑑, where 𝑑
is the depth of the pipeline
• Weight update semantics of data parallelism:
𝑊()$!)
= 𝑊())
− 𝜈 ⋅ ∇𝑓(𝑊())
)
• Weight update semantics with 2BW almost unchanged (note additional
delay term of 1 in gradient computation):
𝑊()$!)
= 𝑊())
− 𝜈 ⋅ ∇𝑓(𝑊()+!)
)
• Semantics similar with replicated stages or gradient aggregation (minibatch
size 𝐵 multiplied by appropriate scale factor)
Evaluation
Setup
• Experiments run on p3.16xlarge instances on AWS (8-V100 servers w/ NVLink)
• Baselines are hybrid parallelism (no pipelining), PipeDream, and GPipe
• Model and associated activations do not fit on a single worker for many
models, so data parallelism is not applicable
• Evaluation on BERT models with various numbers of transformer layers (24 to
192), and a GPT-2 model with 760 million parameters
2BW has weight update semantics similar to data
parallelism
Training loss trajectory identical Accuracy on downstream GLUE tasks unchanged
Training loss while pre-training a BERT-24 model with identical hyperparameters, and
downstream GLUE task accuracy after finetuning pre-trained models 3 times
PipeDream-2BW faster than baselines
1.9x faster than GPipe
6.9x faster than hybrid (no
pipelining)
Throughput in sequences/second for PipeDream-2BW and baselines on various models
PipeDream-2BW has low memory footprint
PipeDream-2BW with activation recomputation (R)
has similar memory footprint to model parallelism
Memory footprint for various systems, using a
fixed per-GPU microbatch size of 4
Takeaways
• Model parallelism can be used to train large models that do not fit on a single
worker, but suffers from low resource utilization
• PipeDream-2BW carefully manages weight versions and uses activation
recomputation when necessary to limit memory footprint
• PipeDream-2BW can accelerate training by up to 6.9x compared to optimized
baselines that do not use pipelining, and up to 1.9x compared to GPipe
Preprint on Arxiv: https://arxiv.org/pdf/2006.09503.pdf
Conclusion
https://cs.stanford.edu/~deepakn/
Pipeline parallelism can accelerate distributed training both in regimes
where model metadata (weight parameters and intermediate
activations) fit on a single worker, and where they do not
Code open sourced on Github: https://github.com/msr-fiddle/pipedream
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Generalized Pipeline Parallelism for DNN Training

  • 1.
    PipeDream: Generalized Pipeline Parallelismfor DNN Training Deepak Narayanan Stanford University In collaboration with many others
  • 2.
    Deep Neural Networkshave empowered state of the art results across a range of applications… 2 cat dog வண#க% எ' ெபய+ ,ப# Hello, my name is Deepak Machine Translation Game Playing Image Classification Speech-to-Text
  • 3.
    …but first needto be trained! 3 𝑦! = tiger 𝑥! = activations gradients 𝑊 optimized using standard iterative optimization procedures 𝑊 = 𝑊 − 𝜂 ⋅ ∇𝑊 𝛻𝑊 loss(𝑦!, +𝑦!) +𝑦! = lion prediction Weight parameters 𝑊
  • 5.
    Parallelizing DNN training:Data Parallelism … Worker 1 ∇𝑊 = ∇𝑊! + ∇𝑊" + ⋯ + ∇𝑊# ∇𝑊! Gradient aggregation using all_reduce(.) 𝑛 copies of the same model … Worker 𝑛 ∇𝑊# Inputs sharded …
  • 6.
    Parallelizing DNN training:Data Parallelism … Worker 1 ∇𝑊 = ∇𝑊! + ∇𝑊" + ⋯ + ∇𝑊# ∇𝑊! Gradient aggregation using all_reduce(.) 𝑛 copies of the same model … ∇𝑊# Inputs sharded … Worker 𝑛 Despite many performance optimizations, communication overhead high! 8xV100s with NVLink (AWS) PyTorch + NCCL 2.4
  • 7.
    Worker 𝑛 Model Parallelism:An alternative to data parallelism All inputs • Single version of weights split over workers • Activations and gradients sent between workers using send(.) and recv(.) Worker 1 …
  • 8.
    Worker 𝑛 Model Parallelism:An alternative to data parallelism All inputs • Single version of weights split over workers • Activations and gradients sent between workers using send(.) and recv(.) Worker 1 … Low throughput due to poor resource utilization! Layers partitioned over workers
  • 9.
    Solution: Pipelining canincrease throughput Pipelining: injecting multiple inputs into the system
  • 10.
    Pipelining in DNNtraining != Traditional pipelining • How should weight and activation versions be managed? • Backward pass operators depend on internal state (𝑊, activations) • Backward pass for inputs should use the same weight version as corresponding forward pass
  • 11.
    Challenge 1: Pipeliningleads to weight version mismatches Naïve pipelining leads to mismatch in weight versions Input 𝒏 sees updates in backward pass not seen in the forward pass, leading to incorrect gradients 𝑊#𝑥# 𝑦# Forward pass 𝑊#$%∇𝑥# ∇𝑦# Backward pass 𝑊#$!
  • 12.
    𝑊#𝑥# 𝑦# Forwardpass 𝑾 𝒏∇𝑥# ∇𝑦# Backward pass 𝑊#$! Weight stashing: A solution to version mismatches Naïve pipelining leads to mismatch in weight versions Store multiple <weight, activation> versions • Ensures same weight versions used in both forward and backward pass • Worst case memory footprint similar to data parallelism (= 𝑑 ⋅ /( # $ % ) ') 𝑊# 𝑊#$! 𝑊#$" Stashed weights
  • 13.
    Pipelining in DNNtraining != Traditional pipelining • How should weight and activation versions be managed? • Backward pass operators depend on internal state (𝑊, activations) • Backward pass for inputs should use the same weight version as corresponding forward pass • How should the DNN operators be partitioned into pipeline stages? • Each operator has a different computation time • Activations and gradients need to be communicated across stages
  • 14.
    Challenge 2: Howdo we assign operators to pipeline stages? Stage 1 Stage 2 Stage 3 𝑡( 𝑡) 𝑡* • Desiderata #1: 𝑡!, 𝑡", 𝑡# as close to each other as possible • Compute resources seldom idle → better hardware efficiency • Desiderata #2: 𝑡!→" comm and 𝑡"→# comm minimized • Less communication → better hardware efficiency 𝑡(→) comm 𝑡)→* comm See SOSP paper for details on PipeDream’s optimizer!
  • 15.
  • 16.
    Setup • Integrated PipeDreamwith PyTorch in ~3000 lines of Python code • Integrated with PyTorch’s communication library • NCCL backend for Data Parallelism baselines • Gloo backend for PipeDream • Experiments run on three different server types • Cluster A: 4xV100 GPUs, PCIe intra-server, and 10 Gbps inter-server (Azure) • Cluster B: 8xV100 GPUs, NVLink intra-server, and 25 Gbps inter-server (AWS) • Cluster C: 1xTitan X, and 40 Gbps inter-server (private)
  • 17.
    5.3x faster 2.5x faster PipeDream> Data Parallelism (DP) end-to-end
  • 18.
    PipeDream vs. DataParallelism on Time-to-Accuracy
  • 19.
    Experiments on 4different tasks: image classification, translation, language modeling, video captioning PipeDream vs. Data Parallelism on Time-to-Accuracy
  • 20.
    With the samenumber of GPUs, PipeDream up to 5.3x faster than Data Parallelism PipeDream vs. Data Parallelism on Time-to-Accuracy
  • 21.
    Takeaways • Model anddata parallelism often suffer from high communication overhead and low resource utilization for certain models and deployments • PipeDream shows pipelining can be used to accelerate distributed training for models that fit on a single worker • Pipelining, when combined with data and model parallelism in a principled way, achieves end-to-end speedups of up to 5.3x compared to data parallelism, with similar worst-case memory footprint Appeared at SOSP 2019
  • 22.
    …but modern DeepNeural Networks are becoming extremely large! 700 GB in 32- bit precision Figure from “Language Models are Few-Shot Learners”, Brown et al.
  • 24.
    Background: GPipe How shouldweight and activation versions be managed? >> Single weight version >> Periodic pipeline flushes update weight version across workers
  • 25.
    GPipe and PipeDreammake different tradeoffs GPipe Pipeline flushes expensive High memory footprint from weight versionsPipeDream
  • 27.
    Double-buffered weight updates:high throughput and low memory footprint How should weight and activation versions be managed? >> Two weight versions (shadow version and main version)
  • 28.
    Double-buffered weight updates:high throughput and low memory footprint Weight updates from inputs 1 to 4 accumulated and applied as a single weight update to generate a new weight version W! " Input 5 uses W! " throughout Use activation recomputation to limit memory footprint of intermediate activations
  • 29.
    Double-buffered weight updates:weight semantics • Assuming a per-GPU microbatch size of 𝑏, minibatch size 𝐵 = 𝑏 ⋅ 𝑑, where 𝑑 is the depth of the pipeline • Weight update semantics of data parallelism: 𝑊()$!) = 𝑊()) − 𝜈 ⋅ ∇𝑓(𝑊()) ) • Weight update semantics with 2BW almost unchanged (note additional delay term of 1 in gradient computation): 𝑊()$!) = 𝑊()) − 𝜈 ⋅ ∇𝑓(𝑊()+!) ) • Semantics similar with replicated stages or gradient aggregation (minibatch size 𝐵 multiplied by appropriate scale factor)
  • 30.
  • 31.
    Setup • Experiments runon p3.16xlarge instances on AWS (8-V100 servers w/ NVLink) • Baselines are hybrid parallelism (no pipelining), PipeDream, and GPipe • Model and associated activations do not fit on a single worker for many models, so data parallelism is not applicable • Evaluation on BERT models with various numbers of transformer layers (24 to 192), and a GPT-2 model with 760 million parameters
  • 32.
    2BW has weightupdate semantics similar to data parallelism Training loss trajectory identical Accuracy on downstream GLUE tasks unchanged Training loss while pre-training a BERT-24 model with identical hyperparameters, and downstream GLUE task accuracy after finetuning pre-trained models 3 times
  • 33.
    PipeDream-2BW faster thanbaselines 1.9x faster than GPipe 6.9x faster than hybrid (no pipelining) Throughput in sequences/second for PipeDream-2BW and baselines on various models
  • 34.
    PipeDream-2BW has lowmemory footprint PipeDream-2BW with activation recomputation (R) has similar memory footprint to model parallelism Memory footprint for various systems, using a fixed per-GPU microbatch size of 4
  • 35.
    Takeaways • Model parallelismcan be used to train large models that do not fit on a single worker, but suffers from low resource utilization • PipeDream-2BW carefully manages weight versions and uses activation recomputation when necessary to limit memory footprint • PipeDream-2BW can accelerate training by up to 6.9x compared to optimized baselines that do not use pipelining, and up to 1.9x compared to GPipe Preprint on Arxiv: https://arxiv.org/pdf/2006.09503.pdf
  • 36.
    Conclusion https://cs.stanford.edu/~deepakn/ Pipeline parallelism canaccelerate distributed training both in regimes where model metadata (weight parameters and intermediate activations) fit on a single worker, and where they do not Code open sourced on Github: https://github.com/msr-fiddle/pipedream
  • 37.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.