Pipeline Parallel Training of
Large-scale Neural Network
Changjiang Gou
Zhejiang Lab
January 2022
1
Agenda 1. Introduction
2. Fundamentals
3. Core Techniques
4. Evaluation on BERT
5. To be continue
6. Closing notes
2
Why we need it?
Compared to the most prevalent data
parallel, Pipeline Parallel (PP) can:
1. Train a large model
2. Low communication overhead
(around 90% less)
3. It overlaps computation and
communication
Introduction
x[2]
x[3]
x[1]
But, a naive implementation
incurs:
1. Idle devices
2. Low throughput
3. State staleness
3
Fundamentals
x[2]
x[3]
x[1]
!" !# !$ !% !&
Partition the NN into several
stages (continuous sequence
of layers)
All devices are running on
different task and data stream
'" '# '$ '%
Assign stages into a device
time
device
'"
'#
'$
'%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
4
Memory
Computation
time
device
!"
!#
!$
!%
1 2 3
1 2 3
1 2 3
1 2 3 3 2 1 0
3 2 1 0
3 2 1 0
3 2 1 0
5%
16%
79%
Memory Consumption
(Transformer)
Model
Optimizer
Activations
&'
&'
device
time
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
Fundamentals
(# ()
($ (*
(%
+%
+# +$
,*
+)
,%
,$ ,)
-.//
0# 0$ 0% 0) 0*
12
+$ +% +)
+#
Coarse grain Computation graph
5
Core Techniques
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F F F B B B B
B B B B
B B B B
B B B B F F F F
F F F F
F F F F
F F F F B B B B
B B B
B B
B
NeurIPS19, GPipe
• Micro-batching
Divides mini-batch of size N into M micro-batches, at the end of a
mini-batch, gradients are accumulated and applied to update
parameters.
• Gradient checkpointing
Each device only stores output activations at the boundary layer.
During the backwards, the device re-computes the forward function.
(sub-linear memory cost)
67%
23%
3%
4% 3%
Time consumption
computation weight update
recompute load imbalance
bubble setup
6
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
SOSP19, PipeDream
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
• 1F1B: one-forward-one-backward
To eliminate idle slots, each device switch between forward computation and backward computation.
• Weight Stashing to reduce staleness
• Discrepancy in weight versions can prevent the model from converging
Core Techniques
7
time
SOSP19, PipeDream
• 1F1B: one-forward-one-backward
To eliminate idle slots, each device switch between forward computation and backward
computation.
• Weight Stashing to reduce staleness
• Discrepancy in weight versions can prevent the model from converging,
for instance, !" starts F computation on green micro-batch at time #$, and B computation at
time #%, the weight is already updated at time #& and #'.
#$ #%
#& #'
Core Techniques
device
!"
!$
!%
!&
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
8
time
SOSP19, PipeDream
• Weight Stashing to reduce staleness
Weight stashing: For a given micro-batch, denoted by colors, each stage maintains its
own version of the latest weight, which is used for F and B computation.
An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this
weight until used at !%; for device #", at time !&, weight used is just updated by blue
micro-batch, and is kept just until time !%.
Shortcoming: weight inconsistency across stages.
!" !%
!'
!&
Core Techniques
device
#$
#"
#%
#'
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
9
time
SOSP19, PipeDream
• Weight Stashing to reduce staleness
Vertical Sync: for a given micro-batch, only the input stage get the latest weight, which is
propagated along with the activations and gradients, i.e, p2p operation.
An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this
weight until used at !%; for device #", at time !&, weight used is the one transformed from #$,
and is kept just until time !%.
Each device still stashes several versions of weights.
!" !%
!'
!&
Core Techniques
device
#$
#"
#%
#'
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
improvement
F
F
F
F
F
B
F
10
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
ICML21, PipeDream-2BW
B
B
B
B
B
B
B
B B
F
F
F
F
F
B
F
• Double-buffered weight update
Each device stashes at most 2 versions of weights. Memory footprint is reduced!
An instance:
1. at &#, a training period starts with weight '#;
2. at &$, another training period starts with '$;
3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded;
4. at &%, a third periods starts with the weight just updated at &(;
5. only 2 versions needed here!
&# &$ &%
Core Techniques
B
B
B
F
B
B
F
F
F
B
F
F B
B
F B
F
F
B
F B
F
F
F
B
F B
B
B
B
B
B
B
B
B
B
&(
11
time
device
!"
!#
!$
!%
F F F F
F F F F
F F F F
F F
F F B
B
B
B
B B
B B
B B F
F F F
F
F F F
F
F F F
F
F F F B
B
B B
B
B
B
B
B
B
ICML21, PipeDream-2BW
B
B
B
B
B
B
B
B B
F
F
F
F
F
B
F
• Double-buffered weight update
Each device stashes at most 2 versions of weights. Memory footprint is reduced!
An instance:
1. at &#, a training period starts with weight '#;
2. at &$, another training period starts with '$;
3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded;
4. at &%, a third periods starts with the weight just updated at &(;
5. only 2 versions needed here!
&# &$ &%
Core Techniques
B
B
B
F
B
B
F
F
F
B
F
F B
B
F B
F
F
B
F B
F
F
F
B
F B
B
B
B
B
B
B
B
B
B
&(
12
time
device
!"
!#
!$
!%
SC20, GEMS
Core Techniques
time
device
!"
!#
!$
!%
&"
&#
&$
&%
&%
&$
&#
&"
Model replica 1
Model replica 2
Model parallelism with a model replica to increase memory efficiency
F
F
B
F
F
F
B
B
B
Memory
(sum of all devices)
F
B
F
F
B
B
B
Memory
(sum of all devices)
13
F
F
F
F
SC20, GEMS
Core Techniques
Model parallelism with a model replica to increase memory efficiency
F
B
F
F
F
B
B
B
F
B
F
F
F
B
B
B
It is designed for:
• Extremely large DNN, e.g., 1000-layer ResNet-1k, which consumes huge memory
• High-resolution images, so that batch-size 1 is large enough, micro-batches is not feasible
Example:
• High resolution histopathology images of 100,000 x 100,000 pixels.
14
time
device
!"
!#
!$
!%
SC21, Chimera
F
F
F
F B
B
B
B F
F
F
F B
B
B
B
&# &$
Core Techniques
time
device
!"
!#
!$
!%
&# &$
'"
'#
'$
'%
F
B
B
B
B
F
F
F
F
B
B
B
B
F
F
F
'%
'$
'#
'"
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
&# &$
F
B
B
B
B
F
F
F F
F
F
F B
B
B
B
F
B
B
B
B
F
F
F
Model replica 1
Model replica 2
15
SC21, Chimera
Core Techniques
time
device
!"
!#
!$
!%
F
F
F
F
&#
B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
• After all local computation
Synchronize gradients (allreduce operation) after 2
micro-batches of bidirections, i.e., ∧ and ∨.
Gradient synchronization between model replicas
'% '"
'% '"
'# '$
'# '$
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
'% '"
'% '"
'# '$
'# '$
• Eager, as soon as gradients are ready
Synchronize gradients (allreduce operation) of the
first and last stage after the gradients are ready.
Bubble is reduced!
&#
16
SC21, Chimera
Core Techniques
Combine Pipeline and Data Parallelism
time
device
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
&% &"
&% &"
&# &$
&# &$
time
device
!'
!(
!)
!*
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
&% &"
&% &"
&# &$
&# &$
With data parallelism:
• p2p communication is reduced since less devices
are used for pipeline stages, e.g., 4 stages instead
of 8 exist here.
• allreduce communication is increased due to
gradient synchronization. High bandwidth
interconnected networks (such as IB, Cray Aries,
Slingshot, NVLink) can partially alleviate it.
• workload on each stage is reduced.
• It’s important to find a sweet pot between !
(number of stages) and +(number of replicas).
,#
,#
17
SC21, Chimera
Core Techniques
Performance modelling
!"
!#
!$
!%
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
F
F
F
F B
B
B
B
Runtime of a single training iteration:
& = ()*+,-./$/),1 + B3 + Com7$7 C8 + max ,-.;<=>?@AB//?C i : i ∈ 0, ! − 1
• )*, K*, runtime of a single forward and backward computation respectively,
• ,-._M2M, p2p communication cost between stages, classical method: O + PQ, Q the size of message
• R1, RS, number of forward and backward computation in the critical path respectively
• ,-.BAA@?C;T? = 2 log$r O + 2 X − 1 PQ/X, classical Rabenseifner’s algorithm, X number of the stage
replicas
• ,-.;<=>?@AB//?C Z , part of ,-.BAA@?C;T? that cannot be covered by bubble on device Z
time
training iteration 0
F F B
training iteration 1
18
Evaluation on BERT
4 workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink.
Memory:
20
Evaluation on BERT
4 workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink.
Throughput:
21
To be continue
Auto parallelism
SOSP19, PipeDream
PPoPP21, DAPPLE
Figure 2. PipeDream framework overview
• Micro-benchmarks to profile computation, memory overhead, etc..
• Design a multi-constraints optimization problem
• Dynamic programming is the core to partition and map DNN
22
To be continue
Extremely memory-efficient training
24
To tackle the low throughput problem of GEMs in high-resolution
histopathology images scenarios:
l Mixed precision training, FP16 with FP32.
l Re-computation, trade-off between computation with memory
l ZeRo techinique from DeepSpeed: trade-off between
communication with memory.
l Harness sparsity, remove zeros from computation and storage,
trade-off between accuracy with computation and memory.
l ……
1. Closing notes
Closing notes:
We investigated several SOTA pipeline parallel training techniques in ML
l It enables training out-of-core ML models compared to Data parallelism
l It enhances throughput compared to Model parallelism
l It is a multi-objective optimization problem: computation efficiency (less bubble),
memory overhead(lower is better), keep convergency guarantee (synchronous).
l It lays down foundations for auto parallelism
25
1. Closing notes
Thank you
26

A review of Pipeline Parallel Training of Large-scale Neural Network.pdf

  • 1.
    Pipeline Parallel Trainingof Large-scale Neural Network Changjiang Gou Zhejiang Lab January 2022 1
  • 2.
    Agenda 1. Introduction 2.Fundamentals 3. Core Techniques 4. Evaluation on BERT 5. To be continue 6. Closing notes 2
  • 3.
    Why we needit? Compared to the most prevalent data parallel, Pipeline Parallel (PP) can: 1. Train a large model 2. Low communication overhead (around 90% less) 3. It overlaps computation and communication Introduction x[2] x[3] x[1] But, a naive implementation incurs: 1. Idle devices 2. Low throughput 3. State staleness 3
  • 4.
    Fundamentals x[2] x[3] x[1] !" !# !$!% !& Partition the NN into several stages (continuous sequence of layers) All devices are running on different task and data stream '" '# '$ '% Assign stages into a device time device '" '# '$ '% F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B 4
  • 5.
    Memory Computation time device !" !# !$ !% 1 2 3 12 3 1 2 3 1 2 3 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 5% 16% 79% Memory Consumption (Transformer) Model Optimizer Activations &' &' device time !" !# !$ !% F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B Fundamentals (# () ($ (* (% +% +# +$ ,* +) ,% ,$ ,) -.// 0# 0$ 0% 0) 0* 12 +$ +% +) +# Coarse grain Computation graph 5
  • 6.
    Core Techniques time device !" !# !$ !% F FF F F F F F F F F F F F F F B B B B B B B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B NeurIPS19, GPipe • Micro-batching Divides mini-batch of size N into M micro-batches, at the end of a mini-batch, gradients are accumulated and applied to update parameters. • Gradient checkpointing Each device only stores output activations at the boundary layer. During the backwards, the device re-computes the forward function. (sub-linear memory cost) 67% 23% 3% 4% 3% Time consumption computation weight update recompute load imbalance bubble setup 6
  • 7.
    time device !" !# !$ !% F F FF F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B SOSP19, PipeDream B B B B B B B B B improvement F F F F F B F • 1F1B: one-forward-one-backward To eliminate idle slots, each device switch between forward computation and backward computation. • Weight Stashing to reduce staleness • Discrepancy in weight versions can prevent the model from converging Core Techniques 7
  • 8.
    time SOSP19, PipeDream • 1F1B:one-forward-one-backward To eliminate idle slots, each device switch between forward computation and backward computation. • Weight Stashing to reduce staleness • Discrepancy in weight versions can prevent the model from converging, for instance, !" starts F computation on green micro-batch at time #$, and B computation at time #%, the weight is already updated at time #& and #'. #$ #% #& #' Core Techniques device !" !$ !% !& F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B B B B improvement F F F F F B F 8
  • 9.
    time SOSP19, PipeDream • WeightStashing to reduce staleness Weight stashing: For a given micro-batch, denoted by colors, each stage maintains its own version of the latest weight, which is used for F and B computation. An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this weight until used at !%; for device #", at time !&, weight used is just updated by blue micro-batch, and is kept just until time !%. Shortcoming: weight inconsistency across stages. !" !% !' !& Core Techniques device #$ #" #% #' F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B B B B improvement F F F F F B F 9
  • 10.
    time SOSP19, PipeDream • WeightStashing to reduce staleness Vertical Sync: for a given micro-batch, only the input stage get the latest weight, which is propagated along with the activations and gradients, i.e, p2p operation. An instance: at !", device #$ uses weight updated by yellow micro-batch, and it stores this weight until used at !%; for device #", at time !&, weight used is the one transformed from #$, and is kept just until time !%. Each device still stashes several versions of weights. !" !% !' !& Core Techniques device #$ #" #% #' F F F F F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B B B B B B B B B B improvement F F F F F B F 10
  • 11.
    time device !" !# !$ !% F F FF F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B ICML21, PipeDream-2BW B B B B B B B B B F F F F F B F • Double-buffered weight update Each device stashes at most 2 versions of weights. Memory footprint is reduced! An instance: 1. at &#, a training period starts with weight '#; 2. at &$, another training period starts with '$; 3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded; 4. at &%, a third periods starts with the weight just updated at &(; 5. only 2 versions needed here! &# &$ &% Core Techniques B B B F B B F F F B F F B B F B F F B F B F F F B F B B B B B B B B B B &( 11
  • 12.
    time device !" !# !$ !% F F FF F F F F F F F F F F F F B B B B B B B B B B F F F F F F F F F F F F F F F F B B B B B B B B B B ICML21, PipeDream-2BW B B B B B B B B B F F F F F B F • Double-buffered weight update Each device stashes at most 2 versions of weights. Memory footprint is reduced! An instance: 1. at &#, a training period starts with weight '#; 2. at &$, another training period starts with '$; 3. at &(, the training with )# terminates on a mini-batch consists of 4 micro-batches, )# is discarded; 4. at &%, a third periods starts with the weight just updated at &(; 5. only 2 versions needed here! &# &$ &% Core Techniques B B B F B B F F F B F F B B F B F F B F B F F F B F B B B B B B B B B B &( 12
  • 13.
    time device !" !# !$ !% SC20, GEMS Core Techniques time device !" !# !$ !% &" &# &$ &% &% &$ &# &" Modelreplica 1 Model replica 2 Model parallelism with a model replica to increase memory efficiency F F B F F F B B B Memory (sum of all devices) F B F F B B B Memory (sum of all devices) 13
  • 14.
    F F F F SC20, GEMS Core Techniques Modelparallelism with a model replica to increase memory efficiency F B F F F B B B F B F F F B B B It is designed for: • Extremely large DNN, e.g., 1000-layer ResNet-1k, which consumes huge memory • High-resolution images, so that batch-size 1 is large enough, micro-batches is not feasible Example: • High resolution histopathology images of 100,000 x 100,000 pixels. 14
  • 15.
    time device !" !# !$ !% SC21, Chimera F F F F B B B BF F F F B B B B &# &$ Core Techniques time device !" !# !$ !% &# &$ '" '# '$ '% F B B B B F F F F B B B B F F F '% '$ '# '" time device !" !# !$ !% F F F F B B B B &# &$ F B B B B F F F F F F F B B B B F B B B B F F F Model replica 1 Model replica 2 15
  • 16.
    SC21, Chimera Core Techniques time device !" !# !$ !% F F F F &# B B B B F F F FB B B B F F F F B B B B F F F F B B B B • After all local computation Synchronize gradients (allreduce operation) after 2 micro-batches of bidirections, i.e., ∧ and ∨. Gradient synchronization between model replicas '% '" '% '" '# '$ '# '$ time device !" !# !$ !% F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B '% '" '% '" '# '$ '# '$ • Eager, as soon as gradients are ready Synchronize gradients (allreduce operation) of the first and last stage after the gradients are ready. Bubble is reduced! &# 16
  • 17.
    SC21, Chimera Core Techniques CombinePipeline and Data Parallelism time device !" !# !$ !% F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B &% &" &% &" &# &$ &# &$ time device !' !( !) !* F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B &% &" &% &" &# &$ &# &$ With data parallelism: • p2p communication is reduced since less devices are used for pipeline stages, e.g., 4 stages instead of 8 exist here. • allreduce communication is increased due to gradient synchronization. High bandwidth interconnected networks (such as IB, Cray Aries, Slingshot, NVLink) can partially alleviate it. • workload on each stage is reduced. • It’s important to find a sweet pot between ! (number of stages) and +(number of replicas). ,# ,# 17
  • 18.
    SC21, Chimera Core Techniques Performancemodelling !" !# !$ !% F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B F F F F B B B B Runtime of a single training iteration: & = ()*+,-./$/),1 + B3 + Com7$7 C8 + max ,-.;<=>?@AB//?C i : i ∈ 0, ! − 1 • )*, K*, runtime of a single forward and backward computation respectively, • ,-._M2M, p2p communication cost between stages, classical method: O + PQ, Q the size of message • R1, RS, number of forward and backward computation in the critical path respectively • ,-.BAA@?C;T? = 2 log$r O + 2 X − 1 PQ/X, classical Rabenseifner’s algorithm, X number of the stage replicas • ,-.;<=>?@AB//?C Z , part of ,-.BAA@?C;T? that cannot be covered by bubble on device Z time training iteration 0 F F B training iteration 1 18
  • 19.
    Evaluation on BERT 4workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink. Memory: 20
  • 20.
    Evaluation on BERT 4workstation interconnected by IB, each equipped with 8 V100 GPUs, which is connected by NVLink. Throughput: 21
  • 21.
    To be continue Autoparallelism SOSP19, PipeDream PPoPP21, DAPPLE Figure 2. PipeDream framework overview • Micro-benchmarks to profile computation, memory overhead, etc.. • Design a multi-constraints optimization problem • Dynamic programming is the core to partition and map DNN 22
  • 22.
    To be continue Extremelymemory-efficient training 24 To tackle the low throughput problem of GEMs in high-resolution histopathology images scenarios: l Mixed precision training, FP16 with FP32. l Re-computation, trade-off between computation with memory l ZeRo techinique from DeepSpeed: trade-off between communication with memory. l Harness sparsity, remove zeros from computation and storage, trade-off between accuracy with computation and memory. l ……
  • 23.
    1. Closing notes Closingnotes: We investigated several SOTA pipeline parallel training techniques in ML l It enables training out-of-core ML models compared to Data parallelism l It enhances throughput compared to Model parallelism l It is a multi-objective optimization problem: computation efficiency (less bubble), memory overhead(lower is better), keep convergency guarantee (synchronous). l It lays down foundations for auto parallelism 25
  • 24.