Rethinking Perturbations in
Encoder-Decoders for Fast Training
Sho Takase
Tokyo Institute of Technology
1
Shun Kiyono
RIKEN
Tohoku University
Summary
• Explore time efficient perturbation methods
– Compare perturbations in various seq-to-seq tasks
• MT, summarization, grammatical error correction
• Simple techniques are more time efficient
– Simple perturbations (e.g., word dropout) achieve comparable score to
complex ones (e.g., scheduled sampling) with shorter training time
• Use simple perturbations as a first step to construct a strong model
2
Exposure bias in seq-to-seq models
• Gap between training and inference in decoder
– Training: use correct sequence as an input
– Test: use predicted sequence as an input
• This gap (probably) harms output sequences
3
Encoder
Where is …
Decoder
Où est … <B> Where is …
Training
Encoder
What is …
Où est … <B> What is …
Test
Decoder
Address exposure bias
• Scheduled sampling [Bengio+ 15]
– Probabilistically use predicted tokens during the training
• We can also add other noises to inputs
– Adversarial perturbations, word dropout, …
– Call such noises “perturbations” in this study
4
Encoder
What was you …
Où est ma … <B> What is you …
Correct
Predict
Where is my …
Decoder
Perturbations are widely-used?
• There are various (complex) perturbations
• Researchers report (complex) perturbations are useful
– Scheduled sampling [Zhang+ 19] was awarded ACL Best paper
• However, perturbations are NOT widely-used
5
Recent WMT systems didn’t use perturbations.
We also didn’t use perturbations but achieved
the top score on WMT 2020 news translation!!!
(Re-)investigate the usefulness of perturbations
Focus on computational time
• We know perturbations improve the performance
• But perturbations are time efficient?
– Complex perturbations require considerable time
• Scheduled sampling: compute sequence length times
• Adversarial: compute backpropagation to obtain perturbations
• Comparison in terms of computational time
– Are improvements worthy of extra training time?
6
Perturbations in this study
7
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
Adversarial
perturbation
(Adv)
Word dropout
(WDrop)
Word replacement
(Rep)
Correct input
Perturbation Speed
w/o perturbation (Vanilla) ×1.00
Adv ×0.33
WDrop ×1.00
Rep
Uniform (Uni) ×1.00
Similarity (Sim) ×0.92
Scheduled
sampling (SS)
×0.87
Computational speed
based on Transformer (Big)
Word replacement (Rep)
• Randomly replace words with sampled ones
– Where is my cat ? → What is my dog ?
• 3 distributions for samplings
– Uniform (Uni)
– Similarity (Sim)
• Based on embeddings:
– Conditional probability
= Scheduled sampling (SS)
• Prediction of a decoder:
8
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
め込み表現(次元数は 𝑑𝑥)を 𝑬𝑥 ∈ ℝ|
ン 𝑥𝑖 に対応する埋め込み表現を 𝒆(𝑥𝑖)
とき,次の確率分布を 𝑄𝑥𝑖 として用い
softmax(𝑬𝑥 𝒆(𝑥𝑖)),
ここで,softmax(.) はソフトマックス
すなわち,式(7)は 𝒆(𝑥𝑖) に似た埋め
い確率を付与する.言い換えれば,式(
考慮していない場合の,𝑥𝑖 の類似トー
率を付与する.デコーダ側についても
列および長さ 𝐽 の出力トークン列をそれぞ
𝒚1:𝐽 とすると,エンコーダ・デコーダは次
き確率を計算する:
𝑝(𝒀|𝑿) =
𝐽+1
!
𝑗=1
𝑝(𝑦𝑗 |𝒚0:𝑗−1, 𝑿), (1)
𝑦0 および 𝑦𝐽+1 はそれぞれ文頭,文末を示す
ークンとし,𝑿 = 𝒙1:𝐼 ,𝒀 = 𝒚1:𝐽+1 とする.
おいては,訓練データにおける負の対数尤
化するパラメータ 𝜽 を探す.𝑿𝑛 と 𝒀𝑛 とい
する系列のペアを含む訓練データを D と
なわち,D = {(𝑿𝑛,𝒀𝑛)}|D|
𝑛=1 とすると,次の
を最小化するように学習を行う.
L(𝜽) = −
1 "
log 𝑝(𝒀|𝑿; 𝜽). (2)
ここで,𝑞 と 𝑘 はハイパーパラメータで
の式により,𝛼𝑡 は 1 から 𝑞 まで,学習ス
依存して減少していく.この 𝛼𝑡 を各ス
ける 𝛼 として用いる.
分布 𝑄𝑥𝑖 については,条件付き確率,
類似度の 3 種類を用いる.
条件付き確率: Rep(SS) Bengio ら [9]
推論時における差異に対処するため,ス
ドサンプリングを提案した.スケジュー
リングは次の条件付き確率を 𝑄𝑦𝑖 として
𝑝( ˆ
𝑦𝑖 |𝒚!
0:𝑖−1, 𝑿).
スケジュールドサンプリングはデコーダ
計算する手法であり,𝑄𝑥𝑖 に対応する関
Word dropout (WDrop)
• Randomly replace embeddings
with zero vectors
9
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
bxi =
1 with probability
0 otherwise
Adversarial perturbation (Adv)
• Noises damage loss value
– Add noises to embeddings
– Seek noises based on gradients
10
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
Summary of perturbations
11
Perturbation Speed
w/o perturbation (Vanilla) ×1.00
Adv ×0.33
WDrop ×1.00
Rep
Uniform (Uni) ×1.00
Similarity (Sim) ×0.92
Scheduled
sampling (SS)
×0.87
Computational speed
based on Transformer (Big)
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
Complex
Complex
Simple
Improvements of BLEU in MT
• Training: WMT 16 En-De
– 4.5M sentence pairs
• Test: newstest2010-2016
– Report averaged BLEU
– Higher is better
• Method: Transformer(Big)
• Simple perturbations
outperform complex ones
– Similar results in GEC
and summarization
12
Simple Complex
Time to achieve BLEU score of Vanilla
13
• Training: WMT 16 En-De
– 4.5M sentence pairs
• Method: Transformer(Big)
• Time to achieve BLEU score
of Vanilla on newstest2013
– Lower is faster
• Simple perturbations achieve
high BLEU score with small
training time
Simple Complex
Robustness of each model
14
• Training: WMT 16 En-De
– 4.5M sentence pairs
• Method: Transformer(Big)
• Replace input tokens with
sampled tokens based on
replace ratios
– Averaged BLEU score of
newstest2010-2016
• Rep(Sim) is the most robust
– Investigate the reason in
future work
Conclusion
• Explore time efficient perturbation methods
– Compare perturbations in various seq-to-seq tasks
• MT, summarization, grammatical error correction
• Simple techniques are more time efficient
– Simple perturbations (e.g., word dropout) achieve comparable score to
complex ones (e.g., scheduled sampling) with shorter training time
– Replacement based on similarity constructs more robust model
• Use simple perturbations as a first step to construct a strong model
• Our code is publicly available: https://github.com/takase/rethink_perturbations
15

Rethinking Perturbations in Encoder-Decoders for Fast Training

  • 1.
    Rethinking Perturbations in Encoder-Decodersfor Fast Training Sho Takase Tokyo Institute of Technology 1 Shun Kiyono RIKEN Tohoku University
  • 2.
    Summary • Explore timeefficient perturbation methods – Compare perturbations in various seq-to-seq tasks • MT, summarization, grammatical error correction • Simple techniques are more time efficient – Simple perturbations (e.g., word dropout) achieve comparable score to complex ones (e.g., scheduled sampling) with shorter training time • Use simple perturbations as a first step to construct a strong model 2
  • 3.
    Exposure bias inseq-to-seq models • Gap between training and inference in decoder – Training: use correct sequence as an input – Test: use predicted sequence as an input • This gap (probably) harms output sequences 3 Encoder Where is … Decoder Où est … <B> Where is … Training Encoder What is … Où est … <B> What is … Test Decoder
  • 4.
    Address exposure bias •Scheduled sampling [Bengio+ 15] – Probabilistically use predicted tokens during the training • We can also add other noises to inputs – Adversarial perturbations, word dropout, … – Call such noises “perturbations” in this study 4 Encoder What was you … Où est ma … <B> What is you … Correct Predict Where is my … Decoder
  • 5.
    Perturbations are widely-used? •There are various (complex) perturbations • Researchers report (complex) perturbations are useful – Scheduled sampling [Zhang+ 19] was awarded ACL Best paper • However, perturbations are NOT widely-used 5 Recent WMT systems didn’t use perturbations. We also didn’t use perturbations but achieved the top score on WMT 2020 news translation!!! (Re-)investigate the usefulness of perturbations
  • 6.
    Focus on computationaltime • We know perturbations improve the performance • But perturbations are time efficient? – Complex perturbations require considerable time • Scheduled sampling: compute sequence length times • Adversarial: compute backpropagation to obtain perturbations • Comparison in terms of computational time – Are improvements worthy of extra training time? 6
  • 7.
    Perturbations in thisstudy 7 Decoder Encoder y1 y2 x1 x’1 bx1 e(x’1) rx1 x2 x’2 bx2 e(x’2) rx2 y0 y’0 by0 e(y’0) ry0 y1 y’1 by1 e(y’1) ry1 Adversarial perturbation (Adv) Word dropout (WDrop) Word replacement (Rep) Correct input Perturbation Speed w/o perturbation (Vanilla) ×1.00 Adv ×0.33 WDrop ×1.00 Rep Uniform (Uni) ×1.00 Similarity (Sim) ×0.92 Scheduled sampling (SS) ×0.87 Computational speed based on Transformer (Big)
  • 8.
    Word replacement (Rep) •Randomly replace words with sampled ones – Where is my cat ? → What is my dog ? • 3 distributions for samplings – Uniform (Uni) – Similarity (Sim) • Based on embeddings: – Conditional probability = Scheduled sampling (SS) • Prediction of a decoder: 8 Decoder Encoder y1 y2 x1 x’1 bx1 e(x’1) rx1 x2 x’2 bx2 e(x’2) rx2 y0 y’0 by0 e(y’0) ry0 y1 y’1 by1 e(y’1) ry1 め込み表現(次元数は 𝑑𝑥)を 𝑬𝑥 ∈ ℝ| ン 𝑥𝑖 に対応する埋め込み表現を 𝒆(𝑥𝑖) とき,次の確率分布を 𝑄𝑥𝑖 として用い softmax(𝑬𝑥 𝒆(𝑥𝑖)), ここで,softmax(.) はソフトマックス すなわち,式(7)は 𝒆(𝑥𝑖) に似た埋め い確率を付与する.言い換えれば,式( 考慮していない場合の,𝑥𝑖 の類似トー 率を付与する.デコーダ側についても 列および長さ 𝐽 の出力トークン列をそれぞ 𝒚1:𝐽 とすると,エンコーダ・デコーダは次 き確率を計算する: 𝑝(𝒀|𝑿) = 𝐽+1 ! 𝑗=1 𝑝(𝑦𝑗 |𝒚0:𝑗−1, 𝑿), (1) 𝑦0 および 𝑦𝐽+1 はそれぞれ文頭,文末を示す ークンとし,𝑿 = 𝒙1:𝐼 ,𝒀 = 𝒚1:𝐽+1 とする. おいては,訓練データにおける負の対数尤 化するパラメータ 𝜽 を探す.𝑿𝑛 と 𝒀𝑛 とい する系列のペアを含む訓練データを D と なわち,D = {(𝑿𝑛,𝒀𝑛)}|D| 𝑛=1 とすると,次の を最小化するように学習を行う. L(𝜽) = − 1 " log 𝑝(𝒀|𝑿; 𝜽). (2) ここで,𝑞 と 𝑘 はハイパーパラメータで の式により,𝛼𝑡 は 1 から 𝑞 まで,学習ス 依存して減少していく.この 𝛼𝑡 を各ス ける 𝛼 として用いる. 分布 𝑄𝑥𝑖 については,条件付き確率, 類似度の 3 種類を用いる. 条件付き確率: Rep(SS) Bengio ら [9] 推論時における差異に対処するため,ス ドサンプリングを提案した.スケジュー リングは次の条件付き確率を 𝑄𝑦𝑖 として 𝑝( ˆ 𝑦𝑖 |𝒚! 0:𝑖−1, 𝑿). スケジュールドサンプリングはデコーダ 計算する手法であり,𝑄𝑥𝑖 に対応する関
  • 9.
    Word dropout (WDrop) •Randomly replace embeddings with zero vectors 9 Decoder Encoder y1 y2 x1 x’1 bx1 e(x’1) rx1 x2 x’2 bx2 e(x’2) rx2 y0 y’0 by0 e(y’0) ry0 y1 y’1 by1 e(y’1) ry1 bxi = 1 with probability 0 otherwise
  • 10.
    Adversarial perturbation (Adv) •Noises damage loss value – Add noises to embeddings – Seek noises based on gradients 10 Decoder Encoder y1 y2 x1 x’1 bx1 e(x’1) rx1 x2 x’2 bx2 e(x’2) rx2 y0 y’0 by0 e(y’0) ry0 y1 y’1 by1 e(y’1) ry1
  • 11.
    Summary of perturbations 11 PerturbationSpeed w/o perturbation (Vanilla) ×1.00 Adv ×0.33 WDrop ×1.00 Rep Uniform (Uni) ×1.00 Similarity (Sim) ×0.92 Scheduled sampling (SS) ×0.87 Computational speed based on Transformer (Big) Decoder Encoder y1 y2 x1 x’1 bx1 e(x’1) rx1 x2 x’2 bx2 e(x’2) rx2 y0 y’0 by0 e(y’0) ry0 y1 y’1 by1 e(y’1) ry1 Complex Complex Simple
  • 12.
    Improvements of BLEUin MT • Training: WMT 16 En-De – 4.5M sentence pairs • Test: newstest2010-2016 – Report averaged BLEU – Higher is better • Method: Transformer(Big) • Simple perturbations outperform complex ones – Similar results in GEC and summarization 12 Simple Complex
  • 13.
    Time to achieveBLEU score of Vanilla 13 • Training: WMT 16 En-De – 4.5M sentence pairs • Method: Transformer(Big) • Time to achieve BLEU score of Vanilla on newstest2013 – Lower is faster • Simple perturbations achieve high BLEU score with small training time Simple Complex
  • 14.
    Robustness of eachmodel 14 • Training: WMT 16 En-De – 4.5M sentence pairs • Method: Transformer(Big) • Replace input tokens with sampled tokens based on replace ratios – Averaged BLEU score of newstest2010-2016 • Rep(Sim) is the most robust – Investigate the reason in future work
  • 15.
    Conclusion • Explore timeefficient perturbation methods – Compare perturbations in various seq-to-seq tasks • MT, summarization, grammatical error correction • Simple techniques are more time efficient – Simple perturbations (e.g., word dropout) achieve comparable score to complex ones (e.g., scheduled sampling) with shorter training time – Replacement based on similarity constructs more robust model • Use simple perturbations as a first step to construct a strong model • Our code is publicly available: https://github.com/takase/rethink_perturbations 15