Rethinking Perturbations in Encoder-Decoders for Fast Training

Rethinking Perturbations in
Encoder-Decoders for Fast Training
Sho Takase
Tokyo Institute of Technology
1
Shun Kiyono
RIKEN
Tohoku University

Summary
• Explore time efficient perturbation methods
– Compare perturbations in various seq-to-seq tasks
• MT, summarization, grammatical error correction
• Simple techniques are more time efficient
– Simple perturbations (e.g., word dropout) achieve comparable score to
complex ones (e.g., scheduled sampling) with shorter training time
• Use simple perturbations as a first step to construct a strong model
2

Exposure bias in seq-to-seq models
• Gap between training and inference in decoder
– Training: use correct sequence as an input
– Test: use predicted sequence as an input
• This gap (probably) harms output sequences
3
Encoder
Where is …
Decoder
Où est … <B> Where is …
Training
Encoder
What is …
Où est … <B> What is …
Test
Decoder

Address exposure bias
• Scheduled sampling [Bengio+ 15]
– Probabilistically use predicted tokens during the training
• We can also add other noises to inputs
– Adversarial perturbations, word dropout, …
– Call such noises “perturbations” in this study
4
Encoder
What was you …
Où est ma … <B> What is you …
Correct
Predict
Where is my …
Decoder

Perturbations are widely-used?
• There are various (complex) perturbations
• Researchers report (complex) perturbations are useful
– Scheduled sampling [Zhang+ 19] was awarded ACL Best paper
• However, perturbations are NOT widely-used
5
Recent WMT systems didn’t use perturbations.
We also didn’t use perturbations but achieved
the top score on WMT 2020 news translation!!!
(Re-)investigate the usefulness of perturbations

Focus on computational time
• We know perturbations improve the performance
• But perturbations are time efficient?
– Complex perturbations require considerable time
• Scheduled sampling: compute sequence length times
• Adversarial: compute backpropagation to obtain perturbations
• Comparison in terms of computational time
– Are improvements worthy of extra training time?
6

Perturbations in this study
7
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
Adversarial
perturbation
(Adv)
Word dropout
(WDrop)
Word replacement
(Rep)
Correct input
Perturbation Speed
w/o perturbation (Vanilla) ×1.00
Adv ×0.33
WDrop ×1.00
Rep
Uniform (Uni) ×1.00
Similarity (Sim) ×0.92
Scheduled
sampling (SS)
×0.87
Computational speed
based on Transformer (Big)

Word replacement (Rep)
• Randomly replace words with sampled ones
– Where is my cat ? → What is my dog ?
• 3 distributions for samplings
– Uniform (Uni)
– Similarity (Sim)
• Based on embeddings:
– Conditional probability
= Scheduled sampling (SS)
• Prediction of a decoder:
8
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
め込み表現（次元数は 𝑑𝑥）を 𝑬𝑥 ∈ ℝ|
ン 𝑥𝑖 に対応する埋め込み表現を 𝒆(𝑥𝑖)
とき，次の確率分布を 𝑄𝑥𝑖 として用い
softmax(𝑬𝑥 𝒆(𝑥𝑖)),
ここで，softmax(.) はソフトマックス
すなわち，式（7）は 𝒆(𝑥𝑖) に似た埋め
い確率を付与する．言い換えれば，式（
考慮していない場合の，𝑥𝑖 の類似トー
率を付与する．デコーダ側についても
列および長さ 𝐽 の出力トークン列をそれぞ
𝒚1:𝐽 とすると，エンコーダ・デコーダは次
き確率を計算する：
𝑝(𝒀|𝑿) =
𝐽+1
!
𝑗=1
𝑝(𝑦𝑗 |𝒚0:𝑗−1, 𝑿), (1)
𝑦0 および 𝑦𝐽+1 はそれぞれ文頭，文末を示す
ークンとし，𝑿 = 𝒙1:𝐼 ，𝒀 = 𝒚1:𝐽+1 とする．
おいては，訓練データにおける負の対数尤
化するパラメータ 𝜽 を探す．𝑿𝑛 と 𝒀𝑛 とい
する系列のペアを含む訓練データを D と
なわち，D = {(𝑿𝑛,𝒀𝑛)}|D|
𝑛=1 とすると，次の
を最小化するように学習を行う．
L(𝜽) = −
1 "
log 𝑝(𝒀|𝑿; 𝜽). (2)
ここで，𝑞 と 𝑘 はハイパーパラメータで
の式により，𝛼𝑡 は 1 から 𝑞 まで，学習ス
依存して減少していく．この 𝛼𝑡 を各ス
ける 𝛼 として用いる．
分布 𝑄𝑥𝑖 については，条件付き確率，
類似度の 3 種類を用いる．
条件付き確率： Rep(SS) Bengio ら [9]
推論時における差異に対処するため，ス
ドサンプリングを提案した．スケジュー
リングは次の条件付き確率を 𝑄𝑦𝑖 として
𝑝( ˆ
𝑦𝑖 |𝒚!
0:𝑖−1, 𝑿).
スケジュールドサンプリングはデコーダ
計算する手法であり，𝑄𝑥𝑖 に対応する関

Word dropout (WDrop)
• Randomly replace embeddings
with zero vectors
9
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
bxi =
1 with probability
0 otherwise

Adversarial perturbation (Adv)
• Noises damage loss value
– Add noises to embeddings
– Seek noises based on gradients
10
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1

Summary of perturbations
11
Perturbation Speed
w/o perturbation (Vanilla) ×1.00
Adv ×0.33
WDrop ×1.00
Rep
Uniform (Uni) ×1.00
Similarity (Sim) ×0.92
Scheduled
sampling (SS)
×0.87
Computational speed
based on Transformer (Big)
Decoder
Encoder
y1 y2
x1
x’1
bx1 e(x’1)
rx1
x2
x’2
bx2 e(x’2)
rx2
y0
y’0
by0 e(y’0)
ry0
y1
y’1
by1 e(y’1)
ry1
Complex
Complex
Simple

Improvements of BLEU in MT
• Training: WMT 16 En-De
– 4.5M sentence pairs
• Test: newstest2010-2016
– Report averaged BLEU
– Higher is better
• Method: Transformer(Big)
• Simple perturbations
outperform complex ones
– Similar results in GEC
and summarization
12
Simple Complex

Time to achieve BLEU score of Vanilla
13
• Time to achieve BLEU score
of Vanilla on newstest2013
– Lower is faster
• Simple perturbations achieve
high BLEU score with small
training time
Simple Complex

Robustness of each model
14
• Replace input tokens with
sampled tokens based on
replace ratios
– Averaged BLEU score of
newstest2010-2016
• Rep(Sim) is the most robust
– Investigate the reason in
future work

Conclusion
• Explore time efficient perturbation methods
– Compare perturbations in various seq-to-seq tasks
• MT, summarization, grammatical error correction
• Simple techniques are more time efficient
– Simple perturbations (e.g., word dropout) achieve comparable score to
complex ones (e.g., scheduled sampling) with shorter training time
– Replacement based on similarity constructs more robust model
• Use simple perturbations as a first step to construct a strong model
• Our code is publicly available: https://github.com/takase/rethink_perturbations
15

Rethinking Perturbations in Encoder-Decoders for Fast Training

More Related Content

What's hot

Similar to Rethinking Perturbations in Encoder-Decoders for Fast Training

More from Sho Takase

Recently uploaded

Rethinking Perturbations in Encoder-Decoders for Fast Training