SlideShare a Scribd company logo

Transformerを多層にする際の勾配消失問題と解決法について

第28回言語処理学会年次大会での発表スライド

1 of 16
Download to read offline
Transformerを多層にする際の
勾配消失問題と解決法について
⾼瀬翔,清野舜,⼩林颯介,鈴⽊潤
東⼯⼤,理研,東北⼤,PFN
2022/3/15
1
Transformer の構造と本研究のまとめ
• Transformer は Layer Normalization (LN) の位置で2種に⼤別される
2
Post-LN
Pre-LN
Residual 後に
Layer Norm
本研究の貢献
・Post-LN と Pre-LN の性能差を実験的に⽰す
・多層 Post-LN の学習が難しい原因を⽰す
・⾼い性能を維持しつつ多層化する⼿法を提案
性能 多層化
Post-LN ○ ×
Pre-LN × ○
B2T(提案⼿法) ○ ○
× N
× N
Layer Norm
Attention
FFN
Layer Norm
Layer Norm
Attention
FFN
Layer Norm
Layer Norm
Attention
× N
× N
Attention
Layer Norm
Layer Norm
FFN
Layer Norm
(a) Post-LN
Layer Norm
+
Attention
Layer Norm
+
FFN
Layer Norm
+
Attention
Layer Norm
+
Attention
Layer Norm
+
FFN
Layer Norm
Attention
FFN
Layer Norm
Layer Norm
Attention
FFN
Layer Norm
Layer Norm
Attention
× N
× N
Attention
Layer Norm
Layer Norm
Attention
Layer Norm
Layer Norm
FFN
Attention
Layer Norm
× N
× N
FFN
Layer Norm
Layer Norm
Layer Norm
Attention
FFN
Layer Norm
Layer N
Attenti
FFN
Layer N
Layer N
Attenti
× N
×
(a) Post-LN (b) Pre-LN (c) Post-LN with B2T connec
Layer Norm
Attention
FFN
Layer Norm
Layer Norm
Attention
FFN
Layer Norm
Layer Norm
Attention
× N
× N
Attention
Layer Norm
Layer Norm
Attention
Layer Norm
Layer Norm
FFN
Attention
Layer Norm
× N
× N
FFN
Layer Norm
Layer Norm
Layer Norm
Attention
FFN
Layer Norm
Layer Norm
Attention
FFN
Layer Norm
Layer Norm
Attention
× N
× N
(a) Post-LN (b) Pre-LN (c) Post-LN with B2T connection
関数適⽤前に
Layer Norm
性能を上げるために多層にしたい
• 性能はパラメータ数に対数⽐例する
• パラメータ数をどう増やす︖
– 中間層の次元数を増やす or 多層にする
3
Transformer エンコーダ・デコーダの翻訳でのBLEU 値
次元数を増やすよりも
多層にした⽅が性能が⾼い
(性能向上の傾きが良さそう)
→ 多層にしたい
6層
6層
(次元数増)
18層
多層 Post-LN の学習は難しい
• 多層 Post-LN(例えば 18層)は学習が難しい
– 勾配消失によって学習が進まない
• 18層 Transformer エンコーダ・デコーダの翻訳タスクでの挙動
4
訓練 Loss 開発データでの Loss エンコーダ側 デコーダ側
各層の勾配のノルム
Post-LN は訓練 Loss が下がらない
開発データの Loss も⾼いまま デコーダ側で勾配消失が発⽣
101
100
10-1
10-1
100
勾配消失の原因は LN
• 勾配消失の原因を探る
– デコーダの18層⽬の各位置における勾配のノルムを調べる
5
Layer
Norm
Attention
FFN
Layer
Norm
Layer
Norm
Attention
(4) → (3),(2) → (1) で勾配が⼤きく減衰
→ LN をまたぐと勾配が⼤きく減衰
→ LN が勾配を減衰させる
→ LN が勾配消失の原因
Pre-LN は何故学習できるのか︖
• Pre-LN は LN の影響を受けない勾配がある
– LN の影響を受けない=勾配が減衰しない
• Post-LN と Pre-LN の式を確認する
– ⼊⼒を x,アテンションやFFNを とすると
6
前向き計算
勾配
in performance and remove unstable training property,
and thus provide better performance than Pre-LN re-
gardless of their layer sizes.
2. Post-LN and Pre-LN Transformers
We briefly describe Post-LN and Pre-LN Transformers. The
original Transformer (Vaswani et al., 2017) uses Post-LN in
which layer normalizations are located after each residual
connection. Let x be an input of sub-layer, and F(·) be a
sub-layer of Transformers such as a feed-forward network
and multi-head attention. Post-LN is defined as follows:
PostLN(x) = LN(x + F(x)), (1)
where LN(·) is the layer normalization function.
In contrast, Pre-LN places the layer normalization before an
input of each sub-layer;
for
ials
the
ish-
ing,
rtic-
cant
the
ayer
yses
with-
nce
we
mers
dual
me-
ayer
tion
age
the
nts;
2. Our modifications enable Post-LN Transformers to
stack many layers.
3. Our method can maintain the advantage of Post-LN
in performance and remove unstable training property,
and thus provide better performance than Pre-LN re-
gardless of their layer sizes.
2. Post-LN and Pre-LN Transformers
We briefly describe Post-LN and Pre-LN Transformers. The
original Transformer (Vaswani et al., 2017) uses Post-LN in
which layer normalizations are located after each residual
connection. Let x be an input of sub-layer, and F(·) be a
sub-layer of Transformers such as a feed-forward network
and multi-head attention. Post-LN is defined as follows:
PostLN(x) = LN(x + F(x)), (1)
where LN(·) is the layer normalization function.
In contrast, Pre-LN places the layer normalization before an
input of each sub-layer;
PreLN(x) = x + F(LN(x)). (2)
-LN prevents it as shown in Figure 1. In partic-
rify that the layer normalization is a significant
vanishing gradient problem by comparing the
t vector norms of gradient flows for each layer
on during back-propagation. These analyses
ovel idea that can satisfy higher stability with-
ormalizations and provide better performance
N regardless of their layer sizes. Specifically, we
method that is based on Post-LN Transformers
different components; 1. additional residual
and 2. simple layers without model parame-
onstant values) as the replacement of the layer
ons.
experiments on a wide range of text generation
y machine translation, summarization, language
nd automatic speech recognition. We obtain the
hree new major findings from our experiments;
gardless of their layer sizes.
2. Post-LN and Pre-LN Transformers
We briefly describe Post-LN and Pre-LN Transformers. The
original Transformer (Vaswani et al., 2017) uses Post-LN in
which layer normalizations are located after each residual
connection. Let x be an input of sub-layer, and F(·) be a
sub-layer of Transformers such as a feed-forward network
and multi-head attention. Post-LN is defined as follows:
PostLN(x) = LN(x + F(x)), (1)
where LN(·) is the layer normalization function.
In contrast, Pre-LN places the layer normalization before an
input of each sub-layer;
PreLN(x) = x + F(LN(x)). (2)
13
based
er for
WMT
Trans-
10
Figure 5. Cosine similarities among outputs of each layer.
norms are exponentially decayed as back-propagated to
shallower layers. This result is consistent with the previous
study (Liu et al., 2020). We consider that this vanishing
gradient causes the difficulty in stacking many layers with
the Post-LN setting as shown in Figure 1.
To explore more details of the vanishing gradient empiri-
cally, we check gradient norms of parts (1) - (5) in Figure 2
(a). Figure 4 shows the gradient norms of each part at 18th
layer. This figure indicates that the gradient norms from (4)
to (3) and (2) to (1) drastically decrease. These parts cor-
respond to layer normalizations as in Figure 4. Thus, layer
normalizations in Post-LN Transformers probably cause the
vanishing gradient problem.
To investigate the difference of gradient flows between Post-
LN and Pre-LN theoretically, we calculate derivatives of
equations (1) and (2). The derivatives are as follows:
@PostLN(x)
@x
=
@LN(x + F(x))
@(x + F(x))
✓
I +
@F(x)
@x
◆
, (3)
@PreLN(x)
@x
= I +
@F(LN(x))
@LN(x)
@LN(x)
@x
, (4)
where I is the identity matrix. As Equation (3), the deriva-
tive of Post-LN is equal to the product of two derivatives;
one is the layer normalization, and the other consists of the
residual connection and sub-layer F. In contrast, in Pre-LN,
Layer
Norm
Attention
FFN
Layer
Norm
Layer
Norm
Attention
Gradient norms of each location in the 18th decoder for
yered Post-LN Transformer encoder-decoder on WMT
o-German translation training data.
2 (a) and (b) illustrate Post-LN and Pre-LN Trans-
architectures respectively.
dients of Transformer Layers
norms are exponentially decayed as back-propagated to
shallower layers. This result is consistent with the previous
study (Liu et al., 2020). We consider that this vanishing
gradient causes the difficulty in stacking many layers with
the Post-LN setting as shown in Figure 1.
To explore more details of the vanishing gradient empiri-
cally, we check gradient norms of parts (1) - (5) in Figure 2
(a). Figure 4 shows the gradient norms of each part at 18th
layer. This figure indicates that the gradient norms from (4)
to (3) and (2) to (1) drastically decrease. These parts cor-
respond to layer normalizations as in Figure 4. Thus, layer
normalizations in Post-LN Transformers probably cause the
vanishing gradient problem.
To investigate the difference of gradient flows between Post-
LN and Pre-LN theoretically, we calculate derivatives of
equations (1) and (2). The derivatives are as follows:
@PostLN(x)
@x
=
@LN(x + F(x))
@(x + F(x))
✓
I +
@F(x)
@x
◆
, (3)
@PreLN(x)
@x
= I +
@F(LN(x))
@LN(x)
@LN(x)
@x
, (4)
where I is the identity matrix. As Equation (3), the deriva-
tive of Post-LN is equal to the product of two derivatives;
one is the layer normalization, and the other consists of the
residual connection and sub-layer F. In contrast, in Pre-LN,
the derivative of the residual connection is isolated from the
term related to the derivative of the layer normalization. The
Post-LN Pre-LN
LN の微分から独⽴した項
→ 勾配の維持に貢献
LN の微分との積
→ 勾配が減衰
Residual が LN を迂回

Recommended

近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision TransformerYusuke Uchida
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language SupervisionDeep Learning JP
 
Attentionの基礎からTransformerの入門まで
Attentionの基礎からTransformerの入門までAttentionの基礎からTransformerの入門まで
Attentionの基礎からTransformerの入門までAGIRobots
 
【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習【メタサーベイ】数式ドリブン教師あり学習
【メタサーベイ】数式ドリブン教師あり学習cvpaper. challenge
 
画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイcvpaper. challenge
 
マルチモーダル深層学習の研究動向
マルチモーダル深層学習の研究動向マルチモーダル深層学習の研究動向
マルチモーダル深層学習の研究動向Koichiro Mori
 

More Related Content

What's hot

[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?Deep Learning JP
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Yamato OKAMOTO
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選Yusuke Uchida
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and EditingDeep Learning JP
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...Deep Learning JP
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習Deep Learning JP
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Yoshitaka Ushiku
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用Yoshitaka Ushiku
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fieldscvpaper. challenge
 
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...Deep Learning JP
 
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜SSII
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...Deep Learning JP
 
[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景Deep Learning JP
 
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイDeep Learning JP
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
最適輸送入門
最適輸送入門最適輸送入門
最適輸送入門joisino
 
AutoEncoderで特徴抽出
AutoEncoderで特徴抽出AutoEncoderで特徴抽出
AutoEncoderで特徴抽出Kai Sasaki
 
Transformer メタサーベイ
Transformer メタサーベイTransformer メタサーベイ
Transformer メタサーベイcvpaper. challenge
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential EquationsDeep Learning JP
 

What's hot (20)

[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]When Does Label Smoothing Help?
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
 
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
[DL輪読会]Set Transformer: A Framework for Attention-based Permutation-Invariant...
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields【メタサーベイ】Neural Fields
【メタサーベイ】Neural Fields
 
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
 
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
 
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning   画像×言語の大規模基盤モ...
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
 
[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景[DL輪読会]1次近似系MAMLとその理論的背景
[DL輪読会]1次近似系MAMLとその理論的背景
 
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
最適輸送入門
最適輸送入門最適輸送入門
最適輸送入門
 
AutoEncoderで特徴抽出
AutoEncoderで特徴抽出AutoEncoderで特徴抽出
AutoEncoderで特徴抽出
 
Transformer メタサーベイ
Transformer メタサーベイTransformer メタサーベイ
Transformer メタサーベイ
 
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
SSII2021 [SS1] Transformer x Computer Visionの 実活用可能性と展望 〜 TransformerのCompute...
 
[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations[DL輪読会]Neural Ordinary Differential Equations
[DL輪読会]Neural Ordinary Differential Equations
 

Similar to Transformerを多層にする際の勾配消失問題と解決法について

Back propagation
Back propagationBack propagation
Back propagationT2C_
 
Decoupled Neural Interfaces using Synthetic Gradients
Decoupled Neural Interfaces using Synthetic GradientsDecoupled Neural Interfaces using Synthetic Gradients
Decoupled Neural Interfaces using Synthetic Gradientstm_2648
 
20181214 clebsch gordan_mizuta
20181214 clebsch gordan_mizuta20181214 clebsch gordan_mizuta
20181214 clebsch gordan_mizutaRei Mizuta
 
Chainerの使い方と 自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と 自然言語処理への応用Yuya Unno
 
[DL輪読会]"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,0...
[DL輪読会]"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,0...[DL輪読会]"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,0...
[DL輪読会]"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,0...Deep Learning JP
 
行列およびテンソルデータに対する機械学習(数理助教の会 2011/11/28)
行列およびテンソルデータに対する機械学習(数理助教の会 2011/11/28)行列およびテンソルデータに対する機械学習(数理助教の会 2011/11/28)
行列およびテンソルデータに対する機械学習(数理助教の会 2011/11/28)ryotat
 
論文紹介 Star-Transformer (NAACL 2019)
論文紹介 Star-Transformer (NAACL 2019)論文紹介 Star-Transformer (NAACL 2019)
論文紹介 Star-Transformer (NAACL 2019)広樹 本間
 
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...Deep Learning JP
 

Similar to Transformerを多層にする際の勾配消失問題と解決法について (11)

PRML_from5.1to5.3.1
PRML_from5.1to5.3.1PRML_from5.1to5.3.1
PRML_from5.1to5.3.1
 
Back propagation
Back propagationBack propagation
Back propagation
 
音声認識と深層学習
音声認識と深層学習音声認識と深層学習
音声認識と深層学習
 
Decoupled Neural Interfaces using Synthetic Gradients
Decoupled Neural Interfaces using Synthetic GradientsDecoupled Neural Interfaces using Synthetic Gradients
Decoupled Neural Interfaces using Synthetic Gradients
 
20181214 clebsch gordan_mizuta
20181214 clebsch gordan_mizuta20181214 clebsch gordan_mizuta
20181214 clebsch gordan_mizuta
 
Report2
Report2Report2
Report2
 
Chainerの使い方と 自然言語処理への応用
Chainerの使い方と自然言語処理への応用Chainerの使い方と自然言語処理への応用
Chainerの使い方と 自然言語処理への応用
 
[DL輪読会]"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,0...
[DL輪読会]"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,0...[DL輪読会]"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,0...
[DL輪読会]"Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,0...
 
行列およびテンソルデータに対する機械学習(数理助教の会 2011/11/28)
行列およびテンソルデータに対する機械学習(数理助教の会 2011/11/28)行列およびテンソルデータに対する機械学習(数理助教の会 2011/11/28)
行列およびテンソルデータに対する機械学習(数理助教の会 2011/11/28)
 
論文紹介 Star-Transformer (NAACL 2019)
論文紹介 Star-Transformer (NAACL 2019)論文紹介 Star-Transformer (NAACL 2019)
論文紹介 Star-Transformer (NAACL 2019)
 
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
[DL輪読会]Clebsch–Gordan Nets: a Fully Fourier Space Spherical Convolutional Neu...
 

More from Sho Takase

ニューラルネットワークを用いた自然言語処理
ニューラルネットワークを用いた自然言語処理ニューラルネットワークを用いた自然言語処理
ニューラルネットワークを用いた自然言語処理Sho Takase
 
NeurIPS2020参加報告
NeurIPS2020参加報告NeurIPS2020参加報告
NeurIPS2020参加報告Sho Takase
 
STAIR Lab Seminar 202105
STAIR Lab Seminar 202105STAIR Lab Seminar 202105
STAIR Lab Seminar 202105Sho Takase
 
Rethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingRethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingSho Takase
 
Robust Neural Machine Translation with Doubly Adversarial Inputs
Robust Neural Machine Translation with Doubly Adversarial InputsRobust Neural Machine Translation with Doubly Adversarial Inputs
Robust Neural Machine Translation with Doubly Adversarial InputsSho Takase
 
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...Sho Takase
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSho Takase
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesSho Takase
 
Learning Composition Models for Phrase Embeddings
Learning Composition Models for Phrase EmbeddingsLearning Composition Models for Phrase Embeddings
Learning Composition Models for Phrase EmbeddingsSho Takase
 
Retrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to Semantic LexiconsRetrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to Semantic LexiconsSho Takase
 
NLP2015 構成性に基づく関係パタンの意味計算
NLP2015 構成性に基づく関係パタンの意味計算NLP2015 構成性に基づく関係パタンの意味計算
NLP2015 構成性に基づく関係パタンの意味計算Sho Takase
 
Lexical Inference over Multi-Word Predicates
Lexical Inference over Multi-Word PredicatesLexical Inference over Multi-Word Predicates
Lexical Inference over Multi-Word PredicatesSho Takase
 
dont_count_predict_in_acl2014
dont_count_predict_in_acl2014dont_count_predict_in_acl2014
dont_count_predict_in_acl2014Sho Takase
 

More from Sho Takase (14)

ニューラルネットワークを用いた自然言語処理
ニューラルネットワークを用いた自然言語処理ニューラルネットワークを用いた自然言語処理
ニューラルネットワークを用いた自然言語処理
 
NeurIPS2020参加報告
NeurIPS2020参加報告NeurIPS2020参加報告
NeurIPS2020参加報告
 
STAIR Lab Seminar 202105
STAIR Lab Seminar 202105STAIR Lab Seminar 202105
STAIR Lab Seminar 202105
 
Rethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast TrainingRethinking Perturbations in Encoder-Decoders for Fast Training
Rethinking Perturbations in Encoder-Decoders for Fast Training
 
Robust Neural Machine Translation with Doubly Adversarial Inputs
Robust Neural Machine Translation with Doubly Adversarial InputsRobust Neural Machine Translation with Doubly Adversarial Inputs
Robust Neural Machine Translation with Doubly Adversarial Inputs
 
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-lineari...
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
 
4thNLPDL
4thNLPDL4thNLPDL
4thNLPDL
 
Learning Composition Models for Phrase Embeddings
Learning Composition Models for Phrase EmbeddingsLearning Composition Models for Phrase Embeddings
Learning Composition Models for Phrase Embeddings
 
Retrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to Semantic LexiconsRetrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to Semantic Lexicons
 
NLP2015 構成性に基づく関係パタンの意味計算
NLP2015 構成性に基づく関係パタンの意味計算NLP2015 構成性に基づく関係パタンの意味計算
NLP2015 構成性に基づく関係パタンの意味計算
 
Lexical Inference over Multi-Word Predicates
Lexical Inference over Multi-Word PredicatesLexical Inference over Multi-Word Predicates
Lexical Inference over Multi-Word Predicates
 
dont_count_predict_in_acl2014
dont_count_predict_in_acl2014dont_count_predict_in_acl2014
dont_count_predict_in_acl2014
 

Transformerを多層にする際の勾配消失問題と解決法について

  • 2. Transformer の構造と本研究のまとめ • Transformer は Layer Normalization (LN) の位置で2種に⼤別される 2 Post-LN Pre-LN Residual 後に Layer Norm 本研究の貢献 ・Post-LN と Pre-LN の性能差を実験的に⽰す ・多層 Post-LN の学習が難しい原因を⽰す ・⾼い性能を維持しつつ多層化する⼿法を提案 性能 多層化 Post-LN ○ × Pre-LN × ○ B2T(提案⼿法) ○ ○ × N × N Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N Attention Layer Norm Layer Norm FFN Layer Norm (a) Post-LN Layer Norm + Attention Layer Norm + FFN Layer Norm + Attention Layer Norm + Attention Layer Norm + FFN Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N Attention Layer Norm Layer Norm Attention Layer Norm Layer Norm FFN Attention Layer Norm × N × N FFN Layer Norm Layer Norm Layer Norm Attention FFN Layer Norm Layer N Attenti FFN Layer N Layer N Attenti × N × (a) Post-LN (b) Pre-LN (c) Post-LN with B2T connec Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N Attention Layer Norm Layer Norm Attention Layer Norm Layer Norm FFN Attention Layer Norm × N × N FFN Layer Norm Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N (a) Post-LN (b) Pre-LN (c) Post-LN with B2T connection 関数適⽤前に Layer Norm
  • 3. 性能を上げるために多層にしたい • 性能はパラメータ数に対数⽐例する • パラメータ数をどう増やす︖ – 中間層の次元数を増やす or 多層にする 3 Transformer エンコーダ・デコーダの翻訳でのBLEU 値 次元数を増やすよりも 多層にした⽅が性能が⾼い (性能向上の傾きが良さそう) → 多層にしたい 6層 6層 (次元数増) 18層
  • 4. 多層 Post-LN の学習は難しい • 多層 Post-LN(例えば 18層)は学習が難しい – 勾配消失によって学習が進まない • 18層 Transformer エンコーダ・デコーダの翻訳タスクでの挙動 4 訓練 Loss 開発データでの Loss エンコーダ側 デコーダ側 各層の勾配のノルム Post-LN は訓練 Loss が下がらない 開発データの Loss も⾼いまま デコーダ側で勾配消失が発⽣ 101 100 10-1 10-1 100
  • 5. 勾配消失の原因は LN • 勾配消失の原因を探る – デコーダの18層⽬の各位置における勾配のノルムを調べる 5 Layer Norm Attention FFN Layer Norm Layer Norm Attention (4) → (3),(2) → (1) で勾配が⼤きく減衰 → LN をまたぐと勾配が⼤きく減衰 → LN が勾配を減衰させる → LN が勾配消失の原因
  • 6. Pre-LN は何故学習できるのか︖ • Pre-LN は LN の影響を受けない勾配がある – LN の影響を受けない=勾配が減衰しない • Post-LN と Pre-LN の式を確認する – ⼊⼒を x,アテンションやFFNを とすると 6 前向き計算 勾配 in performance and remove unstable training property, and thus provide better performance than Pre-LN re- gardless of their layer sizes. 2. Post-LN and Pre-LN Transformers We briefly describe Post-LN and Pre-LN Transformers. The original Transformer (Vaswani et al., 2017) uses Post-LN in which layer normalizations are located after each residual connection. Let x be an input of sub-layer, and F(·) be a sub-layer of Transformers such as a feed-forward network and multi-head attention. Post-LN is defined as follows: PostLN(x) = LN(x + F(x)), (1) where LN(·) is the layer normalization function. In contrast, Pre-LN places the layer normalization before an input of each sub-layer; for ials the ish- ing, rtic- cant the ayer yses with- nce we mers dual me- ayer tion age the nts; 2. Our modifications enable Post-LN Transformers to stack many layers. 3. Our method can maintain the advantage of Post-LN in performance and remove unstable training property, and thus provide better performance than Pre-LN re- gardless of their layer sizes. 2. Post-LN and Pre-LN Transformers We briefly describe Post-LN and Pre-LN Transformers. The original Transformer (Vaswani et al., 2017) uses Post-LN in which layer normalizations are located after each residual connection. Let x be an input of sub-layer, and F(·) be a sub-layer of Transformers such as a feed-forward network and multi-head attention. Post-LN is defined as follows: PostLN(x) = LN(x + F(x)), (1) where LN(·) is the layer normalization function. In contrast, Pre-LN places the layer normalization before an input of each sub-layer; PreLN(x) = x + F(LN(x)). (2) -LN prevents it as shown in Figure 1. In partic- rify that the layer normalization is a significant vanishing gradient problem by comparing the t vector norms of gradient flows for each layer on during back-propagation. These analyses ovel idea that can satisfy higher stability with- ormalizations and provide better performance N regardless of their layer sizes. Specifically, we method that is based on Post-LN Transformers different components; 1. additional residual and 2. simple layers without model parame- onstant values) as the replacement of the layer ons. experiments on a wide range of text generation y machine translation, summarization, language nd automatic speech recognition. We obtain the hree new major findings from our experiments; gardless of their layer sizes. 2. Post-LN and Pre-LN Transformers We briefly describe Post-LN and Pre-LN Transformers. The original Transformer (Vaswani et al., 2017) uses Post-LN in which layer normalizations are located after each residual connection. Let x be an input of sub-layer, and F(·) be a sub-layer of Transformers such as a feed-forward network and multi-head attention. Post-LN is defined as follows: PostLN(x) = LN(x + F(x)), (1) where LN(·) is the layer normalization function. In contrast, Pre-LN places the layer normalization before an input of each sub-layer; PreLN(x) = x + F(LN(x)). (2) 13 based er for WMT Trans- 10 Figure 5. Cosine similarities among outputs of each layer. norms are exponentially decayed as back-propagated to shallower layers. This result is consistent with the previous study (Liu et al., 2020). We consider that this vanishing gradient causes the difficulty in stacking many layers with the Post-LN setting as shown in Figure 1. To explore more details of the vanishing gradient empiri- cally, we check gradient norms of parts (1) - (5) in Figure 2 (a). Figure 4 shows the gradient norms of each part at 18th layer. This figure indicates that the gradient norms from (4) to (3) and (2) to (1) drastically decrease. These parts cor- respond to layer normalizations as in Figure 4. Thus, layer normalizations in Post-LN Transformers probably cause the vanishing gradient problem. To investigate the difference of gradient flows between Post- LN and Pre-LN theoretically, we calculate derivatives of equations (1) and (2). The derivatives are as follows: @PostLN(x) @x = @LN(x + F(x)) @(x + F(x)) ✓ I + @F(x) @x ◆ , (3) @PreLN(x) @x = I + @F(LN(x)) @LN(x) @LN(x) @x , (4) where I is the identity matrix. As Equation (3), the deriva- tive of Post-LN is equal to the product of two derivatives; one is the layer normalization, and the other consists of the residual connection and sub-layer F. In contrast, in Pre-LN, Layer Norm Attention FFN Layer Norm Layer Norm Attention Gradient norms of each location in the 18th decoder for yered Post-LN Transformer encoder-decoder on WMT o-German translation training data. 2 (a) and (b) illustrate Post-LN and Pre-LN Trans- architectures respectively. dients of Transformer Layers norms are exponentially decayed as back-propagated to shallower layers. This result is consistent with the previous study (Liu et al., 2020). We consider that this vanishing gradient causes the difficulty in stacking many layers with the Post-LN setting as shown in Figure 1. To explore more details of the vanishing gradient empiri- cally, we check gradient norms of parts (1) - (5) in Figure 2 (a). Figure 4 shows the gradient norms of each part at 18th layer. This figure indicates that the gradient norms from (4) to (3) and (2) to (1) drastically decrease. These parts cor- respond to layer normalizations as in Figure 4. Thus, layer normalizations in Post-LN Transformers probably cause the vanishing gradient problem. To investigate the difference of gradient flows between Post- LN and Pre-LN theoretically, we calculate derivatives of equations (1) and (2). The derivatives are as follows: @PostLN(x) @x = @LN(x + F(x)) @(x + F(x)) ✓ I + @F(x) @x ◆ , (3) @PreLN(x) @x = I + @F(LN(x)) @LN(x) @LN(x) @x , (4) where I is the identity matrix. As Equation (3), the deriva- tive of Post-LN is equal to the product of two derivatives; one is the layer normalization, and the other consists of the residual connection and sub-layer F. In contrast, in Pre-LN, the derivative of the residual connection is isolated from the term related to the derivative of the layer normalization. The Post-LN Pre-LN LN の微分から独⽴した項 → 勾配の維持に貢献 LN の微分との積 → 勾配が減衰 Residual が LN を迂回
  • 7. Pre-LN で良いのでは︖ • Pre-LN は多層にしたときの学習が安定 – 近年の多層なモデル(例︓GPT)はPre-LN • しかし Pre-LN は性能が低い – Post-LN は学習が成功した場合 Pre-LN より⾼性能 • 6層 Transformer エンコーダ・デコーダの性能⽐較 7 Post-LN Post-LN Pre-LN Pre-LN 翻訳(WMT英-独)でのBLEU ⾒出し⽂⽣成でのROUGE-1
  • 8. Post-LN は何故性能が⾼い︖ • Post-LNは各層で⼊⼒を⼤きく変換できる – 各層のパラメータを効果的に使えている • 各層の出⼒間のコサイン類似度を調査 – ⾚︓類似度が⾼い – ⻘︓類似度が低い 8 On Layer Normalizations and Residual Connections in Transformers 13 101 10-1 100 (a) Encoder side (b) Decoder side 101 10-1 100 Gradient norms of 18 layered Transformer-based decoder architectures. (a) Post-LN (b) Pre-LN (c) B2T connection Encoder Decoder Figure 5. Cosine similarities among outputs of each layer. On Layer Normalizations and Residual Connections in Transformers 13 101 10-1 100 de (b) Decoder side 101 10-1 100 s of 18 layered Transformer-based ures. 10 (a) Post-LN (b) Pre-LN (c) B2T connection Encoder Decoder Figure 5. Cosine similarities among outputs of each layer. norms are exponentially decayed as back-propagated to Post-LN Pre-LN Post-LN は1層⽬と最終層の出⼒ の類似度が低い(⾏列の左下) → ⼊⼒を⼤きく変換できる Pre-LN は何故類似度が⾼い︖ → Residual が LN を迂回する → ⼊⼒ x が出⼒に直結する
  • 9. 提案⼿法︓B2T Connection • Post-LN の⾼い性能を維持しつつ多層化したい • 何が必要か︖ 1. 勾配消失を防ぐ︓LN をなるべく迂回する 2. ⼊⼒が出⼒に直結することを防ぐ︓各層で⼊⼒を変換する 9 n m m Attention Layer Norm Layer Norm FFN Attention Layer Norm N × N m Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention FFN Layer Norm Layer Norm Attention × N × N (b) Pre-LN (c) Post-LN with B2T connection 造.(a) は Post-LN,(b) は Pre-LN,(c) は Post-LN に提案手法を組み せたものである. 101 提案⼿法︓⾚字の Residual Connection を追加 各層の下から上までの経路︓Bottom-to-Top Connection 1. 各層の最後の LN 以外を迂回する → LN による勾配の減衰を抑制 2. 最後の LN は⼊⼒にも適⽤
  • 10. 提案⼿法の効果 • 提案⼿法(B2T connection)を導⼊した場合 10 Attention Layer Norm Layer Norm Attention Layer Norm Attention Pre-LN (c) Post-LN with B2T connection a) は Post-LN,(b) は Pre-LN,(c) は Post-LN に提案手法を組み ものである. 11 100 101 101 10-1 100 (a) Encoder side (b) Decoder side 図 3 Transformer エンコーダ・デコーダの勾配のノルム. 示したように多層 Post-LN の学習を不安定にしてい ると考えられる. 勾配消失の詳細な原因を知るため,図 2(a) の (1) - (5) における勾配のノルムを調査した.図 4 は 18 層 4 WMT 英-独で学習した際の 18 層の Post-LN におけ る 18 層目のデコーダの各位置における勾配のノルム. (a) Post-LN (b) Pre-LN (c) B2T connection Encoder Decoder 図 5 各層からの出力間のコサイン類似度. 5 提案手法:B2T Connection 本節では Post-LN の高い性能を維持したまま多層 エンコーダ側 デコーダ側 18層 Transformer エンコーダ・デコーダの 各層の勾配のノルム 提案⼿法は勾配消失を防いでいる Post-LN Pre-LN B2T connection Post-LNと同様の変換の性質 → ⾼い性能が期待できる
  • 11. 関連研究 • DLCL [Wang+ 19]︓各層からの出⼒の重み付き和を⼊⼒ – L 追加での学習パラメータが必要 • Admin [Liu+ 20]︓出⼒の分散を抑制するパラメータ導⼊ – L 追加での計算コストが発⽣ • 追加パラメータの初期化のために学習前に前向き計算が必要 • Post-LN をベースにしていれば性能は同程度(と思われる) – 提案⼿法は追加のパラメータ・計算コストなしに多層化が可能 11
  • 12. 実験 • 実験を通して⽰すこと – Post-LN は学習が成功すれば Pre-LN よりも性能が⾼い – 提案⼿法(B2T connection)は多層での学習を可能にする – B2T connection は Post-LN の利点を維持する • 実験に⽤いる⾃然⾔語処理タスク – 6 層と 18 層のエンコーダ・デコーダで実験 – 機械翻訳︓WMT 英-独 • 訓練データ︓WMT 450 万⽂対(広く使われているデータ) • 評価︓2010-2016 の BLEU の平均 – ⾒出し⽂⽣成(⽣成型要約タスク) • English Gigaword から抽出したニュースの1⽂⽬から⾒出し⽂を⽣成 • 訓練データ︓Gigaword + REALNEWS + NewsCrawl から構築した 1300 万⽂対 • 評価︓1951 ⽂対のROUGE値 12
  • 13. 機械翻訳タスクの結果 • あ 13 学習 失敗 6層エンコーダ・デコーダ 18層エンコーダ・デコーダ Post-LN は Pre-LN よりも⾼い性能 B2T connection は Post-LN と同程度の性能 B2T connection は 学習に成功 性能も他の⼿法より⾼い
  • 14. ⾒出し⽂⽣成タスクの結果 • あ 14 学習 失敗 6層エンコーダ・デコーダ 18層エンコーダ・デコーダ Post-LN は Pre-LN よりも⾼い性能 B2T connection は Post-LN と同程度の性能 B2T connection は 学習に成功 性能も他の⼿法より⾼い
  • 15. 他のモダリティ︓⾳声認識での実験 • データセット︓LibriSpeech(1000時間の英語⾳声) • ⼿法︓6層,12層のエンコーダ(デコーダは6層に固定) • 評価︓dev と test の単語誤り率の平均(低いほど良い) 15 6層エンコーダ・デコーダ 12層エンコーダ・6層デコーダ 学習 失敗
  • 16. まとめ • 背景︓Transformer は LN の位置によって 2種 • 問題︓Post-LN は多層にした際の学習が不安定 – LN により勾配消失が発⽣する • 本研究の貢献 – Post-LN は Pre-LN より性能が⾼いことを⽰した – Post-LN の利点を維持し多層化可能な⼿法を提案 • 学習を安定させる要因,Post-LN の利点の要因を探った – 複数タスクでの実験を通して上記を⽰した 16