Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
1. Deep Transformers without Shortcuts:
Modifying Self-attention for Faithful Signal Propagation
Shohei Taniguchi, Matsuo Lab
1
2. Deep Transformers without Shortcuts
ॻࢽใ
ஶऀ
• Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock,
Samuel L Smith, Yee Whye Teh (DeepMind)
֓ཁ
• TransformerΛlayer normalizationskip connectionͳ͠ͰֶशͰ͖ΔΑ͏ʹվྑ
• ICLR 2023 accepted
2
18. Deep Transformers without Shortcuts
• TransformerͰਖ਼نԽskip connectionͳ͠ͰֶशͰ͖Δʁ
ؤுΕͰ͖Δ
• ७ਮʹਖ਼نԽͱskipΛൈ͘ͱޯ͕
രൃ͢Δ
• ఏҊ๏͍ͩͿ͑ΒΕ͍ͯΔ
19. Deep Transformers without Shortcuts
• ຊจͰɼGPTͰܥΘΕΔΑ͏ͳCausal masked attentionΛରʹ͢Δ
• ະདྷͷྻܥΛࢀর͠ͳ͍Α͏ʹ ͰϚεΫ͢Δ
Attn(X) = A(X)V(X)
A(X) = softmax
(
M ∘
1
dk
Q(X)K(X)⊤
− Γ(1 − M)
)
Mi,j = 1i≥j
ेେ͖͍ఆ
Γ
20. Deep Transformers without Shortcuts
• ·ͣɼMLPͷͳ͍attention-onlyͷϞσϧΛߟ͑Δͱɼ ͷಛྔ
• ͱ͓͘ͱɼ ͕ަߦྻͷͱ͖
L
XL = [ALAL−1…A1] X0W, W =
L
∏
l=1
WV
l WO
l
Σl = XlX⊤
l , Πl = AlAl−1…A1 W
Σl = Πl ⋅ Σ0 ⋅ Π⊤
l
21. Deep Transformers without Shortcuts
• ͱ͓͘ͱɼ ͕ަߦྻͷͱ͖
• ͕୯Ґߦྻʹ͚ۙΕɼޯ͕҆ఆ͢Δ
ͦΕ͕͜ىΔΑ͏ʹ Λઃ͍ͨ͠ܭ
• ͨͩ͠ɼ ཁૉ͕ඇෛͷԼࡾ֯ߦྻͱ͍͏੍͖
Σl = XlX⊤
l , Πl = AlAl−1…A1 W
Σl = Πl ⋅ Σ0 ⋅ Π⊤
l
Σl
Al
Al
22. Deep Transformers without Shortcuts
• ͱ͓͘ͱɼ ͕ΓཱͭͱͰ
• ͜ΕίϨεΩʔղʹ૬͢Δ
ଥͳ Λઃͯ͠ܭɼͦͷίϨεΩʔղ ΛٻΊΕɼ݅Λຬͨ͢ Λ
࡞ΕΔ
Al = LlL−1
l−1 L−1
0 Σ0L−1⊤
0 = IT
Σl = LlL⊤
l
Σl Ll Al
31. ࢀߟจݙ
[1] Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. "Resurrecting the
sigmoid in deep learning through dynamical isometry: theory and practice."
Advances in neural information processing systems 30 (2017).
[2] Xiao, Lechao, et al. "Dynamical isometry and a mean field theory of cnns: How to
train 10,000-layer vanilla convolutional neural networks." International Conference
on Machine Learning. PMLR, 2018.
[3] Bachlechner, Thomas, et al. "Rezero is all you need: Fast convergence at large
depth." Uncertainty in Artificial Intelligence. PMLR, 2021. APA
31
32. ࢀߟจݙ
[4] Burkholz, Rebekka, and Alina Dubatovka. "Initialization of relus for dynamical
isometry." Advances in Neural Information Processing Systems 32 (2019).
[5] Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. "Attention is not all
you need: Pure attention loses rank doubly exponentially with depth." International
Conference on Machine Learning. PMLR, 2021.
[6] He, Bobby, et al. "Deep Transformers without Shortcuts: Modifying Self-attention
for Faithful Signal Propagation." The Eleventh International Conference on Learning
Representations. 2023.
32