Introduction of Transformer
Lab Meeting Material
Yuta Niki
Master 1st
Izumi Lab. UTokyo
This Material’s Objective
◼Transformer and its advanced models(BERT) show
high performance!
◼Experiments with those models are necessary in
NLP×Deep Learning research.
◼First Step (in this slide)
• Learn basic knowledge of Attention
• Understand the architecture of Transformer
◼Next Step (in the future)
• Fine-Tuning for Sentiment Analysis, etc.
• Learn BERT, etc.
※In the last slide, reference materials are collected. You should read them.
※This is written in English because an international student came to the Lab.
2
What is “Transformer”?
◼Paper
• “Attention Is All You Need”[1]
◼Motivation
• Build a model with sufficient representation power for difficult
task(←translation task in the paper)
• Train a model efficiently in parallel(RNN cannot train in parallel)
◼Methods and Results
• Architecture with attention mechanism without RNN
• Less time to train
• Achieve great BLEU score in the translation task
◼Application
• Use Encoder that have acquired strong representation power
for other tasks by fine-tuning.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
3
Transformer’s structure
◼Encoder(Left)
• Stack of 6 layers
• self-attention + feed-forward network(FFN)
◼Decoder(Right)
• Stack of 6 layers
• self-attention + SourceTarget-att + FFN
◼Components
• Positional Encoding
• Multi-Head Attention
• Position-wise Feed-Forward Network
• Residual Connection
◼Regularization
• Residual Dropout
• Label Smoothing
• Attention Dropout
4
Positional Encoding
◼Proposed in “End-To-End Memory Network”[1]
◼Motivation
• Add information about the position of the words in the
sentences(←transformer don’t contain RNN and CNN)
𝑑 𝑚𝑜𝑑𝑒𝑙: the dim. of word embedding
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
Where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension.
[1] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015.
5
Scaled Dot-Product Attention
Attention 𝑄, 𝐾, 𝑉 = softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
where
𝑄 𝑄 ∈ ℝ 𝑛×𝑑 𝑘 : query matrix
𝐾 𝐾 ∈ ℝ 𝑛×𝑑 𝑘 : key matrix
𝑉 𝑉 ∈ ℝ 𝑛×𝑑 𝑣 : value matrix
𝑛: length of sentence
𝑑 𝑘: dim. of queries and keys
𝑑 𝑣: dim. of values
6
2 Types of Attention
• Additive Attention[1]
𝐴𝑡𝑡 𝐻
= softmax 𝑊𝐻 + 𝑏
• Dot-Product Attention[2,3]
𝐴𝑡𝑡 𝑄, 𝐾, 𝑉
= softmax 𝑄𝐾 𝑇 𝑉
[1] Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align an Translate.” ICLR, 2015.
[2] Miller, Alexander, et al. “Key-Value Memory Networks for Directly Reading Documents.” EMNLP, 2016.
[3] Daniluk, Michal, et al. “Frustratingly Short Attention Spans in Neural Language Modeling.” ICLR, 2017.
In Transformer, Dot-Product Attention is Used.
7
Why Use Scaled Dot-Product Attention?
◼Dot-Product Attention is faster and more
efficient than Additive Attention.
• Additive Attention use a feed-forward network as the
compatibility function.
• Dot-Product Attention can be implemented using highly
optimized matrix multiplication code.
◼Use scaling term
1
𝑑 𝑘
to make Dot-Product
Attention high performance with large 𝑑 𝑘
• Additive Attention outperforms Dot-Product Attention
without scaling for larger values of 𝑑 𝑘 [1]
[1] Britz, Denny, et al. “Massive Exploration of Neural Machine Translation Architectures." EMNLP, 2017.
8
Source-Target or Self Attention
◼2 types of Dot-Product Attention
• Source-Target Attention
➢Used in the 2nd Multi-Head Attention Layer of Transformer
Decoder Layer
• Self-Attention
➢Used in the Multi-Head Attention Layer of Transformer
Encoder Layer and the 1st one of Transformer Decoder Layer
◼What is the difference?
• Depends on where query comes from.
➢query from Encoder → Self-Att.
➢query from Decoder → Source-Target Att.
𝐾 𝑉𝑞𝑢𝑒𝑟𝑦𝜎
from Encoder
from Encoder → Self
from Decoder → Target
9
Multi-Head Attention
MultiHead 𝑄, 𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
where ℎ𝑒𝑎𝑑𝑖 = Attention(𝑄𝑊𝑖
𝑊
, 𝐾𝑊𝑖
𝐾
, 𝑉𝑊𝑖
𝑉
)
where 𝑊𝑖
𝑄
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖
𝐾
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘,
𝑊𝑖
𝑉
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑣 and 𝑊 𝑂
∈ ℝℎ𝑑 𝑣×𝑑 𝑚𝑜𝑑𝑒𝑙.
ℎ: # of parallel attention layers
𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙/ℎ .
⇒Attention with Dropout
Attention 𝑄, 𝐾, 𝑉 = dropout softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
10
Why Multi-Head Attention?
Experiments(In Table 3 (a)) shows that multi-head
attention model outperforms single-head attention.
“Multi-Head Attention allows the model to jointly
attend to information from different representation
subspaces at difference positions.”[1]
Multi-Head Attention seems
ensemble of attention.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
11
What Multi-Head Attention Learns
◼Learn the importance of the relationship
between words regardless of their distance
• In the figure below, the relationship between
“making” and “difficult” is strong in many Attention.
12Cite from (http://deeplearning.hatenablog.com/entry/transformer)
FFN and Residual Connection
◼Point-wise Feed-Forward Network
FFN 𝑥 = ReLU 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
where
𝑑 𝑓𝑓(= 2048): dim. of the inner-layer
◼Residual Connection
LayerNorm(𝑥 + Sublayer(𝑥))
⇒Residual Dropout
LayerNorm(𝑥 + Drouput(Sublayer 𝑥 , droprate))
13
Very Thanks for Great Predecessors
◼Summary blogs helped my understanding m(_ _)m
• 論文解説 Attention Is All You Need (Transformer)
➢Commentary including background knowledge necessary for
full understanding
• 論文読み "Attention Is All You Need“
➢Help understand the flow of data in Transformer
• The Annotated Transformer(harvardnlp)
➢PyTorch implementation and corresponding parts of the paper
are explained simply.
• 作って理解する Transformer / Attention
➢I cannot understand how to calculate 𝑄, 𝐾 and 𝑉 in Dot-
Product Attention from paper. This page shows one solution.
14

Transformer Introduction (Seminar Material)

  • 1.
    Introduction of Transformer LabMeeting Material Yuta Niki Master 1st Izumi Lab. UTokyo
  • 2.
    This Material’s Objective ◼Transformerand its advanced models(BERT) show high performance! ◼Experiments with those models are necessary in NLP×Deep Learning research. ◼First Step (in this slide) • Learn basic knowledge of Attention • Understand the architecture of Transformer ◼Next Step (in the future) • Fine-Tuning for Sentiment Analysis, etc. • Learn BERT, etc. ※In the last slide, reference materials are collected. You should read them. ※This is written in English because an international student came to the Lab. 2
  • 3.
    What is “Transformer”? ◼Paper •“Attention Is All You Need”[1] ◼Motivation • Build a model with sufficient representation power for difficult task(←translation task in the paper) • Train a model efficiently in parallel(RNN cannot train in parallel) ◼Methods and Results • Architecture with attention mechanism without RNN • Less time to train • Achieve great BLEU score in the translation task ◼Application • Use Encoder that have acquired strong representation power for other tasks by fine-tuning. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. 3
  • 4.
    Transformer’s structure ◼Encoder(Left) • Stackof 6 layers • self-attention + feed-forward network(FFN) ◼Decoder(Right) • Stack of 6 layers • self-attention + SourceTarget-att + FFN ◼Components • Positional Encoding • Multi-Head Attention • Position-wise Feed-Forward Network • Residual Connection ◼Regularization • Residual Dropout • Label Smoothing • Attention Dropout 4
  • 5.
    Positional Encoding ◼Proposed in“End-To-End Memory Network”[1] ◼Motivation • Add information about the position of the words in the sentences(←transformer don’t contain RNN and CNN) 𝑑 𝑚𝑜𝑑𝑒𝑙: the dim. of word embedding 𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛( 𝑝𝑜𝑠 100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙 ) 𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠( 𝑝𝑜𝑠 100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙 ) Where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension. [1] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015. 5
  • 6.
    Scaled Dot-Product Attention Attention𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 𝑇 𝑑 𝑘 𝑉 where 𝑄 𝑄 ∈ ℝ 𝑛×𝑑 𝑘 : query matrix 𝐾 𝐾 ∈ ℝ 𝑛×𝑑 𝑘 : key matrix 𝑉 𝑉 ∈ ℝ 𝑛×𝑑 𝑣 : value matrix 𝑛: length of sentence 𝑑 𝑘: dim. of queries and keys 𝑑 𝑣: dim. of values 6
  • 7.
    2 Types ofAttention • Additive Attention[1] 𝐴𝑡𝑡 𝐻 = softmax 𝑊𝐻 + 𝑏 • Dot-Product Attention[2,3] 𝐴𝑡𝑡 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 𝑇 𝑉 [1] Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align an Translate.” ICLR, 2015. [2] Miller, Alexander, et al. “Key-Value Memory Networks for Directly Reading Documents.” EMNLP, 2016. [3] Daniluk, Michal, et al. “Frustratingly Short Attention Spans in Neural Language Modeling.” ICLR, 2017. In Transformer, Dot-Product Attention is Used. 7
  • 8.
    Why Use ScaledDot-Product Attention? ◼Dot-Product Attention is faster and more efficient than Additive Attention. • Additive Attention use a feed-forward network as the compatibility function. • Dot-Product Attention can be implemented using highly optimized matrix multiplication code. ◼Use scaling term 1 𝑑 𝑘 to make Dot-Product Attention high performance with large 𝑑 𝑘 • Additive Attention outperforms Dot-Product Attention without scaling for larger values of 𝑑 𝑘 [1] [1] Britz, Denny, et al. “Massive Exploration of Neural Machine Translation Architectures." EMNLP, 2017. 8
  • 9.
    Source-Target or SelfAttention ◼2 types of Dot-Product Attention • Source-Target Attention ➢Used in the 2nd Multi-Head Attention Layer of Transformer Decoder Layer • Self-Attention ➢Used in the Multi-Head Attention Layer of Transformer Encoder Layer and the 1st one of Transformer Decoder Layer ◼What is the difference? • Depends on where query comes from. ➢query from Encoder → Self-Att. ➢query from Decoder → Source-Target Att. 𝐾 𝑉𝑞𝑢𝑒𝑟𝑦𝜎 from Encoder from Encoder → Self from Decoder → Target 9
  • 10.
    Multi-Head Attention MultiHead 𝑄,𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂 where ℎ𝑒𝑎𝑑𝑖 = Attention(𝑄𝑊𝑖 𝑊 , 𝐾𝑊𝑖 𝐾 , 𝑉𝑊𝑖 𝑉 ) where 𝑊𝑖 𝑄 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖 𝐾 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖 𝑉 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑣 and 𝑊 𝑂 ∈ ℝℎ𝑑 𝑣×𝑑 𝑚𝑜𝑑𝑒𝑙. ℎ: # of parallel attention layers 𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙/ℎ . ⇒Attention with Dropout Attention 𝑄, 𝐾, 𝑉 = dropout softmax 𝑄𝐾 𝑇 𝑑 𝑘 𝑉 10
  • 11.
    Why Multi-Head Attention? Experiments(InTable 3 (a)) shows that multi-head attention model outperforms single-head attention. “Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at difference positions.”[1] Multi-Head Attention seems ensemble of attention. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. 11
  • 12.
    What Multi-Head AttentionLearns ◼Learn the importance of the relationship between words regardless of their distance • In the figure below, the relationship between “making” and “difficult” is strong in many Attention. 12Cite from (http://deeplearning.hatenablog.com/entry/transformer)
  • 13.
    FFN and ResidualConnection ◼Point-wise Feed-Forward Network FFN 𝑥 = ReLU 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2 where 𝑑 𝑓𝑓(= 2048): dim. of the inner-layer ◼Residual Connection LayerNorm(𝑥 + Sublayer(𝑥)) ⇒Residual Dropout LayerNorm(𝑥 + Drouput(Sublayer 𝑥 , droprate)) 13
  • 14.
    Very Thanks forGreat Predecessors ◼Summary blogs helped my understanding m(_ _)m • 論文解説 Attention Is All You Need (Transformer) ➢Commentary including background knowledge necessary for full understanding • 論文読み "Attention Is All You Need“ ➢Help understand the flow of data in Transformer • The Annotated Transformer(harvardnlp) ➢PyTorch implementation and corresponding parts of the paper are explained simply. • 作って理解する Transformer / Attention ➢I cannot understand how to calculate 𝑄, 𝐾 and 𝑉 in Dot- Product Attention from paper. This page shows one solution. 14