Transformers
Presented by: Ali Zoljodi
• What is the Transformer?
The transformer is a transduction model
based on the Attention Mechanism
• What is the Transformer?
The transformer is a transduction model
based on the Attention Mechanism
• What is the Transformer?
The transformer is a transduction model
based on the Attention Mechanism
• Initially developed to address natural language processing
issues like text-to-text transformation(Vaswani et al.,
2017)(Devlin et al., 2018)
• What is the Transformer?
The transformer is a transduction model
based on the Attention Mechanism
• Initially developed to address natural language processing
issues like text-to-text transformation(Vaswani et al.,
2017)(Devlin et al., 2018)
• This method is now widespread in many computer vision uses
 Image classification (Dosovitskiy et al., 2020)
 Object detection(Carion et al., 2020)
 Segmentation(Wang et al., 2020)
What are the benefits of Transformers in comparison with RNNs?
What are the benefits of Transformers in comparison with RNNs?
What are the benefits of Transformers in comparison with RNNs?
• Facilitate long-range dependencies
What are the benefits of Transformers in comparison with RNNs?
• Facilitate long-range dependencies
• No gradient vanishing
What are the benefits of Transformers in comparison with RNNs?
• Facilitate long-range dependencies
• No gradient vanishing
• Fewer training steps
What are the benefits of Transformers in comparison with RNNs?
• Facilitate long-range dependencies
• No gradient vanishing
• Fewer training steps
• Parallel computing
Transformers vs. CNNs
Transformers vs. CNNs
Training Data
Improvement
Transformer
CNN
Transformers vs. CNNs
• CNN advantages
• Fast Convergence
• Local sensitive
• Needs less amount of data
• Transformation advantages
• More robust results
• Facilitate long-term dependencies
• Global sensitive
Training Data
Improvement
Transformer
CNN
Attention
Mechanism
Attention
Mechanism
Mimics
Attention
Mechanism
Mimics
Attention
Mechanism
Mimics
Query
Attention
Mechanism
Mimics
Query Value1
Value2
Key1
Key2
ValueT
KeyT
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
𝑠𝑠𝑖𝑖 =
𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖
(𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑
𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖
𝑤𝑤𝑞𝑞
𝑇𝑇
. 𝑞𝑞 + 𝑤𝑤𝑘𝑘
𝑇𝑇
. 𝑘𝑘
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
𝑠𝑠𝑖𝑖 =
𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖
(𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑
𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖
𝑤𝑤𝑞𝑞
𝑇𝑇
. 𝑞𝑞 + 𝑤𝑤𝑘𝑘
𝑇𝑇
. 𝑘𝑘
𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
𝑠𝑠𝑖𝑖 =
𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖
(𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑
𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖
𝑤𝑤𝑞𝑞
𝑇𝑇
. 𝑞𝑞 + 𝑤𝑤𝑘𝑘
𝑇𝑇
. 𝑘𝑘
𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
𝑠𝑠𝑖𝑖 =
𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖
(𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑
𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖
𝑤𝑤𝑞𝑞
𝑇𝑇
. 𝑞𝑞 + 𝑤𝑤𝑘𝑘
𝑇𝑇
. 𝑘𝑘
𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑎𝑎𝑖𝑖 =
exp(𝑠𝑠𝑖𝑖)
∑𝑗𝑗 exp(𝑠𝑠𝑗𝑗)
Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
𝑠𝑠𝑖𝑖 =
𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖
(𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑
𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖
𝑤𝑤𝑞𝑞
𝑇𝑇
. 𝑞𝑞 + 𝑤𝑤𝑘𝑘
𝑇𝑇
. 𝑘𝑘
𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑎𝑎𝑖𝑖 =
exp(𝑠𝑠𝑖𝑖)
∑𝑗𝑗 exp(𝑠𝑠𝑗𝑗)
𝑣𝑣1 𝑣𝑣2 𝑣𝑣3 𝑣𝑣4
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = �
𝑖𝑖
𝑎𝑎𝑖𝑖𝑣𝑣𝑖𝑖
Self-Attention
I1 I2 I3
Self-Attention
I1 I2 I3 I1 I2 I3 O2 O1
Cross-Attention
Attention all you need!
Attention all you need!
Attention all you need!
Encode the input
sequence and extract
the relationship
between the input
words and their order
Attention all you need!
Encode the input
sequence and extract
the relationship
between the input
words and their order
Encode the
predicted sequence
of outputs and
decode input/output
attentions
Attention all you need!
Attention all you need!
• Multi-Head Attention
Attention all you need!
• Multi-Head Attention
• 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ
Attention all you need!
• Multi-Head Attention
• 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ
• ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊
𝑖𝑖
𝑄𝑄
𝑄𝑄, 𝑊𝑊𝑖𝑖
𝐾𝐾
𝐾𝐾, 𝑊𝑊𝑖𝑖
𝑉𝑉
𝑉𝑉)
Attention all you need!
• Multi-Head Attention
• 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ
• ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊
𝑖𝑖
𝑄𝑄
𝑄𝑄, 𝑊𝑊𝑖𝑖
𝐾𝐾
𝐾𝐾, 𝑊𝑊𝑖𝑖
𝑉𝑉
𝑉𝑉)
• 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑄𝑄𝑇𝑇𝐾𝐾
𝑑𝑑𝑘𝑘
𝑉𝑉
Attention all you need!
• Multi-Head Attention
• 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ
• ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊
𝑖𝑖
𝑄𝑄
𝑄𝑄, 𝑊𝑊𝑖𝑖
𝐾𝐾
𝐾𝐾, 𝑊𝑊𝑖𝑖
𝑉𝑉
𝑉𝑉)
• 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑄𝑄𝑇𝑇𝐾𝐾
𝑑𝑑𝑘𝑘
𝑉𝑉
Attention all you need!
• Multi-Head Attention
• 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ
• ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊
𝑖𝑖
𝑄𝑄
𝑄𝑄, 𝑊𝑊𝑖𝑖
𝐾𝐾
𝐾𝐾, 𝑊𝑊𝑖𝑖
𝑉𝑉
𝑉𝑉)
• 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑄𝑄𝑇𝑇𝐾𝐾
𝑑𝑑𝑘𝑘
𝑉𝑉
Attention all you need!
• Masked Multi-Head Attention
Attention all you need!
• Masked Multi-Head Attention
• Applied when some of the inputs/outputs need to be
masked from the attention mechanism
Attention all you need!
• Masked Multi-Head Attention
• Applied when some of the inputs/outputs need to be
masked from the attention mechanism
• 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑄𝑄𝑇𝑇𝐾𝐾+𝑀𝑀
𝑑𝑑𝐾𝐾
𝑉𝑉
Where M is a mask matrix of −∞’s
Attention all you need!
• Normalize to have 0 mean and 1 variance
• ℎ𝑖𝑖 ←
𝑔𝑔
𝜎𝜎
ℎ𝑖𝑖 − 𝜇𝜇 ,
𝜇𝜇 =
1
𝐻𝐻
�
𝑖𝑖=1
𝐻𝐻
ℎ𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎 𝜎𝜎 =
1
𝐻𝐻
�
𝑖𝑖=1
𝐻𝐻
ℎ𝑖𝑖 − 𝜇𝜇 2
Attention all you need!
• Normalize to have 0 mean and 1 variance
• ℎ𝑖𝑖 ←
𝑔𝑔
𝜎𝜎
ℎ𝑖𝑖 − 𝜇𝜇 ,
𝜇𝜇 =
1
𝐻𝐻
�
𝑖𝑖=1
𝐻𝐻
ℎ𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎 𝜎𝜎 =
1
𝐻𝐻
�
𝑖𝑖=1
𝐻𝐻
ℎ𝑖𝑖 − 𝜇𝜇 2
• 𝑃𝑃𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖 = sin
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
1000
2𝑖𝑖
𝑑𝑑
• 𝑃𝑃𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖+1 = cos
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
1000
2𝑖𝑖
𝑑𝑑

Transformers.pdf

  • 1.
  • 2.
    • What isthe Transformer? The transformer is a transduction model based on the Attention Mechanism
  • 3.
    • What isthe Transformer? The transformer is a transduction model based on the Attention Mechanism
  • 4.
    • What isthe Transformer? The transformer is a transduction model based on the Attention Mechanism • Initially developed to address natural language processing issues like text-to-text transformation(Vaswani et al., 2017)(Devlin et al., 2018)
  • 5.
    • What isthe Transformer? The transformer is a transduction model based on the Attention Mechanism • Initially developed to address natural language processing issues like text-to-text transformation(Vaswani et al., 2017)(Devlin et al., 2018) • This method is now widespread in many computer vision uses  Image classification (Dosovitskiy et al., 2020)  Object detection(Carion et al., 2020)  Segmentation(Wang et al., 2020)
  • 6.
    What are thebenefits of Transformers in comparison with RNNs?
  • 7.
    What are thebenefits of Transformers in comparison with RNNs?
  • 8.
    What are thebenefits of Transformers in comparison with RNNs? • Facilitate long-range dependencies
  • 9.
    What are thebenefits of Transformers in comparison with RNNs? • Facilitate long-range dependencies • No gradient vanishing
  • 10.
    What are thebenefits of Transformers in comparison with RNNs? • Facilitate long-range dependencies • No gradient vanishing • Fewer training steps
  • 11.
    What are thebenefits of Transformers in comparison with RNNs? • Facilitate long-range dependencies • No gradient vanishing • Fewer training steps • Parallel computing
  • 12.
  • 13.
    Transformers vs. CNNs TrainingData Improvement Transformer CNN
  • 14.
    Transformers vs. CNNs •CNN advantages • Fast Convergence • Local sensitive • Needs less amount of data • Transformation advantages • More robust results • Facilitate long-term dependencies • Global sensitive Training Data Improvement Transformer CNN
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖
  • 23.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1
  • 24.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4
  • 25.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
  • 26.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4
  • 27.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
  • 28.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄
  • 29.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑠𝑠𝑖𝑖 = 𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖 (𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑 𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖 𝑤𝑤𝑞𝑞 𝑇𝑇 . 𝑞𝑞 + 𝑤𝑤𝑘𝑘 𝑇𝑇 . 𝑘𝑘
  • 30.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑠𝑠𝑖𝑖 = 𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖 (𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑 𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖 𝑤𝑤𝑞𝑞 𝑇𝑇 . 𝑞𝑞 + 𝑤𝑤𝑘𝑘 𝑇𝑇 . 𝑘𝑘 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
  • 31.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑠𝑠𝑖𝑖 = 𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖 (𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑 𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖 𝑤𝑤𝑞𝑞 𝑇𝑇 . 𝑞𝑞 + 𝑤𝑤𝑘𝑘 𝑇𝑇 . 𝑘𝑘 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
  • 32.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑠𝑠𝑖𝑖 = 𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖 (𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑 𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖 𝑤𝑤𝑞𝑞 𝑇𝑇 . 𝑞𝑞 + 𝑤𝑤𝑘𝑘 𝑇𝑇 . 𝑘𝑘 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4 𝑎𝑎𝑖𝑖 = exp(𝑠𝑠𝑖𝑖) ∑𝑗𝑗 exp(𝑠𝑠𝑗𝑗)
  • 33.
    Attention Mechanism Mimics Query Value1 Value2 Key1 Key2 ValueT KeyT Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞,𝑘𝑘, 𝑣𝑣 = � 𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖 ≥ 0 , ≤ 1 𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4 𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑠𝑠𝑖𝑖 = 𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖 (𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑 𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖 𝑤𝑤𝑞𝑞 𝑇𝑇 . 𝑞𝑞 + 𝑤𝑤𝑘𝑘 𝑇𝑇 . 𝑘𝑘 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4 𝑎𝑎𝑖𝑖 = exp(𝑠𝑠𝑖𝑖) ∑𝑗𝑗 exp(𝑠𝑠𝑗𝑗) 𝑣𝑣1 𝑣𝑣2 𝑣𝑣3 𝑣𝑣4 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = � 𝑖𝑖 𝑎𝑎𝑖𝑖𝑣𝑣𝑖𝑖
  • 34.
  • 35.
    Self-Attention I1 I2 I3I1 I2 I3 O2 O1 Cross-Attention
  • 36.
  • 37.
  • 38.
    Attention all youneed! Encode the input sequence and extract the relationship between the input words and their order
  • 39.
    Attention all youneed! Encode the input sequence and extract the relationship between the input words and their order Encode the predicted sequence of outputs and decode input/output attentions
  • 40.
  • 41.
    Attention all youneed! • Multi-Head Attention
  • 42.
    Attention all youneed! • Multi-Head Attention • 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ
  • 43.
    Attention all youneed! • Multi-Head Attention • 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ • ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊 𝑖𝑖 𝑄𝑄 𝑄𝑄, 𝑊𝑊𝑖𝑖 𝐾𝐾 𝐾𝐾, 𝑊𝑊𝑖𝑖 𝑉𝑉 𝑉𝑉)
  • 44.
    Attention all youneed! • Multi-Head Attention • 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ • ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊 𝑖𝑖 𝑄𝑄 𝑄𝑄, 𝑊𝑊𝑖𝑖 𝐾𝐾 𝐾𝐾, 𝑊𝑊𝑖𝑖 𝑉𝑉 𝑉𝑉) • 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑄𝑄𝑇𝑇𝐾𝐾 𝑑𝑑𝑘𝑘 𝑉𝑉
  • 45.
    Attention all youneed! • Multi-Head Attention • 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ • ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊 𝑖𝑖 𝑄𝑄 𝑄𝑄, 𝑊𝑊𝑖𝑖 𝐾𝐾 𝐾𝐾, 𝑊𝑊𝑖𝑖 𝑉𝑉 𝑉𝑉) • 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑄𝑄𝑇𝑇𝐾𝐾 𝑑𝑑𝑘𝑘 𝑉𝑉
  • 46.
    Attention all youneed! • Multi-Head Attention • 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ • ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊 𝑖𝑖 𝑄𝑄 𝑄𝑄, 𝑊𝑊𝑖𝑖 𝐾𝐾 𝐾𝐾, 𝑊𝑊𝑖𝑖 𝑉𝑉 𝑉𝑉) • 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑄𝑄𝑇𝑇𝐾𝐾 𝑑𝑑𝑘𝑘 𝑉𝑉
  • 47.
    Attention all youneed! • Masked Multi-Head Attention
  • 48.
    Attention all youneed! • Masked Multi-Head Attention • Applied when some of the inputs/outputs need to be masked from the attention mechanism
  • 49.
    Attention all youneed! • Masked Multi-Head Attention • Applied when some of the inputs/outputs need to be masked from the attention mechanism • 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑄𝑄𝑇𝑇𝐾𝐾+𝑀𝑀 𝑑𝑑𝐾𝐾 𝑉𝑉 Where M is a mask matrix of −∞’s
  • 50.
    Attention all youneed! • Normalize to have 0 mean and 1 variance • ℎ𝑖𝑖 ← 𝑔𝑔 𝜎𝜎 ℎ𝑖𝑖 − 𝜇𝜇 , 𝜇𝜇 = 1 𝐻𝐻 � 𝑖𝑖=1 𝐻𝐻 ℎ𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎 𝜎𝜎 = 1 𝐻𝐻 � 𝑖𝑖=1 𝐻𝐻 ℎ𝑖𝑖 − 𝜇𝜇 2
  • 51.
    Attention all youneed! • Normalize to have 0 mean and 1 variance • ℎ𝑖𝑖 ← 𝑔𝑔 𝜎𝜎 ℎ𝑖𝑖 − 𝜇𝜇 , 𝜇𝜇 = 1 𝐻𝐻 � 𝑖𝑖=1 𝐻𝐻 ℎ𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎 𝜎𝜎 = 1 𝐻𝐻 � 𝑖𝑖=1 𝐻𝐻 ℎ𝑖𝑖 − 𝜇𝜇 2 • 𝑃𝑃𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖 = sin 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 1000 2𝑖𝑖 𝑑𝑑 • 𝑃𝑃𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖+1 = cos 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 1000 2𝑖𝑖 𝑑𝑑