Transformers.pdf

Transformers
Presented by: Ali Zoljodi

• What is the Transformer?
The transformer is a transduction model
based on the Attention Mechanism

• Initially developed to address natural language processing
issues like text-to-text transformation(Vaswani et al.,
2017)(Devlin et al., 2018)

• Initially developed to address natural language processing
issues like text-to-text transformation(Vaswani et al.,
2017)(Devlin et al., 2018)
• This method is now widespread in many computer vision uses
 Image classification (Dosovitskiy et al., 2020)
 Object detection(Carion et al., 2020)
 Segmentation(Wang et al., 2020)

What are the benefits of Transformers in comparison with RNNs?

• Facilitate long-range dependencies

• No gradient vanishing

• Fewer training steps

• Fewer training steps
• Parallel computing

Transformers vs. CNNs
Training Data
Improvement
Transformer
CNN

Transformers vs. CNNs
• CNN advantages
• Fast Convergence
• Local sensitive
• Needs less amount of data
• Transformation advantages
• More robust results
• Facilitate long-term dependencies
• Global sensitive
Training Data
Improvement
Transformer
CNN

Attention
Mechanism
Mimics
Query

Attention
Mechanism
Mimics
Query Value1
Value2
Key1
Key2
ValueT
KeyT

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
Value2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑞𝑞, 𝑘𝑘, 𝑣𝑣 = �
𝑖𝑖
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑘𝑘𝑖𝑖) × 𝑣𝑣𝑖𝑖

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
𝑖𝑖
≥ 0 , ≤ 1

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
𝑖𝑖
≥ 0 , ≤ 1
𝑘𝑘1 𝑘𝑘2 𝑘𝑘3 𝑘𝑘4

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
𝑖𝑖
≥ 0 , ≤ 1
𝑠𝑠1 𝑠𝑠2 𝑠𝑠3 𝑠𝑠4

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
𝑖𝑖
≥ 0 , ≤ 1
𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
𝑖𝑖
≥ 0 , ≤ 1
𝑠𝑠𝑖𝑖 =
𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖
(𝑞𝑞𝑇𝑇. 𝑘𝑘𝑖𝑖)/ 𝑑𝑑
𝑞𝑞𝑇𝑇. 𝑤𝑤𝑘𝑘𝑖𝑖
𝑤𝑤𝑞𝑞
𝑇𝑇
. 𝑞𝑞 + 𝑤𝑤𝑘𝑘
𝑇𝑇
. 𝑘𝑘

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
𝑖𝑖
≥ 0 , ≤ 1
𝑠𝑠𝑖𝑖 =
𝑤𝑤𝑞𝑞
𝑇𝑇
𝑇𝑇
. 𝑘𝑘
𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
𝑖𝑖
≥ 0 , ≤ 1
𝑠𝑠𝑖𝑖 =
𝑤𝑤𝑞𝑞
𝑇𝑇
𝑇𝑇
. 𝑘𝑘
𝑎𝑎𝑖𝑖 =
exp(𝑠𝑠𝑖𝑖)
∑𝑗𝑗 exp(𝑠𝑠𝑗𝑗)

Attention
Mechanism
Mimics
Query
Value1
Value2
Key1
Key2
ValueT
KeyT
𝑖𝑖
≥ 0 , ≤ 1
𝑠𝑠𝑖𝑖 =
𝑤𝑤𝑞𝑞
𝑇𝑇
𝑇𝑇
. 𝑘𝑘
𝑎𝑎𝑖𝑖 =
exp(𝑠𝑠𝑖𝑖)
∑𝑗𝑗 exp(𝑠𝑠𝑗𝑗)
𝑣𝑣1 𝑣𝑣2 𝑣𝑣3 𝑣𝑣4
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = �
𝑖𝑖
𝑎𝑎𝑖𝑖𝑣𝑣𝑖𝑖

Self-Attention
I1 I2 I3 I1 I2 I3 O2 O1
Cross-Attention

Attention all you need!
Encode the input
sequence and extract
the relationship
between the input
words and their order

Encode the input
sequence and extract
the relationship
between the input
words and their order
Encode the
predicted sequence
of outputs and
decode input/output
attentions

• Multi-Head Attention

• 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑊𝑊0𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝑒𝑒𝑒𝑒𝑒𝑒1, ℎ𝑒𝑒𝑒𝑒𝑒𝑒2, … , ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ

• ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊
𝑖𝑖
𝑄𝑄
𝑄𝑄, 𝑊𝑊𝑖𝑖
𝐾𝐾
𝐾𝐾, 𝑊𝑊𝑖𝑖
𝑉𝑉
𝑉𝑉)

• ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑖𝑖 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎(𝑊𝑊
𝑖𝑖
𝑄𝑄
𝑄𝑄, 𝑊𝑊𝑖𝑖
𝐾𝐾
𝐾𝐾, 𝑊𝑊𝑖𝑖
𝑉𝑉
𝑉𝑉)
• 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑄𝑄𝑇𝑇𝐾𝐾
𝑑𝑑𝑘𝑘
𝑉𝑉

• Masked Multi-Head Attention

• Applied when some of the inputs/outputs need to be
masked from the attention mechanism

• Applied when some of the inputs/outputs need to be
masked from the attention mechanism
• 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑄𝑄, 𝐾𝐾, 𝑉𝑉 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑄𝑄𝑇𝑇𝐾𝐾+𝑀𝑀
𝑑𝑑𝐾𝐾
𝑉𝑉
Where M is a mask matrix of −∞’s

• Normalize to have 0 mean and 1 variance
• ℎ𝑖𝑖 ←
𝑔𝑔
𝜎𝜎
ℎ𝑖𝑖 − 𝜇𝜇 ,
𝜇𝜇 =
1
𝐻𝐻
�
𝑖𝑖=1
𝐻𝐻
ℎ𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎 𝜎𝜎 =
1
𝐻𝐻
�
𝑖𝑖=1
𝐻𝐻
ℎ𝑖𝑖 − 𝜇𝜇 2

• Normalize to have 0 mean and 1 variance
• ℎ𝑖𝑖 ←
𝑔𝑔
𝜎𝜎
ℎ𝑖𝑖 − 𝜇𝜇 ,
𝜇𝜇 =
1
𝐻𝐻
�
𝑖𝑖=1
𝐻𝐻
ℎ𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎 𝜎𝜎 =
1
𝐻𝐻
�
𝑖𝑖=1
𝐻𝐻
ℎ𝑖𝑖 − 𝜇𝜇 2
• 𝑃𝑃𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖 = sin
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
1000
2𝑖𝑖
𝑑𝑑
• 𝑃𝑃𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝,2𝑖𝑖+1 = cos
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
1000
2𝑖𝑖
𝑑𝑑

Transformers.pdf

More Related Content

Similar to Transformers.pdf

Recently uploaded

Transformers.pdf