3. Problem of the traditional ED model
◦ Long sentence problem
◦ All the information of the input must be compressed into a fixed length vector 𝑠0
◦ cannot encode all information when the input sentence getting long
◦ Input and output alignment
◦ Unable to align between input and output
◦ Decoder lacks any mechanism to selectively focus on relevant input tokens while
generating each output token
3
ℎ ℎ
� � � �
ℎ ℎ � � � �
� �
�
�
�
I math
learning
am Tôi toán
Học
đang
� �
�
�
4. Attention idea
◦ Allowing the decoder to access the entire encoded input sequence
◦ Induce attention weights over the input sequence to prioritize the set of positions where
relevant information is present for generating the next output token
4
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning to
align and translate, ICLR 2015
6. Attention’s intuition
◦ The objective: Align each output’s hidden state with each input’s
◦ Intuition:
◦ To predict 𝑠𝑡, we want to measure how much 𝑠𝑡 relate to each input ℎ𝑖
◦ Then, we will assign higher weight to the ℎ𝑖 which is more relevant to 𝑠𝑡
◦ Problem: we have not known 𝑠𝑡 yet
◦ Solution: measure the relevance of ℎ𝑖 and 𝑠𝑡−1 which is nearest to 𝑠𝑡
6
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning
to align and translate, ICLR 2015
9. Core Attention model
9
Alignment function
(𝑎)
Distribution function
(𝑝)
Weighted sum
𝒄𝒕: context vector
𝒉𝟏, 𝒉𝟐, … , 𝒉𝑻
𝑒𝑡1, 𝑒𝑡2, … , 𝑒𝑡𝑇
𝑎𝑡1, 𝑎𝑡2, … , 𝑎𝑡𝑇
𝒔𝒕−𝟏
𝑒𝑡𝑖 = 𝑎(𝒔𝒕−𝟏, 𝒉𝒊)
𝒂𝒕 = 𝑝(𝒆𝒕)
When 𝑝 is the softmax, we have:
𝒂𝒕𝒊 =
exp(𝑒𝑡𝑖)
𝑗=1
𝑇
exp(𝑒𝑡𝑗)
𝒄𝒕 =
𝒊=𝟏
𝑻
𝑎𝑡𝑖𝒉𝒋
Attention weights
Energy scores
Another name of the “Alignment function”
is “Compatibility Function”
10. Core Attention model
10
Image credit: Attention in Natural Language Processing, IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021
The objective is to map the key 𝑲 with the attention weight 𝒂
The Key 𝑲 encodes the data features whereupon attention is computed
in the RNN ED model: 𝑲 = 𝒉
The query 𝒒 is used as a reference when computing the attention distribution
the attention mechanism will give emphasis to the input elements relevant to the task according to 𝒒
If no query is defined, attention will give emphasis to the elements inherently relevant to the task at
hand
in the RNN ED model: 𝒒 = 𝒔𝒕−𝟏
11. General Attention model
◦ Sometimes, we need to compute the final
task on another represenation of the keys
◦ i.e., the data representation for computing the
attention is different from the one used for
computing the final task
◦ Indoduce a new term: Values 𝑽,
representing the data whereupon the
attention is to be applied
◦ each item of 𝑽 corresponds to an item of 𝑲
11
Weights
calculation
𝒂
Weighted sum
𝑽
𝒄𝒕
Weights
calculation
𝑲
𝒒
𝒂
Weighted sum
𝒄𝒕
𝑲
𝒒
12. General Attention model
12
About Z:
Commonest way is summation
But, alternatives have been proposed,
e.g., gating functions
About 𝒂:
• Deterministic attention (soft
attention): as described so far
• Stochastic attention (hard
attention): use the attention
weights to hardly sample a single
input as context vector c
Hard attention
13. Alignment functions
◦ Main approach 1: matching and comparing 𝑲 and 𝒒
◦ Idea: the most relevant keys are the most similar to the query
◦ Methods: rely on similarity functions
◦ Cosine, dot product, scaled multiplicative attention, …
◦ Main approach 2: combining 𝑲 and 𝒒
◦ Others
◦ Convolution-based attention
◦ Deep attention
13
15. Alignment functions
◦ Dot product-based score
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝒌𝒊 the more similar of 𝒌𝒊 to 𝒒 the higher the attention weight
◦ Limitation: the dimentions of 𝒌𝒊 and 𝒒 must be the same
◦ General score
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝑾𝒌𝒊
◦ 𝑾 can be seen as to map 𝒒𝑻 into the space of 𝒌𝒊 when dimentions of 𝒒 and 𝒌𝒊 are different
15
16. Alignment functions
◦ General score
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝑾𝒌𝒊
◦ 𝑾 can be seen as to map 𝒒𝑻 into the space of 𝒌𝒊 when dimentions of 𝒒 and 𝒌𝒊 are different
16
𝒒𝑻
𝒌𝒊
𝑾
17. Alignment functions
◦ Biased general(1)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒌𝒊 𝑾𝒒 + 𝒃 = 𝒌𝒊𝑾𝒒 + 𝒌𝒊𝒃
◦ Activated general (2)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝑡𝑎𝑛ℎ 𝒒𝑻
𝑾𝒌𝒊 + 𝒃
◦ Generalized kernel (3)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝝓 𝒒 𝑻𝝓(𝒌𝒊)
17
(1)Alessandro Sordoni, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2016. Iterative alternating neural attention for
machine reading. arXiv:1606.02245
(2)Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017b. Interactive attention networks for aspect-level senti-
ment classification, IJCAI 2017
(3)Krzysztof Marcin Choromanski, “Rethinking attention with performers. In Proceedings of the International Conference on
Learning Representations”, ICLR 2021
Bilinear term Biased term
18. Alignment functions
◦ Concat attention(4)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒘𝒊𝒎𝒑
𝑻
. 𝒂𝒄𝒕 𝑾 𝒒; 𝒌𝒊 + 𝒃
◦ Question: what should be the intuition for this formula?
◦ The objective:
◦ estimate the attention weight 𝒂
◦ use 𝒄𝒕 = 𝒊=𝟏
𝑻
𝑎𝑡𝑖𝒉𝒋 as the input of the decoder
◦ Solution
◦ Put all we have (𝒒; 𝒌𝒊 ) into an attention weight estimation neural network
◦ use gradient descent to find the optimal one
18
• In this way, 𝑎𝑡𝑖 does not exactly reflect the original idea of attention mechanism that is to assign
higher weight for more relevant input
(4)Minh-Thang Luong, et al., “Effective approaches to attention-based neural machine Translation”, 2015
19. Alignment functions
◦ Additive attention(5)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒘𝒊𝒎𝒑
𝑻
. 𝒂𝒄𝒕 𝑾𝟏𝒒 + 𝑾𝟐𝒌𝒊 + 𝒃
◦ Deep alignment(6)
19
𝑾𝟏; 𝑾𝟐 𝒒; 𝒌𝒊
The same as concat attention
Precompute of key only one time
(5)Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning to align and translate, ICLR 2015
(6)John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deeper attention to abusive user content moderation.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1125–1135
20. ◦ Compositional De-Attention(7)
◦ Idea: attend on similar input, and de-attend on dissimilar ones
◦ Algorithm
◦ 𝑎𝑖: key; 𝑏𝑖: query
◦ Pairwise similarity measurement
◦ Dissimilarity mesuarement
◦ The final (quasi)-attention matrix
20
Parameterized functions
𝛼, 𝛽: parameters
(7)Yi Tay, Anh Tuan Luu, Aston Zhang, Shuohang Wang, and Siu Cheung Hui. 2019. Compositional de-attention
networks. Adv. Neural Inf. Process. Syst. 32 (2019)
21. Luong attention
◦ Global attention
◦ Consider all the hidden states of the encoder when deriving the context vector 𝒄𝒕
◦ Local attention
◦ Focus only on a s mall subset of the source positions (i.e., encoder) per target word
21
(4)Minh-Thang Luong, et al., “Effective approaches to attention-based neural machine
Translation”, 2015
22. Luong attention
22
The hidden states of the encoder
Luong’s global attention mechanism
The first attention mechanism proposed by
Bahdanau (2015)
What is main difference between them?
23. Luong attention
23
The hidden states of the encoder
Luong’s global attention mechanism
Luong attention tries to predict the relevance
between the current hidden state 𝒉𝒕 and the input
states
Bahdanau attention tries to predict the relevance
between the previous hidden state 𝒉𝒕−𝟏 and the
input states
In Luong attention, we cannot predict 𝑦𝑡 from 𝒉𝒕
need introduce new term 𝒉𝒕 to compute 𝑦𝑡. The
flow is 𝒉𝒕 → 𝒄𝒕 → 𝒉𝒕 → 𝑦𝑡
In Bahdanau, the context vector 𝒄𝒕 is inputed to
caluculate 𝒉𝒕, and 𝑦𝑡 is computed from 𝒉𝒕. The
flow is 𝒉𝒕−𝟏 → 𝒄𝒕 → 𝒉𝒕 → 𝑦𝑡
25. Luong attention
◦ Local attention
◦ Main idea: calculate the attention weights for only a subset of the keys
◦ Flow: for each target word at timestep 𝑡
◦ generates an aligned position 𝑝𝑡
◦ The attention weights and context vector 𝑐𝑡 is
calcualted over 𝑝𝑡 − 𝐷, 𝑝𝑡 + 𝐷 , D is a hyperparam
◦ How to deterimine 𝑝𝑡
◦ Monotonic alignment (local-m)
◦ 𝑝𝑡 = 𝑡
◦ Predictive alignment (local-p)
25
The length of the input Learnable vector and matrix
;
26. Distribution functions
◦ Argmax
◦ Softmax
◦ Dense distribution
◦ 𝒂𝒕𝒊 =
exp(𝑒𝑡𝑖)
𝑗=1
𝑇 exp(𝑒𝑡𝑗)
◦ Sparsemax(8)
◦ Return sparse posterior distributions, assigning zero probability to some output variables.
26
(8)Martin et al.,From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification, PMLR 2016