SlideShare a Scribd company logo
1 of 31
Attention mechanism
Nguyen Phi Le
2021
November
Traditional encoder-decoder
2
ℎ1 ℎ2
𝑥1 𝑥2 𝑥3 𝑥4
ℎ3 ℎ4 𝑠0 𝑠1 𝑠2 𝑠4
𝑦0 𝑦3
𝑦2
𝑦1
𝑠1
I math
learning
am Tôi toán
Học
đang
𝑦0 𝑦3
𝑦2
𝑦1
Problem of the traditional ED model
◦ Long sentence problem
◦ All the information of the input must be compressed into a fixed length vector 𝑠0
◦  cannot encode all information when the input sentence getting long
◦ Input and output alignment
◦ Unable to align between input and output
◦ Decoder lacks any mechanism to selectively focus on relevant input tokens while
generating each output token
3
ℎ ℎ
� � � �
ℎ ℎ � � � �
� �
�
�
�
I math
learning
am Tôi toán
Học
đang
� �
�
�
Attention idea
◦ Allowing the decoder to access the entire encoded input sequence
◦ Induce attention weights over the input sequence to prioritize the set of positions where
relevant information is present for generating the next output token
4
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning to
align and translate, ICLR 2015
5
ℎ1 ℎ2
𝑥1 𝑥2 𝑥𝑇
ℎ𝑇
Attention
𝑠𝑡
𝑦𝑡−1
𝑠𝑡−1
𝒄𝒕
…
…
ℎ1 ℎ2
𝑥1 𝑥2 𝑥𝑇
ℎ𝑇
𝑠𝑡
𝑦𝑡−1
𝑠𝑡−1
…
…
…
…
Attention weight
Attention’s intuition
◦ The objective: Align each output’s hidden state with each input’s
◦ Intuition:
◦ To predict 𝑠𝑡, we want to measure how much 𝑠𝑡 relate to each input ℎ𝑖
◦ Then, we will assign higher weight to the ℎ𝑖 which is more relevant to 𝑠𝑡
◦ Problem: we have not known 𝑠𝑡 yet
◦ Solution: measure the relevance of ℎ𝑖 and 𝑠𝑡−1 which is nearest to 𝑠𝑡
6
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning
to align and translate, ICLR 2015
Attention’s basic mechanism
7
ℎ1 ℎ2
𝑥1 𝑥2 𝑥𝑇
ℎ𝑇
Attention
𝑠𝑡
𝑦𝑡−1
𝑠𝑡−1
𝒄𝒕
…
…
…
ℎ1 ℎ2
𝑥1 𝑥2 𝑥𝑇
ℎ𝑇
Attention
𝑠𝑡
𝑦𝑡−1
𝑠𝑡−1
𝒄𝒕
…
…
…
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015
Attention’s basic mechanism
8
8
ℎ1 ℎ2
𝑥1 𝑥2 𝑥𝑇
ℎ𝑇
Attention
𝑠𝑡
𝑦𝑡−1
𝑠𝑡−1
𝒄𝒕
…
…
…
Alignment function
(𝑎)
Distribution function
(𝑝)
Weighted sum
𝒄𝒕: context vector
ℎ1, ℎ2, … , ℎ𝑇
𝑒𝑡1, 𝑒𝑡2, … , 𝑒𝑡𝑇
𝑠𝑡−1
𝑎𝑡1, 𝑎𝑡2, … , 𝑎𝑡𝑇
Core Attention model
9
Alignment function
(𝑎)
Distribution function
(𝑝)
Weighted sum
𝒄𝒕: context vector
𝒉𝟏, 𝒉𝟐, … , 𝒉𝑻
𝑒𝑡1, 𝑒𝑡2, … , 𝑒𝑡𝑇
𝑎𝑡1, 𝑎𝑡2, … , 𝑎𝑡𝑇
𝒔𝒕−𝟏
𝑒𝑡𝑖 = 𝑎(𝒔𝒕−𝟏, 𝒉𝒊)
𝒂𝒕 = 𝑝(𝒆𝒕)
When 𝑝 is the softmax, we have:
𝒂𝒕𝒊 =
exp(𝑒𝑡𝑖)
𝑗=1
𝑇
exp(𝑒𝑡𝑗)
𝒄𝒕 =
𝒊=𝟏
𝑻
𝑎𝑡𝑖𝒉𝒋
Attention weights
Energy scores
Another name of the “Alignment function”
is “Compatibility Function”
Core Attention model
10
Image credit: Attention in Natural Language Processing, IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021
 The objective is to map the key 𝑲 with the attention weight 𝒂
 The Key 𝑲 encodes the data features whereupon attention is computed
 in the RNN ED model: 𝑲 = 𝒉
 The query 𝒒 is used as a reference when computing the attention distribution
 the attention mechanism will give emphasis to the input elements relevant to the task according to 𝒒
 If no query is defined, attention will give emphasis to the elements inherently relevant to the task at
hand
 in the RNN ED model: 𝒒 = 𝒔𝒕−𝟏
General Attention model
◦ Sometimes, we need to compute the final
task on another represenation of the keys
◦ i.e., the data representation for computing the
attention is different from the one used for
computing the final task
◦ Indoduce a new term: Values 𝑽,
representing the data whereupon the
attention is to be applied
◦ each item of 𝑽 corresponds to an item of 𝑲
11
Weights
calculation
𝒂
Weighted sum
𝑽
𝒄𝒕
Weights
calculation
𝑲
𝒒
𝒂
Weighted sum
𝒄𝒕
𝑲
𝒒
General Attention model
12
About Z:
Commonest way is summation
But, alternatives have been proposed,
e.g., gating functions
About 𝒂:
• Deterministic attention (soft
attention): as described so far
• Stochastic attention (hard
attention): use the attention
weights to hardly sample a single
input as context vector c
Hard attention
Alignment functions
◦ Main approach 1: matching and comparing 𝑲 and 𝒒
◦ Idea: the most relevant keys are the most similar to the query
◦ Methods: rely on similarity functions
◦ Cosine, dot product, scaled multiplicative attention, …
◦ Main approach 2: combining 𝑲 and 𝒒
◦ Others
◦ Convolution-based attention
◦ Deep attention
13
Alignment functions
14
Alignment functions
◦ Dot product-based score
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝒌𝒊  the more similar of 𝒌𝒊 to 𝒒 the higher the attention weight
◦ Limitation: the dimentions of 𝒌𝒊 and 𝒒 must be the same
◦ General score
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝑾𝒌𝒊
◦ 𝑾 can be seen as to map 𝒒𝑻 into the space of 𝒌𝒊 when dimentions of 𝒒 and 𝒌𝒊 are different
15
Alignment functions
◦ General score
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝑾𝒌𝒊
◦ 𝑾 can be seen as to map 𝒒𝑻 into the space of 𝒌𝒊 when dimentions of 𝒒 and 𝒌𝒊 are different
16
𝒒𝑻
𝒌𝒊
𝑾
Alignment functions
◦ Biased general(1)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒌𝒊 𝑾𝒒 + 𝒃 = 𝒌𝒊𝑾𝒒 + 𝒌𝒊𝒃
◦ Activated general (2)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝑡𝑎𝑛ℎ 𝒒𝑻
𝑾𝒌𝒊 + 𝒃
◦ Generalized kernel (3)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝝓 𝒒 𝑻𝝓(𝒌𝒊)
17
(1)Alessandro Sordoni, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2016. Iterative alternating neural attention for
machine reading. arXiv:1606.02245
(2)Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017b. Interactive attention networks for aspect-level senti-
ment classification, IJCAI 2017
(3)Krzysztof Marcin Choromanski, “Rethinking attention with performers. In Proceedings of the International Conference on
Learning Representations”, ICLR 2021
Bilinear term Biased term
Alignment functions
◦ Concat attention(4)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒘𝒊𝒎𝒑
𝑻
. 𝒂𝒄𝒕 𝑾 𝒒; 𝒌𝒊 + 𝒃
◦ Question: what should be the intuition for this formula?
◦ The objective:
◦ estimate the attention weight 𝒂
◦ use 𝒄𝒕 = 𝒊=𝟏
𝑻
𝑎𝑡𝑖𝒉𝒋 as the input of the decoder
◦ Solution
◦ Put all we have (𝒒; 𝒌𝒊 ) into an attention weight estimation neural network
◦ use gradient descent to find the optimal one
18
• In this way, 𝑎𝑡𝑖 does not exactly reflect the original idea of attention mechanism that is to assign
higher weight for more relevant input
(4)Minh-Thang Luong, et al., “Effective approaches to attention-based neural machine Translation”, 2015
Alignment functions
◦ Additive attention(5)
◦ 𝑎 𝒌𝒊, 𝒒 = 𝒘𝒊𝒎𝒑
𝑻
. 𝒂𝒄𝒕 𝑾𝟏𝒒 + 𝑾𝟐𝒌𝒊 + 𝒃
◦ Deep alignment(6)
19
𝑾𝟏; 𝑾𝟐 𝒒; 𝒌𝒊
The same as concat attention
Precompute of key only one time
(5)Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning to align and translate, ICLR 2015
(6)John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deeper attention to abusive user content moderation.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1125–1135
◦ Compositional De-Attention(7)
◦ Idea: attend on similar input, and de-attend on dissimilar ones
◦ Algorithm
◦ 𝑎𝑖: key; 𝑏𝑖: query
◦ Pairwise similarity measurement
◦ Dissimilarity mesuarement
◦ The final (quasi)-attention matrix
20
Parameterized functions
𝛼, 𝛽: parameters
(7)Yi Tay, Anh Tuan Luu, Aston Zhang, Shuohang Wang, and Siu Cheung Hui. 2019. Compositional de-attention
networks. Adv. Neural Inf. Process. Syst. 32 (2019)
Luong attention
◦ Global attention
◦ Consider all the hidden states of the encoder when deriving the context vector 𝒄𝒕
◦ Local attention
◦ Focus only on a s mall subset of the source positions (i.e., encoder) per target word
21
(4)Minh-Thang Luong, et al., “Effective approaches to attention-based neural machine
Translation”, 2015
Luong attention
22
The hidden states of the encoder
Luong’s global attention mechanism
The first attention mechanism proposed by
Bahdanau (2015)
What is main difference between them?
Luong attention
23
The hidden states of the encoder
Luong’s global attention mechanism
Luong attention tries to predict the relevance
between the current hidden state 𝒉𝒕 and the input
states
Bahdanau attention tries to predict the relevance
between the previous hidden state 𝒉𝒕−𝟏 and the
input states
In Luong attention, we cannot predict 𝑦𝑡 from 𝒉𝒕 
need introduce new term 𝒉𝒕 to compute 𝑦𝑡. The
flow is 𝒉𝒕 → 𝒄𝒕 → 𝒉𝒕 → 𝑦𝑡
In Bahdanau, the context vector 𝒄𝒕 is inputed to
caluculate 𝒉𝒕, and 𝑦𝑡 is computed from 𝒉𝒕. The
flow is 𝒉𝒕−𝟏 → 𝒄𝒕 → 𝒉𝒕 → 𝑦𝑡
Luong attention
◦ Alignment functions
◦ They proposed three alignment functions as follows
24
Luong attention
◦ Local attention
◦ Main idea: calculate the attention weights for only a subset of the keys
◦ Flow: for each target word at timestep 𝑡
◦ generates an aligned position 𝑝𝑡
◦ The attention weights and context vector 𝑐𝑡 is
calcualted over 𝑝𝑡 − 𝐷, 𝑝𝑡 + 𝐷 , D is a hyperparam
◦ How to deterimine 𝑝𝑡
◦ Monotonic alignment (local-m)
◦ 𝑝𝑡 = 𝑡
◦ Predictive alignment (local-p)
25
The length of the input Learnable vector and matrix
;
Distribution functions
◦ Argmax
◦ Softmax
◦ Dense distribution 
◦ 𝒂𝒕𝒊 =
exp(𝑒𝑡𝑖)
𝑗=1
𝑇 exp(𝑒𝑡𝑗)
◦ Sparsemax(8)
◦ Return sparse posterior distributions, assigning zero probability to some output variables.
26
(8)Martin et al.,From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification, PMLR 2016
PM2.5 multi-stations prediction
31
𝒉𝟏
𝟏
𝒉𝟐
𝟏
𝒉𝒍
𝟏
…
𝒉𝟏
𝒊
𝒉𝟐
𝒊
𝒉𝒍
𝒊
…
𝒉𝟏
𝒏
𝒉𝟐
𝒏
𝒉𝒍
𝒏
…
Hard
Self
Att
𝒔𝟐
𝟏 𝒔𝒏
𝟏
𝒔𝟏
𝟏 …
𝒔𝟐
𝒊 𝒔𝒏
𝒊
𝒔𝟏
𝒊 …
𝒔𝟐
𝒏 𝒔𝒏
𝒏
𝒔𝟏
𝒏 …
𝒔𝟎
𝟏
𝒔𝟎
𝒏
𝒔𝟎
𝒊
Soft
Self
Att
Soft
Self
Att
PM2.5 multi-stations prediction
32
𝒉𝟏
𝟏
𝒉𝟐
𝟏
𝒉𝒍
𝟏
…
𝒉𝟏
𝒊
𝒉𝟐
𝒊
𝒉𝒍
𝒊
…
𝒉𝟏
𝒏
𝒉𝟐
𝒏
𝒉𝒍
𝒏
…
Hard
Self
Att
𝒔𝟐
𝟏 𝒔𝒏
𝟏
𝒔𝟏
𝟏 …
𝒔𝟐
𝒊 𝒔𝒏
𝒊
𝒔𝟏
𝒊 …
𝒔𝟐
𝒏 𝒔𝒏
𝒏
𝒔𝟏
𝒏 …
𝒔𝟎
𝟏
𝒔𝟎
𝒏
𝒔𝟎
𝒊
Soft
Self
Att
Soft
Self
Att
33
. . . . .
. .
𝑠𝑖
𝑠𝑗
𝛼1
(𝑖,𝑗) 𝛼7
(𝑖,𝑗)
𝑒𝑖,𝑗 =
𝑘=1
7
𝛼𝑘
𝑖,𝑗
𝑠𝑖
𝑠𝑗+1
𝛼1
(𝑖,𝑗+1) 𝛼7
(𝑖,𝑗+1)
𝑒𝑖,𝑗+1 =
𝑘=1
7
𝛼𝑘
𝑖,𝑗+1
 𝑒𝑖,𝑗 is different from 𝑒𝑖,𝑗+1
Biased: 𝑒𝑖𝑗 = 𝑘=1
7
𝛼𝑘
𝑖𝑗
+ 𝑏 𝛼𝑖
𝑖𝑗
34
. . . . .
. .
𝑠𝑖
ℎ𝑗
𝛼1
(𝑖,𝑗) 𝛼7
(𝑖,𝑗)
𝑒𝑖,𝑗 =
𝑘=1
7
𝛼𝑘
𝑖,𝑗
𝑒𝑖,𝑗+1 =
𝑘=1
7
𝛼𝑘
𝑖,𝑗+1
 𝑒𝑖,𝑗 is different from 𝑒𝑖,𝑗+1
Biased: 𝑒𝑖𝑗 = 𝑘=1
7
𝛼𝑘
𝑖𝑗
+ 𝑏 𝛼𝑖
𝑖𝑗
𝑠𝑖2
ℎ𝑗
𝛼1
(𝑖2,𝑗) 𝛼7
(𝑖2,𝑗)
𝑠1 𝑠7
ℎ𝑗−𝑖+1 ℎ𝑗−𝑖+7
× 𝑏
35

More Related Content

What's hot

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnnKuppusamy P
 
Introduction to batch normalization
Introduction to batch normalizationIntroduction to batch normalization
Introduction to batch normalizationJamie (Taka) Wang
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning健程 杨
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 

What's hot (20)

Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Introduction to batch normalization
Introduction to batch normalizationIntroduction to batch normalization
Introduction to batch normalization
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
LSTM
LSTMLSTM
LSTM
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Lstm
LstmLstm
Lstm
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 

Similar to [AIoTLab]attention mechanism.pptx

Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention reviewJune-Woo Kim
 
From_seq2seq_to_BERT
From_seq2seq_to_BERTFrom_seq2seq_to_BERT
From_seq2seq_to_BERTHuali Zhao
 
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...IJERA Editor
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionAlessandro Suglia
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionClaudio Greco
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Association for Computational Linguistics
 
Topik1: Scientific Computing & Floating Point Aritmetics
Topik1: Scientific Computing & Floating Point AritmeticsTopik1: Scientific Computing & Floating Point Aritmetics
Topik1: Scientific Computing & Floating Point AritmeticsRaja Damanik
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Kei Nakagawa
 
Initialization methods for the tsp with time windows using variable neighborh...
Initialization methods for the tsp with time windows using variable neighborh...Initialization methods for the tsp with time windows using variable neighborh...
Initialization methods for the tsp with time windows using variable neighborh...Konstantinos Giannakis
 
Deep learning @ University of Oradea - part I (16 Jan. 2018)
Deep learning @ University of Oradea - part I (16 Jan. 2018)Deep learning @ University of Oradea - part I (16 Jan. 2018)
Deep learning @ University of Oradea - part I (16 Jan. 2018)Vlad Ovidiu Mihalca
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learningRyohei Suzuki
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Interpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsInterpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsManojit Nandi
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesijsc
 

Similar to [AIoTLab]attention mechanism.pptx (20)

XAI (IIT-Patna).pdf
XAI (IIT-Patna).pdfXAI (IIT-Patna).pdf
XAI (IIT-Patna).pdf
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
From_seq2seq_to_BERT
From_seq2seq_to_BERTFrom_seq2seq_to_BERT
From_seq2seq_to_BERT
 
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
Unsteady MHD Flow Past A Semi-Infinite Vertical Plate With Heat Source/ Sink:...
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
Jindřich Libovický - 2017 - Attention Strategies for Multi-Source Sequence-...
 
Topik1: Scientific Computing & Floating Point Aritmetics
Topik1: Scientific Computing & Floating Point AritmeticsTopik1: Scientific Computing & Floating Point Aritmetics
Topik1: Scientific Computing & Floating Point Aritmetics
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Active learning
Active learningActive learning
Active learning
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
 
Icon18revrec sudeshna
Icon18revrec sudeshnaIcon18revrec sudeshna
Icon18revrec sudeshna
 
Initialization methods for the tsp with time windows using variable neighborh...
Initialization methods for the tsp with time windows using variable neighborh...Initialization methods for the tsp with time windows using variable neighborh...
Initialization methods for the tsp with time windows using variable neighborh...
 
Deep learning @ University of Oradea - part I (16 Jan. 2018)
Deep learning @ University of Oradea - part I (16 Jan. 2018)Deep learning @ University of Oradea - part I (16 Jan. 2018)
Deep learning @ University of Oradea - part I (16 Jan. 2018)
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learning
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Interpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex modelsInterpretable machine learning : Methods for understanding complex models
Interpretable machine learning : Methods for understanding complex models
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

[AIoTLab]attention mechanism.pptx

  • 2. Traditional encoder-decoder 2 ℎ1 ℎ2 𝑥1 𝑥2 𝑥3 𝑥4 ℎ3 ℎ4 𝑠0 𝑠1 𝑠2 𝑠4 𝑦0 𝑦3 𝑦2 𝑦1 𝑠1 I math learning am Tôi toán Học đang 𝑦0 𝑦3 𝑦2 𝑦1
  • 3. Problem of the traditional ED model ◦ Long sentence problem ◦ All the information of the input must be compressed into a fixed length vector 𝑠0 ◦  cannot encode all information when the input sentence getting long ◦ Input and output alignment ◦ Unable to align between input and output ◦ Decoder lacks any mechanism to selectively focus on relevant input tokens while generating each output token 3 ℎ ℎ � � � � ℎ ℎ � � � � � � � � � I math learning am Tôi toán Học đang � � � �
  • 4. Attention idea ◦ Allowing the decoder to access the entire encoded input sequence ◦ Induce attention weights over the input sequence to prioritize the set of positions where relevant information is present for generating the next output token 4 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning to align and translate, ICLR 2015
  • 5. 5 ℎ1 ℎ2 𝑥1 𝑥2 𝑥𝑇 ℎ𝑇 Attention 𝑠𝑡 𝑦𝑡−1 𝑠𝑡−1 𝒄𝒕 … … ℎ1 ℎ2 𝑥1 𝑥2 𝑥𝑇 ℎ𝑇 𝑠𝑡 𝑦𝑡−1 𝑠𝑡−1 … … … … Attention weight
  • 6. Attention’s intuition ◦ The objective: Align each output’s hidden state with each input’s ◦ Intuition: ◦ To predict 𝑠𝑡, we want to measure how much 𝑠𝑡 relate to each input ℎ𝑖 ◦ Then, we will assign higher weight to the ℎ𝑖 which is more relevant to 𝑠𝑡 ◦ Problem: we have not known 𝑠𝑡 yet ◦ Solution: measure the relevance of ℎ𝑖 and 𝑠𝑡−1 which is nearest to 𝑠𝑡 6 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning to align and translate, ICLR 2015
  • 7. Attention’s basic mechanism 7 ℎ1 ℎ2 𝑥1 𝑥2 𝑥𝑇 ℎ𝑇 Attention 𝑠𝑡 𝑦𝑡−1 𝑠𝑡−1 𝒄𝒕 … … … ℎ1 ℎ2 𝑥1 𝑥2 𝑥𝑇 ℎ𝑇 Attention 𝑠𝑡 𝑦𝑡−1 𝑠𝑡−1 𝒄𝒕 … … … Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015
  • 8. Attention’s basic mechanism 8 8 ℎ1 ℎ2 𝑥1 𝑥2 𝑥𝑇 ℎ𝑇 Attention 𝑠𝑡 𝑦𝑡−1 𝑠𝑡−1 𝒄𝒕 … … … Alignment function (𝑎) Distribution function (𝑝) Weighted sum 𝒄𝒕: context vector ℎ1, ℎ2, … , ℎ𝑇 𝑒𝑡1, 𝑒𝑡2, … , 𝑒𝑡𝑇 𝑠𝑡−1 𝑎𝑡1, 𝑎𝑡2, … , 𝑎𝑡𝑇
  • 9. Core Attention model 9 Alignment function (𝑎) Distribution function (𝑝) Weighted sum 𝒄𝒕: context vector 𝒉𝟏, 𝒉𝟐, … , 𝒉𝑻 𝑒𝑡1, 𝑒𝑡2, … , 𝑒𝑡𝑇 𝑎𝑡1, 𝑎𝑡2, … , 𝑎𝑡𝑇 𝒔𝒕−𝟏 𝑒𝑡𝑖 = 𝑎(𝒔𝒕−𝟏, 𝒉𝒊) 𝒂𝒕 = 𝑝(𝒆𝒕) When 𝑝 is the softmax, we have: 𝒂𝒕𝒊 = exp(𝑒𝑡𝑖) 𝑗=1 𝑇 exp(𝑒𝑡𝑗) 𝒄𝒕 = 𝒊=𝟏 𝑻 𝑎𝑡𝑖𝒉𝒋 Attention weights Energy scores Another name of the “Alignment function” is “Compatibility Function”
  • 10. Core Attention model 10 Image credit: Attention in Natural Language Processing, IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021  The objective is to map the key 𝑲 with the attention weight 𝒂  The Key 𝑲 encodes the data features whereupon attention is computed  in the RNN ED model: 𝑲 = 𝒉  The query 𝒒 is used as a reference when computing the attention distribution  the attention mechanism will give emphasis to the input elements relevant to the task according to 𝒒  If no query is defined, attention will give emphasis to the elements inherently relevant to the task at hand  in the RNN ED model: 𝒒 = 𝒔𝒕−𝟏
  • 11. General Attention model ◦ Sometimes, we need to compute the final task on another represenation of the keys ◦ i.e., the data representation for computing the attention is different from the one used for computing the final task ◦ Indoduce a new term: Values 𝑽, representing the data whereupon the attention is to be applied ◦ each item of 𝑽 corresponds to an item of 𝑲 11 Weights calculation 𝒂 Weighted sum 𝑽 𝒄𝒕 Weights calculation 𝑲 𝒒 𝒂 Weighted sum 𝒄𝒕 𝑲 𝒒
  • 12. General Attention model 12 About Z: Commonest way is summation But, alternatives have been proposed, e.g., gating functions About 𝒂: • Deterministic attention (soft attention): as described so far • Stochastic attention (hard attention): use the attention weights to hardly sample a single input as context vector c Hard attention
  • 13. Alignment functions ◦ Main approach 1: matching and comparing 𝑲 and 𝒒 ◦ Idea: the most relevant keys are the most similar to the query ◦ Methods: rely on similarity functions ◦ Cosine, dot product, scaled multiplicative attention, … ◦ Main approach 2: combining 𝑲 and 𝒒 ◦ Others ◦ Convolution-based attention ◦ Deep attention 13
  • 15. Alignment functions ◦ Dot product-based score ◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝒌𝒊  the more similar of 𝒌𝒊 to 𝒒 the higher the attention weight ◦ Limitation: the dimentions of 𝒌𝒊 and 𝒒 must be the same ◦ General score ◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝑾𝒌𝒊 ◦ 𝑾 can be seen as to map 𝒒𝑻 into the space of 𝒌𝒊 when dimentions of 𝒒 and 𝒌𝒊 are different 15
  • 16. Alignment functions ◦ General score ◦ 𝑎 𝒌𝒊, 𝒒 = 𝒒𝑻𝑾𝒌𝒊 ◦ 𝑾 can be seen as to map 𝒒𝑻 into the space of 𝒌𝒊 when dimentions of 𝒒 and 𝒌𝒊 are different 16 𝒒𝑻 𝒌𝒊 𝑾
  • 17. Alignment functions ◦ Biased general(1) ◦ 𝑎 𝒌𝒊, 𝒒 = 𝒌𝒊 𝑾𝒒 + 𝒃 = 𝒌𝒊𝑾𝒒 + 𝒌𝒊𝒃 ◦ Activated general (2) ◦ 𝑎 𝒌𝒊, 𝒒 = 𝑡𝑎𝑛ℎ 𝒒𝑻 𝑾𝒌𝒊 + 𝒃 ◦ Generalized kernel (3) ◦ 𝑎 𝒌𝒊, 𝒒 = 𝝓 𝒒 𝑻𝝓(𝒌𝒊) 17 (1)Alessandro Sordoni, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2016. Iterative alternating neural attention for machine reading. arXiv:1606.02245 (2)Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017b. Interactive attention networks for aspect-level senti- ment classification, IJCAI 2017 (3)Krzysztof Marcin Choromanski, “Rethinking attention with performers. In Proceedings of the International Conference on Learning Representations”, ICLR 2021 Bilinear term Biased term
  • 18. Alignment functions ◦ Concat attention(4) ◦ 𝑎 𝒌𝒊, 𝒒 = 𝒘𝒊𝒎𝒑 𝑻 . 𝒂𝒄𝒕 𝑾 𝒒; 𝒌𝒊 + 𝒃 ◦ Question: what should be the intuition for this formula? ◦ The objective: ◦ estimate the attention weight 𝒂 ◦ use 𝒄𝒕 = 𝒊=𝟏 𝑻 𝑎𝑡𝑖𝒉𝒋 as the input of the decoder ◦ Solution ◦ Put all we have (𝒒; 𝒌𝒊 ) into an attention weight estimation neural network ◦ use gradient descent to find the optimal one 18 • In this way, 𝑎𝑡𝑖 does not exactly reflect the original idea of attention mechanism that is to assign higher weight for more relevant input (4)Minh-Thang Luong, et al., “Effective approaches to attention-based neural machine Translation”, 2015
  • 19. Alignment functions ◦ Additive attention(5) ◦ 𝑎 𝒌𝒊, 𝒒 = 𝒘𝒊𝒎𝒑 𝑻 . 𝒂𝒄𝒕 𝑾𝟏𝒒 + 𝑾𝟐𝒌𝒊 + 𝒃 ◦ Deep alignment(6) 19 𝑾𝟏; 𝑾𝟐 𝒒; 𝒌𝒊 The same as concat attention Precompute of key only one time (5)Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015, Neural machine translation by jointly learning to align and translate, ICLR 2015 (6)John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deeper attention to abusive user content moderation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1125–1135
  • 20. ◦ Compositional De-Attention(7) ◦ Idea: attend on similar input, and de-attend on dissimilar ones ◦ Algorithm ◦ 𝑎𝑖: key; 𝑏𝑖: query ◦ Pairwise similarity measurement ◦ Dissimilarity mesuarement ◦ The final (quasi)-attention matrix 20 Parameterized functions 𝛼, 𝛽: parameters (7)Yi Tay, Anh Tuan Luu, Aston Zhang, Shuohang Wang, and Siu Cheung Hui. 2019. Compositional de-attention networks. Adv. Neural Inf. Process. Syst. 32 (2019)
  • 21. Luong attention ◦ Global attention ◦ Consider all the hidden states of the encoder when deriving the context vector 𝒄𝒕 ◦ Local attention ◦ Focus only on a s mall subset of the source positions (i.e., encoder) per target word 21 (4)Minh-Thang Luong, et al., “Effective approaches to attention-based neural machine Translation”, 2015
  • 22. Luong attention 22 The hidden states of the encoder Luong’s global attention mechanism The first attention mechanism proposed by Bahdanau (2015) What is main difference between them?
  • 23. Luong attention 23 The hidden states of the encoder Luong’s global attention mechanism Luong attention tries to predict the relevance between the current hidden state 𝒉𝒕 and the input states Bahdanau attention tries to predict the relevance between the previous hidden state 𝒉𝒕−𝟏 and the input states In Luong attention, we cannot predict 𝑦𝑡 from 𝒉𝒕  need introduce new term 𝒉𝒕 to compute 𝑦𝑡. The flow is 𝒉𝒕 → 𝒄𝒕 → 𝒉𝒕 → 𝑦𝑡 In Bahdanau, the context vector 𝒄𝒕 is inputed to caluculate 𝒉𝒕, and 𝑦𝑡 is computed from 𝒉𝒕. The flow is 𝒉𝒕−𝟏 → 𝒄𝒕 → 𝒉𝒕 → 𝑦𝑡
  • 24. Luong attention ◦ Alignment functions ◦ They proposed three alignment functions as follows 24
  • 25. Luong attention ◦ Local attention ◦ Main idea: calculate the attention weights for only a subset of the keys ◦ Flow: for each target word at timestep 𝑡 ◦ generates an aligned position 𝑝𝑡 ◦ The attention weights and context vector 𝑐𝑡 is calcualted over 𝑝𝑡 − 𝐷, 𝑝𝑡 + 𝐷 , D is a hyperparam ◦ How to deterimine 𝑝𝑡 ◦ Monotonic alignment (local-m) ◦ 𝑝𝑡 = 𝑡 ◦ Predictive alignment (local-p) 25 The length of the input Learnable vector and matrix ;
  • 26. Distribution functions ◦ Argmax ◦ Softmax ◦ Dense distribution  ◦ 𝒂𝒕𝒊 = exp(𝑒𝑡𝑖) 𝑗=1 𝑇 exp(𝑒𝑡𝑗) ◦ Sparsemax(8) ◦ Return sparse posterior distributions, assigning zero probability to some output variables. 26 (8)Martin et al.,From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification, PMLR 2016
  • 27. PM2.5 multi-stations prediction 31 𝒉𝟏 𝟏 𝒉𝟐 𝟏 𝒉𝒍 𝟏 … 𝒉𝟏 𝒊 𝒉𝟐 𝒊 𝒉𝒍 𝒊 … 𝒉𝟏 𝒏 𝒉𝟐 𝒏 𝒉𝒍 𝒏 … Hard Self Att 𝒔𝟐 𝟏 𝒔𝒏 𝟏 𝒔𝟏 𝟏 … 𝒔𝟐 𝒊 𝒔𝒏 𝒊 𝒔𝟏 𝒊 … 𝒔𝟐 𝒏 𝒔𝒏 𝒏 𝒔𝟏 𝒏 … 𝒔𝟎 𝟏 𝒔𝟎 𝒏 𝒔𝟎 𝒊 Soft Self Att Soft Self Att
  • 28. PM2.5 multi-stations prediction 32 𝒉𝟏 𝟏 𝒉𝟐 𝟏 𝒉𝒍 𝟏 … 𝒉𝟏 𝒊 𝒉𝟐 𝒊 𝒉𝒍 𝒊 … 𝒉𝟏 𝒏 𝒉𝟐 𝒏 𝒉𝒍 𝒏 … Hard Self Att 𝒔𝟐 𝟏 𝒔𝒏 𝟏 𝒔𝟏 𝟏 … 𝒔𝟐 𝒊 𝒔𝒏 𝒊 𝒔𝟏 𝒊 … 𝒔𝟐 𝒏 𝒔𝒏 𝒏 𝒔𝟏 𝒏 … 𝒔𝟎 𝟏 𝒔𝟎 𝒏 𝒔𝟎 𝒊 Soft Self Att Soft Self Att
  • 29. 33 . . . . . . . 𝑠𝑖 𝑠𝑗 𝛼1 (𝑖,𝑗) 𝛼7 (𝑖,𝑗) 𝑒𝑖,𝑗 = 𝑘=1 7 𝛼𝑘 𝑖,𝑗 𝑠𝑖 𝑠𝑗+1 𝛼1 (𝑖,𝑗+1) 𝛼7 (𝑖,𝑗+1) 𝑒𝑖,𝑗+1 = 𝑘=1 7 𝛼𝑘 𝑖,𝑗+1  𝑒𝑖,𝑗 is different from 𝑒𝑖,𝑗+1 Biased: 𝑒𝑖𝑗 = 𝑘=1 7 𝛼𝑘 𝑖𝑗 + 𝑏 𝛼𝑖 𝑖𝑗
  • 30. 34 . . . . . . . 𝑠𝑖 ℎ𝑗 𝛼1 (𝑖,𝑗) 𝛼7 (𝑖,𝑗) 𝑒𝑖,𝑗 = 𝑘=1 7 𝛼𝑘 𝑖,𝑗 𝑒𝑖,𝑗+1 = 𝑘=1 7 𝛼𝑘 𝑖,𝑗+1  𝑒𝑖,𝑗 is different from 𝑒𝑖,𝑗+1 Biased: 𝑒𝑖𝑗 = 𝑘=1 7 𝛼𝑘 𝑖𝑗 + 𝑏 𝛼𝑖 𝑖𝑗 𝑠𝑖2 ℎ𝑗 𝛼1 (𝑖2,𝑗) 𝛼7 (𝑖2,𝑗) 𝑠1 𝑠7 ℎ𝑗−𝑖+1 ℎ𝑗−𝑖+7 × 𝑏
  • 31. 35