Attention scores and mechanisms

Attention scores and
mechanisms
2019. 08. 16. (금)
장재호

목차
• Sequence to sequence and bottleneck problem
• Sequence to sequence with attention
• Attention variants (1) Score
• Attention variants (2) Self-attention
• Attention variants (3) Multi-headed attention
• Attention variants (4) Others

Sequence to sequence and
bottleneck problem
참고자료
cs224n-2019-lecture08-nmt
https://medium.com/@joealato/attention-in-nlp-734c6fa9d983
“On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”. Cho et al. 2014.
”Neural Machine Translation by Jointly Learning to Align and Translate”. Banhdanau et al. 2015.

Sequence to sequence and bottleneck
problem
• Encoder-Decoder architecture
• learns to encode a variable-length sequence into a fixed-length
vector representation and to decode a given fixed-length vector
representation back into a variable-length sequence.

problem
• Encoder-Decoder

problem
• Bottleneck problem
• “We show that the neural machine translation performs relatively
well on short sentences without unknown words, but its
performance degrades rapidly as the length of the sentence and
the number of unknown words increase.” (Cho et al. 2014)

problem

problem
• Bottleneck problem
• “we conjecture that the use of a fixed-length vector is a
bottleneck in improving the performance of this basic encoder–
decoder architecture, and propose to extend this by allowing a
model to automatically (soft-)search for parts of a source sentence
that are relevant to predicting a target word, without having to form
these parts as a hard segment explicitly.” (Bahdanau et al., 2015)

problem
• Alignment
• the correspondence between particular words in the translated
sentence pair.

problem
• Alignment

problem
• Encoder-Decoder with Attention

Sequence to sequence with
Attention
참고자료

Sequence to sequence with Attention
• Attention
• Attention provides a solution to the bottleneck problem.
• Core idea: on each step of the decoder, use direct connection to the
encoder to focus on a particular part of the source sequence

Attention variants (1)
Attention scores
참고자료
“Massive Exploration of Neural Machine Translation Architectures”. Britz et al. 2017.

Attention variants (1) Attention scores
• Attention score(1): basic dot product
• 𝒆" = 𝒔% 𝒉"
𝑠
𝒉"
𝑚𝑎𝑡
𝑚𝑢𝑙
𝒆"

• Attention score(2): multiplicative attention
• 𝒆" = 𝒔% 𝑾𝒉"
𝑠
𝒉"
𝑚𝑎𝑡
𝑚𝑢𝑙 𝒆"
𝑾
𝑚𝑎𝑡
𝑚𝑢𝑙

• Attention score(3): additive attention
• 𝒆" = 𝒗%tanh(𝑾4 𝒉" + 𝑾 𝟐 𝒔)
𝑠
𝑚𝑎𝑡
𝑚𝑢𝑙
𝒆"
𝑾8
𝒉"
𝑚𝑎𝑡
𝑚𝑢𝑙
𝑾4
+
𝒗
tanh
𝑚𝑎𝑡
𝑚𝑢𝑙

• The two most commonly
used attention mechanisms
• Additive variant (Bahdanau et
al., 2015)
• 𝑠𝑐𝑜𝑟𝑒 ℎ", 𝑠 = 𝒗%tanh(𝑾4 𝒉" +
𝑾 𝟐 𝒔)
• Multiplicative variant (Luong et
al., 2015)
• 𝑠𝑐𝑜𝑟𝑒 ℎ", 𝑠 = 𝒔% 𝑾𝒉"
• the computationally less
expensive

Self-attention
참고자료
“Long Short-Term Memory-Networks for Machine Reading”. Cheng et. al. 2016.

Self-attention
Encoder
(Source sentence)
Decoder
(Target sentence)
values query
Attention
(Inter-attention)
• Inter-attention

Self-attention
• Intra-attention (=self-attention)
a sentence
values query
Attention
(Intra-attention
= self-attention)

Self-attention
• LSTMN (Cheng et. al. 2016.)
• Inter-attention (ex. Seq2seq with attention)
• Query: a token of target sentence
• Values: tokens of source sentence
• Query: the current token in the input sentence
• Values: the previous tokens in the input sentence

Self-attention
• How can a sequence-level network induce relations which
are presumed latent during text processing?
• How to render sequence-level networks better at handling
structured input?

Self-attention
• Coreference resolution

Self-attention
• Constituent structure

Self-attention
• Dependency structure

Self-attention
• How to render sequence-
level networks better at
handling structured input
• Drawing inspiration from
human language processing
and the fact language
comprehension is incremental
with readers continuously
extracting the meaning of
utterances on a word-by-word
basis

Self-attention
• LSTMN (Cheng et. al. 2016.)
• Inter-attention (ex. Seq2seq with attention)
• Query: a token of target sentence
• Values: tokens of source sentence
• Query: the current token in the input sentence
• Values: the previous tokens in the input sentence
• A key idea behind the LSTMN is to use attention for inducing
relations between tokens

Self-attention
• 𝑥@: the current input
• 𝐶@B4 = (𝑐4, … , 𝑐@B4): the current memory tape
• 𝐻@B4 = (ℎ4, … , ℎ@B4): the previous hidden tape
• At time step 𝑡, the model computes the relation between 𝑥@
and 𝑥4, … , 𝑥@B4 through ℎ4, … , ℎ@B4 with an attention layer:
𝑎"
@
= 𝑣%
tanh 𝑊Gℎ" + 𝑊H 𝑥@ + 𝑊IG
Jℎ@B4 (additive attention)
𝑠"
@
= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑎"
@
• This yields a probability distribution over the hidden state
vectors of previous tokens

Self-attention

Multi-headed attention
참고자료
http://jalammar.github.io/illustrated-transformer/

• It expands the model’s ability
to focus on different
positions.

• It gives the attention layer
multiple “representation
subspaces”.
• With multi-headed attention
we have not only one, but
multiple sets of
Query/Key/Value weight
matrices

Others
참고자료
https://medium.com/@joealato/attention-in-nlp-734c6fa9d983

Attention variants (4) Others
• Two-way Attention & Coattention
• Key-Value(-Predict) attention
• Hierarchical & Nested attention
• Attention flow
• Etc.

Attention scores and mechanisms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Attention scores and mechanisms

Similar to Attention scores and mechanisms (20)

Recently uploaded

Recently uploaded (20)

Attention scores and mechanisms