2. 목차
• Sequence to sequence and bottleneck problem
• Sequence to sequence with attention
• Attention variants (1) Score
• Attention variants (2) Self-attention
• Attention variants (3) Multi-headed attention
• Attention variants (4) Others
3. Sequence to sequence and
bottleneck problem
참고자료
cs224n-2019-lecture08-nmt
https://medium.com/@joealato/attention-in-nlp-734c6fa9d983
“On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”. Cho et al. 2014.
”Neural Machine Translation by Jointly Learning to Align and Translate”. Banhdanau et al. 2015.
4. Sequence to sequence and bottleneck
problem
• Encoder-Decoder architecture
• learns to encode a variable-length sequence into a fixed-length
vector representation and to decode a given fixed-length vector
representation back into a variable-length sequence.
7. Sequence to sequence and bottleneck
problem
• Bottleneck problem
• “We show that the neural machine translation performs relatively
well on short sentences without unknown words, but its
performance degrades rapidly as the length of the sentence and
the number of unknown words increase.” (Cho et al. 2014)
9. Sequence to sequence and bottleneck
problem
• Bottleneck problem
• “we conjecture that the use of a fixed-length vector is a
bottleneck in improving the performance of this basic encoder–
decoder architecture, and propose to extend this by allowing a
model to automatically (soft-)search for parts of a source sentence
that are relevant to predicting a target word, without having to form
these parts as a hard segment explicitly.” (Bahdanau et al., 2015)
10. Sequence to sequence and bottleneck
problem
• Alignment
• the correspondence between particular words in the translated
sentence pair.
18. Sequence to sequence with Attention
• Attention
• Attention provides a solution to the bottleneck problem.
• Core idea: on each step of the decoder, use direct connection to the
encoder to focus on a particular part of the source sequence
49. Attention variants (2)
Self-attention
• LSTMN (Cheng et. al. 2016.)
• Inter-attention (ex. Seq2seq with attention)
• Query: a token of target sentence
• Values: tokens of source sentence
• Intra-attention (=self-attention)
• Query: the current token in the input sentence
• Values: the previous tokens in the input sentence
50. Attention variants (2)
Self-attention
• How can a sequence-level network induce relations which
are presumed latent during text processing?
• How to render sequence-level networks better at handling
structured input?
54. Attention variants (2)
Self-attention
• How to render sequence-
level networks better at
handling structured input
• Drawing inspiration from
human language processing
and the fact language
comprehension is incremental
with readers continuously
extracting the meaning of
utterances on a word-by-word
basis
55. Attention variants (2)
Self-attention
• LSTMN (Cheng et. al. 2016.)
• Inter-attention (ex. Seq2seq with attention)
• Query: a token of target sentence
• Values: tokens of source sentence
• Intra-attention (=self-attention)
• Query: the current token in the input sentence
• Values: the previous tokens in the input sentence
• A key idea behind the LSTMN is to use attention for inducing
relations between tokens
56. Attention variants (2)
Self-attention
• 𝑥@: the current input
• 𝐶@B4 = (𝑐4, … , 𝑐@B4): the current memory tape
• 𝐻@B4 = (ℎ4, … , ℎ@B4): the previous hidden tape
• At time step 𝑡, the model computes the relation between 𝑥@
and 𝑥4, … , 𝑥@B4 through ℎ4, … , ℎ@B4 with an attention layer:
𝑎"
@
= 𝑣%
tanh 𝑊Gℎ" + 𝑊H 𝑥@ + 𝑊IG
Jℎ@B4 (additive attention)
𝑠"
@
= 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑎"
@
• This yields a probability distribution over the hidden state
vectors of previous tokens
70. Attention variants (3)
Multi-headed attention
• It gives the attention layer
multiple “representation
subspaces”.
• With multi-headed attention
we have not only one, but
multiple sets of
Query/Key/Value weight
matrices