Notes on attention mechanism

Attention mechanism
in Neural Machine Translation
PHAM QUANG KHANG

Machine Translation task in NLP
1. Definition: translating text in one language to another language
2. Evaluation Data set: public datasets including sentence pairs of 2 language (source
language and target language)
a. Main in researches: Eng-Fra, Eng-Ger
b. Vietnamese: Eng-Vi 133k sentence pairs
PHAM QUANG KHANG 2
French-to-English translations from newstest2014
(Artetxe et al., 2018)
Source Reference
And of course, we all share
the same adaptive
imperatives.
Và tất nhiên, tất cả chúng ta
đều trải qua quá trình phát
triển và thích nghi như nhau.
Eng-to-Vi dataset in IWSLT 15 (133K sentence pairs)

BLEU score: standard evaluation for MT
1. Def: BLEU score compare n-grams of the candidate with n-grams of the reference
translation and count the number of matches (Papineni et al. 2002)
2. Calculation:
PHAM QUANG KHANG 3
Example:
Candidate: the the the the the the
Ref: The cat is on the mat
Countclip(the) = 2, Count(the) = 7
precision = 2/7
Papineni et al,. BLEU: a Method for Automatic Evaluation of Machine Translation
Precision

NMT as a hot topic for research
 Number of public paper on NMT got a spike from last year
 Key players:
 Facebook: tackling low-resource
language (Turkish, Vietnamese …)
 Amazon: improving efficiency
 Google: improving output of NMT
 Business needs is higher than ever
since auto translation can save
massive cost for global firms
PHAM QUANG KHANG 4
Number of NMT paper in the last few years
Source: https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-
had-its-busiest-month-ever/
Paper counted from Arxiv

Main approaches for NMT recently
1. Recurrent Neural Networks (RNNs):
 Using LSTM, GRU, BiDirectional RNNs for encoder and decoders
2. Attention with RNNs: use attention between encoder and decoder
 Using Attention mechanism while decoding to improve the ability to capture long term
dependencies
3. Attention only: Transformer
 Use only Attention for both long term dependency and reducing calculation cost
PHAM QUANG KHANG 5

RNNs: processing sequential data
 Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
PHAM QUANG KHANG 6
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Luong et al.2015

RNNs: processing sequential data
 Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
RNNs has shown promising results: using RNNs to achieve close to state-of-the-art
performance of conventional phrase-based machine translation on English-to-French task.
PHAM QUANG KHANG 7
Luong et al.2015
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
BLEU score of several
architecture on Eng-Ger
dataset
Luong et al.2015

Attention: aligning while translating
 Intuition: each time the proposed model generates a word in a translation, it searches for a
set of positions in a source sentence where the most relevant information is concentrated
(Bahdanau et al. 2016)
 Advantages over pure RNNs:
1. Do not encode whole input into 1 vector => not losing information
2. Allow adaptive selection to which should the model attend to
PHAM QUANG KHANG 8
Bahdanau et al. 2015https://github.com/tensorflow/nmt

Attention mechanism
 When decode for predicting output, calculate the score between hidden vector of current state
and all hidden vector from input sentence
PHAM QUANG KHANG 9
https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=TNfHIF71ulLu
Luong et al.2015

Attention for long sentence translation
Bahdanau et al
Compare to pure RNNs decoder on same
dataset and test on combined of WMT 12&13
Thang Luong et al.
Test on WMT’14 Eng-Ger test set
PHAM QUANG KHANG 10
Attention proved to be better than normal RNNs decoder for long sentence translation

Google NMT system
Attention between
encoder and decoder
PHAM QUANG KHANG 11
Wu et al., Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation

Self-attention for representation learning
 The essence of encoder is to create representation of input
 Self-attention allows connection between words within one sentence while shortening the
path between words, which really differentiate with RNNs and CNNs
 Input and ouput: query, keys and values are from the same source
PHAM QUANG KHANG 12
http://deeplearning.hatenablog.com/entry/transformerLin et al. A Structured Self-Attentive Sentence Embedding

Transformer: Attention is all you need
1. Use self-attention instead of RNNs, CNNs
a. Multi-head Attention for self-attention and source-target
attention
b. Position-wise Feed Forward after Attention
c. Masked Multi-head Attention to prevent target words to
attend to “future” word
d. Word embedding + Positional Encoding
2. Reduce total computational complexity per layer
3. Amount of computation can be parallelized
4. Enhance the ability to learn long-range dependencies
PHAM QUANG KHANG 13
An architecture based solely on attention mechanisms
Attention Is All You Need

Multi-head attention
 Attention is only scaled dot-product between keys and query
 Each (V,K,Q) are projected into multiple set of (v, k, q) for different learning (multi-head)
 Output is concatenated again then linear transformed
PHAM QUANG KHANG 14

State-of-the-art in MT
 Outperforms best (at the time of paper) reported models by more than 2.0 BLEU with
training cost way less than those models
PHAM QUANG KHANG 15
Attention Is All You Need

Visualization of self-attention in transformer
PHAM QUANG KHANG 16
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
The model recognized which does “it” refer to (left hand: animal, right hand: street)

Potential of MT and Transformer
1. Customers oriented applications:
a. Text-to-text translation: google translate, facebook translate…
=> Need better engine for rare and difficult language like: Japanese, Vietnamese, Arabic…
b. Speech  Text, Speech to speech …
2. B2B applications:
a. Full text translation (company documentations ….)
b. Domain specific real time translation: for meeting, for work flow automation…
3. Scientific:
a. Attention and self-attention as an approach for other tasks like: language modeling, question
answering system (BERT – Google: AI outperform human in question answering)
PHAM QUANG KHANG 17

References
1. Attention Is All You Need, NIPS 2017
2. Dzmitry et al. Neural Machine Translation by Jointly Learning to Align and Translate
3. Lin et al. A Structured Self-Attentitive Sentence Embedding
4. Deep Learning (Ian Goodfellow et al, 2016)
5. IWSLT 2015 dataset for English-Vietnamese: https://wit3.fbk.eu/mt.php?release=2015-01
6. Yonghui Wu, et al. Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation
7. https://machinelearningmastery.com/introduction-neural-machine-translation/
8. http://deeplearning.hatenablog.com/entry/transformer
9. https://medium.com/the-new-nlp/ai-outperforms-humans-in-question-answering-
70554f51136b
PHAM QUANG KHANG 18

Notes on attention mechanism

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Notes on attention mechanism

Similar to Notes on attention mechanism (20)

Recently uploaded

Recently uploaded (20)

Notes on attention mechanism