Incorporating word reordering knowledge into attention-based neural machine translation

Incorporating Word Reordering Knowledge
into Attention-based Neural Machine Translation
Jinchao Zhang, Mingxuan Wang, Qun Liu, Jie Zhou
ACL2017
presentation
Sekizawa Yuuki Komachi lab M2
2017/11/13 1

• word reordering model
• crucial sub-components in SMT
• attention mechanism of NMT
• sometimes inappropriate
• incorrect translation
• propose method
• incorporate word reordering knowledge
into attention-based NMT using distortion model
• semantic requirement and the word reordering penalty
• achieves the SOTA performance on translation quality
• improve word alignment quality
2017/11/13 2

Chinese-English translation example
src youguan baodao shi zhichi tamen lundian de zuixin yiju .
related report is support their arguments ’s latest evidence .
ref the report is the latest evidence that supports their arguments .
NMT output the report supports their perception of the latest .
count zuixin yiju {0} (collocation)
2017/11/13 3
zuixin(latest): common adjective in Chinese
following word should be translated soon
in Chinese to English translation direction
yiju(evidence): does not obtain appropriate attention (following slide)
leads to the incorrect translation

incorrect
attention
2017/11/13 4

propose method
• distortion model using word reordering knowledge
• as the probability distribution of the relative jump distances between
the newly translated source word and the to-be-translated source
word
• extend the attention mechanism to attend to source words
• regarding the semantic requirement and the word reordering penalty
• merits
• Extended word reordering knowledge
• Convenient to be incorporated into attention-based NMT
• Flexible to utilize variant context for computing the word reordering
penalty
2017/11/13 5

Distortion Models in SMT
2017/11/13 6
distortion feature
other features
N: a number of
features
SMT
sepalately trained
NMT (propose)
trained in the end-to-end style

propose method general architecture
• α^t: alignment vector computed by the basic attention mechanism
• dt: alignment vector calculated by the distortion model
• λ: hyper parameter for interpolation
• ct: related source context
• Ψ: context (source or target or translation status (hidden state of decoder))
2017/11/13 7

proposed method’s attention
2017/11/13 8
k: possible relative jump distance
l: window size parameter
P(): probability of jump distance k
Γ: shifting the alignment vector

relative jumps on source words
2017/11/13 9
distortion model estimate the probability distribution of the
possible relative jump distances between the newly translated source word and
the to-be-translated source word upon the context condition

3 distortion models (1/2)
1. S-Distortion model
• adopt previous source context ct-1 as the context Ψ with the intuition
that certain source word indicate certain jump distance
• underlying linguistic intuition: synchronous grammars
• e.g. NP à JJ NN | JJ NN, JJ à zuixin | latest.
• zuixin(latest) is translated, the translation orientation is forward with
shift distance 1
2017/11/13 10

3 distortion models (2/2)
1. fafda
2. T-Distortion model
• exploit the embedding of the previous generated target word yt-1
• focus on the word reordering knowledge upon target word context
3. H-Distortion model
• hidden states st-1 reflect the translation status and contains both
source context and target context information
2017/11/13 11

Experiment
• language: Chinese-to-English
• data
• train: 1.25M sentence pairs from LDC corpora
• validation: NIST 2002 dataset
• test: NIST 2003-2006 dataset
• alignmented data: Tsinghua dataset (Liu and Sun, 2015)
which contains 900 manually aligned sentence pairs
• evaluation: BLEU, Alignment error rate (AER)
2017/11/13 12

• MT system
• Moses, Groundhog, RNNsearch* (in-house implementation)
• NMT Hyper parameter
• max length of sentence: 50
• vocabulary size: 16K, 30K
• encoder: bi-directional GRU
• word embedding dimension: 620
• hidden layer size: 1,000
• interpolation parameter λ: 0.5
window size l: 3
2017/11/13 13

result (BLEU)
2017/11/13 14
vocab 16K has more improvement than vocab 30K
our proposed models alleviate the rare word collocations problem
that leads to incorrect word alignments

compare with previous work
• Coverage: basic RNNsearch model with a coverage model
to alleviate the over-translation and under-translation problems
• MEMDEC: improve translation quality with external memory
• NMTIA: exploits a readable and writable attention mechanism
to keep track of interactive history in decoding
• Our work: using H-Distortion model
n vocab size: 30K, Length: maximum sentence length
2017/11/13 15

compare propose method (BLEU↑, AER↓)
2017/11/13 16

attention improvement
2017/11/13 17
base model
distortion

hyper parameter
2017/11/13 18
l = 3 λ = 0.5

• word reordering model
• crucial sub-components in SMT
• attention mechanism of NMT
• sometimes inappropriate
• incorrect translation
• propose method
• incorporate word reordering knowledge
into attention-based NMT using distortion model
• semantic requirement and the word reordering penalty
• achieves the SOTA performance on translation quality
• improve word alignment quality
2017/11/13 19

Incorporating word reordering knowledge into attention-based neural machine translation

More Related Content

More from sekizawayuuki

Recently uploaded

Incorporating word reordering knowledge into attention-based neural machine translation