Incorporating	Word	Reordering	Knowledge
into	Attention-based	Neural	Machine	Translation
Jinchao Zhang,	Mingxuan Wang,	Qun Liu,	Jie Zhou
ACL2017	
presentation
Sekizawa Yuuki Komachi lab	M2
2017/11/13 1
Incorporating	Word	Reordering	Knowledge
into	Attention-based	Neural	Machine	Translation
• word	reordering	model
• crucial	sub-components	in	SMT
• attention	mechanism	of	NMT
• sometimes	inappropriate
• incorrect	translation
• propose	method
• incorporate	word	reordering	knowledge
into	attention-based	NMT	using	distortion	model
• semantic	requirement	and	the	word	reordering	penalty	
• achieves	the	SOTA	performance	on	translation	quality	
• improve	word	alignment	quality
2017/11/13 2
Chinese-English	translation	example
src youguan baodao shi zhichi tamen lundian de	zuixin yiju .	
related			report			is	support	their	arguments	’s	latest	evidence	.
ref the	report	is	the	latest	evidence	that	supports	their	arguments	.
NMT	output the	report	supports	their	perception	of	the	latest .	
count zuixin yiju {0}	(collocation)
2017/11/13 3
zuixin(latest):	common	adjective	in	Chinese
following	word	should	be	translated	soon
in	Chinese	to	English	translation	direction
yiju(evidence):	does	not	obtain	appropriate	attention	(following	slide)
leads	to	the	incorrect	translation
incorrect
attention
2017/11/13 4
propose	method
• distortion	model	using	word	reordering	knowledge
• as	the	probability	distribution	of	the	relative	jump	distances	between	
the	newly	translated	source	word	and	the	to-be-translated	source	
word	
• extend	the	attention	mechanism	to	attend	to	source	words
• regarding	the	semantic	requirement	and	the	word	reordering	penalty	
• merits
• Extended	word	reordering	knowledge	
• Convenient	to	be	incorporated	into	attention-based	NMT	
• Flexible	to	utilize	variant	context	for	computing	the	word	reordering	
penalty	
2017/11/13 5
Distortion	Models	in	SMT	
2017/11/13 6
distortion	feature
other	features
N:	a	number	of	
features
SMT
sepalately trained
NMT	(propose)
trained	in	the	end-to-end	style
propose	method	general	architecture
• α^t:	alignment	vector	computed	by	the	basic	attention	mechanism	
• dt:	alignment	vector	calculated	by	the	distortion	model
• λ:	hyper	parameter	for	interpolation
• ct:	related	source	context	
• Ψ:	context	(source	or	target	or	translation	status	(hidden	state	of	decoder))
2017/11/13 7
proposed	method’s	attention
2017/11/13 8
k:	possible	relative	jump	distance	
l:	window	size	parameter	
P():	probability	of	jump	distance	k
Γ:	shifting	the	alignment	vector
relative	jumps	on	source	words	
2017/11/13 9
distortion	model	estimate	the	probability	distribution	of	the	
possible	relative	jump	distances	between	the	newly	translated	source	word	and	
the	to-be-translated	source	word	upon	the	context	condition
3	distortion	models	(1/2)
1. S-Distortion	model	
• adopt	previous	source	context	ct-1 as	the	context	Ψ with	the	intuition	
that	certain	source	word	indicate	certain	jump	distance	
• underlying	linguistic	intuition:	synchronous	grammars
• e.g.	NP	à JJ			NN	|	JJ			NN,					JJ	à zuixin |	latest.
• zuixin(latest)	is	translated,	the	translation	orientation	is	forward	with	
shift	distance	1	
2017/11/13 10
3	distortion	models	(2/2)
1. fafda
2. T-Distortion	model	
• exploit	the	embedding	of	the	previous	generated	target	word	yt-1
• focus	on	the	word	reordering	knowledge	upon	target	word	context	
3. H-Distortion	model	
• hidden	states	st-1 reflect	the	translation	status	and	contains	both	
source	context	and	target	context	information	
2017/11/13 11
Experiment
• language:	Chinese-to-English
• data
• train:	1.25M	sentence	pairs	from	LDC	corpora	
• validation:	NIST	2002	dataset	
• test:	NIST	2003-2006	dataset	
• alignmented data:	Tsinghua	dataset	(Liu	and	Sun,	2015)	
which	contains	900	manually	aligned	sentence	pairs	
• evaluation:	BLEU,	Alignment	error	rate	(AER)
2017/11/13 12
• MT	system
• Moses,	Groundhog,	RNNsearch*	(in-house	implementation)
• NMT	Hyper	parameter
• max	length	of	sentence:	50
• vocabulary	size:	16K,	30K
• encoder:	bi-directional	GRU
• word	embedding	dimension:	620
• hidden	layer	size:	1,000
• interpolation	parameter	λ:	0.5
window	size	l:	3	
2017/11/13 13
result	(BLEU)
2017/11/13 14
vocab	16K	has	more	improvement	than	vocab	30K
our	proposed	models	alleviate	the	rare	word	collocations	problem
that	leads	to	incorrect	word	alignments
compare	with	previous	work
• Coverage:	basic	RNNsearch model	with	a	coverage	model
to	alleviate	the	over-translation	and	under-translation	problems	
• MEMDEC:	improve	translation	quality	with	external	memory	
• NMTIA:	exploits	a	readable	and	writable	attention	mechanism
to	keep	track	of	interactive	history	in	decoding	
• Our	work:	using	H-Distortion	model
n vocab	size:	30K,	Length:	maximum	sentence	length
2017/11/13 15
compare	propose	method	(BLEU↑,	AER↓)
2017/11/13 16
attention	improvement
2017/11/13 17
base	model
distortion
hyper	parameter
2017/11/13 18
l	=	3 λ =	0.5
Incorporating	Word	Reordering	Knowledge
into	Attention-based	Neural	Machine	Translation
• word	reordering	model
• crucial	sub-components	in	SMT
• attention	mechanism	of	NMT
• sometimes	inappropriate
• incorrect	translation
• propose	method
• incorporate	word	reordering	knowledge
into	attention-based	NMT	using	distortion	model
• semantic	requirement	and	the	word	reordering	penalty	
• achieves	the	SOTA	performance	on	translation	quality	
• improve	word	alignment	quality
2017/11/13 19

Incorporating word reordering knowledge into attention-based neural machine translation