Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machine Translation paper review presentation

Learning New Semi-Supervised Deep Auto-encoder
Features for Statistical Machine Translation
by Shixiang Lu, Zhenbiao Chen, Bo Xu
Presented By V B Wickramasinghe (148245F)

Overview
● Introduction
● Input features for DNN feature learning
● Semi-supervised deep auto-encoder
features learning for SMT
● Experiments and Results
● Conclusion

Introduction
● Paper describes a novel approach to statistical machine
translation(SMT).
● Uses two deep neural network architectures specifically,
○ Deep belief networks(DBN)
○ Deep auto encoders(DAE)
● The goal is to extract useful features of languages
automatically using DAEs instead of doing it manually.
● Achieves statistically significant improvements over
unsupervised DBN and baseline features.

Input features for DNN feature learning
● Uses a phrase-based translation model.
● Four phrase features are used as the baseline. With f as source and e as
target,
Other features,
● Bidirectional phrase pair similarity.
● Bidirectional Phrase generative probability.

Input features for DNN feature learning
● Phrase frequency.
● Phrase length.
In total there 16 input features which are represented by 16 input nodes in the
DAE.

Semi-supervised deep auto-encoder
● The introduced set of features(X) is then fed to a set of
RBMs.
● Combined together these form a DBN.
● These RBMs are layerwise pretrained to learn deep higher
order correlations between the input features.
● Then unrolling each performed on this DBN to form a DAE.
● Which is then finetuned using back propagation.
● Final step is to stack a number of these trained DAEs to
form a 16-32-32-32-16-16-8 architecture after tuning.

Semi-supervised deep auto-encoder

Experiments & Results
● Experimental Setup
IWSLT. The bilingual corpus is the Chinese English part of Basic Traveling
Expression corpus (BTEC) and China-Japan-Korea (CJK) corpus (0.38M
sentence pairs with 3.5/3.8M Chinese/English words).
NIST. The bilingual corpus is LDC4 (3.4M sentence pairs with 64/70M
Chinese/English words). The LM corpus is the English side of the parallel data
as well as the English Gigaword corpus (LDC2007T07) (11.3M sentences).

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machine Translation paper review presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machine Translation paper review presentation

Similar to Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machine Translation paper review presentation (20)

More from Vimukthi Wickramasinghe

More from Vimukthi Wickramasinghe (8)

Recently uploaded

Recently uploaded (20)

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machine Translation paper review presentation