SlideShare a Scribd company logo
1 of 1
Download to read offline
Forest-to-String SMT for Asian Language Translation: NAIST at WAT 2014
Graham Neubig
Nara Institute of Science and Technology (NAIST), Japan
Framework: F2S SMT Base Model
Data Selection
友達 と ご飯 を 食べ た
SUF5
VP0-5
PP0-1
VP2-5
PP2-3
N2
P3
V4
N0
P1
VP4-5
a meal
a meal
x1 x0
x1 x0
my friend
my friend
x1 with x0
x1 x0
with
ate
ate
Data Preparation
● ja-zh TM: All data (672k)
● ja-en TM:
● ASPEC first 2M
● Optional dictionaries: EIJIRO, EDICT, Wiki
● LM: All data
Tokenization
● en: Stanford Tokenizer (+ split “-” and “/”)
● zh: Stanford Segmenter
● ja: KyTea
→
農産業 農 産業
Parsing
● Parser: Egret
● en→Penn TB, zh→Penn CTB,
ja→Ja. Word Dependency Corpus [Mori+14]
● For Japanese, convert w/ head rules
これ は テスト です 。
これ は テスト です 。 これ は テスト です 。
Alignment
● ja-zh: GIZA++
● ja-en: Nile (supervised syntactic aligner)
Trained on KFTT aligned data
This is a test .
これ は テスト です 。
Translation Model
● Synchronous tree substitution grammar
● 5 composed rules
● Kneser-Ney count smoothing
● Right binarization
Language Model
● KenLM trained 6-gram model
● For ja, interpolated zh-ja and en-ja data
I can eat an apple </s>
Recurrent Neural Network LM
● Improve robustness, longer context
● 500 hidden nodes, 300 classes
● Trained on first 500k sentences
● 10,000-best reranking
Optimization
● Minimum error rate training
● Tested with BLEU or BLEU+RIBES
Unknown Splitting (ja-en)
これ を 球内部 に
[Mikolov+ 10]
[Koehn+ 03]
UNK
● Choose split that maximizes unigram prob:
P( 球 )P( 内部 ) > P( 球内 )P( 部 )
これ を 球 内部 に
● Test time only
Transliteration/Dictionaries
ja-zh, zh-ja
● Post: Convert Simplified ↔ Japanese Kanji
● Use Kanconvit.pm script
イチョウ黄叶 イチョウ黄葉 臭気鉴定师 臭気鑑定師
Japan インテック Japan Intekku
ja-en
● Pre: Normalize 標題 to 表題
● Post: If word exists in dictionary, translate
● Eijiro, Edict, Wiki Language Link
● Post: Romanize Hiragana/Katakana Words
● Post: Delete remaining Japanese words
en-ja ja-en zh-ja ja-zh
10
20
30
40
50
BLEU
en-ja ja-en zh-ja ja-zh
0
10
20
30
40
50
60
HUMAN
Other
NAIST
+2.2
+2.7
+3.6
+1.8
+13.0
+15.0
+28.3
+3.8
First place in all tasks!
Results
RNNLM Helps!
BLEU en-ja ja-en zh-ja ja-zh
w/o RNN 36.50 23.76 39.82 29.27
w/ RNN 37.21 24.72 40.61 29.78
Scripts available!
http://phontron.com/project/wat2014
Tuning w/ RIBES hurts human eval!
Tune BLEU B+R
Eval B R H B R H
en-ja 37.2 80.2 56.3 37.2 80.7 51.5
zh-ja 41.3 83.5 50.8 40.8 83.8 38.0
ja-zh 30.5 81.8 17.8 29.8 83.0 1.3
Why? Too-short hypotheses.

More Related Content

More from Association for Computational Linguistics

Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Association for Computational Linguistics
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Association for Computational Linguistics
 
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Association for Computational Linguistics
 
Toshiaki Nakazawa - 2015 - Overview of the 2nd Workshop on Asian Translation
Toshiaki Nakazawa  - 2015 - Overview of the 2nd Workshop on Asian TranslationToshiaki Nakazawa  - 2015 - Overview of the 2nd Workshop on Asian Translation
Toshiaki Nakazawa - 2015 - Overview of the 2nd Workshop on Asian TranslationAssociation for Computational Linguistics
 
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemAssociation for Computational Linguistics
 
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Association for Computational Linguistics
 

More from Association for Computational Linguistics (20)

Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
 
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
Terumasa Ehara - 2015 - System Combination of RBMT plus SPE and Preordering p...
 
Toshiaki Nakazawa - 2015 - Overview of the 2nd Workshop on Asian Translation
Toshiaki Nakazawa  - 2015 - Overview of the 2nd Workshop on Asian TranslationToshiaki Nakazawa  - 2015 - Overview of the 2nd Workshop on Asian Translation
Toshiaki Nakazawa - 2015 - Overview of the 2nd Workshop on Asian Translation
 
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT SystemHua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
Hua Shan - 2015 - A Dependency-to-String Model for Chinese-Japanese SMT System
 
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
 

Graham Neubig - 2014 - Forest-to-String SMT for Asian Language Translation: NAIST at WAT 2014

  • 1. Forest-to-String SMT for Asian Language Translation: NAIST at WAT 2014 Graham Neubig Nara Institute of Science and Technology (NAIST), Japan Framework: F2S SMT Base Model Data Selection 友達 と ご飯 を 食べ た SUF5 VP0-5 PP0-1 VP2-5 PP2-3 N2 P3 V4 N0 P1 VP4-5 a meal a meal x1 x0 x1 x0 my friend my friend x1 with x0 x1 x0 with ate ate Data Preparation ● ja-zh TM: All data (672k) ● ja-en TM: ● ASPEC first 2M ● Optional dictionaries: EIJIRO, EDICT, Wiki ● LM: All data Tokenization ● en: Stanford Tokenizer (+ split “-” and “/”) ● zh: Stanford Segmenter ● ja: KyTea → 農産業 農 産業 Parsing ● Parser: Egret ● en→Penn TB, zh→Penn CTB, ja→Ja. Word Dependency Corpus [Mori+14] ● For Japanese, convert w/ head rules これ は テスト です 。 これ は テスト です 。 これ は テスト です 。 Alignment ● ja-zh: GIZA++ ● ja-en: Nile (supervised syntactic aligner) Trained on KFTT aligned data This is a test . これ は テスト です 。 Translation Model ● Synchronous tree substitution grammar ● 5 composed rules ● Kneser-Ney count smoothing ● Right binarization Language Model ● KenLM trained 6-gram model ● For ja, interpolated zh-ja and en-ja data I can eat an apple </s> Recurrent Neural Network LM ● Improve robustness, longer context ● 500 hidden nodes, 300 classes ● Trained on first 500k sentences ● 10,000-best reranking Optimization ● Minimum error rate training ● Tested with BLEU or BLEU+RIBES Unknown Splitting (ja-en) これ を 球内部 に [Mikolov+ 10] [Koehn+ 03] UNK ● Choose split that maximizes unigram prob: P( 球 )P( 内部 ) > P( 球内 )P( 部 ) これ を 球 内部 に ● Test time only Transliteration/Dictionaries ja-zh, zh-ja ● Post: Convert Simplified ↔ Japanese Kanji ● Use Kanconvit.pm script イチョウ黄叶 イチョウ黄葉 臭気鉴定师 臭気鑑定師 Japan インテック Japan Intekku ja-en ● Pre: Normalize 標題 to 表題 ● Post: If word exists in dictionary, translate ● Eijiro, Edict, Wiki Language Link ● Post: Romanize Hiragana/Katakana Words ● Post: Delete remaining Japanese words en-ja ja-en zh-ja ja-zh 10 20 30 40 50 BLEU en-ja ja-en zh-ja ja-zh 0 10 20 30 40 50 60 HUMAN Other NAIST +2.2 +2.7 +3.6 +1.8 +13.0 +15.0 +28.3 +3.8 First place in all tasks! Results RNNLM Helps! BLEU en-ja ja-en zh-ja ja-zh w/o RNN 36.50 23.76 39.82 29.27 w/ RNN 37.21 24.72 40.61 29.78 Scripts available! http://phontron.com/project/wat2014 Tuning w/ RIBES hurts human eval! Tune BLEU B+R Eval B R H B R H en-ja 37.2 80.2 56.3 37.2 80.7 51.5 zh-ja 41.3 83.5 50.8 40.8 83.8 38.0 ja-zh 30.5 81.8 17.8 29.8 83.0 1.3 Why? Too-short hypotheses.