Wei Yang - 2015 - Sampling-based Alignment and Hierarchical Sub-sentential Al...
Graham Neubig - 2014 - Forest-to-String SMT for Asian Language Translation: NAIST at WAT 2014
1. Forest-to-String SMT for Asian Language Translation: NAIST at WAT 2014
Graham Neubig
Nara Institute of Science and Technology (NAIST), Japan
Framework: F2S SMT Base Model
Data Selection
友達 と ご飯 を 食べ た
SUF5
VP0-5
PP0-1
VP2-5
PP2-3
N2
P3
V4
N0
P1
VP4-5
a meal
a meal
x1 x0
x1 x0
my friend
my friend
x1 with x0
x1 x0
with
ate
ate
Data Preparation
● ja-zh TM: All data (672k)
● ja-en TM:
● ASPEC first 2M
● Optional dictionaries: EIJIRO, EDICT, Wiki
● LM: All data
Tokenization
● en: Stanford Tokenizer (+ split “-” and “/”)
● zh: Stanford Segmenter
● ja: KyTea
→
農産業 農 産業
Parsing
● Parser: Egret
● en→Penn TB, zh→Penn CTB,
ja→Ja. Word Dependency Corpus [Mori+14]
● For Japanese, convert w/ head rules
これ は テスト です 。
これ は テスト です 。 これ は テスト です 。
Alignment
● ja-zh: GIZA++
● ja-en: Nile (supervised syntactic aligner)
Trained on KFTT aligned data
This is a test .
これ は テスト です 。
Translation Model
● Synchronous tree substitution grammar
● 5 composed rules
● Kneser-Ney count smoothing
● Right binarization
Language Model
● KenLM trained 6-gram model
● For ja, interpolated zh-ja and en-ja data
I can eat an apple </s>
Recurrent Neural Network LM
● Improve robustness, longer context
● 500 hidden nodes, 300 classes
● Trained on first 500k sentences
● 10,000-best reranking
Optimization
● Minimum error rate training
● Tested with BLEU or BLEU+RIBES
Unknown Splitting (ja-en)
これ を 球内部 に
[Mikolov+ 10]
[Koehn+ 03]
UNK
● Choose split that maximizes unigram prob:
P( 球 )P( 内部 ) > P( 球内 )P( 部 )
これ を 球 内部 に
● Test time only
Transliteration/Dictionaries
ja-zh, zh-ja
● Post: Convert Simplified ↔ Japanese Kanji
● Use Kanconvit.pm script
イチョウ黄叶 イチョウ黄葉 臭気鉴定师 臭気鑑定師
Japan インテック Japan Intekku
ja-en
● Pre: Normalize 標題 to 表題
● Post: If word exists in dictionary, translate
● Eijiro, Edict, Wiki Language Link
● Post: Romanize Hiragana/Katakana Words
● Post: Delete remaining Japanese words
en-ja ja-en zh-ja ja-zh
10
20
30
40
50
BLEU
en-ja ja-en zh-ja ja-zh
0
10
20
30
40
50
60
HUMAN
Other
NAIST
+2.2
+2.7
+3.6
+1.8
+13.0
+15.0
+28.3
+3.8
First place in all tasks!
Results
RNNLM Helps!
BLEU en-ja ja-en zh-ja ja-zh
w/o RNN 36.50 23.76 39.82 29.27
w/ RNN 37.21 24.72 40.61 29.78
Scripts available!
http://phontron.com/project/wat2014
Tuning w/ RIBES hurts human eval!
Tune BLEU B+R
Eval B R H B R H
en-ja 37.2 80.2 56.3 37.2 80.7 51.5
zh-ja 41.3 83.5 50.8 40.8 83.8 38.0
ja-zh 30.5 81.8 17.8 29.8 83.0 1.3
Why? Too-short hypotheses.