Masato Hagiwara, Satoshi Sekine
Rakuten Institute of Technology, New York
NEWS 2012, July 12 2012
Transliteration has been usually recognized by spelling-based supervised models. However, a single model cannot deal with mixture of words with different origins, such as “get” in “piaget” and “target”. Li et al. (2007) propose a class transliteration method, which explicitly models the source language origins and switches them to address this issue. In contrast to their model which requires an explicitly tagged training corpus with language origins, Hagiwara and Sekine (2011) have proposed the latent class transliteration model, which models language origins as latent classes and train the transliteration table via the EM algorithm. However, this model, which can be formulated as unigram mixture, is prone to over fitting since it is based on maximum likelihood estimation. We propose a novel latent semantic transliteration model based on Dirichlet mixture, where a Dirichlet mixture prior is introduced to mitigate the over fitting problem. We have shown that the proposed method considerably outperform the conventional transliteration models.
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Latent Semantic Transliteration using Dirichlet Mixture
1. Latent Semantic Transliteration
using Dirichlet Mixture
Masato Hagiwara, Satoshi Sekine
Rakuten Institute of Technology, New York
NEWS 2012, July 12 2012
2. 2
Background
• Transliteration
– Phonetic translation between languages with
different writing systems
e.g., flextime / furekkusutaimu フレックスタイム
– Major way to import words to different languages
• Transliteration models
– Phonetic-based re-writing models
(Knight and Jonathan 1998)
– Spelling-based supervised models
(Li et al. 2004) (Finch and Sumita 2008)
3. 3
Alpha-Beta Model [Brill and Moore 2000]
Edit distance
substitution, insertion, deletion
= cost 1
Alpha-Beta Model
flextime
furekkusutaimu
Generalization of edit distance
string-to-string substitution α→β
P(flextime→furekkusutaimu)
= P(f→fu)×P(le→re)×P(x→kkusu)×P(ti→tai)×P(me→mu)
Transliteration Probability
= Product of “Transliteration Unit (TU)” Probs.
Maximum re-writing probability over all possible partitions
α
β
flextime
furekkusutaimu
フレックスタイム
4. 4
Joint Source Channel Model [Li et al. 05]
P(flextime→furekkusutaime)
= P(f→fu|BOW)×P(le→re|f→fu)×P(x→kkusu|le→re)× …
JSC Model
fl ext im e
frek ku suta imu
p i aget
pi a j e
Transliteration Prob. =Prod. of TU n-gram probs.
TU Probability Estimation
Training
Corpus
TU Probability Table
P( fl→flek |・) = XXX
P( ext→ku |・) = YYY
P( p→pi |・) = ZZZ
…
EM Algorithm
Freq.→Prob.
Random Initial Alignment
Viterbi Algorithm
5. 5
Joint Source Channel Model [Li et al. 05]
P(flextime→furekkusutaime)
= P(f→fu|BOW)×P(le→re|f→fu)×P(x→kkusu|le→re)× …
JSC Model
fl ext im e
frek ku suta imu
p i aget
pi a j e
Transliteration Prob. =Prod. of TU n-gram probs.
TU Probability Estimation
Training
Corpus
TU Probability Table
P( fl→flek |・) = XXX
P( ext→ku |・) = YYY
P( p→pi |・) = ZZZ
…
EM Algorithm
Freq.→Prob.
f le x ti me
fu re kkusu tai me
pi a get
pi a je
Random Initial Alignment
Viterbi Algorithm
6. 6
Multiple Language Origins
亚历山大 Yalishanda / Alexander
山本 Yamamoto / Yamamoto
Explicit language detection
Requires a training set annotated with language origins
piaget / piaje ピアジェ
target / taagetto ターゲット
French origin
English origin
French model
English model
Indo-European origin
Japanese origin
Chinese Transliteration Model
Japanese Reading Model
Class Transliteration Model (Li et al. 07)
7. 7
Issues on Class Transliteration Model
• Requires training sets tagged with language origins
– Rare especially for proper nouns
• Language origins ≠ transliteration models
– e.g., spaghetti / supageti スパゲティ
Italian origins but can be found in English dictionaries
– e.g., Carl Laemmle / kaaru remuri カール・レムリ
German immigrant but listed as an “American” film
producer
→ An English transliteration model doesn’t work
8. 8
Issues on Class Transliteration Model
• Requires training sets tagged with language origins
– Rare especially for proper nouns
• Language origins ≠ transliteration models
– e.g., spaghetti / supageti スパゲティ
Italian origins but can be found in English dictionaries
– e.g., Carl Laemmle / kaaru remuri カール・レムリ
German immigrant but listed as an “American” film
producer
→ An English transliteration model doesn’t work
Model source language origins as latent classes
9. 9
Latent Class Transliteration (LCT) Model
[Hagiwara & Sekine 11]
• Models the “source language origins” as latent classes
• “latent classes” correspond to sets of words with similar
transliteration characteristics
• Trained via the EM algorithm from transliteration pairs
Class transliteration [Li et al. 04]
Latent Class Transliteration [Hagiwara&Sekine 11]
Explicit language detection
Latent class distribution
s: source
t: target
z: latent class
K: # of latent classes
(determined using dev. sets)
10. 10
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
11. 11
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
p/i/a/get→pi/a/j/e
t/ar/get→taa/ge/tto
…
Lx Ly Lz
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
pi/a/get→pi/a/je
tar/get→taa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on viterbi search
12. 12
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
p/i/a/get→pi/a/j/e
t/ar/get→taa/ge/tto
…
Lx Ly Lz
Update
M step
Σγ*f(get→je ジェ)
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
pi/a/get→pi/a/je
tar/get→taa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on viterbi search
13. 13
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
p/i/a/get→pi/a/j/e
t/ar/get→taa/ge/tto
…
Lx Ly Lz
Update
M step
Σγ*f(get→je ジェ)
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
pi/a/get→pi/a/je
tar/get→taa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on viterbi search
14. 14
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
p/i/a/get→pi/a/j/e
t/ar/get→taa/ge/tto
…
Lx Ly Lz
Update
M step
Σγ*f(get→je ジェ)
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
pi/a/get→pi/a/je
tar/get→taa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on viterbi search
Sensitive to noise in the training data
because of maximum likelihood estimation
15. 15
Transliteration Models vs Topic Models
Transliteration Models Document Topic Models
Transliteration Unit
(Atomic units of substitution )
e.g., pia / pia ピア get / je ジェ
Word
Transliteration Pair
(Sequence of transliteration units)
e.g., pia get / pia je ピア ジェ
Document
Alpha-Beta Model Word Unigram Language Model
Joint Source Channel Model Word n-gram Language Model
Class Transliteration Model [Li et al. 04] Classification + Switch LMs
Latent Class Transliteration Model
[Hagiwara&Sekine 11]
Unigram Mixture [Nigam et al. 00]
Proposed
Dirichlet Mixture
[Yamamoto & Sadamitsu 03]
Introduce a Dirichlet mixture prior to alleviate overfitting
16. 16
Polya distribution
[Yamamoto,Mochihashi 06]
Proposed Method
Latent Semantic Transliteration Model
using Dirichlet Mixture (DM-LST)
𝑃(𝑢|𝑧1)
𝑃 𝐷𝑖𝑟(𝑝; 𝛼1)
𝑃 𝐷𝑖𝑟(𝑝; 𝛼2)
𝑃 𝐷𝑖𝑟(𝑝; 𝛼3)
𝑢1=get/je
Latent Class Transliteration [Hagiwara&Sekine 11]
𝑃(𝑢|𝑧2)
𝑃(𝑢|𝑧3)
𝑢2
=get/getto
Latent Semantic Transliteration
using Dirichlet Mixture (Proposed)
Estimate Dirichlet Mixture Parameters via the EM Algorithm [Yamamoto et al. 03]
𝑢3
French
English
Multinomial Dirichlet Mixture
17. 17
Transliteration Generation Pipeline
Training
Corpus
(flextime, flekkusutaimu)
(piaget, piaje)
DM-LST (Proposed)
smith
Input
JSC Model
+Stack Decoder
sumisu スミス
zumisu ズミス
sumaisu スマイス
sumaizu スマイズ
…
Candidate List
sumisu スミス
Output
Generation Re-ranking
(target, taagetto)
18. 18
Experiments
• Alpha-Beta Model (AB)
• Joint Source Channel (JSC)
• Latent Class Transliteration (LST)
• Latent Semantic Transliteration using Dirichlet Mixture
(DM-LST; Proposed)
19. 19
Experimental Settings
• Evaluation Data
– Translation pairs En-Ja, En-Ch, En-Ko in NEWS2009 [Li et al. 09]
• Evaluation Metric
– ACC: Averaged Top-1 Accuracy
– MFS: Mean F-Score
– MRR: Mean Reciprocal Rank
• Parameters
– Fixed: Stack beam width B=32, EM Iteration=15
– Number of latent classes M = Tuned using the dev. set for each set
Set Train Dev. Test
En-Ja 23,225 1,492 1,489
En-Ch 31,961 2,896 2,896
En-Ko 4,785 987 989
23. 23
Conclusion
• Proposed Latent Semantic Transliteration
based on Dirichlet Mixture (DM-LST)
– Formalized conventional transliteration models
by document topic models
– Introduced a Dirichlet Mixture prior to alleviate
overfitting
– Superior transliteration performance
to the conventional methods
• Future Works
– Deal with transliteration unit N-grams (N≧2)
– Context-dependent transliteration
• e.g., Charles → チャールズ chāruzu or シャルル sharuru