SlideShare a Scribd company logo
1 of 23
Download to read offline
Latent Semantic Transliteration
using Dirichlet Mixture
Masato Hagiwara, Satoshi Sekine
Rakuten Institute of Technology, New York
NEWS 2012, July 12 2012
2
Background
• Transliteration
– Phonetic translation between languages with
different writing systems
e.g., flextime / furekkusutaimu フレックスタイム
– Major way to import words to different languages
• Transliteration models
– Phonetic-based re-writing models
(Knight and Jonathan 1998)
– Spelling-based supervised models
(Li et al. 2004) (Finch and Sumita 2008)
3
Alpha-Beta Model [Brill and Moore 2000]
Edit distance
substitution, insertion, deletion
= cost 1
Alpha-Beta Model
flextime
furekkusutaimu
Generalization of edit distance
string-to-string substitution α→β
P(flextime→furekkusutaimu)
= P(f→fu)×P(le→re)×P(x→kkusu)×P(ti→tai)×P(me→mu)
Transliteration Probability
= Product of “Transliteration Unit (TU)” Probs.
Maximum re-writing probability over all possible partitions
α
β
flextime
furekkusutaimu
フレックスタイム
4
Joint Source Channel Model [Li et al. 05]
P(flextime→furekkusutaime)
= P(f→fu|BOW)×P(le→re|f→fu)×P(x→kkusu|le→re)× …
JSC Model
fl ext im e
frek ku suta imu
p i aget
pi a j e
Transliteration Prob. =Prod. of TU n-gram probs.
TU Probability Estimation
Training
Corpus
TU Probability Table
P( fl→flek |・) = XXX
P( ext→ku |・) = YYY
P( p→pi |・) = ZZZ
…
EM Algorithm
Freq.→Prob.
Random Initial Alignment
Viterbi Algorithm
5
Joint Source Channel Model [Li et al. 05]
P(flextime→furekkusutaime)
= P(f→fu|BOW)×P(le→re|f→fu)×P(x→kkusu|le→re)× …
JSC Model
fl ext im e
frek ku suta imu
p i aget
pi a j e
Transliteration Prob. =Prod. of TU n-gram probs.
TU Probability Estimation
Training
Corpus
TU Probability Table
P( fl→flek |・) = XXX
P( ext→ku |・) = YYY
P( p→pi |・) = ZZZ
…
EM Algorithm
Freq.→Prob.
f le x ti me
fu re kkusu tai me
pi a get
pi a je
Random Initial Alignment
Viterbi Algorithm
6
Multiple Language Origins
亚历山大 Yalishanda / Alexander
山本 Yamamoto / Yamamoto
Explicit language detection
Requires a training set annotated with language origins
piaget / piaje ピアジェ
target / taagetto ターゲット
French origin
English origin
French model
English model
Indo-European origin
Japanese origin
Chinese Transliteration Model
Japanese Reading Model
Class Transliteration Model (Li et al. 07)
7
Issues on Class Transliteration Model
• Requires training sets tagged with language origins
– Rare especially for proper nouns
• Language origins ≠ transliteration models
– e.g., spaghetti / supageti スパゲティ
Italian origins but can be found in English dictionaries
– e.g., Carl Laemmle / kaaru remuri カール・レムリ
German immigrant but listed as an “American” film
producer
→ An English transliteration model doesn’t work
8
Issues on Class Transliteration Model
• Requires training sets tagged with language origins
– Rare especially for proper nouns
• Language origins ≠ transliteration models
– e.g., spaghetti / supageti スパゲティ
Italian origins but can be found in English dictionaries
– e.g., Carl Laemmle / kaaru remuri カール・レムリ
German immigrant but listed as an “American” film
producer
→ An English transliteration model doesn’t work
Model source language origins as latent classes
9
Latent Class Transliteration (LCT) Model
[Hagiwara & Sekine 11]
• Models the “source language origins” as latent classes
• “latent classes” correspond to sets of words with similar
transliteration characteristics
• Trained via the EM algorithm from transliteration pairs
Class transliteration [Li et al. 04]
Latent Class Transliteration [Hagiwara&Sekine 11]
Explicit language detection
Latent class distribution
s: source
t: target
z: latent class
K: # of latent classes
(determined using dev. sets)
10
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
11
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
p/i/a/get→pi/a/j/e
t/ar/get→taa/ge/tto
…
Lx Ly Lz
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
pi/a/get→pi/a/je
tar/get→taa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on viterbi search
12
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
p/i/a/get→pi/a/j/e
t/ar/get→taa/ge/tto
…
Lx Ly Lz
Update
M step
Σγ*f(get→je ジェ)
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
pi/a/get→pi/a/je
tar/get→taa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on viterbi search
13
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
p/i/a/get→pi/a/j/e
t/ar/get→taa/ge/tto
…
Lx Ly Lz
Update
M step
Σγ*f(get→je ジェ)
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
pi/a/get→pi/a/je
tar/get→taa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on viterbi search
14
Iterative Learning via EM Algorithm
piaget → piaje
target → taaget
…
p/i/a/get→pi/a/j/e
t/ar/get→taa/ge/tto
…
Lx Ly Lz
Update
M step
Σγ*f(get→je ジェ)
Training Pairs
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
P(pi→pi ピ)
P(ar→aa アー)
P(get→je ジェ)
P(get→getto ゲット)
…
Transliteration
Model
Lx Ly Lz
pi/a/get→pi/a/je
tar/get→taa/getto
…
Lx Ly Lz
E step
Transliteration probability
Based on viterbi search
Sensitive to noise in the training data
because of maximum likelihood estimation
15
Transliteration Models vs Topic Models
Transliteration Models Document Topic Models
Transliteration Unit
(Atomic units of substitution )
e.g., pia / pia ピア get / je ジェ
Word
Transliteration Pair
(Sequence of transliteration units)
e.g., pia get / pia je ピア ジェ
Document
Alpha-Beta Model Word Unigram Language Model
Joint Source Channel Model Word n-gram Language Model
Class Transliteration Model [Li et al. 04] Classification + Switch LMs
Latent Class Transliteration Model
[Hagiwara&Sekine 11]
Unigram Mixture [Nigam et al. 00]
Proposed
Dirichlet Mixture
[Yamamoto & Sadamitsu 03]
Introduce a Dirichlet mixture prior to alleviate overfitting
16
Polya distribution
[Yamamoto,Mochihashi 06]
Proposed Method
Latent Semantic Transliteration Model
using Dirichlet Mixture (DM-LST)
𝑃(𝑢|𝑧1)
𝑃 𝐷𝑖𝑟(𝑝; 𝛼1)
𝑃 𝐷𝑖𝑟(𝑝; 𝛼2)
𝑃 𝐷𝑖𝑟(𝑝; 𝛼3)
𝑢1=get/je
Latent Class Transliteration [Hagiwara&Sekine 11]
𝑃(𝑢|𝑧2)
𝑃(𝑢|𝑧3)
𝑢2
=get/getto
Latent Semantic Transliteration
using Dirichlet Mixture (Proposed)
Estimate Dirichlet Mixture Parameters via the EM Algorithm [Yamamoto et al. 03]
𝑢3
French
English
Multinomial Dirichlet Mixture
17
Transliteration Generation Pipeline
Training
Corpus
(flextime, flekkusutaimu)
(piaget, piaje)
DM-LST (Proposed)
smith
Input
JSC Model
+Stack Decoder
sumisu スミス
zumisu ズミス
sumaisu スマイス
sumaizu スマイズ
…
Candidate List
sumisu スミス
Output
Generation Re-ranking
(target, taagetto)
18
Experiments
• Alpha-Beta Model (AB)
• Joint Source Channel (JSC)
• Latent Class Transliteration (LST)
• Latent Semantic Transliteration using Dirichlet Mixture
(DM-LST; Proposed)
19
Experimental Settings
• Evaluation Data
– Translation pairs En-Ja, En-Ch, En-Ko in NEWS2009 [Li et al. 09]
• Evaluation Metric
– ACC: Averaged Top-1 Accuracy
– MFS: Mean F-Score
– MRR: Mean Reciprocal Rank
• Parameters
– Fixed: Stack beam width B=32, EM Iteration=15
– Number of latent classes M = Tuned using the dev. set for each set
Set Train Dev. Test
En-Ja 23,225 1,492 1,489
En-Ch 31,961 2,896 2,896
En-Ko 4,785 987 989
20
Results
Set Model ACC MFS MRR
En-Ja AB 0.293 0.755 0.378
JSC 0.326 0.770 0.428
LCT 0.345 0.768 0.437
DM-LST 0.349 0.776 0.444
En-Ch AB 0.358 0.741 0.471
JSC 0.417 0.761 0.527
LCT 0.430 0.764 0.532
DM-LST 0.445 0.779 0.546
En-Ko AB 0.145 0.537 0.211
JSC 0.151 0.543 0.221
LCT 0.079 0.483 0.167
DM-LST 0.174 0.556 0.237
21
Results
Set Model ACC MFS MRR
En-Ja AB 0.293 0.755 0.378
JSC 0.326 0.770 0.428
LCT 0.345 0.768 0.437
DM-LST 0.349 0.776 0.444
En-Ch AB 0.358 0.741 0.471
JSC 0.417 0.761 0.527
LCT 0.430 0.764 0.532
DM-LST 0.445 0.779 0.546
En-Ko AB 0.145 0.537 0.211
JSC 0.151 0.543 0.221
LCT 0.079 0.483 0.167
DM-LST 0.174 0.556 0.237
22
Examples
Input Conventional Methods Proposed Method
dijon
(En-Ja)
☓ diyon ディヨン ○ dijon ディジョン
goldenberg
(En-Ja)
☓ gōrudenberugu
ゴールデンベルグ
○ gōrudenbāgu
ゴールデンバーグ
covell
(En-Cn)
☓kefuer 科夫尔 ○keweier 科维尔
netherwood
(En-Cn)
☓neitehewude 内特赫伍德 ○neisewude 内瑟伍德
darling
(En-Ko)
☓dareuling 다르링 ○dalling 달링
gutheim
(En-Cn)
○ gutehaimu 古特海姆 ○ gutehaimu 古特海姆
martina
(En-Ko)
○mareutina 마르티나 ○mareutina 마르티나
23
Conclusion
• Proposed Latent Semantic Transliteration
based on Dirichlet Mixture (DM-LST)
– Formalized conventional transliteration models
by document topic models
– Introduced a Dirichlet Mixture prior to alleviate
overfitting
– Superior transliteration performance
to the conventional methods
• Future Works
– Deal with transliteration unit N-grams (N≧2)
– Context-dependent transliteration
• e.g., Charles → チャールズ chāruzu or シャルル sharuru

More Related Content

What's hot

Python45s - Session 01
Python45s - Session 01Python45s - Session 01
Python45s - Session 01Al Sayed Gamal
 
Type hints in python & mypy
Type hints in python & mypyType hints in python & mypy
Type hints in python & mypyAnirudh
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture NotesFellowBuddy.com
 
Regular expression to NFA (Nondeterministic Finite Automata)
Regular expression to NFA (Nondeterministic Finite Automata)Regular expression to NFA (Nondeterministic Finite Automata)
Regular expression to NFA (Nondeterministic Finite Automata)Niloy Biswas
 
Python2 unicode-pt1
Python2 unicode-pt1Python2 unicode-pt1
Python2 unicode-pt1abadger1999
 
Learning Python - Week 2
Learning Python - Week 2Learning Python - Week 2
Learning Python - Week 2Mindy McAdams
 
A Concurrent Language for Argumentation
A Concurrent Language for ArgumentationA Concurrent Language for Argumentation
A Concurrent Language for ArgumentationCarlo Taticchi
 
Java Polymorphism
Java PolymorphismJava Polymorphism
Java PolymorphismSoba Arjun
 
Regular expressions and languages pdf
Regular expressions and languages pdfRegular expressions and languages pdf
Regular expressions and languages pdfDilouar Hossain
 
Introduction to phython programming
Introduction to phython programmingIntroduction to phython programming
Introduction to phython programmingASIT Education
 
Calc 1.2b eps delt
Calc 1.2b eps deltCalc 1.2b eps delt
Calc 1.2b eps delthartcher
 
Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python amiable_indian
 
Phython Programming Language
Phython Programming LanguagePhython Programming Language
Phython Programming LanguageR.h. Himel
 

What's hot (20)

Nlp
NlpNlp
Nlp
 
Python unit1
Python unit1Python unit1
Python unit1
 
Python45s - Session 01
Python45s - Session 01Python45s - Session 01
Python45s - Session 01
 
Type hints in python & mypy
Type hints in python & mypyType hints in python & mypy
Type hints in python & mypy
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture Notes
 
Regular expression to NFA (Nondeterministic Finite Automata)
Regular expression to NFA (Nondeterministic Finite Automata)Regular expression to NFA (Nondeterministic Finite Automata)
Regular expression to NFA (Nondeterministic Finite Automata)
 
Python2 unicode-pt1
Python2 unicode-pt1Python2 unicode-pt1
Python2 unicode-pt1
 
Learning Python - Week 2
Learning Python - Week 2Learning Python - Week 2
Learning Python - Week 2
 
A Concurrent Language for Argumentation
A Concurrent Language for ArgumentationA Concurrent Language for Argumentation
A Concurrent Language for Argumentation
 
Java Polymorphism
Java PolymorphismJava Polymorphism
Java Polymorphism
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Regular expressions and languages pdf
Regular expressions and languages pdfRegular expressions and languages pdf
Regular expressions and languages pdf
 
Python Workshop
Python WorkshopPython Workshop
Python Workshop
 
Introduction to phython programming
Introduction to phython programmingIntroduction to phython programming
Introduction to phython programming
 
Calc 1.2b eps delt
Calc 1.2b eps deltCalc 1.2b eps delt
Calc 1.2b eps delt
 
Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python
 
Sp imp gtu
Sp imp gtuSp imp gtu
Sp imp gtu
 
Phython Programming Language
Phython Programming LanguagePhython Programming Language
Phython Programming Language
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
 
Boyer more algorithm
Boyer more algorithmBoyer more algorithm
Boyer more algorithm
 

Similar to Latent Semantic Transliteration using Dirichlet Mixture

A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...iyo
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONcscpconf
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
 
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...NAIST Machine Translation Study Group
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfYuki Saito
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational BiologyAtreyiB
 
Coms30123 Synthesis 3 Projector
Coms30123 Synthesis 3 ProjectorComs30123 Synthesis 3 Projector
Coms30123 Synthesis 3 ProjectorDr. Cupid Lucid
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfeliasabdi2024
 
Secrets of Supercomputing
Secrets of SupercomputingSecrets of Supercomputing
Secrets of SupercomputingMarcus Vannini
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019 Alexis Agahi
 
Energy-Based Models with Applications to Speech and Language Processing
Energy-Based Models with Applications to Speech and Language ProcessingEnergy-Based Models with Applications to Speech and Language Processing
Energy-Based Models with Applications to Speech and Language Processingnxmaosdh232
 
Deep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsDeep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsJeongkyu Shin
 
7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine TranslationRIILP
 

Similar to Latent Semantic Transliteration using Dirichlet Mixture (20)

A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
A Study Of Statistical Models For Query Translation :Finding A Good Unit Of T...
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
 
nakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdfnakai22apsipa_presentation.pdf
nakai22apsipa_presentation.pdf
 
defense
defensedefense
defense
 
Lucia Specia - SMT e pós-edição
Lucia Specia - SMT e pós-ediçãoLucia Specia - SMT e pós-edição
Lucia Specia - SMT e pós-edição
 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational Biology
 
Coms30123 Synthesis 3 Projector
Coms30123 Synthesis 3 ProjectorComs30123 Synthesis 3 Projector
Coms30123 Synthesis 3 Projector
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Secrets of Supercomputing
Secrets of SupercomputingSecrets of Supercomputing
Secrets of Supercomputing
 
Devoxx traitement automatique du langage sur du texte en 2019
Devoxx   traitement automatique du langage sur du texte en 2019 Devoxx   traitement automatique du langage sur du texte en 2019
Devoxx traitement automatique du langage sur du texte en 2019
 
AI Lesson 13
AI Lesson 13AI Lesson 13
AI Lesson 13
 
Energy-Based Models with Applications to Speech and Language Processing
Energy-Based Models with Applications to Speech and Language ProcessingEnergy-Based Models with Applications to Speech and Language Processing
Energy-Based Models with Applications to Speech and Language Processing
 
Deep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsDeep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractions
 
7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation
 

More from Rakuten Group, Inc.

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のりRakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みRakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開Rakuten Group, Inc.
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用Rakuten Group, Inc.
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャーRakuten Group, Inc.
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割Rakuten Group, Inc.
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Group, Inc.
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfRakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfRakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfRakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoRakuten Group, Inc.
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technologyRakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情Rakuten Group, Inc.
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャーRakuten Group, Inc.
 

More from Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Recently uploaded

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 

Recently uploaded (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 

Latent Semantic Transliteration using Dirichlet Mixture

  • 1. Latent Semantic Transliteration using Dirichlet Mixture Masato Hagiwara, Satoshi Sekine Rakuten Institute of Technology, New York NEWS 2012, July 12 2012
  • 2. 2 Background • Transliteration – Phonetic translation between languages with different writing systems e.g., flextime / furekkusutaimu フレックスタイム – Major way to import words to different languages • Transliteration models – Phonetic-based re-writing models (Knight and Jonathan 1998) – Spelling-based supervised models (Li et al. 2004) (Finch and Sumita 2008)
  • 3. 3 Alpha-Beta Model [Brill and Moore 2000] Edit distance substitution, insertion, deletion = cost 1 Alpha-Beta Model flextime furekkusutaimu Generalization of edit distance string-to-string substitution α→β P(flextime→furekkusutaimu) = P(f→fu)×P(le→re)×P(x→kkusu)×P(ti→tai)×P(me→mu) Transliteration Probability = Product of “Transliteration Unit (TU)” Probs. Maximum re-writing probability over all possible partitions α β flextime furekkusutaimu フレックスタイム
  • 4. 4 Joint Source Channel Model [Li et al. 05] P(flextime→furekkusutaime) = P(f→fu|BOW)×P(le→re|f→fu)×P(x→kkusu|le→re)× … JSC Model fl ext im e frek ku suta imu p i aget pi a j e Transliteration Prob. =Prod. of TU n-gram probs. TU Probability Estimation Training Corpus TU Probability Table P( fl→flek |・) = XXX P( ext→ku |・) = YYY P( p→pi |・) = ZZZ … EM Algorithm Freq.→Prob. Random Initial Alignment Viterbi Algorithm
  • 5. 5 Joint Source Channel Model [Li et al. 05] P(flextime→furekkusutaime) = P(f→fu|BOW)×P(le→re|f→fu)×P(x→kkusu|le→re)× … JSC Model fl ext im e frek ku suta imu p i aget pi a j e Transliteration Prob. =Prod. of TU n-gram probs. TU Probability Estimation Training Corpus TU Probability Table P( fl→flek |・) = XXX P( ext→ku |・) = YYY P( p→pi |・) = ZZZ … EM Algorithm Freq.→Prob. f le x ti me fu re kkusu tai me pi a get pi a je Random Initial Alignment Viterbi Algorithm
  • 6. 6 Multiple Language Origins 亚历山大 Yalishanda / Alexander 山本 Yamamoto / Yamamoto Explicit language detection Requires a training set annotated with language origins piaget / piaje ピアジェ target / taagetto ターゲット French origin English origin French model English model Indo-European origin Japanese origin Chinese Transliteration Model Japanese Reading Model Class Transliteration Model (Li et al. 07)
  • 7. 7 Issues on Class Transliteration Model • Requires training sets tagged with language origins – Rare especially for proper nouns • Language origins ≠ transliteration models – e.g., spaghetti / supageti スパゲティ Italian origins but can be found in English dictionaries – e.g., Carl Laemmle / kaaru remuri カール・レムリ German immigrant but listed as an “American” film producer → An English transliteration model doesn’t work
  • 8. 8 Issues on Class Transliteration Model • Requires training sets tagged with language origins – Rare especially for proper nouns • Language origins ≠ transliteration models – e.g., spaghetti / supageti スパゲティ Italian origins but can be found in English dictionaries – e.g., Carl Laemmle / kaaru remuri カール・レムリ German immigrant but listed as an “American” film producer → An English transliteration model doesn’t work Model source language origins as latent classes
  • 9. 9 Latent Class Transliteration (LCT) Model [Hagiwara & Sekine 11] • Models the “source language origins” as latent classes • “latent classes” correspond to sets of words with similar transliteration characteristics • Trained via the EM algorithm from transliteration pairs Class transliteration [Li et al. 04] Latent Class Transliteration [Hagiwara&Sekine 11] Explicit language detection Latent class distribution s: source t: target z: latent class K: # of latent classes (determined using dev. sets)
  • 10. 10 Iterative Learning via EM Algorithm piaget → piaje target → taaget … Training Pairs P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) … Transliteration Model Lx Ly Lz
  • 11. 11 Iterative Learning via EM Algorithm piaget → piaje target → taaget … p/i/a/get→pi/a/j/e t/ar/get→taa/ge/tto … Lx Ly Lz Training Pairs P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) … Transliteration Model Lx Ly Lz pi/a/get→pi/a/je tar/get→taa/getto … Lx Ly Lz E step Transliteration probability Based on viterbi search
  • 12. 12 Iterative Learning via EM Algorithm piaget → piaje target → taaget … p/i/a/get→pi/a/j/e t/ar/get→taa/ge/tto … Lx Ly Lz Update M step Σγ*f(get→je ジェ) Training Pairs P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) … Transliteration Model Lx Ly Lz pi/a/get→pi/a/je tar/get→taa/getto … Lx Ly Lz E step Transliteration probability Based on viterbi search
  • 13. 13 Iterative Learning via EM Algorithm piaget → piaje target → taaget … p/i/a/get→pi/a/j/e t/ar/get→taa/ge/tto … Lx Ly Lz Update M step Σγ*f(get→je ジェ) Training Pairs P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) … Transliteration Model Lx Ly Lz P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) … Transliteration Model Lx Ly Lz pi/a/get→pi/a/je tar/get→taa/getto … Lx Ly Lz E step Transliteration probability Based on viterbi search
  • 14. 14 Iterative Learning via EM Algorithm piaget → piaje target → taaget … p/i/a/get→pi/a/j/e t/ar/get→taa/ge/tto … Lx Ly Lz Update M step Σγ*f(get→je ジェ) Training Pairs P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) … Transliteration Model Lx Ly Lz P(pi→pi ピ) P(ar→aa アー) P(get→je ジェ) P(get→getto ゲット) … Transliteration Model Lx Ly Lz pi/a/get→pi/a/je tar/get→taa/getto … Lx Ly Lz E step Transliteration probability Based on viterbi search Sensitive to noise in the training data because of maximum likelihood estimation
  • 15. 15 Transliteration Models vs Topic Models Transliteration Models Document Topic Models Transliteration Unit (Atomic units of substitution ) e.g., pia / pia ピア get / je ジェ Word Transliteration Pair (Sequence of transliteration units) e.g., pia get / pia je ピア ジェ Document Alpha-Beta Model Word Unigram Language Model Joint Source Channel Model Word n-gram Language Model Class Transliteration Model [Li et al. 04] Classification + Switch LMs Latent Class Transliteration Model [Hagiwara&Sekine 11] Unigram Mixture [Nigam et al. 00] Proposed Dirichlet Mixture [Yamamoto & Sadamitsu 03] Introduce a Dirichlet mixture prior to alleviate overfitting
  • 16. 16 Polya distribution [Yamamoto,Mochihashi 06] Proposed Method Latent Semantic Transliteration Model using Dirichlet Mixture (DM-LST) 𝑃(𝑢|𝑧1) 𝑃 𝐷𝑖𝑟(𝑝; 𝛼1) 𝑃 𝐷𝑖𝑟(𝑝; 𝛼2) 𝑃 𝐷𝑖𝑟(𝑝; 𝛼3) 𝑢1=get/je Latent Class Transliteration [Hagiwara&Sekine 11] 𝑃(𝑢|𝑧2) 𝑃(𝑢|𝑧3) 𝑢2 =get/getto Latent Semantic Transliteration using Dirichlet Mixture (Proposed) Estimate Dirichlet Mixture Parameters via the EM Algorithm [Yamamoto et al. 03] 𝑢3 French English Multinomial Dirichlet Mixture
  • 17. 17 Transliteration Generation Pipeline Training Corpus (flextime, flekkusutaimu) (piaget, piaje) DM-LST (Proposed) smith Input JSC Model +Stack Decoder sumisu スミス zumisu ズミス sumaisu スマイス sumaizu スマイズ … Candidate List sumisu スミス Output Generation Re-ranking (target, taagetto)
  • 18. 18 Experiments • Alpha-Beta Model (AB) • Joint Source Channel (JSC) • Latent Class Transliteration (LST) • Latent Semantic Transliteration using Dirichlet Mixture (DM-LST; Proposed)
  • 19. 19 Experimental Settings • Evaluation Data – Translation pairs En-Ja, En-Ch, En-Ko in NEWS2009 [Li et al. 09] • Evaluation Metric – ACC: Averaged Top-1 Accuracy – MFS: Mean F-Score – MRR: Mean Reciprocal Rank • Parameters – Fixed: Stack beam width B=32, EM Iteration=15 – Number of latent classes M = Tuned using the dev. set for each set Set Train Dev. Test En-Ja 23,225 1,492 1,489 En-Ch 31,961 2,896 2,896 En-Ko 4,785 987 989
  • 20. 20 Results Set Model ACC MFS MRR En-Ja AB 0.293 0.755 0.378 JSC 0.326 0.770 0.428 LCT 0.345 0.768 0.437 DM-LST 0.349 0.776 0.444 En-Ch AB 0.358 0.741 0.471 JSC 0.417 0.761 0.527 LCT 0.430 0.764 0.532 DM-LST 0.445 0.779 0.546 En-Ko AB 0.145 0.537 0.211 JSC 0.151 0.543 0.221 LCT 0.079 0.483 0.167 DM-LST 0.174 0.556 0.237
  • 21. 21 Results Set Model ACC MFS MRR En-Ja AB 0.293 0.755 0.378 JSC 0.326 0.770 0.428 LCT 0.345 0.768 0.437 DM-LST 0.349 0.776 0.444 En-Ch AB 0.358 0.741 0.471 JSC 0.417 0.761 0.527 LCT 0.430 0.764 0.532 DM-LST 0.445 0.779 0.546 En-Ko AB 0.145 0.537 0.211 JSC 0.151 0.543 0.221 LCT 0.079 0.483 0.167 DM-LST 0.174 0.556 0.237
  • 22. 22 Examples Input Conventional Methods Proposed Method dijon (En-Ja) ☓ diyon ディヨン ○ dijon ディジョン goldenberg (En-Ja) ☓ gōrudenberugu ゴールデンベルグ ○ gōrudenbāgu ゴールデンバーグ covell (En-Cn) ☓kefuer 科夫尔 ○keweier 科维尔 netherwood (En-Cn) ☓neitehewude 内特赫伍德 ○neisewude 内瑟伍德 darling (En-Ko) ☓dareuling 다르링 ○dalling 달링 gutheim (En-Cn) ○ gutehaimu 古特海姆 ○ gutehaimu 古特海姆 martina (En-Ko) ○mareutina 마르티나 ○mareutina 마르티나
  • 23. 23 Conclusion • Proposed Latent Semantic Transliteration based on Dirichlet Mixture (DM-LST) – Formalized conventional transliteration models by document topic models – Introduced a Dirichlet Mixture prior to alleviate overfitting – Superior transliteration performance to the conventional methods • Future Works – Deal with transliteration unit N-grams (N≧2) – Context-dependent transliteration • e.g., Charles → チャールズ chāruzu or シャルル sharuru