Upcoming SlideShare
×

# Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

934

Published on

ACL 2012 / NEWS 2012

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
934
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
5
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

1. 1. Applying  mpaligner  to  Machine  Transliteration  with  Japanese-­‐Speciﬁc  Heuristics Yoh  Okuno
2. 2. Outline•  Introduction  •  System  •  Experiments  •  Conclusion   2
3. 3. Outline•  Introduction   –  Statistical  Machine  Transliteration   –  Baseline  and  Our  Systems  •  System  •  Experiments  •  Conclusion   3
4. 4. Machine  Transliteration  as  Monotonic  SMT [Finch+  2008]•  The  most  common  approach  for  machine   transliteration  is  to  follow  the  manner  of  SMT (Statistical  Machine  Translation)  •  Consists  of  3  steps  as  below:   1.  Align  training  data  monotonically  (character-­‐based)   2.  Train  discriminative  model  given  aligned  data   3.  Decode  input  string  to  n-­‐best  list   4
5. 5. Example  of  Statistical  Transliteration•  Given  training  data  of  transliteration  pairs   Training  Data OKUNO  奥野 NOMURA  野村   MURAI    村井 5
6. 6. Example  of  Statistical  Transliteration1.  Align  training  data  utilizing  co-­‐occurrence       Training  Data OKUNO  奥野 NOMURA  野村     MURAI    村井   1.  Align   Aligned  Data OKU:NO  奥:野   NO:MURA  野:村     MURA:I  村:井 6
7. 7. Example  of  Statistical  Transliteration2.  Train  statistical  model  from  aligned  data   Training  Data OKUNO  奥野 NOMURA  野村   MURAI    村井 1.  Align Aligned  Data Learned  Model OKU:NO  奥:野   2.  Train OKU  →  奥   NO:MURA  野:村   NO  →  野   (Rules) MURA:I  村:井 MURA  →  村   I  →  井 7
8. 8. Example  of  Statistical  Transliteration3.  Decode  new  input  and  return  output   Training  Data 3.  Decode OKUNO  奥野 Test  Input Output NOMURA  野村   OKUMURA   奥村   MURAI    村井 OKUI   奥井   1.  Align MURANO   村野   Aligned  Data Learned  Model OKU:NO  奥:野   2.  Train OKU  →  奥   NO:MURA  野:村   NO  →  野   (Rules) MURA:I  村:井 MURA  →  村   I  →  井 8
9. 9. The  Baseline  System  using  m2m-­‐aligner [Jiampojamarn+  2007,  2008]   Training  Data Align:  m2m-­‐aligner Train:  DirecTL+ Decode:  DirecTL+ Output:  N-­‐best  List 9
10. 10. Our  System:  mpaligner  with  Heuristics Training  Data Japanese-­‐Speciﬁc  Heuristics   Pre-­‐processing 1.  JnJk:  De-­‐romanization   2.  EnJa:  Syllable-­‐based  Alignment Align:  mpaligner Improved  Alignment  Tool   [Kubo+  2011]   1.  Better  Accuracy  than  m2m   Train:  DirecTL+ 2.  No  hand-­‐tuning  parameters Decode:  DirecTL+ Output:  N-­‐best  List 10
11. 11. Outline•  Introduction  •  System   –  Comparing  Aligners   –  Japanese-­‐Speciﬁc  Heuristics  •  Experiments  •  Conclusion   11
12. 12. m2m-­‐aligner:  Many-­‐to-­‐Many  Alignments   [Jiampojamarn+  2007]  •  Alignment  tool  based  on  EM  algorithm  and  MLE  •  Advantages:   1.  Can  align  multiple  characters   2.  Perform  well  on  short  alignment  •  Disadvantages:   1.  Poor  performance  on  long  alignment  by  overﬁtting   2.  Require  hand-­‐tuning  of  length  limit  parameters   http://code.google.com/p/m2m-­‐aligner/ 12
13. 13. mpaligner:  Minimum  Pattern  Aligner [Kubo+  2011]  •  Idea:  penalize  long  alignment  during  E-­‐step  •  Simple  scaling  as  below  •  x:  source  string,    y:  target  string  •  |x|:  length  of  x,    |y|:  length  of  y  •  P(x,  y):  probability  of  string  pair  (x,y)  •  Good  performance  without  hand-­‐tuning  parameters   http://sourceforge.jp/projects/mpaligner/ 13
14. 14. Motivation:  Invalid  Alignment  Problem•  Character-­‐based  alignment  can  be  phonetically  invalid   –  It  may  divide  atomic  units  into  meaningless  pieces   –  We  call  the  smallest  unit  of  alignment  as  syllable  •  Syllable-­‐based  alignment  should  be  used  for  this  task   –  Problem:  No  training  data  for  syllable-­‐based  alignment  •  In  this  study,  we  propose  Japanese-­‐speciﬁc  heuristics   for  this  problem  depending  on  Japanese  knowledge   14
15. 15. Examples  of  Invalid  and  Valid  Alignment•  In  Japanese  language,  consonants  should  be   combined  with  vowels  •  JnJk  Task •  EnJa  TaskType Source Target Type Source TargetValid SUZU:KI 鈴:木 Valid Ar:thur アー:サーInvalid SUZ:UKI 鈴:木 Invalid A:r:th:ur ア:ー:サ:ーValid HIRO:MI 裕:実 Valid Cha:p:li:n チャッ:プ:リ:ンInvalid HIR:OMI 裕:実 Invalid C:h:a:p:li:n チ:ャ:ッ:プ:リ:ンValid OKU:NO 奥:野 Valid Ju:s:mi:ne ジャ:ス:ミ:ンInvalid OK:UNO 奥:野 Invalid J:u:s:mi:ne ジ:ャ:ス:ミ:ン 15
16. 16. Language  Speciﬁc  Heuristics  as  Preprocessing •  Developed  Japanese-­‐speciﬁc  heuristics  for  JnJk   and  EnJa  tasks  as  preprocessing   –  Combine  atomic  string  into  syllable   –  Treat  a  syllable  as  one  character  in  alignment   •  Deﬁnition  of  syllable  should  be  chosen  carefully   –  It  may  cause  bad  side  eﬀect   –  Some  contexts  are  incorporated  as  n-­‐gram  features 16
17. 17. JnJk  task:  De-­‐romanization  Heuristic•  De-­‐romanization:  convert  Roman  characters  •  Consonant  and  vowel  are  coupled  into  Kana  •  Common  romanization  table  is  used  (Hepburn)  Roman A I U E OKana あ い う え おRoman KA KI KU KE KOKana か き く け こ ・ ・ 17   http://www.social-­‐ime.com/conv-­‐table.html
18. 18. EnJa  Task:  Syllable-­‐based  Alignment•  In  EnJa  task,  target  side  should  be  aligned  with   unit  of  syllable,  not  character  •  Combine  sub-­‐characters  with  previous  ones  •  There  are  3  types  of  sub-­‐characters:   1.  Lower  case  characters  (Yo-­‐on):  e.g.  ャ,  ュ,  ョ   2.  Silent  character  (Soku-­‐on):  e.g.  ッ   3.  Hyphen  (Cho-­‐on;  long  vowel):  e.g.  ー   18
19. 19. Outline•  Introduction  •  System  •  Experiments   –  Oﬃcial  Scores  for  8  Language  Pairs   –  Further  Investigation  for  JnJk  and  EnJa  •  Conclusion   19
20. 20. Experimental  Settings•  Conducted  2  types  of  experiments   –  Oﬃcial  evaluation  on  test  set  for  8  language  pairs   –  Compared  proposed  and  baseline  systems  for  JnJk   and  EnJa  tasks  on  development  set  •  Followed  default  settings  of  tools  basically   –  m2m-­‐aligner:  length  limits  are  selected  carefully   –  Iteration  number:  optimized  by  development  set   –  Features:  N-­‐gram  (N=2)  and  context  (size=7)  features   20
21. 21. Oﬃcial  Scores  for  8  Language  Pairs•  Applied  heuristics  to  JnJk  and  EnJa  tasks  •  Performed  well  (top  rank  on  EnPe  and  EnHe)   Task ACC F-­‐Score MRR MAP Rank JnJk 0.512 0.693 0.582  0.401 2 EnJa 0.362 0.803 0.469 0.359 2 EnCh 0.301 0.655 0.376 0.292 5 ChEn 0.013 0.259 0.017 0.013 4 EnKo 0.334 0.688 0.411 0.334 3 EnBa 0.404 0.882 0.515 0.403 2 EnPe 0.658 0.941 0.761 0.640 1 EnHe 0.191 0.808 0.254 0.190 1 21
22. 22. Results  in  JnJk  and  EnJa  Tasks•  Proposed  system  overcome  baselines   Result  in  JnJk  Task Method ACC F-­‐Score MRR MAP m2m-­‐aligner 0.113 0.389 0.182 0.114 mpaligner 0.121 0.391 0.197 0.122 Proposed 0.199 0.494 0.300 0.200 Result  in  EnJa  Task Method ACC F-­‐Score MRR MAP m2m-­‐aligner 0.280 0.737 0.359 0.280 mpaligner 0.326 0.761 0.431   0.326 Proposed 0.358 0.774 0.469 0.358 22
23. 23. Output  Examples  (10-­‐best  list)JnJk  Task EnJa  Task Harui Kyotaro Bloy Grothendieck1 春井 京太郎 1 ブロイ グローテンディック2 晴井 恭太郎 2 ブロア グロートンディック3 治井 匡太郎 3 ブローイ グローテンディーク4 榛井 強太郎 4 ブロワ グローテンディック5 敏井 共太郎 5 ブロッイ グローゾンディック6 明井 享太郎 6 ブロヤ グローテンジーク7 陽井 亨太郎 7 ブロヨ グローザーンディック8 遙井 杏太郎 8 ブウォイ グローザンディック9 遥井 鋸太郎 9 ブロティ グローシンディック10 温井 教太郎 10 ブロレィ グローゼンディック 23
24. 24. Error  Analysis•  Sparseness  problem:   –  Side  eﬀect  of  syllable-­‐based  alignment  in  EnJa  task   –  Too  many  target  side  characters  in  JnJk  task  •  Word  origin  [Hagiwara+  2011]:   –  English  names  come  from  various  languages   –  First  and  family  name  can  be  modeled  diﬀerently   –  Gender:  ﬁrst  names  are  quite  diﬀerent  •  Training  data  inconsistency  or  ambiguity   –  e.g.  JAPAN  →  日本国  (Not  transliteration)   24
25. 25. Outline•  Introduction  •  System  •  Experiments  •  Conclusion   –  Future  Work   25
26. 26. Conclusion•  Applied  mpaligner  to  machine  transliteration  task  for   the  ﬁrst  time   –  Performed  better  than  m2m-­‐aligner   –  Maximum  likelihood  estimation  approach  is  not  suitable  •  Proposed  Japanese-­‐speciﬁc  heuristics  for  JnJk  and   EnJa  tasks   –  De-­‐romanization  for  JnJk  task   –  Syllable-­‐based  alignment  for  EnJa  task   26
27. 27. Future  Work•  Combine  these  heuristics  with  other   language-­‐independent  approaches  such  as   [Finch+  2011]  or  [Hagiwara+  2011]  •  Develop  language-­‐dependent  heuristics   besides  Japanese  language  •  Can  we  ﬁnd  such  heuristics  automatically? 27
28. 28. Reference  (1)•  Andrew  Finch  and  Eiichiro  Sumita.  2008.  Phrase-­‐based  machine   transliteration.  •  Sittichai  Jiampojamarn,  Grzegorz  Kondrak,  and  Tarek  Sherif.  2007.   Applying  many-­‐to-­‐many  alignments  and  hidden  markov  models  to  letter-­‐ to-­‐phoneme  con-­‐  version.  •  Sittichai  Jiampojamarn,  Colin  Cherry,  and  Grzegorz  Kondrak.  2008.  Joint   processing  and  discrimina-­‐  tive  training  for  letter-­‐to-­‐phoneme  conversion.  •  Keigo  Kubo,  Hiromichi  Kawanami,  Hiroshi  Saruwatari,  and  Kiyohiro   Shikano.  2011.  Unconstrained  many-­‐  to-­‐many  alignment  for  automatic   pronunciation  annotation.  •  Min  Zhang,  A  Kumaran,  and  Haizhou  Li.  2012.  Whitepaper  of  news  2012   shared  task  on  machine  transliteration. 28
29. 29. Reference  (2)•  Masato  Hagiwara  and  Satoshi  Sekine.  2011.  Latent   class  transliteration  based  on  source  language  origin.  •  Andrew  Finch,  Paul  Dixon,  and  Eiichiro  Sumita.  2011.   Integrating  models  derived  from  non-­‐parametric   bayesian  co-­‐segmentation  into  a  statistical  machine   transliteration  system.•  Andrew  Finch  and  Eiichiro  Sumita.  2010.  A  Bayesian   Model  of  Bilingual  Segmentation  for  Transliteration. 29
30. 30. WTIM:  Workshop  on  Text  Input  Methods•  1st  workshop  with  IJCNLP  2011  (Thailand)   –  12  people  presented  from  Google,  Microsoft,  Yahoo   –  https://sites.google.com/site/wtim2011/  •  2nd  workshop  planed  with  COLING  2012  (India)   –  Venue:  December,  2012  in  Mumbai,  India   –  Are  you  interested  as  a  presenter  or  an  attendee?
31. 31. Any  Questions?
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.