Deep Technologies about Kana Kanji Conversion

2,485 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,485
On SlideShare
0
From Embeds
0
Number of Embeds
918
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Deep Technologies about Kana Kanji Conversion

  1. 1. Deep  Technologies  in  Kana  Kanji  Conversion Yoh  Okuno  
  2. 2. Components  of  Converter•  Model,  Training,  Storage,  Interface  etc.   Train Corpora Model (Batch) Lookup Input:  Kana Converter User Output:  Kanji
  3. 3. Deep  Technologies•  Various  Language  Models  •  Training  LMs  using  the  Web  and  Hadoop  •  Automatic  Pronunciation  Inference  •  Data  Compression  •  Predictive  Conversion  and  Spelling  Correction
  4. 4. Various  Language  Models
  5. 5. Various  Language  Models•  Word  N-­‐gram   –  Accurate  but  too  large!  •  Class  N-­‐gram   –  Small  but  inaccurate  •  Combination   –  Good  trade-­‐off  is  needed
  6. 6. Language  Models•  Word  N-­‐gram   i−1 P (y) = P (yi |yi−N +1 ) i•  Class  Bigram   P (y) = P (yi |ci )P (ci |ci−1 ) i•  Phrase-­‐based  Model   P (y) = P (yi |ci )P (ci |ci−1 ) P (yi −1 , ci+N −1 |ci )P (ci |ci−1 ) i+N i∈IC i∈IW Class-­‐based  sub  model Word-­‐based  sub  model
  7. 7. Phrase-­‐based  Model•  Replace  partial  class  bigram    with  word  N-­‐gram  •  Intermediate  classes  are  marginalized :  Classes Phrase  probability:  P(w1,  w2,  w3,  c3  |  c1)   :  Words Only  left-­‐side  class  is  conditionalized!  
  8. 8. Training  Large-­‐Scale   Language  Models
  9. 9. Issues  about  Training•  How  to  collect  large  corpora?   –  Crawl,  crawl,  crawl  !   –  Morphological  analyzer  is  needed  •  How  to  store  and  process  them?   –  Hadoop  MapReduce  helps  us   –  Speeding  up  N-­‐gram  counting?  
  10. 10. Crawling  the  Web•  Raw  html  can  be  collected  from  the  Web  •  Statistics  have  no  copyright  •  Required  components:   –  Web  crawler   –  Body  text  extraction     –  (Spam  filter)   –  Morphological  analyzer   make  use  of  cloud
  11. 11. Japanese  Morphological  Analyzer•  Input:  raw  text  •  Output:  segmented  words,  part-­‐of-­‐speech...
  12. 12. MapReduce  for  Language  Model•  Distributed  computing  of  N-­‐gram  statistics Mapper:  extract  N-­‐grams  from  corpora Mapper Reducer:  aggregate  N-­‐gram  countsCorpora Reducer Mapper N-­‐grams ReducerCorpora Mapper
  13. 13. MapReduce:  Pseudo  Code
  14. 14. Speeding  up  N-­‐gram  count•  Use  binary  representation  for  N-­‐grams   –  Variable  length  ID  for  word  is  efficient  •  Use  In-­‐mapper  combine  by  Jimmy  Lin   –  Combine  in-­‐memory  is  more  efficient  •  Use  Stripes  Pattern  by  Jimmy  Lin   –  Group  N-­‐grams  by  first  word
  15. 15. Performance-­‐Size  Trade  off [Okuno+  2011] Cross  Entropy(bit)  and  Size(byte)Threshold Mobile PC Cloud  
  16. 16. Automatic  Pronunciation  Inference
  17. 17. Pronunciation  Inference•  Japanese  word  has  1-­‐3  pronunciations  •  How  to  pronounce  sentences  or  phrases?  •  Basic  approach:   –  Word-­‐based:  Combination  of  word  pronunciation   –  Character-­‐based:  Combination  of  character’s  
  18. 18. Mining  Pronunciation  via  Hadoop•  Corpora  contain  (phrase,  pronunciation)  pairs  •  Expression  like  •  In  English:                  Phrase  (Pronunciation)  •  Distributed  grep  by  the  regular  expression:    “p{InCJKUnifiedIdeographs}+ p{InHiragana}+ ”
  19. 19. Character  Alignment  Task •  Character  Alignment  for  Noise  Reduction   •  Input:  Pairs  of  Word  and  Pronunciation   •  Output:  Aligned  Pairs     | | | |   | | | |       | | |   | | |  iPhone   i|Ph|o|n|e|   | | | |_| We  can  use  HMM  and  EM  Algorithm
  20. 20. Data  Compression
  21. 21. Why  Compression?•  IMs  should  save  memory  for  other  apps  •  Typically  50  MB  for  PC  and  1-­‐2  MB  for  mobile  •  Compress  data  as  small  as  possible!  •   Solution:  Succinct  data  structures
  22. 22. LOUDS:  Succinct  Trie•  Use  unary  code  to  represent  tree  compactly a a b c d e f g h i 10 11110 0 110 0 10 0 0 10 0 b c g h d e i size  =  #nodes  *  2  +  1  =  19  bit   require  auxiliary  index  besides   f
  23. 23. MARISA:  Nested  Patricia  Trie [Yata+  11]•  Merge  no-­‐branch  nodes  in  tree Patricia  Trie   (Apply  recursively) Normal  Trie
  24. 24. Other  Functions
  25. 25. Predictive  Conversion•  Motivation:  we  want  to  save  key  strokes  •  Approach:  show  most  probable  completion   when  users  input  their  first  some  characters  
  26. 26. Predictive  Conversion•  Accuracy  and  length  are  trade-­‐offs       Good Good  morning•  Phrase  extraction  is  needed   –  Eliminate  candidates  like       (you  very  much):  sub-­‐sequence  of  phrase  
  27. 27. Phrase  Extraction  for  Prediction [Okuno+  2011]•  A  paper  about  phrase  extraction  to  appear  •  Digest:  fast  and  accurate  phrase  extraction  
  28. 28. Spelling  Correction•  Correct  user’s  miss  types  •  Search:  Trie  for  fuzzy  match  •  Model:  Edit  distance  for  error  model  •  Edit  operation:  Insert,  delete  and  replace  
  29. 29. Conclusion
  30. 30. Conclusion•  Various  technologies  are  needed   –  Statistical  language  models  and  training   –  Morphological  analyzer,  pronunciation  inference   –  Data  compression  and  retrieval   –  Predictive  conversion  and  spell  correction  

×