Your SlideShare is downloading. ×
Deep Technologies about Kana Kanji Conversion
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Deep Technologies about Kana Kanji Conversion

2,133
views

Published on

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,133
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Deep  Technologies  in  Kana  Kanji  Conversion Yoh  Okuno  
  • 2. Components  of  Converter•  Model,  Training,  Storage,  Interface  etc.   Train Corpora Model (Batch) Lookup Input:  Kana Converter User Output:  Kanji
  • 3. Deep  Technologies•  Various  Language  Models  •  Training  LMs  using  the  Web  and  Hadoop  •  Automatic  Pronunciation  Inference  •  Data  Compression  •  Predictive  Conversion  and  Spelling  Correction
  • 4. Various  Language  Models
  • 5. Various  Language  Models•  Word  N-­‐gram   –  Accurate  but  too  large!  •  Class  N-­‐gram   –  Small  but  inaccurate  •  Combination   –  Good  trade-­‐off  is  needed
  • 6. Language  Models•  Word  N-­‐gram   i−1 P (y) = P (yi |yi−N +1 ) i•  Class  Bigram   P (y) = P (yi |ci )P (ci |ci−1 ) i•  Phrase-­‐based  Model   P (y) = P (yi |ci )P (ci |ci−1 ) P (yi −1 , ci+N −1 |ci )P (ci |ci−1 ) i+N i∈IC i∈IW Class-­‐based  sub  model Word-­‐based  sub  model
  • 7. Phrase-­‐based  Model•  Replace  partial  class  bigram    with  word  N-­‐gram  •  Intermediate  classes  are  marginalized :  Classes Phrase  probability:  P(w1,  w2,  w3,  c3  |  c1)   :  Words Only  left-­‐side  class  is  conditionalized!  
  • 8. Training  Large-­‐Scale   Language  Models
  • 9. Issues  about  Training•  How  to  collect  large  corpora?   –  Crawl,  crawl,  crawl  !   –  Morphological  analyzer  is  needed  •  How  to  store  and  process  them?   –  Hadoop  MapReduce  helps  us   –  Speeding  up  N-­‐gram  counting?  
  • 10. Crawling  the  Web•  Raw  html  can  be  collected  from  the  Web  •  Statistics  have  no  copyright  •  Required  components:   –  Web  crawler   –  Body  text  extraction     –  (Spam  filter)   –  Morphological  analyzer   make  use  of  cloud
  • 11. Japanese  Morphological  Analyzer•  Input:  raw  text  •  Output:  segmented  words,  part-­‐of-­‐speech...
  • 12. MapReduce  for  Language  Model•  Distributed  computing  of  N-­‐gram  statistics Mapper:  extract  N-­‐grams  from  corpora Mapper Reducer:  aggregate  N-­‐gram  countsCorpora Reducer Mapper N-­‐grams ReducerCorpora Mapper
  • 13. MapReduce:  Pseudo  Code
  • 14. Speeding  up  N-­‐gram  count•  Use  binary  representation  for  N-­‐grams   –  Variable  length  ID  for  word  is  efficient  •  Use  In-­‐mapper  combine  by  Jimmy  Lin   –  Combine  in-­‐memory  is  more  efficient  •  Use  Stripes  Pattern  by  Jimmy  Lin   –  Group  N-­‐grams  by  first  word
  • 15. Performance-­‐Size  Trade  off [Okuno+  2011] Cross  Entropy(bit)  and  Size(byte)Threshold Mobile PC Cloud  
  • 16. Automatic  Pronunciation  Inference
  • 17. Pronunciation  Inference•  Japanese  word  has  1-­‐3  pronunciations  •  How  to  pronounce  sentences  or  phrases?  •  Basic  approach:   –  Word-­‐based:  Combination  of  word  pronunciation   –  Character-­‐based:  Combination  of  character’s  
  • 18. Mining  Pronunciation  via  Hadoop•  Corpora  contain  (phrase,  pronunciation)  pairs  •  Expression  like  •  In  English:                  Phrase  (Pronunciation)  •  Distributed  grep  by  the  regular  expression:    “p{InCJKUnifiedIdeographs}+ p{InHiragana}+ ”
  • 19. Character  Alignment  Task •  Character  Alignment  for  Noise  Reduction   •  Input:  Pairs  of  Word  and  Pronunciation   •  Output:  Aligned  Pairs     | | | |   | | | |       | | |   | | |  iPhone   i|Ph|o|n|e|   | | | |_| We  can  use  HMM  and  EM  Algorithm
  • 20. Data  Compression
  • 21. Why  Compression?•  IMs  should  save  memory  for  other  apps  •  Typically  50  MB  for  PC  and  1-­‐2  MB  for  mobile  •  Compress  data  as  small  as  possible!  •   Solution:  Succinct  data  structures
  • 22. LOUDS:  Succinct  Trie•  Use  unary  code  to  represent  tree  compactly a a b c d e f g h i 10 11110 0 110 0 10 0 0 10 0 b c g h d e i size  =  #nodes  *  2  +  1  =  19  bit   require  auxiliary  index  besides   f
  • 23. MARISA:  Nested  Patricia  Trie [Yata+  11]•  Merge  no-­‐branch  nodes  in  tree Patricia  Trie   (Apply  recursively) Normal  Trie
  • 24. Other  Functions
  • 25. Predictive  Conversion•  Motivation:  we  want  to  save  key  strokes  •  Approach:  show  most  probable  completion   when  users  input  their  first  some  characters  
  • 26. Predictive  Conversion•  Accuracy  and  length  are  trade-­‐offs       Good Good  morning•  Phrase  extraction  is  needed   –  Eliminate  candidates  like       (you  very  much):  sub-­‐sequence  of  phrase  
  • 27. Phrase  Extraction  for  Prediction [Okuno+  2011]•  A  paper  about  phrase  extraction  to  appear  •  Digest:  fast  and  accurate  phrase  extraction  
  • 28. Spelling  Correction•  Correct  user’s  miss  types  •  Search:  Trie  for  fuzzy  match  •  Model:  Edit  distance  for  error  model  •  Edit  operation:  Insert,  delete  and  replace  
  • 29. Conclusion
  • 30. Conclusion•  Various  technologies  are  needed   –  Statistical  language  models  and  training   –  Morphological  analyzer,  pronunciation  inference   –  Data  compression  and  retrieval   –  Predictive  conversion  and  spell  correction