• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Deep Technologies about Kana Kanji Conversion
 

Deep Technologies about Kana Kanji Conversion

on

  • 2,361 views

 

Statistics

Views

Total Views
2,361
Views on SlideShare
1,465
Embed Views
896

Actions

Likes
0
Downloads
4
Comments
0

7 Embeds 896

http://d.hatena.ne.jp 717
http://nokuno.blogspot.com 159
http://nokuno.blogspot.jp 7
https://twitter.com 7
http://www.linkedin.com 3
http://nokuno.blogspot.sg 2
http://translate.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Deep Technologies about Kana Kanji Conversion Deep Technologies about Kana Kanji Conversion Presentation Transcript

    • Deep  Technologies  in  Kana  Kanji  Conversion Yoh  Okuno  
    • Components  of  Converter•  Model,  Training,  Storage,  Interface  etc.   Train Corpora Model (Batch) Lookup Input:  Kana Converter User Output:  Kanji
    • Deep  Technologies•  Various  Language  Models  •  Training  LMs  using  the  Web  and  Hadoop  •  Automatic  Pronunciation  Inference  •  Data  Compression  •  Predictive  Conversion  and  Spelling  Correction
    • Various  Language  Models
    • Various  Language  Models•  Word  N-­‐gram   –  Accurate  but  too  large!  •  Class  N-­‐gram   –  Small  but  inaccurate  •  Combination   –  Good  trade-­‐off  is  needed
    • Language  Models•  Word  N-­‐gram   ￿ i−1 P (y) = P (yi |yi−N +1 ) i•  Class  Bigram   ￿ P (y) = P (yi |ci )P (ci |ci−1 ) i•  Phrase-­‐based  Model   ￿ ￿P (y) = P (yi |ci )P (ci |ci−1 ) P (yi −1 , ci+N −1 |ci )P (ci |ci−1 ) i+N i∈IC i∈IW Class-­‐based  sub  model Word-­‐based  sub  model
    • Phrase-­‐based  Model•  Replace  partial  class  bigram    with  word  N-­‐gram  •  Intermediate  classes  are  marginalized :  Classes Phrase  probability:  P(w1,  w2,  w3,  c3  |  c1)   :  Words Only  left-­‐side  class  is  conditionalized!  
    • Training  Large-­‐Scale   Language  Models
    • Issues  about  Training•  How  to  collect  large  corpora?   –  Crawl,  crawl,  crawl  !   –  Morphological  analyzer  is  needed  •  How  to  store  and  process  them?   –  Hadoop  MapReduce  helps  us   –  Speeding  up  N-­‐gram  counting?  
    • Crawling  the  Web•  Raw  html  can  be  collected  from  the  Web  •  Statistics  have  no  copyright  •  Required  components:   –  Web  crawler   –  Body  text  extraction     –  (Spam  filter)   –  Morphological  analyzer   make  use  of  cloud
    • Japanese  Morphological  Analyzer•  Input:  raw  text  •  Output:  segmented  words,  part-­‐of-­‐speech...
    • MapReduce  for  Language  Model•  Distributed  computing  of  N-­‐gram  statistics Mapper:  extract  N-­‐grams  from  corpora Mapper Reducer:  aggregate  N-­‐gram  countsCorpora Reducer Mapper N-­‐grams ReducerCorpora Mapper
    • MapReduce:  Pseudo  Code
    • Speeding  up  N-­‐gram  count•  Use  binary  representation  for  N-­‐grams   –  Variable  length  ID  for  word  is  efficient  •  Use  In-­‐mapper  combine  by  Jimmy  Lin   –  Combine  in-­‐memory  is  more  efficient  •  Use  Stripes  Pattern  by  Jimmy  Lin   –  Group  N-­‐grams  by  first  word
    • Performance-­‐Size  Trade  off [Okuno+  2011] Cross  Entropy(bit)  and  Size(byte)Threshold Mobile PC Cloud  
    • Automatic  Pronunciation  Inference
    • Pronunciation  Inference•  Japanese  word  has  1-­‐3  pronunciations  •  How  to  pronounce  sentences  or  phrases?  •  Basic  approach:   –  Word-­‐based:  Combination  of  word  pronunciation   –  Character-­‐based:  Combination  of  character’s  
    • Mining  Pronunciation  via  Hadoop•  Corpora  contain  (phrase,  pronunciation)  pairs  •  Expression  like  •  In  English:                  Phrase  (Pronunciation)  •  Distributed  grep  by  the  regular  expression:    “p{InCJKUnifiedIdeographs}+ p{InHiragana}+ ”
    • Character  Alignment  Task •  Character  Alignment  for  Noise  Reduction   •  Input:  Pairs  of  Word  and  Pronunciation   •  Output:  Aligned  Pairs     | | | |   | | | |       | | |   | | |  iPhone   i|Ph|o|n|e|   | | | |_| We  can  use  HMM  and  EM  Algorithm
    • Data  Compression
    • Why  Compression?•  IMs  should  save  memory  for  other  apps  •  Typically  50  MB  for  PC  and  1-­‐2  MB  for  mobile  •  Compress  data  as  small  as  possible!  •   Solution:  Succinct  data  structures
    • LOUDS:  Succinct  Trie•  Use  unary  code  to  represent  tree  compactly a a b c d e f g h i 10 11110 0 110 0 10 0 0 10 0 b c g h d e i size  =  #nodes  *  2  +  1  =  19  bit   require  auxiliary  index  besides   f
    • MARISA:  Nested  Patricia  Trie [Yata+  11]•  Merge  no-­‐branch  nodes  in  tree Patricia  Trie   (Apply  recursively) Normal  Trie
    • Other  Functions
    • Predictive  Conversion•  Motivation:  we  want  to  save  key  strokes  •  Approach:  show  most  probable  completion   when  users  input  their  first  some  characters  
    • Predictive  Conversion•  Accuracy  and  length  are  trade-­‐offs       Good Good  morning•  Phrase  extraction  is  needed   –  Eliminate  candidates  like       (you  very  much):  sub-­‐sequence  of  phrase  
    • Phrase  Extraction  for  Prediction [Okuno+  2011]•  A  paper  about  phrase  extraction  to  appear  •  Digest:  fast  and  accurate  phrase  extraction  
    • Spelling  Correction•  Correct  user’s  miss  types  •  Search:  Trie  for  fuzzy  match  •  Model:  Edit  distance  for  error  model  •  Edit  operation:  Insert,  delete  and  replace  
    • Conclusion
    • Conclusion•  Various  technologies  are  needed   –  Statistical  language  models  and  training   –  Morphological  analyzer,  pronunciation  inference   –  Data  compression  and  retrieval   –  Predictive  conversion  and  spell  correction