Cleaning Social IME Dictionary

1,009 views
917 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,009
On SlideShare
0
From Embeds
0
Number of Embeds
98
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Cleaning Social IME Dictionary

  1. 1. Cleaning  Social  IME  Dictionary Yoh  Okuno   #IME2011  
  2. 2. About  the  presenter•  Name:  Yoh  Okuno  •  Software  Engineer  at  Yahoo!  Japan  •  Interest:  NLP,  Machine  Learning,  Data  Mining  •  Skill:  C/C++,  Python,  Hadoop,  and  English.  •  Website:  http://yoh.okuno.name/  
  3. 3. What  is  Social  IME?•  The  most  popular  “Cloud-­‐based”  Japanese   input  method  (230k  unique  user  per  month)  http://www.social-­‐ime.com/  
  4. 4. Shared  Dictionary  of  Social  IME•  Noisy  &  Crazy    Needs  cleaning!  shared  with  all  users
  5. 5. Character  alignment•  Align  pairs  of  Kana  and  Kanji  characters   monotonically  and  detect  failures  of  alignment  •  Techniques  from  statistical  machine  translation  •  Used  m2m-­‐aligner  because  of  its  functions       | | | |   | | | |       | | |   | | |  iPhone   i|Ph|o|n|e|   | | | |_| http://code.google.com/p/m2m-­‐aligner/
  6. 6. Training  m2m-­‐aligner•  Train  3  datasets   –  Mozc’s  dictionary  (1.5  M  words)   –  unidic  (230k  words)   –  alt-­‐cannadic  (400k  words)    most  suitable    •  Just  run  2  commands  
  7. 7. Trained  results•  Three  files  are  generated  Alignment:Error:Model:
  8. 8. Applying  m2m-­‐aligner•  Apply  to  4  datasets   –  Social  IME  shared  dictionary  (93k  words)   –  Mined  from  Wikipedia  (169k  words)   –  Crawled  MS-­‐IME  dictionary  (18k  words)   –  Manually  corrected  MS-­‐IME  dictionary  (92k  words)   –  Hatena  keyword  (315k  words)  
  9. 9. Mining  words  from  Wikipedia grep  like  “[ -­‐ ]+ [ -­‐ ]+ ”  
  10. 10. Crawling  MS-­‐IME  user  dictionary
  11. 11. Hatena  keyword
  12. 12. Applied  results•  Run:    •  Results:  Dataset Social  IME Wikipedia MS-­‐IME MS-­‐IME2 hatenaSize 93k 169k 18k 97k 314kAlign 48k 137k 16k 86k 235kError 45k 32k 2k 10k 78k
  13. 13. Alignment  examples•  Not  perfect  but  practical  precisionFrom  Social  IME: From  Wikipedia: “ ”  should  be  combined  with  the  previous  character
  14. 14. Error  examples  (from  Social  IME)•  Error  analysis  is  most  interesting!   Emoticons  ( ): Abbreviations: Personal  Information:
  15. 15. Error  examples  (from  Hatena)Length  limit  (16  chars):Semantic  translation:Chinese  /  Korean  /  old  Japanese  words:
  16. 16. 12/29  Released!!
  17. 17. Conclusion•  Described  how  to  clean  Social-­‐IME/Wikipedia/ MS-­‐IME  dictionary  using  m2m-­‐aligner  •  Released  cleaned  dictionary  today!  •  Future  work:  automatically  classify  pairs  with   alignment  error  to  emoticons,  abbreviations,     personal  information  and  so  on.  
  18. 18. Any  Question?

×