Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modeless Japanese Input Method

6,149 views

Published on

Yukino Ikegami, Setsuo Tsuruta.
Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary.
Multimedia Tools and Applications, Volume 74, Issue 11, pp. 3933–3946 , 2015.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Modeless Japanese Input Method

  1. 1. Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary Yukino Ikegami Setsuo Tsuruta 2014/01/20
  2. 2. Necessity of Japanese Input Method • Japanese has many characters – Kana • Hiragana – 81 characters e.g.) いろはにほへと • Katakana – 81 characters e.g.) イロハニホヘト – Kanji (Chinese-characters) • More than 6,000 characters e.g.) 以呂波仁保反止 • We can’t input directly by a keyboard  Japanese input method (Converting alphabet to Japanese character) is necessary 2
  3. 3. If all Japanese characters are assigned to each key… • Toooo many keys! • Japanese input method is necessary
  4. 4. Japanese Input Method -Roman to Kana-Kanji Converter- • Flow 1. Receive the Romanized alphabets 2. Convert the Romanized alphabets into Kana using Roman-to-Kana table 3. Convert Kana into Kanji (if necessary) ①n e k o d e s u ②ねこです ③猫です 4
  5. 5. Problems on Japanese Input Method • Need to switch input modes between Japanese and ASCII e.g. To input ‘あれは8Byteです’ (That is 8Byte) areha [Return][ASCII Mode] 8byte [Japanese Mode] desu Switching Switching • Switching is cumbersome! 5
  6. 6. Adding Term to Dictionary for Switching Mode Problem • Adding term of other languages to dictionary of conventional input method editor • Shortcoming – New term is created continuously – Homograph problem
  7. 7. Related Work • Modeless Pinyin-Chinese Input [Chen et al. 2000] – Convert alphabet (Pinyin) to Chinese – Using word-surface feature only for classification • Type-Any [Ehara et al. 2009] – Convert Alphabet to Any Language – Need press Delimiter-key when converting – Using word-surface feature only for classification 7
  8. 8. Approach -Modeless Japanese Input Method- • Automatically switching input mode 1. Generate discriminating model by Support Vector Machine (SVM) – the model describe multiple n-gram features 2. Distinguish a segment whether Kana or not in alphabet sequences using the discriminating model – e.g. nekohacatdesu → nekoha / cat / desu → ねこはcatです Japanese / English / Japanese 8
  9. 9. Main flow of Modeless Japanese Input Method each character in user inputs if character is still ASCII? Kana conversion System Response (Kana & alphabet sequence) User input (alphabet sequence) True False Kana-conversion Discriminative Model 9 Non Japanese Dic.
  10. 10. Flow of Generating Discriminative Model • 猫はcatですLoad Texts • Using Japanese Morphological Analyzer (MeCab) • ネコハcatデス Kanji to Kana • Using Kana to ASCII table (used by Google Japanese input) • nakohacatdesu Kana to ASCII •character-surface: ne, ek, nek, ko, eko, oh, koh, ha, oha... •character-type: LL, LL, LLL, LL, LLL, LL, LLL... •History: KK,KK, KKK, KK, KKK, KKK... ASCII to n-gram • 1, 3, 4, 13, 22...n-gram to ID • 1:1, 3:1, 4:1, 13:1, 32:1...Describe as binary model • 1.344, 0.691, 0,023, -1.398...Learning on SVM 10
  11. 11. n-gram Features あ れ は 8 B y t e a r e h a 8 B y t e (in case of n-gram upper limit n = 2, window size m = 2, focus-point xi = 2nd “a”) • Character-Surface – Substring of backward and forward at focus point – e.g.) -2/ha -1/a8 0/8B 1/By • Character-Type – Upper-case(U), Lower-case(L), Number(N), and Symbol(S). – e.g.) -2/LL -1/LN 0/NU 1/UL 11
  12. 12. Generating Non-Japanese Dictionary • Words never appeared in Japanese only text – More than 5 length – Contains substring can’t convert to Kana • Source – Corpus of Contemporary American English (COCA) – Japanese Wikipedia article title list 12
  13. 13. Compare with Conventional IME Conventional method areha [Return][Alphabet Mode] 8Byte [Japanese Mode] desu Switching Switching Typing : 17 • The number of typing key is decreased Modeless Japanese input method areha8Bytedesu Typing : 14 13
  14. 14. Datasets used in Evaluation Experiment • Generating Model & Evaluating Method – Balanced Corpus of Contemporary Written Japanese (BCCWJ) • book, magazine, blog, government document and others • Non Japanese Dictionary Source – COCA – Japanese Wikipedia article title list 14
  15. 15. Criteria
  16. 16. Results of Evaluation • Outperforms baseline Baseline (Char. surface n-gram) Proposed method (Char. {surface, type} n-gram & Dictionary) Kana Precision .998 .999 ASCII Precision .989 .996 Kana Recall .993 .998 ASCII Recall .780 .884 Kana F1-measure .953 .968 ASCII F1-measure .858 .924 16
  17. 17. User test • Outperforms conventional method Person No. 1 2 3 4 5 6 7 8 9 Conventional IME 18.18 17.89 15.4 12.71 11.09 10.18 11.42 12.38 10.48 Proposed method 13.34 14.68 9.88 12.23 6.03 7.00 11.03 11.37 10.30 17 … • 4 females and 7 males • Input example sentences (chat, mail, technological text)
  18. 18. Summary • Switching input mode is cumbersome • Hybrid Modeless Japanese Input Method – Automatically switching input mode between Japanese and ASCII – Using n-gram features model for discrimination • character-{surface, type} – Outperforms conventional methods 18

×