Advertisement
Advertisement

More Related Content

Advertisement

Similar to Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011(20)

Advertisement

Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

  1. Tomoya Mizumoto†, Mamoru Komachi†, Masaaki Nagata‡, Yuji Matsumoto† † Nara Institute of Science and Technology ‡ NTT Communication Science Laboratories  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners 1 2011.11.09 IJCNLP
  2. Background !  Number of Japanese language learners has increased !  3.65 million people in 133 countries and regions !  Only 50,000 Japanese language teachers overseas !  High demand to find good instructors for writers of Japanese as a Second Language (JSL) 2
  3. Recent error correction for language learners !  NLP research has begun to pay attention to second language learning !  Most previous research deals with restricted type of learners’ errors !  E.g. research for JSL learners’ error !  Mainly focus on Japanese case particles !  Real JSL learners’ writing contains various errors !  Spelling errors !  Collocation errors 3
  4. Error correction using SMT !  Proposed to correct ESL learners’ errors using statistical machine translation !  Advantage is that it doesn’t require expert knowledge !  Learns a correction model from learners’ and corrected corpora !  Not easy to acquire large scale learners’ corpora !  Japanese sentences is not segmented into words !  JSL learners’ sentences are hard to tokenize 4 [Brockett et al., 2006]
  5. Purpose of our study 5 1.  Solve the knowledge acquisition bottleneck !  Create a large-scale learners’ corpus from error revision logs of language learning SNS 2.  Solve the problem of word segmentation errors caused by erroneous input using SMT techniques with extracted learners’ corpus
  6. SNS sites that helps language learners 6 !  smart.fm !  Helps learners’ practice language learning !  Livemocha !  Offers course of grammar instructions, reading comprehension exercises and practice !  Lang-8 !  Multi-lingual language learning and language exchange SNS !  Soon after the learners write a passage in a learning language, native speakers of the language correct errors in it
  7. !  Sentence of JSL learners: 925,588 !  Corrected sentences: 1,288,934 !  Example of corrected sentence from Lang-8 Lang-8 data 7 Learner: ビデオゲームをやまシた。 Correct: ビデオゲームをやりまシした。 Pairs of learners sentence and corrected sentence Language English Japanese Mandarin Korean Number of sentences 1,069,549 925,588 136,203 93,955
  8. Types of correction 8 !  Correction by insertion, deletion and substitution !  Correction with a comment !  Exist “corrected” sentences to which only the word “GOOD” is appended at the end !  Removing comments !  Number of sentence pair: 1,288,934 → 849,894 Learner: ビデオゲームをやまシた。 Correct: ビデオゲームをやりまシした。 Learner: 銭湯に行った。 Correct: 銭湯に行った。 いつ行ったかがあるほうがいい Comment Learner: 銭湯に行った。 Correct: 銭湯に行った。 GOOD
  9. Comparison of Japanese learners’ corpora Corpus Data size Our Lang-8 corpus 849,894 sentences 448 MB Teramura error Data (1990) 4,601 sentences 420 KB Ohso Database (1998) 756 files 15 MB JSL learners parallel database (2000) 1,500 JSL learners’ writings 9
  10. Error correction using SMT 10 € ˆe = argmax e P e f( ) = argmax e P e( )P f e( ) SMT e: target sentences f: source sentences P(e): probability of the language model P(f|e): probability of the translation model
  11. Error correction using SMT 11 € ˆe = argmax e P e f( ) = argmax e P e( )P f e( ) SMT e: target sentences f: source sentences P(e): probability of the language model P(f|e): probability of the translation model Error correction e: corrected sentences f: Japanese learners’ sentences P(e): probability of the language model P(f|e): probability of the translation model Can be learned from the sentence-aligned learners’ corpus Can be learned from a monolingual corpus of the language to be learned
  12. Difficulty of handling the JSL learners’ sentences !  Word segmentation is usually performed as a pre- processing !  JSL learners’ sentences contain many errors and hiragana (phonetic characters) !  hard to tokenize by traditional morphological analyzer 12
  13. Difficulty of handling the JSL learners’ sentences 13 !  E.g. Learner: でもじ ょずじゃりません Correct: でもじょうずじゃありません tokenize Learner: でも じ ょずじゃりません Correct: でも じょうず じゃ ありません
  14. Character-wise model !  Character-wise segmented !  e.g. !  Not affected by word segmentation errors !  Expected to be more robust 14 でもじょずじゃりません → で も じ ょ ず じ ゃ り ま せ ん Learner: で も じ ょ  ず じ ゃ   り ま せ ん Correct: で も じ ょ う ず じ ゃ あ り ま せ ん
  15. Experiment 15 !  Carried out an experiment to see 1.  the effect of corpus size 2.  the effect of granularity of tokenization
  16. Experimental setting 16 !  Methods !  Baseline:Word-wise model !  Proposed method: Character-wise model !  Language model: 3-gram !  Language model: 5-gram !  Data !  Extracted from revision logs of Lang-8 !  849,849 sentences !  Test: 500 sentences !  Re-annotated 500 sentences to make gold-standard
  17. Evaluation metrics !  BLEU !  Adopted to BLEU for automatic assessment of ESL errors. !  Followed their use of BLEU in the error correction task of JSL learners !  JSL learners’ sentences are hard to tokenize by morphological analyzer !  Character-based BLEU 17 [Park and Levy, 2011]
  18. Larger the corpus, the higher the BLEU !  Character-wise model: Character 5-gram 18 81 81.1 81.2 81.3 81.4 81.5 81.6 81.7 81.8 81.9 0.1M 0.15M 0.3M 0.85M BLEU Learning data size ofTM The difference is not statistically significant
  19. Character-wise models are better than word-wise model !  TM Training corpus: 0.3M sentences !  Achieves the best result 19 Word 3-gram Character 3-gram Character 5-gram 80.72 81.63 81.81
  20. Both 0.1M and 0.3M model corrected 20 Learner: またど もう ありがとう (Thanks, Mantadomou (OOV)) Correct: またど うも ありがとう (Thank you again) Learner: TRUTH わ 美しいです (TRUTH wa beautiful) Correct: TRUTH は 美しいです (TRUTH is beautiful)
  21. 0.3M model corrected 21 Learner: 学生な るたら 学校に行ける (The learner made an error in conjunction form) Correct: 学生な ったら 学校に行ける (Becoming a student, I can go to school) 0.1M : 学生な るため 学校に行ける (I can go to school to be student) 0.3M : 学生な ったら 学校に行ける (Becoming a student, I can go to school)
  22. Conclusions !  Make use of a large-scale corpus from the revision logs of a language learning SNS !  Adopted SMT approaches to alleviate the problem of erroneous input from learners !  Character-wise model outperforms the word-wise model !  Apply method using SMT techniques with extracted learners’ corpus to error correction of English as a second language learners 22
  23. Handling the comment 23 !  Conduct the following three pre-processing steps 1.  If the corrected sentence contain only “GOOD” or “OK”, we don’t include it in the corpus 2.  If edit distance between the learner’s sentences and corrected sentences is larger than 5, we simply drop the sentence for the corpus 3.  If the corrected sentence ends with “OK” or “GOOD”, we remove it and retain the sentence pair.
  24. Feature work 24 !  Apply method using SMT techniques with extracted learners’ corpus to error correction of English as a second language !  Apply factored language and translation models incorporating the POS information of the words on the target side, while learners’ input is processed by a character-wise model
  25. Approach for correcting unrestricted errors !  EM-based unsupervised approach to perform whole sentence grammar correction !  Types of error must be pre-determined !  Requires expert knowledge of L2 teaching !  Error correction using SMT !  Advantage is that it doesn’t require expert knowledge !  Learns a correction model from learners’ corpora !  Not easy to acquire large scale learners’ corpora 25 [Park and Levy, 2011] [Brockett et al., 2006]
  26. Statistical machine translation Japanese Corpus 26 Parallel Corpus English Japanese I like English ー 私は英語が好き ・・・ Translation Model English sentence Language Model Japanese sentence Japanese sentence Japanese sentence TM is learned from sentence- aligned parallel corpus LM is learned from Japanese monolingual corpus
  27. Japanese error correction 27 Japanese Corpus Learners’ Corpus Learner Correct 私わ英語が好き ー 私は英語が好き ・・・ Translation Model Learner’s sentence Language Model Correct sentence Correct sentence Correct sentence TM is learned from sentence- aligned learners’ corpus LM is learned from Japanese monolingual corpus
  28. Evaluation metrics !  Character-based BLEU !  Recall and precision based on LCS !  F-measure: harmonic average between R and P !  e.g. correct: 私 は 学 生 で す system: 私 は 学 生 だ 28 € recall(R) = NLCS NSYSTEM € precision(P) = NLCS NCORRECT ,      : number of character contained in corrected answers     : number of character contained in system results   : number of character contained in LCS of corrected answers and system results NCORRECT NSYSTEM NLCS NCORRECT = 6,NSYSTEM = 5 NLCS = 4 R = 5,P = 4 6 [Park and Levy, 2011]
  29. Experimental results - granularity of tokenization - !  Training corpus: L1= ALL !  Test corpus: L1= English !  TM size: 0.3M sentences 29 W C3 C5 Recall 90.43 90.89 90.83 Precision 91.75 92.34 92.43 F-measure 91.09 91.61 91.62 BLEU 80.72 81.63 81.81
  30. Purpose of our study 30 1.  Solve the knowledge acquisition bottleneck !  Create a large-scale learners’ corpus from error revision logs of language learning SNS 2.  Propose a method using SMT techniques with extracted learners’ corpus 3.  Solve the problem of word segmentation errors caused by erroneous input
  31. Experiment 31 !  Carried out an experiment to see 1.  the effect of granularity of tokenization 2.  the effect of corpus size 3.  the difference of first language (L1)
  32. Experimental data !  Training data !  Extracted from revision logs of Lang-8 !  Prepare three L1 models !  L1= ALL: 849,894 sentences !  L1= English: 320,655 sentences !  L1= Mandarin: 186,807 sentences !  Test data !  extracted 500 sentences from each L1=English and L1 Mandarin !  Re-annotated 500 sentences to make gold-standard 32
  33. Experimental results - granularity of tokenization - !  Training corpus: L1= ALL !  Test corpus: L1= English 33 W C3 C5 80.72 81.63 81.81
  34. Experimental results - corpus size - !  Training corpus: L1= ALL model !  Test corpus: L1= English model 34 81 81.1 81.2 81.3 81.4 81.5 81.6 81.7 81.8 81.9 0.1M 0.15M 0.3M 0.85M BLEU Learning data size ofTM
  35. Experimental results - L1 model - !  TM size: 0.18M sentences 35 L1 of test data L1 of training data English Mandarin English 81.48 85.73 Mandarin 80.83 85.89 all 81.21 85.53
  36. Difficulty of handling the JSL learners’ sentences !  Word segmentation is usually performed as a pre- processing !  JSL learners’ sentences contain many errors and hiragana (phonetic characters) !  hard to tokenize by traditional morphological analyzer !  e.g. 36 Learner: でもじょずじゃりません Correct: でもじょうずじゃありません tokenize Learner: でも じ ょずじゃりません Correct: でも じょうず じゃ ありません
  37. Statistical machine translation 37 Japanese Corpus Parallel Corpus English Japanese I like English ー 私は英語が好き ・・・ Translation Model went to America. Language Model Japanese sentence Japanese sentence 私はアメリカに 行った 私は英語が好き ・・・
  38. Statistical machine translation 38 Japanese Corpus Learner Corpus Learner Correct 私わ英語が好き ー 私は英語が好き ・・・ Translation Model 私わアメリカに 行った Language Model Correct sentence Correct sentence 私はアメリカに 行った 私は英語が好き ・・・
Advertisement