Successfully reported this slideshow.
Your SlideShare is downloading. ×
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 18 Ad
Advertisement

More Related Content

Slideshows for you (20)

Advertisement

Similar to 5. bleu (20)

More from Hiroshi Matsumoto (19)

Advertisement

5. bleu

  1. 1. BLEU: a Method for Automatic Evaluation of Machine Translation (BiLingual Evaluation Understudy) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311- 318
  2. 2. Viewpoint • The idea: the closer a machine translation is to a professional human translation, the better it is. • To judge the quality – Numerical metric • So, MT evaluation system requires: 1. A numerical “translation closeness” metric 2. A corpus of good quality human reference translations • Word error rate metric – Idea: use of weighted average of variable length phrase matches against the reference translations – 参照変換に対して可変長フレーズ一致の加重平均を 使用 (Google Translate)
  3. 3. Baseline BLEU Metric • The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches • So, we look at computing unigram matches
  4. 4. n-gram precision • Precision measure – Counts up the number of candidate translation words ( unigrams ) which occur in any reference translation and then divides by the total number of words in the candidate translation • However, MT generates improbable, high-precision translations like the example result below – A ref word considered exhausted after a matching candidate word is identified
  5. 5. Modified n-gram precision • Modified unigram precision – Counts the maximum number of times a word occurs in any single reference translation – Clips the total count of each candidate word by its maximum reference count – Adds these clipped counts up – Divides by the total (unclipped) number of candidate words • Modified n-gram precision – All candidate n-gram counts & corresponding maximum reference counts are collected – The candidate counts are clipped by their corresponding reference maximum value, summed and divided by the total number of candidate n-grams
  6. 6. Modified n-gram precision on text blocks • Basic unit of evaluation is the sentence • Compute the n-gram matches sentence by sentence • Add clipped n-gram counts for all the candidate sentences • Divide by the number of candidate n-grams in the test corpus to compute a modified precision score
  7. 7. Ranking systems • Human translation & machine translation • 4 reference translations for each of 127 source sentences • Result: • From this result: – Single n-gram precision score can distinguish good/bad translations • To be useful, the metric must distinguish between two human translations that do not differ so greatly in quality
  8. 8. Ranking systems • Translations done by: – Lacking native proficiency in both SL/TL – Native English speaker – Three commercial systems • Result: – The systems in result order is the same rank order by human judges
  9. 9. Combining the modified n-gram precisions • The result, in prev. slide, shows: – It decays roughly exponentially with n – mod. unigram precision > bigram > trigram • BLEU uses the average logarithm with uniform weights (BLEUは一様重み付き平均の対数を 使用しています)
  10. 10. Recall • BLEU considers multiple reference translations, each of which may use a different word choice to translate the same source word. • A good candidate translation will only use (recall) one of these possible choices, but not all. Indeed, recalling all choices leads to a bad translation
  11. 11. Sentence brevity penalty • Candidate translations longer than references are penalized by the modified n-gram precision measure • Brevity penalty factor: – A high-scoring candidate translation must match the reference translations in length, in word choice and in word order • Brevity penalty 1.0: candidate’s length is the same as any reference translations length. • c: the length of the candidate translation • r: the effective reference corpus length • exp(1 - r/c): brevity penalty
  12. 12. BLEU details • Take the geometric mean of the test corpus’ modified precision scores and then multiply the result by an exponential brevity penalty factor. • We first compute the geometric average of the modified n-gram precisions, pn, using n-grams up to length N and positive weights wn summing to one. • To make the behavior apparent
  13. 13. The BLEU Evaluation • The BLEU metric ranges from 0 to 1 • 1 is very rare: only for perfect match • The more, the better • Human translation score 0.3468 against four references and scored 0.2571 against two references • Table 1: 5 systems against two reference
  14. 14. • Is the difference in BLEU metric reliable? • What is the variance of the BLEU score? • If we were to pick another random set of 500 sentences, would we still judge S3 to be better than S2? • 20 blocks of 25 sentences each on BLEU metric • Computed the means, variances, paired t-statistics • What the Table2 indicates is: – 500 sentences in Table 1 and 25 sentences in Table 2 – t-statistics of 1.7 or above is considered 95% significant
  15. 15. Evaluation • Two groups of people, each group has 10 ppl – Monolingual group – Bilingual group • Evaluated previous 5 systems • Evaluation Rate: 1 (very bad) to 5 (very good) • There were some liberal evaluations than others
  16. 16. Pairwise Judgments
  17. 17. BLEU predictions
  18. 18. BLEU vs Bi, Mono-lingual Judgements

×