Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

5. bleu


Published on

summary of BLEU paper

Published in: Education
  • Be the first to comment

5. bleu

  1. 1. BLEU: a Method for AutomaticEvaluation of Machine Translation(BiLingual Evaluation Understudy) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311- 318
  2. 2. Viewpoint• The idea: the closer a machine translation is to a professional human translation, the better it is.• To judge the quality – Numerical metric• So, MT evaluation system requires: 1. A numerical “translation closeness” metric 2. A corpus of good quality human reference translations• Word error rate metric – Idea: use of weighted average of variable length phrase matches against the reference translations – 参照変換に対して可変長フレーズ一致の加重平均を 使用 (Google Translate)
  3. 3. Baseline BLEU Metric• The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches• So, we look at computing unigram matches
  4. 4. n-gram precision• Precision measure – Counts up the number of candidate translation words ( unigrams ) which occur in any reference translation and then divides by the total number of words in the candidate translation• However, MT generates improbable, high-precision translations like the example result below – A ref word considered exhausted after a matching candidate word is identified
  5. 5. Modified n-gram precision• Modified unigram precision – Counts the maximum number of times a word occurs in any single reference translation – Clips the total count of each candidate word by its maximum reference count – Adds these clipped counts up – Divides by the total (unclipped) number of candidate words• Modified n-gram precision – All candidate n-gram counts & corresponding maximum reference counts are collected – The candidate counts are clipped by their corresponding reference maximum value, summed and divided by the total number of candidate n-grams
  6. 6. Modified n-gram precision on text blocks• Basic unit of evaluation is the sentence• Compute the n-gram matches sentence by sentence• Add clipped n-gram counts for all the candidate sentences• Divide by the number of candidate n-grams in the test corpus to compute a modified precision score
  7. 7. Ranking systems• Human translation & machine translation• 4 reference translations for each of 127 source sentences• Result:• From this result: – Single n-gram precision score can distinguish good/bad translations• To be useful, the metric must distinguish between two human translations that do not differ so greatly in quality
  8. 8. Ranking systems• Translations done by: – Lacking native proficiency in both SL/TL – Native English speaker – Three commercial systems• Result: – The systems in result order is the same rank order by human judges
  9. 9. Combining the modified n-gram precisions• The result, in prev. slide, shows: – It decays roughly exponentially with n – mod. unigram precision > bigram > trigram• BLEU uses the average logarithm with uniform weights (BLEUは一様重み付き平均の対数を 使用しています)
  10. 10. Recall• BLEU considers multiple reference translations, each of which may use a different word choice to translate the same source word.• A good candidate translation will only use (recall) one of these possible choices, but not all. Indeed, recalling all choices leads to a bad translation
  11. 11. Sentence brevity penalty• Candidate translations longer than references are penalized by the modified n-gram precision measure• Brevity penalty factor: – A high-scoring candidate translation must match the reference translations in length, in word choice and in word order • Brevity penalty 1.0: candidate’s length is the same as any reference translations length.• c: the length of the candidate translation• r: the effective reference corpus length• exp(1 - r/c): brevity penalty
  12. 12. BLEU details• Take the geometric mean of the test corpus’ modified precision scores and then multiply the result by an exponential brevity penalty factor.• We first compute the geometric average of the modified n-gram precisions, pn, using n-grams up to length N and positive weights wn summing to one.• To make the behavior apparent
  13. 13. The BLEU Evaluation• The BLEU metric ranges from 0 to 1• 1 is very rare: only for perfect match• The more, the better• Human translation score 0.3468 against four references and scored 0.2571 against two references• Table 1: 5 systems against two reference
  14. 14. • Is the difference in BLEU metric reliable?• What is the variance of the BLEU score?• If we were to pick another random set of 500 sentences, would we still judge S3 to be better than S2?• 20 blocks of 25 sentences each on BLEU metric• Computed the means, variances, paired t-statistics• What the Table2 indicates is: – 500 sentences in Table 1 and 25 sentences in Table 2 – t-statistics of 1.7 or above is considered 95% significant
  15. 15. Evaluation• Two groups of people, each group has 10 ppl – Monolingual group – Bilingual group• Evaluated previous 5 systems• Evaluation Rate: 1 (very bad) to 5 (very good)• There were some liberal evaluations than others
  16. 16. Pairwise Judgments
  17. 17. BLEU predictions
  18. 18. BLEU vs Bi, Mono-lingual Judgements