Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- What is Quality? A Machine Translat... by kantanmt 489 views
- LEPOR: an augmented machine transla... by Aaron Li-Feng Han 223 views
- はじめてのルベーグ積分 by Wakamatz 2306 views
- Deview2013 naver labs_nsmt_외부공ᄀ... by NAVER D2 3324 views
- 猫に教えてもらうルベーグ可測 by Shuyo Nakatani 15534 views
- Harmons App by Zachary Stucki 131 views

724 views

Published on

summary of BLEU paper

Published in:
Education

No Downloads

Total views

724

On SlideShare

0

From Embeds

0

Number of Embeds

75

Shares

0

Downloads

0

Comments

0

Likes

2

No embeds

No notes for slide

- 1. BLEU: a Method for AutomaticEvaluation of Machine Translation(BiLingual Evaluation Understudy) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311- 318
- 2. Viewpoint• The idea: the closer a machine translation is to a professional human translation, the better it is.• To judge the quality – Numerical metric• So, MT evaluation system requires: 1. A numerical “translation closeness” metric 2. A corpus of good quality human reference translations• Word error rate metric – Idea: use of weighted average of variable length phrase matches against the reference translations – 参照変換に対して可変長フレーズ一致の加重平均を 使用 (Google Translate)
- 3. Baseline BLEU Metric• The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches• So, we look at computing unigram matches
- 4. n-gram precision• Precision measure – Counts up the number of candidate translation words ( unigrams ) which occur in any reference translation and then divides by the total number of words in the candidate translation• However, MT generates improbable, high-precision translations like the example result below – A ref word considered exhausted after a matching candidate word is identified
- 5. Modified n-gram precision• Modified unigram precision – Counts the maximum number of times a word occurs in any single reference translation – Clips the total count of each candidate word by its maximum reference count – Adds these clipped counts up – Divides by the total (unclipped) number of candidate words• Modified n-gram precision – All candidate n-gram counts & corresponding maximum reference counts are collected – The candidate counts are clipped by their corresponding reference maximum value, summed and divided by the total number of candidate n-grams
- 6. Modified n-gram precision on text blocks• Basic unit of evaluation is the sentence• Compute the n-gram matches sentence by sentence• Add clipped n-gram counts for all the candidate sentences• Divide by the number of candidate n-grams in the test corpus to compute a modified precision score
- 7. Ranking systems• Human translation & machine translation• 4 reference translations for each of 127 source sentences• Result:• From this result: – Single n-gram precision score can distinguish good/bad translations• To be useful, the metric must distinguish between two human translations that do not differ so greatly in quality
- 8. Ranking systems• Translations done by: – Lacking native proficiency in both SL/TL – Native English speaker – Three commercial systems• Result: – The systems in result order is the same rank order by human judges
- 9. Combining the modified n-gram precisions• The result, in prev. slide, shows: – It decays roughly exponentially with n – mod. unigram precision > bigram > trigram• BLEU uses the average logarithm with uniform weights (BLEUは一様重み付き平均の対数を 使用しています)
- 10. Recall• BLEU considers multiple reference translations, each of which may use a different word choice to translate the same source word.• A good candidate translation will only use (recall) one of these possible choices, but not all. Indeed, recalling all choices leads to a bad translation
- 11. Sentence brevity penalty• Candidate translations longer than references are penalized by the modified n-gram precision measure• Brevity penalty factor: – A high-scoring candidate translation must match the reference translations in length, in word choice and in word order • Brevity penalty 1.0: candidate’s length is the same as any reference translations length.• c: the length of the candidate translation• r: the effective reference corpus length• exp(1 - r/c): brevity penalty
- 12. BLEU details• Take the geometric mean of the test corpus’ modiﬁed precision scores and then multiply the result by an exponential brevity penalty factor.• We ﬁrst compute the geometric average of the modiﬁed n-gram precisions, pn, using n-grams up to length N and positive weights wn summing to one.• To make the behavior apparent
- 13. The BLEU Evaluation• The BLEU metric ranges from 0 to 1• 1 is very rare: only for perfect match• The more, the better• Human translation score 0.3468 against four references and scored 0.2571 against two references• Table 1: 5 systems against two reference
- 14. • Is the difference in BLEU metric reliable?• What is the variance of the BLEU score?• If we were to pick another random set of 500 sentences, would we still judge S3 to be better than S2?• 20 blocks of 25 sentences each on BLEU metric• Computed the means, variances, paired t-statistics• What the Table2 indicates is: – 500 sentences in Table 1 and 25 sentences in Table 2 – t-statistics of 1.7 or above is considered 95% significant
- 15. Evaluation• Two groups of people, each group has 10 ppl – Monolingual group – Bilingual group• Evaluated previous 5 systems• Evaluation Rate: 1 (very bad) to 5 (very good)• There were some liberal evaluations than others
- 16. Pairwise Judgments
- 17. BLEU predictions
- 18. BLEU vs Bi, Mono-lingual Judgements

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment