BLEU Evaluation vs
Rouge Evaluation
BLEU
the closer a machine translation is to a professional human
translation, the better it is
brevity penalty : defined to be e^(1-r/c)
ROUGE
 ROUGE-N: N-gram based co-occurrence statistics.
 ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest
common subsequence problem takes into account sentence level structure
similarity naturally and identifies longest co-occurring in sequence n-grams
automatically.
 ROUGE-W: Weighted LCS-based statistics that favors consecutive
 ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair
of words in their sentence order.
 ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.
 Bleu measures precision: how much the words (and/or n-grams) in the
machine generated summaries appeared in the human reference summaries.
 Rouge measures recall: how much the words (and/or n-grams) in the
human reference summaries appeared in the machine generated summaries.
BLEU
 these results are complementing, as is often the case in precision vs recall.
 If you have many words from the system results appearing in the human
references you will have high Bleu
 if you have many words from the human references appearing in the system
results you will have high Rouge.
𝐹1 =
2 ∗ (𝐵𝐿𝐸𝑈 ∗ 𝑅𝑜𝑢𝑔𝑒)
(𝐵𝐿𝐸𝑈 + 𝑅𝑜𝑢𝑔𝑒)

Bleu vs rouge

  • 1.
  • 2.
    BLEU the closer amachine translation is to a professional human translation, the better it is brevity penalty : defined to be e^(1-r/c)
  • 3.
    ROUGE  ROUGE-N: N-grambased co-occurrence statistics.  ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.  ROUGE-W: Weighted LCS-based statistics that favors consecutive  ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.  ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.
  • 4.
     Bleu measuresprecision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.  Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.
  • 5.
  • 6.
     these resultsare complementing, as is often the case in precision vs recall.  If you have many words from the system results appearing in the human references you will have high Bleu  if you have many words from the human references appearing in the system results you will have high Rouge.
  • 7.
    𝐹1 = 2 ∗(𝐵𝐿𝐸𝑈 ∗ 𝑅𝑜𝑢𝑔𝑒) (𝐵𝐿𝐸𝑈 + 𝑅𝑜𝑢𝑔𝑒)