2. BLEU
the closer a machine translation is to a professional human
translation, the better it is
brevity penalty : defined to be e^(1-r/c)
3. ROUGE
ROUGE-N: N-gram based co-occurrence statistics.
ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest
common subsequence problem takes into account sentence level structure
similarity naturally and identifies longest co-occurring in sequence n-grams
automatically.
ROUGE-W: Weighted LCS-based statistics that favors consecutive
ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair
of words in their sentence order.
ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.
4. Bleu measures precision: how much the words (and/or n-grams) in the
machine generated summaries appeared in the human reference summaries.
Rouge measures recall: how much the words (and/or n-grams) in the
human reference summaries appeared in the machine generated summaries.
6. these results are complementing, as is often the case in precision vs recall.
If you have many words from the system results appearing in the human
references you will have high Bleu
if you have many words from the human references appearing in the system
results you will have high Rouge.