This document discusses discrepancies between machine translation evaluation metrics like BLEU and RIBES scores and human judgements of translation quality. The authors find that higher scores on these metrics do not always correlate with better quality translations as judged by humans. Minor differences in BLEU scores can indicate major translation inadequacies missed by the metrics. The metrics may better correlate with human judgements for positively scored translations than negatively scored ones. The authors conclude that higher BLEU or RIBES scores do not necessarily mean the machine translation is better.