Welocalize presentation by Lena Marg. Machine translation research focused on the results from a major data gathering exercise we carried out in 2014 by the Welocalize Language Tools team. We correlated results from automatic scoring (in this case referencing BLEU), human scoring of raw MT output on a 1-5 Likert scale, as well as productivity test deltas from 2013 data. The total test set comprising 22 locales, five different MT systems and various source content types. In line with findings from other speakers and recent publications, we found that while automatic scores such as BLEU serve as great trend indicators for overall MT system performance, they don’t tell us much about how useful the given MT output is for post-editors. Human scoring, on the other hand, correlated with productivity gains seen in post-editing and error classification proves a better indicator on usability. This confirmed the validity of our evaluation approach, comprising productivity data and human evaluation. For additional information, visit http://www.welocalize.com/wemt/why-wemt/