One of the main problems in the translation industry today is the lack of benchmarking. The output of MT engines cannot be compared to industry averages or standards because these are not yet available. Automated scores are meaningless outside the “laboratory”. At the same time, buyers of translation services are increasingly interested in translated content of different quality levels. They want to save on some content and invest more in other. They also want to know how the different engines are performing on different content in different language pairs. How can we be sure MT providers deliver what they are paid for? Benchmarking MT engines and creating a library of MT use cases are one way to move forward. Using industry benchmarking based on evaluation and productivity data is another option. One way or another, buyers need to be able to compare and benchmark MT solutions to make informed decisions.
Major competitions like WMT (Europe), OpenMT (US), CWMT (China), and WAT (Japan, recently started since 2014)
* (only used in WAT in Japan, but correlates significantly more than BLEU, especially when word order is significantly different)
** (actually measuring how much effort needed to correct the actual MT output)
(commercially interesting domains for MT Buyers; TAUS can survey its users to come up with a short list)
(https://evaluate.taus.net/academy/best-practices/evaluate-best-practices/adequacy-fluency-guidelines)
(https://evaluate.taus.net/academy/best-practices/evaluate-best-practices/error-typology-guidelines)
Step 3: This step requires resource time, and therefore there is a cost associated:
Two potential solutions for ensuring quality Human Evaluations:
Human Evaluation costs are paid by participating MT Vendors
Human Evaluation costs are offset by Paid Access to the results of MT Benchmarking for TAUS Members/Non-Members
Submission Period:
Much like an Academic competition, I think we should suggest an Open period for submissions, and once that submission period has ended only the submitted engines will be benchmarked and published.
Others will have to wait until the next Open submission period to submit engines for benchmarking.