Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation (presentation at WMT18)

158 views

Published on

We reassess a recent study (Hassan et al.,
2018) that claimed that machine translation
(MT) has reached human parity for the transla-
tion of news from Chinese into English, using
pairwise ranking and considering three vari-
ables that were not taken into account in that
previous study: the language in which the
source side of the test set was originally writ-
ten, the translation proficiency of the evalua-
tors, and the provision of inter-sentential con-
text. If we consider only original source text
(i.e. not translated from another language, or
translationese), then we find evidence showing
that human parity has not been achieved. We
compare the judgments of professional trans-
lators against those of non-experts and dis-
cover that those of the experts result in higher
inter-annotator agreement and better discrim-
ination between human and machine transla-
tions. In addition, we analyse the human trans-
lations of the test set and identify important
translation issues. Finally, based on these find-
ings, we provide a set of recommendations for
future human evaluations of MT.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation (presentation at WMT18)

  1. 1. Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation WMT18, Brussels, 31st October 2018 Antonio Toral Sheila Castilho Ke Hu Andy Way
  2. 2. Table of contents 1. NMT Today and Human Parity 2. Potential Issues 3. Findings and Recommendations 1
  3. 3. NMT Today and Human Parity
  4. 4. Recent claims • Google (2016): “bridging the gap between human and machine translation [quality]” 2
  5. 5. Recent claims • Google (2016): “bridging the gap between human and machine translation [quality]” • Microsoft (2018): “achieved human parity” on news translation from Chinese to English 2
  6. 6. Recent claims • Google (2016): “bridging the gap between human and machine translation [quality]” • Microsoft (2018): “achieved human parity” on news translation from Chinese to English • SDL (2018): “cracked” Russian-to-English NMT with “near perfect” translation quality 2
  7. 7. Recent claims • Google (2016): “bridging the gap between human and machine translation [quality]” • Microsoft (2018): “achieved human parity” on news translation from Chinese to English • SDL (2018): “cracked” Russian-to-English NMT with “near perfect” translation quality 2
  8. 8. Recent claims • Google (2016): “bridging the gap between human and machine translation [quality]” • Microsoft (2018): “achieved human parity” on news translation from Chinese to English • SDL (2018): “cracked” Russian-to-English NMT with “near perfect” translation quality Is NMT approaching human quality or is it hype? 2
  9. 9. Microsoft’s human parity study • Dataset: WMT2017 Chinese→English • Human evaluation: direct assessment, by bilingual crowd workers • Human parity definition: a human judges the quality of a translation produced by a human to be equivalent to one produced by a machine • I.e. the difference is not significant 3
  10. 10. Potential Issues
  11. 11. Issue #1: Isolated sentences Issue: Sentences were evaluated in isolation 4
  12. 12. Issue #1: Isolated sentences Issue: Sentences were evaluated in isolation Why an issue: • There are referential relations that go beyond the sentence level (Voigt and Jurafsky, 2012). These are disregarded in the evaluation. • Favouring MT? Their MT system does not take into account intersentential context while human translators do. • L¨aubli et al. (2018): stronger preference for human over MT when evaluating documents (vs isolated sentences) 4
  13. 13. Issue #1: Isolated sentences Our setup • Sentences evaluated in the order they appear in the documents that make up the test set • Test set: a subset of randomised documents (vs randomised sentences) Instructions: “Each task corresponds to one document. If possible please annotate all the sentences of a document in one go.” 5
  14. 14. Issue #1: Isolated sentences 6
  15. 15. Issue #2: Translationese Issue: 50% of the source sentences are translationese 7
  16. 16. Issue #2: Translationese Issue: 50% of the source sentences are translationese Why an issue: that part of the test set may be easier for MT due to 3 principles of translationese: simplification, explicitation and normalisation. • Type-token ratios: 0.19 (original ZH) vs 0.17 (translationese ZH) 7
  17. 17. Issue #2: Translationese Evaluation on subsets that differ in the original language of the source Chinese English Original Language -4 -3 -2 -1 0 1 2 3 HT MS GG Trueskillscore * * * *: the translation is significantly better than the one in the next rank. 8
  18. 18. Issue #2: Translationese We check whether this effect of translationese is also present in Microsoft’s evaluation q q q q q q 0.55 0.60 0.65 0.70 zh en Original language of the source sentence Score(range[0,1]) SystemID q q q HT MS GG 9
  19. 19. Issue #3: Crowd Issue: the evaluation was conducted by “bilingual crowd workers” 10
  20. 20. Issue #3: Crowd Issue: the evaluation was conducted by “bilingual crowd workers” Why an issue: non-expert evaluators lack knowledge of translation and might not be able to notice subtle differences that make one translation better than another (Castilho et al., 2017) Our setup: professional and non-professional translators 10
  21. 21. Issue #3: Crowd Experts Non-experts Translators -4 -3 -2 -1 0 1 2 3 HT MS GG Trueskillscore * * * 11
  22. 22. Issue #3: Crowd Experts Non-experts Translators -4 -3 -2 -1 0 1 2 3 HT MS GG Trueskillscore * * * Also, IAA is higher among experts (0.254 vs 0.130) 11
  23. 23. Issue #4: Quality of the Test Sets 12
  24. 24. Issue #4: Quality of the Test Sets B2: EN→ZH • Most sentences contain grammatical errors and/or mistranslated proper nouns • Some sentences are not fluent 13
  25. 25. Issue #4: Quality of the Test Sets C2: ZH→EN: similar issues EN original (A1) A front-row seat to the stunning architecture of the Los Angeles Central Library ZH (B2) 洛杉矶中央图书馆的惊艳结构先睹为快 EN (C2) Take a look of the astounding architecture of the Los Angeles Central Library. 14
  26. 26. Issue #4: Quality of the Test Sets EN original (A1) An open, industrial loft in DTLA gets a cozy makeover ZH (B2) DTLA的开放式工厂阁楼进行了一次舒适的改造。 EN (C2) A comfortable makeover was provided to the open fac- tory building design of DTLA 15
  27. 27. Findings and Recommendations
  28. 28. Findings • Translationese: if removed, evidence that human parity has not been achieved • Professional translators: wider gap between HT and MT and higher IAA • Quality of human references: issues seem to indicate that they were produced by non-expert translators and possibly post-edited • Progress • Current NMT much better than 2017 production systems • Hype? 16
  29. 29. Recommendations The gap between MT and HT is narrowing. Human evaluation should be more discriminative. • Translationese: should be avoided → artificially easier for MT • Professional translators: • Should translate the test sets, from scratch → ensure high quality and independence from any MT engine • Should conduct the human evaluation → fine-grained translation nuances taken into account • Context: beyond the sentence level, evaluate documents → inter-sentential phenomena taken into account 17
  30. 30. Recommendations The gap between MT and HT is narrowing. Human evaluation should be more discriminative. • Translationese: should be avoided → artificially easier for MT • Professional translators: • Should translate the test sets, from scratch → ensure high quality and independence from any MT engine • Should conduct the human evaluation → fine-grained translation nuances taken into account • Context: beyond the sentence level, evaluate documents → inter-sentential phenomena taken into account Caution: these are based on experiments on 1 language direction with 5 evaluators 17
  31. 31. Thanks to: Microsoft, WMT, translators 17
  32. 32. Thanks to: Microsoft, WMT, translators 谢谢谢谢谢谢! Thank you! Questions? 17
  33. 33. References i References Sheila Castilho, Joss Moorkens, Federico Gaspari, Andy Way, Panayota Georgakopoulou, Maria Gialama, Vilelmini Sosoni, and Rico Sennrich. 2017. Crowdsourcing for NMT evaluation: Professional translators versus the crowd. In Translating and the Computer 39, London. https://www.asling.org/tc39/?page_id=3223. S. L¨aubli, R. Sennrich, and M. Volk. 2018. Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation. https://arxiv.org/abs/1808.07048.
  34. 34. References ii Rob Voigt and Dan Jurafsky. 2012. Towards a literary machine translation: The role of referential cohesion. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pages 18–25, Montr`eal, Canada.

×