Massively Multilingual
Conference & Expo.
| San Jose, CA, USA | 11-12-13 October 2022
Mind the Gap
Detecting and Monitoring Quality Gaps in Machine
Translation Services
Achim Ruopp, Polyglot Technology LLC
✌️
Mind the Gap
Sources: Nimdzi Language Technology Atlas 2022 (used with permission); bytelevel/research The 2022 Web Globalization Report Card
Challenge: Selecting MT providers for 40+ languages
bytelevel/research
Web Globalization Report
Card 2022
● Top-25 websites
support 56 languages
on average
● Machine translation
plays an increasing
role
● “The 40+ language
club”
?
Source: Polyglot Technology MT Decider Benchmark Q2/2022
Myth #1: With NMT all MT services are the same
● MT quality varies up to 54% or
more than 9 BLEU points
between Amazon Translate,
DeepL, Google Translate and
Microsoft Translator
● 21.6%, over 1∕5th, of top
rankings by BLEU score
change every quarter
1 Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online.
Association for Computational Linguistics.
2 Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
Machine Translation Quality Reports
Languages Language
Pairs
Test Data Evaluation Frequency Vendor-
independent
Vendor Report 1 8 7
(English→)
Proprietary Manual
(non-standard)
One-off? No
Vendor Report 2 10 34 Proprietary Automatic
(post-editing score)
Quarterly No
Vendor Report 3 12 11
(English→)
Proprietary Automatic
(COMET1+others)
Annual No
Vendor Report 4 20 19
(English→)
Proprietary Automatic (BLEU2) Bi-annual No
MT Decider
Benchmark
24+ 48+ Open Automatic
(COMET1+BLEU2)
Quarterly Yes
Myth #2: Domain-specific evaluation is always necessary
● 30 TAUS domain test sets
○ Financial Services
○ E-Commerce
○ Medical/Pharma/Biotech
● Generic test sets
○ Conference on Machine
Translation (WMT) test sets
● 3 MT services
○ Amazon Translate
○ Google Translate
○ Microsoft Translator
1
63%
2
17%
3
20%
Generic evaluation already picks
best baseline system
Picture by Hay Kranen / PD; 1Upcoming
The Long Tail of machine translation evaluation
20% 80%
Tool
Evaluation
Data
MT Decider Benchmark
Generic, open evaluation data
High-resource/Easy for machine translation
languages
Low-resource/Difficult for machine translation
languages
?
?
MT
Decider
Scorer1
Your data
- source
- target
Photo by Microsoft 365 on Unsplash; Philipp Koehn, Neural Machine Translation
Myth #3: Project-specific data is not necessary –
domain evaluation or generic quality estimation is enough
We need data, so we put a bunch of
sensors in there to tell us what's going
on.
‫بيانات‬ ‫الى‬ ‫نحتاج‬
.
‫ه‬ ‫الحساسات‬ ‫من‬ ‫مجموعة‬ ‫وضعنا‬ ‫لذا‬
‫ناك‬
‫يحصل‬ ‫ماذا‬ ‫لنا‬ ‫ليبينوا‬
.
We correlate those data points to
individual plants.
‫منفردة‬ ‫بنباتات‬ ‫هذه‬ ‫البيانات‬ ‫نقاط‬ ‫نربط‬
… …
Similarity/edit metrics
BLEU, chrF, TER …
Machine learning metrics
COMET
Opportunity to define your organization’s style and terminology!
Make the right decision on machine translation with MT Decider
+ all evaluation data
+ API1
For cost-optimized MT
supplier selection
1Upcoming

Mind the Gap - Detecting and monitoring quality gaps in MT services

  • 1.
    Massively Multilingual Conference &Expo. | San Jose, CA, USA | 11-12-13 October 2022
  • 2.
    Mind the Gap Detectingand Monitoring Quality Gaps in Machine Translation Services Achim Ruopp, Polyglot Technology LLC ✌️ Mind the Gap
  • 3.
    Sources: Nimdzi LanguageTechnology Atlas 2022 (used with permission); bytelevel/research The 2022 Web Globalization Report Card Challenge: Selecting MT providers for 40+ languages bytelevel/research Web Globalization Report Card 2022 ● Top-25 websites support 56 languages on average ● Machine translation plays an increasing role ● “The 40+ language club” ?
  • 4.
    Source: Polyglot TechnologyMT Decider Benchmark Q2/2022 Myth #1: With NMT all MT services are the same ● MT quality varies up to 54% or more than 9 BLEU points between Amazon Translate, DeepL, Google Translate and Microsoft Translator ● 21.6%, over 1∕5th, of top rankings by BLEU score change every quarter
  • 5.
    1 Ricardo Rei,Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics. 2 Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics. Machine Translation Quality Reports Languages Language Pairs Test Data Evaluation Frequency Vendor- independent Vendor Report 1 8 7 (English→) Proprietary Manual (non-standard) One-off? No Vendor Report 2 10 34 Proprietary Automatic (post-editing score) Quarterly No Vendor Report 3 12 11 (English→) Proprietary Automatic (COMET1+others) Annual No Vendor Report 4 20 19 (English→) Proprietary Automatic (BLEU2) Bi-annual No MT Decider Benchmark 24+ 48+ Open Automatic (COMET1+BLEU2) Quarterly Yes
  • 6.
    Myth #2: Domain-specificevaluation is always necessary ● 30 TAUS domain test sets ○ Financial Services ○ E-Commerce ○ Medical/Pharma/Biotech ● Generic test sets ○ Conference on Machine Translation (WMT) test sets ● 3 MT services ○ Amazon Translate ○ Google Translate ○ Microsoft Translator 1 63% 2 17% 3 20% Generic evaluation already picks best baseline system
  • 7.
    Picture by HayKranen / PD; 1Upcoming The Long Tail of machine translation evaluation 20% 80% Tool Evaluation Data MT Decider Benchmark Generic, open evaluation data High-resource/Easy for machine translation languages Low-resource/Difficult for machine translation languages ? ? MT Decider Scorer1 Your data - source - target
  • 8.
    Photo by Microsoft365 on Unsplash; Philipp Koehn, Neural Machine Translation Myth #3: Project-specific data is not necessary – domain evaluation or generic quality estimation is enough We need data, so we put a bunch of sensors in there to tell us what's going on. ‫بيانات‬ ‫الى‬ ‫نحتاج‬ . ‫ه‬ ‫الحساسات‬ ‫من‬ ‫مجموعة‬ ‫وضعنا‬ ‫لذا‬ ‫ناك‬ ‫يحصل‬ ‫ماذا‬ ‫لنا‬ ‫ليبينوا‬ . We correlate those data points to individual plants. ‫منفردة‬ ‫بنباتات‬ ‫هذه‬ ‫البيانات‬ ‫نقاط‬ ‫نربط‬ … … Similarity/edit metrics BLEU, chrF, TER … Machine learning metrics COMET Opportunity to define your organization’s style and terminology!
  • 9.
    Make the rightdecision on machine translation with MT Decider + all evaluation data + API1 For cost-optimized MT supplier selection 1Upcoming