Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

State of the Machine Translation by Intento (stock engines, Jan 2019)

19,402 views

Published on

Evaluation of 23 major Cloud Machine Translation Services with Stock (pre-trained) models (Alibaba, Amazon, Baidu, DeepL, Google Translate, GTCom Yeecloud, IBM Watson v3, Microsoft Text Translator v3, ModernMT, Naver Papago, Niutrans, PROMT, SAP Translation Hub, SDL Language Cloud and BeGlobal, Systran SMT and PNMT, Sogou, Tencent, Yandex, Youdao) for 48 language pairs: pricing, performance, quality, and language coverage. We also analyze how the MT landscape changed over the last year.

Published in: Technology
  • Be the first to comment

State of the Machine Translation by Intento (stock engines, Jan 2019)

  1. 1. STATE OF THE MACHINE TRANSLATION STOCK* MT MODELS by Intento Jan 2019 * commercially available pre-trained MT models
  2. 2. January 2019© Intento, Inc. DISCLAIMER 2 The MT systems used in this report were accessed from Dec 15 to Dec 31, 2018. They may have changed many times since then. — This report demonstrates performance of those systems exclusively on the dataset used for this report (see slide 14) using proximity scores. The final MT decision requires Human LQA and depends on the use-case. — We run multiple evaluations for our clients for various language pairs and domains, observing different rankings of the MT systems. — There’s no “best” MT system. Performance depends on how your data is similar to what they used to train their models and on their algorithms. — Don’t jump to conclusions. Do your homework.
  3. 3. January 2019© Intento, Inc. About At Intento, we make Cloud Cognitive AI easy to discover, access, and evaluate for a specific use. — We evaluate models for Machine Translation since May 2017 (Custom NMT as well). — As we show in this report, the Machine Translation landscape is complex, with models from 9 different vendors required to get the best performance across popular language pairs and 200x difference in price. — We deliver this overview report for FREE. To evaluate on your own dataset, reach us at hello@inten.to 3
  4. 4. January 2019© Intento, Inc. Intento MT Hub - that’s how we run such evaluations Vendor-agnostic API Universal CLI and SDK Connects to MemoQ, SDL Trados, Matecat and more 10-20x faster faster due to multi-threading Get your API key at inten.to 4 Works with files of any size MAY BE DEPLOYED AT PRIVATE CLOUD
  5. 5. January 2019© Intento, Inc. Important highlights Changes in the MT Engines list: - ModernMT and SDL BeGlobal (NMT) added to the quantitative evaluation. - eBay, Kakao, Naver, Niutrans and Sogou added to the MT systems list. - IBM SMT and Microsoft SMT deprecated and removed. — For 21 language pairs, the best MT provider has changed since July 2018. To get the best quality across 48 language pairs, one needs 9 engines (see slide 18). — Significant changes in the Optimal MT chart due to 50% price reduction by Yandex (see slide 19) — Amazon, DeepL, Youdao, SAP, IBM increased language coverage In the same time, deprecation of SMT engines reduced coverage for low-resource language pairs. — For 2 language pairs, available MT quality raised more than 5% since July 2018: en-de (▲8%), it-pt (▲5%); also we have updated some of the datasets (led to 3-4% drop in performance in general). 5
  6. 6. January 2019© Intento, Inc. Overview 1 TRANSLATION QUALITY 2 PRICING 3 LANGUAGE COVERAGE 4 HISTORICAL PROGRESS 5 CONCLUSIONS 48 Language Pairs 23 Machine Translation Engines 6
  7. 7. January 2019© Intento, Inc. Machine Translation Engines* with Pre-Trained Models * We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide web-based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated. Alibaba Cloud MT Amazon Translate Baidu Translate API DeepL API eBay Translation API Google Cloud Translation API GTCom YeeCloud MT IBM Watson Language Translator Kakao Developers Translation Microsoft Translator Text API v3 ModernMT Enterprise API Naver Cloud Papago NMT Niutrans Maverick Translation PROMT Cloud API SAP Translation Hub SDL BeGlobal SDL Language Cloud Sogou Deepi MT Systran PNMT Enterprise Server Systran REST Translation API Tencent Cloud TMT API (preview) Yandex Translate API Youdao Cloud Translation API 7 (MT systems marked with grey color were unavailable for quantitative evaluation for different reasons)
  8. 8. January 2019© Intento, Inc. 1Translation Quality 1.1 Evaluation Methodology 1.2 Available MT Quality 1.3 Top-Performing Engines 1.4 Best General-Purpose Engines 1.5 Optimal General-Purpose Engines 8
  9. 9. January 2019© Intento, Inc. Evaluation methodology (I) Translation quality is evaluated by computing LEPOR score between reference translations and the MT output (Slide 11). — Currently, our goal is to evaluate the performance of translation between the most popular languages (Slide 12). — We use public datasets from StatMT/WMT, CASMACAT News Commentary and Tatoeba (Slide 13). — We have performed LEPOR metric convergence analysis to identify the minimal viable number of segments in the dataset. See Slide 14 for some details. 9
  10. 10. January 2019© Intento, Inc. Evaluation methodology (II) We judge that the MT quality of service A is better than that of B for the language pair C if: - mean LEPOR score of A is greater than LEPOR of B for the pair C, and - lower bound of the LEPOR 95% confidence interval of A is greater than the upper bound of the LEPOR confidence interval of B for the pair C. See Slide 14 for example. — Different language pairs (and different datasets) impose different translation complexity. To compare overall MT performance of different services, we regularize LEPOR scores across all language pairs (See Appendix A for more details). 10
  11. 11. January 2019© Intento, Inc. LEPOR score LEPOR: automatic machine translation evaluation metric considering the enhanced Length Penalty, n-gram Position difference Penalty and Recall — In our evaluation, we used hLEPORA v.3.1: — (best metric from ACL-WMT 2013 contest) https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt https://github.com/aaronlifenghan/aaron-project-lepor LIKE BLEU, BUT BETTER 11
  12. 12. January 2019© Intento, Inc. 48 Language Pairs * https://w3techs.com/technologies/overview/content_language/all Language groups by web popularity*: P1 - ≥ 2.0% websites P2 - 0.5%-2% websites P3 - 0.1-0.3% websites P4 - <0.1% websites — We focus on the en-P1, P1-en and P1-P1 (partially) en ru ja de es fr pt it zh cs tr fi ro ko ar nl en ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ru ✓ ✓ ✓ ✓ ✓ ja ✓ ✓ ✓ de ✓ ✓ ✓ ✓ ✓ es ✓ ✓ fr ✓ ✓ ✓ ✓ pt ✓ it ✓ ✓ ✓ zh ✓ ✓ ✓ cs ✓ tr ✓ fi ✓ ro ✓ ko ✓ ar ✓ nl ✓ 12
  13. 13. January 2019© Intento, Inc. Datasets WMT-2013 (translation task, news domain) en-es, es-en WMT-2015 (translation task, news domain) fr-en, en-fr WMT-2016 (translation task, news domain) ro-en, en-ro WMT-2018 (translation task, news domain) zh-en, en-zh, cs-en, en-cs, de-en, en-de, ru-en, en-ru, tr-en, en-tr, fi-en, en-fi NewsCommentary-2011 en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, de-ru, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es- zh, fr-ru, fr-es, it-pt, zh-it, en-ar, ar-en, en-nl, nl-en, fr-de, de-fr, de-it, it-de, ja-zh, zh-ja Tatoeba, JHE en-ko, ko-en 13
  14. 14. January 2019© Intento, Inc. We used 1600 - 2000 sentences per language pair. The metric stabilizes and adding more from the same domain won’t change the outcome. number of sentences regularisedhLEPORscores Aggregated across all language pairs Examples for individual language pairs: LEPOR Convergence Confi- dence interval Aggre- gated mean 14
  15. 15. January 2019© Intento, Inc. en ru ja de es fr pt it zh cs tr fi ro ko ar nl en 5 7 8 9 6 8 7 6 2 2 2 3 1 4 1 ru 6 5 5 5 3 ja 4 4 6 de 7 4 3 6 5 es 9 5 fr 9 4 7 8 pt 8 it 5 5 4 zh 7 5 4 cs 2 tr 4 fi 1 ro 3 ko 1 ar 7 nl 1 $ $ Available MT Quality Maximal Available hLEPOR score: >80 % 70 % 60 % 50 % 40 % <40 % Minimal price for this quality, per 1M char*: $$$ ≥$20 $$ $10-15 $ <$10 No. of top-performing MT Providers** * base pricing tier ** up to 5% worse than the leader, SMT and NMT counted separately Check Appendix B for more detailed data. $ $ $$ $ $ $$ $ $ $$ $ $ $ $$ $$$ $ $ $ $ $$ $ $ $ $ $ $ $$ $ $ $$ $ $$ $$$ $ $ $ $ $$$ $ $ $$ $$$ $ $ 15
  16. 16. January 2019© Intento, Inc. Sample pair analysis: English-Chinese LEPOR score Providers Price range (per 1M characters) 74 % Tencent (preview) 73 % Baidu, GTCom $8-10 72 % Google, Amazon $15-20 70 % Yandex $7 based on WMT-18 dataset BEST QUALITY: Tencent (preview) TOP 5%: Tencent, Baidu, GTCom, Google, Amazon, Yandex BEST PRICE IN TOP 5%: Yandex 16
  17. 17. January 2019© Intento, Inc. optimal Provides the lowest price among the top 5% MT engines for a language pair 0 10 20 30 40 50 deepl google am azon yandex systran-pnm tm odernm t ibm -nm t prom t m sft-nm t tencent baidu sdl-beglobal gtcom sdl-sm t across 48 language pairs TOP Performing MT Providers best Provides the best MT Quality for a language pair top 5% Within 5% of the best available MT Quality for a language pair 17 numberoflanguagepairs
  18. 18. January 2019© Intento, Inc. en ru ja de es fr pt it zh cs tr fi ro ko ar nl en ru ja de es fr pt it zh cs tr fi ro ko ar nl Best general- purpose MT engines MT Engines deepl google amazon yandex systran-pnmt modernmt ibm promt microsoft tencent baidu 18 In several cases, there’s no statistically significant difference between the top engines. Check Appendix B for more detailed data.
  19. 19. January 2019© Intento, Inc. en ru ja de es fr pt it zh cs tr fi ro ko ar nl en ru ja de es fr pt it zh cs tr fi ro ko ar nl * Cheapest with a performance within 5% of the best available for this language pair Optimal* general- purpose MT engines 19 MT Engines deepl google amazon yandex systran-pnmt modernmt ibm promt microsoft tencent baidu
  20. 20. January 2019© Intento, Inc. 2 Public pricing USD per 1M symbols * +20% for some language pairs ** estimation based on 4.79 symbols per word 20
  21. 21. January 2019© Intento, Inc. 3Language Coverage 3.1 Supported and Unique per Provider 3.2 Coverage by Language Popularity 21
  22. 22. January 2019© Intento, Inc. 1 100 10000 N iutrans G oogle Yandex M icrosoftv3 Sogou Baidu Am azon Tencent Youdao SystranSDL Language C loud PRO M T SAP DeepL IBM W atson v3M odernM T N aver Alibaba G TC om Kakao eBay 1 3 2 54 2 126 4 240 2 024 2 1212 20 3842 50 72 9298104110 132 210 417 756 3 422 3 782 7 656 10 71213 572 Total Unique 3.1 Supported and Unique Language Pairs* Unique language pairs - supported exclusively by one provider 22 * where possible, we have checked via API if all language pairs advertised by the documentation are supported and removed the pairs we were unable to locate in the API. ** as advertised (not validated via API) ** ** ** ** ** ** ** **
  23. 23. January 2019© Intento, Inc. Language popularity Language groups by web popularity*: P1 - ≥ 2.0% websites P2 - 0.5%-2% websites P3 - 0.1-0.3% websites P4 - <0.1% websites * https://w3techs.com/technologies/overview/content_language/all A total of 29070 pairs possible, 14290 are supported across all providers P1 en, ru, ja, de, es, fr, pt, it, zh P2 pl, fa, tr, nl, ko, cs, ar, vi, el, sv in, ro, hu P3 da, sk, fi, th, bg, he, lt, uk, hr, no, nb, sr, ca, sl, lv, et P4 hi, az, bs, ms, is, mk, bn, eu, ka, sq, gl, mn, kk, hy, se, uz, kr, ur, ta, nn, af, be, si, my, br, ne, sw, km, fil, ml, pa, … 23
  24. 24. January 2019© Intento, Inc. 100% 100% 63% 38% P1 P2 P3 P4 P1 P2 P3 P4 60% 100% 100% 100% 63% 100% 100% 100% 63% 63% 60% 99% 3.2 Language coverage by popularity 49% of possible language pairs 24
  25. 25. January 2019© Intento, Inc. Language coverage by service provider Niutrans Maverick Translation Google Cloud Translation API Yandex Translate API Microsoft Translator Text API v3 Sogou Deepi MT Baidu Translate API Amazon Translate Tencent Cloud TMT API (preview) Youdao Cloud Translation API Systran REST Translation API SDL Language Cloud PROMT Cloud API SAP Translation Hub DeepL API IBM Watson Language Translator v3 ModernMT API Naver Papago NMT Alibaba Translate GTCom YeeCloud MT Kakao MT eBay MT (preview) 25
  26. 26. January 2019© Intento, Inc. 4 Historical Progress 4.1 Number of Cloud MT Vendors 4.2 MT Quality 4.3 Performance/Price Efficiency 26
  27. 27. January 2019© Intento, Inc. 4.1 Independent Cloud MT Vendors with pre-built models Commercial Alibaba, Amazon, Baidu, DeepL, Google, GTCom, IBM, Microsoft, ModernMT, Naver, Niutrans, PROMT, SAP, SDL, Sogou, Systran, Yandex, Youdao Preview Tencent, eBay, Kakao 0 5 10 15 20 25 Nov 17 Mar 18 Jul 18 Dec 18 Preview Commercial Intento, Inc. • July 2018 27
  28. 28. January 2019© Intento, Inc. 30 % 40 % 50 % 60 % 70 % 80 % Nov 17 Mar 18 Jul 18 Dec 18 Best pair Worst pair 4 6 4.2 Best available MT Quality Number of language pairs available at this level of LEPOR quality out of 35 pairs we evaluated since November 2017 14 11 5 13 11 5 Intento, Inc. • Dec 2018 13 11 5 7 13 10 5 28 2 3 2
  29. 29. January 2019© Intento, Inc. 3 33 4.3 Best available Performance/Price Efficiency Efficiency = (hLEPOR in %)² / (USD per 1M symbols) — Number of language pairs available at this level of efficiency out of 35 pairs we evaluated since November 2017 8 4 6 4 7 3 8 5 5 7 3 Intento, Inc. • Dec 2018 8 4 7 7 2 8 5 7 4 29 2 1 4 2 1 2 2 5 100 200 300 400 500 600 700 800 900 Nov 17 Mar 18 Jul 18 Dec 18 Best pair Worst pair
  30. 30. January 2019© Intento, Inc. 5 Conclusions Since July 2018, the MT Landscape changed completely, both in terms of quality and price. — Even for the general domain, having the best quality across 48 language pairs requires 9 engines used simultaneously (and those are different from half a year ago). — Re-evaluate your MT choice often to stay competitive. 30
  31. 31. January 2018© Intento, Inc. Intento Professional Services MT Evaluation and Integration Training and statistically significant evaluation of NMT engines, which may bring the most cost and time reduction on the post-editing stage (see the example here). — Identifying a subset of MT results for fast and affordable manual inspection (~200x reduction of LQA efforts). — LQA and HTER also available via our LSP partners. — MT Integration - SDK and connectors to open platforms and in-house software. — Reach us at hello@inten.to 31
  32. 32. January 2018© Intento, Inc. Intento Web-Tools Human-Friendly UI working directly with the Intento API — Quick way to try every MT engine and translate large files without API integration. — Available in preview at no added cost to Intento API 32 SIGN UP at https://console.inten.to
  33. 33. January 2018© Intento, Inc. Intento Plugins and Connectors 33 MemoQ (private plugin) — SDL Trados (private plugin, also in SDL AppStore) — Matecat (private plugin) — Also, many of the engines are available in Smartcat. — Miss some connector? Reach us at hello@inten.to!
  34. 34. January 2019© Intento, Inc. Intento MT Hub - that’s how we run such evaluations Vendor-agnostic API Universal CLI and SDK Connects to MemoQ, SDL Trados, Matecat and more 10-20x faster faster due to multi-threading Get your API key at inten.to 34 Works with files of any size MAY BE DEPLOYED AT PRIVATE CLOUD
  35. 35. by Intento (https://inten.to) January 2019 Konstantin Savenkov ks@inten.to (415) 429-0021 2150 Shattuck Ave Berkeley CA 94705 35 STATE OF THE MACHINE TRANSLATION STOCK* MT MODELS
  36. 36. January 2019© Intento, Inc. Appendix A Overall performance of the MT services across many language pairs is computed in the following way: 1. [Standardisation] We compute mean language-standardized LEPOR score (or z-score) for each provider. 2. [Scale adjustment] We restore the original scale by multiplying z-score for each MT provider by the global LEPOR standard deviation and adding the global mean LEPOR score. 36
  37. 37. January 2019© Intento, Inc. Appendix B. Average hLEPOR ranking across all 48 language pairs. WARNING: This chart looks cool but requires a high level of color sensitivity. Also, there are lots of overlapping circles. Please look at sides 18 and 19 for more digestible data. 37 AveragehLEPOR Intento, Inc. • Dec 2018

×