Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

State of the Machine Translation by Intento (July 2018)

22,424 views

Published on

Evaluation of 19 major Cloud Machine Translation Engines (Alibaba, Amazon, Baidu, DeepL, Google, GRCom, IBM SMT and NMT, Microsoft SMT and NMT, ModernMT, PROMT, SAP, SDL Language Cloud, Systran SMT and PNMT, Tencent, Yandex, Youdao) for 48 language pairs: pricing, performance, quality, and language coverage. We also analyse how the MT landscape changed over the last year.

Published in: Technology

State of the Machine Translation by Intento (July 2018)

  1. 1. STATE OF THE MACHINE TRANSLATION by Intento July 2018
  2. 2. July 2018© Intento, Inc. About At Intento, we make Cloud Cognitive AI easy to discover, access, and evaluate for a specific use. — Evaluation is a pain for everyone: to compare different services, you have to sign a lot of contracts and integrate many APIs. — As we show in this report, the Machine Translation landscape is complex, with 4x difference in quality and 195x difference in price across pre-build models available from different vendors. — We deliver this overview report for FREE. To evaluate on your own dataset, reach us at hello@inten.to 2
  3. 3. July 2018© Intento, Inc. Intento MT Gateway - that’s how we run such evaluations Vendor-agnostic API Sync and async modes CLI tools and SDKs Works with files of any size Much faster due to hyper- threading Get your API key at inten.to 3
  4. 4. July 2018© Intento, Inc. Important highlights Amazon and SAP went from preview to production — Amazon, Baidu, IBM, Microsoft, and PROMT increased language coverage — For 7 language pairs, available MT quality raised more than 5% since Mar 2018: en-ko (▲25%), en-nl (▲11%), nl-en (▲14%), ru-de (▲8%), ja-fr (▲10%), en-cs (▲5%), en-tr (▲7%) (see slide 15) — For 13 language pairs, the best MT provider has changed since Mar 2018: en-zh, de-ru, ru-de, en-tr, en-pt, nl-en, en-nl, ja-en, zh-it, cs-en, en-cs, en- it, ru-en — To get the best quality across 48 language pairs, one needs 9 engines (see slide 18) 4
  5. 5. July 2018© Intento, Inc. Overview 1 TRANSLATION QUALITY 2 PRICING 3 LANGUAGE COVERAGE 4 HISTORICAL PROGRESS 5 CONCLUSIONS 48 Language Pairs 19 Machine Translation Engines 5
  6. 6. July 2018© Intento, Inc. Benchmark changes since March 2018 Added 3 engines: ModernMT*, Alibaba**, Youdao** — Updated to new versions: IBM (v3/NMT), Microsoft (v3/ NMT) — Updated SAP*** and Amazon from preview to public — Added detailed best and optimal engines chart (slides 18-19) — Added Pricing section (slide 21) * evaluated on one language pair (cost prohibitive) ** unavailable outside of China yet *** not evaluated (cost prohibitive & unstable) 6
  7. 7. July 2018© Intento, Inc. Machine Translation Engines* Evaluated * We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide web-based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated. Alibaba Cloud Machine Translation Amazon Translate Baidu Translate API DeepL API Google Cloud Translation API GTCom YeeCloud MT IBM Watson NMT Language Translator IBM Watson SMT Language Translator Microsoft NMT Translator Text API Microsoft SMT Translator Text API ModernMT API PROMT Cloud API SAP Translation Hub SDL Language Cloud Translation Toolkit Systran PNMT Enterprise Server Systran REST Translation API Tencent Cloud TMT API (preview) Yandex Translate API Youdao Cloud Translation API 7
  8. 8. July 2018© Intento, Inc. 1Translation Quality 1.1 Evaluation Methodology 1.2 Available MT Quality 1.3 Top-Performing Engines 1.4 Best General-Purpose Engines 1.5 Optimal General-Purpose Engines 1.6 Price vs. Performance 8
  9. 9. July 2018© Intento, Inc. Evaluation methodology (I) Translation quality is evaluated by computing LEPOR score between reference translations and the MT output (Slide 11). — Currently, our goal is to evaluate the performance of translation between the most popular languages (Slide 12). — We use public datasets from StatMT/WMT, CASMACAT News Commentary and Tatoeba (Slide 13). — We have performed LEPOR metric convergence analysis to identify the minimal viable number of segments in the dataset. See Slide 14 for some details. 9
  10. 10. July 2018© Intento, Inc. Evaluation methodology (II) We judge that the MT quality of service A is better than that of B for the language pair C if: - mean LEPOR score of A is greater than LEPOR of B for the pair C, and - lower bound of the LEPOR 95% confidence interval of A is greater than the upper bound of the LEPOR confidence interval of B for the pair C. See Slide 14 for example. — Different language pairs (and different datasets) impose different translation complexity. To compare overall MT performance of different services, we regularize LEPOR scores across all language pairs (See Appendix A for more details). 10
  11. 11. July 2018© Intento, Inc. LEPOR score LEPOR: automatic machine translation evaluation metric considering the enhanced Length Penalty, n-gram Position difference Penalty and Recall — In our evaluation, we used hLEPORA v.3.1: — (best metric from ACL-WMT 2013 contest) https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt https://github.com/aaronlifenghan/aaron-project-lepor LIKE BLEU, BUT BETTER 11
  12. 12. July 2018© Intento, Inc. 48 Language Pairs * https://w3techs.com/technologies/overview/content_language/all Language groups by web popularity*: P1 - ≥ 2.0% websites P2 - 0.5%-2% websites P3 - 0.1-0.3% websites P4 - <0.1% websites — We focus on the en-P1, P1-en and P1-P1 (partially) en ru ja de es fr pt it zh cs tr fi ro ko ar nl en ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ru ✓ ✓ ✓ ✓ ✓ ja ✓ ✓ ✓ de ✓ ✓ ✓ ✓ ✓ es ✓ ✓ fr ✓ ✓ ✓ ✓ pt ✓ it ✓ ✓ ✓ zh ✓ ✓ ✓ cs ✓ tr ✓ fi ✓ ro ✓ ko ✓ ar ✓ nl ✓ 12
  13. 13. July 2018© Intento, Inc. Datasets WMT-2013 (translation task, news domain) en-es, es-en WMT-2015 (translation task, news domain) fr-en, en-fr WMT-2016 (translation task, news domain) cs-en, en-cs, de-en, en-de, ro-en, en-ro, fi-en, en-fi, ru-en, en-ru, tr-en, en-tr WMT-2017 (translation task, news domain) zh-en, en-zh NewsCommentary-2011 en-ja, ja-en, en-pt, pt-en, en-it, it-en, ru-de, ru-es, ru-fr, ru-pt, ja-fr, de-ja, es-zh, fr- ru, fr-es, it-pt, zh-it, en-ar, ar-en, en-nl, nl-en, fr-de, de-fr, de-it, ja-zh, zh-ja Tatoeba en-ko, ko-en 13
  14. 14. July 2018© Intento, Inc. We used 900 - 3000 sentences per language pair. The metric stabilizes and adding more from the same domain won’t change the outcome. number of sentences regularisedhLEPORscores Aggregated across all language pairs Examples for individual language pairs: LEPOR Convergence Confi- dence interval Aggre- gated mean 14
  15. 15. July 2018© Intento, Inc. en ru ja de es fr pt it zh cs tr fi ro ko ar nl en 2 6 3 6 4 5 5 4 2 3 1 2 1 2 1 ru 2 3 3 3 2 ja 4 2 4 de 5 3 3 4 4 es 5 3 fr 6 3 5 8 pt 5 it 8 2 5 zh 4 4 4 cs 4 tr 4 fi 2 ro 3 ko 1 ar 5 nl 1 $$ $$ Available MT Quality Maximal Available hLEPOR score: >80 % 70 % 60 % 50 % 40 % <40 % Minimal price for this quality, per 1M char*: $$$ ≥$20 $$ $10-15 $ <$10 No. of top-performing MT Providers** * base pricing tier ** up to 5% worse than the leader, SMT and NMT counted separately $$ $$ $$ $$ $$ $$$ $$ $$ $$ $$ $$ $ $$ $$$$$ $$ $ $ $$ $$ $$ $$ $ $$ $ $$$ $$ $$ $$ $$ $$ $$ $$$$ $$ $$ $ $$ $$$ $$ $$ $$ $$$ $ $$ 15
  16. 16. July 2018© Intento, Inc. Sample pair analysis: English-Chinese LEPOR score Providers Price range (per 1M characters) 71 % Tencent (preview) 70 % Google, GTCom $10-20 68 % Baidu $7 66.5 % Systran PNMT, Amazon $15-? 65 % Microsoft, IBM NMT $10-21.4 based on WMT-17 dataset BEST QUALITY: Tencent (preview) TOP 5%: Tencent, Google, GTCom, Baidu BEST PRICE IN TOP 5%: Baidu 16
  17. 17. July 2018© Intento, Inc. optimal Provides the lowest price among the top 5% MT engines for a language pair 0 10 20 30 40 50 google deepl am azon yandex ibm -nm t prom t m sft-nm t tencent ibm -sm t baidu systran-pnm tgtcom m sft-sm t sdl-sm t m odernm t across 48 language pairs* TOP Performing MT Providers best Provides the best MT Quality for a language pair top 5% Within 5% of the best available MT Quality for a language pair 17
  18. 18. July 2018© Intento, Inc. en ru ja de es fr pt it zh cs tr fi ro ko ar nl en ru ja de es fr pt it zh cs tr fi ro ko ar nl Best general- purpose MT engines MT Engines google deepl amazon yandex ibm-nmt promt msft-nmt ibm-smt tencent 18
  19. 19. July 2018© Intento, Inc. en ru ja de es fr pt it zh cs tr fi ro ko ar nl en ru ja de es fr pt it zh cs tr fi ro ko ar nl * Cheapest with a performance within 5% of the best available for this language pair Optimal* general- purpose MT engines MT Engines msft-nmt yandex msft-smt baidu google amazon ibm-nmt promt ibm-smt 19
  20. 20. July 2018© Intento, Inc. Price vs. Performance* AFFORDABILITY PERFORMANCE As of March 2018 ACCURATE NOT PUB LIC COST-EFFECTIVE Performance Regularized hLEPOR score aggregated across all language pairs in the dataset Affordability = 1/price Using public volume- based pricing tiers Legend • performance range: - regularized average - max across all pairs - min across all pairs • price range * only production-ready engines shown 20
  21. 21. July 2018© Intento, Inc. 2Public pricing USD per 1M symbols * +20% for some language pairs ** estimation based on 4.79 symbols per word 21
  22. 22. July 2018© Intento, Inc. 3Language Coverage 3.1 Supported and Unique per Provider 3.2 Coverage by Language Popularity 22
  23. 23. July 2018© Intento, Inc. 1 100 10000 G oogle Yandex M icrosoftN M TM icrosoftSM T Baidu Tencent Systran Systran PN M T PRO M T SDL Language C loud Youdao SAP M odernM T DeepL IBM N M T Am azon IBM SM T Alibaba G TC om 2 11 2 56 138 119 1 074 3 022 6 8 20 24 34 424447 72 104106110110 210 812 3 7823 660 8 556 10 712 Total Unique Supported and Unique Language Pairs Unique language pairs - supported exclusively by one provider 23
  24. 24. July 2018© Intento, Inc. Language popularity Language groups by web popularity*: P1 - ≥ 2.0% websites P2 - 0.5%-2% websites P3 - 0.1-0.3% websites P4 - <0.1% websites * https://w3techs.com/technologies/overview/content_language/all A total of 29070 pairs possible, 13098 are supported across all providers P1 en, ru, ja, de, es, fr, pt, it, zh P2 pl, fa, tr, nl, ko, cs, ar, vi, el, sv in, ro, hu P3 da, sk, fi, th, bg, he, lt, uk, hr, no, nb, sr, ca, sl, lv, et P4 hi, az, bs, ms, is, mk, bn, eu, ka, sq, gl, mn, kk, hy, se, uz, kr, ur, ta, nn, af, be, si, my, br, ne, sw, km, fil, ml, pa, … 24
  25. 25. July 2018© Intento, Inc. 100% 100% 63% 31% P1 P2 P3 P4 P1 P2 P3 P4 60% 100% 100% 100% 63% 100% 100% 100% 63% 63% 60% 99% Language coverage by popularity 45% of possible language pairs 25
  26. 26. July 2018© Intento, Inc. Language coverage by service provider Google Cloud Translation API Yandex Translate API Microsoft Translator Text API (SMT) Microsoft Translator Text API (NMT) Baidu Translate API Tencent Cloud TMT API (preview) Systran REST Translation API Systran PNMT Enterprise Server PROMT Cloud API SDL Language Cloud Translation Toolkit Youdao Cloud Translation API SAP Translation Hub ModernMT API DeepL API IBM Watson Language Translator (NMT) Amazon Translate IBM Watson Language Translator (SMT) Alibaba Translate GTCom YeeCloud MT 26
  27. 27. July 2018© Intento, Inc. 4 Historical Progress 4.1 Number of Cloud MT Vendors 4.2 MT Quality 4.3 Performance/Price Efficiency 27
  28. 28. July 2018© Intento, Inc. Independent Cloud MT Vendors with pre-built models Commercial Alibaba, Amazon, Baidu, DeepL, Google, GTCom, IBM, Microsoft, ModernMT, PROMT, SAP, SDL, Systran, Yandex, Youdao Preview Tencent 0 4 8 12 16 Jul 17 Nov 17 Mar 18 Jul 2018 Preview Commercial Intento, Inc. • July 2018 28
  29. 29. July 2018© Intento, Inc. 30 % 40 % 50 % 60 % 70 % 80 % Jul 17 Nov 17 Mar 18 Jul 18 Best pair Worst pair 1 1 Best available MT Quality Number of language pairs available at this level of LEPOR quality out of 14 pairs we evaluated since July 2017 (ru, de, cs, tr, fi, ro, zh to en and back) 8 4 2 7 4 2 Intento, Inc. • July 2018 7 4 2 2 7 4 1 29
  30. 30. July 2018© Intento, Inc. 1 12 Best available Performance/Price Efficiency Efficiency = (hLEPOR in %)² / (USD per 1M symbols) — Number of language pairs available at this level of efficiency out of 14 pairs we evaluated since July 2017 (ru, de, cs, tr, fi, ro, zh to en and back) 100 200 300 400 500 600 700 800 900 Jul 17 Nov 17 Mar 18 Jul 18 Best pair Worst pair 3 2 3 2 3 1 1 1 6 3 2 Intento, Inc. • July 2018 3 1 4 3 1 3 4 3 4 30
  31. 31. July 2018© Intento, Inc. 5 Conclusions Machine Translation quality and efficiency improves monthly, but far from being ideal, hence clever MT choice is a must. — In the same time, the MT landscape gets more fragmented as focus shifts from having the best algorithms to having the best data. — Even for the general domain, having the best quality across 48 language pairs requires 9 engines used simultaneously. 31
  32. 32. July 2018© Intento, Inc. Custom version of this report You may the evaluation for your project using our vendor-agnostic API and command-line tools. — Also we may help with translating your corpus via multiple vendors or handling the whole evaluation for your project. — Reach us at hello@inten.to 32
  33. 33. July 2018© Intento, Inc. Evaluate vendors on your own data with no effort — up to +230% quality and -87% price by choosing the right vendor — save 12wk of engineering and data science efforts Manage and optimise vendor portfolio with our smart routing AI — use the best vendor for each language pair and domain with no hassle Single integration and contract to multiple vendors and models —
 save upfront 5-7wk per each vendor API — save 1d per month per each vendor API Intento Single API
 routes requests to the best models Reach us for pricing and contract 33
  34. 34. STATE OF THE MACHINE TRANSLATION by Intento (https://inten.to) July 2018 Konstantin Savenkov ks@inten.to (415) 429-0021 2150 Shattuck Ave Berkeley CA 94705 34
  35. 35. July 2018© Intento, Inc. Appendix A Overall performance of the MT services across many language pairs is computed in the following way: 1. [Standardisation] We compute mean language-standardised LEPOR score (or z-score) for each provider. 2. [Scale adjustment] We restore the original scale by multiplying z-score for each MT provider by the global LEPOR standard deviation and adding the global mean LEPOR score. 35

×