State of the Machine Translation by Intento (November 2017)
Nov. 3, 2017•0 likes•25,459 views
Download to read offline
Report
Technology
Evaluation of 11 major Machine Translation (Google, Microsoft, IBM, SAP, Yandex, SDL, Systran, Baidu, GTCom, PROMT, DeepL) providers for 35 most popular language pairs: performance, quality, language coverage, API update frequency.
5. Machine Translation
Services* Compared
* We have evaluated general purpose Cloud Machine Translation services with prebuilt translation models, provided via API. Some vendors also provide
web-based, on-premise or custom MT engines, which may differ on all aspects from what we’ve evaluated.
Baidu
Translate API
DeepL
API (beta)
Google Cloud
Translation API
GTCom
YeeCloud MT
IBM Watson
Language
Translator
Microsoft
Translator Text
API
PROMT
Cloud API
SAP Translation
Hub (beta)
SDL Cloud
Machine
Translation
Systran REST
Translation API
Yandex
Translate API
7. Evaluation methodology (I)
• Translation quality is evaluated by computing LEPOR
score between reference translations and the MT output
(Slide 9).
• Currently, our goal is to evaluate performance of
translation between the most popular languages (Slide
10).
• We use public datasets from StatMT/WMT and
CASMACAT News Commentary (Slide 11).
• We have performed LEPOR metric convergence analysis
to identify minimal viable number of segments in the
dataset. See Slide 12 for some details.
8. Evaluation methodology (II)
• We consider MT service A more performant than B for the
language pair C if:
- mean LEPOR score of A is greater than LEPOR of B for
the pair C, and
- lower bound of the LEPOR 95% confidence interval of A
is greater than the upper bound of the LEPOR confidence
interval of B for the pair C. See Slide 12 for an example.
• Different language pairs (and different datasets) impose
different translation complexity. In order to compare
overall MT performance of different services we regularise
LEPOR scores across all language pairs (See Appendix A
for more details).
9. LEPOR score
• LEPOR: automatic machine translation evaluation metric
considering the enhanced Length Penalty, n-gram
Position difference Penalty and Recall
• In our evaluation, we used hLEPORA v.3.1:
• best metric from ACL-WMT 2013 contest
https://www.slideshare.net/AaronHanLiFeng/lepor-an-augmented-machine-translation-evaluation-metric-thesis-ppt
https://github.com/aaronlifenghan/aaron-project-lepor
LIKE BLEU,
BUT BETTER
10. Language Pairs
We focus on the
en-P1, P1-en and
P1-P1 (partially)
* https://w3techs.com/technologies/overview/content_language/all
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
P4 - <0.1% websites
en ru ja de es fr pt it zh cs tr fi ro
en ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
ru ✓ ✓ ✓ ✓ ✓
ja ✓ ✓
de ✓ ✓
es ✓ ✓
fr ✓ ✓ ✓
pt ✓
it ✓ ✓
zh ✓ ✓
cs ✓
tr ✓
fi ✓
ro ✓
12. LEPOR Convergence
We used 1440 - 3000 sentences per language pair. In all cases it’s clear that
the metric stabilises and adding more from the same domain won’t change the
outcome.
number of sentences
regularisedhLEPORscores
Aggregated across all language pairs Examples for individual language pairs:
Aggre-
gated
mean
Confi-
dense
interval
Detailed data on each language pair provided in the full report
13. Overall Performance
35 language pairs, 1440-3000 sentences per pair
>70%
<40%
Variance
among
language
pairs
Detailed data on each language pair provided in the full report
14. Available MT Quality
en ru ja de es fr pt it zh cs tr fi ro
en 4 7 3 2 7 1 2 6 2 4 5 2
ru 2 3 3 4 4
ja 1 4
de 8 2
es 7 4
fr 6 1 8
pt 4
it 7 2
zh 6 4
cs 2
tr 2
fi 1
ro 4
70 %
60 %
50 %
40 %
30 %
Maximal Achieved
hLEPOR score:
No. of
top-performing
MT Providers
Minimal price
for this quality,
per 1M char:
$$$ ≥$20
$$ $10-15
$ <$10
$$
$
$
$$
$$
$
$
$$$
$$$
$ $$$
$
$
$$$
$$ $$ $
$$$
$
$
$$
$
$
$$
$
$$
$$ $
$$
$
$$$
$
$$
$
Detailed data on each language pair provided in the full report
15. Sample pair analysis: en-pt
LEPOR
score
Providers
Price range
(per 1M characters)
77 % Google $20
72 % Yandex, Microsoft $4.5-15
70 % Baidu, SDL, IBM $8-$21
62 % Systran, PROMT $3-$8
BEST QUALITY:
BEST PRICE:
PRICE&QUALITY:
Google
PROMT
Microsoft
ALL 35 PAIRS
AVAILABLE
IN THE FULL
REPORT
16. Price vs. Performance
AFFORDABILITY
PERFORMANCE
As of November 2017
COST-EFFECTIVE
ACCURATE
FREE
(BETA)
NOT
SET
YET
COST-EFFECTIVE
Performance
Regularized hLEPOR
score aggregated across
all language pairs in the
dataset
Affordability = 1/price
Using public volume-
based pricing tiers
Legend
• performance range:
- regularised average
- max across all pairs
- min across all pairs
• price range
Detailed data on each language pair provided in the full report
19. Language popularity
Language groups by
web popularity*:
P1 - ≥ 2.0% websites
P2 - 0.5%-2% websites
P3 - 0.1-0.3% websites
P4 - <0.1% websites
* https://w3techs.com/technologies/overview/content_language/all
A total of
29070
pairs possible,
12989
are supported
across all providers
P1
en, ru, ja, de, es,
fr, pt, it, zh
P2
pl, fa, tr, nl, ko, cs, ar,
vi, el, sv in, ro, hu
P3
da, sk, fi, th, bg, he, lt, uk,
hr, no, nb, sr, ca, sl, lv, et
P4
hi, az, bs, ms, is, mk, bn, eu, ka, sq,
gl, mn, kk, hy, se, uz, kr, ur, ta, nn, af,
be, si, my, br, ne, sw, km, fil, ml, pa,
…
23. Evaluation methodology (I)
Here we evaluate overall service organisation from the following
angles:
• Product - Support of Machine Translation features desired for using the API in various MT scenarios
• Design - Overall API design and technical convenience
• Documentation - How well the API is documented
• Onboarding - How easy is to integrate and start using the API
• Commercial - Flexibility of the commercial terms
• Implementation - Important low-level features of the API
• Maintenance - Convenience of getting information about the API changes for ongoing support
• Reliability - Various technical issues we’ve encountered
Some references:
• http://talks.kinlane.com/apistrat/api101/index.html#/14
• https://mathieu.fenniak.net/the-api-checklist/
• https://www.slideshare.net/jmusser/ten-reasons-developershateyourapi
• https://restfulapi.net/richardson-maturity-model/
• https://github.com/shieldfy/API-Security-Checklist
• https://nordicapis.com/why-api-developer-experience-matters-more-than-ever/
• http://www.drdobbs.com/windows/measuring-api-usability/184405654?pgno=1
24. Evaluation methodology (II)
• Translation domains
• Translation engines
• Language autodetect
• Glossaries
• TM Support
• Custom engines
• Bulk mode
• Formatted text
• XLIFF support
Product
Design
Documentation
Onboarding
Commercial
Implementation
Maintenance
Reliability
• Authentication
• Use of SSL
• Quota info
• Domain info
• Balance info
• Self-sufficient
• Intuitive
• Versioning
• Bulk mode
• Task-invocation
ratio
• I/O Structure
• List of endpoints
• User documentation
• Supported languages
• Quotas
• Response codes
• Error codes
• Error messages
• API explorer
• API console
• Number of docs
• HTML doc
• Self-registration
• Self-issued keys
• Self-payment
• Free / Trial plan
• Sandbox
• Test console
• Github repo
• Code libraries
• SDK / PDK
• Sample code
• Direct support
• Ticket system
• Self-support
• Tutorial
• FAQ / KB
• Starter package
• Public pricing
• Pay as you go
• Post-paid
• Volume discounts
• Payment systems
• Billing history
• API spec
• Data compression
• Supports JSON
• Negotiable content
• Unicode support
• Error codes
• Error messages
• News source
• Subscription news
• Versioning
• Changelog
• Release notes
• Roadmap
• Status dashboard
• Developer dashboard
• Exportable logs
• Uptime
• Sporadic errors
• Bugs
• Performance issues
• Status dashboard
• Outage alerts
28. Detailed version of this report
• We give this over view version for free.
• The full evaluation report contains:
- Detailed best-deal analysis for each of the 35 language
pairs
- Developer experience analysis for each of the 11 MT
providers
• Also, by ordering the full report you support our
ongoing evaluation of the Cloud MT
• To get the full report, reach us at hello@inten.to