More Related Content More from Konstantin Savenkov (20) Evaluating Domain-Adaptive Neural Machine Translation (IMUG March 2019)2. Intento
AGENDA
1 INTRO & DISCLAIMERS
2 DOMAIN-ADAPTIVE NMT
3 AVAILABLE SOLUTIONS
4 EVALUATION PROCESS
2© Intento, Inc. / March 2019
4. Intento
1 HOW WE STARTED TO EVALUATE MT
Universal API Middleware
—
Noticed 300x price difference for Machine Translation
—
Are price & language support the only differences?
—
Evaluated stock MT engines - performance as well!
—
Also, changes fast!
4© Intento, Inc. / March 2019
7. Affordable Custom NMT —
Domain-Adaptive Models
7
“Custom NMT”
(from scratch)
Domain-Adaptive
NMT
builds upon
open or proprietary
NMT frameworks
baseline models or
datasets as a service
training data size 1M…10M segments 1K…1M segments
training process heavily curated automatic
main cost drivers
licenses, human services
$$$$-$$$$$
computing time
$$-$$$
© Intento, Inc. / March 2019
Intento
Super expensive just to try
8. 2018 in Machine Translation
Rise of Domain-Adaptive NMT*
8
Sep
2017
Oct
2018
* Neural Machine Translation with an automated customisation using domain-specific corpora, also known as the
domain adaptation.
Nov
2017
May
2018
Jun
2018
Jul
2018
Globalese
Custom
NMT
Lilt
Adaptive
NMT
IBM
Custom
NMT
Microsoft
Custom
Translate
Google
AutoML
Translation
SDL
ETS 8.0
ModernMT
Enterprise
Apr
2018
Systran
PNMT
Intento
11. Intento
4 EVALUATION PROCESS
4.1 Identify the goals (PE vs Raw vs IR ets)
—
4.2 Define set of projects
—
4.3 Select candidate engines (stock and adaptive)
—
4.4 Large-scale automatic scoring
—
4.5 Human evaluation
—
4.6 Identify complementary engines
—
4.7 Setup workflow, train people, track performance (PE or conversions)
11© Intento, Inc. / March 2019
12. Intento
4.1 IDENTIFY THE GOALS
Use case: PEMT, Raw MT or Information Retrieval
—
Mission critical? (inbound vs outbound)
—
How to calculate ROI?
—
Expected gains?
—
BATNA?
12© Intento, Inc. / March 2019
13. Intento
4.2 IDENTIFY DISTINCT PROJECTS
One TM = One Project
—
Fewer the better (error matrix will help)
—
Track productivity to identify new projects
13© Intento, Inc. / March 2019
14. Intento
4.3 PROJECT ATTRIBUTES
Language pair
—
Domain
—
Availability of reference data
—
Availability and volume of training data
—
Availability of glossary
14© Intento, Inc. / March 2019
Data Protection requirements
—
Data Locality requirements
—
Contracting requirements
—
Deployment requirements
—
Usage scenario (monthly and peak volume, frequency of updates)
15. Intento
4.4 SELECTING CANDIDATE ENGINES
Language support
—
Domain- and dialect-specific baseline models
—
Minimal training data requirements
—
Glossary support
15© Intento, Inc. / March 2019
Data Protection
—
Deployment options (regions)
—
Contracting jurisdictions, payment options
—
Deployment options (cloud, on-premise etc)
—
TCO model
17. November 2018© Intento, Inc.
Globalese
Custom NMT
Language Pairs “all of them”
Customization parallel corpus
Deployment cloud
on-premise
Cost to train* -
Cost to maintain* $58 per month
Cost to translate* - (limited volume)
17
* base pricing tier
18. October 2018© Intento, Inc.
Google Cloud
AutoML Translation β
Language Pairs 50
Customization parallel corpus
Deployment cloud
Cost to train* $76 per hour of training
Cost to maintain* -
Cost to translate* $80 per 1M symbols
18
* base pricing tier
19. October 2018© Intento, Inc.
IBM Cloud
Language Translator v3
Language Pairs 48
Customization parallel corpus
glossary
Deployment cloud**
Cost to train* free
Cost to maintain* $15 per month
Cost to translate* $100 per 1M symbols
19
* advanced pricing tier
** with optional no-trace
20. October 2018© Intento, Inc.
Microsoft
Custom Translator β APIv3
Language Pairs 74***
Customization parallel corpus,
glossary***
Deployment cloud**
Cost to train* $10 per 1M symbols
of training data (capped $300)
Cost to maintain* $10 per month
Cost to translate* $40 per 1M symbols
* base pricing tier
** no trace
*** since October 25, 2018
20
21. October 2018© Intento, Inc.
ModernMT
Enterprise Edition
Language Pairs 45
Customization parallel corpus
Deployment cloud**
on-premise
Cost to train* free
Cost to maintain* free
Cost to translate* 4 EUR per 1000 words
(~ $960 per 1M symbols)
* base pricing tier
** claims non-exclusive rights on content
21
22. October 2018© Intento, Inc.
Tilde
Custom Machine Translation
Language Pairs ?*
Customization parallel corpus
Deployment cloud
on-premise
Cost to train** N/A
Cost to maintain** N/A
Cost to translate** N/A
* language pair support provided on-demand
** no public pricing available
22
25. Intento
OTHER DIMENSIONS
Different learning curves
25
Some engines start low, but improve
fast
—
Others start high and improve low
—
Also, some have unusual behavior
—
Depends on language pair and
domain
—
Worth exploring to plan the engine
update & re-evaluation strategy
© Intento, Inc. / March 2019
27. Intento
Language Support by MT engines
27
1
100
10000
G
oogle
Yandex
M
icrosoftN
M
T
M
icrosoftSM
T
Baidu
Tencent
Systran
Systran
PN
M
T
PRO
M
T
SDL
Language
C
loud
Youdao
SAP
M
odernM
T
DeepL
IBM
N
M
T
Am
azon
IBM
SM
T
Alibaba
G
TC
om
2
11
2
56
138
119
1 074
3 022
6
8
20
24
34
424447
72
104106110110
210
812
3 7823 660
8 556
10 712
Total
Unique
https://bit.ly/mt_jul2018
© Intento, Inc. / March 2019
29. Intento
MT QUALITY EVALUATION
Getting the best of two worlds
29
1.Run automatic scoring on scale
2.Filter out outlier MT engines
3.Extract meaningful segments for LQA
4.Run LQA
© Intento, Inc. / March 2019
33. Intento
(N)MT quirks
33
“It’s not a story the Jedi
would tell you.”
“”
“Star Wars franchise
is overrated”
A good NMT engine, in
top-5% for a couple of pairs
© Intento, Inc. / March 2019
34. Intento
Otherwise a good NMT from
a famous brand
(N)MT quirks
34
“Unisex Nylon Laptop
Backpack School Travel
Rucksack Satchel
Shoulder Bag”
“рюкзак”
“Author is an idiot.
I will fix it!”
(a backpack)
© Intento, Inc. / March 2019
36. Intento
(N)MT quirks
36
“Revenue X, profit Y.”
“Выручка X, прибыль Y,
EBITDA Z”
“I see you love
numbers, here you
are!”
Good new NMT engine, best
at some language pairs
© Intento, Inc. / March 2019
38. Intento
(N)MT quirks
38
“And here you may buy
our new product X”
“And here you may buy
our new product X https:///
zzz.com/SKUXXX”
“Let me google it for
you…”
Good new NMT engine, best
at some language pairs
© Intento, Inc. / March 2019
40. Intento
MT QUALITY EVALUATION
Expert judgement
40
A.Human LQA – error type and severity labeled
by linguists or subject matter experts
B.Post-editing effort – zero-edits, keystrokes
and time spent
© Intento, Inc. / March 2019
41. November 2018© Intento, Inc.
Human Linguistic Quality Analysis
Blind within-subjects review
An expert receives a source segment and all translations (including
the human reference) without labels. 45 segments are distributed
across 5 experts.
—
Human LQA was conducted by Logrus IT using their issue type
and issue severity metrics (see Slides 26-27)
—
Each expert records all issues and their estimated severity rates
for every translated segment. Segments without errors are
considered “perfect”.
—
Typically, for LQA of a single text, the weighted normalized sum or
errors is used (errors per word). For the engine selection problem
we aggregated errors on a segment level (see slide 30 for details).
41
42. November 2018© Intento, Inc.
Human Linguistic Quality Analysis
Issue types (© LogrusIT)
42
Issue
Type
Description
Adequacy
Adequacy of translation defines how closely the translation follows the meaning of and the message conveyed by
the source text. Problems with adequacy can reflect additions and omissions, mistranslation, partially translated or
untranslated pieces, or pieces that should not have been translated at all.
Readability Intelligibility of translation defines how easy it is to read/consume and understand the target text/content.
Language
Language issues include all deviations from formal language rules and requirements. Includes grammar, spelling
and typography issues.
Terminology
A term (domain-specific word) is translated with a term other than the one expected for the domain or otherwise specified in a
terminology glossary or client requirements. Alternatively, terminology can be correct but inconsistent throughout the content.
Locale
This category refers only to whether the text is given the proper mechanical form for the locale, not whether the
content applies to the locale or not.
Style
Style measures compliance with existing formal style requirements as well as language, cultural and regional
specifics.
© Logrus IT, 2018
43. November 2018© Intento, Inc.
Human Linguistic Quality Analysis
Issue severity (© LogrusIT)
43
Issue
Severity
Description
Critical
Showstopper-level errors are the ones that have the biggest, sometimes dramatic impact on the reader’s perception.
Showstoppers are regular errors that can result in dire consequences for the publisher, including causing life or health risks, equipment damage, violating local or international
laws, unintentionally offending large groups of people, potential risks of misinformation and/or incorrect or dangerous user behavior, etc. Typically the content should not be
published without fixing all showstopper-level problems first.
Major
The issue is serious, and has a noticeable effect on the overall text perception. Typical examples include locale
errors (like incorrect date, numeric or currency format) as well as problems with translation adequacy or readability
for a particular sentence or string.
Medium
The issue has noticeable, but moderate effect on the overall text perception. Typical examples include incorrect
capitalization, wrong markup and regular spelling errors. While somewhat annoying and more noticeable, medium-
severity issues still do not result in misinformation, and do not affect the reader’s perception seriously.
Minor
The issue has minimal effect on the overall text perception. Typical examples include dual spaces or incorrect
typography (as far as it is not misleading and does not change the meaning). Formally a dual space in place of a
single one or a redundant comma represents an error, but its effect on the reader’s perception is minimal.
Preferential Use this severity for recommendations and preferrential issues that should not affect the rating.
© Logrus IT, 2018
44. November 2018© Intento, Inc.
Linguistic Quality Analysis
Dealing with reviewer disagreement
44
For LQA, accurate counting of errors may not work. Human reviewers
disagree in several situations:
• Stop counting errors after discovering a critical one vs. count all errors in a segment
(including multiple critical).
• Major vs. critical severity (e.g. if a dependent clause is missing).
• Counting several consequent errors as one vs. many.
—
Weighted average ranking is not stable in presence of such disagreement.
—
Mitigating the disagreement:
• Count critical errors as major (in both cases post-editing is likely to take as much time as
translating from scratch)
• Count not individual errors, but a number of segments with each of the severity levels
(including zero errors).
—
NMT customization should have the most impact on terminology. 84% of
terminology errors has “Medium” severity, hence we rank engines by the
amount of segments with severity < “Medium” (the next slide)
48. November 2018© Intento, Inc.
Post-Editing Effort
“Hard evidence”
Post-editor accepts or corrects MT output.
—
Things to watch: zero-edits, keystrokes, time to edit.
—
Highly dependent on tools, workflow (simple or complex
task) and PE training (up to 4x difference across different
reviewers)
—
Requires PE tracking, e.g. PET, but better use production
tools
—
Results may be surprising (e.g. one engine produces 2x
more perfect segments, while another 4x easier to edit).
48
49. Intento
EVALUATION IN USE
SMALL PEMT PROJECTS
1. Get a list of stock MT engines for your language pair
—
2. Select 4-5 candidate MT engines
—
3a. Enable them in your CAT tool
—
3b. Translate everything by 4-5 candidate engines and
upload as a TM in your CAT tool
—
4. Choose per segment as you translate
49© Intento, Inc. / March 2019
50. Intento
EVALUATION IN USE
MEDIUM / ONGOING / MT-FIRST
1. Prepare a reference translation (1,500-2,000 segments)
—
2. Get a list of stock MT engines for your language pair
—
3. Translate the sample by appropriate engines
—
4. Calculate a reference-based score for the MT results
—
5. Evaluate top-performing engines manually
—
6. Translate everything with the winning engine
50© Intento, Inc. / March 2019
51. Intento
EVALUATION IN USE
LARGE PROJECTS / MT ONLY
1. Evaluate stock engines and get a baseline quality score
—
2. Prepare a term base and a domain adaptation corpus (from
10K segments)
—
3. Train appropriate custom NMT engines
—
4. Evaluate custom NMT to see if it works better than stock MT
—
5. Update and re-evaluate the winning model as you collect
more post-edited content
51© Intento, Inc. / March 2019