Evaluating Domain-Adaptive Neural Machine Translation (IMUG March 2019)

EVALUATING
DOMAIN-ADAPTIVE
NEURAL MACHINE TRANSLATION
Konstantin Savenkov

CEO Intento, Inc.
© Intento, Inc.
IMUG
Adobe
March 2019 - San-Jose, CA

Intento
AGENDA
1 INTRO & DISCLAIMERS
2 DOMAIN-ADAPTIVE NMT
3 AVAILABLE SOLUTIONS
4 EVALUATION PROCESS
2© Intento, Inc. / March 2019

Intento
3
INTENTO
Discover, evaluate
and use best-of-
breed AI models
© Intento, Inc. / March 2019

Intento
1 HOW WE STARTED TO EVALUATE MT
Universal API Middleware
—
Noticed 300x price difference for Machine Translation
—
Are price & language support the only differences?
—
Evaluated stock MT engines - performance as well!
—
Also, changes fast!

Intento
MT QUALITY LANDSCAPE IS WILD
And changes fast!

6
1949
You
Are
Here
I
Memorandum
on Translation
1996
II
Affordable
stock RBMT
2006
III
Affordable
stock SMT
2016
IV
Affordable
stock NMT
V
Aﬀordable
custom NMT
Intento
2 DOMAIN ADAPTIVE NMT

Aﬀordable Custom NMT —
Domain-Adaptive Models
7
“Custom NMT”

(from scratch)
Domain-Adaptive
NMT
builds upon
open or proprietary
NMT frameworks
baseline models or
datasets as a service
training data size 1M…10M segments 1K…1M segments
training process heavily curated automatic
main cost drivers
licenses, human services
$$$$-$$$$$
computing time
$$-$$$
Intento
Super expensive just to try

2018 in Machine Translation
Rise of Domain-Adaptive NMT*
8
Sep
2017
Oct
2018
* Neural Machine Translation with an automated customisation using domain-speciﬁc corpora, also known as the
domain adaptation.
Nov
2017
May
2018
Jun
2018
Jul
2018
Globalese
Custom
NMT
Lilt
Adaptive
NMT
IBM
Custom
NMT
Microsoft
Custom
Translate
Google
AutoML
Translation
SDL
ETS 8.0
ModernMT
Enterprise
Apr
2018
Systran
PNMT
Intento

Intento
WHY EVALUATE?
Diﬀerent learning curves
9
Starting dataset sizes
vary
—

Learning curves vary
—

Depends on language
pair and domain

Intento
WHY EVALUATE?
The right choice drives ROI (English-to-German, Life Sciences)

Intento
4 EVALUATION PROCESS
4.1 Identify the goals (PE vs Raw vs IR ets)
—
4.2 Deﬁne set of projects
—
4.3 Select candidate engines (stock and adaptive)
—
4.4 Large-scale automatic scoring
—
4.5 Human evaluation
—
4.6 Identify complementary engines
—
4.7 Setup workﬂow, train people, track performance (PE or conversions)

Intento
4.1 IDENTIFY THE GOALS
Use case: PEMT, Raw MT or Information Retrieval
—
Mission critical? (inbound vs outbound)
—
How to calculate ROI?
—
Expected gains?
—
BATNA?

Intento
4.2 IDENTIFY DISTINCT PROJECTS
One TM = One Project
—
Fewer the better (error matrix will help)
—
Track productivity to identify new projects

Intento
4.3 PROJECT ATTRIBUTES
Language pair
—
Domain
—
Availability of reference data
—
Availability and volume of training data
—
Availability of glossary
Data Protection requirements
—
Data Locality requirements
—
Contracting requirements
—
Deployment requirements
—
Usage scenario (monthly and peak volume, frequency of updates)

Intento
4.4 SELECTING CANDIDATE ENGINES
Language support
—
Domain- and dialect-speciﬁc baseline models
—
Minimal training data requirements
—
Glossary support
Data Protection
—
Deployment options (regions)
—
Contracting jurisdictions, payment options
—
Deployment options (cloud, on-premise etc)
—
TCO model

Intento
3 AVAILABLE MT SOLUTIONS

November 2018© Intento, Inc.
Globalese
Custom NMT
Language Pairs “all of them”
Customization parallel corpus
Deployment cloud
on-premise
Cost to train* -
Cost to maintain* $58 per month
Cost to translate* - (limited volume)
17
* base pricing tier

October 2018© Intento, Inc.
Google Cloud
AutoML Translation β
Language Pairs 50
Deployment cloud
Cost to train* $76 per hour of training
Cost to maintain* -
Cost to translate* $80 per 1M symbols
18
* base pricing tier

IBM Cloud
Language Translator v3
Language Pairs 48
glossary
Deployment cloud**
Cost to train* free
19
* advanced pricing tier
** with optional no-trace

Microsoft
Custom Translator β APIv3
Language Pairs 74***
Customization parallel corpus,
glossary***
Deployment cloud**
Cost to train* $10 per 1M symbols
of training data (capped $300)
* base pricing tier
** no trace
*** since October 25, 2018
20

ModernMT
Enterprise Edition
Language Pairs 45
Deployment cloud**
on-premise
Cost to train* free
Cost to maintain* free
Cost to translate* 4 EUR per 1000 words
(~ $960 per 1M symbols)
* base pricing tier
** claims non-exclusive rights on content
21

Tilde
Custom Machine Translation
Language Pairs ?*
Deployment cloud
on-premise
Cost to train** N/A
Cost to maintain** N/A
Cost to translate** N/A
* language pair support provided on-demand
** no public pricing available
22

Intento
OTHER DIMENSIONS
Pricing - Stock Engines
23
USD / 1M
symbols
now
7.1

Intento
OTHER DIMENSIONS
Pricing - Total Cost of Ownership

Intento
OTHER DIMENSIONS
Diﬀerent learning curves
25
Some engines start low, but improve
fast
—

Others start high and improve low
—

Also, some have unusual behavior
—

Depends on language pair and
domain
—

Worth exploring to plan the engine
update & re-evaluation strategy

Intento
OTHER DIMENSIONS
Language Support
26
All stock engines combined
support
14290
language pairs out of
29070
possible (45%)
* https://w3techs.com/technologies/overview/content_language/all

Intento
Language Support by MT engines
27
1
100
10000
G
oogle
Yandex
M
icrosoftN
M
T
M
icrosoftSM
T
Baidu
Tencent
Systran
Systran
PN
M
T
PRO
M
T
SDL
Language

C
loud
Youdao
SAP
M
odernM
T
DeepL
IBM
N
M
T
Am
azon
IBM
SM
T
Alibaba
G
TC
om
2
11
2
56
138
119
1 074
3 022
6
8
20
24
34
424447
72
104106110110
210
812
3 7823 660
8 556
10 712
Total
Unique
https://bit.ly/mt_jul2018

Intento
MT QUALITY EVALUATION
Automatic Scores vs. Expert Judgement

Intento
Getting the best of two worlds
29
1.Run automatic scoring on scale
2.Filter out outlier MT engines
3.Extract meaningful segments for LQA
4.Run LQA

Intento
Reference-Based Scores

Intento
Filter outlier engines

Intento
Focus on segments that matter

Intento
(N)MT quirks
33
“It’s not a story the Jedi
would tell you.”
“”
“Star Wars franchise
is overrated”
A good NMT engine, in
top-5% for a couple of pairs

Intento
Otherwise a good NMT from
a famous brand
(N)MT quirks
34
“Unisex Nylon Laptop
Backpack School Travel
Rucksack Satchel
Shoulder Bag”
“рюкзак”
“Author is an idiot.
I will ﬁx it!”
(a backpack)

Intento
(N)MT quirks
35
“hello, world!” “”
“Are you kidding
me?!”
Good new NMT engine, best
at some language pairs

Intento
(N)MT quirks
36
“Revenue X, proﬁt Y.”
“Выручка X, прибыль Y,
EBITDA Z”
“I see you love
numbers, here you
are!”

Intento
(N)MT quirks
37
“3.29346” “3.29”
“Let me round it up
for you…”

Intento
(N)MT quirks
38
“And here you may buy
our new product X”
“And here you may buy
our new product X https:///
zzz.com/SKUXXX”
“Let me google it for
you…”

Intento
(N)MT quirks
39
“Please, one more time” “Поёалуйста, ещж раз”
“Let’s play “ﬁnd the
difference?”
Good popular NMT engine

Intento
Expert judgement
40
A.Human LQA – error type and severity labeled
by linguists or subject matter experts
B.Post-editing effort – zero-edits, keystrokes
and time spent

Human Linguistic Quality Analysis
Blind within-subjects review
An expert receives a source segment and all translations (including
the human reference) without labels. 45 segments are distributed
across 5 experts.
—
Human LQA was conducted by Logrus IT using their issue type
and issue severity metrics (see Slides 26-27)
—
Each expert records all issues and their estimated severity rates
for every translated segment. Segments without errors are
considered “perfect”.
—
Typically, for LQA of a single text, the weighted normalized sum or
errors is used (errors per word). For the engine selection problem
we aggregated errors on a segment level (see slide 30 for details).
41

Issue types (© LogrusIT)
42
Issue

Type
Description
Adequacy
Adequacy of translation defines how closely the translation follows the meaning of and the message conveyed by
the source text. Problems with adequacy can reflect additions and omissions, mistranslation, partially translated or
untranslated pieces, or pieces that should not have been translated at all.

Readability Intelligibility of translation defines how easy it is to read/consume and understand the target text/content.
Language
Language issues include all deviations from formal language rules and requirements. Includes grammar, spelling
and typography issues.
Terminology
A term (domain-specific word) is translated with a term other than the one expected for the domain or otherwise specified in a
terminology glossary or client requirements. Alternatively, terminology can be correct but inconsistent throughout the content.
Locale
This category refers only to whether the text is given the proper mechanical form for the locale, not whether the
content applies to the locale or not.
Style
Style measures compliance with existing formal style requirements as well as language, cultural and regional
specifics.
© Logrus IT, 2018

Issue severity (© LogrusIT)
43
Issue

Severity
Description
Critical
Showstopper-level errors are the ones that have the biggest, sometimes dramatic impact on the reader’s perception.
Showstoppers are regular errors that can result in dire consequences for the publisher, including causing life or health risks, equipment damage, violating local or international
laws, unintentionally offending large groups of people, potential risks of misinformation and/or incorrect or dangerous user behavior, etc. Typically the content should not be
published without fixing all showstopper-level problems first.
Major
The issue is serious, and has a noticeable effect on the overall text perception. Typical examples include locale
errors (like incorrect date, numeric or currency format) as well as problems with translation adequacy or readability
for a particular sentence or string.
Medium
The issue has noticeable, but moderate effect on the overall text perception. Typical examples include incorrect
capitalization, wrong markup and regular spelling errors. While somewhat annoying and more noticeable, medium-
severity issues still do not result in misinformation, and do not affect the reader’s perception seriously.
Minor
The issue has minimal effect on the overall text perception. Typical examples include dual spaces or incorrect
typography (as far as it is not misleading and does not change the meaning). Formally a dual space in place of a
single one or a redundant comma represents an error, but its effect on the reader’s perception is minimal.
Preferential Use this severity for recommendations and preferrential issues that should not affect the rating.
© Logrus IT, 2018

Linguistic Quality Analysis
Dealing with reviewer disagreement
44
For LQA, accurate counting of errors may not work. Human reviewers
disagree in several situations:
• Stop counting errors after discovering a critical one vs. count all errors in a segment
(including multiple critical).
• Major vs. critical severity (e.g. if a dependent clause is missing).
• Counting several consequent errors as one vs. many.
—
Weighted average ranking is not stable in presence of such disagreement.
—
Mitigating the disagreement:
• Count critical errors as major (in both cases post-editing is likely to take as much time as
translating from scratch)
• Count not individual errors, but a number of segments with each of the severity levels
(including zero errors).
—
NMT customization should have the most impact on terminology. 84% of
terminology errors has “Medium” severity, hence we rank engines by the
amount of segments with severity < “Medium” (the next slide)

Intento
Human LQA

Intento
Do good engines make mistakes in the same
sentences?

Intento
Looking for complementary engines

Post-Editing Eﬀort
“Hard evidence”
Post-editor accepts or corrects MT output.
—
Things to watch: zero-edits, keystrokes, time to edit.
—
Highly dependent on tools, workﬂow (simple or complex
task) and PE training (up to 4x difference across different
reviewers)
—
Requires PE tracking, e.g. PET, but better use production
tools
—
Results may be surprising (e.g. one engine produces 2x
more perfect segments, while another 4x easier to edit).
48

Intento
EVALUATION IN USE
SMALL PEMT PROJECTS
1. Get a list of stock MT engines for your language pair
—

2. Select 4-5 candidate MT engines
—

3a. Enable them in your CAT tool
—

3b. Translate everything by 4-5 candidate engines and
upload as a TM in your CAT tool
—

4. Choose per segment as you translate

Intento
EVALUATION IN USE
MEDIUM / ONGOING / MT-FIRST
1. Prepare a reference translation (1,500-2,000 segments)
—

2. Get a list of stock MT engines for your language pair
—

3. Translate the sample by appropriate engines
—

4. Calculate a reference-based score for the MT results
—

5. Evaluate top-performing engines manually
—

6. Translate everything with the winning engine

Intento
EVALUATION IN USE
LARGE PROJECTS / MT ONLY
1. Evaluate stock engines and get a baseline quality score
—

2. Prepare a term base and a domain adaptation corpus (from
10K segments)
—

3. Train appropriate custom NMT engines
—

4. Evaluate custom NMT to see if it works better than stock MT
—

5. Update and re-evaluate the winning model as you collect
more post-edited content

THANK YOU!
Konstantin Savenkov

ks@inten.to
Konstantin Savenkov
ks@inten.to
(415) 429-0021
2150 Shattuck Ave
Berkeley CA 94705
52

Evaluating Domain-Adaptive Neural Machine Translation (IMUG March 2019)

Recommended

Recommended

More Related Content

More from Konstantin Savenkov

More from Konstantin Savenkov (20)

Recently uploaded

Recently uploaded (20)

Evaluating Domain-Adaptive Neural Machine Translation (IMUG March 2019)