SlideShare a Scribd company logo
1 of 52
Download to read offline
EVALUATING
DOMAIN-ADAPTIVE
NEURAL MACHINE TRANSLATION
Konstantin Savenkov

CEO Intento, Inc.
© Intento, Inc.
IMUG
Adobe
March 2019 - San-Jose, CA
Intento
AGENDA
1 INTRO & DISCLAIMERS
2 DOMAIN-ADAPTIVE NMT
3 AVAILABLE SOLUTIONS
4 EVALUATION PROCESS
2© Intento, Inc. / March 2019
Intento
3
INTENTO
Discover, evaluate
and use best-of-
breed AI models
© Intento, Inc. / March 2019
Intento
1 HOW WE STARTED TO EVALUATE MT
Universal API Middleware
—
Noticed 300x price difference for Machine Translation
—
Are price & language support the only differences?
—
Evaluated stock MT engines - performance as well!
—
Also, changes fast!
4© Intento, Inc. / March 2019
Intento
MT QUALITY LANDSCAPE IS WILD
And changes fast!
5© Intento, Inc. / March 2019
6
1949
You
Are
Here
I
Memorandum
on Translation
1996
II
Affordable
stock RBMT
2006
III
Affordable
stock SMT
2016
IV
Affordable
stock NMT
V
Affordable
custom NMT
© Intento, Inc. / March 2019
Intento
2 DOMAIN ADAPTIVE NMT
Affordable Custom NMT —
Domain-Adaptive Models
7
“Custom NMT”

(from scratch)
Domain-Adaptive
NMT
builds upon
open or proprietary
NMT frameworks
baseline models or
datasets as a service
training data size 1M…10M segments 1K…1M segments
training process heavily curated automatic
main cost drivers
licenses, human services
$$$$-$$$$$
computing time
$$-$$$
© Intento, Inc. / March 2019
Intento
Super expensive just to try
2018 in Machine Translation
Rise of Domain-Adaptive NMT*
8
Sep
2017
Oct
2018
* Neural Machine Translation with an automated customisation using domain-specific corpora, also known as the
domain adaptation.
Nov
2017
May
2018
Jun
2018
Jul
2018
Globalese
Custom
NMT
Lilt
Adaptive
NMT
IBM
Custom
NMT
Microsoft
Custom
Translate
Google
AutoML
Translation
SDL
ETS 8.0
ModernMT
Enterprise
Apr
2018
Systran
PNMT
Intento
Intento
WHY EVALUATE?
Different learning curves
9
Starting dataset sizes
vary
—

Learning curves vary
—

Depends on language
pair and domain
© Intento, Inc. / March 2019
Intento
WHY EVALUATE?
The right choice drives ROI (English-to-German, Life Sciences)
10© Intento, Inc. / March 2019
Intento
4 EVALUATION PROCESS
4.1 Identify the goals (PE vs Raw vs IR ets)
—
4.2 Define set of projects
—
4.3 Select candidate engines (stock and adaptive)
—
4.4 Large-scale automatic scoring
—
4.5 Human evaluation
—
4.6 Identify complementary engines
—
4.7 Setup workflow, train people, track performance (PE or conversions)
11© Intento, Inc. / March 2019
Intento
4.1 IDENTIFY THE GOALS
Use case: PEMT, Raw MT or Information Retrieval
—
Mission critical? (inbound vs outbound)
—
How to calculate ROI?
—
Expected gains?
—
BATNA?
12© Intento, Inc. / March 2019
Intento
4.2 IDENTIFY DISTINCT PROJECTS
One TM = One Project
—
Fewer the better (error matrix will help)
—
Track productivity to identify new projects
13© Intento, Inc. / March 2019
Intento
4.3 PROJECT ATTRIBUTES
Language pair
—
Domain
—
Availability of reference data
—
Availability and volume of training data
—
Availability of glossary
14© Intento, Inc. / March 2019
Data Protection requirements
—
Data Locality requirements
—
Contracting requirements
—
Deployment requirements
—
Usage scenario (monthly and peak volume, frequency of updates)
Intento
4.4 SELECTING CANDIDATE ENGINES
Language support
—
Domain- and dialect-specific baseline models
—
Minimal training data requirements
—
Glossary support
15© Intento, Inc. / March 2019
Data Protection
—
Deployment options (regions)
—
Contracting jurisdictions, payment options
—
Deployment options (cloud, on-premise etc)
—
TCO model
Intento
3 AVAILABLE MT SOLUTIONS
16© Intento, Inc. / March 2019
November 2018© Intento, Inc.
Globalese
Custom NMT
Language Pairs “all of them”
Customization parallel corpus
Deployment cloud
on-premise
Cost to train* -
Cost to maintain* $58 per month
Cost to translate* - (limited volume)
17
* base pricing tier
October 2018© Intento, Inc.
Google Cloud
AutoML Translation β
Language Pairs 50
Customization parallel corpus
Deployment cloud
Cost to train* $76 per hour of training
Cost to maintain* -
Cost to translate* $80 per 1M symbols
18
* base pricing tier
October 2018© Intento, Inc.
IBM Cloud
Language Translator v3
Language Pairs 48
Customization parallel corpus
glossary
Deployment cloud**
Cost to train* free
Cost to maintain* $15 per month
Cost to translate* $100 per 1M symbols
19
* advanced pricing tier
** with optional no-trace
October 2018© Intento, Inc.
Microsoft
Custom Translator β APIv3
Language Pairs 74***
Customization parallel corpus,
glossary***
Deployment cloud**
Cost to train* $10 per 1M symbols
of training data (capped $300)
Cost to maintain* $10 per month
Cost to translate* $40 per 1M symbols
* base pricing tier
** no trace
*** since October 25, 2018
20
October 2018© Intento, Inc.
ModernMT
Enterprise Edition
Language Pairs 45
Customization parallel corpus
Deployment cloud**
on-premise
Cost to train* free
Cost to maintain* free
Cost to translate* 4 EUR per 1000 words
(~ $960 per 1M symbols)
* base pricing tier
** claims non-exclusive rights on content
21
October 2018© Intento, Inc.
Tilde
Custom Machine Translation
Language Pairs ?*
Customization parallel corpus
Deployment cloud
on-premise
Cost to train** N/A
Cost to maintain** N/A
Cost to translate** N/A
* language pair support provided on-demand
** no public pricing available
22
Intento
OTHER DIMENSIONS
Pricing - Stock Engines
23
USD / 1M
symbols
now
7.1
© Intento, Inc. / March 2019
Intento
OTHER DIMENSIONS
Pricing - Total Cost of Ownership
24© Intento, Inc. / March 2019
Intento
OTHER DIMENSIONS
Different learning curves
25
Some engines start low, but improve
fast
—

Others start high and improve low
—

Also, some have unusual behavior
—

Depends on language pair and
domain
—

Worth exploring to plan the engine
update & re-evaluation strategy
© Intento, Inc. / March 2019
Intento
OTHER DIMENSIONS
Language Support
26
All stock engines combined
support
14290
language pairs out of
29070
possible (45%)
* https://w3techs.com/technologies/overview/content_language/all
© Intento, Inc. / March 2019
Intento
Language Support by MT engines
27
1
100
10000
G
oogle
Yandex
M
icrosoftN
M
T
M
icrosoftSM
T
Baidu
Tencent
Systran
Systran
PN
M
T
PRO
M
T
SDL
Language


C
loud
Youdao
SAP
M
odernM
T
DeepL
IBM
N
M
T
Am
azon
IBM
SM
T
Alibaba
G
TC
om
2
11
2
56
138
119
1 074
3 022
6
8
20
24
34
424447
72
104106110110
210
812
3 7823 660
8 556
10 712
Total
Unique
https://bit.ly/mt_jul2018
© Intento, Inc. / March 2019
Intento
MT QUALITY EVALUATION
Automatic Scores vs. Expert Judgement
28© Intento, Inc. / March 2019
Intento
MT QUALITY EVALUATION
Getting the best of two worlds
29
1.Run automatic scoring on scale
2.Filter out outlier MT engines
3.Extract meaningful segments for LQA
4.Run LQA
© Intento, Inc. / March 2019
Intento
MT QUALITY EVALUATION
Reference-Based Scores
30© Intento, Inc. / March 2019
Intento
MT QUALITY EVALUATION
Filter outlier engines
31© Intento, Inc. / March 2019
Intento
MT QUALITY EVALUATION
Focus on segments that matter
32© Intento, Inc. / March 2019
Intento
(N)MT quirks
33
“It’s not a story the Jedi
would tell you.”
“”
“Star Wars franchise
is overrated”
A good NMT engine, in
top-5% for a couple of pairs
© Intento, Inc. / March 2019
Intento
Otherwise a good NMT from
a famous brand
(N)MT quirks
34
“Unisex Nylon Laptop
Backpack School Travel
Rucksack Satchel
Shoulder Bag”
“рюкзак”
“Author is an idiot.
I will fix it!”
(a backpack)
© Intento, Inc. / March 2019
Intento
(N)MT quirks
35
“hello, world!” “”
“Are you kidding
me?!”
Good new NMT engine, best
at some language pairs
© Intento, Inc. / March 2019
Intento
(N)MT quirks
36
“Revenue X, profit Y.”
“Выручка X, прибыль Y,
EBITDA Z”
“I see you love
numbers, here you
are!”
Good new NMT engine, best
at some language pairs
© Intento, Inc. / March 2019
Intento
(N)MT quirks
37
“3.29346” “3.29”
“Let me round it up
for you…”
Good new NMT engine, best
at some language pairs
© Intento, Inc. / March 2019
Intento
(N)MT quirks
38
“And here you may buy
our new product X”
“And here you may buy
our new product X https:///
zzz.com/SKUXXX”
“Let me google it for
you…”
Good new NMT engine, best
at some language pairs
© Intento, Inc. / March 2019
Intento
(N)MT quirks
39
“Please, one more time” “Поёалуйста, ещж раз”
“Let’s play “find the
difference?”
Good popular NMT engine
© Intento, Inc. / March 2019
Intento
MT QUALITY EVALUATION
Expert judgement
40
A.Human LQA – error type and severity labeled
by linguists or subject matter experts
B.Post-editing effort – zero-edits, keystrokes
and time spent
© Intento, Inc. / March 2019
November 2018© Intento, Inc.
Human Linguistic Quality Analysis
Blind within-subjects review
An expert receives a source segment and all translations (including
the human reference) without labels. 45 segments are distributed
across 5 experts.
—
Human LQA was conducted by Logrus IT using their issue type
and issue severity metrics (see Slides 26-27)
—
Each expert records all issues and their estimated severity rates
for every translated segment. Segments without errors are
considered “perfect”.
—
Typically, for LQA of a single text, the weighted normalized sum or
errors is used (errors per word). For the engine selection problem
we aggregated errors on a segment level (see slide 30 for details).
41
November 2018© Intento, Inc.
Human Linguistic Quality Analysis
Issue types (© LogrusIT)
42
Issue

Type
Description
Adequacy
Adequacy of translation defines how closely the translation follows the meaning of and the message conveyed by
the source text. Problems with adequacy can reflect additions and omissions, mistranslation, partially translated or
untranslated pieces, or pieces that should not have been translated at all.

Readability Intelligibility of translation defines how easy it is to read/consume and understand the target text/content.
Language
Language issues include all deviations from formal language rules and requirements. Includes grammar, spelling
and typography issues.
Terminology
A term (domain-specific word) is translated with a term other than the one expected for the domain or otherwise specified in a
terminology glossary or client requirements. Alternatively, terminology can be correct but inconsistent throughout the content.
Locale
This category refers only to whether the text is given the proper mechanical form for the locale, not whether the
content applies to the locale or not.
Style
Style measures compliance with existing formal style requirements as well as language, cultural and regional
specifics.
© Logrus IT, 2018
November 2018© Intento, Inc.
Human Linguistic Quality Analysis
Issue severity (© LogrusIT)
43
Issue

Severity
Description
Critical
Showstopper-level errors are the ones that have the biggest, sometimes dramatic impact on the reader’s perception.
Showstoppers are regular errors that can result in dire consequences for the publisher, including causing life or health risks, equipment damage, violating local or international
laws, unintentionally offending large groups of people, potential risks of misinformation and/or incorrect or dangerous user behavior, etc. Typically the content should not be
published without fixing all showstopper-level problems first.
Major
The issue is serious, and has a noticeable effect on the overall text perception. Typical examples include locale
errors (like incorrect date, numeric or currency format) as well as problems with translation adequacy or readability
for a particular sentence or string.
Medium
The issue has noticeable, but moderate effect on the overall text perception. Typical examples include incorrect
capitalization, wrong markup and regular spelling errors. While somewhat annoying and more noticeable, medium-
severity issues still do not result in misinformation, and do not affect the reader’s perception seriously.
Minor
The issue has minimal effect on the overall text perception. Typical examples include dual spaces or incorrect
typography (as far as it is not misleading and does not change the meaning). Formally a dual space in place of a
single one or a redundant comma represents an error, but its effect on the reader’s perception is minimal.
Preferential Use this severity for recommendations and preferrential issues that should not affect the rating.
© Logrus IT, 2018
November 2018© Intento, Inc.
Linguistic Quality Analysis
Dealing with reviewer disagreement
44
For LQA, accurate counting of errors may not work. Human reviewers
disagree in several situations:
• Stop counting errors after discovering a critical one vs. count all errors in a segment
(including multiple critical).
• Major vs. critical severity (e.g. if a dependent clause is missing).
• Counting several consequent errors as one vs. many.
—
Weighted average ranking is not stable in presence of such disagreement.
—
Mitigating the disagreement:
• Count critical errors as major (in both cases post-editing is likely to take as much time as
translating from scratch)
• Count not individual errors, but a number of segments with each of the severity levels
(including zero errors).
—
NMT customization should have the most impact on terminology. 84% of
terminology errors has “Medium” severity, hence we rank engines by the
amount of segments with severity < “Medium” (the next slide)
Intento
MT QUALITY EVALUATION
Human LQA
45© Intento, Inc. / March 2019
Intento
MT QUALITY EVALUATION
Do good engines make mistakes in the same
sentences?
46© Intento, Inc. / March 2019
Intento
MT QUALITY EVALUATION
Looking for complementary engines
47© Intento, Inc. / March 2019
November 2018© Intento, Inc.
Post-Editing Effort
“Hard evidence”
Post-editor accepts or corrects MT output.
—
Things to watch: zero-edits, keystrokes, time to edit.
—
Highly dependent on tools, workflow (simple or complex
task) and PE training (up to 4x difference across different
reviewers)
—
Requires PE tracking, e.g. PET, but better use production
tools
—
Results may be surprising (e.g. one engine produces 2x
more perfect segments, while another 4x easier to edit).
48
Intento
EVALUATION IN USE
SMALL PEMT PROJECTS
1. Get a list of stock MT engines for your language pair
—

2. Select 4-5 candidate MT engines
—

3a. Enable them in your CAT tool
—

3b. Translate everything by 4-5 candidate engines and
upload as a TM in your CAT tool
—

4. Choose per segment as you translate
49© Intento, Inc. / March 2019
Intento
EVALUATION IN USE
MEDIUM / ONGOING / MT-FIRST
1. Prepare a reference translation (1,500-2,000 segments)
—

2. Get a list of stock MT engines for your language pair
—

3. Translate the sample by appropriate engines
—

4. Calculate a reference-based score for the MT results
—

5. Evaluate top-performing engines manually
—

6. Translate everything with the winning engine
50© Intento, Inc. / March 2019
Intento
EVALUATION IN USE
LARGE PROJECTS / MT ONLY
1. Evaluate stock engines and get a baseline quality score
—

2. Prepare a term base and a domain adaptation corpus (from
10K segments)
—

3. Train appropriate custom NMT engines
—

4. Evaluate custom NMT to see if it works better than stock MT
—

5. Update and re-evaluate the winning model as you collect
more post-edited content
51© Intento, Inc. / March 2019
THANK YOU!
Konstantin Savenkov

ks@inten.to
Konstantin Savenkov
ks@inten.to
(415) 429-0021
2150 Shattuck Ave
Berkeley CA 94705
52

More Related Content

More from Konstantin Savenkov

More from Konstantin Savenkov (20)

State of the Machine Translation by Intento (stock engines, Jun 2019)
State of the Machine Translation by Intento (stock engines, Jun 2019)State of the Machine Translation by Intento (stock engines, Jun 2019)
State of the Machine Translation by Intento (stock engines, Jun 2019)
 
State of the Machine Translation by Intento (stock engines, Jan 2019)
State of the Machine Translation by Intento (stock engines, Jan 2019)State of the Machine Translation by Intento (stock engines, Jan 2019)
State of the Machine Translation by Intento (stock engines, Jan 2019)
 
State of the Domain-Adaptive Machine Translation by Intento (November 2018)
State of the Domain-Adaptive Machine Translation by Intento (November 2018)State of the Domain-Adaptive Machine Translation by Intento (November 2018)
State of the Domain-Adaptive Machine Translation by Intento (November 2018)
 
EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUA...
EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUA...EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUA...
EVALUATION IN USE: NAVIGATING THE MT ENGINE LANDSCAPE WITH THE INTENTO EVALUA...
 
Improving the Demand Side of the AI Economy (API World 2018)
Improving the Demand Side of the AI Economy (API World 2018)Improving the Demand Side of the AI Economy (API World 2018)
Improving the Demand Side of the AI Economy (API World 2018)
 
Сравнительный анализ систем машинного перевода
Сравнительный анализ систем машинного переводаСравнительный анализ систем машинного перевода
Сравнительный анализ систем машинного перевода
 
State of the Machine Translation by Intento (July 2018)
State of the Machine Translation by Intento (July 2018)State of the Machine Translation by Intento (July 2018)
State of the Machine Translation by Intento (July 2018)
 
Cloud Sentiment Analysis - Vendor Overview (April 2018)
Cloud Sentiment Analysis - Vendor Overview (April 2018)Cloud Sentiment Analysis - Vendor Overview (April 2018)
Cloud Sentiment Analysis - Vendor Overview (April 2018)
 
State of the Machine Translation by Intento (March 2018)
State of the Machine Translation by Intento (March 2018)State of the Machine Translation by Intento (March 2018)
State of the Machine Translation by Intento (March 2018)
 
State of the Machine Translation by Intento (November 2017)
State of the Machine Translation by Intento (November 2017)State of the Machine Translation by Intento (November 2017)
State of the Machine Translation by Intento (November 2017)
 
NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017
 
Intento Machine Translation Benchmark, July 2017
Intento Machine Translation Benchmark, July 2017Intento Machine Translation Benchmark, July 2017
Intento Machine Translation Benchmark, July 2017
 
Building a Data Driven Business
Building a Data Driven BusinessBuilding a Data Driven Business
Building a Data Driven Business
 
Управление бизнесом на основе данных
Управление бизнесом на основе данныхУправление бизнесом на основе данных
Управление бизнесом на основе данных
 
Messengers, Bots and Personal Assistants
Messengers, Bots and Personal AssistantsMessengers, Bots and Personal Assistants
Messengers, Bots and Personal Assistants
 
Рекомендательные системы: роль и оценка эффективности
Рекомендательные системы: роль и оценка эффективностиРекомендательные системы: роль и оценка эффективности
Рекомендательные системы: роль и оценка эффективности
 
Measuring the agile process improvement
Measuring the agile process improvementMeasuring the agile process improvement
Measuring the agile process improvement
 
Lean production для SAAS
Lean production для SAASLean production для SAAS
Lean production для SAAS
 
Driving Business Goals with Recommender Systems @ YAC/m 2015
Driving Business Goals with Recommender Systems @ YAC/m 2015Driving Business Goals with Recommender Systems @ YAC/m 2015
Driving Business Goals with Recommender Systems @ YAC/m 2015
 
The Economics of Recommender Systems
The Economics of Recommender SystemsThe Economics of Recommender Systems
The Economics of Recommender Systems
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Evaluating Domain-Adaptive Neural Machine Translation (IMUG March 2019)

  • 1. EVALUATING DOMAIN-ADAPTIVE NEURAL MACHINE TRANSLATION Konstantin Savenkov CEO Intento, Inc. © Intento, Inc. IMUG Adobe March 2019 - San-Jose, CA
  • 2. Intento AGENDA 1 INTRO & DISCLAIMERS 2 DOMAIN-ADAPTIVE NMT 3 AVAILABLE SOLUTIONS 4 EVALUATION PROCESS 2© Intento, Inc. / March 2019
  • 3. Intento 3 INTENTO Discover, evaluate and use best-of- breed AI models © Intento, Inc. / March 2019
  • 4. Intento 1 HOW WE STARTED TO EVALUATE MT Universal API Middleware — Noticed 300x price difference for Machine Translation — Are price & language support the only differences? — Evaluated stock MT engines - performance as well! — Also, changes fast! 4© Intento, Inc. / March 2019
  • 5. Intento MT QUALITY LANDSCAPE IS WILD And changes fast! 5© Intento, Inc. / March 2019
  • 6. 6 1949 You Are Here I Memorandum on Translation 1996 II Affordable stock RBMT 2006 III Affordable stock SMT 2016 IV Affordable stock NMT V Affordable custom NMT © Intento, Inc. / March 2019 Intento 2 DOMAIN ADAPTIVE NMT
  • 7. Affordable Custom NMT — Domain-Adaptive Models 7 “Custom NMT” (from scratch) Domain-Adaptive NMT builds upon open or proprietary NMT frameworks baseline models or datasets as a service training data size 1M…10M segments 1K…1M segments training process heavily curated automatic main cost drivers licenses, human services $$$$-$$$$$ computing time $$-$$$ © Intento, Inc. / March 2019 Intento Super expensive just to try
  • 8. 2018 in Machine Translation Rise of Domain-Adaptive NMT* 8 Sep 2017 Oct 2018 * Neural Machine Translation with an automated customisation using domain-specific corpora, also known as the domain adaptation. Nov 2017 May 2018 Jun 2018 Jul 2018 Globalese Custom NMT Lilt Adaptive NMT IBM Custom NMT Microsoft Custom Translate Google AutoML Translation SDL ETS 8.0 ModernMT Enterprise Apr 2018 Systran PNMT Intento
  • 9. Intento WHY EVALUATE? Different learning curves 9 Starting dataset sizes vary — Learning curves vary — Depends on language pair and domain © Intento, Inc. / March 2019
  • 10. Intento WHY EVALUATE? The right choice drives ROI (English-to-German, Life Sciences) 10© Intento, Inc. / March 2019
  • 11. Intento 4 EVALUATION PROCESS 4.1 Identify the goals (PE vs Raw vs IR ets) — 4.2 Define set of projects — 4.3 Select candidate engines (stock and adaptive) — 4.4 Large-scale automatic scoring — 4.5 Human evaluation — 4.6 Identify complementary engines — 4.7 Setup workflow, train people, track performance (PE or conversions) 11© Intento, Inc. / March 2019
  • 12. Intento 4.1 IDENTIFY THE GOALS Use case: PEMT, Raw MT or Information Retrieval — Mission critical? (inbound vs outbound) — How to calculate ROI? — Expected gains? — BATNA? 12© Intento, Inc. / March 2019
  • 13. Intento 4.2 IDENTIFY DISTINCT PROJECTS One TM = One Project — Fewer the better (error matrix will help) — Track productivity to identify new projects 13© Intento, Inc. / March 2019
  • 14. Intento 4.3 PROJECT ATTRIBUTES Language pair — Domain — Availability of reference data — Availability and volume of training data — Availability of glossary 14© Intento, Inc. / March 2019 Data Protection requirements — Data Locality requirements — Contracting requirements — Deployment requirements — Usage scenario (monthly and peak volume, frequency of updates)
  • 15. Intento 4.4 SELECTING CANDIDATE ENGINES Language support — Domain- and dialect-specific baseline models — Minimal training data requirements — Glossary support 15© Intento, Inc. / March 2019 Data Protection — Deployment options (regions) — Contracting jurisdictions, payment options — Deployment options (cloud, on-premise etc) — TCO model
  • 16. Intento 3 AVAILABLE MT SOLUTIONS 16© Intento, Inc. / March 2019
  • 17. November 2018© Intento, Inc. Globalese Custom NMT Language Pairs “all of them” Customization parallel corpus Deployment cloud on-premise Cost to train* - Cost to maintain* $58 per month Cost to translate* - (limited volume) 17 * base pricing tier
  • 18. October 2018© Intento, Inc. Google Cloud AutoML Translation β Language Pairs 50 Customization parallel corpus Deployment cloud Cost to train* $76 per hour of training Cost to maintain* - Cost to translate* $80 per 1M symbols 18 * base pricing tier
  • 19. October 2018© Intento, Inc. IBM Cloud Language Translator v3 Language Pairs 48 Customization parallel corpus glossary Deployment cloud** Cost to train* free Cost to maintain* $15 per month Cost to translate* $100 per 1M symbols 19 * advanced pricing tier ** with optional no-trace
  • 20. October 2018© Intento, Inc. Microsoft Custom Translator β APIv3 Language Pairs 74*** Customization parallel corpus, glossary*** Deployment cloud** Cost to train* $10 per 1M symbols of training data (capped $300) Cost to maintain* $10 per month Cost to translate* $40 per 1M symbols * base pricing tier ** no trace *** since October 25, 2018 20
  • 21. October 2018© Intento, Inc. ModernMT Enterprise Edition Language Pairs 45 Customization parallel corpus Deployment cloud** on-premise Cost to train* free Cost to maintain* free Cost to translate* 4 EUR per 1000 words (~ $960 per 1M symbols) * base pricing tier ** claims non-exclusive rights on content 21
  • 22. October 2018© Intento, Inc. Tilde Custom Machine Translation Language Pairs ?* Customization parallel corpus Deployment cloud on-premise Cost to train** N/A Cost to maintain** N/A Cost to translate** N/A * language pair support provided on-demand ** no public pricing available 22
  • 23. Intento OTHER DIMENSIONS Pricing - Stock Engines 23 USD / 1M symbols now 7.1 © Intento, Inc. / March 2019
  • 24. Intento OTHER DIMENSIONS Pricing - Total Cost of Ownership 24© Intento, Inc. / March 2019
  • 25. Intento OTHER DIMENSIONS Different learning curves 25 Some engines start low, but improve fast — Others start high and improve low — Also, some have unusual behavior — Depends on language pair and domain — Worth exploring to plan the engine update & re-evaluation strategy © Intento, Inc. / March 2019
  • 26. Intento OTHER DIMENSIONS Language Support 26 All stock engines combined support 14290 language pairs out of 29070 possible (45%) * https://w3techs.com/technologies/overview/content_language/all © Intento, Inc. / March 2019
  • 27. Intento Language Support by MT engines 27 1 100 10000 G oogle Yandex M icrosoftN M T M icrosoftSM T Baidu Tencent Systran Systran PN M T PRO M T SDL Language C loud Youdao SAP M odernM T DeepL IBM N M T Am azon IBM SM T Alibaba G TC om 2 11 2 56 138 119 1 074 3 022 6 8 20 24 34 424447 72 104106110110 210 812 3 7823 660 8 556 10 712 Total Unique https://bit.ly/mt_jul2018 © Intento, Inc. / March 2019
  • 28. Intento MT QUALITY EVALUATION Automatic Scores vs. Expert Judgement 28© Intento, Inc. / March 2019
  • 29. Intento MT QUALITY EVALUATION Getting the best of two worlds 29 1.Run automatic scoring on scale 2.Filter out outlier MT engines 3.Extract meaningful segments for LQA 4.Run LQA © Intento, Inc. / March 2019
  • 30. Intento MT QUALITY EVALUATION Reference-Based Scores 30© Intento, Inc. / March 2019
  • 31. Intento MT QUALITY EVALUATION Filter outlier engines 31© Intento, Inc. / March 2019
  • 32. Intento MT QUALITY EVALUATION Focus on segments that matter 32© Intento, Inc. / March 2019
  • 33. Intento (N)MT quirks 33 “It’s not a story the Jedi would tell you.” “” “Star Wars franchise is overrated” A good NMT engine, in top-5% for a couple of pairs © Intento, Inc. / March 2019
  • 34. Intento Otherwise a good NMT from a famous brand (N)MT quirks 34 “Unisex Nylon Laptop Backpack School Travel Rucksack Satchel Shoulder Bag” “рюкзак” “Author is an idiot. I will fix it!” (a backpack) © Intento, Inc. / March 2019
  • 35. Intento (N)MT quirks 35 “hello, world!” “” “Are you kidding me?!” Good new NMT engine, best at some language pairs © Intento, Inc. / March 2019
  • 36. Intento (N)MT quirks 36 “Revenue X, profit Y.” “Выручка X, прибыль Y, EBITDA Z” “I see you love numbers, here you are!” Good new NMT engine, best at some language pairs © Intento, Inc. / March 2019
  • 37. Intento (N)MT quirks 37 “3.29346” “3.29” “Let me round it up for you…” Good new NMT engine, best at some language pairs © Intento, Inc. / March 2019
  • 38. Intento (N)MT quirks 38 “And here you may buy our new product X” “And here you may buy our new product X https:/// zzz.com/SKUXXX” “Let me google it for you…” Good new NMT engine, best at some language pairs © Intento, Inc. / March 2019
  • 39. Intento (N)MT quirks 39 “Please, one more time” “Поёалуйста, ещж раз” “Let’s play “find the difference?” Good popular NMT engine © Intento, Inc. / March 2019
  • 40. Intento MT QUALITY EVALUATION Expert judgement 40 A.Human LQA – error type and severity labeled by linguists or subject matter experts B.Post-editing effort – zero-edits, keystrokes and time spent © Intento, Inc. / March 2019
  • 41. November 2018© Intento, Inc. Human Linguistic Quality Analysis Blind within-subjects review An expert receives a source segment and all translations (including the human reference) without labels. 45 segments are distributed across 5 experts. — Human LQA was conducted by Logrus IT using their issue type and issue severity metrics (see Slides 26-27) — Each expert records all issues and their estimated severity rates for every translated segment. Segments without errors are considered “perfect”. — Typically, for LQA of a single text, the weighted normalized sum or errors is used (errors per word). For the engine selection problem we aggregated errors on a segment level (see slide 30 for details). 41
  • 42. November 2018© Intento, Inc. Human Linguistic Quality Analysis Issue types (© LogrusIT) 42 Issue Type Description Adequacy Adequacy of translation defines how closely the translation follows the meaning of and the message conveyed by the source text. Problems with adequacy can reflect additions and omissions, mistranslation, partially translated or untranslated pieces, or pieces that should not have been translated at all. Readability Intelligibility of translation defines how easy it is to read/consume and understand the target text/content. Language Language issues include all deviations from formal language rules and requirements. Includes grammar, spelling and typography issues. Terminology A term (domain-specific word) is translated with a term other than the one expected for the domain or otherwise specified in a terminology glossary or client requirements. Alternatively, terminology can be correct but inconsistent throughout the content. Locale This category refers only to whether the text is given the proper mechanical form for the locale, not whether the content applies to the locale or not. Style Style measures compliance with existing formal style requirements as well as language, cultural and regional specifics. © Logrus IT, 2018
  • 43. November 2018© Intento, Inc. Human Linguistic Quality Analysis Issue severity (© LogrusIT) 43 Issue Severity Description Critical Showstopper-level errors are the ones that have the biggest, sometimes dramatic impact on the reader’s perception. Showstoppers are regular errors that can result in dire consequences for the publisher, including causing life or health risks, equipment damage, violating local or international laws, unintentionally offending large groups of people, potential risks of misinformation and/or incorrect or dangerous user behavior, etc. Typically the content should not be published without fixing all showstopper-level problems first. Major The issue is serious, and has a noticeable effect on the overall text perception. Typical examples include locale errors (like incorrect date, numeric or currency format) as well as problems with translation adequacy or readability for a particular sentence or string. Medium The issue has noticeable, but moderate effect on the overall text perception. Typical examples include incorrect capitalization, wrong markup and regular spelling errors. While somewhat annoying and more noticeable, medium- severity issues still do not result in misinformation, and do not affect the reader’s perception seriously. Minor The issue has minimal effect on the overall text perception. Typical examples include dual spaces or incorrect typography (as far as it is not misleading and does not change the meaning). Formally a dual space in place of a single one or a redundant comma represents an error, but its effect on the reader’s perception is minimal. Preferential Use this severity for recommendations and preferrential issues that should not affect the rating. © Logrus IT, 2018
  • 44. November 2018© Intento, Inc. Linguistic Quality Analysis Dealing with reviewer disagreement 44 For LQA, accurate counting of errors may not work. Human reviewers disagree in several situations: • Stop counting errors after discovering a critical one vs. count all errors in a segment (including multiple critical). • Major vs. critical severity (e.g. if a dependent clause is missing). • Counting several consequent errors as one vs. many. — Weighted average ranking is not stable in presence of such disagreement. — Mitigating the disagreement: • Count critical errors as major (in both cases post-editing is likely to take as much time as translating from scratch) • Count not individual errors, but a number of segments with each of the severity levels (including zero errors). — NMT customization should have the most impact on terminology. 84% of terminology errors has “Medium” severity, hence we rank engines by the amount of segments with severity < “Medium” (the next slide)
  • 45. Intento MT QUALITY EVALUATION Human LQA 45© Intento, Inc. / March 2019
  • 46. Intento MT QUALITY EVALUATION Do good engines make mistakes in the same sentences? 46© Intento, Inc. / March 2019
  • 47. Intento MT QUALITY EVALUATION Looking for complementary engines 47© Intento, Inc. / March 2019
  • 48. November 2018© Intento, Inc. Post-Editing Effort “Hard evidence” Post-editor accepts or corrects MT output. — Things to watch: zero-edits, keystrokes, time to edit. — Highly dependent on tools, workflow (simple or complex task) and PE training (up to 4x difference across different reviewers) — Requires PE tracking, e.g. PET, but better use production tools — Results may be surprising (e.g. one engine produces 2x more perfect segments, while another 4x easier to edit). 48
  • 49. Intento EVALUATION IN USE SMALL PEMT PROJECTS 1. Get a list of stock MT engines for your language pair — 2. Select 4-5 candidate MT engines — 3a. Enable them in your CAT tool — 3b. Translate everything by 4-5 candidate engines and upload as a TM in your CAT tool — 4. Choose per segment as you translate 49© Intento, Inc. / March 2019
  • 50. Intento EVALUATION IN USE MEDIUM / ONGOING / MT-FIRST 1. Prepare a reference translation (1,500-2,000 segments) — 2. Get a list of stock MT engines for your language pair — 3. Translate the sample by appropriate engines — 4. Calculate a reference-based score for the MT results — 5. Evaluate top-performing engines manually — 6. Translate everything with the winning engine 50© Intento, Inc. / March 2019
  • 51. Intento EVALUATION IN USE LARGE PROJECTS / MT ONLY 1. Evaluate stock engines and get a baseline quality score — 2. Prepare a term base and a domain adaptation corpus (from 10K segments) — 3. Train appropriate custom NMT engines — 4. Evaluate custom NMT to see if it works better than stock MT — 5. Update and re-evaluate the winning model as you collect more post-edited content 51© Intento, Inc. / March 2019
  • 52. THANK YOU! Konstantin Savenkov ks@inten.to Konstantin Savenkov ks@inten.to (415) 429-0021 2150 Shattuck Ave Berkeley CA 94705 52