Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions presentation by Olga Beregovaya from Welocalize and Alon Lavie from Safaba. Presented at AMTA 2014 in Vancouver in October 2014. Association of Machine Translation in the Americas. Language tools and automation. Machine translation.
4. Content Sent Through the TMT Process
eDell content types handled through the MT PE Process vary
between different – mainly Marketing – content categories:
Partner Marketing
Global Support
Channel Support
Consumer Marketing Communication
Corporate HR
Customer Proposals
Product Launches
Global eDell Web content
The daily/weekly/monthly volumes per content type vary depending
on the Dell business priorities
5. Translation with MT Post-Editing for Dell
Translation Setup:
Source document is pre-translated by translation memory matches augmented
by Safaba MT
Translation Memory “fuzzy match” threshold typically 75-85%
Pre-translations are presented to human translator as starting point for editing;
translators can use or ignore the suggested pre-translations
Currently 28 languages go through the TM/MT workflow
Post-Editing Productivity Assessment:
Contrastive translation projects that measure and compare translation team
productivity with MT post-editing versus translation using just translation
memories
Productivity measured by contrasting translated words per hour under both
conditions: MT-PE throughput / HT throughput
6. MT Post-Editing Productivity Assessment
Evaluated by Welocalize in the context of the Dell MT Program
300.00%
250.00%
200.00%
150.00%
100.00%
50.00%
0.00%
90.00 BLEU
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
PE Distance
Productivity Delta
10. Dell - “High-traffic” MT Program
Q4/Q3/Q2/Q1 Volumes
Quarterly MT throughput
volumes allow Welocalize
and Safaba to accumulate
post-edits sufficient for far
more frequent re-trainings
than scheduled maintenance
engine updates
11. Final Post-Edited Output Quality
MT quality results are consistently above target – engine degradation will force
translators to compensate with additional effort
Continuously above target, monitoring trend
Result
Target
100.00%
99.90%
99.80%
99.70%
99.60%
99.50%
99.40%
99.30%
Week 40Week 41Week 42Week 43Week 44Week 45Week 46Week 47Week 48Week 49Week 50Week 51Week 52Week 53
12. Safaba EMTGlobal
Client-Specific MT Adaptation
The majority of the MT systems Safaba develops are specifically
developed and optimized for specific client content types
Data Scenario:
Some amount of client-specific data: translation memories, terminology
glossaries and monolingual data resources
Additional domain-specific and general background data resources: other
client-specific content types, TAUS data, other general parallel and
monolingual background data
13. Safaba EMTGlobal
Client-Specific MT Adaptation
Safaba Suite of Adaptation Approaches:
Data selection, filtering and prioritization methods
Data mixture and interpolation methods
Model mixture and interpolation methods
Client-specific Automated Post-Editing (Language Optimization Engine)
Styling and Formatting post-processing modules
Terminology and DNT runtime overrides
14. Enterprise Content Drift
Client-specific Enterprise MT systems often degrade in performance over time
for two main reasons:
1. Client content, even in controlled-domains, gradually changes over time:
new products, new terminology, new content developers
2. The typical integrated setup of MT and translation memories: TMs are
updated more frequently, so over time, only “harder” source segments are
sent for translation to MT
Current Full MT system retraining is resource and time consuming:
MT systems are relatively static – they are fully retrained only periodically (typically
only a couple of times per year)
The Result: MT accuracy for new projects declines over time post-editing
productivity also declines over time
We see strong evidence of “content drift” over time with many of our clients,
especially in post-editing setups
15. Evidence from Safaba EMTGlobal Systems for Dell MT Program:
BLEU scores before and after retraining on held out “recent”
incremental data
70
60
50
40
30
20
10
0
2013
2014
Enterprise Content Drift
16. Enterprise Content Drift
Evidence from a typical client-specific MT system:
EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets:
“Original” test set from February 2013 system build (1,200 segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss
Inconsist.
Orig.
BLEU
Orig.
MET
Orig.
TER
Orig.
LEN
Orig.
OOVs
Incr.
BLEU
Incr.
MET
Incr.
TER
Incr.
LEN
Incr.
OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2 107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
17. Enterprise Content Drift
Evidence from a typical client-specific MT system:
EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets:
“Original” test set from February 2013 system build (1,200 segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss
Inconsist.
Orig.
BLEU
Orig.
MET
Orig.
TER
Orig.
LEN
Orig.
OOVs
Incr.
BLEU
Incr.
MET
Incr.
TER
Incr.
LEN
Incr.
OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2 107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
18. Enterprise Content Drift
Evidence from a typical client-specific MT system:
EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets:
“Original” test set from February 2013 system build (1,200 segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss
Inconsist.
Orig.
BLEU
Orig.
MET
Orig.
TER
Orig.
LEN
Orig.
OOVs
Incr.
BLEU
Incr.
MET
Incr.
TER
Incr.
LEN
Incr.
OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2 107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
19. Enterprise Content Drift
Evidence from a typical client-specific MT system:
EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets:
“Original” test set from February 2013 system build (1,200 segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss
Inconsist.
Orig.
BLEU
Orig.
MET
Orig.
TER
Orig.
LEN
Orig.
OOVs
Incr.
BLEU
Incr.
MET
Incr.
TER
Incr.
LEN
Incr.
OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2 107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
20. Enterprise Content Drift
Evidence from a typical client-specific MT system:
EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets:
“Original” test set from February 2013 system build (1,200 segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss
Inconsist.
Orig.
BLEU
Orig.
MET
Orig.
TER
Orig.
LEN
Orig.
OOVs
Incr.
BLEU
Incr.
MET
Incr.
TER
Incr.
LEN
Incr.
OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2 107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
21. Enterprise Content Drift
Evidence from a typical client-specific MT system:
EMTGlobal English-to-German Dell MT System:
February 2013 System: 565K client + 964K background segments
March 2014 System: 594K client + 6,795K background segments
Two test sets:
“Original” test set from February 2013 system build (1,200 segments)
“Incremental” test set extracted from incremental data (500 segments)
System Test Scores and Statistics:
Lang System Gloss
Inconsist.
Orig.
BLEU
Orig.
MET
Orig.
TER
Orig.
LEN
Orig.
OOVs
Incr.
BLEU
Incr.
MET
Incr.
TER
Incr.
LEN
Incr.
OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2 107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31
22. Enterprise Content Drift
Analysis of Content Drift Over Time:
Three EMTGlobal MT systems for Dell:
English to Chinese, Spanish and German
Systems trained and deployed in February 2013
Test sets:
“Original” test set from February 2013 system build (1,200
segments)
“Incremental” test set extracted from 2014 incremental data (500
segments)
Data sets extracted from live Dell production projects in August-
2013, December-2013 and March-2014 along with their post-edited
references
23. Enterprise Content Drift
Analysis of Content Drift Over Time:
BLEU Scores
70
60
50
40
30
20
10
0
Chinese Spanish German
Feb
Aug
Dec
Mar
Inc-2013
Inc-2014
24. Identifying Enterprise Content Drift
Content Drift Indicators
Goal: Establish real-time quantifiable measures that are indicative of Enterprise
Content Drift
Immediate: Available immediately at MT production time, prior to any post-editing of
the MT output
Predictive: Strongly correlate with expected MT evaluation score and post-editing effort
Similar to real-time MT Quality Estimation scores, but specific to capturing content drift
Three Measures:
Core Out-of-Vocabulary (OOV) Type and Token fractions:
Fraction of source types (tokens) that were out-of-vocabulary in the core MT system (OOVs)
Source-side Unigram Coverage:
Fraction of source type (token) unigrams that were observed in the MT system training data
Source-side Trigram Coverage:
Fraction of source type (token) trigrams that were observed in the MT system training data
25. Identifying Enterprise Content Drift
Content Drift Indicators
Performance of Content Drift Indicators on Dell EMTGlobal Systems:
OOVs (Fraction of Tokens)
6.00
5.00
4.00
3.00
2.00
1.00
0.00
Chinese Spanish German
Feb
Aug
Dec
Mar
Inc-2013
Inc-2014
26. Identifying Enterprise Content Drift
Content Drift Indicators
Performance of Content Drift Indicators on Dell EMTGlobal Systems:
Source Trigram Coverage
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
Chinese Spanish German
Feb
Aug
Dec
Mar
Inc-2013
Inc-2014
27. “Overnight” Incremental Adaptation
Objective: Counter “content drift” and help maintain and accelerate post-editing
productivity with fast and frequent incremental adaptation retraining
Setting: New additional post-edited client data is deposited and made available for
adaptation in small incremental batches
Challenge: Full offline system retraining is slow and computationally intensive and
can take several days
Safaba Solution: implement fast “light-weight” adaptations that can be executed,
tested and deployed into production within hours (“overnight”)
Suffix-array variant of Moses supports rapid updating of indexed training data
Safaba Language Optimization Engine (automated post-editing module) supports rapid retraining
KenLM supports rapid rebuilding of language models
Currently in pilot testing with Welocalize and Dell
28. Safaba Overnight Retraining
The Approach:
Goal: Rapid MT System Adaptation using Incremental Data
Current Approach: Language Optimization Engine (LOE) Incremental Retraining
Safaba EMTGlobal MT systems include a core MT engine and a target-side Language Optimization
Engine
Retraining the LOE component is fast – typically within a few hours
Not equivalent to full MT system retraining, but effective in closing the gap
New Approach: EMTGlobal v4.0 Advanced Adaptation Technology:
Supports significantly improved client-specific adaptation within the core MT engine
Supports rapid incremental retraining of core MT engines
Much closer to full MT system retraining at similar time frame as LOE retraining
Will be available in late Q4 of 2014
29. Safaba Overnight Retraining
The Approach:
Full Solution: Overnight Retraining
Incremental data from post-edited MT projects is delivered to Safaba
Incremental system retraining is launched automatically, completed within hours
Newly-adapted version of the MT system is automatically tested and QAed for
quality
Newly-adapted version of the MT system is deployed into production
30. Safaba Overnight Retraining
The Pilot Project
Pilot project with Welocalize to assess impact of Overnight Retraining on Safaba
EMTGlobal Dell MT systems, using samples of real post-edited translation project data
Setup:
Languages: English to Chinese, Spanish and German
Baseline Systems: 2014 retrained Dell EMTGlobal 3.0 MT systems
Incremental Data: Three batches of incremental data from live translation projects
Methodology:
Three versions of the MT systems:
Baseline
Baseline + Retrained on Data Set #1
Baseline + Retrained on Data Set #1 & #2
MT Evaluation:
Translate Data Set #3 (unseen) with the three versions of the MT system
Assess impact on translation performance using automated MT evaluation metrics
Additional analysis using Safaba “Content Drift Indicators”
31. Safaba Overnight Retraining
Data
Original number of segments Number of segments post-filtering
Set 1 Set 2 Set 3 Set 1 Set 2 Set 3
ENUS-ESXL 1108 4553 704 926 2411 528
ENUS-ZHCN 3191 2181 1328 1143 1084 714
ENUS-DEDE 3043 1220 2270 2325 977 1466
32. Pilot Results: Automated Metric Scores
English-to-Chinese:
Incremental Adaptation of Language Optimization Engine (LOE)
Incrementally retraining on Data Sets #1 & #2 results in gain of +3.0 BLEU points on Data Set
#3
70
65
60
55
50
45
40
35
30
BLEU METEOR TER
2013 System
2014 Baseline
Baseline+DS1
Baseline+DS1&2
Safaba Overnight Retraining
33. Safaba Overnight Retraining
Pilot Results: Content Drift Indicator Statistics
English-to-Chinese:
Incremental Adaptation of Language Optimization Engine (LOE)
Adding Data Sets #1 & #2 reduces Data Set #3 OOVs by 0.3%, improves unigram coverage
by 0.36% and improves trigram coverage by 14.22%
7.00%
6.00%
5.00%
4.00%
3.00%
2.00%
1.00%
0.00%
OOV Tokens
1
0.9
0.8
0.7
0.6
0.5
0.4
Unigrams
Covered
Trigrams
Covered
2013 System
2014 Baseline
Baseline+DS1
Baseline+DS1&2
34. Preliminary Results: Advanced Adaptation with EMTGlobal v4.0
English-to-Chinese:
Incremental Adaptation with EMTGlobal v4.0
Incrementally retraining on Data Sets #1 & #2 results in gain of +6.8 BLEU points on Data
Set #3
75
70
65
60
55
50
45
40
35
30
BLEU METEOR TER
2014 Baseline
Baseline+DS1&2
Safaba Overnight Retraining
35. Safaba Overnight Retraining
Summary of Pilot Results
Excellent results for English-to-Chinese!
Spanish and German results show no gain or loss in MT accuracy as a result of LOE
incremental retraining with the available data sets
Performance on Data Set #3 remains completely flat with both retrainings according
to all automated metrics
Data analysis with Content Drift Indicators reveals that Data Sets #1 & #2 for these
two language pairs did not contain novel translations sufficient for improving MT
performance on Data Set #3
No significant reduction in Data Set #3 OOVs
No significant improvement in coverage of source-side n-grams
36. Overnight Retraining Pilot Evaluation Setup
Translators were asked to compare each engine iteration using the same source strings
Result
Target
Day 1: read the MT output first. Then read the source text (ST). Then score the segment for Adequacy and Fluency
Adequacy
On a 4-point scale, rate how much of the meaning is rendered in the translation:
4 Everything
3 Most
2 Little
1 None
Fluency
Rate on a 4- point scale the extent to which the translation is well-formed grammatically, contains correct spellings, adheres to common
use of terms, titles and names, is intuitively acceptable and can be sensibly interpreted by a native speaker:
4 Flawless
3 Good
2 Disfluent
1 Incomprehensible
*Based on TAUS Adequacy/Fluency Guidelines
Comparing the iterations: Compare the NEW MT output to that of the previous week and indicate with a X in the correpsonding column
whether it is better / worse / equal.
If it is better or worse, indicate in the error categories & comment column what has improved or regressed.
37. Interim Chinese Pilot Results - Welocalize
• The human evaluation results we have an our disposal are work in progress - based on
evaluating a small subset of translated data and just one iteration of “overnight
retraining”
• The improvements observed by automated metrics are not yet reflected in the human
assessment
• Human evaluation results consistent between Baseline+ DS 1 and DS2 - no degradation is
introduced but from translator perspective no significant change in quality is captured,
possibly requires a larger evaluation set or a different approach to evaluation string
selection
Translator feedback: improvement in fluency but no improvement in capturing the
meaning of whole sentence; punctuation has improved, but the translation stil needs
improvement; part of the sentence is now more fluent
38. Welocalize “Wins” from “Overnight Retraining”
We need to be more granular than “Quality” and look at
“Relevance” (coverage and fluency will increase based on Safaba
findings)
Our expected benefits from this approach – needs to be in-synch
with sufficient daily volumes
No need to wait for scheduled retrainings
Two things will happen – the translator gets more used to post-editing,
and the MT engines catch up with the changes in the
source content in the “live” mode
Benefit for the client – once the actual ongoing engine relevance
statistics have been captured, we’ll be able to predict higher
throughputs and offer better discounts
39. Summary and Conclusions
Enterprise Content Drift is a natural and frequent phenomenon in large-scale
commercial MT implementation projects
Enterprise MT systems need to constantly adapt or else are likely to
significantly degrade in translation accuracy and value over time
Safaba’s Content Drift Indicators can identify and quantify content drift
and can be effectively used to predict the impact of incremental MT
system retraining
Are being incorporated into Safaba’s new EMTGlobal MT Monitoring Portal
Safaba’s “Overnight Retraining” incremental adaptation is effective in
combating content drift and maintaining/improving MT system
performance over time and maintaining translator productivity levels
Safaba’s upcoming EMTGlobal v4.0 will dramatically enhance these
capabilities!
Editor's Notes
Beyond the quality standards
Engine retraining and additional productivity gains
A solid engine translats into solid gains
In translators’ words – 1,5 years into the program
Meeting high quality standards while reducing turnaround times requires high-quality up-to-date MT output