Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA 2014 Welocalize and Safaba

Enterprise MT Content Drift:
Challenges, Impacts and Advanced
Solutions
Alon Lavie & Olga Beregovaya
AMTA 2014
October 25, 2014

Outline
 Welocalize MT Program for eDell: workflow, processes, challenges
 Safaba EMTGlobal MT
 Enterprise Content Drift: the evidence
 Identifying Content Drift: Indicators and their correlation
 Overnight Retraining: the approach
 Overnight Retraining: the Pilot Study and Results
 EMTGlobal v4.0: Advanced Rapid Adaptation
 Welocalize: Expected Impacts
 Summary and Conclusions

Welocalize Approach to MT Program Deployment

Content Sent Through the TMT Process
 eDell content types handled through the MT PE Process vary
between different – mainly Marketing – content categories:
 Partner Marketing
 Global Support
 Channel Support
 Consumer Marketing Communication
 Corporate HR
 Customer Proposals
 Product Launches
 Global eDell Web content
 The daily/weekly/monthly volumes per content type vary depending
on the Dell business priorities

Translation with MT Post-Editing for Dell
 Translation Setup:
 Source document is pre-translated by translation memory matches augmented
by Safaba MT
 Translation Memory “fuzzy match” threshold typically 75-85%
 Pre-translations are presented to human translator as starting point for editing;
translators can use or ignore the suggested pre-translations
 Currently 28 languages go through the TM/MT workflow
 Post-Editing Productivity Assessment:
 Contrastive translation projects that measure and compare translation team
productivity with MT post-editing versus translation using just translation
memories
 Productivity measured by contrasting translated words per hour under both
conditions: MT-PE throughput / HT throughput

MT Post-Editing Productivity Assessment
 Evaluated by Welocalize in the context of the Dell MT Program
300.00%
250.00%
200.00%
150.00%
100.00%
50.00%
0.00%
90.00 BLEU
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
PE Distance
Productivity Delta

Productivity Gains through Retraining
LOCALE ID Initial engine Retrained Engine
CH_TI -11.75% 4.7%
CS-CZ 37.53% -
DA-DK 88.67% -
DE_DE 20.24% 31.2%
EL_EL 18.36% 51.3%
ES_ES 28.5%
ES_LX 2.31% 99.3%
FI-FI 102.80%
FR_FR 21.73% 46.4%
HE-IL 25.43% -
HU-HU 32.53% -
IT-IT 84.89% -
JA-JP 20.62% -
KO-KR 13.83% -
NL-NL 27.85% -
NO-NO 59.71% -
PL-PL 33.83% -
PT_BR 23.77% 31.3%
PT_PT 30.24% 27.7%
RU_RU 22.88% 36.6%

A Solid Engine Translates into Solid Gains
100%
80%
60%
40%
20%
0%
-20%
Productivity Delta and Fluency
1 2 3 4 5
Productivity Delta
Human Evaluation Fluency Score (1-5)
100%
80%
60%
40%
20%
0%
-20%
Productivity Delta and Adequacy
1 2 3 4 5
Productivity Delta
Human Evaluation Adequacy Score (1-5)
1.00
0.80
0.60
0.40
0.20
0.00
-0.20
-0.40
-0.60
-0.80
-1.00
Adequacy, Fluency and PE Distance Correlation
de_DE es_ES/LA fr_FR/CA it_IT pt_BR
Adequacy & PE Distance Fluency & PE Distance

1.5 Years later -Welocalize Post-Editing Adoption

Dell - “High-traffic” MT Program
Q4/Q3/Q2/Q1 Volumes
Quarterly MT throughput
volumes allow Welocalize
and Safaba to accumulate
post-edits sufficient for far
more frequent re-trainings
than scheduled maintenance
engine updates

Final Post-Edited Output Quality
MT quality results are consistently above target – engine degradation will force
translators to compensate with additional effort
Continuously above target, monitoring trend
Result
Target
100.00%
99.90%
99.80%
99.70%
99.60%
99.50%
99.40%
99.30%
Week 40Week 41Week 42Week 43Week 44Week 45Week 46Week 47Week 48Week 49Week 50Week 51Week 52Week 53

Safaba EMTGlobal
Client-Specific MT Adaptation
 The majority of the MT systems Safaba develops are specifically
developed and optimized for specific client content types
 Data Scenario:
 Some amount of client-specific data: translation memories, terminology
glossaries and monolingual data resources
 Additional domain-specific and general background data resources: other
client-specific content types, TAUS data, other general parallel and
monolingual background data

Safaba EMTGlobal
Client-Specific MT Adaptation
 Safaba Suite of Adaptation Approaches:
 Data selection, filtering and prioritization methods
 Data mixture and interpolation methods
 Model mixture and interpolation methods
 Client-specific Automated Post-Editing (Language Optimization Engine)
 Styling and Formatting post-processing modules
 Terminology and DNT runtime overrides

Enterprise Content Drift
 Client-specific Enterprise MT systems often degrade in performance over time
for two main reasons:
1. Client content, even in controlled-domains, gradually changes over time:
new products, new terminology, new content developers
2. The typical integrated setup of MT and translation memories: TMs are
updated more frequently, so over time, only “harder” source segments are
sent for translation to MT
 Current Full MT system retraining is resource and time consuming:
 MT systems are relatively static – they are fully retrained only periodically (typically
only a couple of times per year)
 The Result: MT accuracy for new projects declines over time  post-editing
productivity also declines over time
 We see strong evidence of “content drift” over time with many of our clients,
especially in post-editing setups

Evidence from Safaba EMTGlobal Systems for Dell MT Program:
 BLEU scores before and after retraining on held out “recent”
incremental data
70
60
50
40
30
20
10
0
2013
2014

Evidence from a typical client-specific MT system:
 EMTGlobal English-to-German Dell MT System:
 February 2013 System: 565K client + 964K background segments
 March 2014 System: 594K client + 6,795K background segments
 Two test sets:
 “Original” test set from February 2013 system build (1,200 segments)
 “Incremental” test set extracted from incremental data (500 segments)
 System Test Scores and Statistics:
Lang System Gloss
Inconsist.
Orig.
BLEU
Orig.
MET
Orig.
TER
Orig.
LEN
Orig.
OOVs
Incr.
BLEU
Incr.
MET
Incr.
TER
Incr.
LEN
Incr.
OOVs
DE Feb. 2013 55.7 % 51.0 63.4 38.2 101.2 63 41.7 56.6 45.0 101.2 107
DE March 2014 24.8 % 52.9 64.2 36.9 100.5 33 60.5 69.9 30.3 99.9 31

Analysis of Content Drift Over Time:
 Three EMTGlobal MT systems for Dell:
 English to Chinese, Spanish and German
 Systems trained and deployed in February 2013
 Test sets:
 “Original” test set from February 2013 system build (1,200
segments)
 “Incremental” test set extracted from 2014 incremental data (500
segments)
 Data sets extracted from live Dell production projects in August-
2013, December-2013 and March-2014 along with their post-edited
references

Analysis of Content Drift Over Time:
 BLEU Scores
70
60
50
40
30
20
10
0
Chinese Spanish German
Feb
Aug
Dec
Mar
Inc-2013
Inc-2014

Identifying Enterprise Content Drift
Content Drift Indicators
 Goal: Establish real-time quantifiable measures that are indicative of Enterprise
Content Drift
 Immediate: Available immediately at MT production time, prior to any post-editing of
the MT output
 Predictive: Strongly correlate with expected MT evaluation score and post-editing effort
 Similar to real-time MT Quality Estimation scores, but specific to capturing content drift
 Three Measures:
 Core Out-of-Vocabulary (OOV) Type and Token fractions:
 Fraction of source types (tokens) that were out-of-vocabulary in the core MT system (OOVs)
 Source-side Unigram Coverage:
 Fraction of source type (token) unigrams that were observed in the MT system training data
 Source-side Trigram Coverage:
 Fraction of source type (token) trigrams that were observed in the MT system training data

Performance of Content Drift Indicators on Dell EMTGlobal Systems:
 OOVs (Fraction of Tokens)
6.00
5.00
4.00
3.00
2.00
1.00
0.00
Feb
Aug
Dec
Mar
Inc-2013
Inc-2014

Performance of Content Drift Indicators on Dell EMTGlobal Systems:
 Source Trigram Coverage
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
Feb
Aug
Dec
Mar
Inc-2013
Inc-2014

“Overnight” Incremental Adaptation
 Objective: Counter “content drift” and help maintain and accelerate post-editing
productivity with fast and frequent incremental adaptation retraining
 Setting: New additional post-edited client data is deposited and made available for
adaptation in small incremental batches
 Challenge: Full offline system retraining is slow and computationally intensive and
can take several days
 Safaba Solution: implement fast “light-weight” adaptations that can be executed,
tested and deployed into production within hours (“overnight”)
 Suffix-array variant of Moses supports rapid updating of indexed training data
 Safaba Language Optimization Engine (automated post-editing module) supports rapid retraining
 KenLM supports rapid rebuilding of language models
 Currently in pilot testing with Welocalize and Dell

Safaba Overnight Retraining
The Approach:
Goal: Rapid MT System Adaptation using Incremental Data
 Current Approach: Language Optimization Engine (LOE) Incremental Retraining
 Safaba EMTGlobal MT systems include a core MT engine and a target-side Language Optimization
Engine
 Retraining the LOE component is fast – typically within a few hours
 Not equivalent to full MT system retraining, but effective in closing the gap
 New Approach: EMTGlobal v4.0 Advanced Adaptation Technology:
 Supports significantly improved client-specific adaptation within the core MT engine
 Supports rapid incremental retraining of core MT engines
 Much closer to full MT system retraining at similar time frame as LOE retraining
 Will be available in late Q4 of 2014

The Approach:
 Full Solution: Overnight Retraining
 Incremental data from post-edited MT projects is delivered to Safaba
 Incremental system retraining is launched automatically, completed within hours
 Newly-adapted version of the MT system is automatically tested and QAed for
quality
 Newly-adapted version of the MT system is deployed into production

The Pilot Project
 Pilot project with Welocalize to assess impact of Overnight Retraining on Safaba
EMTGlobal Dell MT systems, using samples of real post-edited translation project data
 Setup:
 Languages: English to Chinese, Spanish and German
 Baseline Systems: 2014 retrained Dell EMTGlobal 3.0 MT systems
 Incremental Data: Three batches of incremental data from live translation projects
 Methodology:
 Three versions of the MT systems:
 Baseline
 Baseline + Retrained on Data Set #1
 Baseline + Retrained on Data Set #1 & #2
 MT Evaluation:
 Translate Data Set #3 (unseen) with the three versions of the MT system
 Assess impact on translation performance using automated MT evaluation metrics
 Additional analysis using Safaba “Content Drift Indicators”

Data
Original number of segments Number of segments post-filtering
Set 1 Set 2 Set 3 Set 1 Set 2 Set 3
ENUS-ESXL 1108 4553 704 926 2411 528
ENUS-ZHCN 3191 2181 1328 1143 1084 714
ENUS-DEDE 3043 1220 2270 2325 977 1466

Pilot Results: Automated Metric Scores
 English-to-Chinese:
 Incremental Adaptation of Language Optimization Engine (LOE)
 Incrementally retraining on Data Sets #1 & #2 results in gain of +3.0 BLEU points on Data Set
#3
70
65
60
55
50
45
40
35
30
BLEU METEOR TER
2013 System
2014 Baseline
Baseline+DS1
Baseline+DS1&2

Pilot Results: Content Drift Indicator Statistics
 Incremental Adaptation of Language Optimization Engine (LOE)
 Adding Data Sets #1 & #2 reduces Data Set #3 OOVs by 0.3%, improves unigram coverage
by 0.36% and improves trigram coverage by 14.22%
7.00%
6.00%
5.00%
4.00%
3.00%
2.00%
1.00%
0.00%
OOV Tokens
1
0.9
0.8
0.7
0.6
0.5
0.4
Unigrams
Covered
Trigrams
Covered
2013 System
2014 Baseline
Baseline+DS1
Baseline+DS1&2

Preliminary Results: Advanced Adaptation with EMTGlobal v4.0
 Incremental Adaptation with EMTGlobal v4.0
 Incrementally retraining on Data Sets #1 & #2 results in gain of +6.8 BLEU points on Data
Set #3
75
70
65
60
55
50
45
40
35
30
BLEU METEOR TER
2014 Baseline
Baseline+DS1&2

Summary of Pilot Results
 Excellent results for English-to-Chinese!
 Spanish and German results show no gain or loss in MT accuracy as a result of LOE
incremental retraining with the available data sets
 Performance on Data Set #3 remains completely flat with both retrainings according
to all automated metrics
 Data analysis with Content Drift Indicators reveals that Data Sets #1 & #2 for these
two language pairs did not contain novel translations sufficient for improving MT
performance on Data Set #3
 No significant reduction in Data Set #3 OOVs
 No significant improvement in coverage of source-side n-grams

Overnight Retraining Pilot Evaluation Setup
Translators were asked to compare each engine iteration using the same source strings
Result
Target
Day 1: read the MT output first. Then read the source text (ST). Then score the segment for Adequacy and Fluency
Adequacy
On a 4-point scale, rate how much of the meaning is rendered in the translation:
4 Everything
3 Most
2 Little
1 None
Fluency
Rate on a 4- point scale the extent to which the translation is well-formed grammatically, contains correct spellings, adheres to common
use of terms, titles and names, is intuitively acceptable and can be sensibly interpreted by a native speaker:
4 Flawless
3 Good
2 Disfluent
1 Incomprehensible
*Based on TAUS Adequacy/Fluency Guidelines
Comparing the iterations: Compare the NEW MT output to that of the previous week and indicate with a X in the correpsonding column
whether it is better / worse / equal.
If it is better or worse, indicate in the error categories & comment column what has improved or regressed.

Interim Chinese Pilot Results - Welocalize
• The human evaluation results we have an our disposal are work in progress - based on
evaluating a small subset of translated data and just one iteration of “overnight
retraining”
• The improvements observed by automated metrics are not yet reflected in the human
assessment
• Human evaluation results consistent between Baseline+ DS 1 and DS2 - no degradation is
introduced but from translator perspective no significant change in quality is captured,
possibly requires a larger evaluation set or a different approach to evaluation string
selection
Translator feedback: improvement in fluency but no improvement in capturing the
meaning of whole sentence; punctuation has improved, but the translation stil needs
improvement; part of the sentence is now more fluent

Welocalize “Wins” from “Overnight Retraining”
 We need to be more granular than “Quality” and look at
“Relevance” (coverage and fluency will increase based on Safaba
findings)
 Our expected benefits from this approach – needs to be in-synch
with sufficient daily volumes
 No need to wait for scheduled retrainings
 Two things will happen – the translator gets more used to post-editing,
and the MT engines catch up with the changes in the
source content in the “live” mode
 Benefit for the client – once the actual ongoing engine relevance
statistics have been captured, we’ll be able to predict higher
throughputs and offer better discounts

Summary and Conclusions
 Enterprise Content Drift is a natural and frequent phenomenon in large-scale
commercial MT implementation projects
 Enterprise MT systems need to constantly adapt or else are likely to
significantly degrade in translation accuracy and value over time
 Safaba’s Content Drift Indicators can identify and quantify content drift
and can be effectively used to predict the impact of incremental MT
system retraining
 Are being incorporated into Safaba’s new EMTGlobal MT Monitoring Portal
 Safaba’s “Overnight Retraining” incremental adaptation is effective in
combating content drift and maintaining/improving MT system
performance over time and maintaining translator productivity levels
 Safaba’s upcoming EMTGlobal v4.0 will dramatically enhance these
capabilities!

Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA 2014 Welocalize and Safaba

Recommended

Recommended

More Related Content

Similar to Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA 2014 Welocalize and Safaba

Similar to Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA 2014 Welocalize and Safaba (20)

More from Welocalize

More from Welocalize (16)

Recently uploaded

Recently uploaded (20)

Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA 2014 Welocalize and Safaba

Editor's Notes