SlideShare a Scribd company logo
1 of 40
PRESENTATION
AMTA
ELAINEOCURRAN
Welocalize
October 2015
MT Quality Evaluations: From
Test Environment to Production
•Our MT evaluation methodologies
•Correlations between automatic
scores and human evaluations
•Differences between system
autoscores and PE autoscores
•MT evaluations in a production setting
•MT evaluations of post-edited files:
a case study
AGENDA
OUR EVALUATION METHODS
Engine
Training
Autoscoring
Test Sets
(2500 TUs)
Engine
Ranking
(100 TUs)
Adequacy &
Fluency
Scoring
(100 TUs)
Error
Typology
Scoring
(25 TUs)
Productivity
Test (8 hours)
A TYPICAL EVALUATION PROCESS PER LOCALE
AND PER ENGINE
OUR EVALUATION METHODS
AUTOMATIC SCORES GENERATED BY WESCORE
30.0
35.0
40.0
45.0
50.0
55.0
60.0
65.0
70.0
75.0
80.0
Hub Comm1 Comm2 DIY1 DIY2
Precision
Recall
GTM
METEOR
BLEU
Avg. PE Dist.
TER
Engine1 Engine2 Engine3 Engine4 Engine5
OUR EVALUATION METHODS
HUMAN EVALUATIONS:
ADEQUACY AND FLUENCY SCORING
OUR EVALUATION METHODS
HUMAN EVALUATIONS:
ADEQUACY AND FLUENCY SCORING
4.1
3.9 3.9
2.2
4.0
3.1
4.0 3.9
4.1 4.2
3.8 3.7 3.6
2.3
3.8
2.8
3.8
3.5
3.8 3.8
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
DA ES FR HI-IN IT JA NL NO PT-BR SV
Adequacy & Fluency by Local
Accuracy
Fluency
OUR EVALUATION METHODS
HUMAN EVALUATION: ERROR TYPOLOGY
0
5
10
15
20
25
Errors per 25 Segments DA
ES
FR
HI-IN
IT
JA
NL
NO
PT-
BR
SV
OUR EVALUATION METHODS
HUMAN EVALUATION: ENGINE RANKING
41.5
66
51.5
58.5
39
22.5
0
10
20
30
40
50
60
70
80
90
100
de-DE fr-FR ja-JP zh-Hans
Engine Ranked Best (out of 100 segments)
Hub
Comm1
Comm2
DIY1
DIY2
Engine1
Engine2
Engine3
Engine4
Engine5
OUR EVALUATION METHODS
PRODUCTIVITY TESTS
• Tool: iOmegaT, instrumented version of open source CAT tool OmegaT
• iOmegat tracks time spent in segment + keystrokes
• Mimics translators’ work environment: integrated glossary, compatible with 3rd
party tools for quality checks
• Test sets consist of a mix of MT segments to post-edit and no matches to
translate from scratch
• Usual scope is 8h of translation / post-editing
• Provides productivity delta between post-edited and translated words (words per
hour)
OUR EVALUATION METHODS
PRODUCTIVITY TESTS
42.0%
14.7%
45.9%
38.6%
18.1%
22.1% 22.0%
35.0% 34.1%
21.9%
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
de-DE es-MX fr-FR it-IT pt-BR
Average MT Productivity Gain per Engine
Engine1
Engine2
LESSONS LEARNED
• We always perform autoscoring PLUS human scoring for all our MT evaluations.
We have internal thresholds that qualify an engine ready for deployment and its
level of maturity.
• For bake-offs between several engines, we always include engine ranking in
addition to our standard scores.
• Productivity tests are valuable during the initial phase of an MT program to build
up productivity data for future reference across languages, domains and MT
systems.
• Our MT program is now mature and we are able to perform most of our
evaluations based on autoscoring PLUS human scoring, and by referencing the
productivity data we have collected over a number of years.
Correlations between automatic scores
and human evaluations
NEXT
CORRELATIONS
CORRELATIONS BETWEEN AUTOMATIC
SCORES AND HUMAN EVALUATIONS
Pearson's r Variables Strength of Correlation Tests (N) Locales
0.50576955 Fluency & METEOR Strong positive relationship 150 11
0.50070425 Fluency & BLEU Strong positive relationship 150 11
0.49816365 Fluency & Recall Strong positive relationship 150 11
0.49724893 Fluency & NIST Strong positive relationship 150 11
0.49195687 Fluency & GTM Strong positive relationship 150 11
0.47064566 Fluency & Precision Strong positive relationship 150 11
0.38293518 Adequacy & NIST Moderate positive relationship 150 11
0.31354314 Adequacy & METEOR Moderate positive relationship 150 11
0.2940756 Adequacy & Recall Weak positive relationship 150 11
0.28586852 Adequacy & GTM Weak positive relationship 150 11
0.28386332 Adequacy & BLEU Weak positive relationship 150 11
0.26685854 Adequacy & Precision Weak positive relationship 150 11
-0.40270902 Adequacy & TER Strong negative relationship 150 11
-0.4788575 Fluency & PE Distance Strong negative relationship 150 11
-0.5385275 Adequacy & PE Distance Strong negative relationship 150 11
-0.5421933 Fluency & TER Strong negative relationship 150 11
CORRELATIONS
THE STRONGEST CORRELATION WAS FOUND
BETWEEN FLUENCY AND TER
0
10
20
30
40
50
60
70
80
90
100
1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00
TER
Fluency
CORRELATIONS
THE 2ND STRONGEST CORRELATION WAS FOUND
BETWEEN ADEQUACY AND PE DISTANCE
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00
PEDistance
Adequacy
LESSONS LEARNED
• It seems that we cannot rely solely on autoscores as long as the correlation with
human judgment is not stronger than the data suggests
• TER and PE Distance show the strongest correlation to both Fluency and
Adequacy, and therefor seem closer to human judgment than the other scores.
• Fluency correlates stronger with system autoscores than Adequacy overall.
• PE Distance is the only metric that correlates stronger with Adequacy than
Fluency. PE Distance is also the only character-based metric.
Differences between system autoscores
and post-editing autoscores
NEXT
SYSTEM VS PE AUTOSCORES
CORRELATIONS BETWEEN SYSTEM SCORES
AND POST-EDITING SCORES
Pearson's r Variables Strength of Correlation Tests (N) Locales
0.832226688 BLEU (System) & BLEU (PE) Very strong positive relationship 57 9
0.832218909
PE Distance (System)
& PE Distance (PE)
Very strong positive relationship 57 9
33.18
50.58
44.32
29.62
17.40 14.70
0
10
20
30
40
50
60
BLEU (System) BLEU (PE) PE Dist % (System) PE Dist % (PE) BLEU DIFF PE DIST DIFF %
ON AVERAGE, THE POST-EDITING SCORE IS 15 AND 17
POINT HIGHER FOR PE DISTANCE AND BLEU RESPECTIVEL
SYSTEM VS PE AUTOSCORES
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
BLEU(PE)
BLEU (System)
CORRELATIONS BETWEEN SYSTEM BLEU
AND POST-EDITING BLEU
SYSTEM VS PE AUTOSCORES
CORRELATIONS BETWEEN SYSTEM PE DISTANCE
AND POST-EDITING PE DISTANCE
0%
10%
20%
30%
40%
50%
60%
70%
0% 10% 20% 30% 40% 50% 60% 70%
PEDistance(PE)
PE Distance (System)
SYSTEM VS PE AUTOSCORES
REAL DATA WHERE WE COMPARE EVALUATION
SCORES WITH SCORES FROM A 3-MONTH PILOT
32.2 35.3 35.08
20.46
35.26
43.69 46.76
34.71
0
10
20
30
40
50
60
70
80
90
100
es-ES fr-FR it-IT pt-BR
PE Distance (%)
Pilot1 Eval1
LOOK FOR CONSISTENCY AND BEWARE OF OUTLIERS
LESSONS LEARNED
• There is a very high correlation between the MT system autoscores generated
during the evaluation phase and the autoscores generated from production using
the same engines.
• However, the post-editing autoscores are considerably better than the MT
system autoscores by around15%.
• We now differentiate the autoscores in our database as ‘System’ and ‘PE’.
MT evaluations in a
production setting
NEXT
PRODUCTION SETTING
HOW TO MEASURE POST-EDITING EFFORT
• It is important to monitor the performance of MT and post-editing,
especially during the initial launch of a new program
• The use of autoscoring to analyze post-project files is a valuable and cost-
effective method to measure the post-editing effort
• They support rate negotiations and can help us to identify over- or under-
editing by post-editors
• TER and PE Distance are useful metrics, with different underlying
algorithms
PRODUCTION SETTING
HOW TO MEASURE POST-EDITING EFFORT
PE Distance - lower is better!
• Measures the number of insertions, deletions, substitutions required to
transform MT output to the required quality level
• PE Distance values are derived by comparing the post-edited segments
with the corresponding machine translation segments
• In our analysis the PE distance applies the Levenshtein algorithm and is
character-based. This captures morphological post-edits, such as fixing
word forms.
PRODUCTION SETTING
HOW TO MEASURE POST-EDITING EFFORT
TER - lower is better!
• TER stands for Translation Edit Rate
• It is an error metric for machine translation that measures the number of
edits required to change a system output into the post-edited version
• Possible edits include the insertion, deletion, and substitution of single
words as well as shifts of word sequences.
• Unlike PE Distance, TER is a word-based error metric and therefor does
not capture morphological changes during post-editing.
PRODUCTION SETTING
LOOK FOR CONSISTENCY AND BEWARE OF OUTLIERS
23 22
19 18 18
35
0
5
10
15
20
25
30
35
40
45
50
Resource1 Resource2 Resource1 Resource2 Resource2 Resource3
de-DE fr-FR it-IT
PE Distance (%)
PRODUCTION SETTING
LOOK FOR CONSISTENCY AND BEWARE OF OUTLIERS:
POST-PROJECT AUTOSCORES INDICATE UNDEREDITING
30.5
38.47
43.18
24.28 23.35
24.89
15.28
34.61
25.25
35.1
33.36
25.57
38.27
20.73
18.04
22.84
0
5
10
15
20
25
30
35
40
45
50
VIPR 1 VIPR 3 VIPR 2 VIPR 1 VIPR 3 VIPR 2 VIPR 5 VIPR 1 VIPR 3 VIPR 2 VIPR 1 VIPR 3 VIPR 2 VIPR 1 VIPR 3 VIPR 2
de-DE es-MX fr-FR it-IT pt-BR
TER scores for several batches
JOB1 JOB2 JOB3 JOB1 JOB2 JOB3 JOB4 JOB1 JOB2 JOB3 JOB1 JOB2 JOB3 JOB1 JOB2 JOB3
PRODUCTION SETTING
TOOLS TO MEASURE POST-EDITING
EFFORT
TOOL INPUT FILES
OUTPUT
REPORT
PROS CONS
iOmegaT xliff & more xml
Includes productivity
data
Generated in the CAT tool
during translation, requires
post-editor buy-in
MateCat xliff Excel
Includes productivity
data as a built in feature
Generated in the CAT tool
during translation, requires
post-editor buy-in
Okapi xliff html
Allows us to measure
PE distance post-
project
Requires access to pre-
and post-edited file sets
Post-Edit
Compare
sdlxliff html
Allows us to measure
PE distance post-
project
Requires access to pre-
and post-edited file sets
Qualitivity sdlxliff Excel
Includes productivity
data
Generated in the CAT tool
during translation, requires
post-editor buy-in
wescore tmx Excel
Allows us to measure
PE distance post-
project
Proprietary tool, Requires
access to pre- and post-
edited file sets
PRODUCTION SETTING
MATECAT IS A FREE ONLINE CAT TOOL WITH EDITING LOG
http://www.matecat.com/support/translation-toolbox/editing-log/
PRODUCTION SETTING
USE POST-EDIT COMPARE TO ANALYSE SDLXLIFF FILES
http://www.translationzone.com/openexchange/app/post-editcompare-495.html
PRODUCTION SETTING
OKAPI FRAMEWORK TRANSLATION COMPARISON STEP
http://www.opentag.com/okapi/wiki/index.php?title=Translation_Comparison_Step
PRODUCTION SETTING
QUALITIVITY PLUGIN FOR SDL TRADOS STUDIO
http://www.translationzone.com/openexchange/app/qualitivity-788.html
LESSONS LEARNED
• The use of autoscoring to analyze post-project files is a valuable and cost-
effective method to measure the post-editing effort.
• A productivity test requires upfront organization and buy-in from translators.
• It is important to find a tool that works with the given file format and workflow.
• Access to pre- and post-edit versions of projects is required. This is a challenge
on some accounts.
• Identification and separation of MT segments from fuzzy segments may be
required for some tools.
• Look for consistency across languages and resources. Unusually high or low
scores can be a sign of over-editing or under-editing.
MT evaluations of post-
edited files: a case study
NEXT
CASE STUDY
TEST PILOT FOR LIGHT AND FULL POST-EDITING
Full Post-
editing
Light Post-
editing
LQA of PE
Kits
Autoscoring
PE Kits
Adequacy
& Fluency
Scoring
Error
Typology
Scoring
Analysis &
Report
• Languages: Chinese (Simplified) and Japanese
• The resources are regular translators for this client
• In order to have comparable data, the same resource performed both light
and full post-editing tasks of 438 segments
CASE STUDY: HUMAN EVALS
ADEQUACY AND FLUENCY SCORES
3.68 3.663.58
3.4
1
1.5
2
2.5
3
3.5
4
4.5
5
ja-JP zh-CN
Average of Adequacy
Average of Fluency
CASE STUDY: AUTOSCORES
AUTOSCORES FOR LIGHT AND FULL POST-EDITING
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
BLEU METEOR GTM Recall (%) TER
ja-JP Full PE
zh-CN Full PE
ja-JP Light PE
zh-CN Light PE
CASE STUDY: LESSONS
LESSONS LEARNED
• Using autoscores on post-edited translations can indicate the level of post-
editing effort involved for a specific content and MT engine
• The autoscores also illustrate the difference in effort between Light and
Full Post-editing, approximately 20 point delta for BLEU and 15 point delta
for TER
• The autoscores confirm that the resources have indeed managed to
perform two distinct post-editing levels
YOU
THANK
ELAINEOCURRAN
Welocalize
October 2015

More Related Content

Similar to MT Quality Evaluations: From Test Environment to Production

Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
 
05_SQA_Overview.ppt
05_SQA_Overview.ppt05_SQA_Overview.ppt
05_SQA_Overview.pptShimoFcis
 
05_SQA_Overview.ppt
05_SQA_Overview.ppt05_SQA_Overview.ppt
05_SQA_Overview.pptamanraturi10
 
Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)Praveen Srivastava
 
Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)Praveen Srivastava
 
Software reliability & quality
Software reliability & qualitySoftware reliability & quality
Software reliability & qualityNur Islam
 
Measurement system analysis Presentation.ppt
Measurement system analysis Presentation.pptMeasurement system analysis Presentation.ppt
Measurement system analysis Presentation.pptjawadullah25
 
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya
 
ICSME14 - On the Impact of Refactoring Operations on Code Quality Metrics
ICSME14 - On the Impact of Refactoring Operations on Code Quality MetricsICSME14 - On the Impact of Refactoring Operations on Code Quality Metrics
ICSME14 - On the Impact of Refactoring Operations on Code Quality MetricsOscar Chaparro
 
Doors Quality Center Integration
Doors Quality Center IntegrationDoors Quality Center Integration
Doors Quality Center IntegrationBill Duncan
 
Theta Measurement and Control Solutions Pvt Ltd.docx
Theta Measurement and Control Solutions Pvt Ltd.docxTheta Measurement and Control Solutions Pvt Ltd.docx
Theta Measurement and Control Solutions Pvt Ltd.docxTMCS India
 
Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)SungdoGu
 
DevOps Means Business - Gene Kim, IT Revolution Press & Nicole Forsgren Velas...
DevOps Means Business - Gene Kim, IT Revolution Press & Nicole Forsgren Velas...DevOps Means Business - Gene Kim, IT Revolution Press & Nicole Forsgren Velas...
DevOps Means Business - Gene Kim, IT Revolution Press & Nicole Forsgren Velas...Puppet
 
On QoE Metrics and QoE Fairness for Network & Traffic Management
On QoE Metrics and QoE Fairness for Network & Traffic ManagementOn QoE Metrics and QoE Fairness for Network & Traffic Management
On QoE Metrics and QoE Fairness for Network & Traffic ManagementTobias Hoßfeld
 
“ЕРАМ у Південному регіоні та можливості розвитку для QA спеціалістів” Online...
“ЕРАМ у Південному регіоні та можливості розвитку для QA спеціалістів” Online...“ЕРАМ у Південному регіоні та можливості розвитку для QA спеціалістів” Online...
“ЕРАМ у Південному регіоні та можливості розвитку для QA спеціалістів” Online...GoQA
 
Six Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data AccarucySix Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data Accarucyxyhfun
 
6Six sigma-in-measurement-systems-evaluating-the-hidden-factory (2)
6Six sigma-in-measurement-systems-evaluating-the-hidden-factory (2)6Six sigma-in-measurement-systems-evaluating-the-hidden-factory (2)
6Six sigma-in-measurement-systems-evaluating-the-hidden-factory (2)Bibhuti Prasad Nanda
 

Similar to MT Quality Evaluations: From Test Environment to Production (20)

Software Testing
Software Testing Software Testing
Software Testing
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
05_SQA_Overview.ppt
05_SQA_Overview.ppt05_SQA_Overview.ppt
05_SQA_Overview.ppt
 
05_SQA_Overview.ppt
05_SQA_Overview.ppt05_SQA_Overview.ppt
05_SQA_Overview.ppt
 
05 sqa overview
05 sqa overview05 sqa overview
05 sqa overview
 
Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)
 
Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)Six Sigma Green Belt (Product Track)
Six Sigma Green Belt (Product Track)
 
Software reliability & quality
Software reliability & qualitySoftware reliability & quality
Software reliability & quality
 
Measurement system analysis Presentation.ppt
Measurement system analysis Presentation.pptMeasurement system analysis Presentation.ppt
Measurement system analysis Presentation.ppt
 
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
 
ICSME14 - On the Impact of Refactoring Operations on Code Quality Metrics
ICSME14 - On the Impact of Refactoring Operations on Code Quality MetricsICSME14 - On the Impact of Refactoring Operations on Code Quality Metrics
ICSME14 - On the Impact of Refactoring Operations on Code Quality Metrics
 
Doors Quality Center Integration
Doors Quality Center IntegrationDoors Quality Center Integration
Doors Quality Center Integration
 
Theta Measurement and Control Solutions Pvt Ltd.docx
Theta Measurement and Control Solutions Pvt Ltd.docxTheta Measurement and Control Solutions Pvt Ltd.docx
Theta Measurement and Control Solutions Pvt Ltd.docx
 
Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)Feature Selection Techniques for Software Fault Prediction (Summary)
Feature Selection Techniques for Software Fault Prediction (Summary)
 
DevOps Means Business - Gene Kim, IT Revolution Press & Nicole Forsgren Velas...
DevOps Means Business - Gene Kim, IT Revolution Press & Nicole Forsgren Velas...DevOps Means Business - Gene Kim, IT Revolution Press & Nicole Forsgren Velas...
DevOps Means Business - Gene Kim, IT Revolution Press & Nicole Forsgren Velas...
 
On QoE Metrics and QoE Fairness for Network & Traffic Management
On QoE Metrics and QoE Fairness for Network & Traffic ManagementOn QoE Metrics and QoE Fairness for Network & Traffic Management
On QoE Metrics and QoE Fairness for Network & Traffic Management
 
“ЕРАМ у Південному регіоні та можливості розвитку для QA спеціалістів” Online...
“ЕРАМ у Південному регіоні та можливості розвитку для QA спеціалістів” Online...“ЕРАМ у Південному регіоні та можливості розвитку для QA спеціалістів” Online...
“ЕРАМ у Південному регіоні та можливості розвитку для QA спеціалістів” Online...
 
Six Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data AccarucySix Sigma Dfss Application In Data Accarucy
Six Sigma Dfss Application In Data Accarucy
 
6Six sigma-in-measurement-systems-evaluating-the-hidden-factory (2)
6Six sigma-in-measurement-systems-evaluating-the-hidden-factory (2)6Six sigma-in-measurement-systems-evaluating-the-hidden-factory (2)
6Six sigma-in-measurement-systems-evaluating-the-hidden-factory (2)
 
Software Reliability
Software ReliabilitySoftware Reliability
Software Reliability
 

More from Welocalize

Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Welocalize
 
How Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesHow Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesWelocalize
 
EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015Welocalize
 
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Welocalize
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize
 
Tools-Driven Content Curation & Engine Training ATMA 2014
Tools-Driven Content Curation & Engine Training ATMA 2014Tools-Driven Content Curation & Engine Training ATMA 2014
Tools-Driven Content Curation & Engine Training ATMA 2014Welocalize
 
MT and Post-Editing User-Generated Content AMTA 2014
MT and Post-Editing User-Generated Content AMTA 2014MT and Post-Editing User-Generated Content AMTA 2014
MT and Post-Editing User-Generated Content AMTA 2014Welocalize
 
Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA...
 Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA... Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA...
Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA...Welocalize
 
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeContent Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeWelocalize
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize
 
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize
 
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...Welocalize
 
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaBeyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaWelocalize
 
Better translations through automated source and post edit analysis
Better translations through automated source and post edit analysisBetter translations through automated source and post edit analysis
Better translations through automated source and post edit analysisWelocalize
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology Welocalize
 
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Welocalize
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...Welocalize
 
An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013Welocalize
 
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingSafaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingWelocalize
 
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargMT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargWelocalize
 

More from Welocalize (20)

Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?
 
How Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesHow Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT Engines
 
EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015
 
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
 
Tools-Driven Content Curation & Engine Training ATMA 2014
Tools-Driven Content Curation & Engine Training ATMA 2014Tools-Driven Content Curation & Engine Training ATMA 2014
Tools-Driven Content Curation & Engine Training ATMA 2014
 
MT and Post-Editing User-Generated Content AMTA 2014
MT and Post-Editing User-Generated Content AMTA 2014MT and Post-Editing User-Generated Content AMTA 2014
MT and Post-Editing User-Generated Content AMTA 2014
 
Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA...
 Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA... Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA...
Enterprise MT Content Drift: Challenges, Impacts and Advanced Solutions AMTA...
 
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeContent Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
 
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
 
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
 
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaBeyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
 
Better translations through automated source and post edit analysis
Better translations through automated source and post edit analysisBetter translations through automated source and post edit analysis
Better translations through automated source and post edit analysis
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology
 
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
 
An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013
 
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingSafaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
 
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargMT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
 

Recently uploaded

Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...ictsugar
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menzaictsugar
 
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadIslamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadAyesha Khan
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckHajeJanKamps
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCRashishs7044
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfJos Voskuil
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncrdollysharma2066
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Pereraictsugar
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy Verified Accounts
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCRashishs7044
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMintel Group
 
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedLean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedKaiNexus
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607dollysharma2066
 
Future Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionFuture Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionMintel Group
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesKeppelCorporation
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailAriel592675
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9
 

Recently uploaded (20)

Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
 
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in IslamabadIslamabad Escorts | Call 03274100048 | Escort Service in Islamabad
Islamabad Escorts | Call 03274100048 | Escort Service in Islamabad
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdf
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / NcrCall Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
Call Girls in DELHI Cantt, ( Call Me )-8377877756-Female Escort- In Delhi / Ncr
 
Kenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith PereraKenya Coconut Production Presentation by Dr. Lalith Perera
Kenya Coconut Production Presentation by Dr. Lalith Perera
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail Accounts
 
8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR8447779800, Low rate Call girls in Saket Delhi NCR
8447779800, Low rate Call girls in Saket Delhi NCR
 
Market Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 EditionMarket Sizes Sample Report - 2024 Edition
Market Sizes Sample Report - 2024 Edition
 
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedLean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
(Best) ENJOY Call Girls in Faridabad Ex | 8377087607
 
Future Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted VersionFuture Of Sample Report 2024 | Redacted Version
Future Of Sample Report 2024 | Redacted Version
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation Slides
 
Case study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detailCase study on tata clothing brand zudio in detail
Case study on tata clothing brand zudio in detail
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
 

MT Quality Evaluations: From Test Environment to Production

  • 1. PRESENTATION AMTA ELAINEOCURRAN Welocalize October 2015 MT Quality Evaluations: From Test Environment to Production
  • 2. •Our MT evaluation methodologies •Correlations between automatic scores and human evaluations •Differences between system autoscores and PE autoscores •MT evaluations in a production setting •MT evaluations of post-edited files: a case study AGENDA
  • 3. OUR EVALUATION METHODS Engine Training Autoscoring Test Sets (2500 TUs) Engine Ranking (100 TUs) Adequacy & Fluency Scoring (100 TUs) Error Typology Scoring (25 TUs) Productivity Test (8 hours) A TYPICAL EVALUATION PROCESS PER LOCALE AND PER ENGINE
  • 4. OUR EVALUATION METHODS AUTOMATIC SCORES GENERATED BY WESCORE 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0 Hub Comm1 Comm2 DIY1 DIY2 Precision Recall GTM METEOR BLEU Avg. PE Dist. TER Engine1 Engine2 Engine3 Engine4 Engine5
  • 5. OUR EVALUATION METHODS HUMAN EVALUATIONS: ADEQUACY AND FLUENCY SCORING
  • 6. OUR EVALUATION METHODS HUMAN EVALUATIONS: ADEQUACY AND FLUENCY SCORING 4.1 3.9 3.9 2.2 4.0 3.1 4.0 3.9 4.1 4.2 3.8 3.7 3.6 2.3 3.8 2.8 3.8 3.5 3.8 3.8 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 DA ES FR HI-IN IT JA NL NO PT-BR SV Adequacy & Fluency by Local Accuracy Fluency
  • 7. OUR EVALUATION METHODS HUMAN EVALUATION: ERROR TYPOLOGY 0 5 10 15 20 25 Errors per 25 Segments DA ES FR HI-IN IT JA NL NO PT- BR SV
  • 8. OUR EVALUATION METHODS HUMAN EVALUATION: ENGINE RANKING 41.5 66 51.5 58.5 39 22.5 0 10 20 30 40 50 60 70 80 90 100 de-DE fr-FR ja-JP zh-Hans Engine Ranked Best (out of 100 segments) Hub Comm1 Comm2 DIY1 DIY2 Engine1 Engine2 Engine3 Engine4 Engine5
  • 9. OUR EVALUATION METHODS PRODUCTIVITY TESTS • Tool: iOmegaT, instrumented version of open source CAT tool OmegaT • iOmegat tracks time spent in segment + keystrokes • Mimics translators’ work environment: integrated glossary, compatible with 3rd party tools for quality checks • Test sets consist of a mix of MT segments to post-edit and no matches to translate from scratch • Usual scope is 8h of translation / post-editing • Provides productivity delta between post-edited and translated words (words per hour)
  • 10. OUR EVALUATION METHODS PRODUCTIVITY TESTS 42.0% 14.7% 45.9% 38.6% 18.1% 22.1% 22.0% 35.0% 34.1% 21.9% 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% de-DE es-MX fr-FR it-IT pt-BR Average MT Productivity Gain per Engine Engine1 Engine2
  • 11. LESSONS LEARNED • We always perform autoscoring PLUS human scoring for all our MT evaluations. We have internal thresholds that qualify an engine ready for deployment and its level of maturity. • For bake-offs between several engines, we always include engine ranking in addition to our standard scores. • Productivity tests are valuable during the initial phase of an MT program to build up productivity data for future reference across languages, domains and MT systems. • Our MT program is now mature and we are able to perform most of our evaluations based on autoscoring PLUS human scoring, and by referencing the productivity data we have collected over a number of years.
  • 12. Correlations between automatic scores and human evaluations NEXT
  • 13. CORRELATIONS CORRELATIONS BETWEEN AUTOMATIC SCORES AND HUMAN EVALUATIONS Pearson's r Variables Strength of Correlation Tests (N) Locales 0.50576955 Fluency & METEOR Strong positive relationship 150 11 0.50070425 Fluency & BLEU Strong positive relationship 150 11 0.49816365 Fluency & Recall Strong positive relationship 150 11 0.49724893 Fluency & NIST Strong positive relationship 150 11 0.49195687 Fluency & GTM Strong positive relationship 150 11 0.47064566 Fluency & Precision Strong positive relationship 150 11 0.38293518 Adequacy & NIST Moderate positive relationship 150 11 0.31354314 Adequacy & METEOR Moderate positive relationship 150 11 0.2940756 Adequacy & Recall Weak positive relationship 150 11 0.28586852 Adequacy & GTM Weak positive relationship 150 11 0.28386332 Adequacy & BLEU Weak positive relationship 150 11 0.26685854 Adequacy & Precision Weak positive relationship 150 11 -0.40270902 Adequacy & TER Strong negative relationship 150 11 -0.4788575 Fluency & PE Distance Strong negative relationship 150 11 -0.5385275 Adequacy & PE Distance Strong negative relationship 150 11 -0.5421933 Fluency & TER Strong negative relationship 150 11
  • 14. CORRELATIONS THE STRONGEST CORRELATION WAS FOUND BETWEEN FLUENCY AND TER 0 10 20 30 40 50 60 70 80 90 100 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 TER Fluency
  • 15. CORRELATIONS THE 2ND STRONGEST CORRELATION WAS FOUND BETWEEN ADEQUACY AND PE DISTANCE 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 PEDistance Adequacy
  • 16. LESSONS LEARNED • It seems that we cannot rely solely on autoscores as long as the correlation with human judgment is not stronger than the data suggests • TER and PE Distance show the strongest correlation to both Fluency and Adequacy, and therefor seem closer to human judgment than the other scores. • Fluency correlates stronger with system autoscores than Adequacy overall. • PE Distance is the only metric that correlates stronger with Adequacy than Fluency. PE Distance is also the only character-based metric.
  • 17. Differences between system autoscores and post-editing autoscores NEXT
  • 18. SYSTEM VS PE AUTOSCORES CORRELATIONS BETWEEN SYSTEM SCORES AND POST-EDITING SCORES Pearson's r Variables Strength of Correlation Tests (N) Locales 0.832226688 BLEU (System) & BLEU (PE) Very strong positive relationship 57 9 0.832218909 PE Distance (System) & PE Distance (PE) Very strong positive relationship 57 9 33.18 50.58 44.32 29.62 17.40 14.70 0 10 20 30 40 50 60 BLEU (System) BLEU (PE) PE Dist % (System) PE Dist % (PE) BLEU DIFF PE DIST DIFF % ON AVERAGE, THE POST-EDITING SCORE IS 15 AND 17 POINT HIGHER FOR PE DISTANCE AND BLEU RESPECTIVEL
  • 19. SYSTEM VS PE AUTOSCORES 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 BLEU(PE) BLEU (System) CORRELATIONS BETWEEN SYSTEM BLEU AND POST-EDITING BLEU
  • 20. SYSTEM VS PE AUTOSCORES CORRELATIONS BETWEEN SYSTEM PE DISTANCE AND POST-EDITING PE DISTANCE 0% 10% 20% 30% 40% 50% 60% 70% 0% 10% 20% 30% 40% 50% 60% 70% PEDistance(PE) PE Distance (System)
  • 21. SYSTEM VS PE AUTOSCORES REAL DATA WHERE WE COMPARE EVALUATION SCORES WITH SCORES FROM A 3-MONTH PILOT 32.2 35.3 35.08 20.46 35.26 43.69 46.76 34.71 0 10 20 30 40 50 60 70 80 90 100 es-ES fr-FR it-IT pt-BR PE Distance (%) Pilot1 Eval1 LOOK FOR CONSISTENCY AND BEWARE OF OUTLIERS
  • 22. LESSONS LEARNED • There is a very high correlation between the MT system autoscores generated during the evaluation phase and the autoscores generated from production using the same engines. • However, the post-editing autoscores are considerably better than the MT system autoscores by around15%. • We now differentiate the autoscores in our database as ‘System’ and ‘PE’.
  • 23. MT evaluations in a production setting NEXT
  • 24. PRODUCTION SETTING HOW TO MEASURE POST-EDITING EFFORT • It is important to monitor the performance of MT and post-editing, especially during the initial launch of a new program • The use of autoscoring to analyze post-project files is a valuable and cost- effective method to measure the post-editing effort • They support rate negotiations and can help us to identify over- or under- editing by post-editors • TER and PE Distance are useful metrics, with different underlying algorithms
  • 25. PRODUCTION SETTING HOW TO MEASURE POST-EDITING EFFORT PE Distance - lower is better! • Measures the number of insertions, deletions, substitutions required to transform MT output to the required quality level • PE Distance values are derived by comparing the post-edited segments with the corresponding machine translation segments • In our analysis the PE distance applies the Levenshtein algorithm and is character-based. This captures morphological post-edits, such as fixing word forms.
  • 26. PRODUCTION SETTING HOW TO MEASURE POST-EDITING EFFORT TER - lower is better! • TER stands for Translation Edit Rate • It is an error metric for machine translation that measures the number of edits required to change a system output into the post-edited version • Possible edits include the insertion, deletion, and substitution of single words as well as shifts of word sequences. • Unlike PE Distance, TER is a word-based error metric and therefor does not capture morphological changes during post-editing.
  • 27. PRODUCTION SETTING LOOK FOR CONSISTENCY AND BEWARE OF OUTLIERS 23 22 19 18 18 35 0 5 10 15 20 25 30 35 40 45 50 Resource1 Resource2 Resource1 Resource2 Resource2 Resource3 de-DE fr-FR it-IT PE Distance (%)
  • 28. PRODUCTION SETTING LOOK FOR CONSISTENCY AND BEWARE OF OUTLIERS: POST-PROJECT AUTOSCORES INDICATE UNDEREDITING 30.5 38.47 43.18 24.28 23.35 24.89 15.28 34.61 25.25 35.1 33.36 25.57 38.27 20.73 18.04 22.84 0 5 10 15 20 25 30 35 40 45 50 VIPR 1 VIPR 3 VIPR 2 VIPR 1 VIPR 3 VIPR 2 VIPR 5 VIPR 1 VIPR 3 VIPR 2 VIPR 1 VIPR 3 VIPR 2 VIPR 1 VIPR 3 VIPR 2 de-DE es-MX fr-FR it-IT pt-BR TER scores for several batches JOB1 JOB2 JOB3 JOB1 JOB2 JOB3 JOB4 JOB1 JOB2 JOB3 JOB1 JOB2 JOB3 JOB1 JOB2 JOB3
  • 29. PRODUCTION SETTING TOOLS TO MEASURE POST-EDITING EFFORT TOOL INPUT FILES OUTPUT REPORT PROS CONS iOmegaT xliff & more xml Includes productivity data Generated in the CAT tool during translation, requires post-editor buy-in MateCat xliff Excel Includes productivity data as a built in feature Generated in the CAT tool during translation, requires post-editor buy-in Okapi xliff html Allows us to measure PE distance post- project Requires access to pre- and post-edited file sets Post-Edit Compare sdlxliff html Allows us to measure PE distance post- project Requires access to pre- and post-edited file sets Qualitivity sdlxliff Excel Includes productivity data Generated in the CAT tool during translation, requires post-editor buy-in wescore tmx Excel Allows us to measure PE distance post- project Proprietary tool, Requires access to pre- and post- edited file sets
  • 30. PRODUCTION SETTING MATECAT IS A FREE ONLINE CAT TOOL WITH EDITING LOG http://www.matecat.com/support/translation-toolbox/editing-log/
  • 31. PRODUCTION SETTING USE POST-EDIT COMPARE TO ANALYSE SDLXLIFF FILES http://www.translationzone.com/openexchange/app/post-editcompare-495.html
  • 32. PRODUCTION SETTING OKAPI FRAMEWORK TRANSLATION COMPARISON STEP http://www.opentag.com/okapi/wiki/index.php?title=Translation_Comparison_Step
  • 33. PRODUCTION SETTING QUALITIVITY PLUGIN FOR SDL TRADOS STUDIO http://www.translationzone.com/openexchange/app/qualitivity-788.html
  • 34. LESSONS LEARNED • The use of autoscoring to analyze post-project files is a valuable and cost- effective method to measure the post-editing effort. • A productivity test requires upfront organization and buy-in from translators. • It is important to find a tool that works with the given file format and workflow. • Access to pre- and post-edit versions of projects is required. This is a challenge on some accounts. • Identification and separation of MT segments from fuzzy segments may be required for some tools. • Look for consistency across languages and resources. Unusually high or low scores can be a sign of over-editing or under-editing.
  • 35. MT evaluations of post- edited files: a case study NEXT
  • 36. CASE STUDY TEST PILOT FOR LIGHT AND FULL POST-EDITING Full Post- editing Light Post- editing LQA of PE Kits Autoscoring PE Kits Adequacy & Fluency Scoring Error Typology Scoring Analysis & Report • Languages: Chinese (Simplified) and Japanese • The resources are regular translators for this client • In order to have comparable data, the same resource performed both light and full post-editing tasks of 438 segments
  • 37. CASE STUDY: HUMAN EVALS ADEQUACY AND FLUENCY SCORES 3.68 3.663.58 3.4 1 1.5 2 2.5 3 3.5 4 4.5 5 ja-JP zh-CN Average of Adequacy Average of Fluency
  • 38. CASE STUDY: AUTOSCORES AUTOSCORES FOR LIGHT AND FULL POST-EDITING 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 BLEU METEOR GTM Recall (%) TER ja-JP Full PE zh-CN Full PE ja-JP Light PE zh-CN Light PE
  • 39. CASE STUDY: LESSONS LESSONS LEARNED • Using autoscores on post-edited translations can indicate the level of post- editing effort involved for a specific content and MT engine • The autoscores also illustrate the difference in effort between Light and Full Post-editing, approximately 20 point delta for BLEU and 15 point delta for TER • The autoscores confirm that the resources have indeed managed to perform two distinct post-editing levels

Editor's Notes

  1. We started collecting all our MT evaluation scores into a database in 2013. We are users of MT, practitioners, not researchers, so we welcome any feedback from the audience in how to analyze our evaluation data in a meaningful way that is still pratical
  2. What can you do when there is no means to evaluate MT upfront? Run your analysis post-production