SlideShare a Scribd company logo
1 of 21
Rating Evaluation Methods
through Correlation
presented by Lena Marg,
Language Tools Team
@ MTE 2014, Workshop on Automatic and Manual Metrics for Operational
Translation Evaluation
The 9th edition of the Language Resources and Evaluation Conference, Reykjavik
Background on MT Programs @
MT programs vary with regard to:
Scope
Locales
Maturity
System Setup & Ownership
MT Solution used
Key Objective of using MT
Final Quality Requirements
Source Content
MT Quality Evaluation @
1. Automatic Scores
 Provided by the MT system (typically BLEU)
 Provided by our internal scoring tool (range of metrics)
2. Human Evaluation
 Adequacy, scores 1-5
 Fluency, scores 1-5
3. Productivity Tests
 Post-Editing versus Human Translation in iOmegaT
The Database
Objective:
Establish correlations between these 3 evaluation approaches to
- draw conclusions on predicting productivity gains
- see how & when to use the different metrics best
Contents:
- Data from 2013
- Metrics (BLEU & PE Distance, Adequacy & Fluency, Productivity deltas)
- Various locales, MT systems, content types
- MT error analysis
- Post-editing quality scores
Method
Pearson’s r
If r =
+.70 or higher Very strong positive relationship
+.40 to +.69 Strong positive relationship
+.30 to +.39 Moderate positive relationship
+.20 to +.29 Weak positive relationship
+.01 to +.19 No or negligible relationship
-.01 to -.19 No or negligible relationship
-.20 to -.29 Weak negative relationship
-.30 to -.39 Moderate negative relationship
-.40 to -.69 Strong negative relationship
-.70 or higher Very strong negative relationship
thedatabaseData Used
27 locales in total, with
varying amounts of
available data
+ 5 different
MT systems
(SMT &
Hybrid)
correlationresults
Adequacy vs Fluency
1
2
3
4
5
1.00 2.00 3.00 4.00 5.00
Fluency
Adequacy
Fluency and Adequacy - All Locales
A Pearson’s r of 0.82 across 182 test sets and 22 locales is a very strong,
positive relationship
COMMENT
- most locales show a strong correlation between their Fluency and Adequacy scores
- high correlation is expected (with in-domain data customized MT systems) in that, if a
segment is really not understandable, it is neither accurate nor fluent. If a segment is almost
perfect, both would score very high
- some evaluators might not differentiate enough between Adequacy & Fluency, falsely
creating a higher correlation
correlationresults
Adequacy and Fluency versus BLEU
0
20
40
60
80
1 2 3 4 5
BLEUScore
Fluency Score
0
10
20
30
40
50
60
70
80
1 2 3 4 5
BLEUScore
Adequacy Score
Fluency and BLEU across locales
have a Pearson’s r of 0.41, a
strong positive relationship
Adequacy and BLEU across locales have
a Pearson’s r of 0.26, a moderately
positive relationship
-1.00
-0.80
-0.60
-0.40
-0.20
0.00
0.20
0.40
0.60
0.80
1.00
da_DK de_DE es_ES es_LA fr_CA fr_FR it_IT ja_JP ko_KR pt_BR ru_RU zh_CN
Pearson'sr
Adequacy, Fluency & BLEU Correlation - All Locales
Adequacy & BLEU Fluency & BLEU
Adequacy, Fluency and BLEU correlation for locales with 4 or more test sets*
correlationresults
Adequacy and Fluency versus PE Distance
Fluency and PE distance across all
locales have a cumulative Pearson’s r of
-0.70, a very strong negative relationship
Adequacy and PE distance across all
locales have a cumulative Pearson’s r of -
0.41, a strong negative relationship
0%
20%
40%
60%
80%
1 2 3 4 5
PEDistance
Fluency Score
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 2 3 4 5
PEDistance
Adequacy Score
A negative correlation is desired: as Adequacy and Fluency scores increase, PE distance
should decrease proportionally.
-1.00
-0.80
-0.60
-0.40
-0.20
0.00
0.20
0.40
0.60
0.80
1.00
de_DE es_ES/LA fr_FR/CA it_IT pt_BR
Adequacy, Fluency and PE Distance Correlation
Adequacy & PE Distance Fluency & PE Distance
correlationresults
Adequacy and Fluency versus Productivity Delta
Productivity and Adequacy across all
locales with a cumulative Pearson’s r
of 0.77, a very strong correlation
Productivity and Fluency across all
locales with a cumulative Pearson’s r
of 0.71, a very strong correlation
-20%
0%
20%
40%
60%
80%
100%
1 2 3 4 5
ProductivityDelta
Human Evaluation Adequacy Score (1-5)
Productivity Delta and Adequacy
-20%
0%
20%
40%
60%
80%
100%
1 2 3 4 5
ProductivityDelta
Human Evaluation Fluency Score (1-5)
Productivity Delta and Fluency
correlationresults
Automatic Metrics versus Productivity Delta
-100%
0%
100%
200%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
ProductivityDeltaas%
Post-Edit Distance
Productivity Delta and PE Distance
With a Pearson’s r of -
0.436, as PE distance
increases, indicating a
greater effort from the
post-editor, Productivity
declines; it is a strong
negative relationship
-100%
0%
100%
200%
0 10 20 30 40 50 60 70 80 90 100
Productivitydeltaasa%
BLEU Score
BLEU & Productivity Delta Productivity delta and
BLEU with a cumulative
Pearson’s r of 0.24, a weak
positive relationship
correlationresults
Summary
Pearson's r Variables Strength of Correlation Tests (N) Locales Statistical
Significance (p
value <)
0.82 Adequacy & Fluency Very strong positive relationship 182 22 0.0001
0.77 Adequacy & P Delta Very strong positive relationship 23 9 0.0001
0.71 Fluency & P Delta Very strong positive relationship 23 9 0.00015
0.55 Cognitive Effort Rank & PE Distance Strong positive relationship 16 10 0.027
0.41 Fluency & BLEU Strong positive relationship 146 22 0.0001
0.26 Adequacy & BLEU Weak positive relationship 146 22 0.0015
0.24 BLEU & P Delta Weak positive relationship 106 26 0.012
0.13 Numbers of Errors & PE Distance No or negligible relationship 16 10 ns
-0.30 Predominant Error & BLEU Moderate negative relationship 63 13 0.017
-0.32 Cognitive Effort Rank & PE Delta Moderate negative relationship 20 10 ns
-0.41 Numbers of Errors & BLEU Strong negative relationship 63 20 0.00085
-0.41 Adequacy & PE Distance Strong negative relationship 38 13 0.011
-0.42 PE Distance & P Delta Strong negative relationship 72 27 0.00024
-0.70 Fluency & PE Distance Very strong negative relationship 38 13 0.0001
-0.81 BLEU & PE Distance Very strong negative relationship 75 27 0.0001
takeaways
The strongest correlations were found between:
 Adequacy & Fluency
 BLEU and PE Distance
 Adequacy & Productivity Delta
 Fluency & Productivity Delta
 Fluency & PE Distance
 The Human Evaluations come out as stronger indicators for
potential post-editing productivity gains than Automatic
metrics.
CORRELATIONS
erroranalysis
12%
2%
8%
3%
16%
3%
5%
12%
5% 7%
20%
1% 1%
3% 2%
0%
5%
10%
15%
20%
25%
Error Type Frequency
Data size: 117 evaluations x 25 segments (3125 segments), includes 22 locales, different
MT systems (hybrid & SMT).
 Taking this “broad sweep“ view, most errors logged by evaluators across
all categories are:
- Sentence structure (word order)
- MT output too literal
- Wrong terminology
- Word form disagreements
- Source term left untranslated
erroranalysis
Similar picture when we focus on the 8 dominant language pairs that
constituted the bulk of the evaluations in the dataset.
takeaways
 Across different MT systems, content types AND locales, 5 error categories stand
out in particular.
Questions:
How (if) do these correlate to the post-editing effort and predicting productivity
gains?
How (if) can the findings on errors be used to improve the underlying systems?
Are the current error categories what we need?
Can the categories be improved for evaluators?
Will these categories work for other post-editing scenarios (e.g. light PE)?
MOST FREQUENT ERRORS LOGGED
takeaways
Remodelling of Human Evaluation Form to:
- increase user-friendliness
- distinguish better between Ad & Fl
errors
- align with cognitive effort categories
proposed in literature
- improve relevance for system updates
E.g.“Literal Translation“ seemed too broad and probably over-used.
nextsteps
o focus on language groups and individual languages: do we see
the same correlations?
o focus on different MT systems
o add categories to database (e.g. string length, post-editor
experience)
o add new data to database and repeat correlations
o continuously tweak Human Evaluation template and process, as it
proofs to provide valuable insights for predictions, as well as post-
editor on-boarding / education and MT system improvement
o investigate correlation with other AutoScores (…)
THANK YOU!
lena.marg@welocalize.com
with Laura Casanellas Luri, Elaine O’Curran, Andy Mallett

More Related Content

Similar to Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014

(Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014
(Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014 (Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014
(Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014 Lora Aroyo
 
cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfJermaeDizon2
 
ScoreWeek 5 Correlation and Regressio.docx
ScoreWeek 5 Correlation and Regressio.docxScoreWeek 5 Correlation and Regressio.docx
ScoreWeek 5 Correlation and Regressio.docxpotmanandrea
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptEdu4Sure
 
On QoE Metrics and QoE Fairness for Network & Traffic Management
On QoE Metrics and QoE Fairness for Network & Traffic ManagementOn QoE Metrics and QoE Fairness for Network & Traffic Management
On QoE Metrics and QoE Fairness for Network & Traffic ManagementTobias Hoßfeld
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11Bonnie Green
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_saRobert Martin
 
R&R Analysis Using SEDana
R&R Analysis Using SEDanaR&R Analysis Using SEDana
R&R Analysis Using SEDanaamegens
 
Download the presentation
Download the presentationDownload the presentation
Download the presentationbutest
 
Assessing the quality of doctor consultations using ML
Assessing the quality of doctor consultations using MLAssessing the quality of doctor consultations using ML
Assessing the quality of doctor consultations using MLGDG Cloud Bengaluru
 
Smart PLS 4 workshop by Dalowar & Mushtaq Al-Husnawi.pdf
Smart PLS 4 workshop by Dalowar & Mushtaq Al-Husnawi.pdfSmart PLS 4 workshop by Dalowar & Mushtaq Al-Husnawi.pdf
Smart PLS 4 workshop by Dalowar & Mushtaq Al-Husnawi.pdfDalowarHossan2
 
Remarks from the professor on milestone 1 and for milestone 2A.docx
Remarks from the professor on milestone 1 and for milestone 2A.docxRemarks from the professor on milestone 1 and for milestone 2A.docx
Remarks from the professor on milestone 1 and for milestone 2A.docxsodhi3
 
New Gre score scale brochure by ETS
New Gre score scale brochure by ETSNew Gre score scale brochure by ETS
New Gre score scale brochure by ETSAniket Singh
 
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET Journal
 
A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...RAHUL WAGAJ
 

Similar to Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014 (20)

Final.Version
Final.VersionFinal.Version
Final.Version
 
(Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014
(Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014 (Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014
(Presentation Chris) Crowdsourcing & Semantic Web: Dagstuhl 2014
 
cannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdfcannonicalpresentation-110505114327-phpapp01.pdf
cannonicalpresentation-110505114327-phpapp01.pdf
 
ScoreWeek 5 Correlation and Regressio.docx
ScoreWeek 5 Correlation and Regressio.docxScoreWeek 5 Correlation and Regressio.docx
ScoreWeek 5 Correlation and Regressio.docx
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
 
On QoE Metrics and QoE Fairness for Network & Traffic Management
On QoE Metrics and QoE Fairness for Network & Traffic ManagementOn QoE Metrics and QoE Fairness for Network & Traffic Management
On QoE Metrics and QoE Fairness for Network & Traffic Management
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_sa
 
R&R Analysis Using SEDana
R&R Analysis Using SEDanaR&R Analysis Using SEDana
R&R Analysis Using SEDana
 
Download the presentation
Download the presentationDownload the presentation
Download the presentation
 
Assessing the quality of doctor consultations using ML
Assessing the quality of doctor consultations using MLAssessing the quality of doctor consultations using ML
Assessing the quality of doctor consultations using ML
 
WS_09.ppt
WS_09.pptWS_09.ppt
WS_09.ppt
 
ffffff[1].pdf
ffffff[1].pdfffffff[1].pdf
ffffff[1].pdf
 
Smart PLS 4 workshop by Dalowar & Mushtaq Al-Husnawi.pdf
Smart PLS 4 workshop by Dalowar & Mushtaq Al-Husnawi.pdfSmart PLS 4 workshop by Dalowar & Mushtaq Al-Husnawi.pdf
Smart PLS 4 workshop by Dalowar & Mushtaq Al-Husnawi.pdf
 
Statistical analysis in SPSS_
Statistical analysis in SPSS_ Statistical analysis in SPSS_
Statistical analysis in SPSS_
 
Remarks from the professor on milestone 1 and for milestone 2A.docx
Remarks from the professor on milestone 1 and for milestone 2A.docxRemarks from the professor on milestone 1 and for milestone 2A.docx
Remarks from the professor on milestone 1 and for milestone 2A.docx
 
New Gre score scale brochure by ETS
New Gre score scale brochure by ETSNew Gre score scale brochure by ETS
New Gre score scale brochure by ETS
 
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment AnalysisIRJET- Survey of Classification of Business Reviews using Sentiment Analysis
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
 
A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...A Comparative study of locality Preserving Projection & Principle Component A...
A Comparative study of locality Preserving Projection & Principle Component A...
 

More from Welocalize

Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Welocalize
 
How Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesHow Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesWelocalize
 
EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015Welocalize
 
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Welocalize
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize
 
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeContent Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeWelocalize
 
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize
 
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...Welocalize
 
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaBeyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaWelocalize
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology Welocalize
 
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Welocalize
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...Welocalize
 
An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013Welocalize
 
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingSafaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingWelocalize
 
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargMT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargWelocalize
 

More from Welocalize (15)

Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?
 
How Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT EnginesHow Much Cake to Eat: The Case for Targeted MT Engines
How Much Cake to Eat: The Case for Targeted MT Engines
 
EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015
 
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
 
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeContent Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
 
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
 
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
 
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaBeyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology
 
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
 
An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013An MT Journey Intuit and Welocalize Localization World 2013
An MT Journey Intuit and Welocalize Localization World 2013
 
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingSafaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
 
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargMT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
 

Recently uploaded

Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation SlidesKeppelCorporation
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfOrient Homes
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth MarketingShawn Pang
 
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...lizamodels9
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Timedelhimodelshub1
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creationsnakalysalcedo61
 
Pitch Deck Teardown: NOQX's $200k Pre-seed deck
Pitch Deck Teardown: NOQX's $200k Pre-seed deckPitch Deck Teardown: NOQX's $200k Pre-seed deck
Pitch Deck Teardown: NOQX's $200k Pre-seed deckHajeJanKamps
 
RE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechRE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechNewman George Leech
 
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedLean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedKaiNexus
 
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service DewasVip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewasmakika9823
 
Catalogue ONG NUOC PPR DE NHAT .pdf
Catalogue ONG NUOC PPR DE NHAT      .pdfCatalogue ONG NUOC PPR DE NHAT      .pdf
Catalogue ONG NUOC PPR DE NHAT .pdfOrient Homes
 
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756dollysharma2066
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...lizamodels9
 
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCRsoniya singh
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...lizamodels9
 
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsCash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsApsara Of India
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...lizamodels9
 
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCRsoniya singh
 

Recently uploaded (20)

Keppel Ltd. 1Q 2024 Business Update Presentation Slides
Keppel Ltd. 1Q 2024 Business Update  Presentation SlidesKeppel Ltd. 1Q 2024 Business Update  Presentation Slides
Keppel Ltd. 1Q 2024 Business Update Presentation Slides
 
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdfCatalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
Catalogue ONG NƯỚC uPVC - HDPE DE NHAT.pdf
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
 
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
Call Girls In Kishangarh Delhi ❤️8860477959 Good Looking Escorts In 24/7 Delh...
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Time
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creations
 
Pitch Deck Teardown: NOQX's $200k Pre-seed deck
Pitch Deck Teardown: NOQX's $200k Pre-seed deckPitch Deck Teardown: NOQX's $200k Pre-seed deck
Pitch Deck Teardown: NOQX's $200k Pre-seed deck
 
RE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechRE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman Leech
 
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… AbridgedLean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
Lean: From Theory to Practice — One City’s (and Library’s) Lean Story… Abridged
 
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service DewasVip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
Vip Dewas Call Girls #9907093804 Contact Number Escorts Service Dewas
 
Catalogue ONG NUOC PPR DE NHAT .pdf
Catalogue ONG NUOC PPR DE NHAT      .pdfCatalogue ONG NUOC PPR DE NHAT      .pdf
Catalogue ONG NUOC PPR DE NHAT .pdf
 
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
Call Girls In ⇛⇛Chhatarpur⇚⇚. Brings Offer Delhi Contact Us 8377877756
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
Call Girls In Radisson Blu Hotel New Delhi Paschim Vihar ❤️8860477959 Escorts...
 
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Hauz Khas 🔝 Delhi NCR
 
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
Call Girls In Connaught Place Delhi ❤️88604**77959_Russian 100% Genuine Escor...
 
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsCash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
Lowrate Call Girls In Laxmi Nagar Delhi ❤️8860477959 Escorts 100% Genuine Ser...
 
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
 

Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014

  • 1. Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools Team @ MTE 2014, Workshop on Automatic and Manual Metrics for Operational Translation Evaluation The 9th edition of the Language Resources and Evaluation Conference, Reykjavik
  • 2.
  • 3.
  • 4. Background on MT Programs @ MT programs vary with regard to: Scope Locales Maturity System Setup & Ownership MT Solution used Key Objective of using MT Final Quality Requirements Source Content
  • 5. MT Quality Evaluation @ 1. Automatic Scores  Provided by the MT system (typically BLEU)  Provided by our internal scoring tool (range of metrics) 2. Human Evaluation  Adequacy, scores 1-5  Fluency, scores 1-5 3. Productivity Tests  Post-Editing versus Human Translation in iOmegaT
  • 6. The Database Objective: Establish correlations between these 3 evaluation approaches to - draw conclusions on predicting productivity gains - see how & when to use the different metrics best Contents: - Data from 2013 - Metrics (BLEU & PE Distance, Adequacy & Fluency, Productivity deltas) - Various locales, MT systems, content types - MT error analysis - Post-editing quality scores
  • 7. Method Pearson’s r If r = +.70 or higher Very strong positive relationship +.40 to +.69 Strong positive relationship +.30 to +.39 Moderate positive relationship +.20 to +.29 Weak positive relationship +.01 to +.19 No or negligible relationship -.01 to -.19 No or negligible relationship -.20 to -.29 Weak negative relationship -.30 to -.39 Moderate negative relationship -.40 to -.69 Strong negative relationship -.70 or higher Very strong negative relationship
  • 8. thedatabaseData Used 27 locales in total, with varying amounts of available data + 5 different MT systems (SMT & Hybrid)
  • 9. correlationresults Adequacy vs Fluency 1 2 3 4 5 1.00 2.00 3.00 4.00 5.00 Fluency Adequacy Fluency and Adequacy - All Locales A Pearson’s r of 0.82 across 182 test sets and 22 locales is a very strong, positive relationship COMMENT - most locales show a strong correlation between their Fluency and Adequacy scores - high correlation is expected (with in-domain data customized MT systems) in that, if a segment is really not understandable, it is neither accurate nor fluent. If a segment is almost perfect, both would score very high - some evaluators might not differentiate enough between Adequacy & Fluency, falsely creating a higher correlation
  • 10. correlationresults Adequacy and Fluency versus BLEU 0 20 40 60 80 1 2 3 4 5 BLEUScore Fluency Score 0 10 20 30 40 50 60 70 80 1 2 3 4 5 BLEUScore Adequacy Score Fluency and BLEU across locales have a Pearson’s r of 0.41, a strong positive relationship Adequacy and BLEU across locales have a Pearson’s r of 0.26, a moderately positive relationship -1.00 -0.80 -0.60 -0.40 -0.20 0.00 0.20 0.40 0.60 0.80 1.00 da_DK de_DE es_ES es_LA fr_CA fr_FR it_IT ja_JP ko_KR pt_BR ru_RU zh_CN Pearson'sr Adequacy, Fluency & BLEU Correlation - All Locales Adequacy & BLEU Fluency & BLEU Adequacy, Fluency and BLEU correlation for locales with 4 or more test sets*
  • 11. correlationresults Adequacy and Fluency versus PE Distance Fluency and PE distance across all locales have a cumulative Pearson’s r of -0.70, a very strong negative relationship Adequacy and PE distance across all locales have a cumulative Pearson’s r of - 0.41, a strong negative relationship 0% 20% 40% 60% 80% 1 2 3 4 5 PEDistance Fluency Score 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 1 2 3 4 5 PEDistance Adequacy Score A negative correlation is desired: as Adequacy and Fluency scores increase, PE distance should decrease proportionally. -1.00 -0.80 -0.60 -0.40 -0.20 0.00 0.20 0.40 0.60 0.80 1.00 de_DE es_ES/LA fr_FR/CA it_IT pt_BR Adequacy, Fluency and PE Distance Correlation Adequacy & PE Distance Fluency & PE Distance
  • 12. correlationresults Adequacy and Fluency versus Productivity Delta Productivity and Adequacy across all locales with a cumulative Pearson’s r of 0.77, a very strong correlation Productivity and Fluency across all locales with a cumulative Pearson’s r of 0.71, a very strong correlation -20% 0% 20% 40% 60% 80% 100% 1 2 3 4 5 ProductivityDelta Human Evaluation Adequacy Score (1-5) Productivity Delta and Adequacy -20% 0% 20% 40% 60% 80% 100% 1 2 3 4 5 ProductivityDelta Human Evaluation Fluency Score (1-5) Productivity Delta and Fluency
  • 13. correlationresults Automatic Metrics versus Productivity Delta -100% 0% 100% 200% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% ProductivityDeltaas% Post-Edit Distance Productivity Delta and PE Distance With a Pearson’s r of - 0.436, as PE distance increases, indicating a greater effort from the post-editor, Productivity declines; it is a strong negative relationship -100% 0% 100% 200% 0 10 20 30 40 50 60 70 80 90 100 Productivitydeltaasa% BLEU Score BLEU & Productivity Delta Productivity delta and BLEU with a cumulative Pearson’s r of 0.24, a weak positive relationship
  • 14. correlationresults Summary Pearson's r Variables Strength of Correlation Tests (N) Locales Statistical Significance (p value <) 0.82 Adequacy & Fluency Very strong positive relationship 182 22 0.0001 0.77 Adequacy & P Delta Very strong positive relationship 23 9 0.0001 0.71 Fluency & P Delta Very strong positive relationship 23 9 0.00015 0.55 Cognitive Effort Rank & PE Distance Strong positive relationship 16 10 0.027 0.41 Fluency & BLEU Strong positive relationship 146 22 0.0001 0.26 Adequacy & BLEU Weak positive relationship 146 22 0.0015 0.24 BLEU & P Delta Weak positive relationship 106 26 0.012 0.13 Numbers of Errors & PE Distance No or negligible relationship 16 10 ns -0.30 Predominant Error & BLEU Moderate negative relationship 63 13 0.017 -0.32 Cognitive Effort Rank & PE Delta Moderate negative relationship 20 10 ns -0.41 Numbers of Errors & BLEU Strong negative relationship 63 20 0.00085 -0.41 Adequacy & PE Distance Strong negative relationship 38 13 0.011 -0.42 PE Distance & P Delta Strong negative relationship 72 27 0.00024 -0.70 Fluency & PE Distance Very strong negative relationship 38 13 0.0001 -0.81 BLEU & PE Distance Very strong negative relationship 75 27 0.0001
  • 15. takeaways The strongest correlations were found between:  Adequacy & Fluency  BLEU and PE Distance  Adequacy & Productivity Delta  Fluency & Productivity Delta  Fluency & PE Distance  The Human Evaluations come out as stronger indicators for potential post-editing productivity gains than Automatic metrics. CORRELATIONS
  • 16. erroranalysis 12% 2% 8% 3% 16% 3% 5% 12% 5% 7% 20% 1% 1% 3% 2% 0% 5% 10% 15% 20% 25% Error Type Frequency Data size: 117 evaluations x 25 segments (3125 segments), includes 22 locales, different MT systems (hybrid & SMT).  Taking this “broad sweep“ view, most errors logged by evaluators across all categories are: - Sentence structure (word order) - MT output too literal - Wrong terminology - Word form disagreements - Source term left untranslated
  • 17. erroranalysis Similar picture when we focus on the 8 dominant language pairs that constituted the bulk of the evaluations in the dataset.
  • 18. takeaways  Across different MT systems, content types AND locales, 5 error categories stand out in particular. Questions: How (if) do these correlate to the post-editing effort and predicting productivity gains? How (if) can the findings on errors be used to improve the underlying systems? Are the current error categories what we need? Can the categories be improved for evaluators? Will these categories work for other post-editing scenarios (e.g. light PE)? MOST FREQUENT ERRORS LOGGED
  • 19. takeaways Remodelling of Human Evaluation Form to: - increase user-friendliness - distinguish better between Ad & Fl errors - align with cognitive effort categories proposed in literature - improve relevance for system updates E.g.“Literal Translation“ seemed too broad and probably over-used.
  • 20. nextsteps o focus on language groups and individual languages: do we see the same correlations? o focus on different MT systems o add categories to database (e.g. string length, post-editor experience) o add new data to database and repeat correlations o continuously tweak Human Evaluation template and process, as it proofs to provide valuable insights for predictions, as well as post- editor on-boarding / education and MT system improvement o investigate correlation with other AutoScores (…)
  • 21. THANK YOU! lena.marg@welocalize.com with Laura Casanellas Luri, Elaine O’Curran, Andy Mallett

Editor's Notes

  1. NOTE: correlations found for individual languages with 4 or more sets, but not enough data to be statistically relevant. But it does show that BLEU isn‘t a good indicator * Also points out that HE needs to be done accurately