Welocalize presentation by Lena Marg. Machine translation research focused on the results from a major data gathering exercise we carried out in 2014 by the Welocalize Language Tools team.
We correlated results from automatic scoring (in this case referencing BLEU), human scoring of raw MT output on a 1-5 Likert scale, as well as productivity test deltas from 2013 data. The total test set comprising 22 locales, five different MT systems and various source content types. In line with findings from other speakers and recent publications, we found that while automatic scores such as BLEU serve as great trend indicators for overall MT system performance, they don’t tell us much about how useful the given MT output is for post-editors. Human scoring, on the other hand, correlated with productivity gains seen in post-editing and error classification proves a better indicator on usability. This confirmed the validity of our evaluation approach, comprising productivity data and human evaluation.
For additional information, visit http://www.welocalize.com/wemt/why-wemt/
(8264348440) 🔝 Call Girls In Keshav Puram 🔝 Delhi NCR
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014
1. Rating Evaluation Methods
through Correlation
presented by Lena Marg,
Language Tools Team
@ MTE 2014, Workshop on Automatic and Manual Metrics for Operational
Translation Evaluation
The 9th edition of the Language Resources and Evaluation Conference, Reykjavik
2.
3.
4. Background on MT Programs @
MT programs vary with regard to:
Scope
Locales
Maturity
System Setup & Ownership
MT Solution used
Key Objective of using MT
Final Quality Requirements
Source Content
5. MT Quality Evaluation @
1. Automatic Scores
Provided by the MT system (typically BLEU)
Provided by our internal scoring tool (range of metrics)
2. Human Evaluation
Adequacy, scores 1-5
Fluency, scores 1-5
3. Productivity Tests
Post-Editing versus Human Translation in iOmegaT
6. The Database
Objective:
Establish correlations between these 3 evaluation approaches to
- draw conclusions on predicting productivity gains
- see how & when to use the different metrics best
Contents:
- Data from 2013
- Metrics (BLEU & PE Distance, Adequacy & Fluency, Productivity deltas)
- Various locales, MT systems, content types
- MT error analysis
- Post-editing quality scores
7. Method
Pearson’s r
If r =
+.70 or higher Very strong positive relationship
+.40 to +.69 Strong positive relationship
+.30 to +.39 Moderate positive relationship
+.20 to +.29 Weak positive relationship
+.01 to +.19 No or negligible relationship
-.01 to -.19 No or negligible relationship
-.20 to -.29 Weak negative relationship
-.30 to -.39 Moderate negative relationship
-.40 to -.69 Strong negative relationship
-.70 or higher Very strong negative relationship
9. correlationresults
Adequacy vs Fluency
1
2
3
4
5
1.00 2.00 3.00 4.00 5.00
Fluency
Adequacy
Fluency and Adequacy - All Locales
A Pearson’s r of 0.82 across 182 test sets and 22 locales is a very strong,
positive relationship
COMMENT
- most locales show a strong correlation between their Fluency and Adequacy scores
- high correlation is expected (with in-domain data customized MT systems) in that, if a
segment is really not understandable, it is neither accurate nor fluent. If a segment is almost
perfect, both would score very high
- some evaluators might not differentiate enough between Adequacy & Fluency, falsely
creating a higher correlation
10. correlationresults
Adequacy and Fluency versus BLEU
0
20
40
60
80
1 2 3 4 5
BLEUScore
Fluency Score
0
10
20
30
40
50
60
70
80
1 2 3 4 5
BLEUScore
Adequacy Score
Fluency and BLEU across locales
have a Pearson’s r of 0.41, a
strong positive relationship
Adequacy and BLEU across locales have
a Pearson’s r of 0.26, a moderately
positive relationship
-1.00
-0.80
-0.60
-0.40
-0.20
0.00
0.20
0.40
0.60
0.80
1.00
da_DK de_DE es_ES es_LA fr_CA fr_FR it_IT ja_JP ko_KR pt_BR ru_RU zh_CN
Pearson'sr
Adequacy, Fluency & BLEU Correlation - All Locales
Adequacy & BLEU Fluency & BLEU
Adequacy, Fluency and BLEU correlation for locales with 4 or more test sets*
11. correlationresults
Adequacy and Fluency versus PE Distance
Fluency and PE distance across all
locales have a cumulative Pearson’s r of
-0.70, a very strong negative relationship
Adequacy and PE distance across all
locales have a cumulative Pearson’s r of -
0.41, a strong negative relationship
0%
20%
40%
60%
80%
1 2 3 4 5
PEDistance
Fluency Score
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 2 3 4 5
PEDistance
Adequacy Score
A negative correlation is desired: as Adequacy and Fluency scores increase, PE distance
should decrease proportionally.
-1.00
-0.80
-0.60
-0.40
-0.20
0.00
0.20
0.40
0.60
0.80
1.00
de_DE es_ES/LA fr_FR/CA it_IT pt_BR
Adequacy, Fluency and PE Distance Correlation
Adequacy & PE Distance Fluency & PE Distance
12. correlationresults
Adequacy and Fluency versus Productivity Delta
Productivity and Adequacy across all
locales with a cumulative Pearson’s r
of 0.77, a very strong correlation
Productivity and Fluency across all
locales with a cumulative Pearson’s r
of 0.71, a very strong correlation
-20%
0%
20%
40%
60%
80%
100%
1 2 3 4 5
ProductivityDelta
Human Evaluation Adequacy Score (1-5)
Productivity Delta and Adequacy
-20%
0%
20%
40%
60%
80%
100%
1 2 3 4 5
ProductivityDelta
Human Evaluation Fluency Score (1-5)
Productivity Delta and Fluency
13. correlationresults
Automatic Metrics versus Productivity Delta
-100%
0%
100%
200%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
ProductivityDeltaas%
Post-Edit Distance
Productivity Delta and PE Distance
With a Pearson’s r of -
0.436, as PE distance
increases, indicating a
greater effort from the
post-editor, Productivity
declines; it is a strong
negative relationship
-100%
0%
100%
200%
0 10 20 30 40 50 60 70 80 90 100
Productivitydeltaasa%
BLEU Score
BLEU & Productivity Delta Productivity delta and
BLEU with a cumulative
Pearson’s r of 0.24, a weak
positive relationship
14. correlationresults
Summary
Pearson's r Variables Strength of Correlation Tests (N) Locales Statistical
Significance (p
value <)
0.82 Adequacy & Fluency Very strong positive relationship 182 22 0.0001
0.77 Adequacy & P Delta Very strong positive relationship 23 9 0.0001
0.71 Fluency & P Delta Very strong positive relationship 23 9 0.00015
0.55 Cognitive Effort Rank & PE Distance Strong positive relationship 16 10 0.027
0.41 Fluency & BLEU Strong positive relationship 146 22 0.0001
0.26 Adequacy & BLEU Weak positive relationship 146 22 0.0015
0.24 BLEU & P Delta Weak positive relationship 106 26 0.012
0.13 Numbers of Errors & PE Distance No or negligible relationship 16 10 ns
-0.30 Predominant Error & BLEU Moderate negative relationship 63 13 0.017
-0.32 Cognitive Effort Rank & PE Delta Moderate negative relationship 20 10 ns
-0.41 Numbers of Errors & BLEU Strong negative relationship 63 20 0.00085
-0.41 Adequacy & PE Distance Strong negative relationship 38 13 0.011
-0.42 PE Distance & P Delta Strong negative relationship 72 27 0.00024
-0.70 Fluency & PE Distance Very strong negative relationship 38 13 0.0001
-0.81 BLEU & PE Distance Very strong negative relationship 75 27 0.0001
15. takeaways
The strongest correlations were found between:
Adequacy & Fluency
BLEU and PE Distance
Adequacy & Productivity Delta
Fluency & Productivity Delta
Fluency & PE Distance
The Human Evaluations come out as stronger indicators for
potential post-editing productivity gains than Automatic
metrics.
CORRELATIONS
16. erroranalysis
12%
2%
8%
3%
16%
3%
5%
12%
5% 7%
20%
1% 1%
3% 2%
0%
5%
10%
15%
20%
25%
Error Type Frequency
Data size: 117 evaluations x 25 segments (3125 segments), includes 22 locales, different
MT systems (hybrid & SMT).
Taking this “broad sweep“ view, most errors logged by evaluators across
all categories are:
- Sentence structure (word order)
- MT output too literal
- Wrong terminology
- Word form disagreements
- Source term left untranslated
17. erroranalysis
Similar picture when we focus on the 8 dominant language pairs that
constituted the bulk of the evaluations in the dataset.
18. takeaways
Across different MT systems, content types AND locales, 5 error categories stand
out in particular.
Questions:
How (if) do these correlate to the post-editing effort and predicting productivity
gains?
How (if) can the findings on errors be used to improve the underlying systems?
Are the current error categories what we need?
Can the categories be improved for evaluators?
Will these categories work for other post-editing scenarios (e.g. light PE)?
MOST FREQUENT ERRORS LOGGED
19. takeaways
Remodelling of Human Evaluation Form to:
- increase user-friendliness
- distinguish better between Ad & Fl
errors
- align with cognitive effort categories
proposed in literature
- improve relevance for system updates
E.g.“Literal Translation“ seemed too broad and probably over-used.
20. nextsteps
o focus on language groups and individual languages: do we see
the same correlations?
o focus on different MT systems
o add categories to database (e.g. string length, post-editor
experience)
o add new data to database and repeat correlations
o continuously tweak Human Evaluation template and process, as it
proofs to provide valuable insights for predictions, as well as post-
editor on-boarding / education and MT system improvement
o investigate correlation with other AutoScores (…)
NOTE: correlations found for individual languages with 4 or more sets, but not enough data to be statistically relevant.
But it does show that BLEU isn‘t a good indicator
* Also points out that HE needs to be done accurately