Lucia Specia - Estimativa de qualidade em TA

Quality of Machine Translation Quality Estimation Open issues Conclusions
Estimativa da qualidade da tradu¸cão
automática
Lucia Specia
University of Sheffield
l.specia@sheffield.ac.uk
Faculdade de Letras da Universidade do Porto
13 May 2013
Estimativa da qualidade da tradu¸cão automática 1 / 31

Outline
1 Quality of Machine Translation
2 Quality Estimation
3 Open issues
4 Conclusions

Outline
3 Open issues
4 Conclusions

Introduction
Machine Translation:
Around since the early 1950s

Introduction
Increasingly more popular since 1990: statistical
approaches

Introduction
approaches
Software tools and data available to build translation
systems - Moses and others

Introduction
approaches
Increasing demand for cheaper and fast translations

Introduction
approaches
Increasing demand for cheaper and fast translations
How do we measure quality and progress over time?
So far... mostly automatic evaluation metrics

MT evaluation metrics
N-gram matching between system output and one or
more reference translations: BLEU and many others

Issue 1: Too many possible good quality translations,
need thousands of references to capture valid variations

Issue 1: Too many possible good quality translations,
need thousands of references to capture valid variations
Solution: HyTER (Language Weaver) annotation tool to
generate all possible correct translations! [DM12]
Translations built bottom-up from word/phrase
translation equivalents using FSA
2-2.5 hours worth of expert annotation per sentence
One annotator: 5.2 × 106 paths
A bunch of annotators: 8.5 × 1011 paths

Issue 2: Diﬃcult to quantify severity of mismatching
n-grams

n-grams
ref Do not buy this product, it’s their craziest invention!
sys Do buy this product, it’s their craziest invention!

n-grams
Some attempts to weight mismatches diﬀerently -
sparse, lexicalised approach

n-grams
Some attempts to weight mismatches diﬀerently -
sparse, lexicalised approach
However, same error is more or less important depending
on the user or purpose:
Severe if end-user does not speak source language
Trivial to post-edit by translators

Conversely:
ref The battery lasts 6 hours and it can be fully recharged
in 30 minutes.
sys Six-hours battery, 30 minutes to full charge last.

Conversely:
ref The battery lasts 6 hours and it can be fully recharged
in 30 minutes.
sys Six-hours battery, 30 minutes to full charge last.
Ok for gisting - meaning preserved
Very costly for post-editing if style is to be preserved

Task-based evaluation
Measure translation quality within task. E.g. Autodesk -
Productivity test through post-editing [Aut11]
2-day translation and post-editing , 37 participants
In-house Moses (Autodesk data: software)
Time spent on each segment

E.g.: Intel - User satisfaction with un-edited MT
Translation is good if customer can solve problem

MT for Customer Support websites [Int10]
Overall customer satisfaction: 75% for English→Chinese

95% reduction in cost
Project cycle from 10 days to 1 day
From 300 to 60,000 words translated/hour

Customers in China using MT texts were more satisﬁed
with support than natives using original texts (68%)!

Customers in China using MT texts were more satisﬁed
with support than natives using original texts (68%)!
MT for chat and community forums [Int12]
∼60% “understandable and actionable”
(→English/Spanish)
Max ∼10% “not understandable”
(→Chinese)

Outline
3 Open issues
4 Conclusions

Overview
Metrics either depend on references or post-editing/use of
translations (task-based)

Overview
Metrics either depend on references or post-editing/use of
translations (task-based)
Our proposal
Quality assessment without reference, prior to
post-editing/use of translations

Overview
Why don’t translators use (more) MT?

Overview
Translations are not good enough!

Overview
Translations are not good enough!
What about TMs? Aren’t fuzzy matches useful?

Framework
Quality estimation (QE): provide an estimate of
quality for new translated text *before* it is post-edited
Quality = post-editing eﬀort

Framework
No access to reference translations: machine learning
techniques to predict post-editing eﬀort scores

Framework
Considers interaction with TM systems: only used for
low fuzzy match cases, or to select between TM and MT

Framework
Considers interaction with TM systems: only used for
low fuzzy match cases, or to select between TM and MT
QTLaunchPad project
Multidimensional Quality Metrics for MT and HT, for manual
and (semi-)automatic evaluation (QE):
http://www.qt21.eu/launchpad/

Framework
QE system
Examples:
source &
translations,
quality scores
Quality
indicators

Framework
Source
text
MT system
Translation
QE system
Quality score
Examples:
source &
translations,
quality scores
Quality
indicators

Examples of positive results
Time to post-edit subset of sentences predicted as
“good” (low eﬀort) vs time to post-edit random subset of
sentences

sentences
Language no QE QE
fr-en 0.75 words/sec 1.09 words/sec
en-es 0.32 words/sec 0.57 words/sec

sentences
Language no QE QE
fr-en 0.75 words/sec 1.09 words/sec
en-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MT
systems
Best MT system Highest QE score
54% 77%

State-of-the-art
Quality indicators:
Source text TranslationMT system
Confidence
indicators
Complexity
indicators
Fluency
indicators
Adequacy
indicators

State-of-the-art
Quality indicators:
Confidence
indicators
Complexity
indicators
Fluency
indicators
Adequacy
indicators
Learning algorithms: wide range

State-of-the-art
Quality indicators:
Confidence
indicators
Complexity
indicators
Fluency
indicators
Adequacy
indicators
Learning algorithms: wide range
Datasets: few with absolute human scores (1-4/5 scores,
PE time, edit distance)

Outline
3 Open issues
4 Conclusions

State-of-the-art indicators
Shallow indicators:
(S/T/S-T) Sentence length
(S/T) Language model
(S/T) Token-type ratio
(S) Average number of possible translations per word
(S) % of n-grams belonging to diﬀerent frequency
quartiles of a source language corpus
(T) Untranslated/OOV words
(T) Mismatching brackets, quotation marks
(S-T) Preservation of punctuation
(S-T) Word alignment score, etc.

Shallow indicators:
(S/T/S-T) Sentence length
(S/T) Language model
(S/T) Token-type ratio
(S) Average number of possible translations per word
(S) % of n-grams belonging to diﬀerent frequency
quartiles of a source language corpus
(T) Untranslated/OOV words
(T) Mismatching brackets, quotation marks
(S-T) Preservation of punctuation
(S-T) Word alignment score, etc.
These do well for estimation post-editing eﬀort...
...but are not enough for other aspects of quality, e.g.
adequacy

Linguistic indicators - count-based:
(S/T/S-T) Content/non-content words
(S/T/S-T) Nouns/verbs/... NP/VP/...
(S/T/S-T) Deictics (references)
(S/T/S-T) Discourse markers (references)
(S/T/S-T) Named entities
(S/T/S-T) Zero-subjects
(S/T/S-T) Pronominal subjects
(S/T/S-T) Negation indicators
(T) Subject-verb / adjective-noun agreement
(T) Language Model of POS
(T) Grammar checking (dangling words)
(T) Coherence

Linguistic indicators - alignment-based:
(S-T) Correct translation of pronouns
(S-T) Matching of dependency relations
(S-T) Matching of named entities
(S-T) Alignment of parse trees
(S-T) Alignment of predicates & arguments, etc.

Linguistic indicators - alignment-based:
(S-T) Correct translation of pronouns
(S-T) Matching of dependency relations
(S-T) Matching of named entities
(S-T) Alignment of parse trees
(S-T) Alignment of predicates & arguments, etc.
Some indicators are language-dependent, others need
resources that are language-dependent, but apply to most
languages, e.g. LM of POS tags

Fine-grained, lexicalised indicators:
target-word = “process” =
1, if source-word = “hdhh alamlyt”.
0, otherwise.
1, if source-pos = “DT DTNN”.
0, otherwise.

Fine-grained, lexicalised indicators:
1, if source-word = “hdhh alamlyt”.
0, otherwise.
1, if source-pos = “DT DTNN”.
0, otherwise.
Closer to error detection
Need large amounts of training data [BHAO11], or RB approaches

Do these indicators work?

To some extent... Issues:
Representation of shallow/deep indicators: counts,
ratios, (absolute) diﬀerences?
F = S − T, F = |S − T|, F =
T
S
, F =
S − T
S
...

F = S − T, F = |S − T|, F =
T
S
, F =
S − T
S
...
Resources to extract deep indicators: availability and
reliability

F = S − T, F = |S − T|, F =
T
S
, F =
S − T
S
...
Resources to extract deep indicators: availability and
reliability
Data to extract ﬁne-grained indicators: need previously
translated and post-edited data esp. for negative
examples

Manual scoring: agreement between translators
Absolute value judgements: diﬃcult to achieve consistency
across annotators even in highly controlled setup

Manual scoring: agreement between translators
Absolute value judgements: diﬃcult to achieve consistency
across annotators even in highly controlled setup
en-es news WMT12 dataset: 3 professional
translators, 1-5 scores
15% of initial dataset discarded: annotators disagreed by
more than one category
Remaining annotations had to be scaled (0.33, 0.17,
0.50)

Manual scoring: Agreement between translators
en-pt subtitles of TV series: 3 non-professionals
annotators, 1-4 scores
351 cases (41%): full agreement
445 cases (52%): partial agreement
54 cases (7%): null agreement

Manual scoring: Agreement between translators
en-pt subtitles of TV series: 3 non-professionals
annotators, 1-4 scores
351 cases (41%): full agreement
445 cases (52%): partial agreement
54 cases (7%): null agreement
Agreement by score:
Score Full
4 59%
3 35%
2 23%
1 50%

More objective ways of annotating translations
HTER: Edit distance between MT output and its minimally
post-edited version

post-edited version
HTER =
#edits
#words postedited version
Edits: substitute, delete, insert, shift

post-edited version
HTER =
#edits
Analysis by Maarit Koponen (WMT-12) on post-edited
translations with HTER and 1-5 scores
A number of cases where translations with low HTER
(few edits) were assigned low quality scores (high
post-editing eﬀort), and vice-versa

post-edited version
HTER =
#edits
Analysis by Maarit Koponen (WMT-12) on post-edited
translations with HTER and 1-5 scores
A number of cases where translations with low HTER
(few edits) were assigned low quality scores (high
post-editing eﬀort), and vice-versa
Certain edits seem to require more cognitive eﬀort than
others - not captured by HTER

TIME: varies considerably across translators (expected)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
100
200
300
400
500
600
A1
A2
A3
A4
A5
A6
A7
A8
Segments
Annotators
Seconds
Can we normalise this variation?
A dedicated QE system for each translator?

TIME: varies considerably across translators (expected)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.00
5.00
10.00
15.00
20.00
25.00
A1
A2
A3
A4
A5
A6
A7
A8
Annotators
Seconds / word
Segments
Can we normalise this variation?
A dedicated QE system for each translator?

Time, HTER, Keystrokes: data from 8 post-editors

PET: http://pers-www.wlv.ac.uk/~in1676/pet/

How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filtered
out or shown to translators (different scores/colour
codes as in TMs)?
Wasting time to read scores and translations vs wasting
“gisting” information

codes as in TMs)?
How to deﬁne a threshold on the estimated translation
quality to decide what should be ﬁltered out?
Translator dependent
Task dependent (SDL)

codes as in TMs)?
How to deﬁne a threshold on the estimated translation
quality to decide what should be ﬁltered out?
Translator dependent
Task dependent (SDL)
Do translators prefer detailed estimates (sub-sentence
level) or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores

Outline
3 Open issues
4 Conclusions

Conclusions
It is possible to estimate at least certain aspects of MT
quality, esp. wrt PE eﬀort: QuEst
http://quest.dcs.shef.ac.uk/

Conclusions
PE eﬀort estimates can be used in real applications
Ranking translations: ﬁlter out bad quality translations
Selecting translations from multiple MT systems

Conclusions
Commercial products by SDL (document-level for gisting)
and Multilizer

Conclusions
and Multilizer
A number of open issues to be investigated...

Conclusions
and Multilizer
Collaboration with “human translators” essential

Conclusions
and Multilizer
Collaboration with “human translators” essential
My vision
Sub-sentence level QE (error detection), highlighting
errors but also given an overall estimate for the sentence

Estimativa da qualidade da tradu¸cão
automática
Lucia Specia
University of Sheffield
l.specia@sheffield.ac.uk
Faculdade de Letras da Universidade do Porto
13 May 2013

Autodesk.
Translation and Post-Editing Productivity.
In http: // translate. autodesk. com/ productivity. html ,
2011.
Nguyen Bach, Fei Huang, and Yaser Al-Onaizan.
Goodness: a method for measuring machine translation conﬁdence.
pages 211–219, Portland, Oregon, 2011.
Markus Dreyer and Daniel Marcu.
Hyter: Meaning-equivalent semantics for translation evaluation.
In Proceedings of the 2012 Conference of the North American
Chapter of the Association for Computational Linguistics: Human
Language Technologies, pages 162–171, Montr´eal, Canada, 2012.
Intel.
Being Streetwise with Machine Translation in an Enterprise
Neighborhood.

In http:
// mtmarathon2010. info/ JEC2010_ Burgett_ slides. pptx ,
2010.
Intel.
Enabling Multilingual Collaboration through Machine Translation.
In http: // media12. connectedsocialmedia. com/ intel/ 06/
8647/ Enabling_ Multilingual_ Collaboration_ Machine_
Translation. pdf , 2012.

Lucia Specia - Estimativa de qualidade em TA

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Lucia Specia - Estimativa de qualidade em TA

Similar to Lucia Specia - Estimativa de qualidade em TA (20)

More from I Conferência Internacional de Tradução e Tecnologia

More from I Conferência Internacional de Tradução e Tecnologia (10)

Recently uploaded

Recently uploaded (20)

Lucia Specia - Estimativa de qualidade em TA