10. Lucia Specia (USFD) Evaluation of Machine Translation
Upcoming SlideShare
Loading in...5
×
 

10. Lucia Specia (USFD) Evaluation of Machine Translation

on

  • 432 views

 

Statistics

Views

Total Views
432
Views on SlideShare
258
Embed Views
174

Actions

Likes
0
Downloads
12
Comments
0

3 Embeds 174

http://expert-itn.eu 157
http://localhost 16
http://www.expert-itn.eu 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

10. Lucia Specia (USFD) Evaluation of Machine Translation 10. Lucia Specia (USFD) Evaluation of Machine Translation Presentation Transcript

  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Translation Quality Assessment: Evaluation and Estimation Lucia Specia University of Sheffield l.specia@sheffield.ac.uk EXPERT Winter School, 12 November 2013 Translation Quality Assessment: Evaluation and Estimation 1 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview “Machine Translation evaluation is better understood than Machine Translation” (Carbonell and Wilks, 1991) Translation Quality Assessment: Evaluation and Estimation 2 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 3 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 4 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Why is evaluation important? Compare MT systems Measure progress of MT systems over time Diagnose of MT systems Assess (and pay) human translators Quality assurance Tuning of SMT systems Decision on fitness-for-purpose ... Translation Quality Assessment: Evaluation and Estimation 5 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Why is evaluation hard? What does quality mean? Fluent? Adequate? Easy to post-edit? Translation Quality Assessment: Evaluation and Estimation 6 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Why is evaluation hard? What does quality mean? Fluent? Adequate? Easy to post-edit? Quality for whom/what? End-user: gisting (Google Translate), internal communications, or publication (dissemination) MT-system: tuning or diagnosis Post-editor: draft translations (light vs heavy post-editing) Other applications, e.g. CLIR Translation Quality Assessment: Evaluation and Estimation 6 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Ref: Do not buy this product, it’s their craziest invention! MT: Do buy this product, it’s their craziest invention! Translation Quality Assessment: Evaluation and Estimation 7 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Ref: Do not buy this product, it’s their craziest invention! MT: Do buy this product, it’s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators Translation Quality Assessment: Evaluation and Estimation 7 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview Ref: Do not buy this product, it’s their craziest invention! MT: Do buy this product, it’s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators Ref: The battery lasts 6 hours and it can be fully recharged in 30 minutes. MT: Six-hour battery, 30 minutes to full charge last. Translation Quality Assessment: Evaluation and Estimation 7 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview Ref: Do not buy this product, it’s their craziest invention! MT: Do buy this product, it’s their craziest invention! Severe if end-user does not speak source language Trivial to post-edit by translators Ref: The battery lasts 6 hours and it can be fully recharged in 30 minutes. MT: Six-hour battery, 30 minutes to full charge last. Ok for gisting - meaning preserved Very costly for post-editing if style is to be preserved Translation Quality Assessment: Evaluation and Estimation 7 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview How do we measure quality? Manual metrics: Error counts, ranking, acceptability, 1-N judgements on fluency/adequacy Task-based human metrics: productivity tests (HTER, PE time, keystrokes), user-satisfaction, reading comprehension Translation Quality Assessment: Evaluation and Estimation 8 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Overview How do we measure quality? Manual metrics: Error counts, ranking, acceptability, 1-N judgements on fluency/adequacy Task-based human metrics: productivity tests (HTER, PE time, keystrokes), user-satisfaction, reading comprehension Automatic metrics: Based on human references: BLEU, METEOR, TER, ... Reference-less: quality estimation Translation Quality Assessment: Evaluation and Estimation 8 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 9 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Judgements in an n-point scale Adequacy using 5-point scale (NIST-like) 5 All meaning expressed in the source fragment appears in the translation fragment. 4 Most of the source fragment meaning is expressed in the translation fragment. 3 Much of the source fragment meaning is expressed in the translation fragment. 2 Little of the source fragment meaning is expressed in the translation fragment. 1 None of the meaning expressed in the source fragment is expressed in the translation fragment. Translation Quality Assessment: Evaluation and Estimation 10 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Judgements in an n-point scale Fluency using 5-point scale (NIST-like) 5 Native language fluency. No grammar errors, good word choice and syntactic structure in the translation fragment. 4 Near native language fluency. Few terminology or grammar errors which don’t impact the overall understanding of the meaning. 3 Not very fluent. About half of translation contains errors. 2 Little fluency. Wrong word choice, poor grammar and syntactic structure. 1 No fluency. Absolutely ungrammatical and for the most part doesn’t make any sense. Translation Quality Assessment: Evaluation and Estimation 11 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Judgements in an n-point scale Issues: Subjective judgements Hard to reach significant agreement Is it realible at all? Can we use multiple annotators? Translation Quality Assessment: Evaluation and Estimation 12 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Judgements in an n-point scale Issues: Subjective judgements Hard to reach significant agreement Is it realible at all? Can we use multiple annotators? Are fluency and adequacy really separable? Ref: Absolutely ungrammatical and for the most part doesn’t make any sense. MT: Absolutely sense doesn’t ungrammatical for the and most make any part. Translation Quality Assessment: Evaluation and Estimation 12 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Ranking WMT-13 Appraise tool: rank translations best-worst (w. ties) Translation Quality Assessment: Evaluation and Estimation 13 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Ranking Issues: Subjective judgements: what does “best” mean? Hard to judge for long sentences Translation Quality Assessment: Evaluation and Estimation 14 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Ranking Issues: Subjective judgements: what does “best” mean? Hard to judge for long sentences Ref: The majority of existing work focuses on predicting some form of post-editing effort to help professional translators. MT1: Few of the existing work focuses on predicting some form of post-editing effort to help professional translators. MT2: The majority of existing work focuses on predicting some form of post-editing effort to help machine translation. Translation Quality Assessment: Evaluation and Estimation 14 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Ranking Issues: Subjective judgements: what does “best” mean? Hard to judge for long sentences Ref: The majority of existing work focuses on predicting some form of post-editing effort to help professional translators. MT1: Few of the existing work focuses on predicting some form of post-editing effort to help professional translators. MT2: The majority of existing work focuses on predicting some form of post-editing effort to help machine translation. Only serve for comparison purposes - the best system might not be good enough Absolute evaluation can do both Translation Quality Assessment: Evaluation and Estimation 14 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Error counts More fine-grained Aimed at diagnosis of MT systems, quality control of human translation. Translation Quality Assessment: Evaluation and Estimation 15 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Error counts More fine-grained Aimed at diagnosis of MT systems, quality control of human translation. E.g.: Multidimensional Quality Metrics (MQM) Machine and human translation quality Takes quality of source text into account Actual metric is based on a specification Translation Quality Assessment: Evaluation and Estimation 15 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions MQM Issues selected based on a given specification (dimensions): Language/locale Subject field/domain Text Type Audience Purpose Register Style Content correspondence Output modality, ... Translation Quality Assessment: Evaluation and Estimation 16 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation MQM Issue types (core): Altogether: 120 categories Translation Quality Assessment: Evaluation and Estimation 17 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions MQM Issue types: http://www.qt21.eu/launchpad/content/ high-level-structure-0 Combining issue types: TQ = 100 − AccP − (FluPT − FluPS ) − (VerPT − VerPS ) Translation Quality Assessment: Evaluation and Estimation 18 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions MQM Issue types: http://www.qt21.eu/launchpad/content/ high-level-structure-0 Combining issue types: TQ = 100 − AccP − (FluPT − FluPS ) − (VerPT − VerPS ) translate5: open source graphical (Web) interface for inline error annotation: www.translate5.net Translation Quality Assessment: Evaluation and Estimation 18 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation MQM Translation Quality Assessment: Evaluation and Estimation 19 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Error counts Issues: Time consuming Requires training, esp. to distinguish between fine-grained error types Different errors are more relevant for different specifications: need to select and weight them accordingly Translation Quality Assessment: Evaluation and Estimation 20 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 21 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Post-editing Productivity analysis: Measure translation quality within task. E.g. Autodesk - Productivity test through post-editing 2-day translation and post-editing , 37 participants In-house Moses (Autodesk data: software) Time spent on each segment Translation Quality Assessment: Evaluation and Estimation 22 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing PET: Records time, keystrokes, edit distance Translation Quality Assessment: Evaluation and Estimation 23 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing - PET Translation Quality Assessment: Evaluation and Estimation 24 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing - PET Translation Quality Assessment: Evaluation and Estimation 25 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing - PET How often post-editing (PE) a translation tool output is faster than translating from scratch (HT): System Google Moses Systran Trados Faster than HT 94% 86.8% 81.20% 72.40% Translation Quality Assessment: Evaluation and Estimation 26 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Post-editing - PET How often post-editing (PE) a translation tool output is faster than translating from scratch (HT): System Google Moses Systran Trados Faster than HT 94% 86.8% 81.20% 72.40% Comparing the time to translate from scratch with the time to PE MT, in seconds: Annotator Average Deviation HT (s) 31.89 9.99 PE (s) 18.82 6.79 HT/PE 1.73 0.26 Translation Quality Assessment: Evaluation and Estimation PE/HT 0.59 0.09 26 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation User satisfaction Solving a problem: E.g.: Intel measuring user satisfaction with un-edited MT Translation is good if customer can solve problem Translation Quality Assessment: Evaluation and Estimation 27 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions User satisfaction Solving a problem: E.g.: Intel measuring user satisfaction with un-edited MT Translation is good if customer can solve problem MT for Customer Support websites Overall customer satisfaction: 75% for English→Chinese 95% reduction in cost Project cycle from 10 days to 1 day From 300 to 60,000 words translated/hour Translation Quality Assessment: Evaluation and Estimation 27 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions User satisfaction Solving a problem: E.g.: Intel measuring user satisfaction with un-edited MT Translation is good if customer can solve problem MT for Customer Support websites Overall customer satisfaction: 75% for English→Chinese 95% reduction in cost Project cycle from 10 days to 1 day From 300 to 60,000 words translated/hour Customers in China using MT texts were more satisfied with support than natives using original texts (68%)! Translation Quality Assessment: Evaluation and Estimation 27 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension Defense language proficiency test (Jones et al., 2005): Translation Quality Assessment: Evaluation and Estimation 28 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension Translation Quality Assessment: Evaluation and Estimation 29 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension MT quality: function of 1 2 Text passage comprehension, as measured by answers accuracy, and Time taken to complete a test item (read a passage + answer its questions) Translation Quality Assessment: Evaluation and Estimation 29 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension Compared to Human Translation (HT): Translation Quality Assessment: Evaluation and Estimation 30 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Reading comprehension Compared to Human Translation (HT): Translation Quality Assessment: Evaluation and Estimation 31 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Task-based metrics Issues: Final goal needs to be very clear Can be more cost/time consuming Final task has to have a meaningful metric Other elements may affect the final quality measurement (e.g. Chinese vs. Americans) Translation Quality Assessment: Evaluation and Estimation 32 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 33 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Automatic metrics Compare output of an MT system to one or more reference (usually human) translations: how close is the MT output to the reference translation? Numerous metrics: BLEU, NIST, etc. Advantages: Fast and cheap, minimal human labour, no need for bilingual speakers Once test set is created, can be reused many times Can be used on an on-going basis during system development to test changes Translation Quality Assessment: Evaluation and Estimation 34 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Automatic metrics Disadvantages: Very few metrics look at variable ways of saying the same thing (word-level): stems, synonyms, paraphrases Individual sentence scores are not very reliable, aggregate scores on a large test set are required Translation Quality Assessment: Evaluation and Estimation 35 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Automatic metrics Disadvantages: Very few metrics look at variable ways of saying the same thing (word-level): stems, synonyms, paraphrases Individual sentence scores are not very reliable, aggregate scores on a large test set are required Very few of these metrics penalise different mismatches differently Translation Quality Assessment: Evaluation and Estimation 35 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Automatic metrics Disadvantages: Very few metrics look at variable ways of saying the same thing (word-level): stems, synonyms, paraphrases Individual sentence scores are not very reliable, aggregate scores on a large test set are required Very few of these metrics penalise different mismatches differently Reference translations are only a subset of the possible good translations Translation Quality Assessment: Evaluation and Estimation 35 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions String matching BLEU: BiLingual Evaluation Understudy Most widely used metric, both for MT system evaluation/comparison and SMT tuning Matching of n-grams between MT and Ref: rewards same words in equal order #clip(g ) count of reference n-grams g which happen in a hypothesis sentence h clipped by the number of times g appears in the reference sentence for h; #(g ) = number of n-grams in hypotheses n-gram precision pn for a set of MT translations H = pn = h∈H h∈H g ∈ngrams(h) #clip(g ) g ∈ngrams(h) Translation Quality Assessment: Evaluation and Estimation #(g ) 36 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation BLEU Combine (mean of the log) 1-n n-gram precisions log pn n Translation Quality Assessment: Evaluation and Estimation 37 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Combine (mean of the log) 1-n n-gram precisions log pn n Bias towards translations with fewer words (denominator) Brevity penalty to penalise MT sentences that are shorter than reference Compares the overall number of words wh of the entire hypotheses set with ref length wr : BP = 1 e (1−wr /wh ) Translation Quality Assessment: Evaluation and Estimation if wh ≥ wr otherwise 37 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Combine (mean of the log) 1-n n-gram precisions log pn n Bias towards translations with fewer words (denominator) Brevity penalty to penalise MT sentences that are shorter than reference Compares the overall number of words wh of the entire hypotheses set with ref length wr : BP = 1 e (1−wr /wh ) if wh ≥ wr otherwise BLEU = BP ∗ exp log pn n Translation Quality Assessment: Evaluation and Estimation 37 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Scale: 0-1, but highly dependent on the test set Rewards fluency by matching high n-grams (up to 4) Adequacy rewarded by unigrams and brevity penalty – poor model of recall Synonyms and paraphrases only handled if they are in any of the reference translations All tokens are equally weighted: missing out on a content word = missing out on a determiner Better for evaluating changes in the same system than comparing different MT architectures Translation Quality Assessment: Evaluation and Estimation 38 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Not good at sentence-level, unless smoothing is applied: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army 1-gram precision: 2-gram precision: 3-gram precision: 4-gram precision: BLEU = 0 4/8 1/7 0/6 0/5 Translation Quality Assessment: Evaluation and Estimation 39 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Importance of clipping and brevity penalty Ref1: the Iraqi weapons are to be handed over to the army within two weeks Ref2: the Iraqi weapons will be surrendered to the army in two weeks MT: the the the the Count for the should be clipped at 2: max count of the word in any reference. Unigram score = 2/4 (not 4/4) Translation Quality Assessment: Evaluation and Estimation 40 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions BLEU Importance of clipping and brevity penalty Ref1: the Iraqi weapons are to be handed over to the army within two weeks Ref2: the Iraqi weapons will be surrendered to the army in two weeks MT: the the the the Count for the should be clipped at 2: max count of the word in any reference. Unigram score = 2/4 (not 4/4) MT: Iraqi weapons will be 1-gram precision: 4/4 2-gram precision: 3/3 3-gram precision: 2/2 4-gram precision: 1/1 Precision (pn ) = 1 Precision score penalised because h < r Translation Quality Assessment: Evaluation and Estimation 40 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation String matching BLEU: Translation Quality Assessment: Evaluation and Estimation 41 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Edit distance WER: Word Error Rate: Levenshtein edit distance Minimum proportion of insertions, deletions, and substitutions needed to transform an MT sentence into the reference sentence Heavily penalises reorderings: correct translation in the wrong location: deletion + insertion S +D +I WER = N PER: Position-independent word Error Rate: Does not penalise reorderings: output and reference sentences are unordered sets Translation Quality Assessment: Evaluation and Estimation 42 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Edit distance: TER TER: Translation Error Rate Adds shift operation Translation Quality Assessment: Evaluation and Estimation 43 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Edit distance: TER TER: Translation Error Rate Adds shift operation REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times HYP: [this week] the saudis denied information published in the ***** Translation Quality Assessment: Evaluation and Estimation new york times 43 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Edit distance: TER TER: Translation Error Rate Adds shift operation REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times HYP: [this week] the saudis denied information published in the ***** new york times 1 Shift, 2 Substitutions, 1 Deletion → 4 Edits: 4 TER = 13 = 0.31 Translation Quality Assessment: Evaluation and Estimation 43 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Edit distance: TER TER: Translation Error Rate Adds shift operation REF: SAUDI ARABIA denied this week information published in the AMERICAN new york times HYP: [this week] the saudis denied information published in the ***** new york times 1 Shift, 2 Substitutions, 1 Deletion → 4 Edits: 4 TER = 13 = 0.31 Human-targeted TER (HTER) TER between MT and its post-edited version Translation Quality Assessment: Evaluation and Estimation 43 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Edit distance: TER TER: Translation Quality Assessment: Evaluation and Estimation 44 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Alignment-based METEOR: Unigram Precision and Recall Align MT output with reference. Take best scoring pair for multiple refs. Matching considers word inflection variations (stems), synonyms/paraphrases Fluency addressed via a direct penalty: fragmentation of the matching METEOR score = F-mean score discounted for fragmentation = F-mean * (1 - DF) Translation Quality Assessment: Evaluation and Estimation 45 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation METEOR Example: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army Translation Quality Assessment: Evaluation and Estimation 46 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation METEOR Example: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army Matching: Ref: Iraqi weapons army two weeks MT two weeks Iraq’s weapons army Translation Quality Assessment: Evaluation and Estimation 46 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation METEOR Example: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army Matching: Ref: Iraqi weapons army two weeks MT two weeks Iraq’s weapons army P = 5/8 =0.625 R = 5/14 = 0.357 F-mean = 10*P*R/(9P+R) = 0.3731 Translation Quality Assessment: Evaluation and Estimation 46 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions METEOR Example: Ref: the Iraqi weapons are to be handed over to the army within two weeks MT: in two weeks Iraq’s weapons will give army Matching: Ref: Iraqi weapons army two weeks MT two weeks Iraq’s weapons army P = 5/8 =0.625 R = 5/14 = 0.357 F-mean = 10*P*R/(9P+R) = 0.3731 Fragmentation: 3 frags of 5 words = (3)/(5) = 0.6 Discounting factor: DF = 0.5 * (0.6**3) = 0.108 METEOR: F-mean * (1 - DF) = 0.373 * 0.892 = 0.333 Translation Quality Assessment: Evaluation and Estimation 46 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Others WMT shared task on metrics: TerroCat DepRef MEANT and TINE TESLA LEPOR ROSE AMBER Many other linguistically motivated metrics where matching is not done at the word-level (only) ... Translation Quality Assessment: Evaluation and Estimation 47 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 48 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Quality = Can we publish it as is? Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Quality = Can we publish it as is? Quality = Can a reader get the gist? Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Quality = Can we publish it as is? Quality = Can a reader get the gist? Quality = Is it worth post-editing it? Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Overview Quality estimation (QE): metrics that provide an estimate on the quality of unseen translations No access to reference translations Quality defined by the data Quality = Can we publish it as is? Quality = Can a reader get the gist? Quality = Is it worth post-editing it? Quality = How much effort to fix it? Translation Quality Assessment: Evaluation and Estimation 49 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Framework X: examples of source & translations Feature extraction Y: Quality scores for examples in X Translation Quality Assessment: Evaluation and Estimation Features Machine Learning QE model 50 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Framework MT system Translation for xt' Feature extraction Source Text xs' Features Quality score y' Translation Quality Assessment: Evaluation and Estimation QE model 50 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Framework Main components to build a QE system: 1 Definition of quality: what to predict 2 (Human) labelled data (for quality) 3 Features 4 Machine learning algorithm Translation Quality Assessment: Evaluation and Estimation 51 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Definition of quality Predict Predict Predict Predict Predict Predict Predict Predict 1-N absolute scores for adequacy/fluency 1-N absolute scores for post-editing effort average post-editing time per word relative rankings relative rankings for same source percentage of edits needed for sentence word-level edits and its types BLEU, etc. scores for document Translation Quality Assessment: Evaluation and Estimation 52 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Datasets SHEF (several): http://staffwww.dcs.shef.ac.uk/ people/L.Specia/resources.html LIG (10K, fr-en): http://www-clips.imag.fr/geod/ User/marion.potet/index.php?page=download LMSI (14K, fr-en, en-fr, 2 post-editors): http://web.limsi.fr/Individu/wisniews/ recherche/index.html Translation Quality Assessment: Evaluation and Estimation 53 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Features Adequacy indicators Source text Complexity indicators MT system Confidence indicators Translation Quality Assessment: Evaluation and Estimation Translation Fluency indicators 54 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions QuEst Goal: framework to explore features for QE Feature extractors for 150+ features of all types: Java Machine learning: wrappers for a number of algorithms in the scikit-learn toolkit, grid search, feature selection Open source: http://www.quest.dcs.shef.ac.uk/ Translation Quality Assessment: Evaluation and Estimation 55 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation State of the art in QE WMT12-13 shared tasks Translation Quality Assessment: Evaluation and Estimation 56 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation State of the art in QE WMT12-13 shared tasks Sentence- and word-level estimation of PE effort Translation Quality Assessment: Evaluation and Estimation 56 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions State of the art in QE WMT12-13 shared tasks Sentence- and word-level estimation of PE effort Datasets and language pairs: Quality 1-5 subjective scores Ranking all sentences best-worst HTER scores Post-editing time Word-level edits: change/keep Word-level edits: keep/delete/replace Ranking 5 MTs per source Translation Quality Assessment: Evaluation and Estimation Year WMT12 WMT12/13 WMT13 WMT13 WMT13 WMT13 WMT13 Languages en-es en-es en-es en-es en-es en-es en-es; de-en 56 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions State of the art in QE WMT12-13 shared tasks Sentence- and word-level estimation of PE effort Datasets and language pairs: Quality 1-5 subjective scores Ranking all sentences best-worst HTER scores Post-editing time Word-level edits: change/keep Word-level edits: keep/delete/replace Ranking 5 MTs per source Evaluation metric: MAE = N i=1 Year WMT12 WMT12/13 WMT13 WMT13 WMT13 WMT13 WMT13 Languages en-es en-es en-es en-es en-es en-es en-es; de-en |H(si ) − V (si )| N Translation Quality Assessment: Evaluation and Estimation 56 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Baseline system Features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of source 1-grams, 2-grams and 3-grams in frequency quartiles 1 and 4 % of seen source unigrams Translation Quality Assessment: Evaluation and Estimation 57 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Baseline system Features: number of tokens in the source and target sentences average source token length average number of occurrences of words in the target number of punctuation marks in source and target sentences LM probability of source and target sentences average number of translations per source word % of source 1-grams, 2-grams and 3-grams in frequency quartiles 1 and 4 % of seen source unigrams SVM regression with RBF kernel with the parameters γ, and C optimised using a grid-search and 5-fold cross validation on the training set Translation Quality Assessment: Evaluation and Estimation 57 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Results - scoring sub-task (WMT12) System ID • SDLLW M5PbestDeltaAvg UU best SDLLW SVM UU bltk Loria SVMlinear UEdin TCD M5P-resources-only* Baseline bb17 SVR Loria SVMrbf SJTU WLV-SHEF FS PRHLT-UPV WLV-SHEF BL DCU-SYMC unconstrained DFKI grcfs-mars DFKI cfs-plsreg UPC 1 DCU-SYMC constrained UPC 2 TCD M5P-all MAE 0.61 0.64 0.64 0.64 0.68 0.68 0.68 0.69 0.69 0.69 0.69 0.70 0.72 0.75 0.82 0.82 0.84 0.86 0.87 2.09 Translation Quality Assessment: Evaluation and Estimation RMSE 0.75 0.79 0.78 0.79 0.82 0.82 0.82 0.82 0.83 0.83 0.85 0.85 0.86 0.97 0.98 0.99 1.01 1.12 1.04 2.32 58 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Results - scoring sub-task (WMT13) System ID • SHEF FS SHEF FS-AL CNGL SVRPLS LIMSI DCU-SYMC combine DCU-SYMC alltypes CMU noB CNGL SVR FBK-UEdin extra FBK-UEdin rand-svr LORIA inctrain Baseline bb17 SVR TCD-CNGL open LORIA inctraincont TCD-CNGL restricted CMU full UMAC MAE 12.42 13.02 13.26 13.32 13.45 13.51 13.84 13.85 14.38 14.50 14.79 14.81 14.81 14.83 15.20 15.25 16.97 Translation Quality Assessment: Evaluation and Estimation RMSE 15.74 17.03 16.82 17.22 16.64 17.14 17.46 17.28 17.68 17.73 18.34 18.22 19.00 18.17 19.59 18.97 21.94 59 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues Agreement between annotators Absolute value judgements: difficult to achieve consistency even in highly controlled settings WMT12: 30% of initial dataset discarded Remaining annotations had to be scaled Translation Quality Assessment: Evaluation and Estimation 60 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues Annotation costs: active learning to select subset of instances to be annotated (Beck et al., ACL 2013) Translation Quality Assessment: Evaluation and Estimation 61 / 66 Conclusions
  • BL' AF' BL+PR' AF+PR' Translation Quality Assessment: Evaluation and Estimation 09 2 s4 ' 09 2 s3 ' 0.554' 0.5401' 0.5401' 0.5249' 0.5194' 0.5462' 0.5399' 0.5301' 0.5249' 0.521' 0.5339' 0.5437' 0.5113' 0.5309' 0.506' 0.4614' 0.4741' 0.4493' 0.4609' 0.441' Reference-based metrics GA LE 11 2s2 '' 0.4' 0.3591' 0.3578' 0.3401' 0.3409' 0.337' 0.35' GA LE 11 2s1 ' T2 EA M T2 09 2 s2 ' 0.5313' 0.5265' 0.5123' 0.5109' 0.5025' Task-based metrics EA M T2 09 2 s1 ' 0.6' EA M T2 0.4401' 0.4292' 0.4183' 0.4169' 0.411' Manual metrics EA M ' 0.5' r2e n) 0.45' 0.4857' 0.4719' 0.449' 0.4471' 0.432' 0.55' 11 '( f es )' 0.6821' 0.6717' 0.629' 0.6324' 0.6131' 0.7' 11 '( e n2 T1 2' W M 0.65' EA M T2 T2 EA M Translation quality Quality estimation Open issues Curse of dimensionality: feature selection to identify relevant info for dataset (Shah et al., MT Summit 2013) 0.3' FS' 62 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues 0.554' 0.5401' 0.5401' 0.5249' 0.5194' 0.5462' 0.5399' 0.5301' 0.5249' 0.521' 0.3591' 0.3578' 0.3401' 0.3409' 0.337' 0.5339' 0.5437' 0.5113' 0.5309' 0.506' 0.4614' 0.4741' 0.4493' 0.4609' 0.441' 0.5' 0.45' 0.4401' 0.4292' 0.4183' 0.4169' 0.411' 0.4857' 0.4719' 0.449' 0.4471' 0.432' 0.6' 0.55' 0.5313' 0.5265' 0.5123' 0.5109' 0.5025' 0.7' 0.65' 0.6821' 0.6717' 0.629' 0.6324' 0.6131' Curse of dimensionality: feature selection to identify relevant info for dataset (Shah et al., MT Summit 2013) 0.4' 0.35' BL' AF' BL+PR' AF+PR' GA LE 11 2s2 '' GA LE 11 2s1 ' 09 2 s4 ' EA M T2 09 2 s3 ' T2 EA M 09 2 s2 ' EA M T2 09 2 s1 ' T2 EA M 11 '( f r2e n) ' es )' EA M T2 11 '( e n2 EA M T2 W M T1 2' 0.3' FS' Common feature set identified, but nuanced subsets for specific datasets Translation Quality Assessment: Evaluation and Estimation 62 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues How to use estimated PE effort scores?: Do users prefer detailed estimates (sub-sentence level) or an overall estimate for the complete sentence or not seeing bad sentences at all? Too much information vs hard-to-interpret scores Translation Quality Assessment: Evaluation and Estimation 63 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues How to use estimated PE effort scores?: Do users prefer detailed estimates (sub-sentence level) or an overall estimate for the complete sentence or not seeing bad sentences at all? Too much information vs hard-to-interpret scores IBM’s Goodness metric Translation Quality Assessment: Evaluation and Estimation 63 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Open issues How to use estimated PE effort scores?: Do users prefer detailed estimates (sub-sentence level) or an overall estimate for the complete sentence or not seeing bad sentences at all? Too much information vs hard-to-interpret scores IBM’s Goodness metric MATECAT project investigating it Translation Quality Assessment: Evaluation and Estimation 63 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Outline 1 Translation quality 2 Manual metrics 3 Task-based metrics 4 Reference-based metrics 5 Quality estimation 6 Conclusions Translation Quality Assessment: Evaluation and Estimation 64 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions (Machine) Translation evaluation & estimation: still an open problem Translation Quality Assessment: Evaluation and Estimation 65 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Translation Quality Assessment: Evaluation and Estimation 65 / 66 Conclusions
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Quality estimation: learning of these different notions, but requires labelled data Translation Quality Assessment: Evaluation and Estimation 65 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Quality estimation: learning of these different notions, but requires labelled data Solution: Think of what quality means in your scenario Measure significance Measure agreement if manual metrics Translation Quality Assessment: Evaluation and Estimation 65 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Quality estimation: learning of these different notions, but requires labelled data Solution: Think of what quality means in your scenario Measure significance Measure agreement if manual metrics Use various metrics Translation Quality Assessment: Evaluation and Estimation 65 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Conclusions Conclusions (Machine) Translation evaluation & estimation: still an open problem Different metrics for: different purposes/users, different needs, different notions of quality Quality estimation: learning of these different notions, but requires labelled data Solution: Think of what quality means in your scenario Measure significance Measure agreement if manual metrics Use various metrics Invent your own metric! Translation Quality Assessment: Evaluation and Estimation 65 / 66
  • Translation quality Manual metrics Task-based metrics Reference-based metrics Quality estimation Translation Quality Assessment: Evaluation and Estimation Lucia Specia University of Sheffield l.specia@sheffield.ac.uk EXPERT Winter School, 12 November 2013 Translation Quality Assessment: Evaluation and Estimation 66 / 66 Conclusions