Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Human Evaluation: Why do we
need it?
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106...
www.adaptcentre.ieWhy do we need evaluation?
- Evaluation provide data on whether a system works and
why, which parts of i...
www.adaptcentre.ieA bit of history…
- ALPAC Report (1964)
- Generated a long and drastic cut in funding (especially in MT)...
www.adaptcentre.ie
Automatic Metrics
 Interdisciplinary
- WER (speech recognition – MT)
- ROUGE (text summarization – MT)...
www.adaptcentre.ie
Who’s afraid of Human Evaluation?
• Time consuming
• Expensive
• Humans don’t agree with each other
• A...
www.adaptcentre.ieHuman Translation Quality Assessment
- Why evaluate machine translation with humans?
- More detailed eva...
www.adaptcentre.ieHuman Translation Quality Assessment
- most commonly carried out under the adequacy-fluency
paradigm and...
www.adaptcentre.ieAdequacy
 also known as “accuracy” or “fidelity”
 Focus on the source text
 “the extent to which the ...
www.adaptcentre.ieFluency
 also known as intelligibility
 focuses on the target text
 “the flow and naturalness of the ...
www.adaptcentre.iePE
 The “term used for the correction of machine translation
output by human linguists/editors” (Veale ...
www.adaptcentre.iePE
- Why use post-editing for Machine Translation evaluation?
- Assess usefulness of MT system in produc...
www.adaptcentre.ieTranslation Quality Assessment
- Why use error taxonomies for translation evaluation?
- Identify types o...
www.adaptcentre.ieDQF / Multidimensional Quality Metrics
www.adaptcentre.ieDQF / MQM Example
ST: Quando você faz avaliação humana dos sistemas, é mais provável
que os seus resulta...
www.adaptcentre.ieCrowdsourcing
 Cheap
 Fast
 Various tasks (Fluency/adequacy, PE, error mark-up, ranking…)
 Quality?
...
www.adaptcentre.ieUsability
 Concept borrowed for human-computer interaction
 Real world problems
 Understand how end u...
www.adaptcentre.ie
www.adaptcentre.ieSo... Human Evaluation is a great thing!
 Human evaluation avoids awkward situations…
 And backs up go...
www.adaptcentre.ie
Thank you!
www.adaptcentre.ie
• Patrick Paroubek, Stephane Chaudiron, Lynette Hirschman.
Principles of Evaluation in Natural Language...
Upcoming SlideShare
Loading in …5
×

Human Evaluation: Why do we need it? - Dr. Sheila Castilho

519 views

Published on

Talk at the 8th NLP Dublin meetup (https://www.meetup.com/NLP-Dublin/events/241198412/) by Dr. Sheila Castilho, postdoc at ADAPT Centre, Dublin City University.

Published in: Science
  • Login to see the comments

  • Be the first to like this

Human Evaluation: Why do we need it? - Dr. Sheila Castilho

  1. 1. Human Evaluation: Why do we need it? The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Dr. Sheila Castilho
  2. 2. www.adaptcentre.ieWhy do we need evaluation? - Evaluation provide data on whether a system works and why, which parts of it are effective and which need improvement. - Evaluation needs to be honest and replicable, and its methods should be as rigorous as possible.
  3. 3. www.adaptcentre.ieA bit of history… - ALPAC Report (1964) - Generated a long and drastic cut in funding (especially in MT) - Evaluation was a forbidden topic in the NLP community (Paroubek et al 2007)
  4. 4. www.adaptcentre.ie Automatic Metrics  Interdisciplinary - WER (speech recognition – MT) - ROUGE (text summarization – MT) - F-Measure (IR – many other areas)
  5. 5. www.adaptcentre.ie Who’s afraid of Human Evaluation? • Time consuming • Expensive • Humans don’t agree with each other • Automatic metrics should be enough! It’s grand!
  6. 6. www.adaptcentre.ieHuman Translation Quality Assessment - Why evaluate machine translation with humans? - More detailed evaluation - Assess complex linguistic phenomena - Feedback to the MT system - Diagnosis
  7. 7. www.adaptcentre.ieHuman Translation Quality Assessment - most commonly carried out under the adequacy-fluency paradigm and post-editing. - secondary measures are: readability, comprehensibility, usability, acceptability of source and target texts. - carried out by professional and amateur evaluators. - performance-based measures and user-centred approaches are more recent additions.
  8. 8. www.adaptcentre.ieAdequacy  also known as “accuracy” or “fidelity”  Focus on the source text  “the extent to which the translation transfers the meaning of the source text translation unit into the target”  Likert scale: 1. None of it 2. Little of it 3. Most of it 4. All of it  Why is Adequacy useful for MT evaluation?  It tells us how much of the source message has been transferred to the translation
  9. 9. www.adaptcentre.ieFluency  also known as intelligibility  focuses on the target text  “the flow and naturalness of the target text unit in the context of the target audience and its linguistic and sociocultural norms in the given context”  Likert scale: 1.No fluency 2.Little fluency 3.Near native 4.Native  Why is Fluency useful for MT evaluation?  It tells if the message is fluent/intelligible (i.e. sounds natural to a native speaker) or if it is “broken language”.
  10. 10. www.adaptcentre.iePE  The “term used for the correction of machine translation output by human linguists/editors” (Veale and Way 1997)  “checking, proof-reading and revising translations carried out by any kind of translating automaton”. (Gouadec 2007)  Common use of MT in production – over 80% of Language Service Providers now offer post-edited MT (Common Sense Advisory 2016)
  11. 11. www.adaptcentre.iePE - Why use post-editing for Machine Translation evaluation? - Assess usefulness of MT system in production - Identify common errors - Create new training or test data - However, measurements of post-editing effort tend to differ between novice (students) and professionals - Temporal effort: time on PE, WPS - Technical effort: edits perfromed – HTER - Cognitive effort: several ways – eye tracking
  12. 12. www.adaptcentre.ieTranslation Quality Assessment - Why use error taxonomies for translation evaluation? - Identify types of errors in MT or human translation - Detailed error report is useful for adjusting MT systems, reporting back to clients - LSPs use taxonomies and severity ratings to monitor translators’ work - However, error annotation is expensive
  13. 13. www.adaptcentre.ieDQF / Multidimensional Quality Metrics
  14. 14. www.adaptcentre.ieDQF / MQM Example ST: Quando você faz avaliação humana dos sistemas, é mais provável que os seus resultados tenham mais peso. MT: When you make human systems evaluation, it is more likely that the your results will have much more weight. HT: When you do human evaluation of the systems, it is more likely that your results will have more credibility. Errors: • Word order • Extraneous function word • Mistranslation
  15. 15. www.adaptcentre.ieCrowdsourcing  Cheap  Fast  Various tasks (Fluency/adequacy, PE, error mark-up, ranking…)  Quality?  contributors’ level  Country/region  Constant monitoring
  16. 16. www.adaptcentre.ieUsability  Concept borrowed for human-computer interaction  Real world problems  Understand how end users engage with machine-translated texts or how usable such texts are.  Applied for different areas (video/text summarisation, UI, information retrieval, etc.).  Why is Usability useful for MT evaluation?  identify what impact the translation might have on the final readers of the translation, including their satisfaction with the translation and products.  The users of the translation should be the ones who tell us if the final translation is acceptable
  17. 17. www.adaptcentre.ie
  18. 18. www.adaptcentre.ieSo... Human Evaluation is a great thing!  Human evaluation avoids awkward situations…  And backs up good results!
  19. 19. www.adaptcentre.ie Thank you!
  20. 20. www.adaptcentre.ie • Patrick Paroubek, Stephane Chaudiron, Lynette Hirschman. Principles of Evaluation in Natural Language Processing. Traitement Automatique des Langues, ATALA, 2007, 48 (1), pp.7-31.

×