SlideShare a Scribd company logo
1 of 22
Download to read offline
HOPE: A Task-Oriented and Human-Centric
Evaluation Framework Using Professional Post-
Editing Towards More Effective MT Evaluation
Serge Gladkoff * & Lifeng Han **
* CEO at Logrus Global https://logrusglobal.com/
** ADAPT Research Centre, DCU, IE (former) & The University of Manchester, UK (Now)
@ LREC2022, June 20th, Marseille, France
lifeng.han@{manchester.ac.uk, adaptcentre.ie} serge.gladkoff@logrusglobal.com
https://github.com/poethan/LREC22_MetaEval_Tutorial
Something bonus: our tutorial on MTE
* The University of Manchester (current), UK & ADAPT Research Centre, DCU (former), Ireland

** Logrus Global (https://logrusglobal.com)
Meta-Evaluation of Translation Evaluation Methods:
a systematic up-to-date overview
Lifeng Han * and Serge Gladkoff **
Half-day Tutorial @ LREC2022, June 20th, Marseille, France
lifeng.han@{manchester.ac.uk, adaptcentre.ie} serge.gladkoff@logrusglobal.com
X
Presentation schedule: HOPE
Motivation
Design: HOPE
Experimental evaluation
Summary
Motivation
Traditional automatic evaluation metrics for MT have been widely criticized by linguists due to their:
- low accuracy
- lack of transparency
- focus on language mechanics rather than semantics
- and low agreement with human quality evaluation.
Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and
translation service providers (TSPs). However:
- traditional human translation quality evaluations are costly to perform
- and go into great linguistic detail, raise issues as to inter-rater reliability (IRR)
- and are not designed to measure quality of worse than premium quality translations.
Motivation
For large-scale deployment of MT, a more appropriate quality metric is required
which:
a) allows for faster learning curve for evaluators to be applied correctly;
b) is faster to apply;
c) is specifically designed to address less than perfect MT output of “good enough
quality”;
d) does not track so many unnecessary linguistic details as standard MQM
metrics, designed as a tool to measure near-premium quality of human
translations.
Design
HOPE: a task-oriented and Human-centric evaluation framework for machine
translation output based on professional Post-editing annotations.
It contains:
- a limited number of commonly occurring error types (proper name, impact,
required adaptation, terminology, grammar, accuracy, style, and proofreading
error)
- uses a scoring model with geometric progression of error penalty points (EPPs)
reflecting error severity level to each translation unit.
- Error annotation and scoring can be done either without post-editing itself, or
during post-editing towards a newly generated post-edited reference translation
Design: factors
Design: scoring segments
Errors of each type can have the following severity differences: (minor, medium,
major, severe, critical) with the corresponding values (1, 2, 4, 8, 16).
Error points for each Translation Unit (TU) are added to form the Error Point
Penalty (EPP) of the TU (EPPTU) under-study.
Each TU has its own EPPTU not depending on other TUs.
Importantly, repeated errors in different TUs are not counted as one error.
The system-level score of HOPE is calculated by the sum of overall segment-level
EPPTUs.
Design: scoring HOPE
Design: segment/sentence & word-level
segment/sentence-level: We apply this metric into a sentence level (or segment-
level) error severity classification, i.e. minor vs major with the EPPTU score (1~4)
vs 5+.
The benefit of such design is that it immediately allows to distill sentences with
only minor errors, with EPPTUs 1, 2, 3 and 4, and sentences with major errors
(EPPTUs 5 and more).
One can say that EPPTU 1-4 is precisely what is often meant by ``good enough
quality'' of MT, where budget, time or frequency of visiting the content does not
provide for premium quality translation and lower quality is just fine.
Design: segment/sentence & word-level
The word level HOPE: follows the segment-level indicators including “unchanged”,
“good enough”, and “must be fixed”.
However, the statistics will be reflected at word level, e.g. how many words of the
whole document/text belong to each of the three categories.
Both segment/sentence-level and word-level HOPE indicators can be used to
reflect the overall MT quality in translating the overall material/document.
However, they can tell different aspects of the MT systems, e.g. when there are
many sentences falling into very different length (short {vs} long sentences).
Experimental evaluation
The pilot experiments contain two tasks.
Task-I is carried out using:
- English=>Russian (EN=>RU) language pair from marketing document in
technical domain with 111 sentences (segments), using two MT engines,
Google Translator and a customised MT engine.
Task-II uses a document from business domain containing 671 segments (3,339
words) on the same translation direction but using an alternative NMT engine
DeepL.
Experimental evaluation: sentence level
Each error categories in % and overall how many percents of sentences fall into no-change/minor/major-edit
Experimental evaluation: comparison of two quality profiles
Task-I HOPE quality indicators: comparison of System1 vs Google Translate
Experimental evaluation: word level vs segment level
Quality Indicators via Segment {vs} Word Level HOPE: number counts and percentage
Experimental evaluation: word vs segment
HOPE quality indicators via word-level (left) {vs} segment-level (right) in percentage
Summary
We designed HOPE metric:
can be seen as an MQM implementation tailored specifically for evaluating MT output.
The approach has several key advantages:
- ability to measure and compare less than perfect MT output from different systems
- ability to indicate human perception of quality
- immediate estimation of the labor effort required to bring MT output to premium quality / good
enough
- low-cost and faster application, as well as higher IRR
Initial experimental eval on: segment/word-level HOPE using en=>ru works!
Source files (corpus and scoring): https://github.com/lHan87/HOPE
Demo: HOPE
● https://github.com/poethan/LREC22_MetaEval_Tutorial (downloading link)
● A Link to download our tutorial draft PPT slides (will be updated after
tutorial)
● Or this drive:
● https://drive.google.com/drive/folders/1sajkcrnDgTOFiYVkFjYqh9EDYUhneY
qA?usp=sharing
References (selected)
Gladkoff, S., Sorokina, I., Han, L., and Alekseeva, A. (2021). Measuring uncertainty in translation quality evaluation (TQE). CoRR,
abs/2111.07699.
Han, L., Jones, G., and Smeaton, A. (2020b). Mul-tiMWE: Building a multi-lingual multi-word expression (MWE) parallel corpora. In
Proceedings of the 12th Language Resources and Evaluation Conference, pages 2970–2979, Marseille, France, May. European
Language Resources Association.
Han, L., Smeaton, A., and Jones, G. (2021a). Translation quality assessment: A brief survey on manual and automatic methods. In
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age, pages 15–33, online, May. Association for
Computational Linguistics.
Han, L., Sorokina, I., Erofeev, G., and Gladkoff, S. (2021b). cushLEPOR: customising hLEPOR metric using optuna for higher agreement
with human judgments or pre-trained language model labse. In Proceedings of Six Conference on Machine Translation (WMT2021), In
Press. Association for Computational Linguistics.
Han, L. (2022). An Investigation into Multi-Word Expressions in Machine Translation. PhD thesis, Dublin City University.
https://doras.dcu.ie/26559/.
Lommel, A., Burchardt, A., and Uszkoreit, H. (2014). Multidimensional quality metrics (mqm): A frame-work for declaring and describing
translation quality metrics. Traduma`tica: tecnologies de la traduccio ́, 0:455–463, 12.

More Related Content

Similar to HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation

software development methodologies
software development methodologiessoftware development methodologies
software development methodologiesUTeM
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
 
An overview of software development methodologies.
An overview of software development methodologies.An overview of software development methodologies.
An overview of software development methodologies.Masoud Kalali
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
 
ASE Notes - Kuntucky.docx
ASE Notes - Kuntucky.docxASE Notes - Kuntucky.docx
ASE Notes - Kuntucky.docxSanLizasAiren1
 
Sanket 895 presentation
Sanket 895 presentationSanket 895 presentation
Sanket 895 presentationsanketsp
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
 
Problem Solving Techniques
Problem Solving TechniquesProblem Solving Techniques
Problem Solving TechniquesAshesh R
 
Ranking The Refactoring Techniques Based on The External Quality Attributes
Ranking The Refactoring Techniques Based on The External Quality AttributesRanking The Refactoring Techniques Based on The External Quality Attributes
Ranking The Refactoring Techniques Based on The External Quality AttributesIJRES Journal
 
IRJET- Speech Based Answer Sheet Evaluation System
IRJET- Speech Based Answer Sheet Evaluation SystemIRJET- Speech Based Answer Sheet Evaluation System
IRJET- Speech Based Answer Sheet Evaluation SystemIRJET Journal
 
Language and Offensive Word Detection
Language and Offensive Word DetectionLanguage and Offensive Word Detection
Language and Offensive Word DetectionIRJET Journal
 
7a Good Programming Practice.pptx
7a Good Programming Practice.pptx7a Good Programming Practice.pptx
7a Good Programming Practice.pptxDylanTilbury1
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
Multi-system machine translation using online APIs for English-Latvian
Multi-system machine translation using online APIs for English-LatvianMulti-system machine translation using online APIs for English-Latvian
Multi-system machine translation using online APIs for English-LatvianMatīss ‎‎‎‎‎‎‎  
 

Similar to HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation (20)

software development methodologies
software development methodologiessoftware development methodologies
software development methodologies
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
SE.pdf
SE.pdfSE.pdf
SE.pdf
 
team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
An overview of software development methodologies.
An overview of software development methodologies.An overview of software development methodologies.
An overview of software development methodologies.
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
ASE Notes - Kuntucky.docx
ASE Notes - Kuntucky.docxASE Notes - Kuntucky.docx
ASE Notes - Kuntucky.docx
 
Pesttt testing
Pesttt testingPesttt testing
Pesttt testing
 
Sanket 895 presentation
Sanket 895 presentationSanket 895 presentation
Sanket 895 presentation
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...
 
Problem Solving Techniques
Problem Solving TechniquesProblem Solving Techniques
Problem Solving Techniques
 
Beekman5 std ppt_13
Beekman5 std ppt_13Beekman5 std ppt_13
Beekman5 std ppt_13
 
Ranking The Refactoring Techniques Based on The External Quality Attributes
Ranking The Refactoring Techniques Based on The External Quality AttributesRanking The Refactoring Techniques Based on The External Quality Attributes
Ranking The Refactoring Techniques Based on The External Quality Attributes
 
IRJET- Speech Based Answer Sheet Evaluation System
IRJET- Speech Based Answer Sheet Evaluation SystemIRJET- Speech Based Answer Sheet Evaluation System
IRJET- Speech Based Answer Sheet Evaluation System
 
Language and Offensive Word Detection
Language and Offensive Word DetectionLanguage and Offensive Word Detection
Language and Offensive Word Detection
 
7a Good Programming Practice.pptx
7a Good Programming Practice.pptx7a Good Programming Practice.pptx
7a Good Programming Practice.pptx
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
Multi-system machine translation using online APIs for English-Latvian
Multi-system machine translation using online APIs for English-LatvianMulti-system machine translation using online APIs for English-Latvian
Multi-system machine translation using online APIs for English-Latvian
 

More from Lifeng (Aaron) Han

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Lifeng (Aaron) Han
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the societyLifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a surveyLifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 

More from Lifeng (Aaron) Han (20)

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Thesis-Master-MTE-Aaron
Thesis-Master-MTE-AaronThesis-Master-MTE-Aaron
Thesis-Master-MTE-Aaron
 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a survey
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 

Recently uploaded

Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝soniya singh
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)Basil Achie
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...NETWAYS
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...NETWAYS
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptxBasil Achie
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...NETWAYS
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@vikas rana
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Salam Al-Karadaghi
 

Recently uploaded (20)

Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@call girls in delhi malviya nagar @9811711561@
call girls in delhi malviya nagar @9811711561@
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
 

HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation

  • 1. HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post- Editing Towards More Effective MT Evaluation Serge Gladkoff * & Lifeng Han ** * CEO at Logrus Global https://logrusglobal.com/ ** ADAPT Research Centre, DCU, IE (former) & The University of Manchester, UK (Now) @ LREC2022, June 20th, Marseille, France lifeng.han@{manchester.ac.uk, adaptcentre.ie} serge.gladkoff@logrusglobal.com
  • 3. * The University of Manchester (current), UK & ADAPT Research Centre, DCU (former), Ireland ** Logrus Global (https://logrusglobal.com) Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview Lifeng Han * and Serge Gladkoff ** Half-day Tutorial @ LREC2022, June 20th, Marseille, France lifeng.han@{manchester.ac.uk, adaptcentre.ie} serge.gladkoff@logrusglobal.com X
  • 4.
  • 5.
  • 6. Presentation schedule: HOPE Motivation Design: HOPE Experimental evaluation Summary
  • 7. Motivation Traditional automatic evaluation metrics for MT have been widely criticized by linguists due to their: - low accuracy - lack of transparency - focus on language mechanics rather than semantics - and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However: - traditional human translation quality evaluations are costly to perform - and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) - and are not designed to measure quality of worse than premium quality translations.
  • 8. Motivation For large-scale deployment of MT, a more appropriate quality metric is required which: a) allows for faster learning curve for evaluators to be applied correctly; b) is faster to apply; c) is specifically designed to address less than perfect MT output of “good enough quality”; d) does not track so many unnecessary linguistic details as standard MQM metrics, designed as a tool to measure near-premium quality of human translations.
  • 9. Design HOPE: a task-oriented and Human-centric evaluation framework for machine translation output based on professional Post-editing annotations. It contains: - a limited number of commonly occurring error types (proper name, impact, required adaptation, terminology, grammar, accuracy, style, and proofreading error) - uses a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. - Error annotation and scoring can be done either without post-editing itself, or during post-editing towards a newly generated post-edited reference translation
  • 11. Design: scoring segments Errors of each type can have the following severity differences: (minor, medium, major, severe, critical) with the corresponding values (1, 2, 4, 8, 16). Error points for each Translation Unit (TU) are added to form the Error Point Penalty (EPP) of the TU (EPPTU) under-study. Each TU has its own EPPTU not depending on other TUs. Importantly, repeated errors in different TUs are not counted as one error. The system-level score of HOPE is calculated by the sum of overall segment-level EPPTUs.
  • 13. Design: segment/sentence & word-level segment/sentence-level: We apply this metric into a sentence level (or segment- level) error severity classification, i.e. minor vs major with the EPPTU score (1~4) vs 5+. The benefit of such design is that it immediately allows to distill sentences with only minor errors, with EPPTUs 1, 2, 3 and 4, and sentences with major errors (EPPTUs 5 and more). One can say that EPPTU 1-4 is precisely what is often meant by ``good enough quality'' of MT, where budget, time or frequency of visiting the content does not provide for premium quality translation and lower quality is just fine.
  • 14. Design: segment/sentence & word-level The word level HOPE: follows the segment-level indicators including “unchanged”, “good enough”, and “must be fixed”. However, the statistics will be reflected at word level, e.g. how many words of the whole document/text belong to each of the three categories. Both segment/sentence-level and word-level HOPE indicators can be used to reflect the overall MT quality in translating the overall material/document. However, they can tell different aspects of the MT systems, e.g. when there are many sentences falling into very different length (short {vs} long sentences).
  • 15. Experimental evaluation The pilot experiments contain two tasks. Task-I is carried out using: - English=>Russian (EN=>RU) language pair from marketing document in technical domain with 111 sentences (segments), using two MT engines, Google Translator and a customised MT engine. Task-II uses a document from business domain containing 671 segments (3,339 words) on the same translation direction but using an alternative NMT engine DeepL.
  • 16. Experimental evaluation: sentence level Each error categories in % and overall how many percents of sentences fall into no-change/minor/major-edit
  • 17. Experimental evaluation: comparison of two quality profiles Task-I HOPE quality indicators: comparison of System1 vs Google Translate
  • 18. Experimental evaluation: word level vs segment level Quality Indicators via Segment {vs} Word Level HOPE: number counts and percentage
  • 19. Experimental evaluation: word vs segment HOPE quality indicators via word-level (left) {vs} segment-level (right) in percentage
  • 20. Summary We designed HOPE metric: can be seen as an MQM implementation tailored specifically for evaluating MT output. The approach has several key advantages: - ability to measure and compare less than perfect MT output from different systems - ability to indicate human perception of quality - immediate estimation of the labor effort required to bring MT output to premium quality / good enough - low-cost and faster application, as well as higher IRR Initial experimental eval on: segment/word-level HOPE using en=>ru works! Source files (corpus and scoring): https://github.com/lHan87/HOPE
  • 21. Demo: HOPE ● https://github.com/poethan/LREC22_MetaEval_Tutorial (downloading link) ● A Link to download our tutorial draft PPT slides (will be updated after tutorial) ● Or this drive: ● https://drive.google.com/drive/folders/1sajkcrnDgTOFiYVkFjYqh9EDYUhneY qA?usp=sharing
  • 22. References (selected) Gladkoff, S., Sorokina, I., Han, L., and Alekseeva, A. (2021). Measuring uncertainty in translation quality evaluation (TQE). CoRR, abs/2111.07699. Han, L., Jones, G., and Smeaton, A. (2020b). Mul-tiMWE: Building a multi-lingual multi-word expression (MWE) parallel corpora. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 2970–2979, Marseille, France, May. European Language Resources Association. Han, L., Smeaton, A., and Jones, G. (2021a). Translation quality assessment: A brief survey on manual and automatic methods. In Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age, pages 15–33, online, May. Association for Computational Linguistics. Han, L., Sorokina, I., Erofeev, G., and Gladkoff, S. (2021b). cushLEPOR: customising hLEPOR metric using optuna for higher agreement with human judgments or pre-trained language model labse. In Proceedings of Six Conference on Machine Translation (WMT2021), In Press. Association for Computational Linguistics. Han, L. (2022). An Investigation into Multi-Word Expressions in Machine Translation. PhD thesis, Dublin City University. https://doras.dcu.ie/26559/. Lommel, A., Burchardt, A., and Uszkoreit, H. (2014). Multidimensional quality metrics (mqm): A frame-work for declaring and describing translation quality metrics. Traduma`tica: tecnologies de la traduccio ́, 0:455–463, 12.