Measuring Uncertainty in Translation Quality Evaluation (TQE)

•

0 likes•9 views

From both human translators (HT) and machine translation (MT) researchers' point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which meet customer specifications with harsh constraints of required quality level in tight time-frames and costs. MT researchers strive to make their models better, which also requires reliable quality evaluation. While automatic machine translation evaluation (MTE) metrics and quality estimation (QE) tools are widely available and easy to access, existing automated tools are not good enough, and human assessment from professional translators (HAPs) are often chosen as the golden standard \cite{han-etal-2021-TQA}. Human evaluations, however, are often accused of having low reliability and agreement. Is this caused by subjectivity or statistics is at play? How to avoid the entire text to be checked and be more efficient with TQE from cost and efficiency perspectives, and what is the optimal sample size of the translated text, so as to reliably estimate the translation quality of the entire material? This work carries out such a motivated research to correctly estimate the confidence intervals \cite{Brown_etal2001Interval} depending on the sample size of translated text, e.g. the amount of words or sentences, that needs to be processed on TQE workflow step for confident and reliable evaluation of overall translation quality. The methodology we applied for this work is from Bernoulli Statistical Distribution Modeling (BSDM) and Monte Carlo Sampling Analysis (MCSA). Reference: S Gladkoff, I Sorokina, L Han, A Alekseeva. 2022. Measuring Uncertainty in Translation Quality Evaluation (TQE). LREC2022. arXiv preprint arXiv:2111.07699

Presentations & Public Speaking

$Measuring Uncertainty in Translation Quality Evaluation (TQE) Serge Gladkoff1 , Irina Sorokina21 , Lifeng Han34* and Alexandra Alekseeva5 , {serge.gladkoff, irina.sorokina}@logrusglobal.com, lifeng.han@{adaptcentre.ie, manchester.ac.uk} Institutes: 1 Logrus Global LLC 2 Tver State University 3 The Uni of Manchester 4 ADAPT Centre, DCU 5 ROKO Labs LREC2022: 13th Edition of Language Resources and Evaluation Conference, Marseille, France, 20-25 June Motivations I. From both human translators (HT) and machine translation (MT) researchers’ point of view, translation quality evaluation (TQE) is an essential task. II. This is especially the case, when language service providers (LSPs) face huge amount of request frequently from their clients and users to acquire high-quality translations. III. While automatic translation quality assessment (TQA) metrics and quality estimation (QE) tools are widely available and easy to access, human assessment from professional translators (HAP) are often chosen as the golden standard [1]. IV. One challenge that comes to us: to avoid the overall text quality checking from both cost and efficiency perspectives, how to choose the confidence sample size of the translated text, so as to properly estimate the overall text translation quality? 1 Bernoulli Distribution on Errors Errors of certain type (category) either present in a sentence, or not. Errors are independent from each other. The probability of errors is the same. When the sample size n is significantly smaller than the overall population N, the standard deviation of sample measurement falls into the following formula: σ = r p · (1 − p) n where p is the probability estimation of an event under study. The confidence interval CI, using the Wald interval, will be: CI = p ± ∆ where ∆ is the product of standard deviation and factor 1.96 (when confidence level 95 % is chosen): ∆ = 1,96 · σ = 1,96 · r p · (1 − p) n Figure 1 shows the Error density value with sample size variation from 100 to 2K sentences. Confident Modelling Figure 2: how the confident delta value changes with the variation of sample size. Monte Carlo Simulation (MCS) Figure 3 shows Mean post-editing distance (normalised) PEDn value variation with increasing sample size using a 95 % confidence level. Discussion and Summary Our statistical modelling using MCS suggests that a sample size of less than 200 sentences can not reflect the overall material quality in a confident enough level on translation quality evaluation task. Using MCS, we also reduced the suggested sample size from 10k words (around 625 sentences, from Bernoulli statistics to 4k for reliable estimation of overall translation quality. Summary: this work investigates into confidence interval estimation for translation quality evaluation task, which has been an important role among language service providers and Informatics related fields, including machine translation (MT) and natural language processing (NLP). We used Bernoulli Statistical Distribution Mo- delling (BSDM) and Monte Carlo Sampling Analysis (MCSA), and gave concrete feed-backs and guidelines regarding practical situations when translation quality evaluation (TQE) is deployed. 1 Our data will be hosted as open-source repository on https://github.com/lHan87/MCMC_TQE. Corresponding author: LH∗ Acknowledgement: Funding source: Logrus Global https://logrusglobal.com/. LH is partially funded by The Uni of Manchester and ADAPT Research Centre (DCU): The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.$

More from Lifeng (Aaron) Han

Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han

Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han

Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han

AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han

ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han

A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han

machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han

Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han

Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han

Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han

PubhD talk: MT serving the societyLifeng (Aaron) Han

Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han

Thesis-Master-MTE-AaronLifeng (Aaron) Han

Machine translation evaluation: a surveyLifeng (Aaron) Han

LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han

LEPOR: an augmented machine translation evaluation metric Lifeng (Aaron) Han

Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han

pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han

More from Lifeng (Aaron) Han (20)

Chinese Character Decomposition for Neural MT with Multi-Word Expressions

Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer

Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...

AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.

A deep analysis of Multi-word Expression and Machine Translation

machine translation evaluation resources and methods: a survey

Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...

Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model

Quality Estimation for Machine Translation Using the Joint Method of Evaluati...

PubhD talk: MT serving the society

Lepor: augmented automatic MT evaluation metric

Thesis-Master-MTE-Aaron

Machine translation evaluation: a survey

LEPOR: an augmented machine translation evaluation metric - Thesis PPT

LEPOR: an augmented machine translation evaluation metric

Pptphrase tagset mapping for french and english treebanks and its application...

pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...

Measuring Uncertainty in Translation Quality Evaluation (TQE)

1. Measuring Uncertainty in Translation Quality Evaluation (TQE) Serge Gladkoff1 , Irina Sorokina21 , Lifeng Han34* and Alexandra Alekseeva5 , {serge.gladkoff, irina.sorokina}@logrusglobal.com, lifeng.han@{adaptcentre.ie, manchester.ac.uk} Institutes: 1 Logrus Global LLC 2 Tver State University 3 The Uni of Manchester 4 ADAPT Centre, DCU 5 ROKO Labs LREC2022: 13th Edition of Language Resources and Evaluation Conference, Marseille, France, 20-25 June Motivations I. From both human translators (HT) and machine translation (MT) researchers’ point of view, translation quality evaluation (TQE) is an essential task. II. This is especially the case, when language service providers (LSPs) face huge amount of request frequently from their clients and users to acquire high-quality translations. III. While automatic translation quality assessment (TQA) metrics and quality estimation (QE) tools are widely available and easy to access, human assessment from professional translators (HAP) are often chosen as the golden standard [1]. IV. One challenge that comes to us: to avoid the overall text quality checking from both cost and efficiency perspectives, how to choose the confidence sample size of the translated text, so as to properly estimate the overall text translation quality? 1 Bernoulli Distribution on Errors Errors of certain type (category) either present in a sentence, or not. Errors are independent from each other. The probability of errors is the same. When the sample size n is significantly smaller than the overall population N, the standard deviation of sample measurement falls into the following formula: σ = r p · (1 − p) n where p is the probability estimation of an event under study. The confidence interval CI, using the Wald interval, will be: CI = p ± ∆ where ∆ is the product of standard deviation and factor 1.96 (when confidence level 95 % is chosen): ∆ = 1,96 · σ = 1,96 · r p · (1 − p) n Figure 1 shows the Error density value with sample size variation from 100 to 2K sentences. Confident Modelling Figure 2: how the confident delta value changes with the variation of sample size. Monte Carlo Simulation (MCS) Figure 3 shows Mean post-editing distance (normalised) PEDn value variation with increasing sample size using a 95 % confidence level. Discussion and Summary Our statistical modelling using MCS suggests that a sample size of less than 200 sentences can not reflect the overall material quality in a confident enough level on translation quality evaluation task. Using MCS, we also reduced the suggested sample size from 10k words (around 625 sentences, from Bernoulli statistics to 4k for reliable estimation of overall translation quality. Summary: this work investigates into confidence interval estimation for translation quality evaluation task, which has been an important role among language service providers and Informatics related fields, including machine translation (MT) and natural language processing (NLP). We used Bernoulli Statistical Distribution Mo- delling (BSDM) and Monte Carlo Sampling Analysis (MCSA), and gave concrete feed-backs and guidelines regarding practical situations when translation quality evaluation (TQE) is deployed. 1 Our data will be hosted as open-source repository on https://github.com/lHan87/MCMC_TQE. Corresponding author: LH∗ Acknowledgement: Funding source: Logrus Global https://logrusglobal.com/. LH is partially funded by The Uni of Manchester and ADAPT Research Centre (DCU): The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Measuring Uncertainty in Translation Quality Evaluation (TQE)

Recommended

Recommended

More Related Content

More from Lifeng (Aaron) Han

More from Lifeng (Aaron) Han (20)

Measuring Uncertainty in Translation Quality Evaluation (TQE)