SlideShare a Scribd company logo
1 of 1
Download to read offline
Measuring Uncertainty in Translation Quality
Evaluation (TQE)
Serge Gladkoff1
, Irina Sorokina21
, Lifeng Han34*
and Alexandra Alekseeva5
,
{serge.gladkoff, irina.sorokina}@logrusglobal.com, lifeng.han@{adaptcentre.ie, manchester.ac.uk}
Institutes: 1
Logrus Global LLC 2
Tver State University 3
The Uni of Manchester 4
ADAPT Centre, DCU 5
ROKO Labs
LREC2022: 13th Edition of Language Resources and Evaluation Conference, Marseille, France, 20-25 June
Motivations
I. From both human translators (HT) and machine translation (MT) researchers’
point of view, translation quality evaluation (TQE) is an essential task.
II. This is especially the case, when language service providers (LSPs) face huge
amount of request frequently from their clients and users to acquire high-quality
translations.
III. While automatic translation quality assessment (TQA) metrics and quality es-
timation (QE) tools are widely available and easy to access, human assessment from
professional translators (HAP) are often chosen as the golden standard [1].
IV. One challenge that comes to us: to avoid the overall text quality checking from
both cost and efficiency perspectives, how to choose the confidence sample size of
the translated text, so as to properly estimate the overall text translation quality?
1
Bernoulli Distribution on Errors
Errors of certain type (category) either present in a sentence, or not.
Errors are independent from each other.
The probability of errors is the same.
When the sample size n is significantly smaller than the overall population N, the
standard deviation of sample measurement falls into the following formula:
σ =
r
p · (1 − p)
n
where p is the probability estimation of an event under study. The confidence interval
CI, using the Wald interval, will be:
CI = p ± ∆
where ∆ is the product of standard deviation and factor 1.96 (when confidence level
95 % is chosen):
∆ = 1,96 · σ = 1,96 ·
r
p · (1 − p)
n
Figure 1 shows the Error density value with sample size variation from 100 to 2K
sentences.
Confident Modelling
Figure 2: how the confident delta value changes with the variation of sample size.
Monte Carlo Simulation (MCS)
Figure 3 shows Mean post-editing distance (normalised) PEDn value variation with
increasing sample size using a 95 % confidence level.
Discussion and Summary
Our statistical modelling using MCS suggests that a sample size of less than 200
sentences can not reflect the overall material quality in a confident enough level on
translation quality evaluation task.
Using MCS, we also reduced the suggested sample size from 10k words (around 625
sentences, from Bernoulli statistics to 4k for reliable estimation of overall translation
quality.
Summary: this work investigates into confidence interval estimation for translation
quality evaluation task, which has been an important role among language servi-
ce providers and Informatics related fields, including machine translation (MT) and
natural language processing (NLP). We used Bernoulli Statistical Distribution Mo-
delling (BSDM) and Monte Carlo Sampling Analysis (MCSA), and gave concrete
feed-backs and guidelines regarding practical situations when translation quality eva-
luation (TQE) is deployed.
1
Our data will be hosted as open-source repository on https://github.com/lHan87/MCMC_TQE. Corresponding author: LH∗
Acknowledgement: Funding source: Logrus Global https://logrusglobal.com/. LH is partially funded by The Uni of
Manchester and ADAPT Research Centre (DCU): The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

More Related Content

More from Lifeng (Aaron) Han

Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the societyLifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a surveyLifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 

More from Lifeng (Aaron) Han (20)

Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Thesis-Master-MTE-Aaron
Thesis-Master-MTE-AaronThesis-Master-MTE-Aaron
Thesis-Master-MTE-Aaron
 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a survey
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 

Measuring Uncertainty in Translation Quality Evaluation (TQE)

  • 1. Measuring Uncertainty in Translation Quality Evaluation (TQE) Serge Gladkoff1 , Irina Sorokina21 , Lifeng Han34* and Alexandra Alekseeva5 , {serge.gladkoff, irina.sorokina}@logrusglobal.com, lifeng.han@{adaptcentre.ie, manchester.ac.uk} Institutes: 1 Logrus Global LLC 2 Tver State University 3 The Uni of Manchester 4 ADAPT Centre, DCU 5 ROKO Labs LREC2022: 13th Edition of Language Resources and Evaluation Conference, Marseille, France, 20-25 June Motivations I. From both human translators (HT) and machine translation (MT) researchers’ point of view, translation quality evaluation (TQE) is an essential task. II. This is especially the case, when language service providers (LSPs) face huge amount of request frequently from their clients and users to acquire high-quality translations. III. While automatic translation quality assessment (TQA) metrics and quality es- timation (QE) tools are widely available and easy to access, human assessment from professional translators (HAP) are often chosen as the golden standard [1]. IV. One challenge that comes to us: to avoid the overall text quality checking from both cost and efficiency perspectives, how to choose the confidence sample size of the translated text, so as to properly estimate the overall text translation quality? 1 Bernoulli Distribution on Errors Errors of certain type (category) either present in a sentence, or not. Errors are independent from each other. The probability of errors is the same. When the sample size n is significantly smaller than the overall population N, the standard deviation of sample measurement falls into the following formula: σ = r p · (1 − p) n where p is the probability estimation of an event under study. The confidence interval CI, using the Wald interval, will be: CI = p ± ∆ where ∆ is the product of standard deviation and factor 1.96 (when confidence level 95 % is chosen): ∆ = 1,96 · σ = 1,96 · r p · (1 − p) n Figure 1 shows the Error density value with sample size variation from 100 to 2K sentences. Confident Modelling Figure 2: how the confident delta value changes with the variation of sample size. Monte Carlo Simulation (MCS) Figure 3 shows Mean post-editing distance (normalised) PEDn value variation with increasing sample size using a 95 % confidence level. Discussion and Summary Our statistical modelling using MCS suggests that a sample size of less than 200 sentences can not reflect the overall material quality in a confident enough level on translation quality evaluation task. Using MCS, we also reduced the suggested sample size from 10k words (around 625 sentences, from Bernoulli statistics to 4k for reliable estimation of overall translation quality. Summary: this work investigates into confidence interval estimation for translation quality evaluation task, which has been an important role among language servi- ce providers and Informatics related fields, including machine translation (MT) and natural language processing (NLP). We used Bernoulli Statistical Distribution Mo- delling (BSDM) and Monte Carlo Sampling Analysis (MCSA), and gave concrete feed-backs and guidelines regarding practical situations when translation quality eva- luation (TQE) is deployed. 1 Our data will be hosted as open-source repository on https://github.com/lHan87/MCMC_TQE. Corresponding author: LH∗ Acknowledgement: Funding source: Logrus Global https://logrusglobal.com/. LH is partially funded by The Uni of Manchester and ADAPT Research Centre (DCU): The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.