SlideShare a Scribd company logo
1 of 1
Download to read offline
中文
English
Deutsch
Français
Italiano
日本語
Pусский
Español
Português
Dansk, ελληνικά, , 한국어, ...
Magyar nyelv
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Yi Lu, Liangye He, Yiming Wang and Jiaji Zhou
Natural Language Processing & Portuguese - Chinese Machine Translation Laboratory (NLP2CT)
Department of Computer and Information Science,University of Macau, Macau
Acknowledgements
The authors are grateful to the Science and
Technology Development Fund of Macau and the
Research Committee of the University of Macau
for the funding support for our research, under the
reference No. 017/2009/A and RG060/09-
10S/CS/FST.
Introduction
The two of our developed metrics for the
traditional machine translation evaluation metrics
task in the ACL-WMT13:
nLEPOR*: n-gram based language independent
MT evaluation metric employing the factors of
modified sentence length penalty, position
difference penalty, n-gram precision and n-gram
recall.
hLEPOR**: a new version of LEPOR metric
using the mathematical harmonic mean to group
the factors and employing some linguistic
features, such as the part-of-speech information.
A Description of Tunable Machine Translation
Evaluation Systems in WMT13 Metrics Task
ACL 2013 Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, August 8-9 , 2013.
Proposed Models
Firstly, we introduce the augmented factors
designed in the proposed models.
Modified length penalty:
The parameters 𝑐 and 𝑟 represent the length of
candidate sentence and reference sentence
respectively.
Precision and Recall:
𝑊𝑁𝐻𝑃𝑅 means weighted 𝑛-gram harmonic mean of
Experiment
Training data:
WMT11 data is used that contains 3,003 sentence
for each document, for each language pair English-
other (Spanish, German, French and Czech).
Testing data:
WMT13, 3,000 sentence for each document, for
each language pair, English-Russian added.
Using the Pearson correlation coefficient,
LEPOR_v3.1 yields the best performance on
English-to-other language pairs at system-level with
the mean correlation score 0.86. Furthermore,
LEPOR_v3.1 and nLEPOR_baseline also yield the
highest Pearson correlation scores on English-to-
Russian (0.77) and English-to-Czech (0.82)
language pairs respectively.
On the other hand, LEPOR yields moderate
performances on the other-to-English language
pairs, and METEOR achieves the best score in this
translation direction. Actually speaking, LEPOR
shows a similar Mean score 0.87 on the other-to-
English language pairs to the 0.86 on English-to-
other direction, which shows its robustness.
Spearman rank correlation coefficient evaluation
result scores have a small adjustment. Using the
Kendall’s tau, the segment-level correlation score of
LEPOR is moderate on both translation directions,
which is due to the fact that LEPOR was trained at
system-level in the training stage of this task.
precision and recall, #𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑 means the
number of matching n-grams between hypothesis
and reference.
Position difference penalty:
The factor 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 means n-gram based position
difference penalty, 𝑁𝑃𝐷 means n-gram position
difference value, 𝐿𝑒𝑛𝑔𝑡ℎℎ𝑦𝑝 means the length of
hypothesis (candidate) sentence, 𝑀𝑎𝑡𝑐ℎ𝑁ℎ𝑦𝑝 and
𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓 means the position of matching items
from hypothesis sentence and reference sentence
respectively.
Secondly, the Proposed LEPOR Metrics:
nLEPOR uses the simple product value of three sub-
factors, while hLEPOR uses the harmonic means.
Finally, the Combination of Linguistic Features:
We extract the POS sequences of the corresponding
word sentences, then measure the ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆
employing the same algorithm as used on the
surface words ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑. The score ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙
is the combination of the two scores ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑
and ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆.
𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = 𝑒−𝑁𝑃𝐷
(5)
𝑁𝑃𝐷 =
1
𝐿𝑒𝑛𝑔𝑡ℎℎ𝑦𝑝
|𝑃𝐷𝑖|
𝐿𝑒𝑛𝑔𝑡ℎℎ𝑦𝑝
𝑖=1 (6)
𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁ℎ𝑦𝑝 − 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (7)
𝑛𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑡𝑦 × 𝑊𝑁𝐻𝑃𝑅 (8)
ℎ𝐿𝐸𝑃𝑂𝑅 =
𝑤 𝐿𝑃 + 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 + 𝑤 𝐻𝑃𝑅
𝑤 𝐿𝑃
𝐿𝑃
+
𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙
𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙
+
𝑤 𝐻𝑃𝑅
𝐻𝑃𝑅 (9)
ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 =
1
𝑤ℎ𝑤+𝑤ℎ𝑝
×
(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 + 𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆)
(10)
* nLEPOR_baseline; ** LEPOR_v3.1.
𝐿𝑃 =
𝑒1−
𝑟
𝑐 𝑖𝑓 𝑐 < 𝑟
1 𝑖𝑓 𝑐 = 𝑟
𝑒1−
𝑐
𝑟 𝑖𝑓 𝑐 > 𝑟
(1)
𝑃𝑛 =
#𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑
#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 (2)
𝑅 𝑛 =
#𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑
#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 (3)
𝑊𝑁𝐻𝑃𝑅 = 𝑒𝑥𝑝 𝑤 𝑛 𝑙𝑜𝑔𝐻(𝛼𝑅 𝑛, 𝛽𝑃𝑛)𝑁
𝑛=1 (4)
EN-
FR
EN-
DE
EN-
ES
EN-
CS
EN-
RU
Av
LEPOR
_v3.1 .91 .94 .91 .76 .77 .86
nLEPO
R_basel
ine
.92 .92 .90 .82 .68 .85
SIMPB
LEU_R
ECALL
.95 .93 .90 .82 .63 .84
SIMPB
LEU_P
REC
.94 .90 .89 .82 .65 .84
NIST-
mteval-
inter
.91 .83 .84 .79 .68 .81
Meteor .91 .88 .88 .82 .55 .81
BLEU-
mteval-
inter
.89 .84 .88 .81 .61 .80
BLEU-
moses .90 .82 .88 .80 .62 .80
BLEU-
mteval .90 .82 .87 .80 .62 .80
CDER-
moses .91 .82 .88 .74 .63 .80
NIST-
mteval .91 .79 .83 .78 .68 .79
PER-
moses .88 .65 .88 .76 .62 .76
TER-
moses .91 .73 .78 .70 .61 .75
WER-
moses .92 .69 .77 .70 .61 .74
TerrorC
at .94 .96 .95 na na .95
SEMPO
S na na na .72 na .72
ACTa .81 -.47 na na na .17
ACTa5
+6 .81 -.47 na na na .17
Table 1. System-level Pearson correlation scores
on WMT13 English-to-other
FR-
EN
DE-
EN
ES-
EN
CS-
EN
RU-
EN
Av
Meteor .98 .96 .97 .99 .84 .95
SEMPOS .95 .95 .96 .99 .82 .93
Depref-
align .97 .97 .97 .98 .74 .93
Depref-
exact .97 .97 .96 .98 .73 .92
SIMPBLE
U_RECA
LL
.97 .97 .96 .94 .78 .92
UMEANT .96 .97 .99 .97 .66 .91
MEANT .96 .96 .99 .96 .63 .90
CDER-
moses .96 .91 .95 .90 .66 .88
SIMPBLE
U_PREC .95 .92 .95 .91 .61 .87
LEPOR_v
3.1 .96 .96 .90 .81 .71 .87
nLEPOR_
baseline .96 .94 .94 .80 .69 .87
BLEU-
mteval-
inter
.95 .92 .94 .90 .61 .86
NIST-
mteval-
inter
.94 .91 .93 .84 .66 .86
BLEU-
moses .94 .91 .94 .89 .60 .86
BLEU-
mteval .95 .90 .94 .88 .60 .85
NIST-
mteval .94 .90 .93 .84 .65 .85
TER-
moses .93 .87 .91 .77 .52 .80
WER-
moses .93 .84 .89 .76 .50 .78
PER-
moses .84 .88 .87 .74 .45 .76
TerrorCat .98 .98 .97 na na .98
Table 2. System-level Pearson correlation scores
on WMT13 other-to-English
EN-
FR
EN-
DE
EN-
ES
EN-
CS
EN-
RU
Av
SIMPBLE
U_RECA
LL
.16 .09 .23 .06 .12 .13
Meteor .15 .05 .18 .06 .11 .11
SIMPBLE
U_PREC .14 .07 .19 .06 .09 .11
sentBLEU
-moses .13 .05 .17 .05 .09 .10
LEPOR_v
3.1 .13 .06 .18 .02 .11 .10
nLEPOR_
baseline .12 .05 .16 .05 .10 .10
dfki_logre
gNorm-
411
na na .14 na na .14
TerrorCat .12 .07 .19 na na .13
dfki_logre
gNormSof
t-431
na na .03 na na .03
Table 3. Segment-level Kendall’s tau correlation
scores on WMT13 English-to-other
FR-
EN
DE-
EN
ES-
EN
CS-
EN
RU-
EN
Av
SIMPBLE
U_RECA
LL
.19 .32 .28 .26 .23 .26
Meteor .18 .29 .24 .27 .24 .24
Depref-
align .16 .27 .23 .23 .20 .22
Depref-
exact .17 .26 .23 .23 .19 .22
SIMPBLE
U_PREC .15 .24 .21 .21 .17 .20
nLEPOR_
baseline .15 .24 .20 .18 .17 .19
sentBLEU
-moses .15 .22 .20 .20 .17 .19
LEPOR_v
3.1 .15 .22 .16 .19 .18 .18
UMEANT .10 .17 .14 .16 .11 .14
MEANT .10 .16 .14 .16 .11 .14
dfki_logre
gFSS-33 na .27 na na na .27
dfki_logre
gFSS-24 na .27 na na na .27
TerrorCat .16 .30 .23 na na .23
Table 4. Segment-level Kendall’s tau correlation
scores on WMT13 other-to-English

More Related Content

Similar to ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task

MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...
MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...
MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...Lifeng (Aaron) Han
 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksSDL
 
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...Lifeng (Aaron) Han
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"nozyh
 
A comparison of three chromatographic retention time prediction models
A comparison of three chromatographic retention time prediction modelsA comparison of three chromatographic retention time prediction models
A comparison of three chromatographic retention time prediction modelsAndrew McEachran
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsMatīss ‎‎‎‎‎‎‎  
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 
Phonetic distance based accent
Phonetic distance based accentPhonetic distance based accent
Phonetic distance based accentsipij
 
Isolated word recognition using lpc &amp; vector quantization
Isolated word recognition using lpc &amp; vector quantizationIsolated word recognition using lpc &amp; vector quantization
Isolated word recognition using lpc &amp; vector quantizationeSAT Journals
 
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationIsolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationeSAT Publishing House
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Edmond Lepedus
 
HIGHLY ACCURATE LOG SKEW NORMAL APPROXIMATION TO THE SUM OF CORRELATED LOGNOR...
HIGHLY ACCURATE LOG SKEW NORMAL APPROXIMATION TO THE SUM OF CORRELATED LOGNOR...HIGHLY ACCURATE LOG SKEW NORMAL APPROXIMATION TO THE SUM OF CORRELATED LOGNOR...
HIGHLY ACCURATE LOG SKEW NORMAL APPROXIMATION TO THE SUM OF CORRELATED LOGNOR...cscpconf
 
Highly accurate log skew normal
Highly accurate log skew normalHighly accurate log skew normal
Highly accurate log skew normalcsandit
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATAIJCSEA Journal
 
Estimation of global solar radiation by using machine learning methods
Estimation of global solar radiation by using machine learning methodsEstimation of global solar radiation by using machine learning methods
Estimation of global solar radiation by using machine learning methodsmehmet şahin
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
 

Similar to ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task (20)

MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...
MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...
MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...
 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
 
ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"ACL読み会2014@PFI "Less Grammar, More Features"
ACL読み会2014@PFI "Less Grammar, More Features"
 
A comparison of three chromatographic retention time prediction models
A comparison of three chromatographic retention time prediction modelsA comparison of three chromatographic retention time prediction models
A comparison of three chromatographic retention time prediction models
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 
Phonetic distance based accent
Phonetic distance based accentPhonetic distance based accent
Phonetic distance based accent
 
Isolated word recognition using lpc &amp; vector quantization
Isolated word recognition using lpc &amp; vector quantizationIsolated word recognition using lpc &amp; vector quantization
Isolated word recognition using lpc &amp; vector quantization
 
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationIsolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantization
 
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...Understanding Natural Languange with Corpora-based Generation of Dependency G...
Understanding Natural Languange with Corpora-based Generation of Dependency G...
 
HIGHLY ACCURATE LOG SKEW NORMAL APPROXIMATION TO THE SUM OF CORRELATED LOGNOR...
HIGHLY ACCURATE LOG SKEW NORMAL APPROXIMATION TO THE SUM OF CORRELATED LOGNOR...HIGHLY ACCURATE LOG SKEW NORMAL APPROXIMATION TO THE SUM OF CORRELATED LOGNOR...
HIGHLY ACCURATE LOG SKEW NORMAL APPROXIMATION TO THE SUM OF CORRELATED LOGNOR...
 
Highly accurate log skew normal
Highly accurate log skew normalHighly accurate log skew normal
Highly accurate log skew normal
 
201977 1-1-3-pb
201977 1-1-3-pb201977 1-1-3-pb
201977 1-1-3-pb
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
 
Estimation of global solar radiation by using machine learning methods
Estimation of global solar radiation by using machine learning methodsEstimation of global solar radiation by using machine learning methods
Estimation of global solar radiation by using machine learning methods
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
presentation
presentationpresentation
presentation
 

More from Lifeng (Aaron) Han

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Lifeng (Aaron) Han
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the societyLifeng (Aaron) Han
 

More from Lifeng (Aaron) Han (20)

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 

ACL-WMT Poster.A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task

  • 1. 中文 English Deutsch Français Italiano 日本語 Pусский Español Português Dansk, ελληνικά, , 한국어, ... Magyar nyelv Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Yi Lu, Liangye He, Yiming Wang and Jiaji Zhou Natural Language Processing & Portuguese - Chinese Machine Translation Laboratory (NLP2CT) Department of Computer and Information Science,University of Macau, Macau Acknowledgements The authors are grateful to the Science and Technology Development Fund of Macau and the Research Committee of the University of Macau for the funding support for our research, under the reference No. 017/2009/A and RG060/09- 10S/CS/FST. Introduction The two of our developed metrics for the traditional machine translation evaluation metrics task in the ACL-WMT13: nLEPOR*: n-gram based language independent MT evaluation metric employing the factors of modified sentence length penalty, position difference penalty, n-gram precision and n-gram recall. hLEPOR**: a new version of LEPOR metric using the mathematical harmonic mean to group the factors and employing some linguistic features, such as the part-of-speech information. A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task ACL 2013 Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria, August 8-9 , 2013. Proposed Models Firstly, we introduce the augmented factors designed in the proposed models. Modified length penalty: The parameters 𝑐 and 𝑟 represent the length of candidate sentence and reference sentence respectively. Precision and Recall: 𝑊𝑁𝐻𝑃𝑅 means weighted 𝑛-gram harmonic mean of Experiment Training data: WMT11 data is used that contains 3,003 sentence for each document, for each language pair English- other (Spanish, German, French and Czech). Testing data: WMT13, 3,000 sentence for each document, for each language pair, English-Russian added. Using the Pearson correlation coefficient, LEPOR_v3.1 yields the best performance on English-to-other language pairs at system-level with the mean correlation score 0.86. Furthermore, LEPOR_v3.1 and nLEPOR_baseline also yield the highest Pearson correlation scores on English-to- Russian (0.77) and English-to-Czech (0.82) language pairs respectively. On the other hand, LEPOR yields moderate performances on the other-to-English language pairs, and METEOR achieves the best score in this translation direction. Actually speaking, LEPOR shows a similar Mean score 0.87 on the other-to- English language pairs to the 0.86 on English-to- other direction, which shows its robustness. Spearman rank correlation coefficient evaluation result scores have a small adjustment. Using the Kendall’s tau, the segment-level correlation score of LEPOR is moderate on both translation directions, which is due to the fact that LEPOR was trained at system-level in the training stage of this task. precision and recall, #𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑 means the number of matching n-grams between hypothesis and reference. Position difference penalty: The factor 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 means n-gram based position difference penalty, 𝑁𝑃𝐷 means n-gram position difference value, 𝐿𝑒𝑛𝑔𝑡ℎℎ𝑦𝑝 means the length of hypothesis (candidate) sentence, 𝑀𝑎𝑡𝑐ℎ𝑁ℎ𝑦𝑝 and 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓 means the position of matching items from hypothesis sentence and reference sentence respectively. Secondly, the Proposed LEPOR Metrics: nLEPOR uses the simple product value of three sub- factors, while hLEPOR uses the harmonic means. Finally, the Combination of Linguistic Features: We extract the POS sequences of the corresponding word sentences, then measure the ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆 employing the same algorithm as used on the surface words ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑. The score ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 is the combination of the two scores ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 and ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆. 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = 𝑒−𝑁𝑃𝐷 (5) 𝑁𝑃𝐷 = 1 𝐿𝑒𝑛𝑔𝑡ℎℎ𝑦𝑝 |𝑃𝐷𝑖| 𝐿𝑒𝑛𝑔𝑡ℎℎ𝑦𝑝 𝑖=1 (6) 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁ℎ𝑦𝑝 − 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (7) 𝑛𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑡𝑦 × 𝑊𝑁𝐻𝑃𝑅 (8) ℎ𝐿𝐸𝑃𝑂𝑅 = 𝑤 𝐿𝑃 + 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 + 𝑤 𝐻𝑃𝑅 𝑤 𝐿𝑃 𝐿𝑃 + 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 + 𝑤 𝐻𝑃𝑅 𝐻𝑃𝑅 (9) ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 = 1 𝑤ℎ𝑤+𝑤ℎ𝑝 × (𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 + 𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (10) * nLEPOR_baseline; ** LEPOR_v3.1. 𝐿𝑃 = 𝑒1− 𝑟 𝑐 𝑖𝑓 𝑐 < 𝑟 1 𝑖𝑓 𝑐 = 𝑟 𝑒1− 𝑐 𝑟 𝑖𝑓 𝑐 > 𝑟 (1) 𝑃𝑛 = #𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑 #𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 (2) 𝑅 𝑛 = #𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑 #𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 (3) 𝑊𝑁𝐻𝑃𝑅 = 𝑒𝑥𝑝 𝑤 𝑛 𝑙𝑜𝑔𝐻(𝛼𝑅 𝑛, 𝛽𝑃𝑛)𝑁 𝑛=1 (4) EN- FR EN- DE EN- ES EN- CS EN- RU Av LEPOR _v3.1 .91 .94 .91 .76 .77 .86 nLEPO R_basel ine .92 .92 .90 .82 .68 .85 SIMPB LEU_R ECALL .95 .93 .90 .82 .63 .84 SIMPB LEU_P REC .94 .90 .89 .82 .65 .84 NIST- mteval- inter .91 .83 .84 .79 .68 .81 Meteor .91 .88 .88 .82 .55 .81 BLEU- mteval- inter .89 .84 .88 .81 .61 .80 BLEU- moses .90 .82 .88 .80 .62 .80 BLEU- mteval .90 .82 .87 .80 .62 .80 CDER- moses .91 .82 .88 .74 .63 .80 NIST- mteval .91 .79 .83 .78 .68 .79 PER- moses .88 .65 .88 .76 .62 .76 TER- moses .91 .73 .78 .70 .61 .75 WER- moses .92 .69 .77 .70 .61 .74 TerrorC at .94 .96 .95 na na .95 SEMPO S na na na .72 na .72 ACTa .81 -.47 na na na .17 ACTa5 +6 .81 -.47 na na na .17 Table 1. System-level Pearson correlation scores on WMT13 English-to-other FR- EN DE- EN ES- EN CS- EN RU- EN Av Meteor .98 .96 .97 .99 .84 .95 SEMPOS .95 .95 .96 .99 .82 .93 Depref- align .97 .97 .97 .98 .74 .93 Depref- exact .97 .97 .96 .98 .73 .92 SIMPBLE U_RECA LL .97 .97 .96 .94 .78 .92 UMEANT .96 .97 .99 .97 .66 .91 MEANT .96 .96 .99 .96 .63 .90 CDER- moses .96 .91 .95 .90 .66 .88 SIMPBLE U_PREC .95 .92 .95 .91 .61 .87 LEPOR_v 3.1 .96 .96 .90 .81 .71 .87 nLEPOR_ baseline .96 .94 .94 .80 .69 .87 BLEU- mteval- inter .95 .92 .94 .90 .61 .86 NIST- mteval- inter .94 .91 .93 .84 .66 .86 BLEU- moses .94 .91 .94 .89 .60 .86 BLEU- mteval .95 .90 .94 .88 .60 .85 NIST- mteval .94 .90 .93 .84 .65 .85 TER- moses .93 .87 .91 .77 .52 .80 WER- moses .93 .84 .89 .76 .50 .78 PER- moses .84 .88 .87 .74 .45 .76 TerrorCat .98 .98 .97 na na .98 Table 2. System-level Pearson correlation scores on WMT13 other-to-English EN- FR EN- DE EN- ES EN- CS EN- RU Av SIMPBLE U_RECA LL .16 .09 .23 .06 .12 .13 Meteor .15 .05 .18 .06 .11 .11 SIMPBLE U_PREC .14 .07 .19 .06 .09 .11 sentBLEU -moses .13 .05 .17 .05 .09 .10 LEPOR_v 3.1 .13 .06 .18 .02 .11 .10 nLEPOR_ baseline .12 .05 .16 .05 .10 .10 dfki_logre gNorm- 411 na na .14 na na .14 TerrorCat .12 .07 .19 na na .13 dfki_logre gNormSof t-431 na na .03 na na .03 Table 3. Segment-level Kendall’s tau correlation scores on WMT13 English-to-other FR- EN DE- EN ES- EN CS- EN RU- EN Av SIMPBLE U_RECA LL .19 .32 .28 .26 .23 .26 Meteor .18 .29 .24 .27 .24 .24 Depref- align .16 .27 .23 .23 .20 .22 Depref- exact .17 .26 .23 .23 .19 .22 SIMPBLE U_PREC .15 .24 .21 .21 .17 .20 nLEPOR_ baseline .15 .24 .20 .18 .17 .19 sentBLEU -moses .15 .22 .20 .20 .17 .19 LEPOR_v 3.1 .15 .22 .16 .19 .18 .18 UMEANT .10 .17 .14 .16 .11 .14 MEANT .10 .16 .14 .16 .11 .14 dfki_logre gFSS-33 na .27 na na na .27 dfki_logre gFSS-24 na .27 na na na .27 TerrorCat .16 .30 .23 na na .23 Table 4. Segment-level Kendall’s tau correlation scores on WMT13 other-to-English