SlideShare a Scribd company logo
1 of 8
Download to read offline
Automatic Machine Translation Evaluation with
Part-of-Speech Information
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, and Liangye He
University of Macau, Department of Computer and Information Science
Av. Padre Toms Pereira Taipa, Macau, China
{hanlifengaaron}@gmail.com, {derekfw,lidiasc}@umac.mo
{wutianshui0515}@gmail.com
Abstract. One problem of automatic translation is the evaluation of the result.
The result should be as close to a human reference translation as possible, but
varying word order or synonyms have to be taken into account for the evalua-
tion of the similarity of both. In the conventional methods, researchers tend to
employ many resources such as the synonyms vocabulary, paraphrasing, and text
entailment data, etc. To make the evaluation model both accurate and concise, this
paper explores the evaluation only using Part-of-Speech information of the word-
s, which means the method is based only on the consilience of the POS strings of
the hypothesis translation and reference. In this developed method, the POS also
acts as the similar function with the synonyms in addition to its syntactic or mor-
phological behaviour of the lexical item in question. Measures for the similarity
between machine translation and human reference are dependent on the language
pair since the word order or the number of synonyms may vary, for instance. This
new measure solves this problem to a certain extent by introducing weights to
different sources of information. The experiment results on English, German and
French languages correlate on average better with the human reference than some
existing measures, such as BLEU, AMBER and MP4IBM1.1
Keywords: Natural language processing, Machine translation evaluation, Part-
of-Speech, Reference translation
1 Introduction
With the rapid development of Machine Translation systems, how to evaluate each MT
system’s quality and what should be the criteria have become the new challenges in
front of MT researchers. The commonly used automatic evaluation metrics include the
word error rate WER [2], BLEU [3] (the geometric mean of n-gram precision by the
system output with respect to reference translations), and NIST [4]. Recently, many
other methods were proposed to revise or improve the previous works.
METEOR [5] metric conducts a flexible matching, considering stems, synonyms
and paraphrases, which method and formula for computing a score is much more com-
plicated than BLEU’s [1]. The matching process involves computationally expensive
1
The final publication is available at http://www.springerlink.com
2 A.L.F. Han, D.F. Wong, L.S. Chao, L. He
word alignment. There are some parameters such as the relative weight of recall to pre-
cision, the weight for stemming or synonym that should be tuned. Snover [6] discussed
that one disadvantage of the Levenshtein distance was that mismatches in word order
required the deletion and re-insertion of the misplaced words. They proposed TER by
adding an editing step that allows the movement of word sequences from one part of
the output to another. AMBER [7] including AMBER-TI and AMBER-NL declare a
modified version of BLEU and attaches more kinds of penalty coefficients, combining
the n-gram precision and recall with the arithmetic average of F-measure. F15 [8] and
F15G3 perform evaluation with the F1 measure (assigning the same weight on preci-
sion and recall) over target features as a metric for evaluating translation quality. The
target features they defined include TP (be the true positive), TN (the true negative), FP
(the false positive), and FN (the false negative rates), etc. To consider the surrounding
phrases for a missing token in the translation they employed the gapped word sequence
kernels [9] approach to evaluate translations. Other related works include [10], [11] and
[12] about the discussion of word order, ROSE [13], MPF and WMPF [14] about the
employing of POS information, MP4IBM1 [15] without relying on reference transla-
tions, etc.
The evaluation methods proposed previously tend to rely on too many linguistic fea-
tures (difficult in replicability) or no linguistic information (leading the metrics result
in low correlation with human judgments). To address this problem, this paper explores
the performance of a novel method only using the consilience of the POS strings of the
hypothesis translation and reference translation. This ensures that the linguistic infor-
mation is considered in the evaluation but it is a very concise model.
2 Linguistic Features
As discussed above, language variability results in no single correct translation and dif-
ferent languages do not always express the same content in the same way. To address the
variability phenomenon, researchers used to employ the synonyms, paraphrasing or text
entailment as auxiliary information. All of these approaches have their advantages and
weaknesses, e.g. the synonyms are difficult to cover all the acceptable expressions. In-
stead, in the designed metric, we use the part-of-speech (POS) information (also applied
by ROSE [13], MPF and WMPF [14]). If the translation sentence of system outputs is
a good translation then there is a potential that the output sentence has a similar seman-
tic information with the reference sentence (the two sentences may not contain exactly
the same words but with the words that have similar semantic meaning). For example,
“there is a big bag” and “there is a large bag” could be the same expression since “big”
and “large” has the similar meaning (with POS as adjective). To try this approach, we
conduct the evaluation on the POS of the words instead of the words themselves and we
do not use other external resources such as synonym dictionaries. We also test the ap-
proach by calculating the correlation score of this method with human judgments in the
experiment. Assume that we have two sentences: one reference and one system output
translation. Firstly, we extract the POS of each word. Then, we calculate the similarity
of these two sentences through the alignment of their POS information.
Machine Translation Evaluation Based on POS 3
3 Calculation Methods
3.1 Design of hLEPOR Metric
First, we introduce the mathematical harmonic mean for multi-variables (n variables
(X1, X2, ..., Xn)).
Harmonic(X1, X2, ..., Xn) =
n
n
i=1
1
Xi
(1)
where n means the number of variables (also named as factors). Then, the weighted
harmonic mean for multi-variables is:
Harmonic(wX1
X1, wX2
X2, ..., wXn
Xn) =
n
i=1 wXi
n
i=1
wXi
Xi
(2)
where wXi
presents the weight assigned to the corresponding variable Xi. Finally, the
proposed evaluation metric hLEPOR (tunable Harmonic mean of Length Penalty, Pre-
cision, n-gram Position difference Penalty and Recall) is designed as:
hLEPOR = Harmonic(wLP LP, wNP osP enalNPosPenal, wHP RHPR) (3)
=
n
i=1 wi
n
i=1
wi
F actori
=
wLP + wNP osP enal + wHP R
wLP
LP + wNP osP enal
NP osP enal + wHP R
HP R
where LP, NPosPenal and HPR are three factors in hLEPOR and will be in-
troduced in the following. Three tunable weights parameters wLP , wNP osP enal and
wHP R are assigned to the three factors respectively.
3.2 Design of Internal Factors
Length Penalty In the Eq. (3), LP means Length penalty to embrace the penalty for
both longer and shorter system outputs compared with the reference translations:
LP =



e1− r
c : c < r
1 : c = r
e1− c
r : c > r
(4)
where c and r mean the sentence length of candidate translation and reference transla-
tion respectively.
N-gram Position Difference Penalty In the Eq.(3), the NPosPenal is defined as:
NPosPenal = e−NP D
(5)
4 A.L.F. Han, D.F. Wong, L.S. Chao, L. He
where NPD means n-gram position difference penalty. The NPosPenal value is de-
signed to compare the POS order in the sentences between reference translation and
output translation. The NPD is defined as:
NPD =
1
Lengthoutput
Lengthoutput
i=1
|PDi| (6)
where Lengthoutput represents the length of system output sentence and PDi means
the n-gram position difference value of aligned POS between output and reference sen-
tences. Every POS from both output translation and reference should be aligned only
once (one-to-one alignment). When there is no match, the value of PDi will be zero as
default for this output POS.
To calculate the NPD value, there are two steps: aligning and calculating. To begin
with, the context-dependent n-gram alignment task: we use the n-gram method and
assign higher priority on it, which means we take into account the surrounding context
(surrounding POS) of the potential POS to select a better matching pairs between the
output and the reference. If there are both nearby matching or there is no matched POS
around the potential pairs, then we consider the nearest matching to align as a backup
choice. The alignment direction is from output sentence to the references.
See example in Figure 1. In the second step (calculating step), we label each POS
with its position number divided by the corresponding sentence length for normaliza-
tion, and then using the Eq. (6) to finish the calculation.
Fig. 1. Example of n-gram POS alignment
We also use the example in Figure 1 for the NPD introduction (Figure 2). In the
example, when we label the position number of output sentence we divide the numerical
position (from 1 to 5) of the current POS by the sentence length 5. For the reference
sentence it is the similar step. After we get the NPD value, using the Eq. (5), the values
of NPosPenal are calculated.
Precision and Recall Precision is designed to reflect the accurate rate of outputs while
recall means the loyalty to the references. In the Eq. (3), HPR means the weighted
Harmonic mean of precision and recall i.e. Harmonic(αR, βP), with parameters α
and β as the tunable weights for recall and precision respectively.
Harmonic(αR, βP) =
α + β
α
R + β
P
(7)
Machine Translation Evaluation Based on POS 5
Fig. 2. Example of NPD calculation
P =
alignednum
systemlength
(8)
R =
alignednum
referencelength
(9)
where alignednum represents the number of successfully aligned (matched) POS ap-
pearing both in translation and reference, systemlength and referencelength specify
the sentence length of system output and reference respectively.
System-level hLEPOR We have introduced the calculation of hLEPOR on single
output sentence, and we should consider a proper way to calculate the value when the
cases turn into document (or system) level. We perform the system-level hLEPOR as
below.
hLEPORsys = Harmonic(wLP LPsys, wNP osP enalPosPenaltysys, wHP RHPRsys)
(10)
As shown in the formula, to calculate the system-level score hLEPORsys, we
should firstly calculate the system-level scores of its factors LPsys, PosPenaltysys
and HPRsys. The system level factor scores are calculated by the arithmetic means of
the corresponding sentence-level factor scores.
4 Experiments
We trained hLEPOR and tuned the parameters on the public ACL WMT 20082
data.
There are five languages in the WMT 2008 data including English, Spanish, German,
French and Czech; however, we currently did not find proper parser tools for the Spanish
and Czech languages. So we tested on the English, German and French languages using
the Berkeley parsers [16] to extract the POS information of the tested sentences. Thus,
there are four language pairs in our tested corpora: from German and French to English,
and the inverse. The parameter values on all language pairs are shown in Table 1.
2
http://www.statmt.org/wmt08/
6 A.L.F. Han, D.F. Wong, L.S. Chao, L. He
Table 1. Values of tuned parameters
Parameters
(α, β) n-gram POS Alignment Weights(HPR:LP:NPosPenal)
(9,1) 2-gram 3:2:1
The tested corpora we used are from ACL WMT 20113
. There are more than one
hundred MT systems offering their output translation results, with most MT systems
statistical-based except for five rule-based ones. The gold standard reference data for
those corpora consists of 3003 sentences. For each language pairs, there are different
numbers of participated MT systems. Automatic MT evaluation systems are differed by
calculating their Spearman rank correlation coefficient with the human judgment results
[17].
Table 2. Correlation coefficients with human judgments
Correlation Score with Human Judgment
Other-to-English English-to-Other
Metrics DE-EN FR-EN EN-DE EN-FR Mean score
hLEPOR 0.83 0.74 0.84 0.82 0.81
MPF 0.69 0.87 0.63 0.89 0.77
WMPF 0.66 0.87 0.61 0.89 0.76
AMBER-TI 0.63 0.94 0.54 0.84 0.74
AMBER 0.59 0.95 0.53 0.84 0.73
AMBER-NL 0.58 0.94 0.45 0.83 0.7
METEOR-1.3 0.71 0.93 0.3 0.85 0.70
ROSE 0.59 0.86 0.41 0.86 0.68
BLEU 0.48 0.85 0.44 0.86 0.66
F15G3 0.48 0.88 0.3 0.84 0.63
F15 0.45 0.87 0.19 0.85 0.59
MP4IBM1 0.56 0.08 0.91 0.61 0.54
TER 0.33 0.77 0.12 0.84 0.52
We compare the experiments results with several classic metrics including BLEU,
METEOR, TER and some latest ones (e.g. MPF, ROSE, F15, AMBER, MP4IBM1).
The system level correlation coefficients of these metrics with human judgments are
shown in the Table 2 which is ranked by the mean correlation scores of the metrics on
four language pairs. Several conclusions from the results could be drawn: first, many
evaluation metrics performed well in certain language pairs but weak on others, e.g.
WMPF results in 0.89 correlation with human judgments on English-to-French corpus
but down to 0.61 score on English-to-German, F15 gets 0.87 score on French-to-English
but 0.45 on German-to-English, ROSE performs well on both French-to-English and
3
http://www.statmt.org/wmt11/
Machine Translation Evaluation Based on POS 7
English-to-French but worse on Germen-to-English and English-to-German. Second,
recently proposed evaluation metrics (e.g. MPF and AMBER) generally perform better
than the traditional ones (e.g. BLEU and TER), showing an improvement of the research
work.
5 Conclusion and Perspectives
To make the evaluation model both accurate and concise, instead of using the syn-
onyms vocabularies, paraphrasing and text entailment that are commonly used by other
researchers, this paper explores the use of the POS information of the words sequences.
What is noticing is that some researchers have used the n-gram method on the words
alignment (e.g. BLEU using 1-gram to 4-gram), other researchers used the POS in-
formation in the similarity calculation by counting the number of corresponding POS.
However, this paper employs the n-gram method on the POS alignment. Since this
developed metric only relies on the consilience of the POS strings of the evaluated sen-
tence even without using the surface words, it has a potential to be further developed
as a reference independent metric. The main difference of this paper and our previous
work LEPOR (the product of factors, perform on words) [18] is that this method group-
s the factors based on mathematical weighted harmonic mean, instead of the simple
product of factors, and this method is performed on the POS instead of words. The
overall weighted harmonic mean allows to tune the model neatly according to different
circumstances, and the POS information can act as part of synonyms in addition to the
syntactic or morphological behaviour of the lexical item.
Even though the designed metric has shown promising performances on the test-
ed language pairs (EN to DE and FR, and the inverse direction). There are several
weaknesses of this metric. Firstly, the POS codes are not language independent, so this
method may not work well on the distant languages e.g. English and Japanese. Second-
ly, the parsing accuracy will effect the performance of the evaluation. Thirdly, to make
the evaluation model concise, this work only uses the POS information without consid-
ering the surface words. To address these weaknesses, in the future work, more language
pairs will be tested, other POS generation tools will be explored, and the combination
of surface words and POS will be employed.
The designed source codes of this paper can be freely downloaded for research
purpose, open source online. The source code of hLEPOR measuring algorithm is
available here “https://github.com/aaronlifenghan/aaron-project-hlepor”. “The final pub-
lication is available at http://www.springerlink.com”
Acknowledgments. The authors wish to thank the anonymous reviewers for many
helpful comments. The authors are grateful to the Science and Technology Develop-
ment Fund of Macau and the Research Committee of the University of Macau for the
funding support for our research, under the reference No. 017/2009/A and RG060/09-
10S/CS/FST.
8 A.L.F. Han, D.F. Wong, L.S. Chao, L. He
References
1. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge University
Press (2010)
2. Su Keh-Yih, Wu Ming-Wen and Chang Jing-Shin: A New Quantitative Quality Measure for
Machine Translation Systems. In Proceedings of the 14th International Conference on Com-
putational Linguistics, pages 433–439, Nantes, France, July (1992)
3. Papineni, K., Roukos, S., Ward, T. and Zhu, W. J.: BLEU: a method for automatic evaluation
of machine translation. In Proceedings of the ACL 2002, pages 311-318, Philadelphia, PA,
USA (2002)
4. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-
occurrence statistics. In Proceedings of the second international conference on Human Lan-
guage Technology Research, pages 138-145, San Diego, California, USA (2002)
5. Banerjee, S. and Lavie, A.: Meteor: an automatic metric for MT evaluation with high levels
of correlation with human judgments. In Proceedings of ACL-WMT, pages 65-72, Prague,
Czech Republic (2005)
6. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J.: A study of translation edit
rate with targeted human annotation. In Proceedings of the Conference of the Association for
Machine Translation in the Americas, pages 223-231, Boston, USA (2006)
7. Chen, B. and Kuhn, R.: Amber: A modified bleu, enhanced ranking metric. In Proceedings of
ACL-WMT, pages 71-77, Edinburgh, Scotland, UK (2011)
8. Bicici, E. and Yuret, D.: RegMT system for machine translation, system combination, and
evaluation. In Proceedings ACL-WMT, pages 323-329, Edinburgh, Scotland, UK (2011)
9. Taylor, J. Shawe and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge Univer-
sity Press 2004.
10. Wong, B. T-M and Kit, C.: Word choice and word position for automatic MT evaluation. In
Workshop: MetricsMATR of the Association for Machine Translation in the Americas, short
paper, 3 pages, Waikiki, Hawai’I, USA (2008)
11. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation of trans-
lation quality for distant language pairs. In Proceedings of the 2010 Conference on EMNLP,
pages 944–952, Cambridge, MA (2010)
12. Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J., Seno, M. and Och, F.: A Lightweight
Evaluation Framework for Machine Translation Reordering. In Proceedings of the Sixth ACL-
WMT, pages 12-21, Edinburgh, Scotland, UK (2011)
13. Song, X. and Cohn, T.: Regression and ranking based optimisation for sentence level MT e-
valuation. In Proceedings of the ACL-WMT, pages 123-129, Edinburgh, Scotland, UK (2011)
14. Popovic, M.: Morphemes and POS tags for n-gram based evaluation metrics. In Proceedings
of ACL-WMT, pages 104-107, Edinburgh, Scotland, UK (2011)
15. Popovic, M., Vilar, D., Avramidis, E. and Burchardt, A.: Evaluation without references: IB-
M1 scores as evaluation metrics. In Proceedings of the ACL-WMT, pages 99-103, Edinburgh,
Scotland, UK (2011)
16. Petrov S., Leon Barrett, Romain Thibaux, and Dan Klein: Learning accurate, compact, and
interpretable tree annotation. Proceedings of the 21st ACL, pages 433–440, Sydney, July
(2006)
17. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011 Workshop
on Statistical Machine Translation. In Proceedings of ACL-WMT, pages 22-64, Edinburgh,
Scotland, UK (2011)
18. Han, Aaron L.-F., Wong, Derek F., and Chao, Lidia S.: LEPOR: A Robust Evaluation Metric
for Machine Translation with Augmented Factors. In Proceedings of the 24th International
Conference on Computational Linguistics (COLING 2012): Posters, pages 441–450, Mumbai,
India (2012)

More Related Content

What's hot

An approach to speed up the word sense disambiguation procedure through sense...
An approach to speed up the word sense disambiguation procedure through sense...An approach to speed up the word sense disambiguation procedure through sense...
An approach to speed up the word sense disambiguation procedure through sense...ijics
 
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...Carrie Wang
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarizationAbdelaziz Al-Rihawi
 
Document Summarization
Document SummarizationDocument Summarization
Document SummarizationPratik Kumar
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amramit nagarkoti
 
Proposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachProposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachEditor IJMTER
 
Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...eSAT Publishing House
 
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
 
A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationIJERD Editor
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing Sandeep Wakchaure
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
amta-decision-trees.doc Word document
amta-decision-trees.doc Word documentamta-decision-trees.doc Word document
amta-decision-trees.doc Word documentbutest
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationIJECEIAES
 
Conceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationConceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationijnlc
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEAbdurrahimDerric
 
A Novel Text Classification Method Using Comprehensive Feature Weight
A Novel Text Classification Method Using Comprehensive Feature WeightA Novel Text Classification Method Using Comprehensive Feature Weight
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
 
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...NAIST Machine Translation Study Group
 

What's hot (20)

An approach to speed up the word sense disambiguation procedure through sense...
An approach to speed up the word sense disambiguation procedure through sense...An approach to speed up the word sense disambiguation procedure through sense...
An approach to speed up the word sense disambiguation procedure through sense...
 
Text summarization
Text summarization Text summarization
Text summarization
 
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
 
Proposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachProposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic Approach
 
Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...
 
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
 
A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text Summarization
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
amta-decision-trees.doc Word document
amta-decision-trees.doc Word documentamta-decision-trees.doc Word document
amta-decision-trees.doc Word document
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
 
Conceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationConceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarization
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Using lexical chains for text summarization
Using lexical chains for text summarizationUsing lexical chains for text summarization
Using lexical chains for text summarization
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
A Novel Text Classification Method Using Comprehensive Feature Weight
A Novel Text Classification Method Using Comprehensive Feature WeightA Novel Text Classification Method Using Comprehensive Feature Weight
A Novel Text Classification Method Using Comprehensive Feature Weight
 
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word E...
 

Similar to TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION

MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
 
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...Lifeng (Aaron) Han
 
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...Lifeng (Aaron) Han
 
Automated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdfAutomated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdfSarah Marie
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
 
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Yafi Azhari
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsijaia
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSkevig
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSkevig
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksSDL
 
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATIONSEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATIONkevig
 
A Pilot Study On Computer-Aided Coreference Annotation
A Pilot Study On Computer-Aided Coreference AnnotationA Pilot Study On Computer-Aided Coreference Annotation
A Pilot Study On Computer-Aided Coreference AnnotationDarian Pruitt
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONSAN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONSijaia
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
 

Similar to TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION (20)

MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
 
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...
 
semeval2016
semeval2016semeval2016
semeval2016
 
Automated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdfAutomated evaluation of coherence in student essays.pdf
Automated evaluation of coherence in student essays.pdf
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
 
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
 
P13 corley
P13 corleyP13 corley
P13 corley
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
 
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICSEVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATIONSEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
 
A Pilot Study On Computer-Aided Coreference Annotation
A Pilot Study On Computer-Aided Coreference AnnotationA Pilot Study On Computer-Aided Coreference Annotation
A Pilot Study On Computer-Aided Coreference Annotation
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONSAN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 

More from Lifeng (Aaron) Han

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Lifeng (Aaron) Han
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaLifeng (Aaron) Han
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the societyLifeng (Aaron) Han
 

More from Lifeng (Aaron) Han (20)

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION

  • 1. Automatic Machine Translation Evaluation with Part-of-Speech Information Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, and Liangye He University of Macau, Department of Computer and Information Science Av. Padre Toms Pereira Taipa, Macau, China {hanlifengaaron}@gmail.com, {derekfw,lidiasc}@umac.mo {wutianshui0515}@gmail.com Abstract. One problem of automatic translation is the evaluation of the result. The result should be as close to a human reference translation as possible, but varying word order or synonyms have to be taken into account for the evalua- tion of the similarity of both. In the conventional methods, researchers tend to employ many resources such as the synonyms vocabulary, paraphrasing, and text entailment data, etc. To make the evaluation model both accurate and concise, this paper explores the evaluation only using Part-of-Speech information of the word- s, which means the method is based only on the consilience of the POS strings of the hypothesis translation and reference. In this developed method, the POS also acts as the similar function with the synonyms in addition to its syntactic or mor- phological behaviour of the lexical item in question. Measures for the similarity between machine translation and human reference are dependent on the language pair since the word order or the number of synonyms may vary, for instance. This new measure solves this problem to a certain extent by introducing weights to different sources of information. The experiment results on English, German and French languages correlate on average better with the human reference than some existing measures, such as BLEU, AMBER and MP4IBM1.1 Keywords: Natural language processing, Machine translation evaluation, Part- of-Speech, Reference translation 1 Introduction With the rapid development of Machine Translation systems, how to evaluate each MT system’s quality and what should be the criteria have become the new challenges in front of MT researchers. The commonly used automatic evaluation metrics include the word error rate WER [2], BLEU [3] (the geometric mean of n-gram precision by the system output with respect to reference translations), and NIST [4]. Recently, many other methods were proposed to revise or improve the previous works. METEOR [5] metric conducts a flexible matching, considering stems, synonyms and paraphrases, which method and formula for computing a score is much more com- plicated than BLEU’s [1]. The matching process involves computationally expensive 1 The final publication is available at http://www.springerlink.com
  • 2. 2 A.L.F. Han, D.F. Wong, L.S. Chao, L. He word alignment. There are some parameters such as the relative weight of recall to pre- cision, the weight for stemming or synonym that should be tuned. Snover [6] discussed that one disadvantage of the Levenshtein distance was that mismatches in word order required the deletion and re-insertion of the misplaced words. They proposed TER by adding an editing step that allows the movement of word sequences from one part of the output to another. AMBER [7] including AMBER-TI and AMBER-NL declare a modified version of BLEU and attaches more kinds of penalty coefficients, combining the n-gram precision and recall with the arithmetic average of F-measure. F15 [8] and F15G3 perform evaluation with the F1 measure (assigning the same weight on preci- sion and recall) over target features as a metric for evaluating translation quality. The target features they defined include TP (be the true positive), TN (the true negative), FP (the false positive), and FN (the false negative rates), etc. To consider the surrounding phrases for a missing token in the translation they employed the gapped word sequence kernels [9] approach to evaluate translations. Other related works include [10], [11] and [12] about the discussion of word order, ROSE [13], MPF and WMPF [14] about the employing of POS information, MP4IBM1 [15] without relying on reference transla- tions, etc. The evaluation methods proposed previously tend to rely on too many linguistic fea- tures (difficult in replicability) or no linguistic information (leading the metrics result in low correlation with human judgments). To address this problem, this paper explores the performance of a novel method only using the consilience of the POS strings of the hypothesis translation and reference translation. This ensures that the linguistic infor- mation is considered in the evaluation but it is a very concise model. 2 Linguistic Features As discussed above, language variability results in no single correct translation and dif- ferent languages do not always express the same content in the same way. To address the variability phenomenon, researchers used to employ the synonyms, paraphrasing or text entailment as auxiliary information. All of these approaches have their advantages and weaknesses, e.g. the synonyms are difficult to cover all the acceptable expressions. In- stead, in the designed metric, we use the part-of-speech (POS) information (also applied by ROSE [13], MPF and WMPF [14]). If the translation sentence of system outputs is a good translation then there is a potential that the output sentence has a similar seman- tic information with the reference sentence (the two sentences may not contain exactly the same words but with the words that have similar semantic meaning). For example, “there is a big bag” and “there is a large bag” could be the same expression since “big” and “large” has the similar meaning (with POS as adjective). To try this approach, we conduct the evaluation on the POS of the words instead of the words themselves and we do not use other external resources such as synonym dictionaries. We also test the ap- proach by calculating the correlation score of this method with human judgments in the experiment. Assume that we have two sentences: one reference and one system output translation. Firstly, we extract the POS of each word. Then, we calculate the similarity of these two sentences through the alignment of their POS information.
  • 3. Machine Translation Evaluation Based on POS 3 3 Calculation Methods 3.1 Design of hLEPOR Metric First, we introduce the mathematical harmonic mean for multi-variables (n variables (X1, X2, ..., Xn)). Harmonic(X1, X2, ..., Xn) = n n i=1 1 Xi (1) where n means the number of variables (also named as factors). Then, the weighted harmonic mean for multi-variables is: Harmonic(wX1 X1, wX2 X2, ..., wXn Xn) = n i=1 wXi n i=1 wXi Xi (2) where wXi presents the weight assigned to the corresponding variable Xi. Finally, the proposed evaluation metric hLEPOR (tunable Harmonic mean of Length Penalty, Pre- cision, n-gram Position difference Penalty and Recall) is designed as: hLEPOR = Harmonic(wLP LP, wNP osP enalNPosPenal, wHP RHPR) (3) = n i=1 wi n i=1 wi F actori = wLP + wNP osP enal + wHP R wLP LP + wNP osP enal NP osP enal + wHP R HP R where LP, NPosPenal and HPR are three factors in hLEPOR and will be in- troduced in the following. Three tunable weights parameters wLP , wNP osP enal and wHP R are assigned to the three factors respectively. 3.2 Design of Internal Factors Length Penalty In the Eq. (3), LP means Length penalty to embrace the penalty for both longer and shorter system outputs compared with the reference translations: LP =    e1− r c : c < r 1 : c = r e1− c r : c > r (4) where c and r mean the sentence length of candidate translation and reference transla- tion respectively. N-gram Position Difference Penalty In the Eq.(3), the NPosPenal is defined as: NPosPenal = e−NP D (5)
  • 4. 4 A.L.F. Han, D.F. Wong, L.S. Chao, L. He where NPD means n-gram position difference penalty. The NPosPenal value is de- signed to compare the POS order in the sentences between reference translation and output translation. The NPD is defined as: NPD = 1 Lengthoutput Lengthoutput i=1 |PDi| (6) where Lengthoutput represents the length of system output sentence and PDi means the n-gram position difference value of aligned POS between output and reference sen- tences. Every POS from both output translation and reference should be aligned only once (one-to-one alignment). When there is no match, the value of PDi will be zero as default for this output POS. To calculate the NPD value, there are two steps: aligning and calculating. To begin with, the context-dependent n-gram alignment task: we use the n-gram method and assign higher priority on it, which means we take into account the surrounding context (surrounding POS) of the potential POS to select a better matching pairs between the output and the reference. If there are both nearby matching or there is no matched POS around the potential pairs, then we consider the nearest matching to align as a backup choice. The alignment direction is from output sentence to the references. See example in Figure 1. In the second step (calculating step), we label each POS with its position number divided by the corresponding sentence length for normaliza- tion, and then using the Eq. (6) to finish the calculation. Fig. 1. Example of n-gram POS alignment We also use the example in Figure 1 for the NPD introduction (Figure 2). In the example, when we label the position number of output sentence we divide the numerical position (from 1 to 5) of the current POS by the sentence length 5. For the reference sentence it is the similar step. After we get the NPD value, using the Eq. (5), the values of NPosPenal are calculated. Precision and Recall Precision is designed to reflect the accurate rate of outputs while recall means the loyalty to the references. In the Eq. (3), HPR means the weighted Harmonic mean of precision and recall i.e. Harmonic(αR, βP), with parameters α and β as the tunable weights for recall and precision respectively. Harmonic(αR, βP) = α + β α R + β P (7)
  • 5. Machine Translation Evaluation Based on POS 5 Fig. 2. Example of NPD calculation P = alignednum systemlength (8) R = alignednum referencelength (9) where alignednum represents the number of successfully aligned (matched) POS ap- pearing both in translation and reference, systemlength and referencelength specify the sentence length of system output and reference respectively. System-level hLEPOR We have introduced the calculation of hLEPOR on single output sentence, and we should consider a proper way to calculate the value when the cases turn into document (or system) level. We perform the system-level hLEPOR as below. hLEPORsys = Harmonic(wLP LPsys, wNP osP enalPosPenaltysys, wHP RHPRsys) (10) As shown in the formula, to calculate the system-level score hLEPORsys, we should firstly calculate the system-level scores of its factors LPsys, PosPenaltysys and HPRsys. The system level factor scores are calculated by the arithmetic means of the corresponding sentence-level factor scores. 4 Experiments We trained hLEPOR and tuned the parameters on the public ACL WMT 20082 data. There are five languages in the WMT 2008 data including English, Spanish, German, French and Czech; however, we currently did not find proper parser tools for the Spanish and Czech languages. So we tested on the English, German and French languages using the Berkeley parsers [16] to extract the POS information of the tested sentences. Thus, there are four language pairs in our tested corpora: from German and French to English, and the inverse. The parameter values on all language pairs are shown in Table 1. 2 http://www.statmt.org/wmt08/
  • 6. 6 A.L.F. Han, D.F. Wong, L.S. Chao, L. He Table 1. Values of tuned parameters Parameters (α, β) n-gram POS Alignment Weights(HPR:LP:NPosPenal) (9,1) 2-gram 3:2:1 The tested corpora we used are from ACL WMT 20113 . There are more than one hundred MT systems offering their output translation results, with most MT systems statistical-based except for five rule-based ones. The gold standard reference data for those corpora consists of 3003 sentences. For each language pairs, there are different numbers of participated MT systems. Automatic MT evaluation systems are differed by calculating their Spearman rank correlation coefficient with the human judgment results [17]. Table 2. Correlation coefficients with human judgments Correlation Score with Human Judgment Other-to-English English-to-Other Metrics DE-EN FR-EN EN-DE EN-FR Mean score hLEPOR 0.83 0.74 0.84 0.82 0.81 MPF 0.69 0.87 0.63 0.89 0.77 WMPF 0.66 0.87 0.61 0.89 0.76 AMBER-TI 0.63 0.94 0.54 0.84 0.74 AMBER 0.59 0.95 0.53 0.84 0.73 AMBER-NL 0.58 0.94 0.45 0.83 0.7 METEOR-1.3 0.71 0.93 0.3 0.85 0.70 ROSE 0.59 0.86 0.41 0.86 0.68 BLEU 0.48 0.85 0.44 0.86 0.66 F15G3 0.48 0.88 0.3 0.84 0.63 F15 0.45 0.87 0.19 0.85 0.59 MP4IBM1 0.56 0.08 0.91 0.61 0.54 TER 0.33 0.77 0.12 0.84 0.52 We compare the experiments results with several classic metrics including BLEU, METEOR, TER and some latest ones (e.g. MPF, ROSE, F15, AMBER, MP4IBM1). The system level correlation coefficients of these metrics with human judgments are shown in the Table 2 which is ranked by the mean correlation scores of the metrics on four language pairs. Several conclusions from the results could be drawn: first, many evaluation metrics performed well in certain language pairs but weak on others, e.g. WMPF results in 0.89 correlation with human judgments on English-to-French corpus but down to 0.61 score on English-to-German, F15 gets 0.87 score on French-to-English but 0.45 on German-to-English, ROSE performs well on both French-to-English and 3 http://www.statmt.org/wmt11/
  • 7. Machine Translation Evaluation Based on POS 7 English-to-French but worse on Germen-to-English and English-to-German. Second, recently proposed evaluation metrics (e.g. MPF and AMBER) generally perform better than the traditional ones (e.g. BLEU and TER), showing an improvement of the research work. 5 Conclusion and Perspectives To make the evaluation model both accurate and concise, instead of using the syn- onyms vocabularies, paraphrasing and text entailment that are commonly used by other researchers, this paper explores the use of the POS information of the words sequences. What is noticing is that some researchers have used the n-gram method on the words alignment (e.g. BLEU using 1-gram to 4-gram), other researchers used the POS in- formation in the similarity calculation by counting the number of corresponding POS. However, this paper employs the n-gram method on the POS alignment. Since this developed metric only relies on the consilience of the POS strings of the evaluated sen- tence even without using the surface words, it has a potential to be further developed as a reference independent metric. The main difference of this paper and our previous work LEPOR (the product of factors, perform on words) [18] is that this method group- s the factors based on mathematical weighted harmonic mean, instead of the simple product of factors, and this method is performed on the POS instead of words. The overall weighted harmonic mean allows to tune the model neatly according to different circumstances, and the POS information can act as part of synonyms in addition to the syntactic or morphological behaviour of the lexical item. Even though the designed metric has shown promising performances on the test- ed language pairs (EN to DE and FR, and the inverse direction). There are several weaknesses of this metric. Firstly, the POS codes are not language independent, so this method may not work well on the distant languages e.g. English and Japanese. Second- ly, the parsing accuracy will effect the performance of the evaluation. Thirdly, to make the evaluation model concise, this work only uses the POS information without consid- ering the surface words. To address these weaknesses, in the future work, more language pairs will be tested, other POS generation tools will be explored, and the combination of surface words and POS will be employed. The designed source codes of this paper can be freely downloaded for research purpose, open source online. The source code of hLEPOR measuring algorithm is available here “https://github.com/aaronlifenghan/aaron-project-hlepor”. “The final pub- lication is available at http://www.springerlink.com” Acknowledgments. The authors wish to thank the anonymous reviewers for many helpful comments. The authors are grateful to the Science and Technology Develop- ment Fund of Macau and the Research Committee of the University of Macau for the funding support for our research, under the reference No. 017/2009/A and RG060/09- 10S/CS/FST.
  • 8. 8 A.L.F. Han, D.F. Wong, L.S. Chao, L. He References 1. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge University Press (2010) 2. Su Keh-Yih, Wu Ming-Wen and Chang Jing-Shin: A New Quantitative Quality Measure for Machine Translation Systems. In Proceedings of the 14th International Conference on Com- putational Linguistics, pages 433–439, Nantes, France, July (1992) 3. Papineni, K., Roukos, S., Ward, T. and Zhu, W. J.: BLEU: a method for automatic evaluation of machine translation. In Proceedings of the ACL 2002, pages 311-318, Philadelphia, PA, USA (2002) 4. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co- occurrence statistics. In Proceedings of the second international conference on Human Lan- guage Technology Research, pages 138-145, San Diego, California, USA (2002) 5. Banerjee, S. and Lavie, A.: Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of ACL-WMT, pages 65-72, Prague, Czech Republic (2005) 6. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J.: A study of translation edit rate with targeted human annotation. In Proceedings of the Conference of the Association for Machine Translation in the Americas, pages 223-231, Boston, USA (2006) 7. Chen, B. and Kuhn, R.: Amber: A modified bleu, enhanced ranking metric. In Proceedings of ACL-WMT, pages 71-77, Edinburgh, Scotland, UK (2011) 8. Bicici, E. and Yuret, D.: RegMT system for machine translation, system combination, and evaluation. In Proceedings ACL-WMT, pages 323-329, Edinburgh, Scotland, UK (2011) 9. Taylor, J. Shawe and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge Univer- sity Press 2004. 10. Wong, B. T-M and Kit, C.: Word choice and word position for automatic MT evaluation. In Workshop: MetricsMATR of the Association for Machine Translation in the Americas, short paper, 3 pages, Waikiki, Hawai’I, USA (2008) 11. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation of trans- lation quality for distant language pairs. In Proceedings of the 2010 Conference on EMNLP, pages 944–952, Cambridge, MA (2010) 12. Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J., Seno, M. and Och, F.: A Lightweight Evaluation Framework for Machine Translation Reordering. In Proceedings of the Sixth ACL- WMT, pages 12-21, Edinburgh, Scotland, UK (2011) 13. Song, X. and Cohn, T.: Regression and ranking based optimisation for sentence level MT e- valuation. In Proceedings of the ACL-WMT, pages 123-129, Edinburgh, Scotland, UK (2011) 14. Popovic, M.: Morphemes and POS tags for n-gram based evaluation metrics. In Proceedings of ACL-WMT, pages 104-107, Edinburgh, Scotland, UK (2011) 15. Popovic, M., Vilar, D., Avramidis, E. and Burchardt, A.: Evaluation without references: IB- M1 scores as evaluation metrics. In Proceedings of the ACL-WMT, pages 99-103, Edinburgh, Scotland, UK (2011) 16. Petrov S., Leon Barrett, Romain Thibaux, and Dan Klein: Learning accurate, compact, and interpretable tree annotation. Proceedings of the 21st ACL, pages 433–440, Sydney, July (2006) 17. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011 Workshop on Statistical Machine Translation. In Proceedings of ACL-WMT, pages 22-64, Edinburgh, Scotland, UK (2011) 18. Han, Aaron L.-F., Wong, Derek F., and Chao, Lidia S.: LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pages 441–450, Mumbai, India (2012)