CUHK intern PPT. Machine Translation Evaluation: Methods and Tools

Aaron L.-F. Han
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
University of Macau, Macau S.A.R., China
2013.08 @ CUHK, Hong Kong
Email: hanlifengaaron AT gmail DOT com
Homepage: http://www.linkedin.com/in/aaronhan

 The importance of machine translation (MT) evaluation
 Automatic MT evaluation metrics introduction
1. Lexical similarity
2. Linguistic features
3. Metrics combination
 Designed metric: LEPOR Series
1. Motivation
2. LEPOR Metrics Description
3. Performances on international ACL-WMT corpora
4. Publications and Open source tools
 Other research interests and publications

• Eager communication with each other of different
nationalities
– Promote the translation technology
• Rapid development of Machine translation
– machine translation (MT) began as early as in the 1950s
(Weaver, 1955)
– big progress science the 1990s due to the development of
computers (storage capacity and computational power)
and the enlarged bilingual corpora (Marino et al. 2006)

• Some recent works of MT research:
– Och (2003) present MERT (Minimum Error Rate Training)
for log-linear SMT
– Su et al. (2009) use the Thematic Role Templates model to
improve the translation
– Xiong et al. (2011) employ the maximum-entropy model,
etc.
– The data-driven methods including example-based MT
(Carl and Way, 2003) and statistical MT (Koehn, 2010)
became main approaches in MT literature.

• How well the MT systems perform and whether they
make some progress?
• Difficulties of MT evaluation
– language variability results in no single correct translation
– the natural languages are highly ambiguous and different
languages do not always express the same content in the
same way (Arnold, 2003)

• Traditional manual evaluation criteria:
– intelligibility (measuring how understandable the
sentence is)
– fidelity (measuring how much information the translated
sentence retains as compared to the original) by the
Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
– adequacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and comprehension
(improved intelligibility) by Defense Advanced Research
Projects Agency (DARPA) of US (White et al., 1994)

• Problems of manual evaluations :
– Time-consuming
– Expensive
– Unrepeatable
– Low agreement (Callison-Burch, et al., 2011)

2.1 Lexical similarity
2.2 Linguistic features
2.3 Metrics combination

• Precision-based
Bleu (Papineni et al., 2002 ACL)
• Recall-based
ROUGE(Lin, 2004 WAS)
• Precision and Recall
Meteor (Banerjee and Lavie, 2005 ACL)

• Word-order based
NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen
et al., 2012 ACL), ATEC (Wong et al., 2008AMTA)
• Word-alignment based
AER (Och and Ney, 2003 J.CL)
• Edit distance-based
WER(Su et al., 1992Coling), PER(Tillmann et al.,
1997 EUROSPEECH), TER (Snover et al., 2006
AMTA)

• Language model
LM-SVM (Gamon et al., 2005EAMT)
• Shallow parsing
GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel
et al., 2012WMT)
• Semantic roles
Named entity, morphological, synonymy,
paraphrasing, discourse representation, etc.

• MTeRater-Plus (Parton et al., 2011WMT)
– Combine BLEU, TERp (Snover et al., 2009) and Meteor
(Banerjee and Lavie, 2005; Lavie and Denkowski, 2009)
• MPF & WMPBleu (Popovic, 2011WMT)
– Arithmetic mean of F score and BLEU score
• SIA (Liu and Gildea, 2006ACL)
– Combine the advantages of n-gram-based metrics and
loose-sequence-based metrics

• LEPOR: automatic machine translation evaluation
metric considering the: Length Penalty, Precision, n-
gram Position difference Penalty and Recall.

• Weaknesses in existing metrics:
– perform well on certain language pairs but weak on others,
which we call as the language-bias problem;
– consider no linguistic information (leading the metrics
result in low correlation with human judgments) or too
many linguistic features (difficult in replicability), which we
call as the extremism problem;
– present incomprehensive factors (e.g. BLEU focus on
precision only).
– What to do?

• to address some of the existing problems:
– Design tunable parameters to address the language-bias
problem;
– Use concise or optimized linguistic features for the
linguistic extremism problem;
– Design augmented factors.

• Sub-factors:
• 𝐿𝑃 =
exp 1 −
𝑟
𝑐
: 𝑐 < 𝑟
1 ∶ 𝑐 = 𝑟
exp 1 −
𝑐
𝑟
: 𝑐 > 𝑟
(1)
• 𝑟: length of reference sentence
• 𝑐: length of candidate (system-output) sentence

• 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (2)
• 𝑁𝑃𝐷 =
1
𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡
|𝑃𝐷𝑖|
𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡
𝑖=1
(3)
• 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (4)
• 𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡: position of matched token in
output sentence
• 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓: position of matched token in reference
sentence

Fig. 1. N-gram word alignment algorithm

Fig. 2. Example of n-gram word alignment

Fig. 3. Example of NPD calculation

• N-gram precision and recall:
• 𝑃𝑛 =
#𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑
#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑠𝑦𝑠𝑡𝑒𝑚 𝑜𝑢𝑡𝑝𝑢𝑡
(5)
• 𝑅 𝑛 =
#𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑
#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒
(6)
• 𝐻𝑃𝑅 = 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝛼𝑅 𝑛, 𝛽𝑃𝑛 =
𝛼+𝛽
𝛼
𝑅 𝑛
+
𝛽
𝑃 𝑛
(7)

Fig. 4. Example of bigram matching

• LEPOR Metrics:
• 𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝛼𝑅, 𝛽𝑃) (8)
• ℎ𝐿𝐸𝑃𝑂𝑅 =
𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤 𝐿𝑃 𝐿𝑃, 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤 𝐻𝑃𝑅 𝐻𝑃𝑅
• =
𝑤 𝑖
𝑛
𝑖=1
𝑤 𝑖
𝐹𝑎𝑐𝑡𝑜𝑟 𝑖
𝑛
𝑖=1
=
𝑤 𝐿𝑃+𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙+𝑤 𝐻𝑃𝑅
𝑤 𝐿𝑃
𝐿𝑃
+
𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙
𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙
+
𝑤 𝐻𝑃𝑅
𝐻𝑃𝑅
(9)
• 𝑛𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝑒𝑥𝑝( 𝑤 𝑛 𝑙𝑜𝑔𝐻𝑃𝑅𝑁
𝑛=1 )
(10)

• Example, employment of linguistic features:
Fig. 5. Example of n-gram POS alignment
Fig. 6. Example of NPD calculation

• Combination with linguistic features:
• ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 =
1
𝑤ℎ𝑤+𝑤ℎ𝑝
(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 +
𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (11)
• ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆 and ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 use the same
algorithm on POS sequence and word sequence
respectively.

• When multi-references:
• Select the alignment that results in the minimum NPD
score.
Fig. 7. N-gram alignment when multi-references

• How reliable is the automatic metric?
• Evaluation criteria for evaluation metrics:
– Human judgments are the golden to approach, currently.
• Correlation with human judgments:
– System-level correlation
– Segment-level correlation

• System-level correlation:
• Spearman rank correlation coefficient:
– 𝜌 𝑋𝑌 = 1 −
6 𝑑 𝑖
2𝑛
𝑖=1
𝑛(𝑛2−1)
(12)
– 𝑋 = 𝑥1, … , 𝑥 𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}
• Pearson correlation coefficient:
– 𝜌 𝑋𝑌 =
(𝑥 𝑖−𝜇 𝑥)(𝑦 𝑖−𝜇 𝑦)𝑛
𝑖=1
𝑥 𝑖−𝜇 𝑥
2𝑛
𝑖=1 𝑦 𝑖−𝜇 𝑦
2𝑛
𝑖=1
(13)

• Segment-level Kendall’s tau correlation:
• 𝜏 =
𝑛𝑢𝑚 𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑡 𝑝𝑎𝑖𝑟𝑠−𝑛𝑢𝑚 𝑑𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑡 𝑝𝑎𝑖𝑟𝑠
𝑡𝑜𝑡𝑎𝑙 𝑝𝑎𝑖𝑟𝑠
(14)
• The segment unit can be a single sentence or
fragment that contains several sentences.

• Performances on ACL-WMT 2011 copora
• Two translation directions:
– English-to-other (Spanish, German, French, Czech)
– Other-to-English
• System-level metrics:
– 𝐿𝐸𝑃𝑂𝑅 𝐴 =
1
𝑛𝑢𝑚 𝑠𝑒𝑛𝑡
|𝐿𝐸𝑃𝑂𝑅𝑖|
𝑖=1 (15)
– 𝐿𝐸𝑃𝑂𝑅 𝐵 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝛼𝑅, 𝛽𝑃) (16)

Table 1. system-level Spearman correlation with human judgment on WMT11 corpora

• Performances on ACL-WMT 2013 copora
• Two translation directions:
– English-to-other (Spanish, German, French, Czech, and
Russian)
– Other-to-English

• System-level & sentence-level
– LEPOR_v3.1: hLEPOR, nLEPOR_baseline
– ℎ𝐿𝐸𝑃𝑂𝑅 =
1
|ℎ𝐿𝐸𝑃𝑂𝑅𝑖|
𝑖=1 (17)
– ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 =
1
𝑤ℎ𝑤+𝑤ℎ𝑝
(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 + 𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (18)
– 𝑛𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × exp( 𝑤 𝑛 𝑙𝑜𝑔𝐻𝑃𝑅𝑁
𝑛=1 ) (19)

Table 2&3. system-level Pearson correlation with human judgment

Table 4&5. segment-level Kendall’s tau correlation with human judgment

• LEPOR: A Robust Evaluation Metric for Machine
Translation with Augmented Factors
– Aaron L.-F. Han, Derek F. Wong and Lidia S. Chao.
Proceedings of COLING 2012: Posters, pages 441–450,
Mumbai, India.
• Language-independent Model for Machine
Translation Evaluation with Reinforced Factors
– Aaron L.-F. Han, Derek Wong, Lidia S. Chao, Liangye He, Yi
Lu, Junwen Xing, Xiaodong Zeng. Proceedings of MT
Summit 2013. Nice, France.

• Language independent MT evaluation-LEPOR:
https://github.com/aaronlifenghan/aaron-project-lepor
• MT evaluation with linguistic features-hLEPOR:
https://github.com/aaronlifenghan/aaron-project-hlepor
• English-French Phrase tagset mapping and application in
unsupervised MT evaluation-HPPR:
https://github.com/aaronlifenghan/aaron-project-hppr
• Unsupervised English-Spanish MT evaluation-EBLEU:
https://github.com/aaronlifenghan/aaron-project-ebleu
• Projects Homepage: https://github.com/aaronlifenghan

• My research interests:
– Natural Language Processing
– Signal Processing
– Machine Learning
– Artificial Intelligence
– Pattern Recognition
• My past research works:
– Machine Translation Evaluation, Word Segmentation,
Entity Recognition, Multilingual Treebanks

• Other publications:
• A Description of Tunable Machine Translation Evaluation Systems in
WMT13 Metrics Task
– Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yi Lu, Yervant
Ho, Yiming Wang, Zhou jiaji. Proceedings of the ACL 2013 EIGHTH
WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT
2013), 8-9 August 2013. Sofia, Bulgaria.
– ACL-WMT13 METRICS TASK:
Our metrics are language independent
English-vs-other (French, Spanish, Czech, German, Russian)
Can perform on both system-level and segment-level
The official results show our metrics have advantages as compared to others.

• Quality Estimation for Machine Translation Using the Joint Method of
Evaluation Criteria and Statistical Modeling
– Aaron Li-Feng Han, Yi Lu, Derek F. Wong, Lidia S. Chao, Yervant
Ho, Anson Xing. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON
STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August
2013. Sofia, Bulgaria.
– ACL-WMT13 QUALITY ESTIMATION TASK (no reference translation):
Task 1.1: sentence level EN-ES quality estimation
Task 1.2: system selection, EN-ES, EN-DE, new
Task 2: word-level QE, EN-ES, binary classification, multi-class classification, new
We design novel EN-ES POS tagset mapping and metric EBLEU in task 1.1.
We explore the Naïve Bayes and Support Vector Machine in task 1.2.
We achieve the highest F1 score in task 2 using Conditional Random Field.

Designed POS tagset mapping of Spanish (Tree tagger) to universal tagset
(Petrov et al., 2012)

• 𝐸𝐵𝐿𝐸𝑈 = 1 − 𝑀𝐿𝑃 × exp( 𝑤 𝑛log(𝐻(𝛼𝑅 𝑛, 𝛽𝑃𝑛)))
– 𝑀𝐿𝑃 =
𝑒1−
𝑠
ℎ
𝑖𝑓 ℎ < 𝑠
𝑒1−
ℎ
𝑠
𝑖𝑓 ℎ ≥ 𝑠
– 𝑃𝑛 =
#𝑐𝑜𝑚𝑚𝑜𝑛 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘
#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘 𝑖𝑛 𝑡𝑎𝑟𝑔𝑒𝑡 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒
– 𝑅 𝑛 =
#𝑐𝑜𝑚𝑚𝑜𝑛 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘
#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘 𝑖𝑛 𝑠𝑜𝑢𝑟𝑐𝑒 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒

• Bayes’ rule:
– 𝑝 𝑐𝑖 𝑥1, 𝑥2, … , 𝑥 𝑛 =
𝑝 𝑥1,𝑥2,…,𝑥 𝑛 𝑐 𝑖 𝑝(𝑐 𝑖)
𝑝(𝑥1,𝑥2,…,𝑥 𝑛)
• SVM, find the point with smallest margin to hyper
plane and then maximize this margin:
– 𝑎𝑟𝑔𝑚𝑎𝑥 𝑤, 𝑏 {𝑚𝑖𝑛
𝑛
(𝑙𝑎𝑏𝑒𝑙 ∙ (𝑤 𝑇 𝑥 + 𝑏)) ∙
1
||𝑤||
}
• Conditional random fields:
– 𝑃 𝑌 𝑋 ∝
exp( 𝜆 𝑘 𝑓𝑘 𝑒, 𝑌| 𝑒, 𝑋𝑒∈𝐸,𝑘 + 𝜇 𝑘 𝑔 𝑘(𝑣, 𝑌| 𝑣, 𝑋)𝑣∈𝑉,𝑘 )

Designed features for CRF & NB algorithms

ACL-WMT13 word level quality estimation task results

• Phrase Tagset Mapping for French and English Treebanks and Its
Application in Machine Translation Evaluation
– Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho, Shuo
Li, Lynn Ling Zhu. In GSCL 2013. LNCS Vol. 8105, Volume Editors: Iryna
Gurevych, Chris Biemann and Torsten Zesch.
– German Society for Computational Linguistics (oral presentation):
To facilitate future research in unsupervised induction of syntactic structures
We design French-English phrase tagset mapping
We propose a universal phrase tagset
Phase tags extracted from French Treebank and English Penn Treebank
Explore the employment of the proposed mapping in unsupervised MT evaluation

Designed phrase tagset mapping for English and French

Evaluation based on parsing information from syntactic treebanks

Convert the word sequence into universal phrase tagset sequence

• 𝐻𝑃𝑃𝑅 = 𝐻𝑎𝑟(𝑤 𝑃𝑠 𝑁1 𝑃𝑠𝐷𝑖𝑓, 𝑤 𝑃𝑟 𝑁2 𝑃𝑟𝑒, 𝑤 𝑅𝑐 𝑁3 𝑅𝑒𝑐)
– 𝑁1 𝑃𝑠𝐷𝑖𝑓 =
1
𝑛
𝑁1 𝑃𝑠𝐷𝑖𝑓𝑖
– 𝑁2 𝑃𝑟𝑒 = 𝑒𝑥𝑝 𝑤 𝑛 𝑙𝑜𝑔𝑃𝑛𝑁2
– 𝑁3 𝑅𝑒𝑐 = 𝑒𝑥𝑝( 𝑤 𝑛 𝑙𝑜𝑔𝑅 𝑛𝑁3
)

• A Study of Chinese Word Segmentation Based on the Characteristics of
Chinese
– Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho, Lynn Ling
Zhu, Shuo Li. Accepted. In GSCL 2013. LNCS Vol. 8105, Volume Editors:
Iryna Gurevych, Chris Biemann and Torsten Zesch.
– German Society for Computational Linguistics (poster paper):
No word boundary in the Chinese expression
Chinese word segmentation is a difficult problem
Word segmentation is crucial to the word alignment in machine translation
We discuss the characteristics of Chinese and design optimized features
We formulize some problems and issues in Chinese word segmentation

• AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH
INFORMATION
– Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho. In TSD
2013. Plzen, Czech Republic. LNAI Vol. 8082, pp. 121-128. Volume
Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin
Heidelberg.
– Text, Speech and Dialogue 2013 (oral presentation):
We explore the unsupervised machine translation evaluation method
We design hLEPOR algorithm for the first time
We explore the POS usage in unsupervised MT evaluation
Experiments are performed on English vs French, German

• Chinese Named Entity Recognition with Conditional Random Fields in the
Light of Chinese Characteristics
– Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao. In Proceeding
of LP&IIS. M.A. Klopotek et al. (Eds.): IIS 2013, LNCS Vol. 7912, pp. 57–
68, Warsaw, Poland. Springer-Verlag Berlin Heidelberg.
– Intelligent Information System 2013 (oral presentation):
Named entity recognition is important in IR, MT, text analysis, etc.
Chinese named entity recognition is more difficult due to no word boundary
We compare the performances of different algorithm, NB, CRF, SVM, ME
We analysis the characteristics respectively on personal, location, organization names
We show the performance of different features and select the optimized one.

• Ongoing and further works:
– The combination of translation and evaluation, tuning the
translation model using evaluation metrics
– Evaluation models from the perspective of semantics
– The exploration of unsupervised evaluation models,
extracting features from source and target languages

• Actually speaking, the evaluation works are very
related to the similarity measuring. Where I have
employed them is in the MT evaluation only. These
works can be further developed into other literature:
– information retrieval
– question and answering
– Searching
– text analysis
– etc.

Q and A
Thanks for your attention!
Aaron L.-F. Han, 2013.08

• 1. Weaver, Warren.: Translation. In William Locke and A. Donald Booth, editors,
• Machine Translation of Languages: Fourteen Essays. John Wiley and Sons, New
• York, pages 15-23 (1955)
• 2. Marino B. Jose, Rafael E. Banchs, Josep M. Crego, Adria de Gispert, Patrik Lambert,
• Jose A. Fonollosa, Marta R. Costa-jussa: N-gram based machine translation,
• Computational Linguistics, Vol. 32, No. 4. pp. 527-549, MIT Press (2006)
• 3. Och, F. J.: Minimum Error Rate Training for Statistical Machine Translation. In
• Proceedings of (ACL-2003). pp. 160-167 (2003)
• 4. Su Hung-Yu and Chung-Hsien Wu: Improving Structural Statistical Machine Translation
• for Sign Language With Small Corpus Using Thematic Role Templates as
• Translation Memory, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE
• PROCESSING, VOL. 17, NO. 7, SEPTEMBER (2009)
• 5. Xiong D., M. Zhang, H. Li: A Maximum-Entropy Segmentation Model for Statistical
• Machine Translation, Audio, Speech, and Language Processing, IEEE Transactions
• on, Volume: 19, Issue: 8, 2011 , pp. 2494- 2505 (2011)
• 6. Carl, M. and A. Way (eds): Recent Advances in Example-Based Machine Translation.
• Kluwer Academic Publishers, Dordrecht, The Netherlands (2003)

• 7. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge
• University Press (2010)
• 8. Arnold, D.: Why translation is dicult for computers. In Computers and Translation:
• A translator's guide. Benjamins Translation Library (2003)
• 9. Carroll, J. B.: Aan experiment in evaluating the quality of translation, Pierce, J.
• (Chair), Languages and machines: computers in translation and linguistics. A report
• by the Automatic Language Processing Advisory Committee (ALPAC), Publication
• 1416, Division of Behavioral Sciences, National Academy of Sciences, National Research
• Council, page 67-75 (1966)
• 10. White, J. S., O'Connell, T. A., and O'Mara, F. E.: The ARPA MT evaluation
• methodologies: Evolution, lessons, and future approaches. In Proceedings of the
• Conference of the Association for Machine Translation in the Americas (AMTA
• 1994). pp 193-205 (1994)
• 11. Su Keh-Yih, Wu Ming-Wen and Chang Jing-Shin: A New Quantitative Quality
• Measure for Machine Translation Systems. In Proceedings of the 14th International
• Conference on Computational Linguistics, pages 433-439, Nantes, France, July
• (1992)

• 12. Tillmann C., Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf:
• Accelerated DP Based Search For Statistical Translation. In Proceedings of the 5th
• European Conference on Speech Communication and Technology (EUROSPEECH97)
• (1997)
• 13. Papineni, K., Roukos, S., Ward, T. and Zhu, W. J.: BLEU: a method for automatic
• evaluation of machine translation. In Proceedings of the (ACL 2002), pages 311-318,
• Philadelphia, PA, USA (2002)
• 14. Doddington, G.: Automatic evaluation of machine translation quality using ngram
• co-occurrence statistics. In Proceedings of the second international conference
• on Human Language Technology Research(HLT 2002), pages 138-145, San Diego,
• California, USA (2002)
• 15. Turian, J. P., Shen, L. and Melanmed, I. D.: Evaluation of machine translation
• and its evaluation. In Proceedings of MT Summit IX, pages 386-393, New Orleans,
• LA, USA (2003)
• 16. Banerjee, S. and Lavie, A.: Meteor: an automatic metric for MT evaluation with
• high levels of correlation with human judgments. In Proceedings of ACL-WMT,
• pages 65-72, Prague, Czech Republic (2005)

• 17. Denkowski, M. and Lavie, A.: Meteor 1.3: Automatic metric for reliable optimization
• and evaluation of machine translation systems. In Proceedings of (ACL-WMT),
• pages 85-91, Edinburgh, Scotland, UK (2011)
• 18. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J.: A study of
• translation edit rate with targeted human annotation. In Proceedings of the Conference
• of the Association for Machine Translation in the Americas (AMTA), pages
• 223-231, Boston, USA (2006)
• 19. Chen, B. and Kuhn, R.: Amber: A modied bleu, enhanced ranking metric. In
• Proceedings of (ACL-WMT), pages 71-77, Edinburgh, Scotland, UK (2011)
• 20. Bicici, E. and Yuret, D.: RegMT system for machine translation, system combination,
• and evaluation. In Proceedings ACL-WMT, pages 323-329, Edinburgh,
• Scotland, UK (2011)
• 21. Taylor, J. Shawe and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge
• University Press 2004.
• 22. Wong, B. T-M and Kit, C.: Word choice and word position for automatic MT
• evaluation. In Workshop: MetricsMATR of the Association for Machine Translation
• in the Americas (AMTA), short paper, 3 pages, Waikiki, Hawai'I, USA (2008)

• 23. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation
• of translation quality for distant language pairs. In Proceedings of the 2010
• Conference on (EMNLP), pages 944{952, Cambridge, MA (2010)
• 24. Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J., Seno, M. and Och, F.: A
• Lightweight Evaluation Framework for Machine Translation Reordering. In Proceedings
• of the Sixth (ACL-WMT), pages 12-21, Edinburgh, Scotland, UK (2011)
• 25. Song, X. and Cohn, T.: Regression and ranking based optimisation for sentence
• level MT evaluation. In Proceedings of the (ACL-WMT), pages 123-129, Edinburgh,
• Scotland, UK (2011)
• 26. Popovic, M.: Morphemes and POS tags for n-gram based evaluation metrics. In
• Proceedings of (ACL-WMT), pages 104-107, Edinburgh, Scotland, UK (2011)
• 27. Popovic, M., Vilar, D., Avramidis, E. and Burchardt, A.: Evaluation without references:
• IBM1 scores as evaluation metrics. In Proceedings of the (ACL-WMT),
• pages 99-103, Edinburgh, Scotland, UK (2011)
• 28. Petrov S., Leon Barrett, Romain Thibaux, and Dan Klein: Learning accurate,
• compact, and interpretable tree annotation. Proceedings of the 21st ACL, pages
• 433-440, Sydney, July (2006)

• 29. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011
• Workshop on Statistical Machine Translation. In Proceedings of (ACL-WMT), pages
• 22-64, Edinburgh, Scotland, UK (2011)
• 30. Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M. and Zaidan,
• O. F.: Findings of the 2010 Joint Workshop on Statistical Machine Translation and
• Metrics for Machine Translation. In Proceedings of (ACL-WMT), pages 17-53, PA,
• USA (2010)
• 31. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Findings of the 2009
• Workshop on Statistical Machine Translation. In Proceedings of ACL-WMT, pages
• 1-28, Athens, Greece (2009)
• 32. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Further meta-evaluation
• of machine translation. In Proceedings of (ACL-WMT), pages 70-106, Columbus,
• Ohio, USA (2008)
• 33. Avramidis E., Popovic, M., Vilar, D., Burchardt, A.: Evaluate with Condence
• Estimation: Machine ranking of translation outputs using grammatical features. In
• Proceedings of the Sixth Workshop on Statistical Machine Translation, Association
• for Computational Linguistics (ACL-WMT), pages 65-70, Edinburgh, Scotland, UK
• (2011)

CUHK intern PPT. Machine Translation Evaluation: Methods and Tools

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (18)

Similar to CUHK intern PPT. Machine Translation Evaluation: Methods and Tools

Similar to CUHK intern PPT. Machine Translation Evaluation: Methods and Tools (20)

More from Lifeng (Aaron) Han

More from Lifeng (Aaron) Han (20)

Recently uploaded

Recently uploaded (20)

CUHK intern PPT. Machine Translation Evaluation: Methods and Tools