1. LEPOR: An Augmented Machine
Translation Evaluation Metric
Master thesis defense, 2014.07
(Aaron) Li-Feng Han (้ๅฉๅณฐ), MB-154887
Supervisors: Dr. Derek F. Wong & Dr. Lidia S. Chao
NLP2CT, University of Macau
2. Content
โข MT evaluation (MTE) introduction
โ MTE background
โ Existing methods
โ Weaknesses analysis
โข Designed model
โ Designed factors
โ LEPOR metric
โข Experiments and results
โ Evaluation criteria
โ Corpora
โ Results analysis
โข Enhanced models
โ Variants of LEPOR
โ Performances in shared task
โข Conclusion and future work
โข Selected publications
2
3. MTE - background
โข MT began as early as 1950s (Weaver, 1955)
โข Rapid development since 1990s (two reasons)
โ Computer technology
โ Enlarged bilingual corpora (Mariรฑo et al., 2006)
โข Several promotion events for MT:
โ NIST Open MT Evaluation series (OpenMT)
โข 2001-2009 (Li, 2005)
โข By National Institute of Standards and Technology, US
โข Corpora: Arabic-English, Chinese-English, etc.
โ International Workshop on Statistical Machine Translation (WMT)
โข from 2006 (Koehn and Monz, 2006; Callison-Burch et al., 2007 to 2012; Bojar et al., 2013)
โข Annually by SIGMT of ACL
โข Corpora: English to French/German/Spanish/Czech/Hungarian/Haitian Creole/Russian &
inverse direction
โ International Workshop of Spoken Language Processing (IWSLT)
โข from 2004 (Eck and Hori, 2005; Paul, 2009; Paul, et al., 2010; Federico et al., 2011)
โข English & Asian language (Chinese, Japanese, Korean)
3
4. โข With the rapid development of MT, how to
evaluate the MT model?
โ Whether the newly designed algorithm/feature
enhance the existing MT system, or not?
โ Which MT system yields the best output for specified
language pair, or generally across languages?
โข Difficulties in MT evaluation:
โ language variability results in no single correct
translation
โ natural languages are highly ambiguous and different
languages do not always express the same content in
the same way (Arnold, 2003)
4
5. Existing MTE methods
โข Manual MTE methods:
โข Traditional Manual judgment
โ Intelligibility: how understandable the sentence is
โ Fidelity: how much information the translated sentence
retains as compared to the original
โข by the Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
โ Adequacy (similar as fidelity)
โ Fluency: whether the sentence is well-formed and fluent
โ Comprehension (improved intelligibility)
โข by Defense Advanced Research Projects Agency (DARPA) of US
(Church et al., 1991; White et al., 1994)
5
6. โข Advanced manual judgment:
โ Task oriented method (White and Taylor, 1998)
โข In light of the tasks for which the output might by used
โ Further developed criteria
โข Bangalore et al. (2000): simple string accuracy/ generation string
accuracy/ two corresponding tree-based accuracies.
โข LDC (Linguistics Data Consortium): 5-point scales fluency & adequacy
โข Specia et al. (2011): design 4-level adequacy, highly adequacy/fairly
adequacy /poorly adequacy/completely inadequate
โ Segment ranking (WMT 2011~2013)
โข Judges are asked to provide a complete ranking over all the candidate
translations of the same source segment (Callison-Burch et al., 2011,
2012)
โข 5 systems are randomly selected for the judges (Bojar et al., 2013)
6
7. โข Problems in Manual MTE
โ Time consuming
โข How about a document contain 3,000 sentences or more
โ Expensive
โข Professional translators? or other people?
โ Unrepeatable
โข Precious human labor can not be simply re-run
โ Low agreement, sometimes (Callison-Burch et al., 2011)
โข E.g. in WMT 2011 English-Czech task, multi-annotator agreement
kappa value is very low
โข Even the same strings produced by two systems are ranked
differently each time by the same annotator
7
8. โข How to address the problems?
โ Automatic MT evaluation!
โข What do we expect? (as compared with manual judgments)
โ Repeatable
โข Can be re-used whenever we make some change of the MT system, and plan to have a
check of the translation quality
โ Fast
โข several minutes or seconds for evaluating 3,000 sentences
โข V.s. hours of human labor
โ Cheap
โข We do not need expensive manual judgments
โ High agreement
โข Each time of running, result in same scores for un-changed outputs
โ Reliable
โข Give a higher score for better translation output
โข Measured by correlation with human judgments
8
9. Automatic MTE methods
โข BLEU (Papineni et al., 2002)
โ Proposed by IBM
โ First automatic MTE method
โ based on the degree of n-gram overlapping between the strings of words produced by the machine and the
human translation references
โ corpus level evaluation
โข ๐ต๐ฟ๐ธ๐ = ๐ต๐๐๐ฃ๐๐ก๐ฆ ๐๐๐๐๐๐ก๐ฆ ร ๐๐ฅ๐ ๐ ๐ ๐๐๐๐๐๐๐๐๐ ๐๐๐ ๐
๐
๐=1
โ ๐ต๐๐๐ฃ๐๐ก๐ฆ ๐๐๐๐๐๐ก๐ฆ =
1 ๐๐ ๐ > ๐
๐(1โ
๐
๐
)
๐๐ ๐ โค ๐
โ ๐๐๐๐๐๐ ๐๐๐ =
#๐๐๐๐๐๐๐ก
#๐๐ข๐ก๐๐ข๐ก
, ๐๐๐๐๐๐ ๐๐๐ ๐ =
#๐๐๐๐๐ ๐๐๐๐๐๐๐ก
#๐๐๐๐๐ ๐๐ข๐ก๐๐ข๐ก
โข ๐ is the total length of candidate translation corpus (the sum of sentencesโ length)
โข ๐ refers to the sum of effective reference sentence length in the corpus
โ if there are multi-references for each candidate sentence, then the nearest length as compared to the
candidate sentence is selected as the effective one
Papineni et al. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. Proc. of ACL.
9
10. โข METEOR (Banerjee and Lavie, 2005)
โ Proposed by CMU
โ To address weaknesses in BLEU, e.g. lack of recall, lack of explicit word matching
โ Based on general concept of flexible unigram matching
โ Surface form, stemmed form and meanings
โข ๐๐ธ๐๐ธ๐๐ =
10๐๐
๐ +9๐
ร (1 โ ๐๐๐๐๐๐ก๐ฆ)
โ ๐๐๐๐๐๐ก๐ฆ = 0.5 ร
#๐โ๐ข๐๐๐
#๐ข๐๐๐๐๐๐๐ _๐๐๐กโ๐๐
3
โ ๐ ๐๐๐๐๐ =
#๐๐๐๐๐๐๐ก
#๐๐๐๐๐๐๐๐๐
โข Unigram strings matches between MT and references, the match of words
โ simple morphological variants of each other
โ by the identical stem
โ synonyms of each other
โข Metric score is combination of factors
โ unigram-precision, unigram-recall
โ a measure of fragmentation , to capture how well-ordered the matched words in the machine translation
are in relation to the reference
โ Penalty increases as the number of chunks increases
Banerjee, Satanjeev and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments. In proc. of ACL. 10
Reference: He is a clever boy in the class
Output: he is an clever boy in class
#๐โ๐ข๐๐๐ =2
#๐ข๐๐๐๐๐๐๐ _๐๐๐กโ๐๐=6
11. โข WER (Su et al., 1992)
โข TER/HTER (Snover et al., 2006)
โ by University of Maryland & BBN Technologies
โข HyTER (Dreyer and Marcu, 2012)
โข ๐๐ธ๐ =
๐ ๐ข๐๐ ๐ก๐๐ก๐ข๐ก๐๐๐+๐๐๐ ๐๐๐ก๐๐๐+๐๐๐๐๐ก๐๐๐
๐๐๐๐๐๐๐๐๐ ๐๐๐๐๐กโ
โข WER is Based on Levenshtein distance
โ the minimum number of editing steps needed to match two sequences
โข WER does not take word ordering into account appropriately
โ The WER scores very low when the word order of system output translation is โwrongโ according to the reference.
โ In the Levenshtein distance, the mismatches in word order require the deletion and re-insertion of the misplaced words.
โ However, due to the diversity of language expression, some so-called โwrongโ order sentences by WER also prove to be good
translations.
โข TER adds a novel editing step: blocking movement
โ allow the movement of word sequences from one part of the output to another
โ Block movement is considered as one edit step with same cost of other edits
โข HyTER develops an annotation tool that is used to create meaning-equivalent networks of translations for a given
sentence
โ Based on the large reference networks
Snover et al. 2006. A Study of Translation Edit Rate with Targeted Human Annotation. In proc. of AMTA.
Dreyer, Markus and Daniel Marcu. 2012. HyTER: Meaning-Equivalent Semantics for Translation Evaluation. In proc. of NAACL.
11
12. Weaknesses of MTE methods
โข Good performance on certain language pairs
โ Perform lower with language pairs when English as source compared
with English as target
โ E.g. TER (Snover et al., 2006) achieved 0.83 (Czech-English) vs 0.50
(English-Czech) correlation score with human judgments on WMT-
2011 shared tasks
โข rely on many linguistic features for good performance
โ E.g. METEOR rely on both stemming, and synonyms, etc.
โข Employ incomprehensive factors
โ E.g. BLEU (Papineni et al., 2002) based on n-gram precision score
โ higher BLEU score is not necessarily indicative of better translation
(Callison-Burch et al., 2006)
12
13. Content
โข MT evaluation (MTE) introduction
โ MTE background
โ Existing methods
โ Weaknesses analysis
โข Designed model
โ Designed factors
โ LEPOR metric
โข Experiments and results
โ Evaluation criteria
โ Corpora
โ Results analysis
โข Enhanced models
โ Variants of LEPOR
โ Performances in shared task
โข Conclusion and future work
โข Selected publications
13
14. Designed factors
โข How to solve the mentioned problems?
โข Our designed methods
โ to make a comprehensive judgments:
Enhanced/Augmented factors
โ to deal with language bias (perform differently across
languages) problem: Tunable parameters
โข Try to make a contribution on some existing weak
points
โ evaluation with English as the source language
โ some low-resource language pairs, e.g. Czech-English
14
15. Factor-1: enhanced length penalty
โข BLEU (Papineni et al., 2002) only utilizes a brevity penalty for shorter sentence
โข the redundant/longer sentences are not penalized properly
โข to enhance the length penalty factor, we design a new version of penalty
โข ๐ฟ๐ =
exp 1 โ
๐
๐
: ๐ < ๐
1 โถ ๐ = ๐
exp 1 โ
๐
๐
: ๐ > ๐
โข ๐: length of reference sentence
โข ๐: length of candidate (system-output) sentence
โ A penalty score for both longer and shorter sentences as compared with
reference one
โข Our length penalty is designed first by sentence-level while BLEU is corpus
level
15
16. Factor-2: n-gram position difference
penalty (๐๐๐๐ ๐๐๐๐๐)
โข Word order information is introduced in ATEC (Wong and Kit, 2008)
โ However, they utilize the traditional nearest matching strategy
โ Without giving a clear formulized measuring steps
โข We design the n-gram based position difference factor and
formulized steps
โข To calculate ๐๐๐๐ ๐๐๐๐๐ score: several steps
โ N-gram word alignment, n is the number of considered neighbors
โ labeling each word a sequence number
โ Measure the position difference score of each word ( ๐๐ท๐ )
โ Measure the sentence-level position difference score (๐๐๐ท, and
๐๐๐๐ ๐๐๐๐๐)
16
17. โข Step 1: N-gram word alignment (single reference)
โข N-gram word alignment
โ Alignment direction fixed: from hypothesis (output) to
reference
โ Considering word neighbors, higher priority shall be
given for the candidate matching with neighbor
information
โข As compared with the traditional nearest matching strategy,
without consider the neighbors
โ If both the candidates have neighbors, we select
nearest matching as backup choice
17
19. โข Examples of n-gram word alignment:
โข If using the nearest matching strategy, the
alignment will be different
19
20. โข Step 2: NPD calculation for each sentence
โข Labeling the token units:
โข ๐๐๐ก๐โ๐๐๐ข๐ก๐๐ข๐ก: position of matched token in output sentence
โข ๐๐๐ก๐โ๐๐๐๐: position of matched token in reference sentence
โข Measure the scores:
โข Each token: ๐๐ท๐ = |๐๐๐ก๐โ๐๐๐ข๐ก๐๐ข๐ก โ ๐๐๐ก๐โ๐๐๐๐|
โข Whole sentence: ๐๐๐ท =
1
๐ฟ๐๐๐๐กโ ๐๐ข๐ก๐๐ข๐ก
|๐๐ท๐|
๐ฟ๐๐๐๐กโ ๐๐ข๐ก๐๐ข๐ก
๐=1
โข N-gram Position difference score: ๐๐๐๐ ๐๐๐๐๐ = exp โ๐๐๐ท
20
21. โข Examples of NPD score with single reference:
โ N-gram position difference penalty score of each word: ๐๐ท๐
โ Normalize the penalty score for each sentence: ๐๐๐ท
โ This example: ๐๐๐๐ ๐๐๐๐๐ = exp โ๐๐๐ท = ๐โ
1
2
21
22. multi-reference solution
โข Design the n-gram word alignment for multi-
reference situation
โข N-gram alignment for multi-reference:
โ The same direction, output to references
โ Higher priority also for candidate with neighbor
information
โ Adding principle:
โข If the matching candidates from different references all have
neighbors, we select the one leading to a smaller NPD value
(backup choice for nearest matching)
22
23. โข N-gram alignment examples of multi-references:
โข For the word โonโ:
โ Reference one: PDi = PD3 =
3
6
โ
4
8
โ Reference two: PDi = PD3 =
3
6
โ
4
7
โ
3
6
โ
4
8
<
3
6
โ
4
7
, the โonโ in reference-1 is selected by leading to a smaller NPD value
โข Other two words โaโ and โbirdโ are aligned using the same principle
23
24. Factor-3: weighted Harmonic mean of
precision and recall
โข METEOR (Banerjee and Lavie, 2005) puts a fixed higher weight on recall value as
compared with precision
โ For different language pairs, the importance of precision and recall differ
โข To make a generalized factor for wide spread language pairs
โข We design the tunable parameters for precision and recall
โข ๐ป๐๐๐๐๐๐๐ ๐ผ๐ , ๐ฝ๐ = (๐ผ + ๐ฝ)/(
๐ผ
๐
+
๐ฝ
๐
)
โข ๐ =
๐๐๐๐๐๐_๐๐ข๐
๐ ๐ฆ๐ ๐ก๐๐_๐๐๐๐๐กโ
โข ๐ =
๐๐๐๐๐๐_๐๐ข๐
๐๐๐๐๐๐๐๐๐_๐๐๐๐๐กโ
โข ๐ผ and ๐ฝ are two parameters to adjust the weight of ๐ (recall) and ๐ (precision)
โข ๐๐๐๐๐๐_๐๐ข๐ represents the number of aligned (matching) words and marks appearing both in
automatic translations and references
โข ๐ ๐ฆ๐ ๐ก๐๐_๐๐๐๐๐กโ and ๐๐๐๐๐๐๐๐๐_๐๐๐๐๐กโ specify the sentence length of system output and
reference respectively
24
26. Content
โข MT evaluation (MTE) introduction
โ MTE background
โ Existing methods
โ Weaknesses analysis
โข Designed model
โ Designed factors
โ LEPOR metric
โข Experiments and results
โ Evaluation criteria
โ Corpora and compared existing methods
โ Results analysis
โข Enhanced models
โ Variants of LEPOR
โ Performances in shared task
โข Conclusion and future work
โข Selected publications
26
27. Evaluation criteria
โข Evaluation criteria:
โ Human judgments are assumed as the golden ones
โ Measure the correlation score between automatic evaluation and
human judgments
โข System-level correlation (one commonly used correlation criterion)
โ Spearman rank correlation coefficient (Callison-Burch et al., 2011):
โ ๐ ๐๐ = 1 โ
6 ๐ ๐
2๐
๐=1
๐(๐2โ1)
โ ๐ = ๐ฅ1, โฆ , ๐ฅ ๐ , ๐ = {๐ฆ1, โฆ , ๐ฆ๐}
โ ๐๐ = (๐ฅ๐ โ ๐ฆ๐) is the difference value of two corresponding ranked
variants
27
28. Corpora and compared methods
โข Corpora:
โ Development data for tuning of parameters
โ WMT2008 (http://www.statmt.org/wmt08/)
โ EN: English, ES: Spanish, DE: German, FR: French and CZ: Czech
โ Two directions: EN-other and other-EN
โ Testing data
โ WMT2011 (http://www.statmt.org/wmt11/)
โ The numbers of participated automatic MT systems in WMT 2011
โ 10, 22, 15 and 17 respectively for English-to-CZ/DE/ES/FR
โ 8, 20, 15 and 18 respectively for CZ/DE/ES/FR-to-EN
โ The gold standard reference data consists of 3,003 sentences
28
29. โข Comparisons (3 gold standard BLEU/TER/METEOR & 2 latest metrics):
โ BLEU (Papineni et al., 2002), precision based metric
โ TER (Snover et al., 2006), edit distance based metric
โ METEOR (version 1.3) (Denkowski and Lavie, 2011), precision and recall, using
synonym, stemming, and paraphrasing as external resources
โ AMBER (Chen and Kuhn, 2011), a modified version of BLEU, attaching more
kinds of penalty coefficients, combining the n-gram precision and recall
โ MP4IMB1 (Popovic et al., 2011), based on morphemes, POS (4-grams) and
lexicon probabilities, etc.
โข Ours, initial version of LEPOR (๐ฟ๐ธ๐๐๐ ๐ด & ๐ฟ๐ธ๐๐๐ ๐ต)
โ Simple product value of the factors
โ Without using linguistic feature
โ Based on augmented factors
29
30. Result analysis
The system-level Spearman correlation with human judgment on WMT11 corpora
- LEPOR yielded three top one correlation scores on CZ-EN / ES-EN / EN-ES
- LEPOR showed robust performance across langauges, resulting in top one Mean score
Aaron L.F. Han, Derek F. Wong and Lidia S. Chao. 2012. LEPOR: A Robust Evaluation Metric for Machine
Translation with Augmented Factors. In proc. of COLING.
30
31. Content
โข MT evaluation (MTE) introduction
โ MTE background
โ Existing methods
โ Weaknesses analysis
โข Designed model
โ Designed factors
โ LEPOR metric
โข Experiments and results
โ Evaluation criteria
โ Corpora
โ Results analysis
โข Enhanced models
โ Variants of LEPOR
โ Performances in shared task
โข Conclusion and future work
โข Selected publications
31
32. Variant of LEPOR
โข New factor: to consider the content information
โ Design n-gram precision and n-gram recall
โ Harmonic mean of n-gram sub-factors
โ Also measured on sentence-level vs BLEU (corpus-level)
โข N is the number of words in the block matching
โข ๐๐ =
#๐๐๐๐๐ ๐๐๐ก๐โ๐๐
#๐๐๐๐๐ ๐โ๐ข๐๐๐ ๐๐ ๐ ๐ฆ๐ ๐ก๐๐ ๐๐ข๐ก๐๐ข๐ก
โข ๐ ๐ =
#๐๐๐๐๐ ๐๐๐ก๐โ๐๐
#๐๐๐๐๐ ๐โ๐ข๐๐๐ ๐๐ ๐๐๐๐๐๐๐๐๐
โข ๐ป๐๐ = ๐ป๐๐๐๐๐๐๐ ๐ผ๐ ๐, ๐ฝ๐๐ =
๐ผ+๐ฝ
๐ผ
๐ ๐
+
๐ฝ
๐ ๐
32
33. โข Example of bigram (n=2) block matching for
bigram precision and bigram recall:
โข Similar strategies for n>=3, block matching
โ For the calculation of n-gram precision and recall
33
34. Variant-1: โ๐ฟ๐ธ๐๐๐
โข To achieve higher correlation with human judgments for focused language
pair
โ Design tunable parameters at factors level
โ Weighted harmonic mean of the factors:
โ โ๐ฟ๐ธ๐๐๐ = ๐ป๐๐๐๐๐๐๐ ๐ค ๐ฟ๐ ๐ฟ๐, ๐ค ๐๐๐๐ ๐๐๐๐๐ ๐๐๐๐ ๐๐๐๐๐, ๐ค ๐ป๐๐ ๐ป๐๐
=
๐ค ๐
๐
๐=1
๐ค ๐
๐น๐๐๐ก๐๐ ๐
๐
๐=1
=
๐ค ๐ฟ๐+๐ค ๐๐๐๐ ๐๐๐๐๐+๐ค ๐ป๐๐
๐ค ๐ฟ๐
๐ฟ๐
+
๐ค ๐๐๐๐ ๐๐๐๐๐
๐๐๐๐ ๐๐๐๐๐
+
๐ค ๐ป๐๐
๐ป๐๐
โ โ๐ฟ๐ธ๐๐๐ ๐ด =
1
๐๐๐๐ก๐๐ข๐
โ๐ฟ๐ธ๐๐๐ ๐๐กโ๐๐๐๐ก
๐๐๐๐ก๐๐ข๐
๐=1
โ โ๐ฟ๐ธ๐๐๐ ๐ต = ๐ป๐๐๐๐๐๐๐(๐ค ๐ฟ๐ ๐ฟ๐, ๐ค ๐๐๐๐ ๐๐๐๐๐ ๐๐๐๐ ๐๐๐๐๐, ๐ค ๐ป๐๐ ๐ป๐๐ )
โข In this way, it has more parameters to tune for the focused language pair
โ To seize the characteristics of focused language pair
โ Especially for distant language pairs
34
35. Variant-2: ๐๐ฟ๐ธ๐๐๐
โข For the languages that request high fluency
โ We design the n-gram based metric
โ N-gram based product of the factors:
โ ๐๐ฟ๐ธ๐๐๐ = ๐ฟ๐ ร ๐๐๐๐ ๐๐๐๐๐ ร ๐๐ฅ๐( ๐ค ๐ ๐๐๐๐ป๐๐ ๐
๐=1 )
โ ๐ป๐๐ = ๐ป๐๐๐๐๐๐๐ ๐ผ๐ ๐, ๐ฝ๐๐ =
๐ผ+๐ฝ
๐ผ
๐ ๐
+
๐ฝ
๐ ๐
โ ๐๐ฟ๐ธ๐๐๐ ๐ด =
1
๐๐๐๐ก๐๐ข๐
๐๐ฟ๐ธ๐๐๐ ๐๐กโ๐๐๐๐ก
๐๐๐๐ก๐๐ข๐
๐=1
โ ๐๐ฟ๐ธ๐๐๐ ๐ต = ๐ฟ๐ ร ๐๐๐ ๐๐๐๐๐๐ก๐ฆ ร ๐๐ฅ๐( ๐ค ๐ ๐๐๐๐ป๐๐ ๐
๐=1 )
โข In this way, the n-gram information is considered for the measuring of
precision and recall
โ To consider more about content information
35
36. Linguistic feature
โข Enhance the metric with concise linguistic feature:
โข Example of Part-of-speech (POS) utilization
โ Sometimes perform as synonym information
โ E.g. the โsayโ and โclaimโ in the example translation
36
37. โข Scores with linguistic features:
โข Sentence-level score:
โข ๐ฟ๐ธ๐๐๐ ๐๐๐๐๐ =
1
๐คโ๐ค+๐คโ๐
(๐คโ๐ค ๐ฟ๐ธ๐๐๐ ๐ค๐๐๐ + ๐คโ๐ ๐ฟ๐ธ๐๐๐ ๐๐๐)
โข ๐ฟ๐ธ๐๐๐ ๐๐๐ and ๐ฟ๐ธ๐๐๐ ๐ค๐๐๐ are measured using the same
algorithm on POS sequence and word sequence respectively
โข System-level score:
โข ๐ฟ๐ธ๐๐๐ ๐๐๐๐๐ =
1
๐คโ๐ค+๐คโ๐
(๐คโ๐ค ๐ฟ๐ธ๐๐๐ ๐ค๐๐๐ + ๐คโ๐ ๐ฟ๐ธ๐๐๐ ๐๐๐)
Aaron L.F. Han, Derek F. Wong Lidia S. Chao, et al. 2013. Language-independent Model for Machine
Translation Evaluation with Reinforced Factors. In proc. of MT Summit.
37
38. โข Experiments of enhanced metric
โข Corpora setting
โ The same corpora utilization with last experiments
โ WMT08 for development and WMT11 for testing
โข Variant of LEPOR model
โ Harmonic mean to combine the main factors
โ More parameters to tune
โ Utilizing concise linguistic features (POS) as external resource
โ โ๐ฟ๐ธ๐๐๐ =
1
๐๐ข๐ ๐ ๐๐๐ก
|โ๐ฟ๐ธ๐๐๐ ๐|
๐๐ข๐ ๐ ๐๐๐ก
๐=1
โ โ๐ฟ๐ธ๐๐๐ ๐ธ =
1
๐คโ๐ค+๐คโ๐
(๐คโ๐คโ๐ฟ๐ธ๐๐๐ ๐ค๐๐๐ + ๐คโ๐โ๐ฟ๐ธ๐๐๐ ๐๐๐)
38
39. โข Comparison (Metrics) with related works:
โ In addition to the state-of-the-art metrics METEOR
/ BLEU / TER
โ Compare with ROSE (Song and Cohn, 2011) and
MPF (Popovic, 2011)
โ ROSE and MPF metrics both utilize the POS as
external information
39
40. Tuned parameter values of our enhanced method
System-level Spearman correlation with human judgment on WMT11 corpora
Our enhanced method yielded the highest Mean score 0.83 on eight language pairs 40
41. Performance in WMT task
โข Performances on MT evaluation shared tasks
in ACL-WMT 2013
โ The eighth international workshop of statistical
machine translation, accompanied with ACL-2013
โข Corpora:
โ English, Spanish, German, French, Czech, and
Russian (new)
41
44. Official results
System-level Pearson (left)/Spearman (right) correlation score with human judgment
Our methods rank first by Pearson/ second by Spearman respectively by Average score44
45. โข From the shared task results
โ Practical performance: LEPOR methods are
effective yielding generally higher across language
pairs
โ Robustness: LEPOR methods achieved the first
highest score on the new language pair English-
Russian
โ Contribution for the existing weak point: MT
evaluation with English as the source language
45
Aaron L.F. Han, Derek F. Wong and Lidia S. Chao et al. 2013. A Description of Tunable Machine Translation
Evaluation Systems in WMT13 Metrics Task. In proc. of ACL workshop of 8th WMT.
46. Content
โข MT evaluation (MTE) introduction
โ MTE background
โ Existing methods
โ Weaknesses analysis
โข Designed model
โ Designed factors
โ LEPOR metric
โข Experiments and results
โ Evaluation criteria
โ Corpora
โ Results analysis
โข Enhanced models
โ Variants of LEPOR
โ Performances in shared task
โข Conclusion and future work
โข Selected publications
46
47. โข The methods and contributions
โ Designed and trained for system-level MTE
โ Using reference translations
โข For future work:
โ Tune MT system using designed MTE methods
โ Design model for Segment-level MTE
โ MTE without using reference translation
โ Investigate more linguistic features (e.g. text
entailment, paraphrasing, and synonym) for MTE with
English as target language
47
48. Content
โข MT evaluation (MTE) introduction
โ MTE background
โ Existing methods
โ Weaknesses analysis
โข Designed model
โ Designed factors
โ LEPOR metric
โข Experiments and results
โ Evaluation criteria
โ Corpora
โ Results analysis
โข Enhanced models
โ Variants of LEPOR
โ Performances in shared task
โข Conclusion and future work
โข Selected publications
48
49. Selected publications
๏ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He and Yi Lu. Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive
Supervised Evaluation. The Scientific World Journal, Issue: Recent Advances in Information Technology. Page 1-12, April 2014. Hindawi Publishing Corporation. ISSN:1537-744X.
http://www.hindawi.com/journals/tswj/aip/760301/
๏ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng. Language-independent Model for Machine Translation Evaluation with
Reinforced Factors. Proceedings of the 14th International Conference of Machine Translation Summit (MT Summit), pp. 215-222. Nice, France. 2 - 6 September 2013. International
Association for Machine Translation. http://www.mt-archive.info/10/MTS-2013-Han.pdf
๏ฒ Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yi Lu, Liangye He, Yiming Wang, Jiaji Zhou. A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task.
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT), pp. 414-421, 8-9 August 2013. Sofia, Bulgaria. Association for
Computational Linguistics. http://www.aclweb.org/anthology/W13-2253
๏ฒ Aaron Li-Feng Han, Yi Lu, Derek F. Wong, Lidia S. Chao, Liangye He, Junwen Xing. Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and
Statistical Modeling. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT), pp. 365-372. 8-9 August 2013. Sofia, Bulgaria.
Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-2245
๏ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li and Ling Zhu. Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine
Translation Evaluation. Language Processing and Knowledge in the Web. Lecture Notes in Computer Science Volume 8105, 2013, pp 119-131. Volume Editors: Iryna Gurevych, Chris
Biemann and Torsten Zesch. Springer-Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40722-2_13
๏ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Ling Zhu and Shuo Li. A Study of Chinese Word Segmentation Based on the Characteristics of Chinese. Language
Processing and Knowledge in the Web. Lecture Notes in Computer Science Volume 8105, 2013, pp 111-118. Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. Springer-
Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40722-2_12
๏ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He. Automatic Machine Translation Evaluation with Part-of-Speech Information. Text, Speech, and Dialogue. Lecture Notes in
Computer Science Volume 8082, 2013, pp 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40585-3_16
๏ฒ Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao. Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics. Language
Processing and Intelligent Information Systems. Lecture Notes in Computer Science Volume 7912, 2013, pp 57-68. M.A. Klopotek et al. (Eds.): IIS 2013. Springer-Verlag Berlin Heidelberg.
http://dx.doi.org/10.1007/978-3-642-38634-3_8
๏ฒ Aaron Li-Feng Han, Derek F. Wong and Lidia S. Chao. LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. Proceedings of the 24th International
Conference on Computational Linguistics (COLING): Posters, pages 441โ450, Mumbai, December 2012. Association for Computational Linguistics.
http://aclweb.org/anthology//C/C12/C12-2044.pdf
49
51. Thanks for your attention!
(Aaron) Li-Feng Han (้ๅฉๅณฐ), MB-154887
Supervisors: Dr. Derek F. Wong & Dr. Lidia S. Chao
NLP2CT, University of Macau