SlideShare a Scribd company logo
1 of 51
Download to read offline
LEPOR: An Augmented Machine
Translation Evaluation Metric
Master thesis defense, 2014.07
(Aaron) Li-Feng Han (้Ÿ“ๅˆฉๅณฐ), MB-154887
Supervisors: Dr. Derek F. Wong & Dr. Lidia S. Chao
NLP2CT, University of Macau
Content
โ€ข MT evaluation (MTE) introduction
โ€“ MTE background
โ€“ Existing methods
โ€“ Weaknesses analysis
โ€ข Designed model
โ€“ Designed factors
โ€“ LEPOR metric
โ€ข Experiments and results
โ€“ Evaluation criteria
โ€“ Corpora
โ€“ Results analysis
โ€ข Enhanced models
โ€“ Variants of LEPOR
โ€“ Performances in shared task
โ€ข Conclusion and future work
โ€ข Selected publications
2
MTE - background
โ€ข MT began as early as 1950s (Weaver, 1955)
โ€ข Rapid development since 1990s (two reasons)
โ€“ Computer technology
โ€“ Enlarged bilingual corpora (Mariรฑo et al., 2006)
โ€ข Several promotion events for MT:
โ€“ NIST Open MT Evaluation series (OpenMT)
โ€ข 2001-2009 (Li, 2005)
โ€ข By National Institute of Standards and Technology, US
โ€ข Corpora: Arabic-English, Chinese-English, etc.
โ€“ International Workshop on Statistical Machine Translation (WMT)
โ€ข from 2006 (Koehn and Monz, 2006; Callison-Burch et al., 2007 to 2012; Bojar et al., 2013)
โ€ข Annually by SIGMT of ACL
โ€ข Corpora: English to French/German/Spanish/Czech/Hungarian/Haitian Creole/Russian &
inverse direction
โ€“ International Workshop of Spoken Language Processing (IWSLT)
โ€ข from 2004 (Eck and Hori, 2005; Paul, 2009; Paul, et al., 2010; Federico et al., 2011)
โ€ข English & Asian language (Chinese, Japanese, Korean)
3
โ€ข With the rapid development of MT, how to
evaluate the MT model?
โ€“ Whether the newly designed algorithm/feature
enhance the existing MT system, or not?
โ€“ Which MT system yields the best output for specified
language pair, or generally across languages?
โ€ข Difficulties in MT evaluation:
โ€“ language variability results in no single correct
translation
โ€“ natural languages are highly ambiguous and different
languages do not always express the same content in
the same way (Arnold, 2003)
4
Existing MTE methods
โ€ข Manual MTE methods:
โ€ข Traditional Manual judgment
โ€“ Intelligibility: how understandable the sentence is
โ€“ Fidelity: how much information the translated sentence
retains as compared to the original
โ€ข by the Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
โ€“ Adequacy (similar as fidelity)
โ€“ Fluency: whether the sentence is well-formed and fluent
โ€“ Comprehension (improved intelligibility)
โ€ข by Defense Advanced Research Projects Agency (DARPA) of US
(Church et al., 1991; White et al., 1994)
5
โ€ข Advanced manual judgment:
โ€“ Task oriented method (White and Taylor, 1998)
โ€ข In light of the tasks for which the output might by used
โ€“ Further developed criteria
โ€ข Bangalore et al. (2000): simple string accuracy/ generation string
accuracy/ two corresponding tree-based accuracies.
โ€ข LDC (Linguistics Data Consortium): 5-point scales fluency & adequacy
โ€ข Specia et al. (2011): design 4-level adequacy, highly adequacy/fairly
adequacy /poorly adequacy/completely inadequate
โ€“ Segment ranking (WMT 2011~2013)
โ€ข Judges are asked to provide a complete ranking over all the candidate
translations of the same source segment (Callison-Burch et al., 2011,
2012)
โ€ข 5 systems are randomly selected for the judges (Bojar et al., 2013)
6
โ€ข Problems in Manual MTE
โ€“ Time consuming
โ€ข How about a document contain 3,000 sentences or more
โ€“ Expensive
โ€ข Professional translators? or other people?
โ€“ Unrepeatable
โ€ข Precious human labor can not be simply re-run
โ€“ Low agreement, sometimes (Callison-Burch et al., 2011)
โ€ข E.g. in WMT 2011 English-Czech task, multi-annotator agreement
kappa value is very low
โ€ข Even the same strings produced by two systems are ranked
differently each time by the same annotator
7
โ€ข How to address the problems?
โ€“ Automatic MT evaluation!
โ€ข What do we expect? (as compared with manual judgments)
โ€“ Repeatable
โ€ข Can be re-used whenever we make some change of the MT system, and plan to have a
check of the translation quality
โ€“ Fast
โ€ข several minutes or seconds for evaluating 3,000 sentences
โ€ข V.s. hours of human labor
โ€“ Cheap
โ€ข We do not need expensive manual judgments
โ€“ High agreement
โ€ข Each time of running, result in same scores for un-changed outputs
โ€“ Reliable
โ€ข Give a higher score for better translation output
โ€ข Measured by correlation with human judgments
8
Automatic MTE methods
โ€ข BLEU (Papineni et al., 2002)
โ€“ Proposed by IBM
โ€“ First automatic MTE method
โ€“ based on the degree of n-gram overlapping between the strings of words produced by the machine and the
human translation references
โ€“ corpus level evaluation
โ€ข ๐ต๐ฟ๐ธ๐‘ˆ = ๐ต๐‘Ÿ๐‘’๐‘ฃ๐‘–๐‘ก๐‘ฆ ๐‘๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ ร— ๐‘’๐‘ฅ๐‘ ๐œ† ๐‘› ๐‘™๐‘œ๐‘”๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› ๐‘›
๐‘
๐‘›=1
โ€“ ๐ต๐‘Ÿ๐‘’๐‘ฃ๐‘–๐‘ก๐‘ฆ ๐‘๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ =
1 ๐‘–๐‘“ ๐‘ > ๐‘Ÿ
๐‘’(1โˆ’
๐‘Ÿ
๐‘
)
๐‘–๐‘“ ๐‘ โ‰ค ๐‘Ÿ
โ€“ ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› =
#๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก
#๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก
, ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› ๐‘› =
#๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก
#๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก
โ€ข ๐‘ is the total length of candidate translation corpus (the sum of sentencesโ€™ length)
โ€ข ๐‘Ÿ refers to the sum of effective reference sentence length in the corpus
โ€“ if there are multi-references for each candidate sentence, then the nearest length as compared to the
candidate sentence is selected as the effective one
Papineni et al. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. Proc. of ACL.
9
โ€ข METEOR (Banerjee and Lavie, 2005)
โ€“ Proposed by CMU
โ€“ To address weaknesses in BLEU, e.g. lack of recall, lack of explicit word matching
โ€“ Based on general concept of flexible unigram matching
โ€“ Surface form, stemmed form and meanings
โ€ข ๐‘€๐ธ๐‘‡๐ธ๐‘‚๐‘… =
10๐‘ƒ๐‘…
๐‘…+9๐‘ƒ
ร— (1 โˆ’ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ)
โ€“ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ = 0.5 ร—
#๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘ 
#๐‘ข๐‘›๐‘–๐‘”๐‘Ÿ๐‘Ž๐‘š๐‘ _๐‘š๐‘Ž๐‘กโ„Ž๐‘’๐‘‘
3
โ€“ ๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™ =
#๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก
#๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’
โ€ข Unigram strings matches between MT and references, the match of words
โ€“ simple morphological variants of each other
โ€“ by the identical stem
โ€“ synonyms of each other
โ€ข Metric score is combination of factors
โ€“ unigram-precision, unigram-recall
โ€“ a measure of fragmentation , to capture how well-ordered the matched words in the machine translation
are in relation to the reference
โ€“ Penalty increases as the number of chunks increases
Banerjee, Satanjeev and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments. In proc. of ACL. 10
Reference: He is a clever boy in the class
Output: he is an clever boy in class
#๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘ =2
#๐‘ข๐‘›๐‘–๐‘”๐‘Ÿ๐‘Ž๐‘š๐‘ _๐‘š๐‘Ž๐‘กโ„Ž๐‘’๐‘‘=6
โ€ข WER (Su et al., 1992)
โ€ข TER/HTER (Snover et al., 2006)
โ€“ by University of Maryland & BBN Technologies
โ€ข HyTER (Dreyer and Marcu, 2012)
โ€ข ๐‘Š๐ธ๐‘… =
๐‘ ๐‘ข๐‘๐‘ ๐‘ก๐‘–๐‘ก๐‘ข๐‘ก๐‘–๐‘œ๐‘›+๐‘–๐‘›๐‘ ๐‘’๐‘Ÿ๐‘ก๐‘–๐‘œ๐‘›+๐‘‘๐‘’๐‘™๐‘’๐‘ก๐‘–๐‘œ๐‘›
๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’ ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž
โ€ข WER is Based on Levenshtein distance
โ€“ the minimum number of editing steps needed to match two sequences
โ€ข WER does not take word ordering into account appropriately
โ€“ The WER scores very low when the word order of system output translation is โ€œwrongโ€ according to the reference.
โ€“ In the Levenshtein distance, the mismatches in word order require the deletion and re-insertion of the misplaced words.
โ€“ However, due to the diversity of language expression, some so-called โ€œwrongโ€ order sentences by WER also prove to be good
translations.
โ€ข TER adds a novel editing step: blocking movement
โ€“ allow the movement of word sequences from one part of the output to another
โ€“ Block movement is considered as one edit step with same cost of other edits
โ€ข HyTER develops an annotation tool that is used to create meaning-equivalent networks of translations for a given
sentence
โ€“ Based on the large reference networks
Snover et al. 2006. A Study of Translation Edit Rate with Targeted Human Annotation. In proc. of AMTA.
Dreyer, Markus and Daniel Marcu. 2012. HyTER: Meaning-Equivalent Semantics for Translation Evaluation. In proc. of NAACL.
11
Weaknesses of MTE methods
โ€ข Good performance on certain language pairs
โ€“ Perform lower with language pairs when English as source compared
with English as target
โ€“ E.g. TER (Snover et al., 2006) achieved 0.83 (Czech-English) vs 0.50
(English-Czech) correlation score with human judgments on WMT-
2011 shared tasks
โ€ข rely on many linguistic features for good performance
โ€“ E.g. METEOR rely on both stemming, and synonyms, etc.
โ€ข Employ incomprehensive factors
โ€“ E.g. BLEU (Papineni et al., 2002) based on n-gram precision score
โ€“ higher BLEU score is not necessarily indicative of better translation
(Callison-Burch et al., 2006)
12
Content
โ€ข MT evaluation (MTE) introduction
โ€“ MTE background
โ€“ Existing methods
โ€“ Weaknesses analysis
โ€ข Designed model
โ€“ Designed factors
โ€“ LEPOR metric
โ€ข Experiments and results
โ€“ Evaluation criteria
โ€“ Corpora
โ€“ Results analysis
โ€ข Enhanced models
โ€“ Variants of LEPOR
โ€“ Performances in shared task
โ€ข Conclusion and future work
โ€ข Selected publications
13
Designed factors
โ€ข How to solve the mentioned problems?
โ€ข Our designed methods
โ€“ to make a comprehensive judgments:
Enhanced/Augmented factors
โ€“ to deal with language bias (perform differently across
languages) problem: Tunable parameters
โ€ข Try to make a contribution on some existing weak
points
โ€“ evaluation with English as the source language
โ€“ some low-resource language pairs, e.g. Czech-English
14
Factor-1: enhanced length penalty
โ€ข BLEU (Papineni et al., 2002) only utilizes a brevity penalty for shorter sentence
โ€ข the redundant/longer sentences are not penalized properly
โ€ข to enhance the length penalty factor, we design a new version of penalty
โ€ข ๐ฟ๐‘ƒ =
exp 1 โˆ’
๐‘Ÿ
๐‘
: ๐‘ < ๐‘Ÿ
1 โˆถ ๐‘ = ๐‘Ÿ
exp 1 โˆ’
๐‘
๐‘Ÿ
: ๐‘ > ๐‘Ÿ
โ€ข ๐‘Ÿ: length of reference sentence
โ€ข ๐‘: length of candidate (system-output) sentence
โ€“ A penalty score for both longer and shorter sentences as compared with
reference one
โ€ข Our length penalty is designed first by sentence-level while BLEU is corpus
level
15
Factor-2: n-gram position difference
penalty (๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™)
โ€ข Word order information is introduced in ATEC (Wong and Kit, 2008)
โ€“ However, they utilize the traditional nearest matching strategy
โ€“ Without giving a clear formulized measuring steps
โ€ข We design the n-gram based position difference factor and
formulized steps
โ€ข To calculate ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ score: several steps
โ€“ N-gram word alignment, n is the number of considered neighbors
โ€“ labeling each word a sequence number
โ€“ Measure the position difference score of each word ( ๐‘ƒ๐ท๐‘– )
โ€“ Measure the sentence-level position difference score (๐‘๐‘ƒ๐ท, and
๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™)
16
โ€ข Step 1: N-gram word alignment (single reference)
โ€ข N-gram word alignment
โ€“ Alignment direction fixed: from hypothesis (output) to
reference
โ€“ Considering word neighbors, higher priority shall be
given for the candidate matching with neighbor
information
โ€ข As compared with the traditional nearest matching strategy,
without consider the neighbors
โ€“ If both the candidates have neighbors, we select
nearest matching as backup choice
17
Fig. N-gram word alignment algorithm
18
โ€ข Examples of n-gram word alignment:
โ€ข If using the nearest matching strategy, the
alignment will be different
19
โ€ข Step 2: NPD calculation for each sentence
โ€ข Labeling the token units:
โ€ข ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐‘๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก: position of matched token in output sentence
โ€ข ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐‘๐‘Ÿ๐‘’๐‘“: position of matched token in reference sentence
โ€ข Measure the scores:
โ€ข Each token: ๐‘ƒ๐ท๐‘– = |๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐‘๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก โˆ’ ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐‘๐‘Ÿ๐‘’๐‘“|
โ€ข Whole sentence: ๐‘๐‘ƒ๐ท =
1
๐ฟ๐‘’๐‘›๐‘”๐‘กโ„Ž ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก
|๐‘ƒ๐ท๐‘–|
๐ฟ๐‘’๐‘›๐‘”๐‘กโ„Ž ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก
๐‘–=1
โ€ข N-gram Position difference score: ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ = exp โˆ’๐‘๐‘ƒ๐ท
20
โ€ข Examples of NPD score with single reference:
โ€“ N-gram position difference penalty score of each word: ๐‘ƒ๐ท๐‘–
โ€“ Normalize the penalty score for each sentence: ๐‘๐‘ƒ๐ท
โ€“ This example: ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ = exp โˆ’๐‘๐‘ƒ๐ท = ๐‘’โˆ’
1
2
21
multi-reference solution
โ€ข Design the n-gram word alignment for multi-
reference situation
โ€ข N-gram alignment for multi-reference:
โ€“ The same direction, output to references
โ€“ Higher priority also for candidate with neighbor
information
โ€“ Adding principle:
โ€ข If the matching candidates from different references all have
neighbors, we select the one leading to a smaller NPD value
(backup choice for nearest matching)
22
โ€ข N-gram alignment examples of multi-references:
โ€ข For the word โ€œonโ€:
โ€“ Reference one: PDi = PD3 =
3
6
โˆ’
4
8
โ€“ Reference two: PDi = PD3 =
3
6
โˆ’
4
7
โ€“
3
6
โˆ’
4
8
<
3
6
โˆ’
4
7
, the โ€œonโ€ in reference-1 is selected by leading to a smaller NPD value
โ€ข Other two words โ€œaโ€ and โ€œbirdโ€ are aligned using the same principle
23
Factor-3: weighted Harmonic mean of
precision and recall
โ€ข METEOR (Banerjee and Lavie, 2005) puts a fixed higher weight on recall value as
compared with precision
โ€“ For different language pairs, the importance of precision and recall differ
โ€ข To make a generalized factor for wide spread language pairs
โ€ข We design the tunable parameters for precision and recall
โ€ข ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘ ๐›ผ๐‘…, ๐›ฝ๐‘ƒ = (๐›ผ + ๐›ฝ)/(
๐›ผ
๐‘…
+
๐›ฝ
๐‘ƒ
)
โ€ข ๐‘ƒ =
๐‘๐‘œ๐‘š๐‘š๐‘œ๐‘›_๐‘›๐‘ข๐‘š
๐‘ ๐‘ฆ๐‘ ๐‘ก๐‘’๐‘š_๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž
โ€ข ๐‘… =
๐‘๐‘œ๐‘š๐‘š๐‘œ๐‘›_๐‘›๐‘ข๐‘š
๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’_๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž
โ€ข ๐›ผ and ๐›ฝ are two parameters to adjust the weight of ๐‘… (recall) and ๐‘ƒ (precision)
โ€ข ๐‘๐‘œ๐‘š๐‘š๐‘œ๐‘›_๐‘›๐‘ข๐‘š represents the number of aligned (matching) words and marks appearing both in
automatic translations and references
โ€ข ๐‘ ๐‘ฆ๐‘ ๐‘ก๐‘’๐‘š_๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž and ๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’_๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž specify the sentence length of system output and
reference respectively
24
LEPOR metric
โ€ข LEPOR: automatic machine translation evaluation metric
considering the enhanced Length Penalty, Precision, n-gram
Position difference Penalty and Recall.
โ€ข Initial version: The product value of the factors
โ€ข Sentence-level score:
โ€“ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = ๐ฟ๐‘ƒ ร— ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ร— ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘(๐›ผ๐‘…, ๐›ฝ๐‘ƒ)
โ€ข System-level score:
โ€“ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ด =
1
๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š
๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–๐‘กโ„Ž๐‘†๐‘’๐‘›๐‘ก
๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š
๐‘–=1
โ€“ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ต = ๐น๐‘Ž๐‘๐‘ก๐‘œ๐‘Ÿ๐‘–
๐‘›
๐‘–=1
โ€“ ๐น๐‘Ž๐‘๐‘ก๐‘œ๐‘Ÿ๐‘–=
1
๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š
๐น๐‘Ž๐‘๐‘ก๐‘œ๐‘Ÿ๐‘–๐‘กโ„Ž๐‘†๐‘’๐‘›๐‘ก
๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š
๐‘–=1
25
Content
โ€ข MT evaluation (MTE) introduction
โ€“ MTE background
โ€“ Existing methods
โ€“ Weaknesses analysis
โ€ข Designed model
โ€“ Designed factors
โ€“ LEPOR metric
โ€ข Experiments and results
โ€“ Evaluation criteria
โ€“ Corpora and compared existing methods
โ€“ Results analysis
โ€ข Enhanced models
โ€“ Variants of LEPOR
โ€“ Performances in shared task
โ€ข Conclusion and future work
โ€ข Selected publications
26
Evaluation criteria
โ€ข Evaluation criteria:
โ€“ Human judgments are assumed as the golden ones
โ€“ Measure the correlation score between automatic evaluation and
human judgments
โ€ข System-level correlation (one commonly used correlation criterion)
โ€“ Spearman rank correlation coefficient (Callison-Burch et al., 2011):
โ€“ ๐œŒ ๐‘‹๐‘Œ = 1 โˆ’
6 ๐‘‘ ๐‘–
2๐‘›
๐‘–=1
๐‘›(๐‘›2โˆ’1)
โ€“ ๐‘‹ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘› , ๐‘Œ = {๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘›}
โ€“ ๐‘‘๐‘– = (๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–) is the difference value of two corresponding ranked
variants
27
Corpora and compared methods
โ€ข Corpora:
โ€“ Development data for tuning of parameters
โ€“ WMT2008 (http://www.statmt.org/wmt08/)
โ€“ EN: English, ES: Spanish, DE: German, FR: French and CZ: Czech
โ€“ Two directions: EN-other and other-EN
โ€“ Testing data
โ€“ WMT2011 (http://www.statmt.org/wmt11/)
โ€“ The numbers of participated automatic MT systems in WMT 2011
โ€“ 10, 22, 15 and 17 respectively for English-to-CZ/DE/ES/FR
โ€“ 8, 20, 15 and 18 respectively for CZ/DE/ES/FR-to-EN
โ€“ The gold standard reference data consists of 3,003 sentences
28
โ€ข Comparisons (3 gold standard BLEU/TER/METEOR & 2 latest metrics):
โ€“ BLEU (Papineni et al., 2002), precision based metric
โ€“ TER (Snover et al., 2006), edit distance based metric
โ€“ METEOR (version 1.3) (Denkowski and Lavie, 2011), precision and recall, using
synonym, stemming, and paraphrasing as external resources
โ€“ AMBER (Chen and Kuhn, 2011), a modified version of BLEU, attaching more
kinds of penalty coefficients, combining the n-gram precision and recall
โ€“ MP4IMB1 (Popovic et al., 2011), based on morphemes, POS (4-grams) and
lexicon probabilities, etc.
โ€ข Ours, initial version of LEPOR (๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ด & ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ต)
โ€“ Simple product value of the factors
โ€“ Without using linguistic feature
โ€“ Based on augmented factors
29
Result analysis
The system-level Spearman correlation with human judgment on WMT11 corpora
- LEPOR yielded three top one correlation scores on CZ-EN / ES-EN / EN-ES
- LEPOR showed robust performance across langauges, resulting in top one Mean score
Aaron L.F. Han, Derek F. Wong and Lidia S. Chao. 2012. LEPOR: A Robust Evaluation Metric for Machine
Translation with Augmented Factors. In proc. of COLING.
30
Content
โ€ข MT evaluation (MTE) introduction
โ€“ MTE background
โ€“ Existing methods
โ€“ Weaknesses analysis
โ€ข Designed model
โ€“ Designed factors
โ€“ LEPOR metric
โ€ข Experiments and results
โ€“ Evaluation criteria
โ€“ Corpora
โ€“ Results analysis
โ€ข Enhanced models
โ€“ Variants of LEPOR
โ€“ Performances in shared task
โ€ข Conclusion and future work
โ€ข Selected publications
31
Variant of LEPOR
โ€ข New factor: to consider the content information
โ€“ Design n-gram precision and n-gram recall
โ€“ Harmonic mean of n-gram sub-factors
โ€“ Also measured on sentence-level vs BLEU (corpus-level)
โ€ข N is the number of words in the block matching
โ€ข ๐‘ƒ๐‘› =
#๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘‘
#๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘  ๐‘–๐‘› ๐‘ ๐‘ฆ๐‘ ๐‘ก๐‘’๐‘š ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก
โ€ข ๐‘… ๐‘› =
#๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘‘
#๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘  ๐‘–๐‘› ๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’
โ€ข ๐ป๐‘ƒ๐‘… = ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘ ๐›ผ๐‘… ๐‘›, ๐›ฝ๐‘ƒ๐‘› =
๐›ผ+๐›ฝ
๐›ผ
๐‘… ๐‘›
+
๐›ฝ
๐‘ƒ ๐‘›
32
โ€ข Example of bigram (n=2) block matching for
bigram precision and bigram recall:
โ€ข Similar strategies for n>=3, block matching
โ€“ For the calculation of n-gram precision and recall
33
Variant-1: โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…
โ€ข To achieve higher correlation with human judgments for focused language
pair
โ€“ Design tunable parameters at factors level
โ€“ Weighted harmonic mean of the factors:
โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘ ๐‘ค ๐ฟ๐‘ƒ ๐ฟ๐‘ƒ, ๐‘ค ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™, ๐‘ค ๐ป๐‘ƒ๐‘… ๐ป๐‘ƒ๐‘…
=
๐‘ค ๐‘–
๐‘›
๐‘–=1
๐‘ค ๐‘–
๐น๐‘Ž๐‘๐‘ก๐‘œ๐‘Ÿ ๐‘–
๐‘›
๐‘–=1
=
๐‘ค ๐ฟ๐‘ƒ+๐‘ค ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™+๐‘ค ๐ป๐‘ƒ๐‘…
๐‘ค ๐ฟ๐‘ƒ
๐ฟ๐‘ƒ
+
๐‘ค ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™
๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™
+
๐‘ค ๐ป๐‘ƒ๐‘…
๐ป๐‘ƒ๐‘…
โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ด =
1
๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š
โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–๐‘กโ„Ž๐‘†๐‘’๐‘›๐‘ก
๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š
๐‘–=1
โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ต = ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘(๐‘ค ๐ฟ๐‘ƒ ๐ฟ๐‘ƒ, ๐‘ค ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™, ๐‘ค ๐ป๐‘ƒ๐‘… ๐ป๐‘ƒ๐‘…)
โ€ข In this way, it has more parameters to tune for the focused language pair
โ€“ To seize the characteristics of focused language pair
โ€“ Especially for distant language pairs
34
Variant-2: ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…
โ€ข For the languages that request high fluency
โ€“ We design the n-gram based metric
โ€“ N-gram based product of the factors:
โ€“ ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = ๐ฟ๐‘ƒ ร— ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ร— ๐‘’๐‘ฅ๐‘( ๐‘ค ๐‘› ๐‘™๐‘œ๐‘”๐ป๐‘ƒ๐‘…๐‘
๐‘›=1 )
โ€“ ๐ป๐‘ƒ๐‘… = ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘ ๐›ผ๐‘… ๐‘›, ๐›ฝ๐‘ƒ๐‘› =
๐›ผ+๐›ฝ
๐›ผ
๐‘… ๐‘›
+
๐›ฝ
๐‘ƒ ๐‘›
โ€“ ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ด =
1
๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š
๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–๐‘กโ„Ž๐‘†๐‘’๐‘›๐‘ก
๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š
๐‘–=1
โ€“ ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ต = ๐ฟ๐‘ƒ ร— ๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ ร— ๐‘’๐‘ฅ๐‘( ๐‘ค ๐‘› ๐‘™๐‘œ๐‘”๐ป๐‘ƒ๐‘…๐‘
๐‘›=1 )
โ€ข In this way, the n-gram information is considered for the measuring of
precision and recall
โ€“ To consider more about content information
35
Linguistic feature
โ€ข Enhance the metric with concise linguistic feature:
โ€ข Example of Part-of-speech (POS) utilization
โ€“ Sometimes perform as synonym information
โ€“ E.g. the โ€œsayโ€ and โ€œclaimโ€ in the example translation
36
โ€ข Scores with linguistic features:
โ€ข Sentence-level score:
โ€ข ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘“๐‘–๐‘›๐‘Ž๐‘™ =
1
๐‘คโ„Ž๐‘ค+๐‘คโ„Ž๐‘
(๐‘คโ„Ž๐‘ค ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ + ๐‘คโ„Ž๐‘ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘†)
โ€ข ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘† and ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ are measured using the same
algorithm on POS sequence and word sequence respectively
โ€ข System-level score:
โ€ข ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘“๐‘–๐‘›๐‘Ž๐‘™ =
1
๐‘คโ„Ž๐‘ค+๐‘คโ„Ž๐‘
(๐‘คโ„Ž๐‘ค ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ + ๐‘คโ„Ž๐‘ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘†)
Aaron L.F. Han, Derek F. Wong Lidia S. Chao, et al. 2013. Language-independent Model for Machine
Translation Evaluation with Reinforced Factors. In proc. of MT Summit.
37
โ€ข Experiments of enhanced metric
โ€ข Corpora setting
โ€“ The same corpora utilization with last experiments
โ€“ WMT08 for development and WMT11 for testing
โ€ข Variant of LEPOR model
โ€“ Harmonic mean to combine the main factors
โ€“ More parameters to tune
โ€“ Utilizing concise linguistic features (POS) as external resource
โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… =
1
๐‘›๐‘ข๐‘š ๐‘ ๐‘’๐‘›๐‘ก
|โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–|
๐‘›๐‘ข๐‘š ๐‘ ๐‘’๐‘›๐‘ก
๐‘–=1
โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ธ =
1
๐‘คโ„Ž๐‘ค+๐‘คโ„Ž๐‘
(๐‘คโ„Ž๐‘คโ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ + ๐‘คโ„Ž๐‘โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘†)
38
โ€ข Comparison (Metrics) with related works:
โ€“ In addition to the state-of-the-art metrics METEOR
/ BLEU / TER
โ€“ Compare with ROSE (Song and Cohn, 2011) and
MPF (Popovic, 2011)
โ€“ ROSE and MPF metrics both utilize the POS as
external information
39
Tuned parameter values of our enhanced method
System-level Spearman correlation with human judgment on WMT11 corpora
Our enhanced method yielded the highest Mean score 0.83 on eight language pairs 40
Performance in WMT task
โ€ข Performances on MT evaluation shared tasks
in ACL-WMT 2013
โ€“ The eighth international workshop of statistical
machine translation, accompanied with ACL-2013
โ€ข Corpora:
โ€“ English, Spanish, German, French, Czech, and
Russian (new)
41
โ€ข Submitted methods:
โ€“ hLEPOR (LEPOR_v3.1): with linguistic feature & tunable
parameters
โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… =
1
๐‘›๐‘ข๐‘š ๐‘ ๐‘’๐‘›๐‘ก
|โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–|
๐‘›๐‘ข๐‘š ๐‘ ๐‘’๐‘›๐‘ก
๐‘–=1
โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘“๐‘–๐‘›๐‘Ž๐‘™ =
1
๐‘คโ„Ž๐‘ค+๐‘คโ„Ž๐‘
(๐‘คโ„Ž๐‘คโ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ + ๐‘คโ„Ž๐‘โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘†)
โ€“ nLEPOR_baseline: without using external resource, default
weights
โ€“ ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = ๐ฟ๐‘ƒ ร— ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ร— exp( ๐‘ค ๐‘› ๐‘™๐‘œ๐‘”๐ป๐‘ƒ๐‘…๐‘
๐‘›=1 )
42
Metric evaluations
โ€ข Evaluation criteria:
โ€“ Human judgments are assumed the golden ones
โ€“ Measure the correlation score between automatic evaluation and
human judgments
โ€ข System-level correlation (two commonly used correlations)
โ€“ Spearman rank correlation coefficient:
โ€“ ๐œŒ ๐‘‹๐‘Œ = 1 โˆ’
6 ๐‘‘ ๐‘–
2๐‘›
๐‘–=1
๐‘›(๐‘›2โˆ’1)
โ€“ ๐‘‹ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘› , ๐‘Œ = {๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘›}
โ€“ Pearson correlation coefficient:
โ€“ ๐œŒ ๐‘‹๐‘Œ =
(๐‘ฅ๐‘–โˆ’๐œ‡ ๐‘ฅ)(๐‘ฆ ๐‘–โˆ’๐œ‡ ๐‘ฆ)๐‘›
๐‘–=1
๐‘ฅ ๐‘–โˆ’๐œ‡ ๐‘ฅ
2๐‘›
๐‘–=1 ๐‘ฆ ๐‘–โˆ’๐œ‡ ๐‘ฆ
2๐‘›
๐‘–=1
, ๐œ‡ ๐‘ฆ =
1
๐‘›
๐‘ฆ๐‘–, ๐œ‡ ๐‘ฅ =
1
๐‘›
๐‘ฅ๐‘–
43
Official results
System-level Pearson (left)/Spearman (right) correlation score with human judgment
Our methods rank first by Pearson/ second by Spearman respectively by Average score44
โ€ข From the shared task results
โ€“ Practical performance: LEPOR methods are
effective yielding generally higher across language
pairs
โ€“ Robustness: LEPOR methods achieved the first
highest score on the new language pair English-
Russian
โ€“ Contribution for the existing weak point: MT
evaluation with English as the source language
45
Aaron L.F. Han, Derek F. Wong and Lidia S. Chao et al. 2013. A Description of Tunable Machine Translation
Evaluation Systems in WMT13 Metrics Task. In proc. of ACL workshop of 8th WMT.
Content
โ€ข MT evaluation (MTE) introduction
โ€“ MTE background
โ€“ Existing methods
โ€“ Weaknesses analysis
โ€ข Designed model
โ€“ Designed factors
โ€“ LEPOR metric
โ€ข Experiments and results
โ€“ Evaluation criteria
โ€“ Corpora
โ€“ Results analysis
โ€ข Enhanced models
โ€“ Variants of LEPOR
โ€“ Performances in shared task
โ€ข Conclusion and future work
โ€ข Selected publications
46
โ€ข The methods and contributions
โ€“ Designed and trained for system-level MTE
โ€“ Using reference translations
โ€ข For future work:
โ€“ Tune MT system using designed MTE methods
โ€“ Design model for Segment-level MTE
โ€“ MTE without using reference translation
โ€“ Investigate more linguistic features (e.g. text
entailment, paraphrasing, and synonym) for MTE with
English as target language
47
Content
โ€ข MT evaluation (MTE) introduction
โ€“ MTE background
โ€“ Existing methods
โ€“ Weaknesses analysis
โ€ข Designed model
โ€“ Designed factors
โ€“ LEPOR metric
โ€ข Experiments and results
โ€“ Evaluation criteria
โ€“ Corpora
โ€“ Results analysis
โ€ข Enhanced models
โ€“ Variants of LEPOR
โ€“ Performances in shared task
โ€ข Conclusion and future work
โ€ข Selected publications
48
Selected publications
๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He and Yi Lu. Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive
Supervised Evaluation. The Scientific World Journal, Issue: Recent Advances in Information Technology. Page 1-12, April 2014. Hindawi Publishing Corporation. ISSN:1537-744X.
http://www.hindawi.com/journals/tswj/aip/760301/
๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng. Language-independent Model for Machine Translation Evaluation with
Reinforced Factors. Proceedings of the 14th International Conference of Machine Translation Summit (MT Summit), pp. 215-222. Nice, France. 2 - 6 September 2013. International
Association for Machine Translation. http://www.mt-archive.info/10/MTS-2013-Han.pdf
๏‚ฒ Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yi Lu, Liangye He, Yiming Wang, Jiaji Zhou. A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task.
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT), pp. 414-421, 8-9 August 2013. Sofia, Bulgaria. Association for
Computational Linguistics. http://www.aclweb.org/anthology/W13-2253
๏‚ฒ Aaron Li-Feng Han, Yi Lu, Derek F. Wong, Lidia S. Chao, Liangye He, Junwen Xing. Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and
Statistical Modeling. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT), pp. 365-372. 8-9 August 2013. Sofia, Bulgaria.
Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-2245
๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li and Ling Zhu. Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine
Translation Evaluation. Language Processing and Knowledge in the Web. Lecture Notes in Computer Science Volume 8105, 2013, pp 119-131. Volume Editors: Iryna Gurevych, Chris
Biemann and Torsten Zesch. Springer-Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40722-2_13
๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Ling Zhu and Shuo Li. A Study of Chinese Word Segmentation Based on the Characteristics of Chinese. Language
Processing and Knowledge in the Web. Lecture Notes in Computer Science Volume 8105, 2013, pp 111-118. Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. Springer-
Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40722-2_12
๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He. Automatic Machine Translation Evaluation with Part-of-Speech Information. Text, Speech, and Dialogue. Lecture Notes in
Computer Science Volume 8082, 2013, pp 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40585-3_16
๏‚ฒ Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao. Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics. Language
Processing and Intelligent Information Systems. Lecture Notes in Computer Science Volume 7912, 2013, pp 57-68. M.A. Klopotek et al. (Eds.): IIS 2013. Springer-Verlag Berlin Heidelberg.
http://dx.doi.org/10.1007/978-3-642-38634-3_8
๏‚ฒ Aaron Li-Feng Han, Derek F. Wong and Lidia S. Chao. LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. Proceedings of the 24th International
Conference on Computational Linguistics (COLING): Posters, pages 441โ€“450, Mumbai, December 2012. Association for Computational Linguistics.
http://aclweb.org/anthology//C/C12/C12-2044.pdf
49
Warm pictures from NLP2CT
50
Thanks for your attention!
(Aaron) Li-Feng Han (้Ÿ“ๅˆฉๅณฐ), MB-154887
Supervisors: Dr. Derek F. Wong & Dr. Lidia S. Chao
NLP2CT, University of Macau

More Related Content

What's hot

NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
ย 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
ย 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
ย 
1909 paclic
1909 paclic1909 paclic
1909 paclicWarNik Chow
ย 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
ย 
SemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisSemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisAditya Joshi
ย 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
ย 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
ย 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
ย 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
ย 
NLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentationNLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentationstuti_agarwal
ย 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyAkshayaNagarajan10
ย 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...Alexander Panchenko
ย 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
ย 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT IntroductionRIILP
ย 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Dhruv Gohil
ย 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyRimzim Thube
ย 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize
ย 
Can Deep Learning solve the Sentiment Analysis Problem
Can Deep Learning solve the Sentiment Analysis ProblemCan Deep Learning solve the Sentiment Analysis Problem
Can Deep Learning solve the Sentiment Analysis ProblemMark Cieliebak
ย 

What's hot (20)

NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
ย 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
ย 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
ย 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
ย 
1909 paclic
1909 paclic1909 paclic
1909 paclic
ย 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
ย 
SemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisSemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment Analysis
ย 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
ย 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
ย 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
ย 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
ย 
NLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentationNLP and its application in Insurance -Short story presentation
NLP and its application in Insurance -Short story presentation
ย 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
ย 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
ย 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ย 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
ย 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
ย 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
ย 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
ย 
Can Deep Learning solve the Sentiment Analysis Problem
Can Deep Learning solve the Sentiment Analysis ProblemCan Deep Learning solve the Sentiment Analysis Problem
Can Deep Learning solve the Sentiment Analysis Problem
ย 

Viewers also liked

Making the Contents Page
Making the Contents PageMaking the Contents Page
Making the Contents PageCatfishMike
ย 
Production Brief
Production BriefProduction Brief
Production BriefCatfishMike
ย 
Making the Cover
Making the CoverMaking the Cover
Making the CoverCatfishMike
ย 
Medios de transmision
Medios de transmisionMedios de transmision
Medios de transmisiondeissynen
ย 
D ay 12 graphing inequalities
D ay 12 graphing inequalitiesD ay 12 graphing inequalities
D ay 12 graphing inequalitiesErik Tjersland
ย 
Digital marketing
Digital marketingDigital marketing
Digital marketingRajan Soni
ย 

Viewers also liked (8)

Making the Contents Page
Making the Contents PageMaking the Contents Page
Making the Contents Page
ย 
Production Brief
Production BriefProduction Brief
Production Brief
ย 
Contactual
ContactualContactual
Contactual
ย 
Making the Cover
Making the CoverMaking the Cover
Making the Cover
ย 
Medios de transmision
Medios de transmisionMedios de transmision
Medios de transmision
ย 
D ay 12 graphing inequalities
D ay 12 graphing inequalitiesD ay 12 graphing inequalities
D ay 12 graphing inequalities
ย 
Catalogo
CatalogoCatalogo
Catalogo
ย 
Digital marketing
Digital marketingDigital marketing
Digital marketing
ย 

Similar to Lepor: augmented automatic MT evaluation metric

TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
ย 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
ย 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
ย 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
ย 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
ย 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a surveyLifeng (Aaron) Han
ย 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
ย 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
ย 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
ย 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
ย 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
ย 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
ย 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...ๅบƒๆจน ๆœฌ้–“
ย 
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda
ย 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
ย 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
ย 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
ย 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
ย 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
ย 

Similar to Lepor: augmented automatic MT evaluation metric (20)

TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
ย 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
ย 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
ย 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
ย 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
ย 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
ย 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a survey
ย 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ย 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
ย 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
ย 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
ย 
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
ย 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
ย 
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
ย 
Thamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slidesThamme Gowda's PhD dissertation defense slides
Thamme Gowda's PhD dissertation defense slides
ย 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
ย 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
ย 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
ย 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
ย 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
ย 

More from Lifeng (Aaron) Han

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
ย 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
ย 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
ย 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
ย 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ย 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
ย 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
ย 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
ย 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
ย 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
ย 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
ย 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
ย 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
ย 
Thesis-Master-MTE-Aaron
Thesis-Master-MTE-AaronThesis-Master-MTE-Aaron
Thesis-Master-MTE-AaronLifeng (Aaron) Han
ย 
LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric Lifeng (Aaron) Han
ย 
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksPPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksLifeng (Aaron) Han
ย 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
ย 

More from Lifeng (Aaron) Han (17)

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
ย 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)
ย 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
ย 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
ย 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
ย 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
ย 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
ย 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
ย 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
ย 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
ย 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
ย 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
ย 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
ย 
Thesis-Master-MTE-Aaron
Thesis-Master-MTE-AaronThesis-Master-MTE-Aaron
Thesis-Master-MTE-Aaron
ย 
LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric
ย 
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksPPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
ย 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...
ย 

Recently uploaded

Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
ย 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
ย 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
ย 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
ย 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
ย 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
ย 
TechTACยฎ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTACยฎ CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTACยฎ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTACยฎ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
ย 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
ย 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
ย 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
ย 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
ย 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
ย 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
ย 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community
ย 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao
ย 
๐Ÿ”9953056974๐Ÿ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
๐Ÿ”9953056974๐Ÿ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...๐Ÿ”9953056974๐Ÿ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
๐Ÿ”9953056974๐Ÿ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 

Recently uploaded (20)

Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
ย 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
ย 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
ย 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
ย 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
ย 
young call girls in Rajiv Chowk๐Ÿ” 9953056974 ๐Ÿ” Delhi escort Service
young call girls in Rajiv Chowk๐Ÿ” 9953056974 ๐Ÿ” Delhi escort Serviceyoung call girls in Rajiv Chowk๐Ÿ” 9953056974 ๐Ÿ” Delhi escort Service
young call girls in Rajiv Chowk๐Ÿ” 9953056974 ๐Ÿ” Delhi escort Service
ย 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
ย 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
ย 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
ย 
TechTACยฎ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTACยฎ CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTACยฎ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTACยฎ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
ย 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
ย 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
ย 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
ย 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
ย 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
ย 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
ย 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
ย 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
ย 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
ย 
๐Ÿ”9953056974๐Ÿ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
๐Ÿ”9953056974๐Ÿ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...๐Ÿ”9953056974๐Ÿ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
๐Ÿ”9953056974๐Ÿ”!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
ย 

Lepor: augmented automatic MT evaluation metric

  • 1. LEPOR: An Augmented Machine Translation Evaluation Metric Master thesis defense, 2014.07 (Aaron) Li-Feng Han (้Ÿ“ๅˆฉๅณฐ), MB-154887 Supervisors: Dr. Derek F. Wong & Dr. Lidia S. Chao NLP2CT, University of Macau
  • 2. Content โ€ข MT evaluation (MTE) introduction โ€“ MTE background โ€“ Existing methods โ€“ Weaknesses analysis โ€ข Designed model โ€“ Designed factors โ€“ LEPOR metric โ€ข Experiments and results โ€“ Evaluation criteria โ€“ Corpora โ€“ Results analysis โ€ข Enhanced models โ€“ Variants of LEPOR โ€“ Performances in shared task โ€ข Conclusion and future work โ€ข Selected publications 2
  • 3. MTE - background โ€ข MT began as early as 1950s (Weaver, 1955) โ€ข Rapid development since 1990s (two reasons) โ€“ Computer technology โ€“ Enlarged bilingual corpora (Mariรฑo et al., 2006) โ€ข Several promotion events for MT: โ€“ NIST Open MT Evaluation series (OpenMT) โ€ข 2001-2009 (Li, 2005) โ€ข By National Institute of Standards and Technology, US โ€ข Corpora: Arabic-English, Chinese-English, etc. โ€“ International Workshop on Statistical Machine Translation (WMT) โ€ข from 2006 (Koehn and Monz, 2006; Callison-Burch et al., 2007 to 2012; Bojar et al., 2013) โ€ข Annually by SIGMT of ACL โ€ข Corpora: English to French/German/Spanish/Czech/Hungarian/Haitian Creole/Russian & inverse direction โ€“ International Workshop of Spoken Language Processing (IWSLT) โ€ข from 2004 (Eck and Hori, 2005; Paul, 2009; Paul, et al., 2010; Federico et al., 2011) โ€ข English & Asian language (Chinese, Japanese, Korean) 3
  • 4. โ€ข With the rapid development of MT, how to evaluate the MT model? โ€“ Whether the newly designed algorithm/feature enhance the existing MT system, or not? โ€“ Which MT system yields the best output for specified language pair, or generally across languages? โ€ข Difficulties in MT evaluation: โ€“ language variability results in no single correct translation โ€“ natural languages are highly ambiguous and different languages do not always express the same content in the same way (Arnold, 2003) 4
  • 5. Existing MTE methods โ€ข Manual MTE methods: โ€ข Traditional Manual judgment โ€“ Intelligibility: how understandable the sentence is โ€“ Fidelity: how much information the translated sentence retains as compared to the original โ€ข by the Automatic Language Processing Advisory Committee (ALPAC) around 1966 (Carroll, 1966) โ€“ Adequacy (similar as fidelity) โ€“ Fluency: whether the sentence is well-formed and fluent โ€“ Comprehension (improved intelligibility) โ€ข by Defense Advanced Research Projects Agency (DARPA) of US (Church et al., 1991; White et al., 1994) 5
  • 6. โ€ข Advanced manual judgment: โ€“ Task oriented method (White and Taylor, 1998) โ€ข In light of the tasks for which the output might by used โ€“ Further developed criteria โ€ข Bangalore et al. (2000): simple string accuracy/ generation string accuracy/ two corresponding tree-based accuracies. โ€ข LDC (Linguistics Data Consortium): 5-point scales fluency & adequacy โ€ข Specia et al. (2011): design 4-level adequacy, highly adequacy/fairly adequacy /poorly adequacy/completely inadequate โ€“ Segment ranking (WMT 2011~2013) โ€ข Judges are asked to provide a complete ranking over all the candidate translations of the same source segment (Callison-Burch et al., 2011, 2012) โ€ข 5 systems are randomly selected for the judges (Bojar et al., 2013) 6
  • 7. โ€ข Problems in Manual MTE โ€“ Time consuming โ€ข How about a document contain 3,000 sentences or more โ€“ Expensive โ€ข Professional translators? or other people? โ€“ Unrepeatable โ€ข Precious human labor can not be simply re-run โ€“ Low agreement, sometimes (Callison-Burch et al., 2011) โ€ข E.g. in WMT 2011 English-Czech task, multi-annotator agreement kappa value is very low โ€ข Even the same strings produced by two systems are ranked differently each time by the same annotator 7
  • 8. โ€ข How to address the problems? โ€“ Automatic MT evaluation! โ€ข What do we expect? (as compared with manual judgments) โ€“ Repeatable โ€ข Can be re-used whenever we make some change of the MT system, and plan to have a check of the translation quality โ€“ Fast โ€ข several minutes or seconds for evaluating 3,000 sentences โ€ข V.s. hours of human labor โ€“ Cheap โ€ข We do not need expensive manual judgments โ€“ High agreement โ€ข Each time of running, result in same scores for un-changed outputs โ€“ Reliable โ€ข Give a higher score for better translation output โ€ข Measured by correlation with human judgments 8
  • 9. Automatic MTE methods โ€ข BLEU (Papineni et al., 2002) โ€“ Proposed by IBM โ€“ First automatic MTE method โ€“ based on the degree of n-gram overlapping between the strings of words produced by the machine and the human translation references โ€“ corpus level evaluation โ€ข ๐ต๐ฟ๐ธ๐‘ˆ = ๐ต๐‘Ÿ๐‘’๐‘ฃ๐‘–๐‘ก๐‘ฆ ๐‘๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ ร— ๐‘’๐‘ฅ๐‘ ๐œ† ๐‘› ๐‘™๐‘œ๐‘”๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› ๐‘› ๐‘ ๐‘›=1 โ€“ ๐ต๐‘Ÿ๐‘’๐‘ฃ๐‘–๐‘ก๐‘ฆ ๐‘๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ = 1 ๐‘–๐‘“ ๐‘ > ๐‘Ÿ ๐‘’(1โˆ’ ๐‘Ÿ ๐‘ ) ๐‘–๐‘“ ๐‘ โ‰ค ๐‘Ÿ โ€“ ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› = #๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก #๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก , ๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› ๐‘› = #๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก #๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก โ€ข ๐‘ is the total length of candidate translation corpus (the sum of sentencesโ€™ length) โ€ข ๐‘Ÿ refers to the sum of effective reference sentence length in the corpus โ€“ if there are multi-references for each candidate sentence, then the nearest length as compared to the candidate sentence is selected as the effective one Papineni et al. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. Proc. of ACL. 9
  • 10. โ€ข METEOR (Banerjee and Lavie, 2005) โ€“ Proposed by CMU โ€“ To address weaknesses in BLEU, e.g. lack of recall, lack of explicit word matching โ€“ Based on general concept of flexible unigram matching โ€“ Surface form, stemmed form and meanings โ€ข ๐‘€๐ธ๐‘‡๐ธ๐‘‚๐‘… = 10๐‘ƒ๐‘… ๐‘…+9๐‘ƒ ร— (1 โˆ’ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ) โ€“ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ = 0.5 ร— #๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘  #๐‘ข๐‘›๐‘–๐‘”๐‘Ÿ๐‘Ž๐‘š๐‘ _๐‘š๐‘Ž๐‘กโ„Ž๐‘’๐‘‘ 3 โ€“ ๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™ = #๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก #๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’ โ€ข Unigram strings matches between MT and references, the match of words โ€“ simple morphological variants of each other โ€“ by the identical stem โ€“ synonyms of each other โ€ข Metric score is combination of factors โ€“ unigram-precision, unigram-recall โ€“ a measure of fragmentation , to capture how well-ordered the matched words in the machine translation are in relation to the reference โ€“ Penalty increases as the number of chunks increases Banerjee, Satanjeev and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In proc. of ACL. 10 Reference: He is a clever boy in the class Output: he is an clever boy in class #๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘ =2 #๐‘ข๐‘›๐‘–๐‘”๐‘Ÿ๐‘Ž๐‘š๐‘ _๐‘š๐‘Ž๐‘กโ„Ž๐‘’๐‘‘=6
  • 11. โ€ข WER (Su et al., 1992) โ€ข TER/HTER (Snover et al., 2006) โ€“ by University of Maryland & BBN Technologies โ€ข HyTER (Dreyer and Marcu, 2012) โ€ข ๐‘Š๐ธ๐‘… = ๐‘ ๐‘ข๐‘๐‘ ๐‘ก๐‘–๐‘ก๐‘ข๐‘ก๐‘–๐‘œ๐‘›+๐‘–๐‘›๐‘ ๐‘’๐‘Ÿ๐‘ก๐‘–๐‘œ๐‘›+๐‘‘๐‘’๐‘™๐‘’๐‘ก๐‘–๐‘œ๐‘› ๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’ ๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž โ€ข WER is Based on Levenshtein distance โ€“ the minimum number of editing steps needed to match two sequences โ€ข WER does not take word ordering into account appropriately โ€“ The WER scores very low when the word order of system output translation is โ€œwrongโ€ according to the reference. โ€“ In the Levenshtein distance, the mismatches in word order require the deletion and re-insertion of the misplaced words. โ€“ However, due to the diversity of language expression, some so-called โ€œwrongโ€ order sentences by WER also prove to be good translations. โ€ข TER adds a novel editing step: blocking movement โ€“ allow the movement of word sequences from one part of the output to another โ€“ Block movement is considered as one edit step with same cost of other edits โ€ข HyTER develops an annotation tool that is used to create meaning-equivalent networks of translations for a given sentence โ€“ Based on the large reference networks Snover et al. 2006. A Study of Translation Edit Rate with Targeted Human Annotation. In proc. of AMTA. Dreyer, Markus and Daniel Marcu. 2012. HyTER: Meaning-Equivalent Semantics for Translation Evaluation. In proc. of NAACL. 11
  • 12. Weaknesses of MTE methods โ€ข Good performance on certain language pairs โ€“ Perform lower with language pairs when English as source compared with English as target โ€“ E.g. TER (Snover et al., 2006) achieved 0.83 (Czech-English) vs 0.50 (English-Czech) correlation score with human judgments on WMT- 2011 shared tasks โ€ข rely on many linguistic features for good performance โ€“ E.g. METEOR rely on both stemming, and synonyms, etc. โ€ข Employ incomprehensive factors โ€“ E.g. BLEU (Papineni et al., 2002) based on n-gram precision score โ€“ higher BLEU score is not necessarily indicative of better translation (Callison-Burch et al., 2006) 12
  • 13. Content โ€ข MT evaluation (MTE) introduction โ€“ MTE background โ€“ Existing methods โ€“ Weaknesses analysis โ€ข Designed model โ€“ Designed factors โ€“ LEPOR metric โ€ข Experiments and results โ€“ Evaluation criteria โ€“ Corpora โ€“ Results analysis โ€ข Enhanced models โ€“ Variants of LEPOR โ€“ Performances in shared task โ€ข Conclusion and future work โ€ข Selected publications 13
  • 14. Designed factors โ€ข How to solve the mentioned problems? โ€ข Our designed methods โ€“ to make a comprehensive judgments: Enhanced/Augmented factors โ€“ to deal with language bias (perform differently across languages) problem: Tunable parameters โ€ข Try to make a contribution on some existing weak points โ€“ evaluation with English as the source language โ€“ some low-resource language pairs, e.g. Czech-English 14
  • 15. Factor-1: enhanced length penalty โ€ข BLEU (Papineni et al., 2002) only utilizes a brevity penalty for shorter sentence โ€ข the redundant/longer sentences are not penalized properly โ€ข to enhance the length penalty factor, we design a new version of penalty โ€ข ๐ฟ๐‘ƒ = exp 1 โˆ’ ๐‘Ÿ ๐‘ : ๐‘ < ๐‘Ÿ 1 โˆถ ๐‘ = ๐‘Ÿ exp 1 โˆ’ ๐‘ ๐‘Ÿ : ๐‘ > ๐‘Ÿ โ€ข ๐‘Ÿ: length of reference sentence โ€ข ๐‘: length of candidate (system-output) sentence โ€“ A penalty score for both longer and shorter sentences as compared with reference one โ€ข Our length penalty is designed first by sentence-level while BLEU is corpus level 15
  • 16. Factor-2: n-gram position difference penalty (๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™) โ€ข Word order information is introduced in ATEC (Wong and Kit, 2008) โ€“ However, they utilize the traditional nearest matching strategy โ€“ Without giving a clear formulized measuring steps โ€ข We design the n-gram based position difference factor and formulized steps โ€ข To calculate ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ score: several steps โ€“ N-gram word alignment, n is the number of considered neighbors โ€“ labeling each word a sequence number โ€“ Measure the position difference score of each word ( ๐‘ƒ๐ท๐‘– ) โ€“ Measure the sentence-level position difference score (๐‘๐‘ƒ๐ท, and ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™) 16
  • 17. โ€ข Step 1: N-gram word alignment (single reference) โ€ข N-gram word alignment โ€“ Alignment direction fixed: from hypothesis (output) to reference โ€“ Considering word neighbors, higher priority shall be given for the candidate matching with neighbor information โ€ข As compared with the traditional nearest matching strategy, without consider the neighbors โ€“ If both the candidates have neighbors, we select nearest matching as backup choice 17
  • 18. Fig. N-gram word alignment algorithm 18
  • 19. โ€ข Examples of n-gram word alignment: โ€ข If using the nearest matching strategy, the alignment will be different 19
  • 20. โ€ข Step 2: NPD calculation for each sentence โ€ข Labeling the token units: โ€ข ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐‘๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก: position of matched token in output sentence โ€ข ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐‘๐‘Ÿ๐‘’๐‘“: position of matched token in reference sentence โ€ข Measure the scores: โ€ข Each token: ๐‘ƒ๐ท๐‘– = |๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐‘๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก โˆ’ ๐‘€๐‘Ž๐‘ก๐‘โ„Ž๐‘๐‘Ÿ๐‘’๐‘“| โ€ข Whole sentence: ๐‘๐‘ƒ๐ท = 1 ๐ฟ๐‘’๐‘›๐‘”๐‘กโ„Ž ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก |๐‘ƒ๐ท๐‘–| ๐ฟ๐‘’๐‘›๐‘”๐‘กโ„Ž ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก ๐‘–=1 โ€ข N-gram Position difference score: ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ = exp โˆ’๐‘๐‘ƒ๐ท 20
  • 21. โ€ข Examples of NPD score with single reference: โ€“ N-gram position difference penalty score of each word: ๐‘ƒ๐ท๐‘– โ€“ Normalize the penalty score for each sentence: ๐‘๐‘ƒ๐ท โ€“ This example: ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ = exp โˆ’๐‘๐‘ƒ๐ท = ๐‘’โˆ’ 1 2 21
  • 22. multi-reference solution โ€ข Design the n-gram word alignment for multi- reference situation โ€ข N-gram alignment for multi-reference: โ€“ The same direction, output to references โ€“ Higher priority also for candidate with neighbor information โ€“ Adding principle: โ€ข If the matching candidates from different references all have neighbors, we select the one leading to a smaller NPD value (backup choice for nearest matching) 22
  • 23. โ€ข N-gram alignment examples of multi-references: โ€ข For the word โ€œonโ€: โ€“ Reference one: PDi = PD3 = 3 6 โˆ’ 4 8 โ€“ Reference two: PDi = PD3 = 3 6 โˆ’ 4 7 โ€“ 3 6 โˆ’ 4 8 < 3 6 โˆ’ 4 7 , the โ€œonโ€ in reference-1 is selected by leading to a smaller NPD value โ€ข Other two words โ€œaโ€ and โ€œbirdโ€ are aligned using the same principle 23
  • 24. Factor-3: weighted Harmonic mean of precision and recall โ€ข METEOR (Banerjee and Lavie, 2005) puts a fixed higher weight on recall value as compared with precision โ€“ For different language pairs, the importance of precision and recall differ โ€ข To make a generalized factor for wide spread language pairs โ€ข We design the tunable parameters for precision and recall โ€ข ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘ ๐›ผ๐‘…, ๐›ฝ๐‘ƒ = (๐›ผ + ๐›ฝ)/( ๐›ผ ๐‘… + ๐›ฝ ๐‘ƒ ) โ€ข ๐‘ƒ = ๐‘๐‘œ๐‘š๐‘š๐‘œ๐‘›_๐‘›๐‘ข๐‘š ๐‘ ๐‘ฆ๐‘ ๐‘ก๐‘’๐‘š_๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž โ€ข ๐‘… = ๐‘๐‘œ๐‘š๐‘š๐‘œ๐‘›_๐‘›๐‘ข๐‘š ๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’_๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž โ€ข ๐›ผ and ๐›ฝ are two parameters to adjust the weight of ๐‘… (recall) and ๐‘ƒ (precision) โ€ข ๐‘๐‘œ๐‘š๐‘š๐‘œ๐‘›_๐‘›๐‘ข๐‘š represents the number of aligned (matching) words and marks appearing both in automatic translations and references โ€ข ๐‘ ๐‘ฆ๐‘ ๐‘ก๐‘’๐‘š_๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž and ๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’_๐‘™๐‘’๐‘›๐‘”๐‘กโ„Ž specify the sentence length of system output and reference respectively 24
  • 25. LEPOR metric โ€ข LEPOR: automatic machine translation evaluation metric considering the enhanced Length Penalty, Precision, n-gram Position difference Penalty and Recall. โ€ข Initial version: The product value of the factors โ€ข Sentence-level score: โ€“ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = ๐ฟ๐‘ƒ ร— ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ร— ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘(๐›ผ๐‘…, ๐›ฝ๐‘ƒ) โ€ข System-level score: โ€“ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ด = 1 ๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–๐‘กโ„Ž๐‘†๐‘’๐‘›๐‘ก ๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š ๐‘–=1 โ€“ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ต = ๐น๐‘Ž๐‘๐‘ก๐‘œ๐‘Ÿ๐‘– ๐‘› ๐‘–=1 โ€“ ๐น๐‘Ž๐‘๐‘ก๐‘œ๐‘Ÿ๐‘–= 1 ๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š ๐น๐‘Ž๐‘๐‘ก๐‘œ๐‘Ÿ๐‘–๐‘กโ„Ž๐‘†๐‘’๐‘›๐‘ก ๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š ๐‘–=1 25
  • 26. Content โ€ข MT evaluation (MTE) introduction โ€“ MTE background โ€“ Existing methods โ€“ Weaknesses analysis โ€ข Designed model โ€“ Designed factors โ€“ LEPOR metric โ€ข Experiments and results โ€“ Evaluation criteria โ€“ Corpora and compared existing methods โ€“ Results analysis โ€ข Enhanced models โ€“ Variants of LEPOR โ€“ Performances in shared task โ€ข Conclusion and future work โ€ข Selected publications 26
  • 27. Evaluation criteria โ€ข Evaluation criteria: โ€“ Human judgments are assumed as the golden ones โ€“ Measure the correlation score between automatic evaluation and human judgments โ€ข System-level correlation (one commonly used correlation criterion) โ€“ Spearman rank correlation coefficient (Callison-Burch et al., 2011): โ€“ ๐œŒ ๐‘‹๐‘Œ = 1 โˆ’ 6 ๐‘‘ ๐‘– 2๐‘› ๐‘–=1 ๐‘›(๐‘›2โˆ’1) โ€“ ๐‘‹ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘› , ๐‘Œ = {๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘›} โ€“ ๐‘‘๐‘– = (๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–) is the difference value of two corresponding ranked variants 27
  • 28. Corpora and compared methods โ€ข Corpora: โ€“ Development data for tuning of parameters โ€“ WMT2008 (http://www.statmt.org/wmt08/) โ€“ EN: English, ES: Spanish, DE: German, FR: French and CZ: Czech โ€“ Two directions: EN-other and other-EN โ€“ Testing data โ€“ WMT2011 (http://www.statmt.org/wmt11/) โ€“ The numbers of participated automatic MT systems in WMT 2011 โ€“ 10, 22, 15 and 17 respectively for English-to-CZ/DE/ES/FR โ€“ 8, 20, 15 and 18 respectively for CZ/DE/ES/FR-to-EN โ€“ The gold standard reference data consists of 3,003 sentences 28
  • 29. โ€ข Comparisons (3 gold standard BLEU/TER/METEOR & 2 latest metrics): โ€“ BLEU (Papineni et al., 2002), precision based metric โ€“ TER (Snover et al., 2006), edit distance based metric โ€“ METEOR (version 1.3) (Denkowski and Lavie, 2011), precision and recall, using synonym, stemming, and paraphrasing as external resources โ€“ AMBER (Chen and Kuhn, 2011), a modified version of BLEU, attaching more kinds of penalty coefficients, combining the n-gram precision and recall โ€“ MP4IMB1 (Popovic et al., 2011), based on morphemes, POS (4-grams) and lexicon probabilities, etc. โ€ข Ours, initial version of LEPOR (๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ด & ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ต) โ€“ Simple product value of the factors โ€“ Without using linguistic feature โ€“ Based on augmented factors 29
  • 30. Result analysis The system-level Spearman correlation with human judgment on WMT11 corpora - LEPOR yielded three top one correlation scores on CZ-EN / ES-EN / EN-ES - LEPOR showed robust performance across langauges, resulting in top one Mean score Aaron L.F. Han, Derek F. Wong and Lidia S. Chao. 2012. LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. In proc. of COLING. 30
  • 31. Content โ€ข MT evaluation (MTE) introduction โ€“ MTE background โ€“ Existing methods โ€“ Weaknesses analysis โ€ข Designed model โ€“ Designed factors โ€“ LEPOR metric โ€ข Experiments and results โ€“ Evaluation criteria โ€“ Corpora โ€“ Results analysis โ€ข Enhanced models โ€“ Variants of LEPOR โ€“ Performances in shared task โ€ข Conclusion and future work โ€ข Selected publications 31
  • 32. Variant of LEPOR โ€ข New factor: to consider the content information โ€“ Design n-gram precision and n-gram recall โ€“ Harmonic mean of n-gram sub-factors โ€“ Also measured on sentence-level vs BLEU (corpus-level) โ€ข N is the number of words in the block matching โ€ข ๐‘ƒ๐‘› = #๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘‘ #๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘  ๐‘–๐‘› ๐‘ ๐‘ฆ๐‘ ๐‘ก๐‘’๐‘š ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก โ€ข ๐‘… ๐‘› = #๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘š๐‘Ž๐‘ก๐‘โ„Ž๐‘’๐‘‘ #๐‘›๐‘”๐‘Ÿ๐‘Ž๐‘š ๐‘โ„Ž๐‘ข๐‘›๐‘˜๐‘  ๐‘–๐‘› ๐‘Ÿ๐‘’๐‘“๐‘’๐‘Ÿ๐‘’๐‘›๐‘๐‘’ โ€ข ๐ป๐‘ƒ๐‘… = ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘ ๐›ผ๐‘… ๐‘›, ๐›ฝ๐‘ƒ๐‘› = ๐›ผ+๐›ฝ ๐›ผ ๐‘… ๐‘› + ๐›ฝ ๐‘ƒ ๐‘› 32
  • 33. โ€ข Example of bigram (n=2) block matching for bigram precision and bigram recall: โ€ข Similar strategies for n>=3, block matching โ€“ For the calculation of n-gram precision and recall 33
  • 34. Variant-1: โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… โ€ข To achieve higher correlation with human judgments for focused language pair โ€“ Design tunable parameters at factors level โ€“ Weighted harmonic mean of the factors: โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘ ๐‘ค ๐ฟ๐‘ƒ ๐ฟ๐‘ƒ, ๐‘ค ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™, ๐‘ค ๐ป๐‘ƒ๐‘… ๐ป๐‘ƒ๐‘… = ๐‘ค ๐‘– ๐‘› ๐‘–=1 ๐‘ค ๐‘– ๐น๐‘Ž๐‘๐‘ก๐‘œ๐‘Ÿ ๐‘– ๐‘› ๐‘–=1 = ๐‘ค ๐ฟ๐‘ƒ+๐‘ค ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™+๐‘ค ๐ป๐‘ƒ๐‘… ๐‘ค ๐ฟ๐‘ƒ ๐ฟ๐‘ƒ + ๐‘ค ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ + ๐‘ค ๐ป๐‘ƒ๐‘… ๐ป๐‘ƒ๐‘… โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ด = 1 ๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–๐‘กโ„Ž๐‘†๐‘’๐‘›๐‘ก ๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š ๐‘–=1 โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ต = ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘(๐‘ค ๐ฟ๐‘ƒ ๐ฟ๐‘ƒ, ๐‘ค ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™, ๐‘ค ๐ป๐‘ƒ๐‘… ๐ป๐‘ƒ๐‘…) โ€ข In this way, it has more parameters to tune for the focused language pair โ€“ To seize the characteristics of focused language pair โ€“ Especially for distant language pairs 34
  • 35. Variant-2: ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… โ€ข For the languages that request high fluency โ€“ We design the n-gram based metric โ€“ N-gram based product of the factors: โ€“ ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = ๐ฟ๐‘ƒ ร— ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ร— ๐‘’๐‘ฅ๐‘( ๐‘ค ๐‘› ๐‘™๐‘œ๐‘”๐ป๐‘ƒ๐‘…๐‘ ๐‘›=1 ) โ€“ ๐ป๐‘ƒ๐‘… = ๐ป๐‘Ž๐‘Ÿ๐‘š๐‘œ๐‘›๐‘–๐‘ ๐›ผ๐‘… ๐‘›, ๐›ฝ๐‘ƒ๐‘› = ๐›ผ+๐›ฝ ๐›ผ ๐‘… ๐‘› + ๐›ฝ ๐‘ƒ ๐‘› โ€“ ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ด = 1 ๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–๐‘กโ„Ž๐‘†๐‘’๐‘›๐‘ก ๐‘†๐‘’๐‘›๐‘ก๐‘๐‘ข๐‘š ๐‘–=1 โ€“ ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ต = ๐ฟ๐‘ƒ ร— ๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™๐‘ก๐‘ฆ ร— ๐‘’๐‘ฅ๐‘( ๐‘ค ๐‘› ๐‘™๐‘œ๐‘”๐ป๐‘ƒ๐‘…๐‘ ๐‘›=1 ) โ€ข In this way, the n-gram information is considered for the measuring of precision and recall โ€“ To consider more about content information 35
  • 36. Linguistic feature โ€ข Enhance the metric with concise linguistic feature: โ€ข Example of Part-of-speech (POS) utilization โ€“ Sometimes perform as synonym information โ€“ E.g. the โ€œsayโ€ and โ€œclaimโ€ in the example translation 36
  • 37. โ€ข Scores with linguistic features: โ€ข Sentence-level score: โ€ข ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘“๐‘–๐‘›๐‘Ž๐‘™ = 1 ๐‘คโ„Ž๐‘ค+๐‘คโ„Ž๐‘ (๐‘คโ„Ž๐‘ค ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ + ๐‘คโ„Ž๐‘ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘†) โ€ข ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘† and ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ are measured using the same algorithm on POS sequence and word sequence respectively โ€ข System-level score: โ€ข ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘“๐‘–๐‘›๐‘Ž๐‘™ = 1 ๐‘คโ„Ž๐‘ค+๐‘คโ„Ž๐‘ (๐‘คโ„Ž๐‘ค ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ + ๐‘คโ„Ž๐‘ ๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘†) Aaron L.F. Han, Derek F. Wong Lidia S. Chao, et al. 2013. Language-independent Model for Machine Translation Evaluation with Reinforced Factors. In proc. of MT Summit. 37
  • 38. โ€ข Experiments of enhanced metric โ€ข Corpora setting โ€“ The same corpora utilization with last experiments โ€“ WMT08 for development and WMT11 for testing โ€ข Variant of LEPOR model โ€“ Harmonic mean to combine the main factors โ€“ More parameters to tune โ€“ Utilizing concise linguistic features (POS) as external resource โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = 1 ๐‘›๐‘ข๐‘š ๐‘ ๐‘’๐‘›๐‘ก |โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–| ๐‘›๐‘ข๐‘š ๐‘ ๐‘’๐‘›๐‘ก ๐‘–=1 โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐ธ = 1 ๐‘คโ„Ž๐‘ค+๐‘คโ„Ž๐‘ (๐‘คโ„Ž๐‘คโ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ + ๐‘คโ„Ž๐‘โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘†) 38
  • 39. โ€ข Comparison (Metrics) with related works: โ€“ In addition to the state-of-the-art metrics METEOR / BLEU / TER โ€“ Compare with ROSE (Song and Cohn, 2011) and MPF (Popovic, 2011) โ€“ ROSE and MPF metrics both utilize the POS as external information 39
  • 40. Tuned parameter values of our enhanced method System-level Spearman correlation with human judgment on WMT11 corpora Our enhanced method yielded the highest Mean score 0.83 on eight language pairs 40
  • 41. Performance in WMT task โ€ข Performances on MT evaluation shared tasks in ACL-WMT 2013 โ€“ The eighth international workshop of statistical machine translation, accompanied with ACL-2013 โ€ข Corpora: โ€“ English, Spanish, German, French, Czech, and Russian (new) 41
  • 42. โ€ข Submitted methods: โ€“ hLEPOR (LEPOR_v3.1): with linguistic feature & tunable parameters โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = 1 ๐‘›๐‘ข๐‘š ๐‘ ๐‘’๐‘›๐‘ก |โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘–| ๐‘›๐‘ข๐‘š ๐‘ ๐‘’๐‘›๐‘ก ๐‘–=1 โ€“ โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘…๐‘“๐‘–๐‘›๐‘Ž๐‘™ = 1 ๐‘คโ„Ž๐‘ค+๐‘คโ„Ž๐‘ (๐‘คโ„Ž๐‘คโ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ค๐‘œ๐‘Ÿ๐‘‘ + ๐‘คโ„Ž๐‘โ„Ž๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… ๐‘ƒ๐‘‚๐‘†) โ€“ nLEPOR_baseline: without using external resource, default weights โ€“ ๐‘›๐ฟ๐ธ๐‘ƒ๐‘‚๐‘… = ๐ฟ๐‘ƒ ร— ๐‘๐‘ƒ๐‘œ๐‘ ๐‘ƒ๐‘’๐‘›๐‘Ž๐‘™ ร— exp( ๐‘ค ๐‘› ๐‘™๐‘œ๐‘”๐ป๐‘ƒ๐‘…๐‘ ๐‘›=1 ) 42
  • 43. Metric evaluations โ€ข Evaluation criteria: โ€“ Human judgments are assumed the golden ones โ€“ Measure the correlation score between automatic evaluation and human judgments โ€ข System-level correlation (two commonly used correlations) โ€“ Spearman rank correlation coefficient: โ€“ ๐œŒ ๐‘‹๐‘Œ = 1 โˆ’ 6 ๐‘‘ ๐‘– 2๐‘› ๐‘–=1 ๐‘›(๐‘›2โˆ’1) โ€“ ๐‘‹ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ ๐‘› , ๐‘Œ = {๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘›} โ€“ Pearson correlation coefficient: โ€“ ๐œŒ ๐‘‹๐‘Œ = (๐‘ฅ๐‘–โˆ’๐œ‡ ๐‘ฅ)(๐‘ฆ ๐‘–โˆ’๐œ‡ ๐‘ฆ)๐‘› ๐‘–=1 ๐‘ฅ ๐‘–โˆ’๐œ‡ ๐‘ฅ 2๐‘› ๐‘–=1 ๐‘ฆ ๐‘–โˆ’๐œ‡ ๐‘ฆ 2๐‘› ๐‘–=1 , ๐œ‡ ๐‘ฆ = 1 ๐‘› ๐‘ฆ๐‘–, ๐œ‡ ๐‘ฅ = 1 ๐‘› ๐‘ฅ๐‘– 43
  • 44. Official results System-level Pearson (left)/Spearman (right) correlation score with human judgment Our methods rank first by Pearson/ second by Spearman respectively by Average score44
  • 45. โ€ข From the shared task results โ€“ Practical performance: LEPOR methods are effective yielding generally higher across language pairs โ€“ Robustness: LEPOR methods achieved the first highest score on the new language pair English- Russian โ€“ Contribution for the existing weak point: MT evaluation with English as the source language 45 Aaron L.F. Han, Derek F. Wong and Lidia S. Chao et al. 2013. A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task. In proc. of ACL workshop of 8th WMT.
  • 46. Content โ€ข MT evaluation (MTE) introduction โ€“ MTE background โ€“ Existing methods โ€“ Weaknesses analysis โ€ข Designed model โ€“ Designed factors โ€“ LEPOR metric โ€ข Experiments and results โ€“ Evaluation criteria โ€“ Corpora โ€“ Results analysis โ€ข Enhanced models โ€“ Variants of LEPOR โ€“ Performances in shared task โ€ข Conclusion and future work โ€ข Selected publications 46
  • 47. โ€ข The methods and contributions โ€“ Designed and trained for system-level MTE โ€“ Using reference translations โ€ข For future work: โ€“ Tune MT system using designed MTE methods โ€“ Design model for Segment-level MTE โ€“ MTE without using reference translation โ€“ Investigate more linguistic features (e.g. text entailment, paraphrasing, and synonym) for MTE with English as target language 47
  • 48. Content โ€ข MT evaluation (MTE) introduction โ€“ MTE background โ€“ Existing methods โ€“ Weaknesses analysis โ€ข Designed model โ€“ Designed factors โ€“ LEPOR metric โ€ข Experiments and results โ€“ Evaluation criteria โ€“ Corpora โ€“ Results analysis โ€ข Enhanced models โ€“ Variants of LEPOR โ€“ Performances in shared task โ€ข Conclusion and future work โ€ข Selected publications 48
  • 49. Selected publications ๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He and Yi Lu. Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation. The Scientific World Journal, Issue: Recent Advances in Information Technology. Page 1-12, April 2014. Hindawi Publishing Corporation. ISSN:1537-744X. http://www.hindawi.com/journals/tswj/aip/760301/ ๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng. Language-independent Model for Machine Translation Evaluation with Reinforced Factors. Proceedings of the 14th International Conference of Machine Translation Summit (MT Summit), pp. 215-222. Nice, France. 2 - 6 September 2013. International Association for Machine Translation. http://www.mt-archive.info/10/MTS-2013-Han.pdf ๏‚ฒ Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yi Lu, Liangye He, Yiming Wang, Jiaji Zhou. A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT), pp. 414-421, 8-9 August 2013. Sofia, Bulgaria. Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-2253 ๏‚ฒ Aaron Li-Feng Han, Yi Lu, Derek F. Wong, Lidia S. Chao, Liangye He, Junwen Xing. Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT), pp. 365-372. 8-9 August 2013. Sofia, Bulgaria. Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-2245 ๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li and Ling Zhu. Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. Language Processing and Knowledge in the Web. Lecture Notes in Computer Science Volume 8105, 2013, pp 119-131. Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. Springer-Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40722-2_13 ๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He, Ling Zhu and Shuo Li. A Study of Chinese Word Segmentation Based on the Characteristics of Chinese. Language Processing and Knowledge in the Web. Lecture Notes in Computer Science Volume 8105, 2013, pp 111-118. Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. Springer- Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40722-2_12 ๏‚ฒ Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He. Automatic Machine Translation Evaluation with Part-of-Speech Information. Text, Speech, and Dialogue. Lecture Notes in Computer Science Volume 8082, 2013, pp 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-40585-3_16 ๏‚ฒ Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao. Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics. Language Processing and Intelligent Information Systems. Lecture Notes in Computer Science Volume 7912, 2013, pp 57-68. M.A. Klopotek et al. (Eds.): IIS 2013. Springer-Verlag Berlin Heidelberg. http://dx.doi.org/10.1007/978-3-642-38634-3_8 ๏‚ฒ Aaron Li-Feng Han, Derek F. Wong and Lidia S. Chao. LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. Proceedings of the 24th International Conference on Computational Linguistics (COLING): Posters, pages 441โ€“450, Mumbai, December 2012. Association for Computational Linguistics. http://aclweb.org/anthology//C/C12/C12-2044.pdf 49
  • 50. Warm pictures from NLP2CT 50
  • 51. Thanks for your attention! (Aaron) Li-Feng Han (้Ÿ“ๅˆฉๅณฐ), MB-154887 Supervisors: Dr. Derek F. Wong & Dr. Lidia S. Chao NLP2CT, University of Macau