Aaron L.-F. Han
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
University of Macau, Macau...
 The importance of machine translation (MT) evaluation
 Automatic MT evaluation metrics introduction
1. Lexical similari...
• Eager communication with each other of different
nationalities
– Promote the translation technology
• Rapid development ...
• Some recent works of MT research:
– Och (2003) present MERT (Minimum Error Rate Training)
for log-linear SMT
– Su et al....
• How well the MT systems perform and whether they
make some progress?
• Difficulties of MT evaluation
– language variabil...
• Traditional manual evaluation criteria:
– intelligibility (measuring how understandable the
sentence is)
– fidelity (mea...
• Problems of manual evaluations :
– Time-consuming
– Expensive
– Unrepeatable
– Low agreement (Callison-Burch, et al., 20...
2.1 Lexical similarity
2.2 Linguistic features
2.3 Metrics combination
• Precision-based
Bleu (Papineni et al., 2002 ACL)
• Recall-based
ROUGE(Lin, 2004 WAS)
• Precision and Recall
Meteor (Bane...
• Word-order based
NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen
et al., 2012 ACL), ATEC (Wong et al., 2008AMTA)
• Word-a...
• Language model
LM-SVM (Gamon et al., 2005EAMT)
• Shallow parsing
GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel
et al....
• MTeRater-Plus (Parton et al., 2011WMT)
– Combine BLEU, TERp (Snover et al., 2009) and Meteor
(Banerjee and Lavie, 2005; ...
• LEPOR: automatic machine translation evaluation
metric considering the: Length Penalty, Precision, n-
gram Position diff...
• Weaknesses in existing metrics:
– perform well on certain language pairs but weak on others,
which we call as the langua...
• to address some of the existing problems:
– Design tunable parameters to address the language-bias
problem;
– Use concis...
• Sub-factors:
• 𝐿𝑃 =
exp 1 −
𝑟
𝑐
: 𝑐 < 𝑟
1 ∶ 𝑐 = 𝑟
exp 1 −
𝑐
𝑟
: 𝑐 > 𝑟
(1)
• 𝑟: length of reference sentence
• 𝑐: length ...
• 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (2)
• 𝑁𝑃𝐷 =
1
𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡
|𝑃𝐷𝑖|
𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡
𝑖=1
(3)
• 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (4)
• 𝑀𝑎...
Fig. 1. N-gram word alignment algorithm
Fig. 2. Example of n-gram word alignment
Fig. 3. Example of NPD calculation
• N-gram precision and recall:
• 𝑃𝑛 =
#𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑
#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑠𝑦𝑠𝑡𝑒𝑚 𝑜𝑢𝑡𝑝𝑢𝑡
(5)
• 𝑅 𝑛 =
#𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑
#𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛...
Fig. 4. Example of bigram matching
• LEPOR Metrics:
• 𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝛼𝑅, 𝛽𝑃) (8)
• ℎ𝐿𝐸𝑃𝑂𝑅 =
𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤 𝐿𝑃 𝐿𝑃, 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤 𝐻𝑃𝑅...
• Example, employment of linguistic features:
Fig. 5. Example of n-gram POS alignment
Fig. 6. Example of NPD calculation
• Combination with linguistic features:
• ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 =
1
𝑤ℎ𝑤+𝑤ℎ𝑝
(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 +
𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (11)
• ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆 and ℎ...
• When multi-references:
• Select the alignment that results in the minimum NPD
score.
Fig. 7. N-gram alignment when multi...
• How reliable is the automatic metric?
• Evaluation criteria for evaluation metrics:
– Human judgments are the golden to ...
• System-level correlation:
• Spearman rank correlation coefficient:
– 𝜌 𝑋𝑌 = 1 −
6 𝑑 𝑖
2𝑛
𝑖=1
𝑛(𝑛2−1)
(12)
– 𝑋 = 𝑥1, … , ...
• Segment-level Kendall’s tau correlation:
• 𝜏 =
𝑛𝑢𝑚 𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑡 𝑝𝑎𝑖𝑟𝑠−𝑛𝑢𝑚 𝑑𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑡 𝑝𝑎𝑖𝑟𝑠
𝑡𝑜𝑡𝑎𝑙 𝑝𝑎𝑖𝑟𝑠
(14)
• The segment ...
• Performances on ACL-WMT 2011 copora
• Two translation directions:
– English-to-other (Spanish, German, French, Czech)
– ...
Table 1. system-level Spearman correlation with human judgment on WMT11 corpora
• Performances on ACL-WMT 2013 copora
• Two translation directions:
– English-to-other (Spanish, German, French, Czech, an...
• System-level & sentence-level
– LEPOR_v3.1: hLEPOR, nLEPOR_baseline
– ℎ𝐿𝐸𝑃𝑂𝑅 =
1
𝑛𝑢𝑚 𝑠𝑒𝑛𝑡
|ℎ𝐿𝐸𝑃𝑂𝑅𝑖|
𝑛𝑢𝑚 𝑠𝑒𝑛𝑡
𝑖=1 (17)
– ...
Table 2&3. system-level Pearson correlation with human judgment
Table 4&5. segment-level Kendall’s tau correlation with human judgment
• LEPOR: A Robust Evaluation Metric for Machine
Translation with Augmented Factors
– Aaron L.-F. Han, Derek F. Wong and Li...
• Language independent MT evaluation-LEPOR:
https://github.com/aaronlifenghan/aaron-project-lepor
• MT evaluation with lin...
• My research interests:
– Natural Language Processing
– Signal Processing
– Machine Learning
– Artificial Intelligence
– ...
• Other publications:
• A Description of Tunable Machine Translation Evaluation Systems in
WMT13 Metrics Task
– Aaron Li-F...
• Quality Estimation for Machine Translation Using the Joint Method of
Evaluation Criteria and Statistical Modeling
– Aaro...
Designed POS tagset mapping of Spanish (Tree tagger) to universal tagset
(Petrov et al., 2012)
• 𝐸𝐵𝐿𝐸𝑈 = 1 − 𝑀𝐿𝑃 × exp( 𝑤 𝑛log(𝐻(𝛼𝑅 𝑛, 𝛽𝑃𝑛)))
– 𝑀𝐿𝑃 =
𝑒1−
𝑠
ℎ
𝑖𝑓 ℎ < 𝑠
𝑒1−
ℎ
𝑠
𝑖𝑓 ℎ ≥ 𝑠
– 𝑃𝑛 =
#𝑐𝑜𝑚𝑚𝑜𝑛 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘
#𝑛𝑔𝑟𝑎𝑚...
• Bayes’ rule:
– 𝑝 𝑐𝑖 𝑥1, 𝑥2, … , 𝑥 𝑛 =
𝑝 𝑥1,𝑥2,…,𝑥 𝑛 𝑐 𝑖 𝑝(𝑐 𝑖)
𝑝(𝑥1,𝑥2,…,𝑥 𝑛)
• SVM, find the point with smallest margin...
Designed features for CRF & NB algorithms
ACL-WMT13 word level quality estimation task results
• Phrase Tagset Mapping for French and English Treebanks and Its
Application in Machine Translation Evaluation
– Aaron Li-...
Designed phrase tagset mapping for English and French
Evaluation based on parsing information from syntactic treebanks
Convert the word sequence into universal phrase tagset sequence
• 𝐻𝑃𝑃𝑅 = 𝐻𝑎𝑟(𝑤 𝑃𝑠 𝑁1 𝑃𝑠𝐷𝑖𝑓, 𝑤 𝑃𝑟 𝑁2 𝑃𝑟𝑒, 𝑤 𝑅𝑐 𝑁3 𝑅𝑒𝑐)
– 𝑁1 𝑃𝑠𝐷𝑖𝑓 =
1
𝑛
𝑁1 𝑃𝑠𝐷𝑖𝑓𝑖
– 𝑁2 𝑃𝑟𝑒 = 𝑒𝑥𝑝 𝑤 𝑛 𝑙𝑜𝑔𝑃𝑛𝑁2
– 𝑁3 𝑅𝑒𝑐 = 𝑒𝑥𝑝...
• A Study of Chinese Word Segmentation Based on the Characteristics of
Chinese
– Aaron Li-Feng Han, Derek F. Wong, Lidia S...
• AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH
INFORMATION
– Aaron Li-Feng Han, Derek F. Wong, Lidia S. Ch...
• Chinese Named Entity Recognition with Conditional Random Fields in the
Light of Chinese Characteristics
– Aaron Li-Feng ...
• Ongoing and further works:
– The combination of translation and evaluation, tuning the
translation model using evaluatio...
• Actually speaking, the evaluation works are very
related to the similarity measuring. Where I have
employed them is in t...
Q and A
Thanks for your attention!
Aaron L.-F. Han, 2013.08
• 1. Weaver, Warren.: Translation. In William Locke and A. Donald Booth, editors,
• Machine Translation of Languages: Four...
• 7. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge
• University Press (2010)
• 8. Arnold...
• 12. Tillmann C., Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf:
• Accelerated DP Based Search For Statis...
• 17. Denkowski, M. and Lavie, A.: Meteor 1.3: Automatic metric for reliable optimization
• and evaluation of machine tran...
• 23. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation
• of translation quality for dista...
• 29. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011
• Workshop on Statistical Machine Tr...
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
Upcoming SlideShare
Loading in...5
×

CUHK intern PPT. Machine Translation Evaluation: Methods and Tools

714

Published on

Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
714
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
40
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CUHK intern PPT. Machine Translation Evaluation: Methods and Tools

  1. 1. Aaron L.-F. Han Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory University of Macau, Macau S.A.R., China 2013.08 @ CUHK, Hong Kong Email: hanlifengaaron AT gmail DOT com Homepage: http://www.linkedin.com/in/aaronhan
  2. 2.  The importance of machine translation (MT) evaluation  Automatic MT evaluation metrics introduction 1. Lexical similarity 2. Linguistic features 3. Metrics combination  Designed metric: LEPOR Series 1. Motivation 2. LEPOR Metrics Description 3. Performances on international ACL-WMT corpora 4. Publications and Open source tools  Other research interests and publications
  3. 3. • Eager communication with each other of different nationalities – Promote the translation technology • Rapid development of Machine translation – machine translation (MT) began as early as in the 1950s (Weaver, 1955) – big progress science the 1990s due to the development of computers (storage capacity and computational power) and the enlarged bilingual corpora (Marino et al. 2006)
  4. 4. • Some recent works of MT research: – Och (2003) present MERT (Minimum Error Rate Training) for log-linear SMT – Su et al. (2009) use the Thematic Role Templates model to improve the translation – Xiong et al. (2011) employ the maximum-entropy model, etc. – The data-driven methods including example-based MT (Carl and Way, 2003) and statistical MT (Koehn, 2010) became main approaches in MT literature.
  5. 5. • How well the MT systems perform and whether they make some progress? • Difficulties of MT evaluation – language variability results in no single correct translation – the natural languages are highly ambiguous and different languages do not always express the same content in the same way (Arnold, 2003)
  6. 6. • Traditional manual evaluation criteria: – intelligibility (measuring how understandable the sentence is) – fidelity (measuring how much information the translated sentence retains as compared to the original) by the Automatic Language Processing Advisory Committee (ALPAC) around 1966 (Carroll, 1966) – adequacy (similar as fidelity), fluency (whether the sentence is well-formed and fluent) and comprehension (improved intelligibility) by Defense Advanced Research Projects Agency (DARPA) of US (White et al., 1994)
  7. 7. • Problems of manual evaluations : – Time-consuming – Expensive – Unrepeatable – Low agreement (Callison-Burch, et al., 2011)
  8. 8. 2.1 Lexical similarity 2.2 Linguistic features 2.3 Metrics combination
  9. 9. • Precision-based Bleu (Papineni et al., 2002 ACL) • Recall-based ROUGE(Lin, 2004 WAS) • Precision and Recall Meteor (Banerjee and Lavie, 2005 ACL)
  10. 10. • Word-order based NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen et al., 2012 ACL), ATEC (Wong et al., 2008AMTA) • Word-alignment based AER (Och and Ney, 2003 J.CL) • Edit distance-based WER(Su et al., 1992Coling), PER(Tillmann et al., 1997 EUROSPEECH), TER (Snover et al., 2006 AMTA)
  11. 11. • Language model LM-SVM (Gamon et al., 2005EAMT) • Shallow parsing GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel et al., 2012WMT) • Semantic roles Named entity, morphological, synonymy, paraphrasing, discourse representation, etc.
  12. 12. • MTeRater-Plus (Parton et al., 2011WMT) – Combine BLEU, TERp (Snover et al., 2009) and Meteor (Banerjee and Lavie, 2005; Lavie and Denkowski, 2009) • MPF & WMPBleu (Popovic, 2011WMT) – Arithmetic mean of F score and BLEU score • SIA (Liu and Gildea, 2006ACL) – Combine the advantages of n-gram-based metrics and loose-sequence-based metrics
  13. 13. • LEPOR: automatic machine translation evaluation metric considering the: Length Penalty, Precision, n- gram Position difference Penalty and Recall.
  14. 14. • Weaknesses in existing metrics: – perform well on certain language pairs but weak on others, which we call as the language-bias problem; – consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call as the extremism problem; – present incomprehensive factors (e.g. BLEU focus on precision only). – What to do?
  15. 15. • to address some of the existing problems: – Design tunable parameters to address the language-bias problem; – Use concise or optimized linguistic features for the linguistic extremism problem; – Design augmented factors.
  16. 16. • Sub-factors: • 𝐿𝑃 = exp 1 − 𝑟 𝑐 : 𝑐 < 𝑟 1 ∶ 𝑐 = 𝑟 exp 1 − 𝑐 𝑟 : 𝑐 > 𝑟 (1) • 𝑟: length of reference sentence • 𝑐: length of candidate (system-output) sentence
  17. 17. • 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (2) • 𝑁𝑃𝐷 = 1 𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡 |𝑃𝐷𝑖| 𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡 𝑖=1 (3) • 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (4) • 𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡: position of matched token in output sentence • 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓: position of matched token in reference sentence
  18. 18. Fig. 1. N-gram word alignment algorithm
  19. 19. Fig. 2. Example of n-gram word alignment
  20. 20. Fig. 3. Example of NPD calculation
  21. 21. • N-gram precision and recall: • 𝑃𝑛 = #𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑 #𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑠𝑦𝑠𝑡𝑒𝑚 𝑜𝑢𝑡𝑝𝑢𝑡 (5) • 𝑅 𝑛 = #𝑛𝑔𝑟𝑎𝑚 𝑚𝑎𝑡𝑐ℎ𝑒𝑑 #𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘𝑠 𝑖𝑛 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 (6) • 𝐻𝑃𝑅 = 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝛼𝑅 𝑛, 𝛽𝑃𝑛 = 𝛼+𝛽 𝛼 𝑅 𝑛 + 𝛽 𝑃 𝑛 (7)
  22. 22. Fig. 4. Example of bigram matching
  23. 23. • LEPOR Metrics: • 𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝛼𝑅, 𝛽𝑃) (8) • ℎ𝐿𝐸𝑃𝑂𝑅 = 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤 𝐿𝑃 𝐿𝑃, 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤 𝐻𝑃𝑅 𝐻𝑃𝑅 • = 𝑤 𝑖 𝑛 𝑖=1 𝑤 𝑖 𝐹𝑎𝑐𝑡𝑜𝑟 𝑖 𝑛 𝑖=1 = 𝑤 𝐿𝑃+𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙+𝑤 𝐻𝑃𝑅 𝑤 𝐿𝑃 𝐿𝑃 + 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 + 𝑤 𝐻𝑃𝑅 𝐻𝑃𝑅 (9) • 𝑛𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝑒𝑥𝑝( 𝑤 𝑛 𝑙𝑜𝑔𝐻𝑃𝑅𝑁 𝑛=1 ) (10)
  24. 24. • Example, employment of linguistic features: Fig. 5. Example of n-gram POS alignment Fig. 6. Example of NPD calculation
  25. 25. • Combination with linguistic features: • ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 = 1 𝑤ℎ𝑤+𝑤ℎ𝑝 (𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 + 𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (11) • ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆 and ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 use the same algorithm on POS sequence and word sequence respectively.
  26. 26. • When multi-references: • Select the alignment that results in the minimum NPD score. Fig. 7. N-gram alignment when multi-references
  27. 27. • How reliable is the automatic metric? • Evaluation criteria for evaluation metrics: – Human judgments are the golden to approach, currently. • Correlation with human judgments: – System-level correlation – Segment-level correlation
  28. 28. • System-level correlation: • Spearman rank correlation coefficient: – 𝜌 𝑋𝑌 = 1 − 6 𝑑 𝑖 2𝑛 𝑖=1 𝑛(𝑛2−1) (12) – 𝑋 = 𝑥1, … , 𝑥 𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛} • Pearson correlation coefficient: – 𝜌 𝑋𝑌 = (𝑥 𝑖−𝜇 𝑥)(𝑦 𝑖−𝜇 𝑦)𝑛 𝑖=1 𝑥 𝑖−𝜇 𝑥 2𝑛 𝑖=1 𝑦 𝑖−𝜇 𝑦 2𝑛 𝑖=1 (13)
  29. 29. • Segment-level Kendall’s tau correlation: • 𝜏 = 𝑛𝑢𝑚 𝑐𝑜𝑛𝑐𝑜𝑟𝑑𝑎𝑛𝑡 𝑝𝑎𝑖𝑟𝑠−𝑛𝑢𝑚 𝑑𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑡 𝑝𝑎𝑖𝑟𝑠 𝑡𝑜𝑡𝑎𝑙 𝑝𝑎𝑖𝑟𝑠 (14) • The segment unit can be a single sentence or fragment that contains several sentences.
  30. 30. • Performances on ACL-WMT 2011 copora • Two translation directions: – English-to-other (Spanish, German, French, Czech) – Other-to-English • System-level metrics: – 𝐿𝐸𝑃𝑂𝑅 𝐴 = 1 𝑛𝑢𝑚 𝑠𝑒𝑛𝑡 |𝐿𝐸𝑃𝑂𝑅𝑖| 𝑛𝑢𝑚 𝑠𝑒𝑛𝑡 𝑖=1 (15) – 𝐿𝐸𝑃𝑂𝑅 𝐵 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝛼𝑅, 𝛽𝑃) (16)
  31. 31. Table 1. system-level Spearman correlation with human judgment on WMT11 corpora
  32. 32. • Performances on ACL-WMT 2013 copora • Two translation directions: – English-to-other (Spanish, German, French, Czech, and Russian) – Other-to-English
  33. 33. • System-level & sentence-level – LEPOR_v3.1: hLEPOR, nLEPOR_baseline – ℎ𝐿𝐸𝑃𝑂𝑅 = 1 𝑛𝑢𝑚 𝑠𝑒𝑛𝑡 |ℎ𝐿𝐸𝑃𝑂𝑅𝑖| 𝑛𝑢𝑚 𝑠𝑒𝑛𝑡 𝑖=1 (17) – ℎ𝐿𝐸𝑃𝑂𝑅𝑓𝑖𝑛𝑎𝑙 = 1 𝑤ℎ𝑤+𝑤ℎ𝑝 (𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 + 𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (18) – 𝑛𝐿𝐸𝑃𝑂𝑅 = 𝐿𝑃 × 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 × exp( 𝑤 𝑛 𝑙𝑜𝑔𝐻𝑃𝑅𝑁 𝑛=1 ) (19)
  34. 34. Table 2&3. system-level Pearson correlation with human judgment
  35. 35. Table 4&5. segment-level Kendall’s tau correlation with human judgment
  36. 36. • LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors – Aaron L.-F. Han, Derek F. Wong and Lidia S. Chao. Proceedings of COLING 2012: Posters, pages 441–450, Mumbai, India. • Language-independent Model for Machine Translation Evaluation with Reinforced Factors – Aaron L.-F. Han, Derek Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing, Xiaodong Zeng. Proceedings of MT Summit 2013. Nice, France.
  37. 37. • Language independent MT evaluation-LEPOR: https://github.com/aaronlifenghan/aaron-project-lepor • MT evaluation with linguistic features-hLEPOR: https://github.com/aaronlifenghan/aaron-project-hlepor • English-French Phrase tagset mapping and application in unsupervised MT evaluation-HPPR: https://github.com/aaronlifenghan/aaron-project-hppr • Unsupervised English-Spanish MT evaluation-EBLEU: https://github.com/aaronlifenghan/aaron-project-ebleu • Projects Homepage: https://github.com/aaronlifenghan
  38. 38. • My research interests: – Natural Language Processing – Signal Processing – Machine Learning – Artificial Intelligence – Pattern Recognition • My past research works: – Machine Translation Evaluation, Word Segmentation, Entity Recognition, Multilingual Treebanks
  39. 39. • Other publications: • A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task – Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yi Lu, Yervant Ho, Yiming Wang, Zhou jiaji. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. – ACL-WMT13 METRICS TASK: Our metrics are language independent English-vs-other (French, Spanish, Czech, German, Russian) Can perform on both system-level and segment-level The official results show our metrics have advantages as compared to others.
  40. 40. • Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling – Aaron Li-Feng Han, Yi Lu, Derek F. Wong, Lidia S. Chao, Yervant Ho, Anson Xing. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. – ACL-WMT13 QUALITY ESTIMATION TASK (no reference translation): Task 1.1: sentence level EN-ES quality estimation Task 1.2: system selection, EN-ES, EN-DE, new Task 2: word-level QE, EN-ES, binary classification, multi-class classification, new We design novel EN-ES POS tagset mapping and metric EBLEU in task 1.1. We explore the Naïve Bayes and Support Vector Machine in task 1.2. We achieve the highest F1 score in task 2 using Conditional Random Field.
  41. 41. Designed POS tagset mapping of Spanish (Tree tagger) to universal tagset (Petrov et al., 2012)
  42. 42. • 𝐸𝐵𝐿𝐸𝑈 = 1 − 𝑀𝐿𝑃 × exp( 𝑤 𝑛log(𝐻(𝛼𝑅 𝑛, 𝛽𝑃𝑛))) – 𝑀𝐿𝑃 = 𝑒1− 𝑠 ℎ 𝑖𝑓 ℎ < 𝑠 𝑒1− ℎ 𝑠 𝑖𝑓 ℎ ≥ 𝑠 – 𝑃𝑛 = #𝑐𝑜𝑚𝑚𝑜𝑛 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘 #𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘 𝑖𝑛 𝑡𝑎𝑟𝑔𝑒𝑡 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 – 𝑅 𝑛 = #𝑐𝑜𝑚𝑚𝑜𝑛 𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘 #𝑛𝑔𝑟𝑎𝑚 𝑐ℎ𝑢𝑛𝑘 𝑖𝑛 𝑠𝑜𝑢𝑟𝑐𝑒 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒
  43. 43. • Bayes’ rule: – 𝑝 𝑐𝑖 𝑥1, 𝑥2, … , 𝑥 𝑛 = 𝑝 𝑥1,𝑥2,…,𝑥 𝑛 𝑐 𝑖 𝑝(𝑐 𝑖) 𝑝(𝑥1,𝑥2,…,𝑥 𝑛) • SVM, find the point with smallest margin to hyper plane and then maximize this margin: – 𝑎𝑟𝑔𝑚𝑎𝑥 𝑤, 𝑏 {𝑚𝑖𝑛 𝑛 (𝑙𝑎𝑏𝑒𝑙 ∙ (𝑤 𝑇 𝑥 + 𝑏)) ∙ 1 ||𝑤|| } • Conditional random fields: – 𝑃 𝑌 𝑋 ∝ exp( 𝜆 𝑘 𝑓𝑘 𝑒, 𝑌| 𝑒, 𝑋𝑒∈𝐸,𝑘 + 𝜇 𝑘 𝑔 𝑘(𝑣, 𝑌| 𝑣, 𝑋)𝑣∈𝑉,𝑘 )
  44. 44. Designed features for CRF & NB algorithms
  45. 45. ACL-WMT13 word level quality estimation task results
  46. 46. • Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation – Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho, Shuo Li, Lynn Ling Zhu. In GSCL 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. – German Society for Computational Linguistics (oral presentation): To facilitate future research in unsupervised induction of syntactic structures We design French-English phrase tagset mapping We propose a universal phrase tagset Phase tags extracted from French Treebank and English Penn Treebank Explore the employment of the proposed mapping in unsupervised MT evaluation
  47. 47. Designed phrase tagset mapping for English and French
  48. 48. Evaluation based on parsing information from syntactic treebanks
  49. 49. Convert the word sequence into universal phrase tagset sequence
  50. 50. • 𝐻𝑃𝑃𝑅 = 𝐻𝑎𝑟(𝑤 𝑃𝑠 𝑁1 𝑃𝑠𝐷𝑖𝑓, 𝑤 𝑃𝑟 𝑁2 𝑃𝑟𝑒, 𝑤 𝑅𝑐 𝑁3 𝑅𝑒𝑐) – 𝑁1 𝑃𝑠𝐷𝑖𝑓 = 1 𝑛 𝑁1 𝑃𝑠𝐷𝑖𝑓𝑖 – 𝑁2 𝑃𝑟𝑒 = 𝑒𝑥𝑝 𝑤 𝑛 𝑙𝑜𝑔𝑃𝑛𝑁2 – 𝑁3 𝑅𝑒𝑐 = 𝑒𝑥𝑝( 𝑤 𝑛 𝑙𝑜𝑔𝑅 𝑛𝑁3 )
  51. 51. • A Study of Chinese Word Segmentation Based on the Characteristics of Chinese – Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho, Lynn Ling Zhu, Shuo Li. Accepted. In GSCL 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. – German Society for Computational Linguistics (poster paper): No word boundary in the Chinese expression Chinese word segmentation is a difficult problem Word segmentation is crucial to the word alignment in machine translation We discuss the characteristics of Chinese and design optimized features We formulize some problems and issues in Chinese word segmentation
  52. 52. • AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION – Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho. In TSD 2013. Plzen, Czech Republic. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg. – Text, Speech and Dialogue 2013 (oral presentation): We explore the unsupervised machine translation evaluation method We design hLEPOR algorithm for the first time We explore the POS usage in unsupervised MT evaluation Experiments are performed on English vs French, German
  53. 53. • Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics – Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao. In Proceeding of LP&IIS. M.A. Klopotek et al. (Eds.): IIS 2013, LNCS Vol. 7912, pp. 57– 68, Warsaw, Poland. Springer-Verlag Berlin Heidelberg. – Intelligent Information System 2013 (oral presentation): Named entity recognition is important in IR, MT, text analysis, etc. Chinese named entity recognition is more difficult due to no word boundary We compare the performances of different algorithm, NB, CRF, SVM, ME We analysis the characteristics respectively on personal, location, organization names We show the performance of different features and select the optimized one.
  54. 54. • Ongoing and further works: – The combination of translation and evaluation, tuning the translation model using evaluation metrics – Evaluation models from the perspective of semantics – The exploration of unsupervised evaluation models, extracting features from source and target languages
  55. 55. • Actually speaking, the evaluation works are very related to the similarity measuring. Where I have employed them is in the MT evaluation only. These works can be further developed into other literature: – information retrieval – question and answering – Searching – text analysis – etc.
  56. 56. Q and A Thanks for your attention! Aaron L.-F. Han, 2013.08
  57. 57. • 1. Weaver, Warren.: Translation. In William Locke and A. Donald Booth, editors, • Machine Translation of Languages: Fourteen Essays. John Wiley and Sons, New • York, pages 15-23 (1955) • 2. Marino B. Jose, Rafael E. Banchs, Josep M. Crego, Adria de Gispert, Patrik Lambert, • Jose A. Fonollosa, Marta R. Costa-jussa: N-gram based machine translation, • Computational Linguistics, Vol. 32, No. 4. pp. 527-549, MIT Press (2006) • 3. Och, F. J.: Minimum Error Rate Training for Statistical Machine Translation. In • Proceedings of (ACL-2003). pp. 160-167 (2003) • 4. Su Hung-Yu and Chung-Hsien Wu: Improving Structural Statistical Machine Translation • for Sign Language With Small Corpus Using Thematic Role Templates as • Translation Memory, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE • PROCESSING, VOL. 17, NO. 7, SEPTEMBER (2009) • 5. Xiong D., M. Zhang, H. Li: A Maximum-Entropy Segmentation Model for Statistical • Machine Translation, Audio, Speech, and Language Processing, IEEE Transactions • on, Volume: 19, Issue: 8, 2011 , pp. 2494- 2505 (2011) • 6. Carl, M. and A. Way (eds): Recent Advances in Example-Based Machine Translation. • Kluwer Academic Publishers, Dordrecht, The Netherlands (2003)
  58. 58. • 7. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge • University Press (2010) • 8. Arnold, D.: Why translation is dicult for computers. In Computers and Translation: • A translator's guide. Benjamins Translation Library (2003) • 9. Carroll, J. B.: Aan experiment in evaluating the quality of translation, Pierce, J. • (Chair), Languages and machines: computers in translation and linguistics. A report • by the Automatic Language Processing Advisory Committee (ALPAC), Publication • 1416, Division of Behavioral Sciences, National Academy of Sciences, National Research • Council, page 67-75 (1966) • 10. White, J. S., O'Connell, T. A., and O'Mara, F. E.: The ARPA MT evaluation • methodologies: Evolution, lessons, and future approaches. In Proceedings of the • Conference of the Association for Machine Translation in the Americas (AMTA • 1994). pp 193-205 (1994) • 11. Su Keh-Yih, Wu Ming-Wen and Chang Jing-Shin: A New Quantitative Quality • Measure for Machine Translation Systems. In Proceedings of the 14th International • Conference on Computational Linguistics, pages 433-439, Nantes, France, July • (1992)
  59. 59. • 12. Tillmann C., Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf: • Accelerated DP Based Search For Statistical Translation. In Proceedings of the 5th • European Conference on Speech Communication and Technology (EUROSPEECH97) • (1997) • 13. Papineni, K., Roukos, S., Ward, T. and Zhu, W. J.: BLEU: a method for automatic • evaluation of machine translation. In Proceedings of the (ACL 2002), pages 311-318, • Philadelphia, PA, USA (2002) • 14. Doddington, G.: Automatic evaluation of machine translation quality using ngram • co-occurrence statistics. In Proceedings of the second international conference • on Human Language Technology Research(HLT 2002), pages 138-145, San Diego, • California, USA (2002) • 15. Turian, J. P., Shen, L. and Melanmed, I. D.: Evaluation of machine translation • and its evaluation. In Proceedings of MT Summit IX, pages 386-393, New Orleans, • LA, USA (2003) • 16. Banerjee, S. and Lavie, A.: Meteor: an automatic metric for MT evaluation with • high levels of correlation with human judgments. In Proceedings of ACL-WMT, • pages 65-72, Prague, Czech Republic (2005)
  60. 60. • 17. Denkowski, M. and Lavie, A.: Meteor 1.3: Automatic metric for reliable optimization • and evaluation of machine translation systems. In Proceedings of (ACL-WMT), • pages 85-91, Edinburgh, Scotland, UK (2011) • 18. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J.: A study of • translation edit rate with targeted human annotation. In Proceedings of the Conference • of the Association for Machine Translation in the Americas (AMTA), pages • 223-231, Boston, USA (2006) • 19. Chen, B. and Kuhn, R.: Amber: A modied bleu, enhanced ranking metric. In • Proceedings of (ACL-WMT), pages 71-77, Edinburgh, Scotland, UK (2011) • 20. Bicici, E. and Yuret, D.: RegMT system for machine translation, system combination, • and evaluation. In Proceedings ACL-WMT, pages 323-329, Edinburgh, • Scotland, UK (2011) • 21. Taylor, J. Shawe and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge • University Press 2004. • 22. Wong, B. T-M and Kit, C.: Word choice and word position for automatic MT • evaluation. In Workshop: MetricsMATR of the Association for Machine Translation • in the Americas (AMTA), short paper, 3 pages, Waikiki, Hawai'I, USA (2008)
  61. 61. • 23. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation • of translation quality for distant language pairs. In Proceedings of the 2010 • Conference on (EMNLP), pages 944{952, Cambridge, MA (2010) • 24. Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J., Seno, M. and Och, F.: A • Lightweight Evaluation Framework for Machine Translation Reordering. In Proceedings • of the Sixth (ACL-WMT), pages 12-21, Edinburgh, Scotland, UK (2011) • 25. Song, X. and Cohn, T.: Regression and ranking based optimisation for sentence • level MT evaluation. In Proceedings of the (ACL-WMT), pages 123-129, Edinburgh, • Scotland, UK (2011) • 26. Popovic, M.: Morphemes and POS tags for n-gram based evaluation metrics. In • Proceedings of (ACL-WMT), pages 104-107, Edinburgh, Scotland, UK (2011) • 27. Popovic, M., Vilar, D., Avramidis, E. and Burchardt, A.: Evaluation without references: • IBM1 scores as evaluation metrics. In Proceedings of the (ACL-WMT), • pages 99-103, Edinburgh, Scotland, UK (2011) • 28. Petrov S., Leon Barrett, Romain Thibaux, and Dan Klein: Learning accurate, • compact, and interpretable tree annotation. Proceedings of the 21st ACL, pages • 433-440, Sydney, July (2006)
  62. 62. • 29. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011 • Workshop on Statistical Machine Translation. In Proceedings of (ACL-WMT), pages • 22-64, Edinburgh, Scotland, UK (2011) • 30. Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M. and Zaidan, • O. F.: Findings of the 2010 Joint Workshop on Statistical Machine Translation and • Metrics for Machine Translation. In Proceedings of (ACL-WMT), pages 17-53, PA, • USA (2010) • 31. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Findings of the 2009 • Workshop on Statistical Machine Translation. In Proceedings of ACL-WMT, pages • 1-28, Athens, Greece (2009) • 32. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Further meta-evaluation • of machine translation. In Proceedings of (ACL-WMT), pages 70-106, Columbus, • Ohio, USA (2008) • 33. Avramidis E., Popovic, M., Vilar, D., Burchardt, A.: Evaluate with Condence • Estimation: Machine ranking of translation outputs using grammatical features. In • Proceedings of the Sixth Workshop on Statistical Machine Translation, Association • for Computational Linguistics (ACL-WMT), pages 65-70, Edinburgh, Scotland, UK • (2011)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×