MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

556 views
459 views

Published on

Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
556
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors

  1. 1. MT SUMMIT 2013 Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng September 2nd-6th, 2013, Nice, France Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory Department of Computer and Information Science University of Macau
  2. 2.  The importance of machine translation (MT) evaluation  Automatic MT evaluation metrics introduction 1. Lexical similarity 2. Linguistic features 3. Metrics combination  Designed metric: LEPOR Series 1. Motivation 2. LEPOR Metrics Description 3. Performances on international ACL-WMT corpora 4. Publications and Open source tools  Further information
  3. 3. • Eager communication with each other of different nationalities – Promote the translation technology • Rapid development of Machine translation – machine translation (MT) began as early as in the 1950s (Weaver, 1955) – big progress science the 1990s due to the development of computers (storage capacity and computational power) and the enlarged bilingual corpora (Marino et al. 2006)
  4. 4. • Some recent works of MT research: – Och (2003) present MERT (Minimum Error Rate Training) for log-linear SMT – Su et al. (2009) use the Thematic Role Templates model to improve the translation – Xiong et al. (2011) employ the maximum-entropy model, etc. – The data-driven methods including example-based MT (Carl and Way, 2003) and statistical MT (Koehn, 2010) became main approaches in MT literature.
  5. 5. • How well the MT systems perform and whether they make some progress? • Difficulties of MT evaluation – language variability results in no single correct translation – the natural languages are highly ambiguous and different languages do not always express the same content in the same way (Arnold, 2003)
  6. 6. • Traditional manual evaluation criteria: – intelligibility (measuring how understandable the sentence is) – fidelity (measuring how much information the translated sentence retains as compared to the original) by the Automatic Language Processing Advisory Committee (ALPAC) around 1966 (Carroll, 1966) – adequacy (similar as fidelity), fluency (whether the sentence is well-formed and fluent) and comprehension (improved intelligibility) by Defense Advanced Research Projects Agency (DARPA) of US (White et al., 1994)
  7. 7. • Problems of manual evaluations : – Time-consuming – Expensive – Unrepeatable – Low agreement (Callison-Burch, et al., 2011)
  8. 8. 2.1 Lexical similarity 2.2 Linguistic features 2.3 Metrics combination
  9. 9. • Precision-based Bleu (Papineni et al., 2002 ACL) • Recall-based ROUGE(Lin, 2004 WAS) • Precision and Recall Meteor (Banerjee and Lavie, 2005 ACL)
  10. 10. • Word-order based NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen et al., 2012 ACL), ATEC (Wong et al., 2008AMTA) • Word-alignment based AER (Och and Ney, 2003 J.CL) • Edit distance-based WER(Su et al., 1992Coling), PER(Tillmann et al., 1997 EUROSPEECH), TER (Snover et al., 2006 AMTA)
  11. 11. • Language model LM-SVM (Gamon et al., 2005EAMT) • Shallow parsing GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel et al., 2012WMT) • Semantic roles Named entity, morphological, synonymy, paraphrasing, discourse representation, etc.
  12. 12. • MTeRater-Plus (Parton et al., 2011WMT) – Combine BLEU, TERp (Snover et al., 2009) and Meteor (Banerjee and Lavie, 2005; Lavie and Denkowski, 2009) • MPF & WMPBleu (Popovic, 2011WMT) – Arithmetic mean of F score and BLEU score • SIA (Liu and Gildea, 2006ACL) – Combine the advantages of n-gram-based metrics and loose-sequence-based metrics
  13. 13. • hLEPOR: harmonic mean of enhanced Length Penalty, Precision, n-gram Position difference Penalty and Recall
  14. 14. • Weaknesses in existing metrics: – perform well on certain language pairs but weak on others, which we call as the language-bias problem; – consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call as the extremism problem; – present incomprehensive factors (e.g. BLEU focus on precision only). – What to do?
  15. 15. • to address some of the existing problems: – Design tunable parameters to address the language-bias problem; – Use concise or optimized linguistic features for the linguistic extremism problem; – Design augmented factors.
  16. 16. • Sub-factors: • 𝐸𝐿𝑃 = 𝑒1− 𝑟 𝑐 ∶ 𝑐<𝑟 𝑒1− 𝑐 𝑟 ∶ 𝑐≥𝑟 (1) • 𝑟: length of reference sentence • 𝑐: length of candidate (system-output) sentence
  17. 17. • 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (2) • 𝑁𝑃𝐷 = 1 𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡 |𝑃𝐷𝑖| 𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡 𝑖=1 (3) • 𝑃𝐷𝑖 = |𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡 − 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓| (4) • 𝑀𝑎𝑡𝑐ℎ𝑁𝑜𝑢𝑡𝑝𝑢𝑡: position of matched token in output sentence • 𝑀𝑎𝑡𝑐ℎ𝑁𝑟𝑒𝑓: position of matched token in reference sentence
  18. 18. Fig. 1. N-gram word alignment algorithm
  19. 19. Fig. 2. Example of n-gram word alignment
  20. 20. Fig. 3. Example of NPD calculation
  21. 21. • N-gram precision and recall: • 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝐴𝑙𝑖𝑔𝑛𝑒𝑑 𝑛𝑢𝑚 𝐿𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡 (5) • 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝐴𝑙𝑖𝑔𝑛𝑒𝑑 𝑛𝑢𝑚 𝐿𝑒𝑛𝑔𝑡ℎ 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 (6) • 𝐻𝑃𝑅 = 𝛼+𝛽 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙 𝛼𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝛽𝑅𝑒𝑐𝑎𝑙𝑙 (7)
  22. 22. • Sentence-level hLEPOR Metric: • ℎ𝐿𝐸𝑃𝑂𝑅 = 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤 𝐿𝑃 𝐿𝑃, 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤 𝐻𝑃𝑅 𝐻𝑃𝑅 = 𝑤 𝑖 𝑛 𝑖=1 𝑤 𝑖 𝐹𝑎𝑐𝑡𝑜𝑟 𝑖 𝑛 𝑖=1 = 𝑤 𝐿𝑃+𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙+𝑤 𝐻𝑃𝑅 𝑤 𝐿𝑃 𝐿𝑃 + 𝑤 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 + 𝑤 𝐻𝑃𝑅 𝐻𝑃𝑅 (8) • System-level hLEPOR Metric: • ℎ𝐿𝐸𝑃𝑂𝑅 = 1 𝑛𝑢𝑚 𝑠𝑒𝑛𝑡 |ℎ𝐿𝐸𝑃𝑂𝑅𝑖| 𝑛𝑢𝑚 𝑠𝑒𝑛𝑡 𝑖=1 (9)
  23. 23. • Example, employment of linguistic features: Fig. 4. Example of n-gram POS alignment Fig. 5. Example of NPD calculation
  24. 24. • Enhanced version with linguistic features: • ℎ𝐿𝐸𝑃𝑂𝑅 𝐸 = 1 𝑤ℎ𝑤+𝑤ℎ𝑝 (𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 + 𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (10) • The system-level scores ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 and ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆 use the same algorithm on word sequence and POS sequence respectively.
  25. 25. • When multi-references: • Select the alignment that results in the minimum NPD score. Fig. 6. N-gram alignment when multi-references
  26. 26. • How reliable is the automatic metric? • Evaluation criteria for evaluation metrics: – Human judgments are the golden to approach, currently. • Correlation with human judgments: • System-level Spearman rank correlation coefficient: – 𝜌 𝑋𝑌 = 1 − 6 𝑑 𝑖 2𝑛 𝑖=1 𝑛(𝑛2−1) (11) – 𝑋 = 𝑥1, … , 𝑥 𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}
  27. 27. • Training data (WMT08) – 2,028 sentences for each document – English vs Spanish/German/French/Czech • Testing data (WMT11) – 3,003 sentences for each document – English vs Spanish/German/French/Czech
  28. 28. Table 1. values of tuned parameters
  29. 29. Table 2. correlation with human judgments on WMT11 corpora
  30. 30. • Language-independent Model for Machine Translation Evaluation with Reinforced Factors – Aaron L.-F. Han, Derek Wong, Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing, Xiaodong Zeng. Proceedings of MT Summit 2013. Nice, France. • Machine Translation evaluation tool-hLEPOR: https://github.com/aaronlifenghan/aaron-project- hlepor
  31. 31. • Ongoing and further works: – The combination of translation and evaluation, tuning the translation model using evaluation metrics – Evaluation models from the perspective of semantics – The exploration of unsupervised evaluation models, extracting features from source and target languages
  32. 32. • Actually speaking, the evaluation works are very related to the similarity measuring. Where we have employed them is in the MT evaluation. These works can be further developed into other literature: – information retrieval – question and answering – Searching – text analysis – etc.
  33. 33. MT SUMMIT 2013, September 2nd-6th, 2013, Nice, France Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu, Junwen Xing and Xiaodong Zeng Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory Department of Computer and Information Science University of Macau

×