Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013) pp. 215-222. Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
The document describes two machine translation evaluation systems, nLEPOR_baseline and LEPOR_v3.1, that were submitted to the WMT13 Metrics Task. nLEPOR_baseline is an n-gram based metric that considers modified sentence length penalty, position difference penalty, and n-gram precision and recall. LEPOR_v3.1 is an enhanced version that uses a harmonic mean to combine factors and includes part-of-speech information. Evaluation results showed LEPOR_v3.1 had the highest correlation of 0.86 with human judgments for English to other language pairs.
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
Publisher: Springer-Verlag Berlin Heidelberg 20132013
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho
Proceedings of the 16th International Conference of Text, Speech and Dialogue (TSD 2013). Plzen, Czech Republic, September 2013. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...Lifeng (Aaron) Han
"LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors"
Publisher: Association for Computational LinguisticsDecember 2012
Authors: Aaron Li-Feng Han, Derek F. Wong and Lidia S. Chao
Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pages 441–450, Mumbai, December 2012. Open tool https://github.com/aaronlifenghan/aaron-project-lepor
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
The document provides an overview of machine translation evaluation (MTE). It discusses existing MTE methods like BLEU, METEOR, WER, and their weaknesses. The author's thesis proposes a new metric called LEPOR that incorporates additional factors to address weaknesses. The additional factors include an enhanced length penalty, n-gram position difference penalty, and tunable parameters to handle cross-language performance differences. The thesis will experiment with LEPOR on various language pairs and shared tasks to evaluate its performance.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
A POS Tagger for Tamil Language”, Proceedings of the IJCNLP-2009, Suntec,
Singapore.
Dhanalakshmi V, Anand Kumar M, Soman K P and Rajendran S (2011), “Dependency
Parsing for Tamil using Malt Parser”, Proceedings of the International Conference on
Asian Language Processing (IALP), Bali, Indonesia.
Gimenez J and Marquez L (2004), “SVMTool: A general POS tagger generator based on
Support Vector Machines”, Proceedings of the 4th International Conference on Language
Resources and Evaluation (LREC 2004), Lisbon, Portugal.
Joakim Nivre and Johan Hall (
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Lifeng (Aaron) Han
This document discusses developing a statistical approach for measuring confidence intervals in translation quality evaluation and post-editing distance. It proposes modeling errors as independent binomial distributions and using Monte Carlo simulations to determine confidence intervals for different sample sizes. The simulations show that with samples of 100 sentences or less, the 95% confidence interval is too broad to reliably measure quality. A minimum sample of 30 pages is recommended to achieve a reasonable confidence level and narrower interval. Understanding confidence intervals provides a measure of reliability for translation quality scores.
2010 PACLIC - pay attention to categoriesWarNik Chow
This document summarizes a research paper on a proposed method called Metadata Projection Matrix (MPM) for sentence modeling that allows controlling attention to certain syntactic categories. The method uses a projection matrix to incorporate syntactic category information when calculating attention weights. Experimental results on several datasets show MPM outperforms baselines on tasks where attention to specific categories is important, like detecting terms or irony, but is weaker on more context-dependent tasks. The method is best suited to applications where syntactic structure significantly informs predictions.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
The document describes two machine translation evaluation systems, nLEPOR_baseline and LEPOR_v3.1, that were submitted to the WMT13 Metrics Task. nLEPOR_baseline is an n-gram based metric that considers modified sentence length penalty, position difference penalty, and n-gram precision and recall. LEPOR_v3.1 is an enhanced version that uses a harmonic mean to combine factors and includes part-of-speech information. Evaluation results showed LEPOR_v3.1 had the highest correlation of 0.86 with human judgments for English to other language pairs.
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
Publisher: Springer-Verlag Berlin Heidelberg 20132013
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho
Proceedings of the 16th International Conference of Text, Speech and Dialogue (TSD 2013). Plzen, Czech Republic, September 2013. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...Lifeng (Aaron) Han
"LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors"
Publisher: Association for Computational LinguisticsDecember 2012
Authors: Aaron Li-Feng Han, Derek F. Wong and Lidia S. Chao
Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pages 441–450, Mumbai, December 2012. Open tool https://github.com/aaronlifenghan/aaron-project-lepor
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
The document provides an overview of machine translation evaluation (MTE). It discusses existing MTE methods like BLEU, METEOR, WER, and their weaknesses. The author's thesis proposes a new metric called LEPOR that incorporates additional factors to address weaknesses. The additional factors include an enhanced length penalty, n-gram position difference penalty, and tunable parameters to handle cross-language performance differences. The thesis will experiment with LEPOR on various language pairs and shared tasks to evaluate its performance.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
A POS Tagger for Tamil Language”, Proceedings of the IJCNLP-2009, Suntec,
Singapore.
Dhanalakshmi V, Anand Kumar M, Soman K P and Rajendran S (2011), “Dependency
Parsing for Tamil using Malt Parser”, Proceedings of the International Conference on
Asian Language Processing (IALP), Bali, Indonesia.
Gimenez J and Marquez L (2004), “SVMTool: A general POS tagger generator based on
Support Vector Machines”, Proceedings of the 4th International Conference on Language
Resources and Evaluation (LREC 2004), Lisbon, Portugal.
Joakim Nivre and Johan Hall (
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Lifeng (Aaron) Han
This document discusses developing a statistical approach for measuring confidence intervals in translation quality evaluation and post-editing distance. It proposes modeling errors as independent binomial distributions and using Monte Carlo simulations to determine confidence intervals for different sample sizes. The simulations show that with samples of 100 sentences or less, the 95% confidence interval is too broad to reliably measure quality. A minimum sample of 30 pages is recommended to achieve a reasonable confidence level and narrower interval. Understanding confidence intervals provides a measure of reliability for translation quality scores.
2010 PACLIC - pay attention to categoriesWarNik Chow
This document summarizes a research paper on a proposed method called Metadata Projection Matrix (MPM) for sentence modeling that allows controlling attention to certain syntactic categories. The method uses a projection matrix to incorporate syntactic category information when calculating attention weights. Experimental results on several datasets show MPM outperforms baselines on tasks where attention to specific categories is important, like detecting terms or irony, but is weaker on more context-dependent tasks. The method is best suited to applications where syntactic structure significantly informs predictions.
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
This document presents a method for automatic language identification that uses a hybrid approach combining n-gram text processing and Naive Bayesian classification algorithms. The method first preprocesses text documents by removing special characters, suffixes, and generating tokens. It then extracts n-gram features from the text and calculates n-gram frequencies. Finally, it uses the n-gram frequencies as inputs to a Naive Bayesian classifier to identify the language of the document. The approach is able to identify languages like Hindi, English, Gujarati, and Sanskrit without requiring any prior information about the number of languages or initial partitioning of texts.
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
This document discusses how natural language processing (NLP) techniques can help improve machine translation (MT). It describes some of the linguistic challenges in MT, such as ambiguity at the lexical, syntactic, semantic and pragmatic levels. It then discusses how various NLP tasks, such as tokenization, word sense disambiguation, and handling of named entities could enhance MT systems. Several studies that have successfully integrated NLP techniques like word sense disambiguation into statistical machine translation systems are also summarized.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
The document presents a method for unsupervised machine translation evaluation using universal phrase tags. It designs a mapping between phrase tags from different treebanks to 9 universal tags. An unsupervised metric called HPPR is introduced to measure similarity between the universal phrase sequences of the source and translated sentences. Experiments on French-English data show HPPR achieves promising correlations with human judgments without using reference translations.
This paper introduces the state-of-the-art machine translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency , adequacy, comprehension, and in-formativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories , including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms , textual entailment, paraphrase, semantic roles, and language models. Subsequently , we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT.
This paper differs from the existing works (Dorr et al., 2009; EuroMatrix, 2007) from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content. For latest version, please goto: https://arxiv.org/abs/1605.04515
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
Cite: Lifeng Han. 2021. Meta-evaluation of machine translation evaluation methods. In Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T. October 23–24.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering. It then discusses how language models assign probabilities to strings of text to determine if they are valid sentences. Specifically, it covers n-gram models which use the previous n words to predict the next, and how smoothing techniques are used to handle uncommon words. The document provides an overview of key concepts in natural language processing and language modeling.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
The document summarizes a technique for generating summaries using sentence compression and statistical measures. It first implements a graph-based technique to achieve sentence compression and information fusion. It then uses hand-crafted syntactic rules to prune compressed sentences. Finally, it uses probabilistic measures and word co-occurrence to obtain the summaries. The system can generate summaries at any user-defined compression rate.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
• Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation
o Hindawi Publishing Corporation
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He and Yi Lu
The Scientific World Journal, Issue: Recent Advances in Information Technology. ISSN:1537-744X. SCIE, IF=1.73. http://www.hindawi.com/journals/tswj/aip/760301/
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Machine translation systems can translate text from one language to another. Moses is an open-source statistical machine translation toolkit that is commonly used. It takes parallel text corpora to train models for translation. The Moses training process involves word alignment, phrase extraction, and language model building. The Moses decoder then translates new text using these statistical models.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningCITE
HU, Yuxiu (Harbin Institute of Technology Shenzhen Graduate School, China)
BODOMO, Adams (The University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_603.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
Machine Translation System: Chhattisgarhi to HindiPadma Metta
The document discusses trends in machine translation, including different approaches such as direct machine translation, rule-based machine translation, and corpus-based machine translation. It provides examples of challenges in machine translation like word order, word sense disambiguation, and idioms. The document also describes the proposed methodology for Chhattisgarhi to Hindi machine translation, including lexical analysis, syntactic analysis, and rule-based conversion between the languages.
Indexing of Arabic documents automatically based on lexical analysis kevig
The continuous information explosion through the Internet and all information sources makes it
necessary to perform all information processing activities automatically in quick and reliable
manners. In this paper, we proposed and implemented a method to automatically create and Index
for books written in Arabic language. The process depends largely on text summarization and
abstraction processes to collect main topics and statements in the book. The process is developed
in terms of accuracy and performance and results showed that this process can effectively replace
the effort of manually indexing books and document, a process that can be very useful in all
information processing and retrieval applications.
To enable multilingual communication for European citizens and public administrations, the European Commission has embarked on the creation of a large-scale automated translation platform, CEF.AT. Developed within the Connecting Europe Facility programme, the platform will power automated translation in Europe's public online services, helping to break down language barriers between people and nations in 21st century Europe. In this presentation, Andrejs Vasiljevs of Tilde will describe the first steps in building this important multilingual infrastructure – the identification and gathering of language resources relevant to national public services, administrations, and governmental institutions. These efforts are currently being conducted as part of the European Language Resource Coordination action, where Tilde is a partner.
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
1. The document proposes the TrueSkill algorithm as an improvement over existing models for ranking machine translation systems based on pairwise comparisons from human evaluators.
2. TrueSkill is shown to outperform baselines by requiring less training data to achieve accurate rankings while also better predicting pairwise preferences.
3. It functions by modeling systems as distributions that are efficiently updated online during a matching process, unlike batch models, allowing more effective data collection and system clustering from fewer annotations.
The document presents a Gedankenexperiment to evaluate machine translation without psychological or epistemological biases. It proposes two scenarios: the default scenario where a reviewer evaluates a source text in one language and its machine translation in another language, and the neutral scenario where the source text and machine translation are both in controlled quasi-natural languages to avoid biases. It then addresses 5 counterobjections to the neutral scenario approach from perspectives of cognition, practicality, artificial languages, typology, and human interfaces.
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
This document presents a method for automatic language identification that uses a hybrid approach combining n-gram text processing and Naive Bayesian classification algorithms. The method first preprocesses text documents by removing special characters, suffixes, and generating tokens. It then extracts n-gram features from the text and calculates n-gram frequencies. Finally, it uses the n-gram frequencies as inputs to a Naive Bayesian classifier to identify the language of the document. The approach is able to identify languages like Hindi, English, Gujarati, and Sanskrit without requiring any prior information about the number of languages or initial partitioning of texts.
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
This document discusses how natural language processing (NLP) techniques can help improve machine translation (MT). It describes some of the linguistic challenges in MT, such as ambiguity at the lexical, syntactic, semantic and pragmatic levels. It then discusses how various NLP tasks, such as tokenization, word sense disambiguation, and handling of named entities could enhance MT systems. Several studies that have successfully integrated NLP techniques like word sense disambiguation into statistical machine translation systems are also summarized.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
The document presents a method for unsupervised machine translation evaluation using universal phrase tags. It designs a mapping between phrase tags from different treebanks to 9 universal tags. An unsupervised metric called HPPR is introduced to measure similarity between the universal phrase sequences of the source and translated sentences. Experiments on French-English data show HPPR achieves promising correlations with human judgments without using reference translations.
This paper introduces the state-of-the-art machine translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency , adequacy, comprehension, and in-formativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories , including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms , textual entailment, paraphrase, semantic roles, and language models. Subsequently , we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT.
This paper differs from the existing works (Dorr et al., 2009; EuroMatrix, 2007) from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content. For latest version, please goto: https://arxiv.org/abs/1605.04515
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
Cite: Lifeng Han. 2021. Meta-evaluation of machine translation evaluation methods. In Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T. October 23–24.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering. It then discusses how language models assign probabilities to strings of text to determine if they are valid sentences. Specifically, it covers n-gram models which use the previous n words to predict the next, and how smoothing techniques are used to handle uncommon words. The document provides an overview of key concepts in natural language processing and language modeling.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
The document summarizes a technique for generating summaries using sentence compression and statistical measures. It first implements a graph-based technique to achieve sentence compression and information fusion. It then uses hand-crafted syntactic rules to prune compressed sentences. Finally, it uses probabilistic measures and word co-occurrence to obtain the summaries. The system can generate summaries at any user-defined compression rate.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
• Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation
o Hindawi Publishing Corporation
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He and Yi Lu
The Scientific World Journal, Issue: Recent Advances in Information Technology. ISSN:1537-744X. SCIE, IF=1.73. http://www.hindawi.com/journals/tswj/aip/760301/
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Machine translation systems can translate text from one language to another. Moses is an open-source statistical machine translation toolkit that is commonly used. It takes parallel text corpora to train models for translation. The Moses training process involves word alignment, phrase extraction, and language model building. The Moses decoder then translates new text using these statistical models.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningCITE
HU, Yuxiu (Harbin Institute of Technology Shenzhen Graduate School, China)
BODOMO, Adams (The University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_603.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
Machine Translation System: Chhattisgarhi to HindiPadma Metta
The document discusses trends in machine translation, including different approaches such as direct machine translation, rule-based machine translation, and corpus-based machine translation. It provides examples of challenges in machine translation like word order, word sense disambiguation, and idioms. The document also describes the proposed methodology for Chhattisgarhi to Hindi machine translation, including lexical analysis, syntactic analysis, and rule-based conversion between the languages.
Indexing of Arabic documents automatically based on lexical analysis kevig
The continuous information explosion through the Internet and all information sources makes it
necessary to perform all information processing activities automatically in quick and reliable
manners. In this paper, we proposed and implemented a method to automatically create and Index
for books written in Arabic language. The process depends largely on text summarization and
abstraction processes to collect main topics and statements in the book. The process is developed
in terms of accuracy and performance and results showed that this process can effectively replace
the effort of manually indexing books and document, a process that can be very useful in all
information processing and retrieval applications.
To enable multilingual communication for European citizens and public administrations, the European Commission has embarked on the creation of a large-scale automated translation platform, CEF.AT. Developed within the Connecting Europe Facility programme, the platform will power automated translation in Europe's public online services, helping to break down language barriers between people and nations in 21st century Europe. In this presentation, Andrejs Vasiljevs of Tilde will describe the first steps in building this important multilingual infrastructure – the identification and gathering of language resources relevant to national public services, administrations, and governmental institutions. These efforts are currently being conducted as part of the European Language Resource Coordination action, where Tilde is a partner.
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
1. The document proposes the TrueSkill algorithm as an improvement over existing models for ranking machine translation systems based on pairwise comparisons from human evaluators.
2. TrueSkill is shown to outperform baselines by requiring less training data to achieve accurate rankings while also better predicting pairwise preferences.
3. It functions by modeling systems as distributions that are efficiently updated online during a matching process, unlike batch models, allowing more effective data collection and system clustering from fewer annotations.
The document presents a Gedankenexperiment to evaluate machine translation without psychological or epistemological biases. It proposes two scenarios: the default scenario where a reviewer evaluates a source text in one language and its machine translation in another language, and the neutral scenario where the source text and machine translation are both in controlled quasi-natural languages to avoid biases. It then addresses 5 counterobjections to the neutral scenario approach from perspectives of cognition, practicality, artificial languages, typology, and human interfaces.
MT SUMMIT2013 poster boaster slides.Language-independent Model for Machine Tr...Lifeng (Aaron) Han
The document proposes a novel language-independent evaluation metric for machine translation that aims to address the weaknesses of existing metrics. It combines several factors, including enhanced length penalty, n-gram position difference penalty, and precision and recall, to calculate sentence-level and system-level scores. The proposed metric is evaluated on several language pairs from the WMT datasets and is shown to have high correlation with human judgments.
NLP Professional Publication and Presentation Links.Aaron 2011-2013Lifeng (Aaron) Han
Aaron has given several professional presentations and published papers from 2011-2013 related to his work in natural language processing. This includes presentations on machine translation evaluation given in Hong Kong in 2013 and France in 2013, as well as a presentation on Chinese named entity recognition given in Poland in 2013. Aaron has also published several papers in peer-reviewed conferences and journals, including papers on machine translation evaluation metrics, Chinese word segmentation, and Chinese named entity recognition. Links are provided to view the slides from Aaron's presentations and download copies of his published papers.
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
Rule-based machine translation systems were evaluated based on errors in translations from English to Persian. Several error categories were identified including syntactic errors (word order, missing words, parts of speech), unknown words, and semantic errors (incorrect words, idiomatic expressions). Three texts (a short story, user guide, and magazine article) were translated using two machine translation systems and analyzed sentence-by-sentence to identify errors according to the defined categories.
The document summarizes the BLEU method for automatically evaluating machine translation systems. BLEU calculates n-gram precision between a candidate translation and multiple reference translations, with modifications to address weaknesses. It combines the average logarithm of modified n-gram precisions with a brevity penalty for translations longer than references. Evaluation tests on multiple translation systems found BLEU scores reliably distinguished system quality and correlated well with human judgements.
Shallow-transfer rule-based machine translation from Czech to PolishJim O'Regan
This document summarizes the development of a shallow-transfer rule-based machine translation system from Czech to Polish. It first provides background on the two languages, noting their shared Slavic roots and similarities in inflection and word order. It then discusses the historical divergences between their vocabularies due to different experiences with Germanization. The document also presents the rule-based system developed under Google Summer of Code and justifies the choice of this approach over statistical machine translation due to data scarcity issues and a desire to create open-source software.
LEPOR: an augmented machine translation evaluation metric Lifeng (Aaron) Han
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks.
In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translationRIILP
The document discusses a company's evaluation of their machine translation systems. They had hoped automated metrics would correlate with productivity gains reported by post-editors, but found no correlation. Reasons for variability included different translation environments, engines, clients, post-editors, and word volumes. While some metrics indicated better translation quality, other factors like automatic terminology tools impacted productivity more. The company now combines automated metrics with time/productivity data and qualitative reviews to evaluate their machine translation performance.
An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Q...Kyoshiro Sugiyama
In these slides, I described translation quality with regards to cross-lingual question-answering tasks.
Our investigation makes clear what kind of evaluation metric is appropriate for question answering tasks, and what kind of mistranslations affect accuracy of question answering.
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015RIILP
This document discusses using semantic similarity measures to evaluate machine translation quality. It explores how translation quality can be defined in terms of meaning preservation, fluency, and matching reference translations. The document then examines semantic textual similarity as a way to quantify similarity between texts on a scale from 0 to 5. It presents experiments using semantic similarity scores and BLEU scores as features to predict translation quality, finding these semantic approaches outperform baselines. The document concludes that while semantic features improve quality estimation, access to semantically similar texts is needed to apply these methods.
Kone ja joukko – kääntäjän uudet ystävät?Jari Herrgård
Mitä ovat konekäännös ja joukkoistettu käännös? Millä tavalla ne vaikuttavat käännösalan ammattilaisten työhön ja koko alaan?
Esitys kansainvälisen kääntäjienpäivän seminaarista, joka pidettiin 28.9.2012 Helsingissä.
ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...Lifeng (Aaron) Han
Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling
Publisher: Association for Computational Linguistics2013
Authors: Aaron Li-Feng Han, Yi Lu, Derek F. Wong, Lidia S. Chao, Yervant Ho, Anson Xing
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-ebleu (ACM digital library, ACL anthology)
10. Lucia Specia (USFD) Evaluation of Machine TranslationRIILP
This document discusses various methods for evaluating translation quality, including manual metrics, task-based metrics, and reference-based automatic metrics. It notes that evaluating translation quality is difficult because the definition of quality depends on factors like the end user and intended purpose. Methods discussed include n-point scales for adequacy and fluency, ranking translations, and counting errors. Issues with subjective judgments, reliability, and defining what makes a translation "best" are also covered.
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONLifeng (Aaron) Han
Proceedings of the 16th International Conference of Text, Speech and Dialogue (TSD 2013). Plzen, Czech Republic, September 2013. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor
ACL-WMT2013.Quality Estimation for Machine Translation Using the Joint Method...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-ebleu (ACM digital library, ACL anthology)
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Automated evaluation of coherence in student essays.pdfSarah Marie
This document summarizes a study that explored using Centering Theory to develop an automated metric for essay coherence. The study aimed to improve the performance of e-rater, an existing automated essay scoring system, by adding a new feature based on Centering Theory. Specifically, the percentage of "Rough Shifts" identified through Centering Theory analysis was tested as a predictor of essay coherence and scoring. The results showed that incorporating this new metric improved e-rater's ability to approximate human scores and provide more instructionally useful feedback to students.
The document summarizes an academic thesis defense presentation on evaluating machine translation. It introduces the background of machine translation evaluation (MTE), existing MTE methods like BLEU, METEOR, WER, and their weaknesses. It then outlines the designed model for a new MTE metric called LEPOR, including designed factors like an enhanced length penalty and n-gram position difference penalty. The document concludes by discussing experiments, enhanced models, and applications in shared tasks to evaluate LEPOR's performance.
The document describes a system for semantic textual similarity (STS) that uses various techniques to estimate the semantic similarity between texts. The system combines lexical, syntactic, and semantic information sources using state-of-the-art algorithms. In SemEval 2016 tasks, the system achieved a mean Pearson correlation of 75.7% on the monolingual English task and 86.3% on the cross-lingual Spanish-English task, ranking first in the cross-lingual task. The system utilizes techniques such as word embeddings, paragraph vectors, tree-structured LSTMs, and word alignment to capture semantic similarity.
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Yafi Azhari
1) Previous metrics like Alignment Error Rate (AER) did not strongly correlate with statistical machine translation quality when word alignments were varied, showing the need for a new alignment quality metric.
2) The paper proposes using balanced F-measure with different precision-recall weights (α values) to measure alignment quality, which showed much higher correlation to translation quality across three language pairs than AER.
3) The best α values were 0.3-0.6, showing recall is somewhat more important than precision for alignment quality as it relates to translation quality. This new metric provides a better predictive measurement of how alignment quality impacts machine translation.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that
'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study
in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show
that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
Abstract:The interest in text to speech synthesis increased in the world .text to speech have been developed for many popular languages such as English, Spanish and French and many researches and developments have been applied to those languages. Persian on the other hand, has been given little attention compared to other languages of similar importance and the research in Persian is still in its infancy. Persian languages possess many difficulty and exceptions that increase complexity of text to speech systems. For example: short vowels is absent in written text or existence of homograph words. in this paper we propose a new method for Persian text to phonetic that base on pronunciations by analogy in words, semantic relations and grammatical rules for finding proper phonetic.Keywords:PbA, text to speech, Persian language, Phonetic recognition.
Title:Phonetic Recognition In Words For Persian Text To Speech Systems
Author:Ahmad Musavi Nasab, Ali Joharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
A Pilot Study On Computer-Aided Coreference AnnotationDarian Pruitt
This document describes a pilot study on using automatic coreference resolution to aid human annotation of coreference. It finds that pre-annotating data with the predictions of an existing coreference system can both reduce the time needed for human annotation and decrease error rates, by reducing the task to checking and modifying existing annotations rather than creating everything from scratch. The study uses the output of an automatic system for resolving nominal anaphora to pre-annotate German newspaper text, which is then manually edited using annotation software.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
Deep reinforcement learning (RL) has been a commonly-used strategy for the abstractive summarization task to address both the exposure bias and non-differentiable task issues. However, the conventional reward ROUGE-L simply looks for exact n-grams matches between candidates and annotated references, which inevitably makes the generated sentences repetitive and incoherent. In this paper, we explore the practicability of utilizing the distributional semantics to measure the matching degrees. Our proposed distributional semantics reward has distinct superiority in capturing the lexical and compositional diversity of natural language.
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
Machine Learning Techniques with Ontology for Subjective Answer Evaluationijnlc
Computerized Evaluation of English Essays is performed using Machine learning techniques like Latent
Semantic Analysis (LSA), Generalized LSA, Bilingual Evaluation Understudy and Maximum Entropy.
Ontology, a concept map of domain knowledge, can enhance the performance of these techniques. Use of
Ontology makes the evaluation process holistic as presence of keywords, synonyms, the right word
combination and coverage of concepts can be checked. In this paper, the above mentioned techniques are
implemented both with and without Ontology and tested on common input data consisting of technical
answers of Computer Science. Domain Ontology of Computer Graphics is designed and developed. The
software used for implementation includes Java Programming Language and tools such as MATLAB,
Protégé, etc. Ten questions from Computer Graphics with sixty answers for each question are used for
testing. The results are analyzed and it is concluded that the results are more accurate with use of
Ontology.
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...Kim Daniels
This document summarizes a study that compares different term reduction methods for representing the vocabulary of literary texts. Specifically, it examines the effectiveness of the entropy method, transition point method, and a hybrid method in reducing the size of the vocabulary from a collection of Quran texts, while preserving important terms. The results indicate that the transition point method was most effective at reducing the vocabulary size without losing important terms, compared to the other methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output based on professional post-editing annotations. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at \url{this https URL}.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
Text analysis has been attracting increasing attention in this data era. Selecting effective features from
datasets is a particular important part in text classification studies. Feature selection excludes irrelevant
features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy
and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and
comparing the available feature selection methods in general versatility regarding authorship attribution
problems and tries to identify which method is the most effective. The discussions on general versatility of
feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the
accuracy of SVM (support vector machine), and different criteria for determining the rank of feature
selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can
always extract useful information to discriminate the classes. The chi-square was proved to be a better
method overall.
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
This paper describes a universal phrase tagset mapping between the French Treebank and English Penn Treebank using 9 phrase categories. It then applies this mapping to an unsupervised machine translation evaluation method that calculates similarity between the source and target sentences without reference translations. The method extracts phrase tags from the source and target, maps them to universal tags, and measures n-gram precision, recall, and position difference as similarity metrics. Evaluation on French-English data shows promising correlation with human judgments, though there is still room for improvement. The tagset and methods could facilitate future multilingual research.
The document presents a knowledge-based method for measuring semantic similarity between texts. It combines word-to-word semantic similarity metrics with information about word specificity to calculate a text-to-text similarity score. An example application shows how word similarity scores from WordNet are combined using the Wu & Palmer metric to determine the semantic similarity between two text segments. The method is evaluated on paraphrase identification tasks and shown to outperform approaches based only on lexical matching.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Cross lingual similarity discrimination with translation characteristicsijaia
This document summarizes a research paper on cross-lingual similarity discrimination using translation characteristics. The paper proposes a discriminative model trained on bilingual corpora to classify sentences in a target language as similar or dissimilar to a given sentence in a source language. Features used in the model include translation characteristics like sentence length ratios, word alignments, and polarity. The model is trained on various sampling methods to address the imbalanced data of having many more negative samples than positive translations. Experiments on 1500 English-Chinese sentence pairs show the model achieves satisfactory performance according to three evaluation metrics, outperforming a baseline system.
Similar to MT SUMMIT13.Language-independent Model for Machine Translation Evaluation with Reinforced Factors (20)
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
The document summarizes the results of experiments comparing large pre-trained language models for machine translation. In a machine translation challenge, a smaller Marian model demonstrated better or similar results to much larger pretrained models, contradicting expectations. This suggests that very large models do not necessarily improve translation quality and that current automatic evaluation metrics are limited. Human evaluation remains important for fully assessing machine translation quality.
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
From both human translators (HT) and machine translation (MT) researchers' point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which meet customer specifications with harsh constraints of required quality level in tight time-frames and costs. MT researchers strive to make their models better, which also requires reliable quality evaluation. While automatic machine translation evaluation (MTE) metrics and quality estimation (QE) tools are widely available and easy to access, existing automated tools are not good enough, and human assessment from professional translators (HAPs) are often chosen as the golden standard \cite{han-etal-2021-TQA}.
Human evaluations, however, are often accused of having low reliability and agreement. Is this caused by subjectivity or statistics is at play? How to avoid the entire text to be checked and be more efficient with TQE from cost and efficiency perspectives, and what is the optimal sample size of the translated text, so as to reliably estimate the translation quality of the entire material? This work carries out such a motivated research to correctly estimate the confidence intervals \cite{Brown_etal2001Interval}
depending on the sample size of translated text, e.g. the amount of words or sentences, that needs to be processed on TQE workflow step for confident and reliable evaluation of overall translation quality.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modeling (BSDM) and Monte Carlo Sampling Analysis (MCSA).
Reference: S Gladkoff, I Sorokina, L Han, A Alekseeva. 2022. Measuring Uncertainty in Translation Quality Evaluation (TQE). LREC2022. arXiv preprint arXiv:2111.07699
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
Starting from 1950s, Machine Translation (MT) was challenged from different scientific solutions which included rule-based methods, example-based and statistical models (SMT), to hybrid models, and very recent years the neural models (NMT).
While NMT has achieved a huge quality improvement in comparison to conventional methodologies, by taking advantages of huge amount of parallel corpora available from internet and the recently developed super computational power support with an acceptable cost, it struggles to achieve real human parity in many domains and most language pairs, if not all of them.
Alongside the long road of MT research and development, quality evaluation metrics played very important roles in MT advancement and evolution.
In this tutorial, we overview the traditional human judgement criteria, automatic evaluation metrics, unsupervised quality estimation models, as well as the meta-evaluation of the evaluation methods. Among these, we will also cover the very recent work in the MT evaluation (MTE) fields taking advantages of large size of pre-trained language models for automatic metric customisation towards exactly deployed language pairs and domains. In addition, we also introduce the statistical confidence estimation regarding sample size needed for human evaluation in real practice simulation.
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to
inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations.
In this work, we introduce \textbf{HOPE}, a task-oriented and \textit{\textbf{h}}uman-centric evaluation framework for machine translation output based \textit{\textbf{o}}n professional \textit{\textbf{p}}ost-\textit{\textbf{e}}diting annotations. It contains only a limited number of commonly occurring error types, and uses a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at \url{https://github.com/lHan87/HOPE}
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
The document proposes incorporating Chinese radicals into neural machine translation models. It discusses related work incorporating word and character level information into neural MT. The proposed model combines radical-level MT with an attention-based neural model, representing input text with word, character, and radical combinations. Experiments show the character+radical and word+radical models outperform baselines on standard MT evaluation metrics using a Chinese-English dataset. Future work includes improving model optimization and testing on additional data.
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
cushLEPOR uses LABSE distilled knowledge to improve correlation with human translation evaluations. It customizes the hLEPOR metric by optimizing its parameters against LABSE similarity scores or human evaluations to achieve lower RMSE than vanilla hLEPOR or BLEU. The optimized cushLEPOR metric then shows better correlation with human judgments than existing automated metrics like BLEU.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
Build Moses Statistical Machine Translation system with Ubuntu
Tree to tree Machine Translation with Universal phrase tagset. https://github.com/aaronlifenghan/A-Universal-Phrase-Tagset
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
ADAPT Centre & Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @ DLSS2017 Bilbao.
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT models for comparisons namely: Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the noise removal, translation post editing and MWE annotation by human professionals, we believe our AlphaMWE dataset will be an asset for cross-lingual and multilingual research, such as MT and information extraction. Our multilingual corpora are available as open access at github.com/poethan/AlphaMWE
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
A deep analysis of Multi-word Expression and Machine Translation. Faculty research open day. DCU, Dublin. 2019.
Including MWE identification, MT with radical, MTE.
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
This document surveys machine translation evaluation resources and methods. It reviews both traditional human evaluation methods like adequacy and fluency judgments as well as automatic evaluation methods like lexical similarity and linguistic features. Recent deep learning models are also covered. The survey finds that exactly evaluating translations is difficult due to language variations and low agreement between humans. It also discusses evaluating both human judgments and automatic evaluations.
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
The document presents a proposed model to apply Chinese radicals into neural machine translation. It discusses related work on machine translation and neural networks. The proposed model would combine radical-level machine translation with an attention-based neural model, incorporating radicals into the input data. Experiments would evaluate the model on various data settings and metrics, comparing performance with and without radicals. Future work could improve parameter tuning and include more diverse data.
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
This document summarizes an experiment on using graph-based semi-supervised learning to improve a conditional random field model for Chinese named entity recognition. The experiment used unlabeled data from previous NER tasks to extend the labeled training data via label propagation. This enhanced CRF model was evaluated on a standard test corpus and showed a slight improvement over a closed CRF baseline, particularly for person and organization entities. However, the unlabeled data was not large enough to cover all entity types. Future work could explore using more unlabeled data and optimizing features for the graph construction.
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
This is a short presentation for the Poster of WMT13 shared task: This paper is to introduce our participation in
the WMT13 shared tasks on Quality Estimation
for machine translation without using reference
translations. We submitted the results
for Task 1.1 (sentence-level quality estimation),
Task 1.2 (system selection) and Task 2
(word-level quality estimation). In Task 1.1,
we used an enhanced version of BLEU metric
without using reference translations to evaluate
the translation quality. In Task 1.2, we utilized
a probability model Naïve Bayes (NB) as
a classification algorithm with the features
borrowed from the traditional evaluation metrics.
In Task 2, to take the contextual information
into account, we employed a discriminative
undirected probabilistic graphical model
Conditional random field (CRF), in addition
to the NB algorithm. The training experiments
on the past WMT corpora showed that the designed
methods of this paper yielded promising
results especially the statistical models of
CRF and NB. The official results show that
our CRF model achieved the highest F-score
0.8297 in binary classification of Task 2.
This thesis proposes new machine translation evaluation metrics called LEPOR that aim to address issues with existing methods. LEPOR incorporates additional evaluation factors like length penalty and n-gram position difference penalty to provide more accurate assessments. It also explores using part-of-speech tags as linguistic features. Experimental results on multiple language pairs show LEPOR achieves strong performance, correlating highly with human judgments of translation quality. The metrics were also submitted to the ACL-WMT shared translation evaluation task where they proved robust across different languages.
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
The document discusses a phrase tagset mapping between French and English treebanks and its application in machine translation evaluation. Key points:
- A universal phrase tagset with 9 categories was designed to map phrase tags from the French Treebank and English Penn Treebank.
- The tagset mapping aims to facilitate multilingual research by bridging differences in treebank tagsets.
- An unsupervised machine translation evaluation method was proposed that uses the universal tagset to compare phrase categories between source and translated sentences, without needing reference translations.
- Experiments on French-English translation tasks showed promising results, with the unsupervised method correlating reasonably well with BLEU and TER scores. However, there is still
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation with Reinforced Factors
1. Language-independent Model for Machine Translation Evaluation with
Reinforced Factors
Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He
Yi Lu, Junwen Xing, and Xiaodong Zeng
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau, Macau S.A.R., China
hanlifengaaron@gmail.com, {derekfw, lidiasc}@umac.mo
{wutianshui0515,takamachi660,nlp2ct.anson,nlp2ct.samuel}@gmail.com
Abstract
The conventional machine translation eval-
uation metrics tend to perform well on cer-
tain language pairs but weak on other lan-
guage pairs. Furthermore, some evalua-
tion metrics could only work on certain
language pairs not language-independent.
Finally, no considering of linguistic in-
formation usually leads the metrics re-
sult in low correlation with human judg-
ments while too many linguistic features
or external resources make the metric-
s complicated and difficult in replicabil-
ity. To address these problems, a nov-
el language-independent evaluation metric
is proposed in this work with enhanced
factors and optional linguistic information
(part-of-speech, n-grammar) but not very
much. To make the metric perform well on
different language pairs, extensive factors
are designed to reflect the translation qual-
ity and the assigned parameter weights are
tunable according to the special character-
istics of focused language pairs. Experi-
ments show that this novel evaluation met-
ric yields better performances compared
with several classic evaluation metrics (in-
cluding BLEU, TER and METEOR) and
two state-of-the-art ones including ROSE
and MPF.1
1 Introduction
The machine translation (MT) began as early as
in the 1950s (Weaver, 1955) and gained a big
1
The final publication is available at http://www.mt-
archive.info/
progress science the 1990s due to the develop-
ment of computers (storage capacity and compu-
tational power) and the enlarged bilingual corpora
(Marino et al., 2006), e.g. (Och, 2003) present-
ed MERT (Minimum Error Rate Training) for log-
linear statistical machine translation (SMT) mod-
els to achieve better translation quality, (Su et al.,
2009) used the Thematic Role Templates model to
improve the translation and (Xiong et al., 2011)
employed the maximum-entropy model etc. The
statistical MT (Koehn, 2010) became mainly ap-
proaches in MT literature. Due to the wide-spread
development of MT systems, the MT evaluation
becomes more and more important to tell us how
well the MT systems perform and whether they
make some progress. However, the MT evaluation
is difficult because some reasons, e.g. language
variability results in no single correct translation,
the natural languages are highly ambiguous and d-
ifferent languages do not always express the same
content in the same way (Arnold, 2003).
How to evaluate each MT system’s quality and
what should be the criteria have become the new
challenges in front of MT researchers. The earliest
human assessment methods include the intelligi-
bility (measuring how understandable the sentence
is) and fidelity (measuring how much information
the translated sentence retains compared to the o-
riginal) used by the Automatic Language Process-
ing Advisory Committee (ALPAC) around 1966
(Carroll, 1966), and the afterwards proposed ad-
equacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and compre-
hension (improved intelligibility) by Defense Ad-
vanced Research Projects Agency (DARPA) of US
(White et al., 1994). The manual evaluations suf-
2. fer the main disadvantage that it is time-consuming
and thus too expensive to do frequently.
The early automatic evaluation metrics include
the word error rate WER (Su et al., 1992) (edit dis-
tance between the system output and the closest
reference translation), position independent word
error rate PER (Tillmann et al., 1997) (variant of
WER that disregards word ordering), BLEU (Pap-
ineni et al., 2002) (the geometric mean of n-gram
precision by the system output with respect to
reference translations), NIST (Doddington, 2002)
(adding the information weight) and GTM (Turian
et al., 2003). Recently, many other methods were
proposed to revise or improve the previous works.
One of the categories is the lexical similarity
based metric. The metrics of this kind include
the edit distance based method, such as the TER
(Snover et al., 2006) and the work of (Akiba et
al., 2001) in addition to WER and PER, the pre-
cision based method such as SIA (Liu and Gildea,
2006) in addition to BLEU and NIST, recall based
method such as ROUGE (Lin and Hovy, 2003),
the word order information utilized by (Wong and
Kit, 2008), (Isozaki et al., 2010) and (Talbot et
al., 2011), and the combination of precision and
recall such as Meteor-1.3 (Denkowski and Lavie,
2011) (an modified version of Meteor, includes
ranking and adequacy versions and has overcome
some weaknesses of previous version such as noise
in the paraphrase matching, lack of punctuation
handling and discrimination between word type-
s), BLANC (Lita et al., 2005), LEPOR (Han et
al., 2012) and PORT (Chen et al., 2012). An-
other category is the employing of linguistic fea-
tures. The metrics of this kind include the syntactic
similarity such as the Part-of-Speech information
used by ROSE (Song and Cohn, 2011) and MPF
(Popovic, 2011), and phrase information employed
by (Echizen-ya and Araki, 2010) and (Han et al.,
2013b); the semantic similarity such as Textual en-
tailment used by (Mirkin et al., 2009), Synonyms
by (Chan and Ng, 2008), paraphrase by (Snover et
al., 2009).
The evaluation methods proposed previously
suffer from several main weaknesses more or less:
perform well in certain language pairs but weak on
others, which we call the language-bias problem;
consider no linguistic information (not reasonable
from the aspect of linguistic analysis) or too many
linguistic features (making it difficult in replicabil-
ity), which we call the extremism problem; present
incomprehensive factors (e.g. BLEU focus on pre-
cision only). To address these problems, a novel
automatic evaluation metric is proposed in this pa-
per with enhanced factors, tunable parameters and
optional linguistic information (part-of-speech, n-
gram).
2 Designed Model
2.1 Employed Internal Factors
Firstly, we introduce the internal factors utilized in
the calculation model.
2.1.1 Enhanced Length Penalty
Enhanced length penalty ELP is designed to
put the penalty on both longer and shorter sys-
tem output translations (an enhanced version of the
brevity penalty in BLEU):
ELP =
e1−r
c : c < r
e1− c
r : c ≥ r
(1)
where the parameters c and r are the sentence
length of automatically output (candidate) and ref-
erence translation respectively.
2.1.2 N-gram Position Difference Penalty
The N-gram Position Difference Penalty
NPosPenal is developed to compare the word
order between the output and reference translation.
NPosPenal = e−NPD
(2)
where NPD is defined as:
NPD =
1
Lengthoutput
Lengthoutput
i=1
|PDi| (3)
where Lengthoutput is the length of system output
sentence and PDi means the position difference
value of each output word. Every word from both
output translation and reference should be aligned
only once. When there is no match, the value of
PDi is assigned with zero as default for this output
token.
Two steps are designed to measure the NPD
value. The first step is the context-dependent n-
gram alignment: we use the n-gram method and
assign it with higher priority, which means the sur-
rounding context of the potential words are con-
sidered when selecting the matched pairs between
3. the output and reference sentence. The nearest
match is accepted as a backup choice to establish
the alignment, if there are both nearby matching or
there is no other matched words surrounding the
potential word pairs. The one-direction alignment
is from output sentence to the reference.
Assuming that wx represents the current word
in output sentence and wx+k means the kth word
to the previous (k < 0) or following (k > 0).
On the other hand, wr
y means the word matching
wx in the references, and wr
y+j has the similar
meaning as wx+k but in reference sentence. The
variable Distance is the position difference val-
ue between the matching word in outputs and ref-
erences. The operation process and pseudo code
of the context-dependent n-gram word alignmen-
t algorithm are shown in Figure 1 (with → as the
alignment). There is an example in Figure 2. In
the calculating step, each word is labeled with the
quotient value of its position number divided by
sentence length (the total number of the tokens in
the sentence).
Let’s see the example in Figure 2 for the NPD
introduction (Figure 3). Each output word is la-
beled with the position quotient value from 1/6 to
6/6 (indicating the word position marked by sen-
tence length which is 6). The words in the refer-
ence sentence is labeled using the same subscripts.
2.1.3 Precision and Recall
Precision and recall are two commonly used cri-
teria in the NLP literature. We use the HPR to
represent the weighted Harmonic mean of preci-
sion and recall, i.e. Harmonic(αR, βP). The
weights are the tunable parameters α and β.
HPR =
(α + β)Precision × Recall
αPrecision + βRecall
(4)
Precision =
Alignednum
Lengthoutput
(5)
Recall =
Alignednum
Lengthreference
(6)
where Alignednum represents the number of suc-
cessfully matched words appearing both in trans-
lation and reference.
2.2 Sentence Level Score
Secondly, we introduce the mathematical harmon-
ic mean to group multi-variables (n variables
(X1, X2, ..., Xn)).
Harmonic(X1, X2, ..., Xn) =
n
n
i=1
1
Xi
(7)
where n is the number of factors. Then, the
weighted harmonic mean for multi-variables is:
Harmonic(wX1 X1, wX2 X2, ..., wXn Xn) =
n
i=1 wXi
n
i=1
wXi
Xi
(8)
where wXi is the weight of variable Xi. Final-
ly, the sentence level score of the developed eval-
uation metric hLEPOR (Harmonic mean of en-
hanced Length Penalty, Precision, n-gram Position
difference Penalty and Recall) is measured by:
hLEPOR =
=
n
i=1 wi
n
i=1
wi
Factori
=
wELP + wNPosPenal + wHPR
wELP
ELP + wNP osP enal
NPosPenal + wHP R
HPR
(9)
where ELP, NPosPenal and HPR are the three
factors explained in previous section with tunable
weights wELP , wNPosPenal and wHPR respective-
ly.
2.3 System-level Score
The system level score is the arithmetical mean of
the sentence scores as below.
hLEPOR =
1
SentNum
SentNum
i=1
hLEPORi
(10)
where hLEPOR represents the system-level s-
core of hLEPOR, SentNum specifies the
sentence number of the test document, and
hLEPORi means the score of the ith sentence.
3 Enhanced Version
This section introduces an enhanced version of the
developed metric hLEPOR as hLEPORE. As
discussed by many researchers, language variabili-
ty results in no single correct translation and differ-
ent languages do not always express the same con-
tent in the same way. In addition to the augment-
ed factors of the designed metric hLEPOR, we
present that optional linguistic information can be
4. combined into this metric concisely. As an exam-
ple, we will show how the part-of-speech (POS) in-
formation can be employed into this metric. First,
we calculate the system-level hLEPOR scores
on the surface words ( hLEPORword ). Then
we employ the same algorithms of hLEPOR on
the corresponding POS sequences of the words (
hLEPORPOS). Finally, we combine this two
system-level scores together with tunable weights
(whw and whp) as the final score.
hLEPORE =
1
whw + whp
(whwhLEPORword
+whphLEPORPOS) (11)
We mention the POS information because it
sometimes acts as the similar function with the
synonyms, e.g. “there is a big bag” and “there
is a large bag” could be the same meaning but
with different surface words “big” and “large”
(the same POS adjective). The POS information
has been proved helpful in the research works of
ROSE (Song and Cohn, 2011) and MPF (Popovic,
2011). The POS information could be replaced by
any other concise linguistic information in our de-
signed model.
4 Evaluating the Evaluation Metric
In order to distinguish the reliability of differen-
t MT evaluation metrics, Spearman rank correla-
tion coefficient ρ is commonly used to calculate
the correlation in the annual workshop of statisti-
cal machine translation (WMT) for Association of
Computational Linguistics (ACL) (Callison-Burch
et al., 2011). When there are no ties, Spearman
rank correlation coefficient is calculated as:
ρφ(XY ) = 1 −
6 n
i=1 d2
i
n(n2 − 1)
(12)
where di is the difference-value (D-value) be-
tween the two corresponding rank variables X =
{x1, x2, ..., xn} and Y = {y1, y2, ..., yn} describ-
ing the system φ and n is the number of variables
in the system.
5 Experiments
The experiment corpora are from the ACL’s spe-
cial interest group of machine translation SIGMT
(WMT workshop) which contain eight corpora in-
cluding English-to-other (Spanish, Czech, French
and German) and other-to-English. There are in-
deed a lot of linguistic POS tagger tools for differ-
ent languages available. We conduct an evaluation
with different POS taggers, and find that the em-
ploying of POS information can make an increase
of the correlation score with human judgment for
some language pairs but little or no effect on other-
s. The employed POS tagging tools include Berke-
ley POS tagger for French, English and German
(Petrov et al., 2006), COMPOST Czech morphol-
ogy tagger (Collins, 2002) and TreeTagger Span-
ish tagger (Schmid, 1994). To avoid the overfitting
problem, the WMT 20082 data are used in the de-
velopment stage for the tuning of the parameter-
s and the WMT 2011 corpora are used in testing.
The tuned parameter values for different language
pairs are shown in Table 1. The abbreviations EN,
CZ, DE, ES and FR mean English, Czech, Ger-
man, Spanish and French respectively. In the n-
gram word (POS) alignment, bigram is selected in
all the language pairs. To make the model con-
cise using as fewer of external resources as possi-
ble, the value of “N/A” means the POS informa-
tion of that language pair is not employed due to
that it makes little or no effect in the correlation
scores. The label “(W)” and “(POS)” means the
parameters tuned on word and POS respectively.
The “NPP” means NPosPenal to save window
space. The tuned parameter values also prove that
different language pairs embrace different charac-
teristics.
The testing results on WMT 20113 corpora are
shown in Table 2. The comparisons with language-
independent evaluation metrics include the clas-
sic metrics (BLEU, TER and METEOR) and two
state-of-the-art metrics MPF and ROSE. We selec-
t MPF and ROSE because that these two metrics
also employ the POS information and MPF yield-
ed the highest correlation score with human judg-
ments among all the language-independent metric-
s (performing on eight language pairs) in WMT
2011. The numbers of participated automatic MT
systems in WMT 2011 are 10, 22, 15 and 17 re-
spectively for English-to-other (CZ, DE, ES and
FR) and 8, 20, 15 and 18 respectively for the op-
posite translation direction. The gold standard ref-
erence data for those corpora consists of 3,003 sen-
tences offered by manual work. Automatic MT e-
2
http://www.statmt.org/wmt08/
3
http://www.statmt.org/wmt11/
5. Ratio Other-to-English English-to-Other
CZ-EN DE-EN ES-EN FR-EN EN-CZ EN-DE EN-ES EN-FR
HPR:ELP:NPP(W) 7:2:1 3:2:1 7:2:1 3:2:1 3:2:1 1:3:7 3:2:1 3:2:1
HPR:ELP:NPP(POS) N/A 3:2:1 N/A 3:2:1 N/A 7:2:1 N/A 3:2:1
α : β(W) 1:9 9:1 1:9 9:1 9:1 9:1 9:1 9:1
α : β(POS) N/A 9:1 N/A 9:1 N/A 9:1 N/A 9:1
whw : whp N/A 1:9 N/A 9:1 N/A 1:9 N/A 9:1
Table 1: Values of tuned weight parameters
valuation metrics are evaluated by the correlation
coefficient with the human judgments.
Several conclusions could be drawn from the re-
sults. First, some evaluation metrics show good
performances on part of the language pairs but low
performances on others, e.g ROSE results in 0.92
correlation with human judgments on Spanish-to-
English corpus but down to 0.41 score on English-
to-German; METEOR gets 0.93 score on French-
to-English but 0.3 on English-to-German. Second,
hLEPORE generally yields good performances
on different language pairs except for the English-
to-Czech and results in the highest Mean correla-
tion score 0.83 on eight corpora. Third, the recent-
ly developed methods (e.g. MPF, 0.81 mean score)
correlate better with human judgments than the tra-
ditional ones (e.g. BLEU, 0.74 means score), indi-
cating an improvement of the researches. Final-
ly, no metric can yield high performance on all
the language pairs, which shows that there remains
large potential to achieve improvement.
6 Conclusion and Perspectives
This work proposes a language-independent mod-
el for machine translation evaluation. Considering
the different characteristics of different languages,
hLEPORE has been extensively designed from
different aspects. That spans from word order
(context-dependent n-gram alignment), output ac-
curacy (precision), and loyalty (recall) to trans-
lation length performance (sentence length). D-
ifferent weight parameters are assigned to adjust
the importance of each factor, for instance, the
word position could be free in some languages but
strictly constrained in other languages. In prac-
tice, these employed features by hLEPORE are
also the vital ones when people facilitate language
translation. This is the philology behind the formu-
lation and the study of this work, and we believe
human’s translation ideology is the exact direction
that MT systems should try to approach. Further-
more, this work specifies that different external re-
sources or linguistic information could be integrat-
ed into this model easily. As suggested by other
works, e.g. (Avramidis et al., 2011), the POS infor-
mation is considered in the experiments and shows
some improvements on certain language pairs.
There are several main contributions of this pa-
per compared with our previous work (Han et al.,
2013). This work combines the utilizing of sur-
face words and linguistic features together (in-
stead of relying on the consilience of the POS se-
quence only). This paper measures the system-
level hLEPOR score by the arithmetical mean of
each sentence-level score (instead of the Harmon-
ic mean of system-level internal factors). This pa-
per shows the performances of enhanced method
hLEPORE on all the eight language pairs re-
leased by WMT official web (instead of part lan-
guage pairs by previous work) and most of the per-
formances have achieved improvements than pre-
vious work on the same language pairs (e.g. the
correlation score on German-English is 0.86 in-
creased from 0.83; the correlation score on French-
English is 0.92 increased from 0.74.). Other po-
tential linguistic features are easily to be employed
into the flexible model built in this paper.
There are also several aspects that should be ad-
dressed in the future works. Firstly, more language
pairs, in addition to the European languages, will
be tested such as Japanese, Korean and Chinese
and the performances of linguistic features (e.g.
POS tagging) will also be explored on the new lan-
guage pairs. Secondly, the tuning of weight pa-
rameters to achieve high correlation with human
judgments during the development period will be
automatically performed. Thirdly, since the use
of multiple references helps the usual translation
6. Metrics Other-to-English English-to-Other Mean
CZ-EN DE-EN ES-EN FR-EN EN-CZ EN-DE EN-ES EN-FR
hLEPORE 0.93 0.86 0.88 0.92 0.56 0.82 0.85 0.83 0.83
MPF 0.95 0.69 0.83 0.87 0.72 0.63 0.87 0.89 0.81
ROSE 0.88 0.59 0.92 0.86 0.65 0.41 0.9 0.86 0.76
METEOR 0.93 0.71 0.91 0.93 0.65 0.3 0.74 0.85 0.75
BLEU 0.88 0.48 0.9 0.85 0.65 0.44 0.87 0.86 0.74
TER 0.83 0.33 0.89 0.77 0.5 0.12 0.81 0.84 0.64
Table 2: Correlation coefficients with human judgments
quality measures correlate with the human judg-
ing, the scheme of how to use the multiple refer-
ences will be designed.
The designed source codes of this paper
can be freely downloaded for research pur-
pose, open source online. The source code of
hLEPOR measuring algorithm is available here
“https://github.com/aaronlifenghan/aaron-project-
hlepor”. “The final publication is available at
http://www.mt-archive.info/”
Acknowledgments.
The authors are grateful to the Science and
Technology Development Fund of Macau and the
Research Committee of the University of Macau
for the funding support for our research, under
the reference No. 017/2009/A and RG060/09-
10S/CS/FST. The authors also wish to thank the
anonymous reviewers for many helpful comments.
References
Akiba, Y., K. Imamura, and E. Sumita. 2001. Using
Multiple Edit Distances to Automatically Rank Ma-
chine Translation Output. Proceedings of MT Sum-
mit VIII , Santiago de Compostela, Spain.
Arnold, D. 2003. Why translation is difficult for com-
puters. In Computers and Translation: A transla-
tor’s guide , Benjamins Translation Library.
Avramidis, E., Popovic, M., Vilar, D., Burchardt, A.
2011. Evaluate with Confidence Estimation: Ma-
chine ranking of translation outputs using grammat-
ical features. Proceedings of ACL-WMT , pages 65-
70, Edinburgh, Scotland, UK.
Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O.
F. 2011. Findings of the 2011 Workshop on Statisti-
cal Machine Translation. Proceedings of ACL-WMT
, pages 22-64, Edinburgh, Scotland, UK.
Carroll, J. B. 1966. Aan experiment in evaluating
the quality of translation. Languages and machines:
computers in translation and linguistics , Automat-
ic Language Processing Advisory Committee (AL-
PAC), Publication 1416, Division of Behavioral Sci-
ences, National Academy of Sciences, National Re-
search Council, page 67-75.
Chan, Y. S. and Ng, H. T. 2008. MAXSIM: A maxi-
mum similarity metric for machine translation eval-
uation. Proceedings of ACL 2008: HLT , pages
55–62.
Chen, Boxing, Roland Kuhn and Samuel Larkin.
2012. PORT: a Precision-Order-Recall MT Evalu-
ation Metric for Tuning. Proceedings of 50th ACL) ,
pages 930–939, Jeju, Republic of Korea.
Collins, M. 2002. Discriminative Training Method-
s for Hidden Markov Models: Theory and Experi-
ments with Perceptron Algorithms. Proceedings of
the ACL-02 conference, Volume 10 (EMNLP 02) ,
pages 1-8. Stroudsburg, PA, USA .
Denkowski, M. and Lavie, A. 2011. Meteor: Meteor
1.3: Automatic metric for reliable optimization and
evaluation of machine translation systems. Proceed-
ings of (ACL-WMT) ,pages 85-91, Edinburgh, Scot-
land, UK.
Doddington, G. 2002. Automatic evaluation of ma-
chine translation quality using n-gram co-occurrence
statistics. Proceedings of the second internation-
al conference on Human Language Technology Re-
search , pages 138-145, San Diego, California, USA.
Echizen-ya, H. and Araki, K. 2010. Automatic eval-
uation method for machine translation using noun-
phrase chunking. Proceedings of ACL 2010 , pages
108–117. Association for Computational Linguistic-
s.
Han, Aaron L.-F., Derek F. Wong, and Lidia S. Chao.
2012. LEPOR: A Robust Evaluation Metric for Ma-
chine Translation with Augmented Factors. Pro-
ceedings of the 24th International Conference of
COLING, Posters, pages 441-450, Mumbai, India.
Han, Aaron L.-F., Derek F. Wong, Lidia S. Chao, and
Liangye He. 2013. Automatic Machine Trans-
lation Evaluation with Part-of-Speech Information.
7. Proceedings of the 16th International Conference of
Text, Speech and Dialogue (TSD 2013), LNCS Vol-
ume Editors: Vaclav Matousek et al. Springer-Verlag
Berlin Heidelberg. Plzen, Czech Republic.
Han, Aaron L.-F., Derek F. Wong, Lidia S. Chao,
Liangye He, Shuo Li, and Ling Zhu. 2013b.
Phrase Mapping for French and English Treebank
and the Application in Machine Translation Evalu-
ation. Proceedings of the International Conference
of the German Society for Computational Linguis-
tics and Language Technology, (GSCL 2013), LNC-
S Volume Editors: Iryna Gurevych, Chris Biemann
and Torsten Zesch. Darmstadt, Germany.
Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsuka-
da, H. 2010. Automatic evaluation of translation
quality for distant language pairs. Proceedings of
the 2010 Conference on EMNLP , pages 944–952,
Cambridge, MA.
Koehn, P. 2010. Statistical Machine Translation. Cam-
bridge University Press .
Marino, B. Jose, Rafael E. Banchs, Josep M. Crego,
Adria de Gispert, Patrik Lambert, Jose A. Fonollosa,
and Marta R. Costa-jussa. 2006. N-gram based ma-
chine translation. Journal of the Computational Lin-
guistics ,Vol. 32, No. 4. pp. 527-549, MIT Press.
Lin, Chin-Yew and E.H. Hovy. 2003. Automatic Eval-
uation of Summaries Using N-gram Co-occurrence
Statistics. Proceedings of HLT-NAACL 2003, Ed-
monton, Canada.
Lita, Lucian Vlad, Monica Rogati and Alon Lavie.
2005. BLANC: Learning Evaluation Metrics for
MT. Proceedings of the HLT/EMNLP, pages
740–747, Vancouver.
Liu D. and Daniel Gildea. 2006. Stochastic iterative
alignment for machine translation evaluation. Pro-
ceedings of ACL-06, Sydney.
Mirkin S., Lucia Specia, Nicola Cancedda, Ido Dagan,
Marc Dymetman, and Idan Szpektor. 2009. Source-
Language Entailment Modeling for Translating Un-
known Terms. Proceedings of the ACL-IJCNLP
2009) , pages 791–799, Suntec, Singapore.
Och, F. J. 2003. Minimum Error Rate Training for S-
tatistical Machine Translation. Proceedings of ACL-
2003 , pp. 160-167.
Papineni, K., Roukos, S., Ward, T. and Zhu, W. J. 2002.
BLEU: a method for automatic evaluation of ma-
chine translation. Proceedings of the ACL 2002 ,
pages 311-318, Philadelphia, PA, USA.
Petrov, S., Leon Barrett, Romain Thibaux, and Dan K-
lein 2006. Learning accurate, compact, and inter-
pretable tree annotation. Proceedings of the 21st A-
CL , pages 433–440, Sydney.
Popovic, M. 2011. Morphemes and POS tags for n-
gram based evaluation metrics. Proceedings of WMT
, pages 104-107, Edinburgh, Scotland, UK.
Schmid, H. 1994. Probabilistic Part-of-Speech Tag-
ging Using Decision Trees. Proceedings of Inter-
national Conference on New Methods in Language
Processing , Manchester, UK.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and
Makhoul J. 2006. A study of translation edit rate
with targeted human annotation. Proceedings of the
AMTA, pages 223-231, Boston, USA.
Snover, Matthew G., Nitin Madnani, Bonnie Dorr, and
Richard Schwartz. 2009. TER-Plus: paraphrase, se-
mantic, and alignment enhancements to Translation
Edit Rate. J. Machine Translation, 23: 117-127.
Song, X. and Cohn, T. 2011. Regression and rank-
ing based optimisation for sentence level MT eval-
uation. Proceedings of the WMT , pages 123-129,
Edinburgh, Scotland, UK.
Su, Hung-Yu and Chung-Hsien Wu. 2009. Improving
Structural Statistical Machine Translation for Sign
Language With Small Corpus Using Thematic Role
Templates as Translation Memory. IEEE TRANS-
ACTIONS ON AUDIO, SPEECH, AND LANGUAGE
PROCESSING , VOL. 17, NO. 7.
Su, Keh-Yih, Wu Ming-Wen and Chang Jing-Shin.
1992. A New Quantitative Quality Measure for Ma-
chine Translation Systems. Proceedings of COL-
ING, pages 433–439, Nantes, France.
Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J.,
Seno, M. and Och, F. 2011. A Lightweight Evalu-
ation Framework for Machine Translation Reorder-
ing. Proceedings of the WMT, pages 12-21, Edin-
burgh, Scotland, UK.
Tillmann, C., Stephan Vogel, Hermann Ney, Arkaitz
Zubiaga, and Hassan Sawaf. 1997. Accelerated
DP Based Search For Statistical Translation. Pro-
ceedings of the 5th European Conference on Speech
Communication and Technology .
Turian, J. P., Shen, L. and Melanmed, I. D. 2003. E-
valuation of machine translation and its evaluation.
Proceedings of MT Summit IX , pages 386-393, New
Orleans, LA, USA.
Weaver, Warren. 1955. Translation. Machine Trans-
lation of Languages: Fourteen Essays, In William
Locke and A. Donald Booth, editors, John Wiley and
Sons. New York, pages 15—23.
White, J. S., O’Connell, T. A., and O’Mara, F. E. 1994.
The ARPA MT evaluation methodologies: Evolu-
tion, lessons, and future approaches. Proceedings
of AMTA, pp193-205.
8. Wong, B. T-M and Kit, C. 2008. Word choice and
word position for automatic MT evaluation. Work-
shop: MetricsMATR of AMTA, short paper, 3 pages,
Waikiki, Hawai’I, USA.
Xiong, D., M. Zhang, H. Li. 2011. A Maximum-
Entropy Segmentation Model for Statistical Machine
Translation. IEEE Transactions on Audio, Speech,
and Language Processing , Volume: 19, Issue: 8,
2011 , pp. 2494- 2505.
9. Figure 1: N-gram word alignment algorithm
Figure 2: Example of n-gram word alignment
Figure 3: Example of NPD calculation