This paper describes a universal phrase tagset mapping between the French Treebank and English Penn Treebank using 9 phrase categories. It then applies this mapping to an unsupervised machine translation evaluation method that calculates similarity between the source and target sentences without reference translations. The method extracts phrase tags from the source and target, maps them to universal tags, and measures n-gram precision, recall, and position difference as similarity metrics. Evaluation on French-English data shows promising correlation with human judgments, though there is still room for improvement. The tagset and methods could facilitate future multilingual research.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
Parallel corpora play a central role in current approaches to machine and computer-assisted translation and also in any corpus-based study that involves original text and its translation. This talk motivates the use of parallel data, as well as its desired properties. It then introduces practical methodologies to automatically acquire and prepare parallel data for the task at hand. Finally, it glances at the neighbouring field of Translation Studies to assert that translations can differ to a great extent depending on the strategy followed by the translator, which might lead to the translation being more or less appropriate for its use in corpus-based studies.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
Language variety identification aims at labelling texts in a native lan- guage (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Ar- gentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of ∼35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show com- petitive performance while dramatically reducing the dimensionality — and in- creasing the big data suitability — to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.
Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language. In this work we approach the task by using distributed representations based on Mikolov et al. investigations.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
Parallel corpora play a central role in current approaches to machine and computer-assisted translation and also in any corpus-based study that involves original text and its translation. This talk motivates the use of parallel data, as well as its desired properties. It then introduces practical methodologies to automatically acquire and prepare parallel data for the task at hand. Finally, it glances at the neighbouring field of Translation Studies to assert that translations can differ to a great extent depending on the strategy followed by the translator, which might lead to the translation being more or less appropriate for its use in corpus-based studies.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
Language variety identification aims at labelling texts in a native lan- guage (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Ar- gentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of ∼35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show com- petitive performance while dramatically reducing the dimensionality — and in- creasing the big data suitability — to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages.
Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language. In this work we approach the task by using distributed representations based on Mikolov et al. investigations.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
Limitations in automated translation services showcase the need for active professionals to conduct your translation work and this is where Nordictrans comes into actions providing high quality translation services from and into just about any language.
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
Natural Language Processing is an interrelated disincline adding the capability of communicating as human beings to Computerworld. Amharic language is having much improvement over time thanks to researcher at PHD, MSC level at AAU. Here , I have tried to study and come up a limited scope solution that does syntax parsing for Amharic language and draws syntax parse trees using Python!!
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks.
In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
Publisher: Springer-Verlag Berlin Heidelberg 20132013
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho
Proceedings of the 16th International Conference of Text, Speech and Dialogue (TSD 2013). Plzen, Czech Republic, September 2013. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksLifeng (Aaron) Han
Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013) pp. 215-222. Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
Cite: Lifeng Han. 2021. Meta-evaluation of machine translation evaluation methods. In Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T. October 23–24.
Sentence-level translation quality estimation with cross-lingual transformers.
Please consider citing our paper
@InProceedings{transquest:2020,
author = {Ranasinghe, Tharindu and Orasan, Constantin and Mitkov, Ruslan},
title = {TransQuest: Translation Quality Estimation with Cross-lingual Transformers},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
year = {2020}
}
Machine translation (MT) is one of the earliest and most successful applications of natural language processing. Many MT services have been deployed via web and smartphone apps, enabling communication and information access across the globe by bypassing language barriers. However, MT is not yet a solved problem. MT services that cover the most languages cover only about a hundred; thousands more are currently unsupported. Even for the currently supported languages, the translation quality is far from perfect.
A key obstacle in our way to achieving usable MT models for any language is data imbalance. On the one hand, machine learning techniques perform subpar on rare categories, having only a few to no training examples. On the other hand, natural language datasets are inevitably imbalanced with a long tail of rare types. The rare types carry more information content, and hence correctly translating them is crucial. In addition to the rare word types, rare phenomena also manifest in other forms as rare languages and rare linguistic styles.
Our contributions towards advancing rare phenomena learning in MT are four-fold: (1) We show that MT models have much in common with classification models, especially regarding the data imbalance and frequency-based biases. We describe a way to reduce the imbalance severity during the model training. (2) We show that the currently used automatic evaluation metrics overlook the importance of rare words. We describe an interpretable evaluation metric that treats important words as important. (3) We propose methods to evaluate and improve translation robustness to rare linguistic styles such as partial translations and language alternations in inputs. (4) Lastly, we present a set of tools intended to advance MT research across a wider range of languages. Using these tools, we demonstrate 600 languages to English translation, thus supporting 500 more rare languages currently unsupported by others.
This lectures provides students with an introduction to natural language processing, with a specific focus on the basics of two applications: vector semantics and text classification.
(Lecture at the QUARTZ PhD Winter School (http://www.quartz-itn.eu/training/winter-school/ in Padua, Italy on February 12, 2018)
Similar to Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation (20)
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
Pre-trained language models (PLMs) often take
advantage of the monolingual and multilingual
dataset that is freely available online to acquire
general or mixed domain knowledge before
deployment into specific tasks. Extra-large
PLMs (xLPLMs) are proposed very recently
to claim supreme performances over smallersized PLMs such as in machine translation
(MT) tasks. These xLPLMs include Meta-AI’s
wmt21-dense-24-wide-en-X (2021) and NLLB
(2022). In this work, we examine if xLPLMs are
absolutely superior to smaller-sized PLMs in
fine-tuning toward domain-specific MTs. We
use two different in-domain data of different
sizes: commercial automotive in-house data
and clinical shared task data from the ClinSpEn2022 challenge at WMT2022. We choose
popular Marian Helsinki as smaller sized PLM
and two massive-sized Mega-Transformers
from Meta-AI as xLPLMs.
Our experimental investigation shows that 1)
on smaller sized in-domain commercial automotive data, xLPLM wmt21-dense-24-wideen-X indeed shows much better evaluation
scores using SACREBLEU and hLEPOR metrics than smaller-sized Marian, even though
its score increase rate is lower than Marian
after fine-tuning; 2) on relatively larger-size
well prepared clinical data fine-tuning, the
xLPLM NLLB tends to lose its advantage
over smaller-sized Marian on two sub-tasks
(clinical terms and ontology concepts) using
ClinSpEn offered metrics METEOR, COMET,
and ROUGE-L, and totally lost to Marian on
Task-1 (clinical cases) on all official metrics
including SACREBLEU and BLEU; 3) metrics do not always agree with each other
on the same tasks using the same model outputs; 4) clinic-Marian ranked No.2 on Task-1
(via SACREBLEU/BLEU) and Task-3 (via METEOR and ROUGE) among all submissions.
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
From both human translators (HT) and machine translation (MT) researchers' point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which meet customer specifications with harsh constraints of required quality level in tight time-frames and costs. MT researchers strive to make their models better, which also requires reliable quality evaluation. While automatic machine translation evaluation (MTE) metrics and quality estimation (QE) tools are widely available and easy to access, existing automated tools are not good enough, and human assessment from professional translators (HAPs) are often chosen as the golden standard \cite{han-etal-2021-TQA}.
Human evaluations, however, are often accused of having low reliability and agreement. Is this caused by subjectivity or statistics is at play? How to avoid the entire text to be checked and be more efficient with TQE from cost and efficiency perspectives, and what is the optimal sample size of the translated text, so as to reliably estimate the translation quality of the entire material? This work carries out such a motivated research to correctly estimate the confidence intervals \cite{Brown_etal2001Interval}
depending on the sample size of translated text, e.g. the amount of words or sentences, that needs to be processed on TQE workflow step for confident and reliable evaluation of overall translation quality.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modeling (BSDM) and Monte Carlo Sampling Analysis (MCSA).
Reference: S Gladkoff, I Sorokina, L Han, A Alekseeva. 2022. Measuring Uncertainty in Translation Quality Evaluation (TQE). LREC2022. arXiv preprint arXiv:2111.07699
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
Starting from 1950s, Machine Translation (MT) was challenged from different scientific solutions which included rule-based methods, example-based and statistical models (SMT), to hybrid models, and very recent years the neural models (NMT).
While NMT has achieved a huge quality improvement in comparison to conventional methodologies, by taking advantages of huge amount of parallel corpora available from internet and the recently developed super computational power support with an acceptable cost, it struggles to achieve real human parity in many domains and most language pairs, if not all of them.
Alongside the long road of MT research and development, quality evaluation metrics played very important roles in MT advancement and evolution.
In this tutorial, we overview the traditional human judgement criteria, automatic evaluation metrics, unsupervised quality estimation models, as well as the meta-evaluation of the evaluation methods. Among these, we will also cover the very recent work in the MT evaluation (MTE) fields taking advantages of large size of pre-trained language models for automatic metric customisation towards exactly deployed language pairs and domains. In addition, we also introduce the statistical confidence estimation regarding sample size needed for human evaluation in real practice simulation.
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to
inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations.
In this work, we introduce \textbf{HOPE}, a task-oriented and \textit{\textbf{h}}uman-centric evaluation framework for machine translation output based \textit{\textbf{o}}n professional \textit{\textbf{p}}ost-\textit{\textbf{e}}diting annotations. It contains only a limited number of commonly occurring error types, and uses a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at \url{https://github.com/lHan87/HOPE}
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output based on professional post-editing annotations. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at \url{this https URL}.
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
LPRC 2018: Limerick Postgraduate Research Conference
Lifeng Han and Shaohui Kuang. 2018. Apply Chinese radicals into neural machine translation: Deeper than character level. ArXiv pre-print https://arxiv.org/abs/1805.01565v1
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
Build Moses Statistical Machine Translation system with Ubuntu
Tree to tree Machine Translation with Universal phrase tagset. https://github.com/aaronlifenghan/A-Universal-Phrase-Tagset
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
ADAPT Centre & Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @ DLSS2017 Bilbao.
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT models for comparisons namely: Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the noise removal, translation post editing and MWE annotation by human professionals, we believe our AlphaMWE dataset will be an asset for cross-lingual and multilingual research, such as MT and information extraction. Our multilingual corpora are available as open access at github.com/poethan/AlphaMWE
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
A deep analysis of Multi-word Expression and Machine Translation. Faculty research open day. DCU, Dublin. 2019.
Including MWE identification, MT with radical, MTE.
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
This is a short presentation for the Poster of WMT13 shared task: This paper is to introduce our participation in
the WMT13 shared tasks on Quality Estimation
for machine translation without using reference
translations. We submitted the results
for Task 1.1 (sentence-level quality estimation),
Task 1.2 (system selection) and Task 2
(word-level quality estimation). In Task 1.1,
we used an enhanced version of BLEU metric
without using reference translations to evaluate
the translation quality. In Task 1.2, we utilized
a probability model Naïve Bayes (NB) as
a classification algorithm with the features
borrowed from the traditional evaluation metrics.
In Task 2, to take the contextual information
into account, we employed a discriminative
undirected probabilistic graphical model
Conditional random field (CRF), in addition
to the NB algorithm. The training experiments
on the past WMT corpora showed that the designed
methods of this paper yielded promising
results especially the statistical models of
CRF and NB. The official results show that
our CRF model achieved the highest F-score
0.8297 in binary classification of Task 2.
Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation
1. 25th International Conference, GSCL 2013
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li,
and Ling Zhu
September 25th -27th, 2013, Darmstadt, Germany
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
Department of Computer and Information Science
University of Macau
2. Background of language Treebank
Motivation
Designed phrase tagset mapping
Application in MT evaluation
1. Manual evaluations
2. Traditional automatic MT evaluation methods
3. Designed unsupervised MT evaluation
4. Evaluating the evaluation method
5. Experiments
6. Open source code
Discussion
Further information
3. • To promote the development of syntactic analysis
• Many language treebanks are developed
– English Penn Treebank (Marcus et al., 1993; Mitchell et al.,
1994)
– German Negra Treebank (Skut et al., 1997)
– French Treebank (Abeillé et al., 2003)
– Chinese Sinica Treebank (Chen et al., 2003)
– Etc.
4. • Problems
– Different treebanks use their own syntactic tagsets
– The number of tags ranging from tens (e.g. English Penn
Treebank) to hundreds (e.g. Chinese Sinica Treebank)
– Inconvenient when undertaking the multilingual or cross-lingual
research
5. • To bridge the gap between these treebanks and
facilitate future research
– E.g. the unsupervised induction of syntactic structure
• Petrov et al. (2012) develop a universal POS tagset
• How about the phrase level tags?
• The disaccord problem in the phrase level tags
remains unsolved
– Let’s try to solve it
6. • Tentative design of phrase tagset mapping
– On English Penn Treebank I, II & French Treebank
• 9 universal phrasal categories covering
– 14 phrase tags in English Penn Treebank I
– 26 phrase tags in English Penn Treebank II
– 14 phrase tags in French Treebank
7. Table 1: phrase tagset mapping for French and English treebanks
8. • Universal phrasal categories: NP (noun phrase),
VP (verb phrase), AJP (adjective phrase), AVP
(adverbial phrase), PP (prepositional phrase), S (sub/-
sentence), CONJP (conjunction phrase), COP
(coordinated phrse), X (other phrases or unknown)
• NP covering
– French tags: NP
– English tags: NP, NAC (the scope of certain prenominal
modifiers within an NP), NX (within certain complex NPs to
mark the head of NP), WHNP (wh-noun phrase), QP
(quantifier phrase)
9. • VP covering
– French tags: VN (verbal nucleus), VP (infinitives and
nonfinite clauses)
– English tags: VP (verb phrase)
• AJP covering
– French tags: AP (adjectival phrase)
– English tags: ADJP (adjective phrase), WHADJP (wh-adjective
phrase)
10. • AVP covering
– French tags: AdP (adverbial phrases)
– English tags: ADVP (adverb phrase), WHAVP (wh-adverb
phrase), PRT (particle)
• PP covering
– French tags: PP
– English tags: PP, WHPP (wh-propositional phrase phrase)
11. • S covering
– French tags: SENT (sentence), S (finite clause)
– English tags: S (simple declarative clause), SBAR (clause
introduced by a subordinating conjunction), SBARQ (direct
question introduced by a wh-phrase), SINV (declarative
sentence with subject-aux inversion), SQ (sub-constituent
of SBARQ), PRN (parenthetical), FRAG (fragment), RRC
(reduced relative clause).
• CONJP covering
– French tags: N/A
– English tags: CONJP
12. • COP covering
– French tags: COORD (coordinated phrase)
– English tags: UCP (coordinated phrases belonging to
different categories)
• X covering
– French tags: unknown
– English tags: X (unknown or uncertain), INTJ (interjection),
LST (list marker)
14. • Rapid development of Machine Translations
– MT began as early as in the 1950s (Weaver, 1955)
– Big progress science the 1990s due to the development of
computers (storage capacity and computational power)
and the enlarged bilingual corpora (Marino et al. 2006)
• Difficulties of MT evaluation
– language variability results in no single correct translation
– the natural languages are highly ambiguous and different
languages do not always express the same content in the
same way (Arnold, 2003)
15. • Traditional manual evaluation criteria:
– intelligibility (measuring how understandable the
sentence is)
– fidelity (measuring how much information the translated
sentence retains as compared to the original) by the
Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
– adequacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and comprehension
(improved intelligibility) by Defense Advanced Research
Projects Agency (DARPA) of US (White et al., 1994)
16. • Problems of manual evaluations :
– Time-consuming
– Expensive
– Unrepeatable
– Low agreement (Callison-Burch, et al., 2011)
17. • Measuring the similarity of automatic translation and
reference translation
– Automatic translation (or hypothesis translation, target
translation): by automatic MT system
– Reference translation: by professional translators
– Source language and source document: not used
• Traditional automatic evaluation:
– BLEU: n-gram precisions (Papineni et al., 2002)
– TER: edit distances (Snover et al., 2006)
– METEOR: precision and recall (Banerjee and Lavie, 2005)
18. • Problems in supervised MT evaluation
– Reference translations are expensive
– Reference translations are not available is some cases
• Could we get rid of the reference translation?
– Unsupervised MT evaluation method
– Extract information from source and target language
– How to use the designed universal phrase tagset?
19. • Assume that the translated sentence should have a
similar set of phrase categories with the source
sentence.
– This design is inspired by the synonymous relation
between source and target sentence.
• Two sentences that have similar set of phrases may
talk about different things.
– However, this evaluation approach is not designed for
general circumstance
– Assume that the targeted sentences are indeed the
translated sentences from the source document
20. • First, we parse the source and target languages
respectively
• Then we extract the phrase set from the source and
target sentences
• Third, we convert the phrases into the developed
universal phrase categories
• Last, we measure the similarity of source and target
language on the universal phrase sequences
22. The level of extracted phrase tags: just the upper level of POS tags, bottom-up
Figure 2: convert the extracted phrase into universal phrase tags
23. • What is the similarity metric we employed?
• Designed similarity metric: HPPR
– N1 gram position order difference penalty
– Weighted N2 gram precision
– Weighted N3 gram recall
– Weighted geometric mean in n-gram precision & recall
– Weighted harmonic mean to combine sub-factors
– The parameters are tunable according to different
language pairs
24. • 퐻푃푃푅 = 퐻푎푟(푤푃푠푁1푃푠퐷푖푓, 푤푃푟푁2푃푟푒, 푤푅푐푁3푅푒푐)
• 퐻푃푃푅 =
푤푃푠+푤푃푟+푤푅푐
푤푃푠
푁1푃푠퐷푖푓
푤푃푟
푁2푃푟푒
+
푤푅푐
푁3푅푒푐
+
• 푁1푃푠퐷푖푓, 푁2푃푟푒, and 푁3푅푒푐 are the corpus level
scores of sub-factors position difference penalty,
precision and recall.
25. • The sentence level 푁1푃푠퐷푖푓 score:
• 푁1푃푠퐷푖푓 = exp(−푁1푃퐷)
1
• 푁1푃퐷 =
퐿푒푛푔푡ℎℎ푦푝
Σ|푃퐷푖 |
• 푃퐷푖 = |푃푠푁ℎ푦푝 − 푀푎푡푐ℎ푃푠푁푠푟푐 |
• 푃푠푁ℎ푦푝 and 푀푎푡푐ℎ푃푠푁푠푟푐 are the position number
of matching tag in the hypothesis and source
sentence respectively. When no match for the tag:
푃퐷푖 = |푃푠푁ℎ푦푝 − 0|
30. • How reliable is the automatic metric?
• Evaluation criteria for evaluation metrics:
– Human judgments are the golden to approach, currently
– Correlation with human judgments (Callison-Burch, et al.,
2011, 2012)
• Spearman rank correlation coefficient rs:
– 푟푠 푋푌 = 1 −
푛 푑푖
6 Σ푖=1
2
푛(푛2−1)
– Two rank sequences 푋 = 푥1, … , 푥푛 , 푌 = {푦1, … , 푦푛}
31. • Corpus from WMT
– Workshop of statistical machine translation
– SIGMT, ACL’S special interest group of machine translation
• Training data (WMT11), tune the parameters
– 3, 003 sentences for each document
– 18 automatic French-to-English MT systems
• Testing data (WMT12)
– 3, 003 sentences for each document
– 15 automatic French-to-English MT systems
32. • Training, tune the parameters
– N1, N2 and N3 are tuned as 2, 3 and 3 due to the fact that
the 4-gram chunk match usually results in 0 score.
– Tuned values of factor weights are shown in table
Table 2: tuned parameter values
33. • Comparisons with:
– BLEU, measure the closeness of the hypothesis and
reference translations, n-gram precision
– TER, measure the editing distance of hypothesis to
reference translations
34. Table 3: training (development) scores on WMT11 corpus
Table 4: testing scores on WMT12 corpus
35. Table 5: correlation score intro (Cohen, 1988)
The experiment results on the development and testing corpora show that
HPPR without using reference translations has yielded promising
correlation scores (0.63 and 0.59 respectively).
There is still potential to improve the performances of all the three
metrics, even though that the correlation scores which are higher than 0.5
are already considered as strong correlation as shown in Table 5.
36. • Phrase Tagset Mapping for French and English
Treebanks and Its Application in Machine
Translation Evaluation
– Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He,
Shuo Li, and Ling Zhu. GSCL 2013, Darmstadt, Germany.
LNCS Vol. 8105, pp. 119-131, Volume Editors: Iryna
Gurevych, Chris Biemann and Torsten Zesch.
• Open source tool for phrase tagset mapping
and HPPR similarity measuring algorithms:
https://github.com/aaronlifenghan/aaron-project-hppr
37. • Facilitate future research in multilingual or cross-lingual
literature, this paper designs a phrase tags
mapping between the French Treebank and the
English Penn Treebank using 9 phrase categories.
• One of the potential applications of the designed
universal phrase tagset is shown in the unsupervised
MT evaluation task in the experiment section.
38. • There are still some limitations in this work to be
addressed in the future.
– The designed universal phrase categories may not be
able to cover all the phrase tags of other language
treebanks, so this tagset could be expanded when
necessary.
– The designed HPPR formula contains the n-gram factors
of position difference, precision and recall, which may not
be sufficient or suitable for some of the other language
pairs, so different measuring factors should be added or
switched when facing new tasks.
39. • Actually speaking, the designed models are very
related to the similarity measuring. Where we
have employed them is in the MT evaluation. These
works may be further developed into other
literature:
– information retrieval
– question and answering
– Searching
– text analysis
– etc.
40. • Ongoing and further works:
– The combination of translation and evaluation, tuning the
translation model using evaluation metrics
– Evaluation models from the perspective of semantics
– The further explorations of unsupervised evaluation
models, extracting other features from source and target
languages
• Aaron open source tools: https://github.com/aaronlifenghan
• Aaron network Home: http://www.linkedin.com/in/aaronhan
41. GSCL 2013, Darmstadt, Germany
Aaron L.-F. Han
email: hanlifengaaron AT gmail DOT com
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
Department of Computer and Information Science
University of Macau