Cite: Lifeng Han. 2021. Meta-evaluation of machine translation evaluation methods. In Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T. October 23–24.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
LPRC 2018: Limerick Postgraduate Research Conference
Lifeng Han and Shaohui Kuang. 2018. Apply Chinese radicals into neural machine translation: Deeper than character level. ArXiv pre-print https://arxiv.org/abs/1805.01565v1
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks.
In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
ADAPT seminar series. June 2021
research papers @NoDaLiDa2021:the 23rd Nordic Conference on Computational Linguistics
& COLING20:MWE-LEX WS
Bonus takeaway:
AlphaMWE multilingual corpus
with MWEs
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
LPRC 2018: Limerick Postgraduate Research Conference
Lifeng Han and Shaohui Kuang. 2018. Apply Chinese radicals into neural machine translation: Deeper than character level. ArXiv pre-print https://arxiv.org/abs/1805.01565v1
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks.
In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Practical Machine Learning - Part 1 contains:
- Basic notations of ML (what tasks are there, what is a model, how to measure performance)
- A couple of examples of problems and solutions (taken from previous work)
- A brief presentation of open-source software used for ML (R, scikit-learn, Weka)
A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013) pp. 215-222. Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
Review of paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ArXiv link: https://arxiv.org/abs/1810.04805
YouTube Presentation: https://youtu.be/GK4IO3qOnLc
(Slides are written in English, but the presentation is done in Korean)
Arabic is the 6th most wide-spread natural language in the world with more than 350 million native speakers. Arabic question answering systems are gaining great significance due to the increasing amounts of Arabic unstructured content on the Internet and the increasing demand for information that regular information retrieval techniques do not satisfy. Question answering systems generally, and Arabic systems are no exception, hit an upper bound of performance due to the propagation of error in their pipeline. This increases the significance of answer selection and validation systems as they enhance the certainty and accuracy of question answering systems. Very few works tackled the Arabic answer selection and validation problem, and they used the same question answering pipeline without any changes to satisfy the requirements of answer selection and validation. That is why they did not perform adequately well in this task. In this dissertation, a new approach to Arabic answer selection and validation is presented through “ALQASIM”, which is a QA4MRE (Question Answering for Machine Reading Evaluation) system. ALQASIM analyzes the reading test documents instead of the questions, utilizes sentence splitting, root expansion, and semantic expansion using an ontology built from the CLEF 2012 background collections. Our experiments have been conducted on the test-set provided by CLEF 2012 through the task of QA4MRE. This approach led to a promising performance of 0.36 Accuracy and 0.42 C@1, which is double the performance of the best performing Arabic QA4MRE system.
Publications:
http://scholar.google.com/citations?user=XGJiEioAAAAJ&hl=en
https://aast.academia.edu/AhmedMagdy
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
Starting from 1950s, Machine Translation (MT) was challenged from different scientific solutions which included rule-based methods, example-based and statistical models (SMT), to hybrid models, and very recent years the neural models (NMT).
While NMT has achieved a huge quality improvement in comparison to conventional methodologies, by taking advantages of huge amount of parallel corpora available from internet and the recently developed super computational power support with an acceptable cost, it struggles to achieve real human parity in many domains and most language pairs, if not all of them.
Alongside the long road of MT research and development, quality evaluation metrics played very important roles in MT advancement and evolution.
In this tutorial, we overview the traditional human judgement criteria, automatic evaluation metrics, unsupervised quality estimation models, as well as the meta-evaluation of the evaluation methods. Among these, we will also cover the very recent work in the MT evaluation (MTE) fields taking advantages of large size of pre-trained language models for automatic metric customisation towards exactly deployed language pairs and domains. In addition, we also introduce the statistical confidence estimation regarding sample size needed for human evaluation in real practice simulation.
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output based on professional post-editing annotations. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at \url{this https URL}.
Practical Machine Learning - Part 1 contains:
- Basic notations of ML (what tasks are there, what is a model, how to measure performance)
- A couple of examples of problems and solutions (taken from previous work)
- A brief presentation of open-source software used for ML (R, scikit-learn, Weka)
A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
In this presentation we discuss several concepts that include Word Representation using SVD as well as neural networks based techniques. In addition we also cover core concepts such as cosine similarity, atomic and distributed representations.
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013) pp. 215-222. Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
Review of paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ArXiv link: https://arxiv.org/abs/1810.04805
YouTube Presentation: https://youtu.be/GK4IO3qOnLc
(Slides are written in English, but the presentation is done in Korean)
Arabic is the 6th most wide-spread natural language in the world with more than 350 million native speakers. Arabic question answering systems are gaining great significance due to the increasing amounts of Arabic unstructured content on the Internet and the increasing demand for information that regular information retrieval techniques do not satisfy. Question answering systems generally, and Arabic systems are no exception, hit an upper bound of performance due to the propagation of error in their pipeline. This increases the significance of answer selection and validation systems as they enhance the certainty and accuracy of question answering systems. Very few works tackled the Arabic answer selection and validation problem, and they used the same question answering pipeline without any changes to satisfy the requirements of answer selection and validation. That is why they did not perform adequately well in this task. In this dissertation, a new approach to Arabic answer selection and validation is presented through “ALQASIM”, which is a QA4MRE (Question Answering for Machine Reading Evaluation) system. ALQASIM analyzes the reading test documents instead of the questions, utilizes sentence splitting, root expansion, and semantic expansion using an ontology built from the CLEF 2012 background collections. Our experiments have been conducted on the test-set provided by CLEF 2012 through the task of QA4MRE. This approach led to a promising performance of 0.36 Accuracy and 0.42 C@1, which is double the performance of the best performing Arabic QA4MRE system.
Publications:
http://scholar.google.com/citations?user=XGJiEioAAAAJ&hl=en
https://aast.academia.edu/AhmedMagdy
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
Starting from 1950s, Machine Translation (MT) was challenged from different scientific solutions which included rule-based methods, example-based and statistical models (SMT), to hybrid models, and very recent years the neural models (NMT).
While NMT has achieved a huge quality improvement in comparison to conventional methodologies, by taking advantages of huge amount of parallel corpora available from internet and the recently developed super computational power support with an acceptable cost, it struggles to achieve real human parity in many domains and most language pairs, if not all of them.
Alongside the long road of MT research and development, quality evaluation metrics played very important roles in MT advancement and evolution.
In this tutorial, we overview the traditional human judgement criteria, automatic evaluation metrics, unsupervised quality estimation models, as well as the meta-evaluation of the evaluation methods. Among these, we will also cover the very recent work in the MT evaluation (MTE) fields taking advantages of large size of pre-trained language models for automatic metric customisation towards exactly deployed language pairs and domains. In addition, we also introduce the statistical confidence estimation regarding sample size needed for human evaluation in real practice simulation.
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output based on professional post-editing annotations. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at \url{this https URL}.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
A deep analysis of Multi-word Expression and Machine Translation. Faculty research open day. DCU, Dublin. 2019.
Including MWE identification, MT with radical, MTE.
Machine translation course program (in English)Dmitry Kan
This is the English version of my Machine Translation course program for the following course slides (in Russian):
http://www.slideshare.net/dmitrykan/introduction-to-machine-translation-2911038
and
http://www.slideshare.net/dmitrykan/introduction-to-machine-translation-1
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to
inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations.
In this work, we introduce \textbf{HOPE}, a task-oriented and \textit{\textbf{h}}uman-centric evaluation framework for machine translation output based \textit{\textbf{o}}n professional \textit{\textbf{p}}ost-\textit{\textbf{e}}diting annotations. It contains only a limited number of commonly occurring error types, and uses a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at \url{https://github.com/lHan87/HOPE}
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
Talk at the 8th NLP Dublin meetup (https://www.meetup.com/NLP-Dublin/events/241198412/) by Dr. Sheila Castilho, postdoc at ADAPT Centre, Dublin City University.
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
• Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation
o Hindawi Publishing Corporation
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He and Yi Lu
The Scientific World Journal, Issue: Recent Advances in Information Technology. ISSN:1537-744X. SCIE, IF=1.73. http://www.hindawi.com/journals/tswj/aip/760301/
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017Manuel Herranz
Presentation of Pangeanic language technologies as a result of EU and national R&D: Cor for web crawling and website translation, linked to Elastic Search-based ActivaTM and NeuralMT
Similar to Meta-evaluation of machine translation evaluation methods (20)
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
Pre-trained language models (PLMs) often take
advantage of the monolingual and multilingual
dataset that is freely available online to acquire
general or mixed domain knowledge before
deployment into specific tasks. Extra-large
PLMs (xLPLMs) are proposed very recently
to claim supreme performances over smallersized PLMs such as in machine translation
(MT) tasks. These xLPLMs include Meta-AI’s
wmt21-dense-24-wide-en-X (2021) and NLLB
(2022). In this work, we examine if xLPLMs are
absolutely superior to smaller-sized PLMs in
fine-tuning toward domain-specific MTs. We
use two different in-domain data of different
sizes: commercial automotive in-house data
and clinical shared task data from the ClinSpEn2022 challenge at WMT2022. We choose
popular Marian Helsinki as smaller sized PLM
and two massive-sized Mega-Transformers
from Meta-AI as xLPLMs.
Our experimental investigation shows that 1)
on smaller sized in-domain commercial automotive data, xLPLM wmt21-dense-24-wideen-X indeed shows much better evaluation
scores using SACREBLEU and hLEPOR metrics than smaller-sized Marian, even though
its score increase rate is lower than Marian
after fine-tuning; 2) on relatively larger-size
well prepared clinical data fine-tuning, the
xLPLM NLLB tends to lose its advantage
over smaller-sized Marian on two sub-tasks
(clinical terms and ontology concepts) using
ClinSpEn offered metrics METEOR, COMET,
and ROUGE-L, and totally lost to Marian on
Task-1 (clinical cases) on all official metrics
including SACREBLEU and BLEU; 3) metrics do not always agree with each other
on the same tasks using the same model outputs; 4) clinic-Marian ranked No.2 on Task-1
(via SACREBLEU/BLEU) and Task-3 (via METEOR and ROUGE) among all submissions.
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
From both human translators (HT) and machine translation (MT) researchers' point of view, translation quality evaluation (TQE) is an essential task. Translation service providers (TSPs) have to deliver large volumes of translations which meet customer specifications with harsh constraints of required quality level in tight time-frames and costs. MT researchers strive to make their models better, which also requires reliable quality evaluation. While automatic machine translation evaluation (MTE) metrics and quality estimation (QE) tools are widely available and easy to access, existing automated tools are not good enough, and human assessment from professional translators (HAPs) are often chosen as the golden standard \cite{han-etal-2021-TQA}.
Human evaluations, however, are often accused of having low reliability and agreement. Is this caused by subjectivity or statistics is at play? How to avoid the entire text to be checked and be more efficient with TQE from cost and efficiency perspectives, and what is the optimal sample size of the translated text, so as to reliably estimate the translation quality of the entire material? This work carries out such a motivated research to correctly estimate the confidence intervals \cite{Brown_etal2001Interval}
depending on the sample size of translated text, e.g. the amount of words or sentences, that needs to be processed on TQE workflow step for confident and reliable evaluation of overall translation quality.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modeling (BSDM) and Monte Carlo Sampling Analysis (MCSA).
Reference: S Gladkoff, I Sorokina, L Han, A Alekseeva. 2022. Measuring Uncertainty in Translation Quality Evaluation (TQE). LREC2022. arXiv preprint arXiv:2111.07699
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
Build Moses Statistical Machine Translation system with Ubuntu
Tree to tree Machine Translation with Universal phrase tagset. https://github.com/aaronlifenghan/A-Universal-Phrase-Tagset
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
ADAPT Centre & Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking @ DLSS2017 Bilbao.
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT models for comparisons namely: Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the noise removal, translation post editing and MWE annotation by human professionals, we believe our AlphaMWE dataset will be an asset for cross-lingual and multilingual research, such as MT and information extraction. Our multilingual corpora are available as open access at github.com/poethan/AlphaMWE
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
This is a short presentation for the Poster of WMT13 shared task: This paper is to introduce our participation in
the WMT13 shared tasks on Quality Estimation
for machine translation without using reference
translations. We submitted the results
for Task 1.1 (sentence-level quality estimation),
Task 1.2 (system selection) and Task 2
(word-level quality estimation). In Task 1.1,
we used an enhanced version of BLEU metric
without using reference translations to evaluate
the translation quality. In Task 1.2, we utilized
a probability model Naïve Bayes (NB) as
a classification algorithm with the features
borrowed from the traditional evaluation metrics.
In Task 2, to take the contextual information
into account, we employed a discriminative
undirected probabilistic graphical model
Conditional random field (CRF), in addition
to the NB algorithm. The training experiments
on the past WMT corpora showed that the designed
methods of this paper yielded promising
results especially the statistical models of
CRF and NB. The official results show that
our CRF model achieved the highest F-score
0.8297 in binary classification of Task 2.
This paper introduces the state-of-the-art machine translation (MT) evaluation survey that contains both manual and automatic evaluation methods. The traditional human evaluation criteria mainly include the intelligibility, fidelity, fluency , adequacy, comprehension, and in-formativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, and extended criteriea, etc. We classify the automatic evaluation methods into two categories , including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, F-measure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tag, phrase types and sentence structures, and the semantic features include named entity, synonyms , textual entailment, paraphrase, semantic roles, and language models. Subsequently , we also introduce the evaluation methods for MT evaluation including different correlation scores, and the recent quality estimation (QE) tasks for MT.
This paper differs from the existing works (Dorr et al., 2009; EuroMatrix, 2007) from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of recent QE tasks of MT, and the concise construction of the content. For latest version, please goto: https://arxiv.org/abs/1605.04515
LEPOR: an augmented machine translation evaluation metric Lifeng (Aaron) Han
Machine translation (MT) was developed as one of the hottest research topics in the natural language processing (NLP) literature. One important issue in MT is that how to evaluate the MT system reasonably and tell us whether the translation system makes an improvement or not. The traditional manual judgment methods are expensive, time-consuming, unrepeatable, and sometimes with low agreement. On the other hand, the popular automatic MT evaluation methods have some weaknesses. Firstly, they tend to perform well on the language pairs with English as the target language, but weak when English is used as source. Secondly, some methods rely on many additional linguistic features to achieve good performance, which makes the metric unable to replicateand apply to other language pairs easily. Thirdly, some popular metrics utilize incomprehensive factors, which result in low performance on some practical tasks.
In this thesis, to address the existing problems, we design novel MT evaluation methods and investigate their performances on different languages. Firstly, we design augmented factors to yield highly accurate evaluation.Secondly, we design a tunable evaluation model where weighting of factors can be optimized according to the characteristics of languages. Thirdly, in the enhanced version of our methods, we design concise linguistic feature using POS to show that our methods can yield even higher performance when using some external linguistic resources. Finally, we introduce the practical performance of our metrics in the ACL-WMT workshop shared tasks, which show that the proposed methods are robust across different languages.
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksLifeng (Aaron) Han
Many syntactic treebanks and parser toolkits are developed in the past twenty years, including dependency structure parsers and phrase structure parsers. For the phrase structure parsers, they usually utilize different phrase tagsets for different languages, which results in an inconvenience when conducting the multilingual research. This paper designs a refined universal phrase tagset that contains 9 commonly used phrase categories. Furthermore, the mapping covers 25 constituent treebanks and 21 languages. The experiments show that the universal phrase tagset can generally reduce the costs in the parsing models and even improve the parsing accuracy.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Meta-evaluation of machine translation evaluation methods
1. ADAPT Research Centre
School of Computing, Dublin City University, Ireland
lifeng.han@adaptcentre.ie
Meta-evaluation of
machine translation evaluation methods
Lifeng Han
Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T.
October 23–24.
1
2. www.adaptcentre.ie
Content
• Background (motivation of this work)
• Related work (earlier surveys)
• Machine Translation Evaluation (MTE) & Meta-eval
• Discussion and Perspectives
• Conclusions
Cartoon, https://www.freepik.com/premium-vector/boy-with-lot-books_5762027.htm
2
3. www.adaptcentre.ie
Background
Driven: - two factors
• MTE as a key point in Translation Assessment & MT model development
• Human evaluations as the golden criteria of assessment-> high-cost?
• Automatic metrics for MT system tuning, parameter optimisation-> reliability?
• New MTE methods very needed for
• i) distinguishing SOTA high-performance MT systems (ref. ACL21 outstanding
paper by Marie et al.)
• ii) helping low-resource language pairs / domains MT research
• Related earlier surveys:
• not covering recent years development (EuroMatrix07, GALE09)
• Specialised/focused to certain domain (Secară 05)
• or very initial, brief (Goutte 06)
3
Work based on an earlier pre-print: L. Han (2016)
“Machine Translation Evaluation Resources and Methods: A Survey”. arXiv:1605.04515 [cs.CL] updated-Sep.2018
4. www.adaptcentre.ie
Background
Goals:
• a concise but structured classification of MTE
• An overview
• extending related literature, e.g. recent years published MTE work
• + Quality Estimation (QE), + evaluating methods of MTE itself (meta-eval)
• easy to grasp what users/researchers need
• find their corresponding assessment methods efficiently from existing ones or
make their own that can be revised/inspired by this overview
• We expect this survey to be helpful for more NLP tasks (model evaluations)
• NLP tasks share similarities and Translation is a relatively larger task that
covers/impacts/interacts with others - e.g.:
• text summarisation (Bhandari et al 2021),
• language generation (Novikova et al. 2017emnlp),
• searching (Liu et al. 2021acm), code generation (Liguori et al 2021)
4
6. www.adaptcentre.ie
Han, Lifeng, Gareth, Jones and Alan F., Smeaton (2020) MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora. In:
12th International Conference on Language Resources and Evaluation (LREC), 11-16 May, 2020, Marseille, France. (Virtual).
Background
example: MTE
(Han et al. 2020LREC)
MT output / candidat T
MTE1: src + ref + output
=>(HE)
MTE2: src + output
=>(HE, QE)
MTE3: ref + output
=>(HE, Metrics)
6
ZH source:
ZH pinyin:
Nián nián suì suì huā xiāng sì, suì suì nián
nián rén bù tóng.
EN reference:
The flowers are similar each year, while
people are changing every year.
EN MT output:
One year spent similar, each year is
different
8. www.adaptcentre.ie
Related work
- earlier surveys on MTE
• Alina Secară (2005): in Proceedings of the eCoLoRe/MeLLANGE workshop. page 39-44.
• A special focus on error classification schemes
• from translation industry and translation teaching institutions
• a consistent and systematic error classification survey till early 2000s
• Cyril Goutte (2006): Automatic Evaluation of Machine Translation Quality. 5 pages.
- surveyed two kind of metrics to date (2006): string match, and IR inspired.
- String match: WER, PER;
- IR style technique (n-gram precision): BLEU, NIST, F, METEOR
• EuroMatrix (2007): 1.3: Survey of Machine Translation Evaluation. In EuroMatrix Project Report,
Statistical and Hybrid MT between All European Languages, co-ordinator: Prof. Hans Uszkoreit
• Human evaluation and automatic metrics for MT, to date (2007).
• An extensive referenced metric lists, compared to other earlier peer work.
8
9. www.adaptcentre.ie
Related work
- earlier surveys on MTE
9
• Bonnie Dorr (2009) edited DARPA GALE program report. Chap 5: Machine Translation Evaluation.
• Global Autonomous Language Exploitation programme from NIST.
• Cover HE (intro) and Metrics (heavier)
• Finishing up with METEOR, TER-P, SEPIA (a syntax aware metric, 2007), etc.
• Concluding the limitations of their metrics: high resource requirements, some seconds to score
one segment pair, etc. Calls for better automatic metrics
• Màrquez L. Dialogue 2013 invited talk, extended. automatic evaluation of machine translation
quality. @EAMT
• An emphasis on linguistically motivated measures
• introducing Asiya online interface by UPC for MT output error analysis
• Introducing QE (fresh TQA task in WMT community)
• pre-print: L. Han (2016): Machine Translation Evaluation Resources and Methods: A Survey
https://arxiv.org/abs/1605.04515
We acknowledge and endorse the EuroMatrix (2007) and NIST’ s GALE (2009) program reports.
11. www.adaptcentre.ie
11
Machine Translation Evaluation: Meta-eval of MTE paradigm
Machine Translation
Human evaluation Automatic metrics Quality estimation
Meta-evaluation
Human-in-the-Loop
metrics
Metrics as QE
Natural Language Processing
12. www.adaptcentre.ie
Manual methods
- started since Translation was deployed
- a boost development with together MT technology
- traditional criteria (from 1960s) vs advanced (further developed)
Automatic methods - 2000s on when MT making sense & popular
- Metrics (traditional): word n-gram match, linguistic features,
statistical models, then ML/DL based.
- QEs (later developed): without using reference translations.
Evaluating MTE methods (meta-evaluation/assessment)
- statistical significance, agreement, correlation methodologies, etc.
12
Machine Translation Evaluation: Meta-eval of MTE paradigm
13. 13
Human Assessment Methods
Traditional Advanced
further development
fluency, adequacy, comprehension
Intelligibility and fidelity
revisiting traditional criteria
crowd source intelligence
segment ranking
utilizing post-editing
extended criteria
task oriented
14. www.adaptcentre.ie
selected definitions:
- Intelligible: as far as possible, the translation should read like normal, well-edited prose and
be readily understandable in the same way that such a sentence would be understandable if
originally composed in the translation language.
- Fidelity or accuracy: the translation should, as little as possible, twist, distort, or controvert
the meaning intended by the original.
- Comprehension relates to “Informativeness’’: objective is to measure a system's ability to
produce a translation that conveys sufficient information, such that people can gain
necessary information from it.
- Fluency: whether a sentence is well-formed and fluent/grammar in context
- Adequacy: the quantity of the information existent in the original text that a translation
contains (recall)
14
Meta-eval of MTE paradigm: Human Eval
15. www.adaptcentre.ie
- task oriented: in light of the tasks for which their output might be used, e.g. hotel booking
- Extended Criteria: suitability, interoperability, reliability, usability, efficiency, maintainability,
portability. (focus on MT)
- Utilising Post-editing: compare the post-edited correct translation to the original MT output, the
number of editing steps
- Segment Ranking: assessors were frequently asked to provide a complete ranking over all the
candidate translations of the same source segment, WMTs.
- crowd-source intelligence: with Amazon MTurk, using fluency criteria, different scales set-up,
with some quality control solutions.
- Revisiting Traditional Criteria: ask the assessors to mark whatever errors from the candidate
translations, with questions on adequacy and comprehension.
15
Meta-eval of MTE paradigm: Human Eval
16. Automatic Quality Assessment Methods
Traditional Advanced
Deep Learning Models
N-gram surface
matching
Deeper linguistic
features
Edit distance
Precision and Recall
Word order
Syntax POS, phrase,
sentence structure
Semantics name entity, MWEs,
synonym, textual
entailment,
paraphrase, semantic
role, language model
16
17. www.adaptcentre.ie
- n-gram string match: carried out on word surface level
- editing distance
- precision, recall, f-score, word order
- augmented/hybrid
- linguistic features: to capture syntax and meaning
- syntactic (POS, chunk, sentence structure)
- semantic (dictionary, paraphrase, NE, MWE, entailment, language
model)
- deep learning / ML models
- LSTM, NNs, RNNs: modeling syntactic and semantic features
17
Meta-eval of MTE paradigm: Automatic Metrics
18. Our work:
N-gram string match: augmented/hybrid
LEPOR, nLEPOR: Length Penalty, Precision, n-gram Position difference
Penalty and Recall. (n-gram P and R: nLEPOR) https://github.com/poethan/
LEPOR/ (code in Perl) - best score on En-2-other WMT2013 shared task.
Deeper linguistic features:
hLEPOR: word level + PoS < Harmonic mean of factors>
HPPR, word level + phrase structure
News: customised hLEPOR using Pre-trained language model and human
labelled data from this year at: https://github.com/poethan/cushLEPOR
-(presented in MT summit21:https://aclanthology.org/2021.mtsummit-up.28/)
www.adaptcentre.ie
18
Meta-eval of MTE paradigm: Automatic Metrics
Han et al. (2012) "LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors" in COLING.
hLEPOR: Language-independent Model for Machine Translation Evaluation with Reinforced Factors, in MT summit 2013
HPPR: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. In GSCL2013
19. www.adaptcentre.ie
How to estimate the translation quality without depending on the prepared reference translation?
- e.g. based on the features that can be extracted from MT output and source text
- estimation spans: word, sentence, system level.
Evaluation methods:
- word level: Mean Absolute Error (MAE) as a primary metric, and Root of Mean Squared
Error (RMSE) as a secondary metric, borrowed from measuring regression task.
- sentence level: DA, HTER
- document level: MQM (multi-dimention quality metrics)
Recent trends:
- Metrics and QE integration/cross performance, how to utilise/communicate both metrics
task and QE task knowledge
-WMT19: there were 10 reference-less evaluation metrics which were used for the QE task,
“QE as a Metric”
- challenges: predicting src-word that leads error, multilingual or language independent
models, low resource languages, etc.
19
Meta-eval of MTE paradigm: Metrics / QE
20. www.adaptcentre.ie
Our QE models: Statistical modelling with metrics criteria for QE
I.- sentence-level quality estimation: EBLEU (an enhanced BLEU with POS)
II.- system selection: probability model Naïve Bayes (NB) and SVM as classification
algorithm with the features from evaluation metrics (ength penalty, Precision, Recall
and Rank values)
III.- word-level quality estimation: take the contextual information into account, we
employed a discriminative undirected probabilistic graphical model conditional
random field (CRF), in addition to the NB algorithm.
20
Meta-eval of MTE paradigm: Automatic Eval/QE
Han et al. (2013) Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical
Modeling. In WMT13. https://www.aclweb.org/anthology/W13-2245/
21. www.adaptcentre.ie
Statistical significance: by bootstrap re-sampling methods, different system
indeed own different quality? statistical significance intervals for evaluation
metrics on small test sets.
Human judgement agreement kappa levels: how often they agree with each
other on the segments, or they agree with themselves on the same segment.
Correlation methods: how much correlation is the candidate ranking and
scoring (metrics, QE), compared to the reference ranking (e.g. human)?
Spearman, Pearson, Kendall tau.
Metrics comparisons: compare and manipulate different metrics.
21
MT evaluation paradigm: Meta-eval
23. www.adaptcentre.ie
Discussion and Perspectives
23
• MQM, getting more attention recently, applied to WMT tasks (QE2020), many criteria,
needs to be adapted to specific task of evaluations. (www.qt21.eu/mqm-definition)
• QE more attention, more realistic in practical situation when no ref available, e.g.
emergency situation for instruction/translation, such as war, natural disaster, rescue,
in some low resource scenario. As SAE stated 2001: the poor MT means leading to
death, ….
• Human assessment agreement has always been an issue, and recent google-HE-
with-WMT data shows the crowd source human assessment disagree largely with
professional translators. i.e. the crowd source assessors cannot distinguish real
improved MT system that may have syntax or semantic parts improved.
• How to improve the test set diversity (domains), reference translations coverage -
semantical equivalence, e.g. ⼝口⽔水战 (kǒushuǐ zhàn, war of words vs oral combat).
24. www.adaptcentre.ie
Discussion and Perspectives
24
• Metrics still facing challenges in segment level performance/correlation.
• Metrics score interpretation: automatic evaluation metrics usually yield meaningless score, which is very test
set specific and the absolute value is not informative:
• — what is the meaning of -16094 score by the MTeRater metric, or 1.98 score by ROSE?
• — similar goes to 19.07 by BEER / 28.47 by BLEU / 33.03 by METEOR for a mostly good translation
• — we suggest our own LEPOR and hLEPOR/nLEPOR where relatively good/better MT system translations
were reported scoring 0.60 to 0.80, with overall (0 ~ 1)
• Low-resource MT development & eval: avoid the long road that high-resource language pairs took?
• — semantic evaluation metrics
• — human in the loop evaluation
• — semantic interpreted test-suits
26. www.adaptcentre.ie
Conclusions
26
• A brief overview of both human/manual and automatic MTE methods for
translation quality, and the meta-eval: evaluating the assessment methods.
• Automated models cover both reference-translation dependent metrics and
quality estimation without using reference. There is a new trend of reference less
metrics.
• MT is still far from reaching human parity, especially in broader domains, topics,
low-resource language pairs. MTE will still play a key role in the MT
development.
• MTE methods will surely have an influence in other NLP tasks and be impacted
vice versa.
27. www.adaptcentre.ie
References
• Our earlier work related to this tutorial:
• Lifeng Han. (2016) Machine Translation Evaluation Resources and Methods: A Survey https://arxiv.org/abs/1605.04515 updated 2018.
• Lifeng Han, Gareth J. F. Jones, and Alan Smeaton. (2020) MultiMWE: Building a multi-lingual multiword expression (MWE) parallel corpora. In LREC.
• Lifeng Han. (2014) LEPOR: An Augmented Machine Translation Evaluation Metric. Msc thesis. University of Macau. https://library2.um.edu.mo/
etheses/b33358400_ft.pdf
• Lifeng Han and I. Sorokina and Gleb Erofeev and S. Gladkoff. (2021) CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge
Model to Improve Agreement with Human Judgements. WMT2021 (in press). http://doras.dcu.ie/26182/
• Lifeng Han, Gareth Jones, Alan Smeaton. (2021) Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods.
MoTra2021@NoDaLiDa2021. https://aclanthology.org/2021.motra-1.3/
• WMT findings:
• WMT and Metrics task: Koehn and Mons (2006), …, Ondřej Bojar, et al. (2013),…on to Barrault et al. (2020)
• QE task findings: from (Callison-Burch et al., 2012), …on to Special et al. (2020)
• MTE interaction with other NLP task evaluations:
• Liu et al. Meta-evaluation of Conversational Search Evaluation Metrics, (2021) https://arxiv.org/pdf/2104.13453.pdf ACM Transactions on
Information Systems
• Pietro Liguori et al. 2021. Shellcode_IA32: A Dataset for Automatic Shellcode Generation. https://arxiv.org/abs/2104.13100
• Why We Need New Evaluation Metrics for NLG. by Jekaterina Novikova et al 2017emnlp. https://www.aclweb.org/anthology/D17-1238/
• Evaluation of text generation: A survey. (2020) A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, - arxiv.org
• SCOTI: Science Captioning of Terrain Images for data prioritization and local image search. D Qiu, B Rothrock, T Islam, AK Didier, VZ Sun… -
Planetary and Space …, 2020 - Elsevier
27
28. www.adaptcentre.ie
28
• Go raibh maith agat!
• 谢谢!
• Dankeschön
• Thank you!
• Gracias!
• Grazie!
• Dziękuję Ci!
• Merci!
• Dank je!
• спасибі!
• धन्यवाद!
• Благодаря ти!
quiz: which language do you recognise? 🙃
謝謝!
tak skal
du have
Takk skal du ha
tack
Kiitos
Þakka þér fyrir
Qujan Qujanaq Qujanarsuaq
29. www.adaptcentre.ie
29
• Go raibh maith agat!
• 谢谢!
• Dankeschön
• Thank you!
• Gracias!
• Grazie!
• Dziękuję Ci!
• Merci!
• Dank je!
• спасибі!
• धन्यवाद!
• Благодаря ти!
How many can you speak? 🙃
Dankeschön
Danish:
tak skal
du have
Norwegian: Takk skal du ha
Swedish: tack
Finnish: Kiitos
Icelandic: Þakka þér fyrir
Kalaallisut/greelandic: Qujan Qujanaq Qujanarsuaq
謝謝!