There are several issues with the existing general machine translation or natural language generation
evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust
QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model
predictions to questions are similar to ground-truth annotations. The ability to compare similarity based
on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more
realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that
uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations
to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and
BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of
US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name
string pairs along with their similarities, which can be used for training
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
Â
This document describes the SemEval-2012 Task 6 on semantic textual similarity. The task involved measuring the semantic equivalence of sentence pairs on a scale from 0 to 5. The training data consisted of 2000 sentence pairs from existing paraphrase and machine translation datasets. The test data also had 2000 sentence pairs from these datasets as well as surprise datasets. Systems were evaluated based on their Pearson correlation with human annotations. 35 teams participated and the best systems achieved a Pearson correlation over 80%. This pilot task established semantic textual similarity as an area for further exploration.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Â
The document describes two machine translation evaluation systems, nLEPOR_baseline and LEPOR_v3.1, that were submitted to the WMT13 Metrics Task. nLEPOR_baseline is an n-gram based metric that considers modified sentence length penalty, position difference penalty, and n-gram precision and recall. LEPOR_v3.1 is an enhanced version that uses a harmonic mean to combine factors and includes part-of-speech information. Evaluation results showed LEPOR_v3.1 had the highest correlation of 0.86 with human judgments for English to other language pairs.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Yafi Azhari
Â
1) Previous metrics like Alignment Error Rate (AER) did not strongly correlate with statistical machine translation quality when word alignments were varied, showing the need for a new alignment quality metric.
2) The paper proposes using balanced F-measure with different precision-recall weights (Îą values) to measure alignment quality, which showed much higher correlation to translation quality across three language pairs than AER.
3) The best Îą values were 0.3-0.6, showing recall is somewhat more important than precision for alignment quality as it relates to translation quality. This new metric provides a better predictive measurement of how alignment quality impacts machine translation.
A Pilot Study On Computer-Aided Coreference AnnotationDarian Pruitt
Â
This document describes a pilot study on using automatic coreference resolution to aid human annotation of coreference. It finds that pre-annotating data with the predictions of an existing coreference system can both reduce the time needed for human annotation and decrease error rates, by reducing the task to checking and modifying existing annotations rather than creating everything from scratch. The study uses the output of an automatic system for resolving nominal anaphora to pre-annotate German newspaper text, which is then manually edited using annotation software.
Assessing the Sufficiency of Arguments through Conclusion Generation.pdfAsia Smith
Â
The paper proposes assessing the sufficiency of arguments by generating conclusions from premises using large language models. The authors fine-tune BART to generate conclusions for arguments with masked conclusions. They then use the generated and original conclusions in a modified RoBERTa model to assess argument sufficiency. Evaluating on student essay data annotated for argument structure and sufficiency, the approach outperforms the previous state-of-the-art for sufficiency assessment and achieves performance close to human levels, though the impact of conclusion generation is ultimately limited.
Automated evaluation of coherence in student essays.pdfSarah Marie
Â
This document summarizes a study that explored using Centering Theory to develop an automated metric for essay coherence. The study aimed to improve the performance of e-rater, an existing automated essay scoring system, by adding a new feature based on Centering Theory. Specifically, the percentage of "Rough Shifts" identified through Centering Theory analysis was tested as a predictor of essay coherence and scoring. The results showed that incorporating this new metric improved e-rater's ability to approximate human scores and provide more instructionally useful feedback to students.
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONLifeng (Aaron) Han
Â
Proceedings of the 16th International Conference of Text, Speech and Dialogue (TSD 2013). Plzen, Czech Republic, September 2013. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
Â
This document describes the SemEval-2012 Task 6 on semantic textual similarity. The task involved measuring the semantic equivalence of sentence pairs on a scale from 0 to 5. The training data consisted of 2000 sentence pairs from existing paraphrase and machine translation datasets. The test data also had 2000 sentence pairs from these datasets as well as surprise datasets. Systems were evaluated based on their Pearson correlation with human annotations. 35 teams participated and the best systems achieved a Pearson correlation over 80%. This pilot task established semantic textual similarity as an area for further exploration.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Â
The document describes two machine translation evaluation systems, nLEPOR_baseline and LEPOR_v3.1, that were submitted to the WMT13 Metrics Task. nLEPOR_baseline is an n-gram based metric that considers modified sentence length penalty, position difference penalty, and n-gram precision and recall. LEPOR_v3.1 is an enhanced version that uses a harmonic mean to combine factors and includes part-of-speech information. Evaluation results showed LEPOR_v3.1 had the highest correlation of 0.86 with human judgments for English to other language pairs.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663Yafi Azhari
Â
1) Previous metrics like Alignment Error Rate (AER) did not strongly correlate with statistical machine translation quality when word alignments were varied, showing the need for a new alignment quality metric.
2) The paper proposes using balanced F-measure with different precision-recall weights (Îą values) to measure alignment quality, which showed much higher correlation to translation quality across three language pairs than AER.
3) The best Îą values were 0.3-0.6, showing recall is somewhat more important than precision for alignment quality as it relates to translation quality. This new metric provides a better predictive measurement of how alignment quality impacts machine translation.
A Pilot Study On Computer-Aided Coreference AnnotationDarian Pruitt
Â
This document describes a pilot study on using automatic coreference resolution to aid human annotation of coreference. It finds that pre-annotating data with the predictions of an existing coreference system can both reduce the time needed for human annotation and decrease error rates, by reducing the task to checking and modifying existing annotations rather than creating everything from scratch. The study uses the output of an automatic system for resolving nominal anaphora to pre-annotate German newspaper text, which is then manually edited using annotation software.
Assessing the Sufficiency of Arguments through Conclusion Generation.pdfAsia Smith
Â
The paper proposes assessing the sufficiency of arguments by generating conclusions from premises using large language models. The authors fine-tune BART to generate conclusions for arguments with masked conclusions. They then use the generated and original conclusions in a modified RoBERTa model to assess argument sufficiency. Evaluating on student essay data annotated for argument structure and sufficiency, the approach outperforms the previous state-of-the-art for sufficiency assessment and achieves performance close to human levels, though the impact of conclusion generation is ultimately limited.
Automated evaluation of coherence in student essays.pdfSarah Marie
Â
This document summarizes a study that explored using Centering Theory to develop an automated metric for essay coherence. The study aimed to improve the performance of e-rater, an existing automated essay scoring system, by adding a new feature based on Centering Theory. Specifically, the percentage of "Rough Shifts" identified through Centering Theory analysis was tested as a predictor of essay coherence and scoring. The results showed that incorporating this new metric improved e-rater's ability to approximate human scores and provide more instructionally useful feedback to students.
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONLifeng (Aaron) Han
Â
Proceedings of the 16th International Conference of Text, Speech and Dialogue (TSD 2013). Plzen, Czech Republic, September 2013. LNAI Vol. 8082, pp. 121-128. Volume Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin Heidelberg 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor
The document presents a knowledge-based method for measuring semantic similarity between texts. It combines word-to-word semantic similarity metrics with information about word specificity to calculate a text-to-text similarity score. An example application shows how word similarity scores from WordNet are combined using the Wu & Palmer metric to determine the semantic similarity between two text segments. The method is evaluated on paraphrase identification tasks and shown to outperform approaches based only on lexical matching.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Â
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Â
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Â
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
Using Word Embedding for Automatic Query ExpansionDwaipayan Roy
Â
The document proposes using word embeddings to perform automatic query expansion. It explores using the Word2Vec framework to find semantically similar terms to add to the original query terms. The authors experiment with pre-retrieval, post-retrieval, and incremental nearest neighbor methods for query expansion using Word2Vec embeddings. While results show improvement over the baseline, the methods lagged behind the established RM3 query expansion technique. Future work could explore incorporating co-occurrence statistics with the Word2Vec embeddings to further improve performance.
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...Lifeng (Aaron) Han
Â
"LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors"
Publisher: Association for Computational LinguisticsDecember 2012
Authors: Aaron Li-Feng Han, Derek F. Wong and Lidia S. Chao
Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pages 441â450, Mumbai, December 2012. Open tool https://github.com/aaronlifenghan/aaron-project-lepor
The document discusses using machine learning to identify factors that correlate with essay grades and English proficiency levels. It analyzes a database of 481 essays written by Dutch students and coded with 81 linguistic variables. Various machine learning algorithms are tested to predict the essays' proficiency levels as assigned by human raters. The best performing algorithm, Logistic Model Tree, achieves accuracy comparable to human raters when using all 81 features directly. Feature selection and discretization are also tested to identify a minimal set of highly predictive features.
Automatic vs. human question answering over multimedia meeting recordingsLĂŞ Anh
Â
Information access in meeting recordings can be assisted by
meeting browsers, or can be fully automated following a
question-answering (QA) approach. An information access task
is defined, aiming at discriminating true vs. false parallel statements
about facts in meetings. An automatic QA algorithm is
applied to this task, using passage retrieval over a meeting transcript.
The algorithm scores 59% accuracy for passage retrieval,
while random guessing is below 1%, but only scores 60% on
combined retrieval and question discrimination, for which humans
reach 70%â80% and the baseline is 50%. The algorithm
clearly outperforms humans for speed, at less than 1 second
per question, vs. 1.5â2 minutes per question for humans. The
degradation on ASR compared to manual transcripts still yields
lower but acceptable scores, especially for passage identification.
Automatic QA thus appears to be a promising enhancement
to meeting browsers used by humans, as an assistant for
relevant passage identification.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
Â
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on A...Griffin Adams
Â
Research on the evaluation of abstractive summarization models has evolved considerably in the last few years. To take stock of these changes, namely, the shift from n-gram overlap to fact-based assessments, I have written a brief summary of papers on evaluation metrics, most of which focuses on the period from 2018 to mid-2020. The paper is written in a terse literature review format, so as to aid researchers when craft- ing related works sections of papers on summary evaluation.
This document discusses approaches to measuring semantic similarity between sentences. It evaluates three approaches: cosine similarity, path-based measures using WordNet, and a feature-based approach. The feature-based approach generates similarity scores based on tagging parts of speech, lemmatization, and comparing nouns and verbs between sentences. It is concluded that the feature-based approach provides better semantic similarity scores compared to existing path-based and cosine similarity measures.
IRJET-Semantic Similarity Between SentencesIRJET Journal
Â
This document discusses approaches to measuring semantic similarity between sentences. It evaluates three approaches: cosine similarity, path-based measures using WordNet, and a feature-based approach. The feature-based approach generates similarity scores based on tagging parts of speech, lemmatization, and comparing nouns and verbs between sentences. It is concluded that the feature-based approach provides better semantic similarity scores compared to existing path-based and cosine similarity measures.
Data linkage involves matching records from two different datasets that belong to the same person. It typically involves three stages: pre-linkage (data cleaning), linkage (comparing records and determining matches), and post-linkage (checking matches and estimating error rates). The two main methods are deterministic matching, which requires exact matches on identifying fields, and probabilistic matching, which calculates match probabilities based on partial identifier agreements. Probabilistic matching is more computationally demanding but reduces missed matches by modeling inconsistencies in identifying data. Key parameters that affect probabilistic matching are data quality, the probability of random field agreements between non-matched records, and the overall number of true matches.
This document provides details about a case study assignment for a management information systems course. It includes guidelines for completing four individual case study analyses over the semester. Each case study includes 3-5 multiple choice questions worth 1 mark each. The questions assess understanding of concepts from the readings and their application to the case studies. Topics covered in the case studies include automated essay grading systems, data management at an water utility company, and use of analytics and data management software at a fleet management company. Students will also present a summary of their analysis of one case study.
This document outlines guidelines for an assignment involving analysis of four case studies. It provides details of the individual assessment, including that it is worth 25 marks and involves answering questions for each case study as well as presenting one case study. It then provides the details of Case Study 1, involving whether computers should grade essays. It discusses the benefits and drawbacks of automated essay scoring and factors to consider in deciding to use it.
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
Â
In this paper, we propose a novel algorithm that rearrange the topic assignment results obtained from topic
modeling algorithms, including NMF and LDA. The effectiveness of the algorithm is measured by how much
the results conform to expert opinion, which is a data structure called TDAG that we defined to represent the
probability that a pair of highly correlated words appear together. In order to make sure that the internal
structure does not get changed too much from the rearrangement, coherence, which is a well known metric
for measuring the effectiveness of topic modeling, is used to control the balance of the internal structure.
We developed two ways to systematically obtain the expert opinion from data, depending on whether the
data has relevant expert writing or not. The final algorithm which takes into account both coherence and
expert opinion is presented. Finally we compare amount of adjustments needed to be done for each topic
modeling method, NMF and LDA.
An Approach for Big Data to Evolve the Auspicious Information from Cross-DomainsIJECEIAES
Â
Sentiment analysis is the pre-eminent technology to extract the relevant information from the data domain. In this paper cross domain sentimental classification approach Cross_BOMEST is proposed. Proposed approach will extract â ve words using existing BOMEST technique, with the help of Ms Word Introp, Cross_BOMEST determines â ve words and replaces all its synonyms to escalate the polarity and blends two different domains and detects all the self-sufficient words. Proposed Algorithm is executed on Amazon datasets where two different domains are trained to analyze sentiments of the reviews of the other remaining domain. Proposed approach contributes propitious results in the cross domain analysis and accuracy of 92 % is obtained. Precision and Recall of BOMEST is improved by 16% and 7% respectively by the Cross_BOMEST.
Hybrid Deep Learning Model for Multilingual Sentiment AnalysisIRJET Journal
Â
This document discusses hybrid deep learning models for multilingual sentiment analysis. It proposes a hybrid model that uses a convolutional neural network (CNN) for feature extraction and long short-term memory (LSTM) for recurrence. The model aims to improve accuracy over existing techniques by up to 11.6% on benchmarks. Previous research found that combining deep learning models with support vector machines (SVM) produced better sentiment analysis results than single models alone. However, hybrid models with SVM took significantly longer to compute. The document also reviews related work applying deep learning techniques like DNN, CNN and RNN to sentiment analysis tasks.
Generation of Question and Answer from Unstructured Document using Gaussian M...IJACEE IJACEE
Â
The document describes a system that automatically generates questions and answers from an unstructured document. It involves several steps: (1) simplifying complex sentences, (2) generating initial questions using named entities and semantic role labeling, (3) identifying subtopics using LDA and GMNTM models, (4) measuring syntactic correctness of questions, and (5) extracting answers using pattern matching. The system is expected to produce more accurate results compared to using only LDA for subtopic identification, as GMNTM also considers word order and semantics. Key techniques include semantic role labeling, Extended String Subsequence Kernel for similarity measurement, and syntactic tree kernel for question ranking.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Â
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-Ă -vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
Identification and Classification of Named Entities in Indian Languageskevig
Â
The process of identification of Named Entities (NEs) in a given document and then there classification into
different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort
in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by
English and the European languages. In this paper, we have presented the results that we have achieved by
performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance
Metrics.
Effect of Query Formation on Web Search Engine Resultskevig
Â
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
More Related Content
Similar to EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
The document presents a knowledge-based method for measuring semantic similarity between texts. It combines word-to-word semantic similarity metrics with information about word specificity to calculate a text-to-text similarity score. An example application shows how word similarity scores from WordNet are combined using the Wu & Palmer metric to determine the semantic similarity between two text segments. The method is evaluated on paraphrase identification tasks and shown to outperform approaches based only on lexical matching.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Â
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Â
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A Formal Machine Learning or Multi Objective Decision Making System for Deter...Editor IJCATR
Â
Decision-making typically needs the mechanisms to compromise among opposing norms. Once multiple objectives square measure is concerned of machine learning, a vital step is to check the weights of individual objectives to the system-level performance. Determinant, the weights of multi-objectives is associate in analysis method, associated it's been typically treated as a drawback. However, our preliminary investigation has shown that existing methodologies in managing the weights of multi-objectives have some obvious limitations like the determination of weights is treated as one drawback, a result supporting such associate improvement is limited, if associated it will even be unreliable, once knowledge concerning multiple objectives is incomplete like an integrity caused by poor data. The constraints of weights are also mentioned. Variable weights square measure is natural in decision-making processes. Here, we'd like to develop a scientific methodology in determinant variable weights of multi-objectives. The roles of weights in a creative multi-objective decision-making or machine-learning of square measure analyzed, and therefore the weights square measure determined with the help of a standard neural network.
Using Word Embedding for Automatic Query ExpansionDwaipayan Roy
Â
The document proposes using word embeddings to perform automatic query expansion. It explores using the Word2Vec framework to find semantically similar terms to add to the original query terms. The authors experiment with pre-retrieval, post-retrieval, and incremental nearest neighbor methods for query expansion using Word2Vec embeddings. While results show improvement over the baseline, the methods lagged behind the established RM3 query expansion technique. Future work could explore incorporating co-occurrence statistics with the Word2Vec embeddings to further improve performance.
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...Lifeng (Aaron) Han
Â
"LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors"
Publisher: Association for Computational LinguisticsDecember 2012
Authors: Aaron Li-Feng Han, Derek F. Wong and Lidia S. Chao
Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pages 441â450, Mumbai, December 2012. Open tool https://github.com/aaronlifenghan/aaron-project-lepor
The document discusses using machine learning to identify factors that correlate with essay grades and English proficiency levels. It analyzes a database of 481 essays written by Dutch students and coded with 81 linguistic variables. Various machine learning algorithms are tested to predict the essays' proficiency levels as assigned by human raters. The best performing algorithm, Logistic Model Tree, achieves accuracy comparable to human raters when using all 81 features directly. Feature selection and discretization are also tested to identify a minimal set of highly predictive features.
Automatic vs. human question answering over multimedia meeting recordingsLĂŞ Anh
Â
Information access in meeting recordings can be assisted by
meeting browsers, or can be fully automated following a
question-answering (QA) approach. An information access task
is defined, aiming at discriminating true vs. false parallel statements
about facts in meetings. An automatic QA algorithm is
applied to this task, using passage retrieval over a meeting transcript.
The algorithm scores 59% accuracy for passage retrieval,
while random guessing is below 1%, but only scores 60% on
combined retrieval and question discrimination, for which humans
reach 70%â80% and the baseline is 50%. The algorithm
clearly outperforms humans for speed, at less than 1 second
per question, vs. 1.5â2 minutes per question for humans. The
degradation on ASR compared to manual transcripts still yields
lower but acceptable scores, especially for passage identification.
Automatic QA thus appears to be a promising enhancement
to meeting browsers used by humans, as an assistant for
relevant passage identification.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
Â
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
Sending out an SOS (Summary of Summaries): A Brief Survey of Recent Work on A...Griffin Adams
Â
Research on the evaluation of abstractive summarization models has evolved considerably in the last few years. To take stock of these changes, namely, the shift from n-gram overlap to fact-based assessments, I have written a brief summary of papers on evaluation metrics, most of which focuses on the period from 2018 to mid-2020. The paper is written in a terse literature review format, so as to aid researchers when craft- ing related works sections of papers on summary evaluation.
This document discusses approaches to measuring semantic similarity between sentences. It evaluates three approaches: cosine similarity, path-based measures using WordNet, and a feature-based approach. The feature-based approach generates similarity scores based on tagging parts of speech, lemmatization, and comparing nouns and verbs between sentences. It is concluded that the feature-based approach provides better semantic similarity scores compared to existing path-based and cosine similarity measures.
IRJET-Semantic Similarity Between SentencesIRJET Journal
Â
This document discusses approaches to measuring semantic similarity between sentences. It evaluates three approaches: cosine similarity, path-based measures using WordNet, and a feature-based approach. The feature-based approach generates similarity scores based on tagging parts of speech, lemmatization, and comparing nouns and verbs between sentences. It is concluded that the feature-based approach provides better semantic similarity scores compared to existing path-based and cosine similarity measures.
Data linkage involves matching records from two different datasets that belong to the same person. It typically involves three stages: pre-linkage (data cleaning), linkage (comparing records and determining matches), and post-linkage (checking matches and estimating error rates). The two main methods are deterministic matching, which requires exact matches on identifying fields, and probabilistic matching, which calculates match probabilities based on partial identifier agreements. Probabilistic matching is more computationally demanding but reduces missed matches by modeling inconsistencies in identifying data. Key parameters that affect probabilistic matching are data quality, the probability of random field agreements between non-matched records, and the overall number of true matches.
This document provides details about a case study assignment for a management information systems course. It includes guidelines for completing four individual case study analyses over the semester. Each case study includes 3-5 multiple choice questions worth 1 mark each. The questions assess understanding of concepts from the readings and their application to the case studies. Topics covered in the case studies include automated essay grading systems, data management at an water utility company, and use of analytics and data management software at a fleet management company. Students will also present a summary of their analysis of one case study.
This document outlines guidelines for an assignment involving analysis of four case studies. It provides details of the individual assessment, including that it is worth 25 marks and involves answering questions for each case study as well as presenting one case study. It then provides the details of Case Study 1, involving whether computers should grade essays. It discusses the benefits and drawbacks of automated essay scoring and factors to consider in deciding to use it.
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
Â
In this paper, we propose a novel algorithm that rearrange the topic assignment results obtained from topic
modeling algorithms, including NMF and LDA. The effectiveness of the algorithm is measured by how much
the results conform to expert opinion, which is a data structure called TDAG that we defined to represent the
probability that a pair of highly correlated words appear together. In order to make sure that the internal
structure does not get changed too much from the rearrangement, coherence, which is a well known metric
for measuring the effectiveness of topic modeling, is used to control the balance of the internal structure.
We developed two ways to systematically obtain the expert opinion from data, depending on whether the
data has relevant expert writing or not. The final algorithm which takes into account both coherence and
expert opinion is presented. Finally we compare amount of adjustments needed to be done for each topic
modeling method, NMF and LDA.
An Approach for Big Data to Evolve the Auspicious Information from Cross-DomainsIJECEIAES
Â
Sentiment analysis is the pre-eminent technology to extract the relevant information from the data domain. In this paper cross domain sentimental classification approach Cross_BOMEST is proposed. Proposed approach will extract â ve words using existing BOMEST technique, with the help of Ms Word Introp, Cross_BOMEST determines â ve words and replaces all its synonyms to escalate the polarity and blends two different domains and detects all the self-sufficient words. Proposed Algorithm is executed on Amazon datasets where two different domains are trained to analyze sentiments of the reviews of the other remaining domain. Proposed approach contributes propitious results in the cross domain analysis and accuracy of 92 % is obtained. Precision and Recall of BOMEST is improved by 16% and 7% respectively by the Cross_BOMEST.
Hybrid Deep Learning Model for Multilingual Sentiment AnalysisIRJET Journal
Â
This document discusses hybrid deep learning models for multilingual sentiment analysis. It proposes a hybrid model that uses a convolutional neural network (CNN) for feature extraction and long short-term memory (LSTM) for recurrence. The model aims to improve accuracy over existing techniques by up to 11.6% on benchmarks. Previous research found that combining deep learning models with support vector machines (SVM) produced better sentiment analysis results than single models alone. However, hybrid models with SVM took significantly longer to compute. The document also reviews related work applying deep learning techniques like DNN, CNN and RNN to sentiment analysis tasks.
Generation of Question and Answer from Unstructured Document using Gaussian M...IJACEE IJACEE
Â
The document describes a system that automatically generates questions and answers from an unstructured document. It involves several steps: (1) simplifying complex sentences, (2) generating initial questions using named entities and semantic role labeling, (3) identifying subtopics using LDA and GMNTM models, (4) measuring syntactic correctness of questions, and (5) extracting answers using pattern matching. The system is expected to produce more accurate results compared to using only LDA for subtopic identification, as GMNTM also considers word order and semantics. Key techniques include semantic role labeling, Extended String Subsequence Kernel for similarity measurement, and syntactic tree kernel for question ranking.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Â
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-Ă -vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
Similar to EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS (20)
Identification and Classification of Named Entities in Indian Languageskevig
Â
The process of identification of Named Entities (NEs) in a given document and then there classification into
different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort
in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by
English and the European languages. In this paper, we have presented the results that we have achieved by
performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance
Metrics.
Effect of Query Formation on Web Search Engine Resultskevig
Â
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Â
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Â
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
Â
Speech is an important biological signal for primary mode of communication among human being and also the most natural and efficient form of exchanging information among human in speech. Speech processing is the most important aspect in signal processing. In this paper the theory of linear algebra called singular value decomposition (SVD) is applied to the speech signal. SVD is a technique for deriving important parameters of a signal. The parameters derived using SVD may further be reduced by perceptual evaluation of the synthesized speech using only perceptually important parameters, where the speech signal can be compressed so that the information can be transformed into compressed form without losing its quality. This technique finds wide applications in speech compression, speech recognition, and speech synthesis. The objective of this paper is to investigate the effect of SVD based feature selection of the input speech on the perception of the processed speech signal. The speech signal which is in the form of vowels \a\, \e\, \u\ were recorded from each of the six speakers (3 males and 3 females). The vowels for the six speakers were analyzed using SVD based processing and the effect of the reduction in singular values was investigated on the perception of the resynthesized vowels using reduced singular values. Investigations have shown that the number of singular values can be drastically reduced without significantly affecting the perception of the vowels.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Â
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,
demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to
identify which specific terms in prompts positively or negatively impact relevance
evaluation with LLMs. We employed two types of prompts: those used in previous
research and generated automatically by LLMs. By comparing the performance of
these prompts in both few-shot and zero-shot settings, we analyze the influence of
specific terms in the prompts. We have observed two main findings from our study.
First, we discovered that prompts using the term âanswerâ lead to more effective
relevance evaluations than those using ârelevant.â This indicates that a more direct
approach, focusing on answering the query, tends to enhance performance. Second,
we noted the importance of appropriately balancing the scope of ârelevance.â While
the term ârelevantâ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.
The inclusion of few-shot examples helps in more precisely defining this balance.
By providing clearer contexts for the term ârelevance,â few-shot examples contribute
to refine relevance criteria. In conclusion, our study highlights the significance of
carefully selecting terms in prompts for relevance evaluation with LLMs.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Â
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term âanswerâ lead to more effective relevance evaluations than those using ârelevant.â This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ârelevance.â While the term ârelevantâ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ârelevance,â few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Genetic Approach For Arabic Part Of Speech Taggingkevig
Â
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Rule Based Transliteration Scheme for English to Punjabikevig
Â
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Improving Dialogue Management Through Data Optimizationkevig
Â
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and computers through natural language stands as a critical advancement. Central to these systems is the dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue management has embraced a variety of methodologies, including rule-based systems, reinforcement learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset imperfections, especially mislabeling, on the challenges inherent in refining dialogue management processes.
Document Author Classification using Parsed Language Structurekevig
Â
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of âproof texts,â The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
Â
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context.
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
Â
The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Evaluation of Medium-Sized Language Models in German and English Languagekevig
Â
Large language models (LLMs) have garnered significant attention, but the definition of âlargeâ lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
Â
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and
computers through natural language stands as a critical advancement. Central to these systems is the
dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user
goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue
management has embraced a variety of methodologies, including rule-based systems, reinforcement
learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This
research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue
managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as
Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the
introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset
imperfections, especially mislabeling, on the challenges inherent in refining dialogue management
processes.
Document Author Classification Using Parsed Language Structurekevig
Â
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of âproof texts,â The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
Â
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product
information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots,
but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines
RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal
scores and fusing the documents and scores. Through manually evaluating answers on accuracy,
relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and
comprehensive answers due to the generated queries contextualizing the original query from various
perspectives. However, some answers strayed off topic when the generated queries' relevance to the
original query is insufficient. This research marks significant progress in artificial intelligence (AI) and
natural language processing (NLP) applications and demonstrates transformations in a global and multiindustry context
Performance, energy consumption and costs: a comparative analysis of automati...kevig
Â
The common practice in Machine Learning research is to evaluate the top-performing models based on their
performance. However, this often leads to overlooking other crucial aspects that should be given careful
consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into
account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches
(SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance
parameters (standard indices), but also alternative measures such as timing, energy consumption and costs,
which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and
the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of
complex models (LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing machine
learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific
world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to
give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
EVALUATION OF MEDIUM-SIZED LANGUAGE MODELS IN GERMAN AND ENGLISH LANGUAGEkevig
Â
Large language models (LLMs) have garnered significant attention, but the definition of âlargeâ lacks
clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six
billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative
question answering, which requires models to provide elaborate answers without external document
retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results
show that combining the best answers from different MLMs yielded an overall correct answer rate of
82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B
parameters, which highlights the importance of using appropriate training data for fine-tuning rather than
solely relying on the number of parameters. More fine-grained feedback should be used to further improve
the quality of answers. The open source community is quickly closing the gap to the best commercial
models.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
Â
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Â
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
Â
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Â
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Fueling AI with Great Data with Airbyte WebinarZilliz
Â
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Â
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
Â
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di piÚ di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilitĂ , standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunitĂ open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. à stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Â
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Ivantiâs Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There weâll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Â
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
Â
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Â
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Â
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
Â
An English đŹđ§ translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech đ¨đż version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
20240609 QFM020 Irresponsible AI Reading List May 2024
Â
EVALUATION OF SEMANTIC ANSWER SIMILARITY METRICS
1. EVALUATION OF SEMANTIC ANSWER
SIMILARITY METRICS
1Farida Mustafazade and 2Peter F. Ebbinghaus
1
GAM Systematic
2
Teufel Audio
ABSTRACT
There are several issues with the existing general machine translation or natural language generation
evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust
QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model
predictions to questions are similar to ground-truth annotations. The ability to compare similarity based
on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more
realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that
uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations
to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and
BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of
US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name
string pairs along with their similarities, which can be used for training.
KEYWORDS
Question-answering, semantic answer similarity, exact match, pre-trained language models, cross-encoder,
bi-encoder, semantic textual similarity, automated data labelling
1. INTRODUCTION
Having reliable metrics for evaluation of language models in general, and models solving difficult
question answering (QA) problems, is crucial in this rapidly developing field. These metrics are
not only useful to identify issues with the current models, but they also influence the development
of a new generation of models. In addition, it is preferable to have an automatic, simple metric as
opposed to expensive, manual annotation or a highly configurable and parameterisable metric so
that the development and the hyperparameter tuning do not add more layers of complexity. SAS, a
cross-encoder-based metric for the estimation of semantic answer similarity [1], provides one such
metric to compare answers based on semantic similarity.
The central objective of this research project is to analyse pairs of answers similar to the one in
Figure 1 and to evaluate evaluation errors across datasets and evaluation metrics.
The main hypotheses that we will aim to test thoroughly through experiments are twofold. Firstly,
lexical-based metrics are not well suited for automated QA model evaluation as they lack a notion
of context and semantics. Secondly, most metrics, specifically SAS and BERTScore, as described
in [1], find some data types more difficult to assess for similarity than others.
After familiarising ourselves with the current state of research in the field in Section 2, we describe
the datasets provided in [1] and the new dataset of names that we purposefully tailor to our
model in Section 3. This is followed by Section 4, introducing the four new semantic answer
similarity approaches described in [1], our fine-tuned model as well as three lexical n-gram-based
automated metrics. Then in Section 5, we thoroughly analyse the evaluation datasets described in
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
43
DOI: 10.5121/ijnlc.2022.11305
2. Figure 1: Representative example from NQ-open of a question and all semantic answer similarity
measurement results.
Question: Who makes more money: NFL or Premier League?
Ground-truth answer: National Football League
Predicted Answer: the NFL
EM: 0.00
F1: 0.00
Top-1-Accuracy: 0.00
SAS: 0.9008
Human Judgment: 2 (definitely correct prediction)
fBERT: 0.4317
fâ˛
BERT: 0.4446
Bi-Encoder: 0.5019
the previous section and conduct an in-depth qualitative analysis of the errors. Finally, in Section 6,
we summarise our contributions.
2. RELATED WORK
We define semantic similarity as different descriptions for something that has the same meaning
in a given context, following largely [2]âs definition of semantic and contextual synonyms. [3]
noted that open-domain QA is inherently ambiguous because of the uncertainties in the language
itself. The human annotators attach a label 2 to all predictions that are âdefinitely correctâ, 1 -
âpossibly correctâ, and 0 - âdefinitely incorrectâ. Automatic evaluation based on exact match (EM)
fails to capture semantic similarity for definitely correct answers, where 60% of the predictions are
semantically equivalent to the ground-truth answer. Just under a third of the predictions that do
not match the ground-truth labels were nonetheless correct. They also mention other reasons for
failure to spot equivalence, such as time-dependence of the answers or underlying ambiguity in the
questions.
QA evaluation metrics in the context of SQuAD v1.0 [4] dataset are analysed in [5]. They
thoroughly discuss the limitations of EM and F1 score from n-gram based metrics, as well as the
importance of context including the relevance of questions to the interpretation of answers. A BERT
matching metric (Bert Match) is proposed for answer equivalence prediction, which performs better
when the questions are included alongside the two answers, but appending contexts didnât improve
results. Additionally, authors demonstrate better suitability of Bert Match in constructing top-k
modelâs predictions. In contrast, we will cover multilingual datasets, as well as more token-level
equivalence measures, but limit our focus on similarity of answer pairs without accompanying
questions or contexts.
Two out of four semantic textual similarity (STS) metrics that we analyse and the model that we
eventually train depend on bi-encoder and BERTScore [6]. The bi-encoder approach model is based
on the Sentence Transformer structure [7], which is a faster adaptation of BERT for the semantic
search and clustering type of problems. BERTScore uses BERT to generate contextual embeddings,
then match the tokens of the ground-truth answer and prediction, followed by creating a score from
the maximum cosine similarity of the matched tokens. This metric is not one-size-fits-all. On top
of choosing a suitable contextual embedding and model, there is an optional feature of importance
weighting using inverse document frequency (idf). The idea is to limit the influence of common
words. One of the findings is that most automated evaluation metrics demonstrate significantly
better results on datasets without adversarial examples, even when these are introduced within
the training dataset, while the performance of BERTScore suffers only slightly. [6] uses machine
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
44
3. translation (MT) and image captioning tasks in experiments and not QA. [8] apply BERT-based
evaluation metrics for the first time in the context of QA. Even though they find that METEOR
as an n-gram based evaluation metric proved to perform better than the BERT-based approaches,
they encourage more research in the area of semantic text analysis for QA. Moreover, [5] uses
only BERTScore base as one of the benchmarks, while we explore the larger model, as well as a
finetuned variation of it.
Authors in [1] expand on this idea and further address the issues with existing general MT, natural
language generation (NLG), which entails as well generative QA and extractive QA evaluation
metrics. These include reliance on string-based methods, such as EM, F1-score, and top-n-accuracy.
The problem is even more substantial for multi-way annotations. Here, multiple ground-truth
answers exist in the document for the same question, but only one of them is annotated. The
major contribution of the authors is the formulation and analysis of four semantic answer similarity
approaches that aim to resolve to a large extent the issues mentioned above. They also release two
three-way annotated datasets: a subset of the English SQuAD dataset [9], German GermanQuAD
dataset [10], and NQ-open [3].
Looking into error categories (see Table 1 and Section 5) revealed problematic data types, where
entities, particularly those involving names of any kind, turned out to be the leading category. [11]
analyse Natural Questions (NQ) [12], TriviaQA [13] as well as SQuAD and address the issue that
current QA benchmarks neglect the possibility of multiple correct answers. They focus on the
variations of names, e.g. nicknames, and improve the evaluation of Open-domain QA models
based on a higher EM score by augmenting ground-truth answers with aliases from Wikipedia
and Freebase. In our work, we focus solely on the evaluations of answer evaluation metrics and
generate a standalone names dataset from another dataset, described in greater detail in Section 3.
Our main assumption is that better metrics will have a higher correlation with human judgement,
but the choice of a correlation metric is important. Pearson correlation is a commonly used metric
in evaluating semantic text similarity (STS) for comparing the system output to human evaluation.
[14] show that Pearson power-moment correlation can be misleading when it comes to intrinsic
evaluation. They further go on to demonstrate that no single evaluation metric is well suited for
all STS tasks, hence evaluation metrics should be chosen based on the specific task. In our case,
most of the assumptions, such as normality of data and continuity of the variables behind Pearson
correlation do not hold. Kendallâs rank correlations are meant to be more robust and slightly more
efficient in comparison to Spearman as demonstrated in [15].
Soon after Transformers took over the field, adversarial tests resulted in significantly lower perfor-
mance figures, which increased the importance of adversarial attacks [16]. General shortcomings of
language models and their benchmarks led to new approaches such as Dynabench [17]. Adversarial
GLUE (AdvGLUE) [18] focuses on the added difficulty of maintaining the semantic meaning when
applying a general attack framework for generating adversarial texts. There are other shortcomings
of large language models, including environmental and financial costs [19]. Hence, analysing
existing benchmarks is crucial in effectively supporting the pursuit of more robust models. We
therefore carefully analyse state of the art benchmarking for semantic answer similarity metrics
while keeping in mind the more general underlying shortcomings of large pre-trained language
models.
3. DATA
We perform our analysis on three subsets of larger datasets annotated by three human raters and
provided by [1]. Unless specified otherwise, these will be referred to by their associated dataset
names.
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
45
4. Table 1: Category definitions and examples from annotated NQ-open dataset.
Category Definition Question Gold
label
Prediction
Acronym An abbreviation formed from the initial
letters of other words and pronounced
as a word
what channel does the
haves and have nots come
on on directv
OWN Oprah
Winfrey
Network
Alias Indicate an additional name that a per-
son sometimes uses
who is the man in black
the dark tower
Randall
Flagg
Walter
Padick
Co-
reference
Requires resolution of a relationship
between two distinct words referring
to the same entity
who is marconi in we
built this city
the father
of the ra-
dio
Italian
inventor
Guglielmo
Marconi
Different
levels of
precision
When both answers are correct, but one
is more precise
when does the sympa-
thetic nervous system be
activated
constantly fight-or-
flight
response
Imprecise
question
There can be more than one correct an-
swers
b-25 bomber accidentally
flew into the empire state
building
Old John
Feather
Merchant
1945
Medical
term
Language used to describe components
and processes of the human body
what is the scientific
name for the shoulder
bone
shoulder
blade
scapula
Multiple
correct
answers
There is no single definite answer city belonging to mid
west of united states
Des
Moines
kansas
city
Spatial Requires an understanding of the con-
cept of space, location, or proximity
where was the tv series
pie in the sky filmed
Marlow in
Bucking-
hamshire
bray stu-
dios
Synonyms Gold label and prediction are synony-
mous
what is the purpose of a
chip in a debit card
control ac-
cess to a
resource
security
Biological
term
Of or relating to biology or life and liv-
ing processes
where is the ground tis-
sue located in plants
in regions
of new
growth
cortex
Wrong
gold label
The ground-truth label is incorrect how do you call a person
who cannot speak
sign lan-
guage
mute
Wrong la-
bel
The human judgement is incorrect who wrote the words to
the original pledge of al-
legiance
Captain
George
Thatcher
Balch
Francis
Julius
Bellamy
Incomplete
answer
The gold label answer contains only a
subset of the full answer
what are your rights in
the first amendment
religion freedom
of the
press
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
46
5. 3.1. Original datasets
SQuAD is an English-language dataset containing multi-way annotated questions with 4.8 answers
per question on average. GermanQuAD is a three-way annotated German-language question/an-
swer pairs dataset created by the deepset team which also wrote [1]. Based on the German
counterpart of the English Wikipedia articles used in SQuAD, GermanQuAD is the SOTA dataset
for German question answering models. To address a shortcoming of SQuAD that was mentioned
in [20], GermanQuAD was created with the goal of preventing strong lexical overlap between
questions and answers. Hence, more complex questions were encouraged, and questions were
rephrased with synonyms and altered syntax. SQuaD and GermanQuAD contain a pair of answers
and a hand-labelled annotation of 0 if answers are completely dissimilar, 1 if answers have a
somewhat similar meaning, and 2 if the two answers express the same meaning. NQ-open is
a five-way annotated open-domain adaption of [20]âs Natural Questions dataset. NQ-open is
based on actual Google search engine queries. In case of NQ-open, the labels follow a different
methodology as described in [3]. The assumption is that we only leave questions with a non-vague
interpretation (see Table 1). Questions like Who won the last FIFA World Cup? received the label
1 because they have different correct answers without a precise answer at a point in time later
than when the question was retrieved. There is yet another ambiguity with this question, which
is whether it is discussing FIFA Womenâs World Cup or FIFA Menâs World Cup. This way, the
two answers can be correct without semantic similarity even though only one correct answer is
expected.
The annotation of NQ-open indicates truthfulness of the predicted answer, whereas for SQuAD
and GermanQuAD the annotation relates to the semantic similarity of both answers which can
lead to differences in interpretation as well as evaluation. To keep the methodology consistent
and improve NQ-open subset, vague questions with more than one ground-truth labels have been
filtered out. We also manually re-label incorrect labels as well as filter out vague questions.
Table 2 describes the size and some lexical features for each of the three datasets. There were 2, 3
and 23 duplicates in each dataset respectively. Dropping these duplicates led to slight changes in
the metric scores.
Table 2: Percentage distribution of the labels and statistics on the subsets of datasets used in the
analyses. The average answer size column refers to the average of both the first and second
answers as well as ground-truth answer and predicted answer (NQ-open only). F1 = 0 indicates
no string similarity, F1 ̸= 0 indicates some string similarity. Label distribution is given in
percentages.
SQuAD GermanQuAD NQ-open
Label 0 56.7 27.3 71.7
Label 1 30.7 51.5 16.6
Label 2 12.7 21.1 11.7
F1 = 0 565 124 3030
F1 ̸= 0 374 299 529
Size 939 423 3559
Avg answer size 23 68 13
3.2. Augmented dataset
For NQ-open, the largest of the three datasets, names was the most challenging category to predict
similarity as per 4. While names includes city and country names as well, we focus on the names of
public figures in our work. To resolve this issue, we provide a new dataset that consists of âź40,000
(39,593) name pairs and employ the Augmented SBERT approach [21]: we use the cross-encoder
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
47
6. model to label a new dataset consisting of name pairs and then train a bi-encoder model on the
resulting dataset. We discuss the deployed models in more detail in Section 4.
The underlying dataset is created from an open dbpedia-data dataset [22] which includes the names
of more than a million public figures that have a page on Wikipedia and DBpedia, including
actors, politicians, scientists, sportsmen, and writers. Out of these we only use those with a U.S.
nationality as the questions in NQ-open are on predominantly U.S. related topics. We then shuffle
the list of 25,462 names and pair them randomly to get the name pairs that are then labelled by the
cross-encoder model.
The dataset includes different ways of writing a personâs name including aliases. For example,
Gary A Labranche and Labranche Gary, or aliases like Lisa Marie Abatoâs stage name Holly Ryder
as well as e.g. Chinese ways of writing such as Rulan Chao Pian and ĺčśĺŚč. We filter out
all examples where more than three different ways of writing a personâs name exist because in
these cases these names donât refer to the same person but were mistakenly included in the dataset.
For example, names of various members of Tampa Bay Rays minor league who have one page for
all members. Since most public figures in the dataset have a maximum of one variation of their
name, we only leave out close to 800 other variations this way, and can add 14,131 additional pairs.
These are labelled as 1 because they refer to the same person.
4. MODELS / METRICS
The focus of our research lies on different semantic similarity metrics and their underlying models.
As a human baseline, [1] reports correlations between the labels by the first and the second
annotator for subsets of SQuAD and GermanQuAD and omits these for the NQ-open subset since
they are not publicly available. Maximum Kendallâs tau-b rank correlations are 0.64 for SQuAD
and 0.57 for GermanQuAD. The baseline semantic similarity models considered are bi-encoder,
BERTScore vanilla, and BERTScore trained, whereas the focus will be on cross-encoder (SAS)
performance. Table 3 outlines the exact configurations used for each model.
Table 3: Configuration details of each of the models used in evaluations. The architectures for the
first two models and our model follow corresponding sequence classification. T-systems-onsite
model, as well as our trained model, follow XLMRobertaModel, and the other two -
BertForMaskedLM & ElectraForPreTraining architectures respectively. Most of the
models use absolute position embedding.
deepset/
gbert-large-sts
cross-encoder/
stsb-roberta-large
T-Systems-onsite/
cross-en-de-roberta
-sentence-transformer
bert-base-uncased
deepset/
gelectra-base
Augmented
cross-en-de-roberta
-sentence-transformer
hidden size 1,024 1,024 768 768 768 768
intermediate size 4,096 4,096 3,072 3,072 3,072 3,072
max position embeddings 512 514 514 512 512 514
model type bert roberta xlm-roberta bert electra xlm-roberta
num attention heads 16 16 12 12 12 12
num hidden layers 24 24 12 12 12 12
vocab size 31,102 50,265 250,002 30,522 31,102 250,002
transformers version 4.9.2 - - 4.6.0.dev0 - 4.12.2
A cross-encoder architecture [23] concatenates two sentences with a special separator token and
passes them to a network to apply multi-head attention over all input tokens in one pass. Pre-
computation is not possible with the cross-encoder approach because it takes both input texts into
account at the same time to calculate embeddings. A well-known language model that makes use
of the cross-encoder architecture is BERT [24]. The resulting improved performance in terms
of more accurate similarity scores for text pairs comes with the cost of higher time complexity,
i.e. lower speed, of cross-encoders in comparison to bi-encoders. A bi-encoder calculates the
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
48
7. embeddings of the two input texts separately by mapping independently encoded sentences for
comparison to a dense vector space which can then be compared using cosine similarity. The
separate embeddings result in higher speed but result in reduced scoring quality due to treating the
text pairs completely separate [25]. In our work, both cross- and bi-encoder architectures are based
on Sentence Transformers [26].
The original bi-encoder applied in [1] uses the multi-lingual T-Systems-onsite/cross-en-de-roberta-
sentence-transformer [27] that is based on xlm-roberta-base which was further trained on an
unreleased multi-lingual paraphrase dataset resulting in the model paraphrase-xlm-r-multilingual-
v1. The latter then in turn was fine-tuned on an English-language STS benchmark dataset [28] and
a machine-translated German STS benchmark.
[1] used a separate English and German model for the cross-encoder because there is no multi-
lingual cross-encoder implementation available yet. Similar to the bi-encoder approach, the English
SAS cross-encoder model relies on cross-encoder/stsb-roberta-large which was trained on the same
English STS benchmark. For German, a new cross-encoder model had to be trained, as there were
no German cross-encoder models available. It is based on deepsetâs gbert-large [29] and trained
on the same machine-translated German STS benchmark as the bi-encoder model, resulting in
gbert-large-sts.
BERTScore implementation from [6] is used for our evaluation, with minor changes to accommo-
date for missing key-value pairs for the [27] model type. For BERTScore trained, the last layer
representations were used, while for vanilla type BERTScore, only the second layer. BERTScore
vanilla is based on bert-base-uncased for English (SQuAD and NQ-open) and deepsetâs gelectra-
base [29] for German (GermanQuAD), whereas BERTScore trained is based on the multi-lingual
model that is used by the bi-encoder [27]. BERTScore trained outperforms SAS for answer-
prediction pairs without lexical overlap, the largest group in NQ-open, but neither of the models
perform well on names. We use our new name pairs dataset to train the Sentence Transformer
[30] with the same hyperparameters as were used to train paraphrase-xlm-r-multilingual-v1 on the
English-language STS benchmark dataset.
We did an automatic hyperparameter search Table 6 for 5 trials with Optuna [31]. Note that
cross-validation is an approximation of Bayesian optimization, so it is not necessary to use it with
Optuna. The following set of hyperparameters was found to be the best: âbatchâ: 64, âepochsâ: 2,
âwarmâ: 0.45.
We have scanned all metrics from Table 5 for time complexity on NQ-open as it is the largest
evaluation dataset. Note that we havenât profiled training times as those are not defined for lexical-
based metrics, but only measured CPU time for predicting answer pairs in NQ-open. N-gram based
metrics are much faster as they donât have any encoding or decoding steps involved, and they take
âź10s to generate similarity scores. The slowest is the cross-encoder as it requires concatenating
answers first, followed by encoding, and it takes âź10 minutes. Concatenation grows on a quadratic
scale with the input length due to self-attention mechanism. For the same dataset, bi-encoder takes
âź2 minutes. BERTScore trained takes âź3 minutes, hence computational costs of BERTScore and
bi-encoders are comparable. Additional complexity for all methods mentioned above except for
SAS would be marginal when used during training on the validation set. Please note the following
system description:
System name=âDarwinâ, Release=â20.6.0â, Machine=âx86_64â,
Total Memory=8.00GB, Total cores=4, Frequency=2700.00Mhz
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
49
8. 5. ANALYSIS
To evaluate the shortcomings of lexical-based metrics in the context of QA, we compare BLEU,
ROUGE-L, METEOR, F1 and the semantic answer similarity metrics, i.e. Bi-Encoder, BERTScore
vanilla, BERTScore trained, and Cross-Encoder (SAS) scores on evaluation datasets. To address
the second hypothesis, we delve deeply into every single dataset and find differences between
different types of answers.
5.1. Quantitative Analysis
As can be observed from Table 4 and Table 5, lexical-based metrics show considerably lower results
than any of the semantic similarity approaches. BLEU lags behind all other metrics, followed by
METEOR. Similarly, we found that ROUGE-L and F1 achieve close results. In the absence of
lexical overlap, METEOR gives superior results than the other n-gram-based metrics in the case of
SQUAD, but ROUGE-L is closer to human judgement for the rest. The highest correlations are
achieved in the case of BERTScore based trained models, followed closely by bi- and cross-encoder
models. We found some inconsistencies regarding the performance of the cross-encoder based
SAS metric. The superior performance of SAS doesnât hold up for the correlation metrics other
than Pearson. We observed that SAS score underperformed when F1 = 0 compared to all other
semantic answer similarity metrics and overperformed when there is some lexical similarity.
NQ-open is not only by far the largest of the three datasets but also the most skewed one. We
observe that the vast majority of answer-prediction pairs have a label 0 (see Table 2). In the
majority of cases, the underlying QA model predicted the wrong answer.
All four semantic similarity metrics perform considerably worse on NQ-open than on SQuAD
and GermanQuAD. In particular, answer-prediction pairs that have no lexical overlap (F1 = 0)
amount to 95 per cent of all pairs with the label 0 indicating incorrect predictions. Additionally,
they perform only marginally better than METEOR or ROUGE-L.
BLEU ROUGE-L METEOR F1-score Bi-Encoder fBERT f0
BERT SAS New Bi-Encoder fBERT
variable
0.2
0.0
0.2
0.4
0.6
0.8
1.0
value
Distribution of evaluation metric scores
SQuAD GermanQUAD NQ-open
Figure 2: Comparison of all (similarity) scores for the pairs in evaluation datasets. METEOR
computations for GermanQuAD are omitted since it is not available for German.
Score distribution for SAS and BERTScore trained shows that SAS scores are heavily tilted towards
0 Figure 2.
In Figure 3, we analyse SQuAD subset dataset of answers and we observe a similar phenomenon as
in [1] when there is no lexical overlap between the answer pairs: the higher in layers we go in case
of BERTScore trained, the higher the correlation values with human labels are. Quite the opposite
is observed in the case of BERTScore vanilla, where it is either not as sensitive to embedding
representations in case of no lexical overlap or correlations decrease with higher embedding layers.
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
50
9. Table 4: Pearson, Spearmanâs, and Kendallâs rank correlations of annotator labels and automated
metrics on subsets of GermanQuAD. fBERT is BERTScore vanilla and fâ˛
BERT is BERTScore
trained.
GermanQuAD
F1 = 0 F1 ̸= 0
Metrics r Ď Ď r Ď Ď
BLEU 0.000 0.000 0.000 0.153 0.095 0.089
ROUGE-L 0.172 0.106 0.100 0.579 0.554 0.460
F1-score 0.000 0.000 0.000 0.560 0.534 0.443
Bi-Encoder 0.392 0.337 0.273 0.596 0.595 0.491
fBERT 0.149 0.008 0.006 0.599 0.554 0.457
fâ˛
BERT 0.410 0.349 0.284 0.606 0.592 0.489
SAS 0.488 0.432 0.349 0.713 0.690 0.574
Table 5: Pearson, Spearmanâs, and Kendallâs rank correlations of annotator labels and automated
metrics on subsets of SQuAD and NQ-open. fBERT is BERTScore vanilla and fâ˛
BERT is
BERTScore trained, and Ë
fBERT is the new BERTScore trained on names.
SQuad NQ-open
F1 = 0 F1 ̸= 0 F1 = 0 F1 ̸= 0
Metrics r Ď Ď r Ď Ď r Ď Ď r Ď Ď
BLEU 0.000 0.000 0.000 0.182 0.168 0.159 0.000 0.000 0.000 0.052 0.054 0.051
ROUGE-L 0.100 0.043 0.041 0.556 0.537 0.455 0.220 0.163 0.159 0.450 0.458 0.377
METEOR 0.398 0.207 0.200 0.450 0.464 0.378 0.233 0.152 0.148 0.188 0.179 0.139
F1-score 0.000 0.000 0.000 0.594 0.579 0.497 0.000 0.000 0.000 0.394 0.407 0.337
Bi-Encoder 0.487 0.372 0.303 0.684 0.684 0.566 0.294 0.212 0.170 0.454 0.446 0.351
fBERT 0.249 0.132 0.108 0.612 0.601 0.492 0.156 0.169 0.135 0.165 0.142 0.112
fâ˛
BERT 0.516 0.391 0.318 0.698 0.688 0.571 0.319 0.225 0.181 0.452 0.449 0.354
SAS 0.561 0.359 0.291 0.743 0.735 0.613 0.422 0.196 0.158 0.662 0.647 0.512
New Bi-Encoder 0.501 0.391 0.318 0.694 0.690 0.572 0.338 0.252 0.203 0.501 0.501 0.392
Ë
fBERT 0.519 0.399 0.324 0.707 0.698 0.581 0.351 0.257 0.208 0.498 0.507 0.398
5.2. Qualitative Analysis
This section is entirely dedicated to highlighting the major categories of problematic samples in
each of the datasets.
5.2.1. SQuAD
In SQuAD there are only 16 cases where SAS completely diverges from human labels. In all seven
cases where SAS score is above 0.5 and label is 0, we notice that the two answers have either a
common substring or could be used often in the same context. In the other 9 extreme cases when
the label is indicative of semantic similarity and SAS is giving scores below 0.25, there are three
Table 6: Experimental setup for hyperparameter tuning of cross-encoder augmented BERTScore.
Batch Size {16, 32, 64, 128, 256}
Epochs {1, 2, 3, 4}
warm uniform(0.0, 0.5)
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
51
10. 0 2 4 6 8 10 12
0.1
0.2
0.3
0.4
0.5
Correlation of BERTScore to label with no lexical overlap
fBERT, Pearsonr
f0
BERT, Pearsonr
SAS, Pearsonr
fBERT, Spearmanr
f0
BERT, Spearmanr
SAS, Spearmanr
fBERT, Kendalltau
f0
BERT, Kendalltau
SAS, Kendalltau
0 2 4 6 8 10 12
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Correlation of BERTScore to human judgement
for pairs with some lexical overlap
fBERT, Pearsonr
f0
BERT, Pearsonr
SAS, Pearsonr
fBERT, Spearmanr
f0
BERT, Spearmanr
SAS, Spearmanr
fBERT, Kendalltau
f0
BERT, Kendalltau
SAS, Kendalltau
0 2 4 6 8 10 12
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Correlation of BERTScore to human judgement
fBERT, Pearsonr
f0
BERT, Pearsonr
SAS, Pearsonr
fBERT, Spearmanr
f0
BERT, Spearmanr
SAS, Spearmanr
fBERT, Kendalltau
f0
BERT, Kendalltau
SAS, Kendalltau
Figure 3: Pearson, Spearmanâs, and Kendallâs rank correlations for different embedding extractions
for when there is no lexical overlap (F1 = 0), when there is some overlap (F1 ̸= 0) and aggregated
for the SQuAD subset. fBERT is BERTScore vanilla and fâ˛
BERT is BERTScore trained.
spatial translations. There is an encoding-related example with 12 and 10 special characters each
which seems to be a mislabelled example.
5.2.2. GermanQuAD
Overall, error analysis for GermanQuAD is limited to a few cases because it is the smallest
dataset of the three and all language model based metrics perform comparably well - SAS in
particular. SAS fails to identify semantic similarity in cases where the answers are synonyms or
translations which also include technical terms that rely on Latin (e.g. vis viva and living forces
(translated) (SAS score: 0.5), Anorthotiko Komma Ergazomenou Laou and Progressive Party of
the Working People (translation) (0.04), NaĚhrgebiet and Akkumulationsgebiet (0.45), Zehrgebiet
and Ablationsgebiet (0.43)). This is likely the case because SAS does not use a multilingual model.
Since multilingual models have not been implemented for cross-encoders yet, this remains an area
for future research. Text-based calculations and numbers are also problematic: (translated) 46th
day before Easter Sunday and Wednesday after the 7th Sunday before Easter (0.41).
SAS also fails to recognise aliases or descriptions of relations that point to the same person or
object: Thayendanegea and Joseph Brant (0.028) are the same people. BERTScore vanilla and
BERTScore trained both find some similarity (0.36, 0.22). Goring House and Buckinghams Haus
(0.29) refer to the same object but one is the official name, the other one a description of the same,
again BERTScore vanilla and BERTScore trained identify more similarity (0.44, 0.37).
5.2.3. NQ-open
We also observe that similarity scores for answer-prediction pairs which include numbers, e.g.
an amount, a date or a year, SAS, as well as BERTScore trained, diverge from labels. The only
semantically similar entities to answers expected to contain a numeric value should be the exact
value, not a unit more or less. Also, the position within the pairs seems to matter for digits and
their string representation. For SAS that the pair of 11 and eleven has a score of 0.09 whereas the
pair of eleven and 11 has a score of 0.89.
Figure 4 depicts the major error categories for when SAS scores range below 0.25 while human
annotations indicate a label of 2. We observe that entities related to names, which includes spatial
names as well as co-references and synonyms, form the largest group of scoring errors. After
correcting for encoding errors and fixing the labels manually in the NQ-open subset, totalling 70
samples, the correlations have already improved by about a per cent for SAS. Correcting wrong
labels in extreme cases where SAS score is below 0.25 and the label is 2 or when SAS is above
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
52
11. 0 3 6 9 12 15 18 21
Spatial
Names
Co-reference
Different levels of precision
Synonyms
Incomplete answer
Medical term
Biological term
Dates
Temporal
Alias
Numeric
Unrelated answer
Acronym
Chemical term
Synonyms
Unrelated
Example SAS/label disagreement
explanations for NQ-open
Figure 4: Subset of NQ-open test set, where SAS score < 0.01 and human label is 2, manually
annotated for an explanation of discrepancies. Original questions and Google search has been used
to assess the correctness of the gold labels.
0.5 and label is 0 improves results almost across the board for all models, but more so for SAS.
After removal of duplicates, sample with imprecise questions, wrong gold label or multiple correct
answers, we are left with 3559 ground-truth answer/prediction pairs compared to 3658 we started
with.
An example for the better performance on names when applying our new bi-encoder and SBERT
trained models can be seen in Figure 5, where both models perform well in comparison to SAS
and human judgement.
6. CONCLUSION
Existing evaluation metrics for QA models have various limitations. N-gram based metrics suffer
from asymmetry, strictness, failure to capture multi-hop dependencies and penalise semantically-
critical ordering, failure to account for relevant context or question, to name a few. We have found
patterns in the mistakes that SAS was making. These include spatial awareness, names, numbers,
dates, context awareness, translations, acronyms, scientific terminology, historical events,
conversions, encodings.
The comparison to annotator labels is performed on answer pairs taken from subsets of SQuAD
and GermanQuAD datasets, and for NQ-open we have a prediction and ground-truth answer pair.
For cases with lexical overlap, ROUGE-L achieves comparative results to pre-trained semantic
similarity evaluation models at a fraction of computation costs that the other models require. This
holds for all GermanQuAD, SQuAD and NQ-open alike. We conclude that to further improve the
semantic answer similarity evaluation in German, future work should focus on providing a larger
dataset of answer pairs, since we could find only a few examples where SAS and the other metrics
based on pre-trained language models didnât match with human annotator labels. Dataset size was
one of the reasons why we focused more heavily on NQ-open dataset. In addition, focusing on the
other two would mean less strong evidence on how the metric will perform when applied to model
predictions behind a real-world application. Furthermore, all semantic similarity metrics failed to
have a high correlation to human labels when there was no token-level overlap, which is arguably
the most important use-case for a semantic answer similarity metric as opposed to, say, ROUGE-L.
NQ-open happened to have the largest number of samples that satisfied this requirement. Removing
duplicates and re-labelling led to significant improvements across the board. We have generated a
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
53
12. Figure 5: Representative example from NQ-open of a question and all semantic answer similarity
measurement results.
Question: Who killed Natalie and Ann in Sharp Objects?
Ground-truth answer: Amma
Predicted Answer: Luke
EM: 0.00
F1: 0.00
Top-1-Accuracy: 0.00
SAS: 0.0096
Human Judgment: 0
fBERT: 0.226
fâ˛
BERT: 0.145
Bi-Encoder: 0.208
fĚBERT: 0.00
Bi-Encoder (new model): â0.034
names dataset, which was then used to fine-tune the bi-encoder and BERTScore model. The latter
achieves and beats SOTA rank correlation figures when there is no lexical overlap for datasets with
English as the core language. Bi-encoders outperformed cross-encoders on answer-prediction pairs
without lexical overlap both in terms of correlation to human judgement and speed, which makes
them more applicable in real-world scenarios. This, plus support for multilingual setups, could be
essential for companies as well because models most probably wonât understand the relationships
between different employees and stakeholders mentioned in internal documents. A reason to have
a preference towards BERTScore would be the ability to use BERTScore as a training objective to
generate soft predictions, allowing the network to remain differentiable end-to-end.
An element of future research would be further improving the performance on names of public
figures as well as spatial names like cities and countries. Knowledge-bases, such as Freebase or
Wikipedia, as explored in [11], could be used to find an equivalent answer to named geographical
entities. Numbers and dates which is the problematic data type in multi-lingual, as well as
monolingual contexts, would be another dimension.
7. ACKNOWLEDGEMENTS
We would like to thank Ardhendu Singh, Julian Risch, Malte Pietsch and XCS224U course
facilitators, Ankit Chadha in particular, as well as Christopher Potts for their constant support.
8. REFERENCES
[1] J. Risch, T. MoĚller, J. Gutsch, and M. Pietsch, âSemantic answer similarity for evaluating
question answering models,â arXiv preprint arXiv:2108.06130, 2021.
[2] X.-M. Zeng, âSemantic relationships between contextual synonyms,â US-China education
review, vol. 4, pp. 33â37, 2007.
[3] S. Min, J. Boyd-Graber, C. Alberti, D. Chen, E. Choi, M. Collins, K. Guu, H. Hajishirzi,
K. Lee, J. Palomaki, C. Raffel, A. Roberts, T. Kwiatkowski, P. Lewis, Y. Wu, H. KuĚttler,
L. Liu, P. Minervini, P. Stenetorp, S. Riedel, S. Yang, M. Seo, G. Izacard, F. Petroni,
L. Hosseini, N. D. Cao, E. Grave, I. Yamada, S. Shimaoka, M. Suzuki, S. Miyawaki, S. Sato,
R. Takahashi, J. Suzuki, M. Fajcik, M. Docekal, K. Ondrej, P. Smrz, H. Cheng, Y. Shen,
X. Liu, P. He, W. Chen, J. Gao, B. Oguz, X. Chen, V. Karpukhin, S. Peshterliev, D. Okhonko,
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
54
13. M. Schlichtkrull, S. Gupta, Y. Mehdad, and W.-t. Yih, âNeurips 2020 efficientqa competition:
Systems, analyses and lessons learned,â in Proceedings of the NeurIPS 2020 Competition
and Demonstration Track (H. J. Escalante and K. Hofmann, eds.), vol. 133 of Proceedings of
Machine Learning Research, pp. 86â111, PMLR, 06â12 Dec 2021.
[4] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, âSquad: 100,000+ questions for machine
comprehension of text,â arXiv:1606.05250, 2016.
[5] J. Bulian, C. Buck, W. Gajewski, B. Boerschinger, and T. Schuster, âTomayto, tomahto.
beyond token-level answer equivalence for question answering evaluation,â arXiv preprint
arXiv:2202.07654, 2022.
[6] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, âBertscore: Evaluating text
generation with bert,â arXiv preprint arXiv:1904.09675, 2019.
[7] N. Reimers and I. Gurevych, âSentence-BERT: Sentence embeddings using Siamese BERT-
networks,â in Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), (Hong Kong, China), pp. 3982â3992, Association for Computational
Linguistics, Nov. 2019.
[8] A. Chen, G. Stanovsky, S. Singh, and M. Gardner, âEvaluating question answering evaluation,â
in Proceedings of the 2nd Workshop on Machine Reading for Question Answering, (Hong
Kong, China), pp. 119â124, Association for Computational Linguistics, Nov. 2019.
[9] P. Rajpurkar, R. Jia, and P. Liang, âKnow what you donât know: Unanswerable questions for
squad,â arXiv preprint arXiv:1806.03822, 2018.
[10] T. MoĚller, J. Risch, and M. Pietsch, âGermanquad and germandpr: Improving non-english
question answering and passage retrieval,â arXiv:2104.12741, 2021.
[11] C. Si, C. Zhao, and J. Boyd-Graber, âWhatâs in a name? answer equivalence for open-domain
question answering,â arXiv preprint arXiv:2109.05289, 2021.
[12] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein,
I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai,
J. Uszkoreit, Q. Le, and S. Petrov, âNatural questions: a benchmark for question answering
research,â Transactions of the Association of Computational Linguistics, 2019.
[13] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, âTriviaqa: A large scale distantly
supervised challenge dataset for reading comprehension,â CoRR, vol. abs/1705.03551, 2017.
[14] N. Reimers, P. Beyer, and I. Gurevych, âTask-oriented intrinsic evaluation of semantic
textual similarity,â in Proceedings of COLING 2016, the 26th International Conference on
Computational Linguistics: Technical Papers, (Osaka, Japan), pp. 87â96, The COLING 2016
Organizing Committee, Dec. 2016.
[15] C. Croux and C. Dehon, âInfluence functions of the spearman and kendall correlation
measures,â Stat Methods Appl (2010) 19:497â515, 2010.
[16] T. Niven and H.-Y. Kao, âProbing neural network comprehension of natural language argu-
ments,â in Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, (Florence, Italy), pp. 4658â4664, Association for Computational Linguistics,
July 2019.
[17] D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad,
A. Singh, P. Ringshia, et al., âDynabench: Rethinking benchmarking in nlp,â arXiv preprint
arXiv:2104.14337, 2021.
[18] B. Wang, C. Xu, S. Wang, Z. Gan, Y. Cheng, J. Gao, A. H. Awadallah, and B. Li, âAdversarial
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
55
14. GLUE: A multi-task benchmark for robustness evaluation of language models,â in Thirty-fifth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track
(Round 2), 2021.
[19] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, âOn the dangers of stochastic
parrots: Can language models be too big?,â in Proceedings of the 2021 ACM Conference on
Fairness, Accountability, and Transparency, FAccT â21, (New York, NY, USA), p. 610â623,
Association for Computing Machinery, 2021.
[20] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein,
I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai,
J. Uszkoreit, Q. Le, and S. Petrov, âNatural questions: A benchmark for question answering
research,â Transactions of the Association for Computational Linguistics, vol. 7, pp. 452â466,
Mar. 2019.
[21] N. Thakur, N. Reimers, J. Daxenberger, and I. Gurevych, âAugmented sbert: Data
augmentation method for improving bi-encoders for pairwise sentence scoring tasks,â
arXiv:2010.08240v2 [cs.CL], 2021.
[22] C. Wagner, âPoliticians on wikipedia and dbpedia,â 2017.
[23] S. Humeau, K. Shuster, M.-A. Lachaux, and J. Weston, âPoly-encoders: Transformer archi-
tectures and pre-training strategies for fast and accurate multi-sentence scoring,â 2020.
[24] J. Devlin, M. Chang, K. Lee, and K. Toutanova, âBERT: pre-training of deep bidirectional
transformers for language understanding,â CoRR, vol. abs/1810.04805, 2018.
[25] J. Chen, L. Yang, K. Raman, M. Bendersky, J.-J. Yeh, Y. Zhou, M. Najork, D. Cai, and
E. Emadzadeh, âDiPair: Fast and accurate distillation for trillion-scale text matching and
pair modeling,â in Findings of the Association for Computational Linguistics: EMNLP 2020,
(Online), pp. 2925â2937, Association for Computational Linguistics, Nov. 2020.
[26] N. Reimers and I. Gurevych, âSentence-bert: Sentence embeddings using siamese bert-
networks,â CoRR, vol. abs/1908.10084, 2019.
[27] P. May, âT-systems-onsite/cross-en-de-roberta-sentence-transformer,â Hugging Face, 2020.
[28] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, âSemEval-2017 task 1: Semantic
textual similarity multilingual and crosslingual focused evaluation,â in Proceedings of the
11th International Workshop on Semantic Evaluation (SemEval-2017), (Vancouver, Canada),
pp. 1â14, Association for Computational Linguistics, Aug. 2017.
[29] B. Chan, S. Schweter, and T. MoĚller, âGermanâs next language model,â in Proceedings of the
28th International Conference on Computational Linguistics, (Barcelona, Spain (Online)),
pp. 6788â6796, International Committee on Computational Linguistics, Dec. 2020.
[30] F. Mustafazade and P. F. Ebbinghaus, âEvaluation of semantic answer similarity metrics.â
https://github.com/e184633/semantic-answer-similarity, 2021.
[31] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, âOptuna: A next-generation hyperpa-
rameter optimization framework,â in Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining, pp. 2623â2631, 2019.
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022
56
15. 9. AUTHORS
Peter F. Ebbinghaus is currently SEO Team Lead at Teufel Audio.
With a background in econometrics and copywriting, his work fo-
cuses on the quantitative analysis of e-commerce content. Based in
Germany and Mexico, his research includes applied multi-lingual
NLP in particular.
Farida Mustafazade is a London-based quantitative researcher in the
GAM Systematic Cambridge investment team where she focuses on
macro and sustainable macro trading strategies. Her area of expertise
expands to applying machine learning, including NLP techniques, to
financial as well as alternative data.
57
International Journal on Natural Language Computing (IJNLC) Vol.11, No.3, June 2022