The document describes research on developing an automatic system to detect plagiarized spoken responses in an English proficiency exam. The system was trained using two types of features: text similarity measures between responses and sources, and speaking proficiency measures from an automated speech scoring system. The system was evaluated on a large dataset of real exam responses and achieved an F1-score of 0.761 for detecting plagiarized responses. It was further validated on responses from a single exam and achieved a recall of 0.897, showing potential to help improve validity of human and automated exam scoring.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that
'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study
in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show
that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
ABSTRACT
Natural Language processing is an interdisciplinary branch of linguistic and computer science studied under the Artificial Intelligence (AI) that gave birth to an allied area called
‘Computational Linguistic’ which focuses on processing of natural languages on computational devices. A natural language consists of a large number of sentences which are linguistic units involving one or more words linked together in accordance with a set of predefined rules called grammar. Grammar checking is the task of validating sentences syntactically and is a prominent tool within language engineering. Our review draws on the recent development of various grammar checkers to look at past, present and the future in a new light. Our review covers grammar checkers of many languages with the aim of seeking their approaches, methodologies for developing new tool and system as a whole. The survey concludes with the discussion of various features included in existing grammar checkers of foreign languages as well as a few Indian Languages.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
This is presentation slides of the paper "Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech" in 5th International Conference on Statistical Language and Speech Processing (SLSP 2017)
Abstract
Until very recently, the generation of punctuation marks for automatic speech recognition (ASR) output has been mostly done by looking at the syntactic structure of the recognized utterances. Prosodic cues such as breaks, speech rate, pitch intonation that influence placing of punctuation marks on speech transcripts have been seldom used. We propose a method that uses recurrent neural networks, taking prosodic and lexical information into account in order to predict punctuation marks for raw ASR output. Our experiments show that an attention mechanism over parallel sequences of prosodic cues aligned with transcribed speech improves accuracy of punctuation generation.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
Language Needs Analysis for English Curriculum Validationinventionjournals
This study aims to identify the language needs analysis for English curriculum validation in the tertiary level. The descriptive method is utilized in the study and employed purposive sampling. This is also called judgmental sampling. A deliberate selection of individuals made by the researcher based on the predefined criteria. Three hundred forty nine (349) students were utilized as respondents to test their listening, speaking, reading, writing, vocabulary, identifying errors and correct usage. Result showed that identifying errors skills, writing skills, correct usage, reading skills and listening skills were significantly affected by the respondents profile since the computed P-value is greater than the significance level of 0.05. However, speaking skills and vocabulary skills show that they are not significant to the profile of the respondents.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that 'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
Increasing interpreting needs a more objective and automatic measurement. We hold a basic idea that
'translating means translating meaning' in that we can assessment interpretation quality by comparing the
meaning of the interpreting output with the source input. That is, a translation unit of a 'chunk' named
Frame which comes from frame semantics and its components named Frame Elements (FEs) which comes
from Frame Net are proposed to explore their matching rate between target and source texts. A case study
in this paper verifies the usability of semi-automatic graded semantic-scoring measurement for human
simultaneous interpreting and shows how to use frame and FE matches to score. Experiments results show
that the semantic-scoring metrics have a significantly correlation coefficient with human judgment.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
ABSTRACT
Natural Language processing is an interdisciplinary branch of linguistic and computer science studied under the Artificial Intelligence (AI) that gave birth to an allied area called
‘Computational Linguistic’ which focuses on processing of natural languages on computational devices. A natural language consists of a large number of sentences which are linguistic units involving one or more words linked together in accordance with a set of predefined rules called grammar. Grammar checking is the task of validating sentences syntactically and is a prominent tool within language engineering. Our review draws on the recent development of various grammar checkers to look at past, present and the future in a new light. Our review covers grammar checkers of many languages with the aim of seeking their approaches, methodologies for developing new tool and system as a whole. The survey concludes with the discussion of various features included in existing grammar checkers of foreign languages as well as a few Indian Languages.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
This is presentation slides of the paper "Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech" in 5th International Conference on Statistical Language and Speech Processing (SLSP 2017)
Abstract
Until very recently, the generation of punctuation marks for automatic speech recognition (ASR) output has been mostly done by looking at the syntactic structure of the recognized utterances. Prosodic cues such as breaks, speech rate, pitch intonation that influence placing of punctuation marks on speech transcripts have been seldom used. We propose a method that uses recurrent neural networks, taking prosodic and lexical information into account in order to predict punctuation marks for raw ASR output. Our experiments show that an attention mechanism over parallel sequences of prosodic cues aligned with transcribed speech improves accuracy of punctuation generation.
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
In this paper we combine our previous research in the field of Semantic web, especially ontology learning and population with Sentence retrieval. To do this we developed a new approach to sentence retrieval
modifying our previous TF-ISF method which uses local context information to take into account only document level information. This is quite a new approach to sentence retrieval, presented for the first time
in this paper and also compared to the existing methods that use information from whole document collection. Using this approach and developed methods for sentence retrieval on a document level it is possible to assess the relevance of a sentence by using only the information from the retrieved sentence’s document and to define a document level OWL representation for sentence retrieval that can be
automatically populated. In this way the idea of Semantic Web through automatic and semi-automatic
extraction of additional information from existing web resources is supported. Additional information is
formatted in OWL document containing document sentence relevance for sentence retrieval.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
Language Needs Analysis for English Curriculum Validationinventionjournals
This study aims to identify the language needs analysis for English curriculum validation in the tertiary level. The descriptive method is utilized in the study and employed purposive sampling. This is also called judgmental sampling. A deliberate selection of individuals made by the researcher based on the predefined criteria. Three hundred forty nine (349) students were utilized as respondents to test their listening, speaking, reading, writing, vocabulary, identifying errors and correct usage. Result showed that identifying errors skills, writing skills, correct usage, reading skills and listening skills were significantly affected by the respondents profile since the computed P-value is greater than the significance level of 0.05. However, speaking skills and vocabulary skills show that they are not significant to the profile of the respondents.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Similar to Application Of An Automatic Plagiarism Detection System In A Large-Scale Assessment Of English Speaking Proficiency (20)
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Application Of An Automatic Plagiarism Detection System In A Large-Scale Assessment Of English Speaking Proficiency
1. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 435–443
Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics
435
Application of an Automatic Plagiarism Detection System in a Large-scale
Assessment of English Speaking Proficiency
Xinhao Wang1
, Keelan Evanini2
, Matthew Mulholland2
, Yao Qian1
, James V. Bruno2
Educational Testing Service
1
90 New Montgomery St #1450, San Francisco, CA 94105, USA
2
660 Rosedale Road, Princeton, NJ 08541, USA
{xwang002, kevanini, mmulholland, yqian, jbruno}@ets.org
Abstract
This study aims to build an automatic sys-
tem for the detection of plagiarized spoken
responses in the context of an assessment of
English speaking proficiency for non-native
speakers. Classification models were trained
to distinguish between plagiarized and non-
plagiarized responses with two different types
of features: text-to-text content similarity
measures, which are commonly used in the
task of plagiarism detection for written doc-
uments, and speaking proficiency measures,
which were specifically designed for spon-
taneous speech and extracted using an auto-
mated speech scoring system. The experi-
ments were first conducted on a large data set
drawn from an operational English proficiency
assessment across multiple years, and the best
classifier on this heavily imbalanced data set
resulted in an F1-score of 0.761 on the plagia-
rized class. This system was then validated on
operational responses collected from a single
administration of the assessment and achieved
a recall of 0.897. The results indicate that
the proposed system can potentially be used
to improve the validity of both human and au-
tomated assessment of non-native spoken En-
glish.
1 Introduction
Plagiarism of spoken responses has become a vex-
ing problem in the domain of spoken language
assessment, in particular, the evaluation of non-
native speaking proficiency, since there exists a
vast amount of easily accessible online resources
covering a wide variety of topics that test tak-
ers can use to prepare responses prior to the test.
In the context of large-scale, standardized as-
sessments of spoken English for academic pur-
poses, such as the TOEFL iBT test (ETS, 2012),
the Pearson Test of English Academic (Long-
man, 2010), and the IELTS Academic assessment
(Cullen et al., 2014), some test takers may uti-
lize content from online resources or other pre-
pared sources in their spoken responses to test
questions that are intended to elicit spontaneous
speech. These responses that are based on canned
material pose a problem for both human raters and
automated scoring systems, and can reduce the va-
lidity of scores that are provided to the test takers;
therefore, research into the automated detection of
plagiarized spoken responses is necessary.
In this paper, we investigate a variety of features
for automatically detecting plagiarized spoken re-
sponses in the context of a standardized assess-
ment of English speaking proficiency. In addition
to examining several commonly used text-to-text
content similarity features, we also use features
that compare various aspects of speaking profi-
ciency across multiple responses provided by a
test taker, based on the hypothesis that certain as-
pects of speaking proficiency, such as fluency, may
be artificially inflated in a test taker’s canned re-
sponses in comparison to non-canned responses.
These features are designed to be independent of
the availability of the reference source materials.
Finally, we evaluate the effectiveness of this sys-
tem on a data set with a large number of control
(non-plagiarized) responses in an attempt to sim-
ulate the imbalanced distribution from an opera-
tional setting in which only a small number of the
test takers’ responses are plagiarized. In addition,
we further validate this system on operational data
and show how it can practically assist both human
and automated scoring in a large scale assessment
of English speaking proficiency
2 Previous Work
Previous research related to automated plagia-
rism detection for natural language has mainly
focused on written documents. For example, a
2. 436
series of shared tasks has enabled a variety of
NLP approaches for detecting plagiarism to be
compared on a standardized set of texts (Potthast
et al., 2013), and several plagiarism detection ser-
vices are available online.1 Various techniques
have been employed in this task, including n-
gram overlap (Lyon et al., 2006), document finger-
printing (Brin et al., 1995), word frequency statis-
tics (Shivakumar and Garcia-Molina, 1995), in-
formation retrieval-based metrics (Hoad and Zo-
bel, 2003), text summarization evaluation met-
rics (Chen et al., 2010), WordNet-based features
(Nahnsen et al., 2005), stopword-based features
(Stamatatos, 2011), features based on shared syn-
tactic patterns (Uzuner et al., 2005), features based
on word swaps detected via dependency parsing
(Mozgovoy et al., 2007), and stylometric features
(Stein et al., 2011), among others. In general,
for the task of monolingual plagiarism detection,
these methods can be categorized as either ex-
ternal plagiarism detection, in which a document
is compared to a body of reference documents,
or intrinsic plagiarism detection, in which a doc-
ument is evaluated independently without a ref-
erence collection (Alzahrani et al., 2012). This
task is also related to the widely studied task of
paraphrase recognition, which benefits from sim-
ilar types of features (Finch et al., 2005; Mad-
nani et al., 2012). The current study adopts sev-
eral of these features that are designed to be ro-
bust to the presence of word-level modifications
between the source and the plagiarized text; since
this study focuses on spoken responses that are
reproduced from memory and subsequently pro-
cessed by a speech recognizer, metrics that rely on
exact matches are likely to perform sub-optimally.
Little prior work has been conducted on the
task of automatically detecting similar spoken re-
sponses, although research in the field of Spoken
Document Retrieval (Hauptmann, 2006) is rele-
vant. Due to the difficulties involved in collecting
corpora of actual plagiarized material, nearly all
published results of approaches to the task of pla-
giarism detection have relied on either simulated
plagiarism (i.e., plagiarized texts generated by ex-
perimental human participants in a controlled en-
vironment) or artificial plagiarism (i.e., plagia-
rized texts generated by algorithmically modifying
1
For example, http://turnitin.com/en_
us/what-we-offer/feedback-studio,http:
//www.grammarly.com/plagiarism, and http:
//www.paperrater.com/plagiarism_checker.
a source text) (Potthast et al., 2010). These results,
however, may not reflect actual performance in a
deployed setting, since the characteristics of the
plagiarized material may differ from actual plagia-
rized responses.
In previous studies (Evanini and Wang, 2014;
Wang et al., 2016), we conducted experiments
on a simulated data set from an operational,
large-scale, standardized English proficiency as-
sessment and obtained initial results with an F1-
measure of 70.6% using an automatic system to
detect plagiarized spoken responses (Wang et al.,
2016). Based on these previous findings, we ex-
tend this line of research and contribute in the fol-
lowing ways: 1) an improved automatic speech
recognition (ASR) system based on Kaldi was in-
troduced, and an unsupervised language model
adaptation method was employed to improve the
ASR performance on spontaneous speech elicited
by new, unseen test questions; 2) an improved set
of text-to-text content similarity features based on
n-gram overlap and Word Mover’s Distance was
investigated; 3) in addition to evaluating the sys-
tem on a simulated imbalanced data set, we also
validated the developed automatic system using all
of the responses from a single operational admin-
istration of the English speaking assessment in or-
der to obtain a reliable estimate of the system’s
performance in a practical deployment.
3 Data
The data used in this study was drawn from a
large-scale, high-stakes assessment of English for
non-native speakers, which assesses English com-
munication skills for academic purposes. The
Speaking section of this assessment contains six
tasks designed to elicit spontaneous spoken re-
sponses: two of them require test takers to provide
an opinion or preference based on personal expe-
rience, which are referred to as independent tasks;
and the other four tasks require test takers to sum-
marize or discuss material provided in a reading
and/or listening passage, which are referred to as
integrated tasks.
In general, the independent tasks ask questions
on topics that are familiar to test takers and are not
based on any stimulus materials. Therefore, test
takers can provide responses containing a wide va-
riety of specific examples. In some cases, test tak-
ers may attempt to game the assessment by mem-
orizing canned material from an external source
3. 437
and adapting it to the questions presented in the
independent tasks. This type of plagiarism can af-
fect the validity of a test taker’s speaking score and
can be grounds for score cancellation. However, it
is often difficult even for trained human raters to
recognize plagiarized spoken responses, due to the
large number and variety of external sources that
are available to test takers.
In order to identify the plagiarized spoken re-
sponses from the operational test, human raters
need to first flag spoken responses that contained
potentially plagiarized material, then trained ex-
perts subsequently review them to make the fi-
nal decision. In the review process, the re-
sponses were transcribed and compared to external
source materials obtained through manual internet
searches; if it was determined that the presence of
plagiarized material made it impossible to provide
a valid assessment of the test taker’s performance
on the speaking task, the response was labeled as
a plagiarized response and assigned a score of 0.
In this study, a total of 1,557 plagiarized responses
to independent test questions were collected from
the operational assessment across multiple years.
During the process of reviewing potentially pla-
giarized responses, the raters also collected a data
set of external sources that appeared to have been
used by test takers in their responses. In some
cases, the test taker’s spoken response was nearly
identical to an identified source; in other cases,
several sentences or phrases were clearly drawn
from a particular source, although some modifi-
cations were apparent. Table 1 presents a sam-
ple source that was identified for several of the
responses in the data set along with a sample
plagiarized response that apparently contains ex-
tended sequences of words directly matching id-
iosyncratic features of this source, such as the
phrases “how romantic it can ever be” and “just
relax yourself on the beach.” In general, test tak-
ers typically do not reproduce the entire source
material in their responses; rather, they often at-
tempt to adapt the source material to a specific test
question by providing some speech that is directly
relevant to the prompt and combining it with the
plagiarized material. An example of this is shown
by the opening and closing non-italicized portions
of the sample plagiarized response in Table 1. In
total, human raters identified 224 different source
materials while reviewing the potentially plagia-
rized responses, and their statistics information is
as follows: the average number of words is 95.7
(std. dev. = 38.5), the average number of clauses
is 10.3 (std. dev. = 5.1), and the average number
of words per clause is 9.3 (std. dev. = 7.1).
In addition to the source materials and the
plagiarized responses, a set of non-plagiarized
control responses was also obtained in order to
conduct classification experiments between pla-
giarized and non-plagiarized responses. Since
the plagiarized responses were collected over the
course of multiple years, they were drawn from
many different test forms, and it was not practi-
cal to obtain control data from all of the test forms
that were represented in the plagiarized set. So,
only the 166 test forms that appear most frequently
in the canned data set were used for the collec-
tion of control responses, and 200 test takers were
randomly selected from each form, without any
overlap with speakers in the plagiarized set. The
two spoken responses from the two independent
questions were collected from each speaker; in to-
tal, 66,400 spoken responses from 33,200 speak-
ers were obtained as the control set. Therefore,
the data set used in this study is quite imbalanced:
the number of control responses is larger than the
number of plagiarized responses by a factor of 43.
4 Methodology
This study employed two different types of fea-
tures in the automatic detection of plagiarized spo-
ken responses: 1) similar to human raters’ behav-
ior in identifying the canned spoken responses, a
set of features is developed to measure the content
similarities between a test response and the source
materials that were collected; 2) to deal with this
particular task of plagiarism detection for sponta-
neous spoken responses, a set of features is intro-
duced based on the assumption that the produc-
tion of spoken language based on memorized ma-
terial is expected to differ from the production of
non-plagiarized speech in aspects of a test taker’s
speech delivery, such as fluency, pronunciation,
and prosody.
4.1 Content Similarity
Previous work (Wang et al., 2016; Evanini and
Wang, 2014) has demonstrated the effectiveness
of using content-based features for the task of
automatic plagiarized spoken response detection.
Therefore, this study investigates the use of im-
proved features based on the measurement of text-
4. 438
Table 1: A sample source passage and the transcription of a sample plagiarized spoken response that was appar-
ently drawn from the source. The test question/prompt used to elicit this response is also included. The overlapping
word sequences between the source material and the transcription of the spoken response are indicated in italics.
Sample source passage: Well, the place I
enjoy the most is a small town located in
France. I like this small town because it has
very charming ocean view. I mean the sky
there is so blue and the beach is always full of
sunshine. You know how romantic it can ever
be, just relax yourself on the beach, when the
sun is setting down, when the ocean breeze
is blowing and the seabirds are singing. Of
course I like this small French town also be-
cause there are many great French restau-
rants. They offer the best seafood in the world
like lobsters and tuna fishes. The most impor-
tant, I have been benefited a lot from this trip
to France because I made friends with some
gorgeous French girls. One of them even gave
me a little watch as a souvenir of our friend-
ship.
Prompt: Talk about an activity you enjoyed
doing with your family when you were a kid.
Transcription of a plagiarized response:
family is a little trip to France when I was in
primary school ten years ago I enjoy this ac-
tivity first because we visited a small French
town located by the beach the town has very
charming ocean view and in the sky is so blue
and the beach is always full of sunshine you
know how romantic it can ever be just relax
yourself on the beach when the sun is settling
down the sea birds are singing of course I en-
joy this activity with my family also because
there are many great French restaurants they
offer the best sea food in the world like lob-
sters and tuna fishes so I enjoy this activity
with my family very much even it has passed
several years
to-text content similarity. Given a test response, a
comparison is made with each of the 224 reference
sources using the following two content similarity
metrics: word n-gram overlap and Word Mover’s
Distance. Then, the maximum similarity or the
minimum distance is taken as a single feature to
measure the content relevance between the test re-
sponses and the source materials.
4.1.1 N-gram Overlap
Features based on the BLEU metric have been
proven to be effective in measuring the content ap-
propriateness of spoken responses in the context
of English proficiency assessment (Zechner and
Wang, 2013) and in measuring content similarity
in the detection of plagiarized spoken responses
(Wang et al., 2016; Evanini and Wang, 2014). In
this study, we first design a new type of feature,
known as n-gram overlap, by simulating and im-
proving the previous BLEU-based features. Word
n-grams, with n varying from 1 word to 11 words,
are first extracted from both the test response and
each of the source materials, and then the number
of overlapping n-grams are counted, where both
n-gram types and tokens are counted separately.
The intuition behind decreasing the maximum or-
der is to increase the classifier’s recall by evalu-
ating the overlap of shorter word sequences, such
as individual words in the unigram setting. On the
other hand, the motivation behind increasing the
maximum order is to boost the classifier’s preci-
sion, since it will focus on matches of longer word
sequences. Here, the maximum order of 11 was
experimentally decided in consideration of the av-
erage number of words per clause in source mate-
rials, which is 9.3 as described in Section 3.
In order to calculate the maximum similarity
across all source materials, the 11 n-gram over-
lap counts are combined together to generate one
weighted score between a test and each source as
in Equation 1, and then the maximum score across
all sources is calculated as a feature to measure the
similarity between a test and the set of source ma-
terials. Meanwhile, the 11 n-gram overlap counts
calculated using the source with the maximum
similarity score are also taken as features.
n
X
i=1
i
(n · (n + 1)/2)
count overlap(i-gram) (1)
Furthermore, the n-gram based feature set can
be enlarged by: 1) normalizing the n-gram counts
by either the number of n-grams in the test re-
sponse or the number of n-grams in each of the
sources; 2) combining all source materials to-
gether into a single document for comparison (11
5. 439
features based on n-gram overlap with the com-
bined source), which is designed based on the as-
sumption that test takers may attempt to use con-
tent from multiple sources. Similarly, this type of
features can be further normalized by the number
of n-grams in the test response. Based on all of
these variations, a total of 116 n-gram overlap fea-
tures is generated for each spoken response.
4.1.2 Word Mover’s Distance
More recently, various approaches based on deep
neural networks (DNN) and word-embeddings
trained on large corpora have shown promis-
ing performance in document similarity detec-
tion (Kusner et al., 2015). In contrast to tradi-
tional similarity features, which are limited to a
reliance on exact word matching, as the above
n-gram overlap features, these new approaches
have the advantage of capturing topically relevant
words that are not identical. In this study, we
employ Word Mover’s Distance (WMD) (Kusner
et al., 2015) to measure the distance between a
test response and a source material based on word-
embeddings.
Embeddings of words are first represented as
vectors, and then the distance between each word
appearing in a test response and each word in a
source can be measured using the Euclidean dis-
tance in the embedding space. WMD represents
the sum of the minimum values among the Eu-
clidean word distances between words in the two
compared documents. This minimization problem
is a special case of Earth Mover’s Distance (Rub-
ner et al., 1998), for which efficient algorithms are
available. Kunsner et al. (Kusner et al., 2015)
reported that WMD outperformed other distance
measures on document retrieval tasks and that the
embeddings trained on the Google News corpus
consistently performed well across a variety of
contexts. For this work, we used the same word
embeddings used in weighted embedding features
as the input for the WMD calculation.
4.2 Difference in Speaking Proficiency
The performance of the above content similarity
features greatly depends on the availability of a
comprehensive set of source materials. If a test
taker uses unseen source materials as the basis for
a plagiarized response, the system may fail to de-
tect it. Accordingly, a set of features that do not
rely on a comparison with source materials has
been proposed previously (Wang et al., 2016). The
current study also examined this type of features.
As described in Section 3, the Speaking section
of the assessment includes both independent and
integrated tasks. In a given test administration, test
takers are required to respond to all six questions;
however, plagiarized responses are more likely to
appear in the two independent tasks, since they
are not based on specific reading and/or listening
passages and thus elicit a wider range of variation
across responses. Since the plagiarized responses
are mostly constructed based on memorized ma-
terial, they may be delivered in a more fluent and
proficient manner compared to the responses that
contain fully spontaneous speech. Based on this
assumption, a set of features can be developed
to capture the difference between various speak-
ing proficiency features extracted from the canned
and the fully spontaneous speech produced by the
same test taker, where an automated spoken En-
glish assessment system, SpeechRater R
(Zechner
et al., 2007, 2009), can be used to provide the
automatic proficiency scores along with 29 fea-
tures measuring fluency, pronunciation, prosody,
rhythm, vocabulary, and grammar. Since most pla-
giarized responses are expected to occur in the
independent tasks, we assume the integrated re-
sponses are based on spontaneous speech. A mis-
match between the proficiency scores and the fea-
ture values from the independent responses and
the integrated responses from the same speaker
can potentially indicate the presence of both pre-
pared speech and spontaneous speech, and, there-
fore, the presence of plagiarized spoken responses.
Given an independent response from a test
taker, along with the other independent response
and four integrated responses from the same test
taker, 6 features can be extracted according to each
of the proficiency scores and 29 SpeechRater fea-
tures. First, the difference of score/feature val-
ues between two independent responses was cal-
culated as a feature, which was used to deal with
the case in which only one independent response
was canned and the other one contained sponta-
neous speech. Then, basic descriptive statistics,
including mean, median, min, and max, were ob-
tained across the four integrated responses. The
differences between the score/feature value of the
independent response and these four basic statis-
tics were extracted as additional features. Finally,
another feature was also extracted by standardiz-
ing the score/feature value of the independent re-
6. 440
sponse with the mean and standard deviation from
the integrated responses. In total, a set of 180 fea-
tures were extracted, referred as SpeechRater in
the following experiments.
5 Experiments and Results
5.1 ASR Improvement
In this study, spoken responses need to be tran-
scribed into text so that they can be compared
with the source materials to measure the text-to-
text similarity. However, due to the large amount
of spoken responses considered in this study, it is
not practical to manually transcribe all of them;
therefore, a Kaldi2-based automatic speech recog-
nition engine was employed. The training set used
to develop the speech recognizer consists of simi-
lar responses (around 800 hours of speech) drawn
from the same assessment and do not overlap with
the data sets included in this study.
When using an ASR system to recognize spo-
ken responses from new prompts that are not seen
in the ASR training data, a degradation in recog-
nition accuracy is expected because of the mis-
match between the training and test data. In this
study, we used an unsupervised language model
(LM) adaptation method to improve the ASR per-
formance on unseen data. In this method, two
rounds of language model adaptation were con-
ducted with the following steps: first, out-of-
vocabulary (OOV) words from the prompt mate-
rials were added to the pronunciation dictionary
and the baseline models were adapted with the
prompts; second, the adapted models were ap-
plied to spoken responses from these new prompts
to produce the recognized texts along with con-
fidence scores corresponding to each response;
third, automatically transcribed texts with confi-
dence scores higher than a predefined threshold
of 0.8 were selected; finally, these high-confident
recognized texts were used to conduct another
round of language model adaptation. We evalu-
ated this unsupervised language model adaptation
method on a stand-alone data set with 1,500 re-
sponses from 250 test speakers, where the prompts
used to elicit these responses were unseen in the
baseline recognition models. In this experiment,
supervised language model adaptation with human
transcriptions was also examined for comparison.
As shown in Table 2, the overall word error rate
2
http://kaldi-asr.org/
Table 2: Word error rate (WER %) reduction with
an unsupervised language model adaptation method,
where the WERs on Independent items (IND), Inte-
grated items (INT), as well as all items (ALL), are
reported. The WER with the supervised adaptation
method based on human transcriptions is also listed for
comparison.
ASR Systems IND INT ALL
Baseline 22.5 26.6 25.5
Unsupervised 21.5 23.5 22.9
Supervised 21.2 22.1 21.8
(WER) is 25.5% for the baseline models. By ap-
plying the unsupervised LM adaptation method,
the overall WER can be reduced to 22.9%; in com-
parison, the overall WER can be reduced to 21.8%
by using supervised LM adaptation. The unsuper-
vised method can achieve very similar results to
the supervised method especially for responses to
the independent prompts, i.e., with WER of 21.5%
(unsupervised) vs 21.2% (supervised). These re-
sults indicate the effectiveness of the proposed
unsupervised adaptation method, which was em-
ployed in the subsequent automatic plagiarism de-
tection work.
5.2 Experimental Setup
Due to ASR failures, a small number of responses
were excluded from the experiments; finally, a to-
tal of 1,551 canned and 66,257 control responses
were included in the simulated data. Since this
work was conducted on a very imbalanced data
set and only 2.3% of the responses in the simu-
lated data are authentic plagiarized ones confirmed
by human raters, 10-fold cross-validation was per-
formed first on the simulated data. Afterward, the
classification model built on the simulated data set
was further evaluated on a corpus with real opera-
tional data.
We employed the machine learning tool of
scikit-learn3 (Pedregosa et al., 2011), for train-
ing the classifier. It provides various classifica-
tion methods, such as decision tree, random for-
est, AdaBoost, etc. This study involves a variety of
features from two different categories, and prelim-
inary experiments demonstrated that the random
forest model can achieve the overall better perfor-
3
SKLL, a python tool making the running of scikit-learn
experiments simpler, was used. Downloaded from https:
//github.com/EducationalTestingService/
skll.
7. 441
mance. Therefore, the random forest method is
used to build classification models in the follow-
ing experiments, and the precision, recall, as well
as F1-score on the positive class (plagiarized re-
sponses) are used as the evaluation metrics.
5.3 Experimental Results
First, in order to verify the effectiveness of the
newly developed n-gram overlap features, a pre-
liminary experiment was conducted to compare
this set of features with BLEU-based features,
since they had been shown to be effective in pre-
vious research (Wang et al., 2016). The results as
shown in Table 3 indicate that the F1-Measure of
the n-gram features outperforms the BLEU fea-
tures (0.761 vs. 0.748), and the recall of the n-
gram features is higher than the BLEU features
(0.716 vs. 0.683); inversely, the BLEU features
result in higher precision (0.83 vs. 0.814). Ac-
cordingly, the n-gram features are used to replace
the BLEU ones, since it is more important to re-
duce the number of false negatives, i.e., improve
the recall, for our task.
Table 3: Comparison of n-gram and BLEU features.
Features Precision Recall F1
BLEU 0.83 0.683 0.748
n-gram 0.814 0.716 0.761
Furthermore, each individual type of feature
and their combinations were examined in the clas-
sification experiments described above. As shown
in Table 4, each feature set alone was used to build
classification models. The n-gram overlap fea-
tures result in the best performance with an F1-
score of 0.761, and the WMD features capturing
the topical relevance between word pairs result in
a much lower F1-score of 0.649. Furthermore,
the combination of both types of content similarity
features, i.e., n-gram and WMD, slightly reduces
the F1-score to 0.76. These results indicate that for
this particular task, the exact match of certain ex-
pressions appearing in both the test response and a
source material plays a critical role.
As to the speaking proficiency related features,
they can lead to a promising precision of 0.8 but
with a very low recall of 0.009. After reexamining
the assumption that human experts may be able
to differentiate prepared speech from fully spon-
taneous speech based on the way how the speech
is delivered, it turns out that it is quite challenging
Table 4: Classification performance using each individ-
ual feature set and their combinations.
Features Precision Recall F1
1. n-gram 0.814 0.716 0.761
2. WMD 0.663 0.636 0.649
3. SpeechRater 0.8 0.009 0.018
1 + 2 0.812 0.716 0.76
1 + 2 + 3 0.821 0.696 0.752
for human experts to make a reliable judgment of
plagiarism just based on the speech delivery with-
out any reference to the source materials, in par-
ticular, within the context of high-stakes language
assessment. Accordingly, the features capturing
the difference in speaking proficiency of prepared
and spontaneous speech can be used as contrib-
utory information to improve the accuracy of an
automatic detection system, but they are unable
to achieve promising performance alone. Also as
shown in Table 4, By adding the speaking profi-
ciency features, the precision can be improved to
0.821, but the recall is reduced; finally, the F1-
score is reduced to 0.752.
6 Employment in Operational Test
6.1 Operational Use
In order to obtain a more accurate estimate of how
well the automatic plagiarism detection system
might perform in a practical application in which
the distribution is expected to be heavily skewed
towards the non-plagiarized category, all test-taker
responses to independent prompts were collected
from a single administration of the speaking as-
sessment. In total, 13,516 independent responses
from 6,758 speakers were extracted for system
evaluation. We collected 39 responses confirmed
as plagiarized through the human review process,
which represents 0.29% of the data set.
In this particular task, automatic detection sys-
tems can be applied to support human raters,
where all potentially plagiarized responses can be
first automatically identified and then human ex-
perts can be involved to review flagged responses
and make the final decision about potential in-
stances of plagiarism. In this scenario, it is more
important to increase the number of true positives
flagged by the automated system; thus, recall of
plagiarized responses was used as the evaluation
metric, i.e., how many responses can be success-
fully detected among these 39 confirmed instances
8. 442
of plagiarism.
6.2 Results and Discussion
In order to maximize the recall of plagiarized re-
sponses for this review, several models were built
with either different types of features or different
types of classification models, for example, ran-
dom forest and Adaboost with decision tree as
the weak classifier, and then they were combined
in an ensemble manner to flag potentially plagia-
rized responses, i.e., a response is flagged if any
of the models detects it as a plagiarized response.
This ensemble system flagged 850 responses as in-
stances of plagiarism in total and achieved a recall
of 0.897, i.e., 35 of the confirmed plagiarized re-
sponses were successfully identified by the auto-
matic system and 4 of them were missed.
These results prove that the developed sys-
tem can provide a substantial benefit to both hu-
man and automated assessment of non-native spo-
ken English. Manual identification of plagia-
rized responses can be a very difficult and time-
consuming task, where human experts need to
memorize hundreds of source materials and com-
pare them to thousands of responses. By apply-
ing an automated system, potentially plagiarized
responses can first be filtered out of the standard
scoring pipeline; subsequently, experts can review
these flagged responses to confirm whether they
actually contain plagiarized material. Accord-
ingly, instead of reviewing all 13,516 responses
in this administration for plagiarized content, hu-
man effort is required only for the 850 flagged
responses, thus substantially reducing the overall
human effort. Thus, optimizing recall is appro-
priate in this targeted use case, since the number
of false positives is within an acceptable range for
the expert review. In addition, the source labels in-
dicating which source materials were likely used
in the preparation of each response are automati-
cally generated by the automatic system for each
suspected response; this information can help to
accelerate the manual review process.
7 Conclusion and Future Work
This study proposed a system which can bene-
fit a high-stakes assessment of English speaking
proficiency by automatically detecting potentially
plagiarized spoken responses, and investigated the
empirical effectiveness of two different types of
features. One is based on automatic plagiarism
detection methods commonly applied to written
texts, in which the content similarity between a
test response and a set of source materials col-
lected from human raters were measured. In addi-
tion, this study also adopted a set of features which
do not rely on the human effort involved in source
material collection and can be easily applied to un-
seen test questions. This type of feature attempts
to capture the difference in speech patterns be-
tween prepared responses and fully spontaneous
responses from the same speaker in a test. Finally,
the classification models were evaluated on a large
set of responses collected from an operational test,
and the experimental results demonstrate that the
automatic detection system can achieve an F1-
measure of 0.761. Further evaluation on the real
operational data also shows the effectiveness of
the automatic detection system.
The task of applying an automatic system in a
large-scale operational assessment is quite chal-
lenging since typically only a small number of pla-
giarized responses are distributed among a much
larger amount of non-plagiarized responses to a
wide range of different test questions. In the fu-
ture, we will continue our research efforts to im-
prove the automatic detection system along the
following lines. First, since deep learning tech-
niques have recently shown their effectiveness in
the fields of both speech processing and natural
language understanding, we will further explore
various deep learning techniques to improve the
metrics used to measure the content similarity be-
tween test responses and source materials. Next,
further analysis will be conducted to determine the
extent of differences between canned and sponta-
neous speech, and additional features will be ex-
plored based on the findings. In addition, another
focus is to build automatic tools to regularly up-
date the pool of source materials. Besides internet
search, new sources can also be detected by com-
paring all candidate responses from the same test
question, since different test takers may use the
same source to produce their canned responses.
References
S. M. Alzahrani, N. Salim, and A. Abraham. 2012.
Understanding plagiarism linguistic patterns, textual
features, and detection methods. IEEE Transaction
on Systems, Man, and Cybernetics-Part C: Applica-
tions and Reviews, 42(2):133–149.
S. Brin, J. Davis, and H. Garcia-Molina. 1995. Copy
9. 443
detection mechanisms for digital documents. In
Proceedings of the ACM SIGMOD Annual Confer-
ence, pages 398–409.
C. Chen, J. Yeh, and H. Ke. 2010. Plagiarism detection
using ROUGE and WordNet. Journal of Computing,
2(3):34–44.
P. Cullen, A. French, and V. Jakeman. 2014. The Of-
ficial Cambridge Guide to IELTS. Cambridge Uni-
versity Press.
ETS. 2012. The Official Guide to the TOEFL R
Test,
Fourth Edition. McGraw-Hill.
Keelan Evanini and Xinhao Wang. 2014. Automatic
detection of plagiarized spoken responses. In Pro-
ceedings of the Ninth Workshop on Innovative Use of
NLP for Building Educational Applications, pages
22–27.
A. Finch, Y. Hwang, and E. Sumita. 2005. Using ma-
chine translation evaluation techniques to determine
sentence-level semantic equivalence. In Proceed-
ings of the Third International Workshop on Para-
phrasing, pages 17–24.
A. Hauptmann. 2006. Automatic spoken document re-
trieval. In Ketih Brown, editor, Encylclopedia of
Language and Linguistics (Second Edition), pages
95–103. Elsevier Science.
T. C. Hoad and J. Zobel. 2003. Methods for identi-
fying versioned and plagiarised documents. Journal
of the American Society for Information Science and
Technology, 54:203–215.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
Weinberger. 2015. From word embeddings to doc-
ument distances. In Proceedings of the 32nd In-
ternational Conference on Machine Learning, vol-
ume 37 of Proceedings of Machine Learning Re-
search, pages 957–966, Lille, France.
P. Longman. 2010. The Official Guide to Pearson Test
of English Academic. Pearson Education ESL.
C. Lyon, R. Barrett, and J. Malcolm. 2006. Plagiarism
is easy, but also easy to detect. Plagiary, 1:57–65.
N. Madnani, J. Tetreault, and M. Chodorow. 2012.
Re-examining machine translation metrics for para-
phrase identification. In Proceedings of the 2012
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 182–190, Montréal,
Canada. Association for Computational Linguistics.
M. Mozgovoy, T. Kakkonen, and E. Sutinen. 2007. Us-
ing natural language parsers in plagiarism detection.
In Proceedings of the ISCA Workshop on Speech and
Language Technology in Education (SLaTE).
T. Nahnsen, Ö. Uzuner, and B. Katz. 2005. Lexical
chains and sliding locality windows in content-based
text similarity detection. CSAIL Technical Report,
MIT-CSAIL-TR-2005-034.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
fort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, et al. 2011. Scikit-learn:
Machine learning in python. Journal of machine
learning research, 12(Oct):2825–2830.
M. Potthast, M. Hagen, T. Gollub, M. Tippmann,
J. Kiesel, P. Rosso, E. Stamatatos, and B. Stein.
2013. Overview of the 5th International Competi-
tion on Plagiarism Detection. In CLEF 2013 Evalu-
ation Labs and Workshop – Working Notes Papers.
M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso.
2010. An evaluation framework for plagiarism de-
tection. In Proceedings of the 23rd International
Conference on Computational Linguistics.
Y. Rubner, C. Tomasi, and L. J. Guibas. 1998. A
metric for distributions with applications to image
databases. In Proceedings of the Sixth International
Conference on Computer Vision.
N. Shivakumar and H. Garcia-Molina. 1995. SCAM:
A copy detection mechanism for digital documents.
In Proceedings of the Second Annual Conference on
the Theory and Practice of Digital Libraries.
E. Stamatatos. 2011. Plagiarism detection using stop-
word n-grams. American Society for Information
Science and Technology, 62(12):2512–2527.
B. Stein, N. Lipka, and P. Prettenhofer. 2011. Intrinsic
plagiarism analysis. Language Resources and Eval-
uation, 45(1):63–82.
Ö. Uzuner, B. Katz, and T. Nahnsen. 2005. Using syn-
tactic information to identify plagiarism. In Pro-
ceedings of the 2nd Workshop on Building Educa-
tional Applications using NLP. Ann Arbor.
X. Wang, K. Evanini, J. Bruno, and M. Mulholland.
2016. Automatic plagiarism detection for spoken
responses in an assessment of english language pro-
ficiency. In Proceedings of the IEEE Spoken Lan-
guage Technology Workshop, pages 121–128.
K. Zechner, D. Higgins, and X. Xi. 2007.
SpeechraterSM
: A construct-driven approach to
scoring spontaneous non-native speech. In Proceed-
ings of the International Speech Communication
Association Special Interest Group on Speech and
Language Technology in Education, pages 128–131.
K. Zechner, D. Higgins, X. Xi, and D. M. Williamson.
2009. Automatic scoring of non-native spontaneous
speech in tests of spoken English. Speech Commu-
nication, 51(10):883–895.
K. Zechner and X. Wang. 2013. Automated content
scoring of spoken responses in an assessment for
teachers of english. In Proceedings of the Eighth
Workshop on Innovative Use of NLP for Building
Educational Applications, page 73–81.