SlideShare a Scribd company logo
1 of 9
Download to read offline
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 435ā€“443
Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics
435
Application of an Automatic Plagiarism Detection System in a Large-scale
Assessment of English Speaking Proficiency
Xinhao Wang1
, Keelan Evanini2
, Matthew Mulholland2
, Yao Qian1
, James V. Bruno2
Educational Testing Service
1
90 New Montgomery St #1450, San Francisco, CA 94105, USA
2
660 Rosedale Road, Princeton, NJ 08541, USA
{xwang002, kevanini, mmulholland, yqian, jbruno}@ets.org
Abstract
This study aims to build an automatic sys-
tem for the detection of plagiarized spoken
responses in the context of an assessment of
English speaking proficiency for non-native
speakers. Classification models were trained
to distinguish between plagiarized and non-
plagiarized responses with two different types
of features: text-to-text content similarity
measures, which are commonly used in the
task of plagiarism detection for written doc-
uments, and speaking proficiency measures,
which were specifically designed for spon-
taneous speech and extracted using an auto-
mated speech scoring system. The experi-
ments were first conducted on a large data set
drawn from an operational English proficiency
assessment across multiple years, and the best
classifier on this heavily imbalanced data set
resulted in an F1-score of 0.761 on the plagia-
rized class. This system was then validated on
operational responses collected from a single
administration of the assessment and achieved
a recall of 0.897. The results indicate that
the proposed system can potentially be used
to improve the validity of both human and au-
tomated assessment of non-native spoken En-
glish.
1 Introduction
Plagiarism of spoken responses has become a vex-
ing problem in the domain of spoken language
assessment, in particular, the evaluation of non-
native speaking proficiency, since there exists a
vast amount of easily accessible online resources
covering a wide variety of topics that test tak-
ers can use to prepare responses prior to the test.
In the context of large-scale, standardized as-
sessments of spoken English for academic pur-
poses, such as the TOEFL iBT test (ETS, 2012),
the Pearson Test of English Academic (Long-
man, 2010), and the IELTS Academic assessment
(Cullen et al., 2014), some test takers may uti-
lize content from online resources or other pre-
pared sources in their spoken responses to test
questions that are intended to elicit spontaneous
speech. These responses that are based on canned
material pose a problem for both human raters and
automated scoring systems, and can reduce the va-
lidity of scores that are provided to the test takers;
therefore, research into the automated detection of
plagiarized spoken responses is necessary.
In this paper, we investigate a variety of features
for automatically detecting plagiarized spoken re-
sponses in the context of a standardized assess-
ment of English speaking proficiency. In addition
to examining several commonly used text-to-text
content similarity features, we also use features
that compare various aspects of speaking profi-
ciency across multiple responses provided by a
test taker, based on the hypothesis that certain as-
pects of speaking proficiency, such as fluency, may
be artificially inflated in a test takerā€™s canned re-
sponses in comparison to non-canned responses.
These features are designed to be independent of
the availability of the reference source materials.
Finally, we evaluate the effectiveness of this sys-
tem on a data set with a large number of control
(non-plagiarized) responses in an attempt to sim-
ulate the imbalanced distribution from an opera-
tional setting in which only a small number of the
test takersā€™ responses are plagiarized. In addition,
we further validate this system on operational data
and show how it can practically assist both human
and automated scoring in a large scale assessment
of English speaking proficiency
2 Previous Work
Previous research related to automated plagia-
rism detection for natural language has mainly
focused on written documents. For example, a
436
series of shared tasks has enabled a variety of
NLP approaches for detecting plagiarism to be
compared on a standardized set of texts (Potthast
et al., 2013), and several plagiarism detection ser-
vices are available online.1 Various techniques
have been employed in this task, including n-
gram overlap (Lyon et al., 2006), document finger-
printing (Brin et al., 1995), word frequency statis-
tics (Shivakumar and Garcia-Molina, 1995), in-
formation retrieval-based metrics (Hoad and Zo-
bel, 2003), text summarization evaluation met-
rics (Chen et al., 2010), WordNet-based features
(Nahnsen et al., 2005), stopword-based features
(Stamatatos, 2011), features based on shared syn-
tactic patterns (Uzuner et al., 2005), features based
on word swaps detected via dependency parsing
(Mozgovoy et al., 2007), and stylometric features
(Stein et al., 2011), among others. In general,
for the task of monolingual plagiarism detection,
these methods can be categorized as either ex-
ternal plagiarism detection, in which a document
is compared to a body of reference documents,
or intrinsic plagiarism detection, in which a doc-
ument is evaluated independently without a ref-
erence collection (Alzahrani et al., 2012). This
task is also related to the widely studied task of
paraphrase recognition, which benefits from sim-
ilar types of features (Finch et al., 2005; Mad-
nani et al., 2012). The current study adopts sev-
eral of these features that are designed to be ro-
bust to the presence of word-level modifications
between the source and the plagiarized text; since
this study focuses on spoken responses that are
reproduced from memory and subsequently pro-
cessed by a speech recognizer, metrics that rely on
exact matches are likely to perform sub-optimally.
Little prior work has been conducted on the
task of automatically detecting similar spoken re-
sponses, although research in the field of Spoken
Document Retrieval (Hauptmann, 2006) is rele-
vant. Due to the difficulties involved in collecting
corpora of actual plagiarized material, nearly all
published results of approaches to the task of pla-
giarism detection have relied on either simulated
plagiarism (i.e., plagiarized texts generated by ex-
perimental human participants in a controlled en-
vironment) or artificial plagiarism (i.e., plagia-
rized texts generated by algorithmically modifying
1
For example, http://turnitin.com/en_
us/what-we-offer/feedback-studio,http:
//www.grammarly.com/plagiarism, and http:
//www.paperrater.com/plagiarism_checker.
a source text) (Potthast et al., 2010). These results,
however, may not reflect actual performance in a
deployed setting, since the characteristics of the
plagiarized material may differ from actual plagia-
rized responses.
In previous studies (Evanini and Wang, 2014;
Wang et al., 2016), we conducted experiments
on a simulated data set from an operational,
large-scale, standardized English proficiency as-
sessment and obtained initial results with an F1-
measure of 70.6% using an automatic system to
detect plagiarized spoken responses (Wang et al.,
2016). Based on these previous findings, we ex-
tend this line of research and contribute in the fol-
lowing ways: 1) an improved automatic speech
recognition (ASR) system based on Kaldi was in-
troduced, and an unsupervised language model
adaptation method was employed to improve the
ASR performance on spontaneous speech elicited
by new, unseen test questions; 2) an improved set
of text-to-text content similarity features based on
n-gram overlap and Word Moverā€™s Distance was
investigated; 3) in addition to evaluating the sys-
tem on a simulated imbalanced data set, we also
validated the developed automatic system using all
of the responses from a single operational admin-
istration of the English speaking assessment in or-
der to obtain a reliable estimate of the systemā€™s
performance in a practical deployment.
3 Data
The data used in this study was drawn from a
large-scale, high-stakes assessment of English for
non-native speakers, which assesses English com-
munication skills for academic purposes. The
Speaking section of this assessment contains six
tasks designed to elicit spontaneous spoken re-
sponses: two of them require test takers to provide
an opinion or preference based on personal expe-
rience, which are referred to as independent tasks;
and the other four tasks require test takers to sum-
marize or discuss material provided in a reading
and/or listening passage, which are referred to as
integrated tasks.
In general, the independent tasks ask questions
on topics that are familiar to test takers and are not
based on any stimulus materials. Therefore, test
takers can provide responses containing a wide va-
riety of specific examples. In some cases, test tak-
ers may attempt to game the assessment by mem-
orizing canned material from an external source
437
and adapting it to the questions presented in the
independent tasks. This type of plagiarism can af-
fect the validity of a test takerā€™s speaking score and
can be grounds for score cancellation. However, it
is often difficult even for trained human raters to
recognize plagiarized spoken responses, due to the
large number and variety of external sources that
are available to test takers.
In order to identify the plagiarized spoken re-
sponses from the operational test, human raters
need to first flag spoken responses that contained
potentially plagiarized material, then trained ex-
perts subsequently review them to make the fi-
nal decision. In the review process, the re-
sponses were transcribed and compared to external
source materials obtained through manual internet
searches; if it was determined that the presence of
plagiarized material made it impossible to provide
a valid assessment of the test takerā€™s performance
on the speaking task, the response was labeled as
a plagiarized response and assigned a score of 0.
In this study, a total of 1,557 plagiarized responses
to independent test questions were collected from
the operational assessment across multiple years.
During the process of reviewing potentially pla-
giarized responses, the raters also collected a data
set of external sources that appeared to have been
used by test takers in their responses. In some
cases, the test takerā€™s spoken response was nearly
identical to an identified source; in other cases,
several sentences or phrases were clearly drawn
from a particular source, although some modifi-
cations were apparent. Table 1 presents a sam-
ple source that was identified for several of the
responses in the data set along with a sample
plagiarized response that apparently contains ex-
tended sequences of words directly matching id-
iosyncratic features of this source, such as the
phrases ā€œhow romantic it can ever beā€ and ā€œjust
relax yourself on the beach.ā€ In general, test tak-
ers typically do not reproduce the entire source
material in their responses; rather, they often at-
tempt to adapt the source material to a specific test
question by providing some speech that is directly
relevant to the prompt and combining it with the
plagiarized material. An example of this is shown
by the opening and closing non-italicized portions
of the sample plagiarized response in Table 1. In
total, human raters identified 224 different source
materials while reviewing the potentially plagia-
rized responses, and their statistics information is
as follows: the average number of words is 95.7
(std. dev. = 38.5), the average number of clauses
is 10.3 (std. dev. = 5.1), and the average number
of words per clause is 9.3 (std. dev. = 7.1).
In addition to the source materials and the
plagiarized responses, a set of non-plagiarized
control responses was also obtained in order to
conduct classification experiments between pla-
giarized and non-plagiarized responses. Since
the plagiarized responses were collected over the
course of multiple years, they were drawn from
many different test forms, and it was not practi-
cal to obtain control data from all of the test forms
that were represented in the plagiarized set. So,
only the 166 test forms that appear most frequently
in the canned data set were used for the collec-
tion of control responses, and 200 test takers were
randomly selected from each form, without any
overlap with speakers in the plagiarized set. The
two spoken responses from the two independent
questions were collected from each speaker; in to-
tal, 66,400 spoken responses from 33,200 speak-
ers were obtained as the control set. Therefore,
the data set used in this study is quite imbalanced:
the number of control responses is larger than the
number of plagiarized responses by a factor of 43.
4 Methodology
This study employed two different types of fea-
tures in the automatic detection of plagiarized spo-
ken responses: 1) similar to human ratersā€™ behav-
ior in identifying the canned spoken responses, a
set of features is developed to measure the content
similarities between a test response and the source
materials that were collected; 2) to deal with this
particular task of plagiarism detection for sponta-
neous spoken responses, a set of features is intro-
duced based on the assumption that the produc-
tion of spoken language based on memorized ma-
terial is expected to differ from the production of
non-plagiarized speech in aspects of a test takerā€™s
speech delivery, such as fluency, pronunciation,
and prosody.
4.1 Content Similarity
Previous work (Wang et al., 2016; Evanini and
Wang, 2014) has demonstrated the effectiveness
of using content-based features for the task of
automatic plagiarized spoken response detection.
Therefore, this study investigates the use of im-
proved features based on the measurement of text-
438
Table 1: A sample source passage and the transcription of a sample plagiarized spoken response that was appar-
ently drawn from the source. The test question/prompt used to elicit this response is also included. The overlapping
word sequences between the source material and the transcription of the spoken response are indicated in italics.
Sample source passage: Well, the place I
enjoy the most is a small town located in
France. I like this small town because it has
very charming ocean view. I mean the sky
there is so blue and the beach is always full of
sunshine. You know how romantic it can ever
be, just relax yourself on the beach, when the
sun is setting down, when the ocean breeze
is blowing and the seabirds are singing. Of
course I like this small French town also be-
cause there are many great French restau-
rants. They offer the best seafood in the world
like lobsters and tuna fishes. The most impor-
tant, I have been benefited a lot from this trip
to France because I made friends with some
gorgeous French girls. One of them even gave
me a little watch as a souvenir of our friend-
ship.
Prompt: Talk about an activity you enjoyed
doing with your family when you were a kid.
Transcription of a plagiarized response:
family is a little trip to France when I was in
primary school ten years ago I enjoy this ac-
tivity first because we visited a small French
town located by the beach the town has very
charming ocean view and in the sky is so blue
and the beach is always full of sunshine you
know how romantic it can ever be just relax
yourself on the beach when the sun is settling
down the sea birds are singing of course I en-
joy this activity with my family also because
there are many great French restaurants they
offer the best sea food in the world like lob-
sters and tuna fishes so I enjoy this activity
with my family very much even it has passed
several years
to-text content similarity. Given a test response, a
comparison is made with each of the 224 reference
sources using the following two content similarity
metrics: word n-gram overlap and Word Moverā€™s
Distance. Then, the maximum similarity or the
minimum distance is taken as a single feature to
measure the content relevance between the test re-
sponses and the source materials.
4.1.1 N-gram Overlap
Features based on the BLEU metric have been
proven to be effective in measuring the content ap-
propriateness of spoken responses in the context
of English proficiency assessment (Zechner and
Wang, 2013) and in measuring content similarity
in the detection of plagiarized spoken responses
(Wang et al., 2016; Evanini and Wang, 2014). In
this study, we first design a new type of feature,
known as n-gram overlap, by simulating and im-
proving the previous BLEU-based features. Word
n-grams, with n varying from 1 word to 11 words,
are first extracted from both the test response and
each of the source materials, and then the number
of overlapping n-grams are counted, where both
n-gram types and tokens are counted separately.
The intuition behind decreasing the maximum or-
der is to increase the classifierā€™s recall by evalu-
ating the overlap of shorter word sequences, such
as individual words in the unigram setting. On the
other hand, the motivation behind increasing the
maximum order is to boost the classifierā€™s preci-
sion, since it will focus on matches of longer word
sequences. Here, the maximum order of 11 was
experimentally decided in consideration of the av-
erage number of words per clause in source mate-
rials, which is 9.3 as described in Section 3.
In order to calculate the maximum similarity
across all source materials, the 11 n-gram over-
lap counts are combined together to generate one
weighted score between a test and each source as
in Equation 1, and then the maximum score across
all sources is calculated as a feature to measure the
similarity between a test and the set of source ma-
terials. Meanwhile, the 11 n-gram overlap counts
calculated using the source with the maximum
similarity score are also taken as features.
n
X
i=1
i
(n Ā· (n + 1)/2)
count overlap(i-gram) (1)
Furthermore, the n-gram based feature set can
be enlarged by: 1) normalizing the n-gram counts
by either the number of n-grams in the test re-
sponse or the number of n-grams in each of the
sources; 2) combining all source materials to-
gether into a single document for comparison (11
439
features based on n-gram overlap with the com-
bined source), which is designed based on the as-
sumption that test takers may attempt to use con-
tent from multiple sources. Similarly, this type of
features can be further normalized by the number
of n-grams in the test response. Based on all of
these variations, a total of 116 n-gram overlap fea-
tures is generated for each spoken response.
4.1.2 Word Moverā€™s Distance
More recently, various approaches based on deep
neural networks (DNN) and word-embeddings
trained on large corpora have shown promis-
ing performance in document similarity detec-
tion (Kusner et al., 2015). In contrast to tradi-
tional similarity features, which are limited to a
reliance on exact word matching, as the above
n-gram overlap features, these new approaches
have the advantage of capturing topically relevant
words that are not identical. In this study, we
employ Word Moverā€™s Distance (WMD) (Kusner
et al., 2015) to measure the distance between a
test response and a source material based on word-
embeddings.
Embeddings of words are first represented as
vectors, and then the distance between each word
appearing in a test response and each word in a
source can be measured using the Euclidean dis-
tance in the embedding space. WMD represents
the sum of the minimum values among the Eu-
clidean word distances between words in the two
compared documents. This minimization problem
is a special case of Earth Moverā€™s Distance (Rub-
ner et al., 1998), for which efficient algorithms are
available. Kunsner et al. (Kusner et al., 2015)
reported that WMD outperformed other distance
measures on document retrieval tasks and that the
embeddings trained on the Google News corpus
consistently performed well across a variety of
contexts. For this work, we used the same word
embeddings used in weighted embedding features
as the input for the WMD calculation.
4.2 Difference in Speaking Proficiency
The performance of the above content similarity
features greatly depends on the availability of a
comprehensive set of source materials. If a test
taker uses unseen source materials as the basis for
a plagiarized response, the system may fail to de-
tect it. Accordingly, a set of features that do not
rely on a comparison with source materials has
been proposed previously (Wang et al., 2016). The
current study also examined this type of features.
As described in Section 3, the Speaking section
of the assessment includes both independent and
integrated tasks. In a given test administration, test
takers are required to respond to all six questions;
however, plagiarized responses are more likely to
appear in the two independent tasks, since they
are not based on specific reading and/or listening
passages and thus elicit a wider range of variation
across responses. Since the plagiarized responses
are mostly constructed based on memorized ma-
terial, they may be delivered in a more fluent and
proficient manner compared to the responses that
contain fully spontaneous speech. Based on this
assumption, a set of features can be developed
to capture the difference between various speak-
ing proficiency features extracted from the canned
and the fully spontaneous speech produced by the
same test taker, where an automated spoken En-
glish assessment system, SpeechRater R
(Zechner
et al., 2007, 2009), can be used to provide the
automatic proficiency scores along with 29 fea-
tures measuring fluency, pronunciation, prosody,
rhythm, vocabulary, and grammar. Since most pla-
giarized responses are expected to occur in the
independent tasks, we assume the integrated re-
sponses are based on spontaneous speech. A mis-
match between the proficiency scores and the fea-
ture values from the independent responses and
the integrated responses from the same speaker
can potentially indicate the presence of both pre-
pared speech and spontaneous speech, and, there-
fore, the presence of plagiarized spoken responses.
Given an independent response from a test
taker, along with the other independent response
and four integrated responses from the same test
taker, 6 features can be extracted according to each
of the proficiency scores and 29 SpeechRater fea-
tures. First, the difference of score/feature val-
ues between two independent responses was cal-
culated as a feature, which was used to deal with
the case in which only one independent response
was canned and the other one contained sponta-
neous speech. Then, basic descriptive statistics,
including mean, median, min, and max, were ob-
tained across the four integrated responses. The
differences between the score/feature value of the
independent response and these four basic statis-
tics were extracted as additional features. Finally,
another feature was also extracted by standardiz-
ing the score/feature value of the independent re-
440
sponse with the mean and standard deviation from
the integrated responses. In total, a set of 180 fea-
tures were extracted, referred as SpeechRater in
the following experiments.
5 Experiments and Results
5.1 ASR Improvement
In this study, spoken responses need to be tran-
scribed into text so that they can be compared
with the source materials to measure the text-to-
text similarity. However, due to the large amount
of spoken responses considered in this study, it is
not practical to manually transcribe all of them;
therefore, a Kaldi2-based automatic speech recog-
nition engine was employed. The training set used
to develop the speech recognizer consists of simi-
lar responses (around 800 hours of speech) drawn
from the same assessment and do not overlap with
the data sets included in this study.
When using an ASR system to recognize spo-
ken responses from new prompts that are not seen
in the ASR training data, a degradation in recog-
nition accuracy is expected because of the mis-
match between the training and test data. In this
study, we used an unsupervised language model
(LM) adaptation method to improve the ASR per-
formance on unseen data. In this method, two
rounds of language model adaptation were con-
ducted with the following steps: first, out-of-
vocabulary (OOV) words from the prompt mate-
rials were added to the pronunciation dictionary
and the baseline models were adapted with the
prompts; second, the adapted models were ap-
plied to spoken responses from these new prompts
to produce the recognized texts along with con-
fidence scores corresponding to each response;
third, automatically transcribed texts with confi-
dence scores higher than a predefined threshold
of 0.8 were selected; finally, these high-confident
recognized texts were used to conduct another
round of language model adaptation. We evalu-
ated this unsupervised language model adaptation
method on a stand-alone data set with 1,500 re-
sponses from 250 test speakers, where the prompts
used to elicit these responses were unseen in the
baseline recognition models. In this experiment,
supervised language model adaptation with human
transcriptions was also examined for comparison.
As shown in Table 2, the overall word error rate
2
http://kaldi-asr.org/
Table 2: Word error rate (WER %) reduction with
an unsupervised language model adaptation method,
where the WERs on Independent items (IND), Inte-
grated items (INT), as well as all items (ALL), are
reported. The WER with the supervised adaptation
method based on human transcriptions is also listed for
comparison.
ASR Systems IND INT ALL
Baseline 22.5 26.6 25.5
Unsupervised 21.5 23.5 22.9
Supervised 21.2 22.1 21.8
(WER) is 25.5% for the baseline models. By ap-
plying the unsupervised LM adaptation method,
the overall WER can be reduced to 22.9%; in com-
parison, the overall WER can be reduced to 21.8%
by using supervised LM adaptation. The unsuper-
vised method can achieve very similar results to
the supervised method especially for responses to
the independent prompts, i.e., with WER of 21.5%
(unsupervised) vs 21.2% (supervised). These re-
sults indicate the effectiveness of the proposed
unsupervised adaptation method, which was em-
ployed in the subsequent automatic plagiarism de-
tection work.
5.2 Experimental Setup
Due to ASR failures, a small number of responses
were excluded from the experiments; finally, a to-
tal of 1,551 canned and 66,257 control responses
were included in the simulated data. Since this
work was conducted on a very imbalanced data
set and only 2.3% of the responses in the simu-
lated data are authentic plagiarized ones confirmed
by human raters, 10-fold cross-validation was per-
formed first on the simulated data. Afterward, the
classification model built on the simulated data set
was further evaluated on a corpus with real opera-
tional data.
We employed the machine learning tool of
scikit-learn3 (Pedregosa et al., 2011), for train-
ing the classifier. It provides various classifica-
tion methods, such as decision tree, random for-
est, AdaBoost, etc. This study involves a variety of
features from two different categories, and prelim-
inary experiments demonstrated that the random
forest model can achieve the overall better perfor-
3
SKLL, a python tool making the running of scikit-learn
experiments simpler, was used. Downloaded from https:
//github.com/EducationalTestingService/
skll.
441
mance. Therefore, the random forest method is
used to build classification models in the follow-
ing experiments, and the precision, recall, as well
as F1-score on the positive class (plagiarized re-
sponses) are used as the evaluation metrics.
5.3 Experimental Results
First, in order to verify the effectiveness of the
newly developed n-gram overlap features, a pre-
liminary experiment was conducted to compare
this set of features with BLEU-based features,
since they had been shown to be effective in pre-
vious research (Wang et al., 2016). The results as
shown in Table 3 indicate that the F1-Measure of
the n-gram features outperforms the BLEU fea-
tures (0.761 vs. 0.748), and the recall of the n-
gram features is higher than the BLEU features
(0.716 vs. 0.683); inversely, the BLEU features
result in higher precision (0.83 vs. 0.814). Ac-
cordingly, the n-gram features are used to replace
the BLEU ones, since it is more important to re-
duce the number of false negatives, i.e., improve
the recall, for our task.
Table 3: Comparison of n-gram and BLEU features.
Features Precision Recall F1
BLEU 0.83 0.683 0.748
n-gram 0.814 0.716 0.761
Furthermore, each individual type of feature
and their combinations were examined in the clas-
sification experiments described above. As shown
in Table 4, each feature set alone was used to build
classification models. The n-gram overlap fea-
tures result in the best performance with an F1-
score of 0.761, and the WMD features capturing
the topical relevance between word pairs result in
a much lower F1-score of 0.649. Furthermore,
the combination of both types of content similarity
features, i.e., n-gram and WMD, slightly reduces
the F1-score to 0.76. These results indicate that for
this particular task, the exact match of certain ex-
pressions appearing in both the test response and a
source material plays a critical role.
As to the speaking proficiency related features,
they can lead to a promising precision of 0.8 but
with a very low recall of 0.009. After reexamining
the assumption that human experts may be able
to differentiate prepared speech from fully spon-
taneous speech based on the way how the speech
is delivered, it turns out that it is quite challenging
Table 4: Classification performance using each individ-
ual feature set and their combinations.
Features Precision Recall F1
1. n-gram 0.814 0.716 0.761
2. WMD 0.663 0.636 0.649
3. SpeechRater 0.8 0.009 0.018
1 + 2 0.812 0.716 0.76
1 + 2 + 3 0.821 0.696 0.752
for human experts to make a reliable judgment of
plagiarism just based on the speech delivery with-
out any reference to the source materials, in par-
ticular, within the context of high-stakes language
assessment. Accordingly, the features capturing
the difference in speaking proficiency of prepared
and spontaneous speech can be used as contrib-
utory information to improve the accuracy of an
automatic detection system, but they are unable
to achieve promising performance alone. Also as
shown in Table 4, By adding the speaking profi-
ciency features, the precision can be improved to
0.821, but the recall is reduced; finally, the F1-
score is reduced to 0.752.
6 Employment in Operational Test
6.1 Operational Use
In order to obtain a more accurate estimate of how
well the automatic plagiarism detection system
might perform in a practical application in which
the distribution is expected to be heavily skewed
towards the non-plagiarized category, all test-taker
responses to independent prompts were collected
from a single administration of the speaking as-
sessment. In total, 13,516 independent responses
from 6,758 speakers were extracted for system
evaluation. We collected 39 responses confirmed
as plagiarized through the human review process,
which represents 0.29% of the data set.
In this particular task, automatic detection sys-
tems can be applied to support human raters,
where all potentially plagiarized responses can be
first automatically identified and then human ex-
perts can be involved to review flagged responses
and make the final decision about potential in-
stances of plagiarism. In this scenario, it is more
important to increase the number of true positives
flagged by the automated system; thus, recall of
plagiarized responses was used as the evaluation
metric, i.e., how many responses can be success-
fully detected among these 39 confirmed instances
442
of plagiarism.
6.2 Results and Discussion
In order to maximize the recall of plagiarized re-
sponses for this review, several models were built
with either different types of features or different
types of classification models, for example, ran-
dom forest and Adaboost with decision tree as
the weak classifier, and then they were combined
in an ensemble manner to flag potentially plagia-
rized responses, i.e., a response is flagged if any
of the models detects it as a plagiarized response.
This ensemble system flagged 850 responses as in-
stances of plagiarism in total and achieved a recall
of 0.897, i.e., 35 of the confirmed plagiarized re-
sponses were successfully identified by the auto-
matic system and 4 of them were missed.
These results prove that the developed sys-
tem can provide a substantial benefit to both hu-
man and automated assessment of non-native spo-
ken English. Manual identification of plagia-
rized responses can be a very difficult and time-
consuming task, where human experts need to
memorize hundreds of source materials and com-
pare them to thousands of responses. By apply-
ing an automated system, potentially plagiarized
responses can first be filtered out of the standard
scoring pipeline; subsequently, experts can review
these flagged responses to confirm whether they
actually contain plagiarized material. Accord-
ingly, instead of reviewing all 13,516 responses
in this administration for plagiarized content, hu-
man effort is required only for the 850 flagged
responses, thus substantially reducing the overall
human effort. Thus, optimizing recall is appro-
priate in this targeted use case, since the number
of false positives is within an acceptable range for
the expert review. In addition, the source labels in-
dicating which source materials were likely used
in the preparation of each response are automati-
cally generated by the automatic system for each
suspected response; this information can help to
accelerate the manual review process.
7 Conclusion and Future Work
This study proposed a system which can bene-
fit a high-stakes assessment of English speaking
proficiency by automatically detecting potentially
plagiarized spoken responses, and investigated the
empirical effectiveness of two different types of
features. One is based on automatic plagiarism
detection methods commonly applied to written
texts, in which the content similarity between a
test response and a set of source materials col-
lected from human raters were measured. In addi-
tion, this study also adopted a set of features which
do not rely on the human effort involved in source
material collection and can be easily applied to un-
seen test questions. This type of feature attempts
to capture the difference in speech patterns be-
tween prepared responses and fully spontaneous
responses from the same speaker in a test. Finally,
the classification models were evaluated on a large
set of responses collected from an operational test,
and the experimental results demonstrate that the
automatic detection system can achieve an F1-
measure of 0.761. Further evaluation on the real
operational data also shows the effectiveness of
the automatic detection system.
The task of applying an automatic system in a
large-scale operational assessment is quite chal-
lenging since typically only a small number of pla-
giarized responses are distributed among a much
larger amount of non-plagiarized responses to a
wide range of different test questions. In the fu-
ture, we will continue our research efforts to im-
prove the automatic detection system along the
following lines. First, since deep learning tech-
niques have recently shown their effectiveness in
the fields of both speech processing and natural
language understanding, we will further explore
various deep learning techniques to improve the
metrics used to measure the content similarity be-
tween test responses and source materials. Next,
further analysis will be conducted to determine the
extent of differences between canned and sponta-
neous speech, and additional features will be ex-
plored based on the findings. In addition, another
focus is to build automatic tools to regularly up-
date the pool of source materials. Besides internet
search, new sources can also be detected by com-
paring all candidate responses from the same test
question, since different test takers may use the
same source to produce their canned responses.
References
S. M. Alzahrani, N. Salim, and A. Abraham. 2012.
Understanding plagiarism linguistic patterns, textual
features, and detection methods. IEEE Transaction
on Systems, Man, and Cybernetics-Part C: Applica-
tions and Reviews, 42(2):133ā€“149.
S. Brin, J. Davis, and H. Garcia-Molina. 1995. Copy
443
detection mechanisms for digital documents. In
Proceedings of the ACM SIGMOD Annual Confer-
ence, pages 398ā€“409.
C. Chen, J. Yeh, and H. Ke. 2010. Plagiarism detection
using ROUGE and WordNet. Journal of Computing,
2(3):34ā€“44.
P. Cullen, A. French, and V. Jakeman. 2014. The Of-
ficial Cambridge Guide to IELTS. Cambridge Uni-
versity Press.
ETS. 2012. The Official Guide to the TOEFL R
Test,
Fourth Edition. McGraw-Hill.
Keelan Evanini and Xinhao Wang. 2014. Automatic
detection of plagiarized spoken responses. In Pro-
ceedings of the Ninth Workshop on Innovative Use of
NLP for Building Educational Applications, pages
22ā€“27.
A. Finch, Y. Hwang, and E. Sumita. 2005. Using ma-
chine translation evaluation techniques to determine
sentence-level semantic equivalence. In Proceed-
ings of the Third International Workshop on Para-
phrasing, pages 17ā€“24.
A. Hauptmann. 2006. Automatic spoken document re-
trieval. In Ketih Brown, editor, Encylclopedia of
Language and Linguistics (Second Edition), pages
95ā€“103. Elsevier Science.
T. C. Hoad and J. Zobel. 2003. Methods for identi-
fying versioned and plagiarised documents. Journal
of the American Society for Information Science and
Technology, 54:203ā€“215.
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
Weinberger. 2015. From word embeddings to doc-
ument distances. In Proceedings of the 32nd In-
ternational Conference on Machine Learning, vol-
ume 37 of Proceedings of Machine Learning Re-
search, pages 957ā€“966, Lille, France.
P. Longman. 2010. The Official Guide to Pearson Test
of English Academic. Pearson Education ESL.
C. Lyon, R. Barrett, and J. Malcolm. 2006. Plagiarism
is easy, but also easy to detect. Plagiary, 1:57ā€“65.
N. Madnani, J. Tetreault, and M. Chodorow. 2012.
Re-examining machine translation metrics for para-
phrase identification. In Proceedings of the 2012
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 182ā€“190, MontreĢal,
Canada. Association for Computational Linguistics.
M. Mozgovoy, T. Kakkonen, and E. Sutinen. 2007. Us-
ing natural language parsers in plagiarism detection.
In Proceedings of the ISCA Workshop on Speech and
Language Technology in Education (SLaTE).
T. Nahnsen, OĢˆ. Uzuner, and B. Katz. 2005. Lexical
chains and sliding locality windows in content-based
text similarity detection. CSAIL Technical Report,
MIT-CSAIL-TR-2005-034.
Fabian Pedregosa, GaeĢˆl Varoquaux, Alexandre Gram-
fort, Vincent Michel, Bertrand Thirion, Olivier
Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, et al. 2011. Scikit-learn:
Machine learning in python. Journal of machine
learning research, 12(Oct):2825ā€“2830.
M. Potthast, M. Hagen, T. Gollub, M. Tippmann,
J. Kiesel, P. Rosso, E. Stamatatos, and B. Stein.
2013. Overview of the 5th International Competi-
tion on Plagiarism Detection. In CLEF 2013 Evalu-
ation Labs and Workshop ā€“ Working Notes Papers.
M. Potthast, B. Stein, A. BarroĢn-CedenĢƒo, and P. Rosso.
2010. An evaluation framework for plagiarism de-
tection. In Proceedings of the 23rd International
Conference on Computational Linguistics.
Y. Rubner, C. Tomasi, and L. J. Guibas. 1998. A
metric for distributions with applications to image
databases. In Proceedings of the Sixth International
Conference on Computer Vision.
N. Shivakumar and H. Garcia-Molina. 1995. SCAM:
A copy detection mechanism for digital documents.
In Proceedings of the Second Annual Conference on
the Theory and Practice of Digital Libraries.
E. Stamatatos. 2011. Plagiarism detection using stop-
word n-grams. American Society for Information
Science and Technology, 62(12):2512ā€“2527.
B. Stein, N. Lipka, and P. Prettenhofer. 2011. Intrinsic
plagiarism analysis. Language Resources and Eval-
uation, 45(1):63ā€“82.
OĢˆ. Uzuner, B. Katz, and T. Nahnsen. 2005. Using syn-
tactic information to identify plagiarism. In Pro-
ceedings of the 2nd Workshop on Building Educa-
tional Applications using NLP. Ann Arbor.
X. Wang, K. Evanini, J. Bruno, and M. Mulholland.
2016. Automatic plagiarism detection for spoken
responses in an assessment of english language pro-
ficiency. In Proceedings of the IEEE Spoken Lan-
guage Technology Workshop, pages 121ā€“128.
K. Zechner, D. Higgins, and X. Xi. 2007.
SpeechraterSM
: A construct-driven approach to
scoring spontaneous non-native speech. In Proceed-
ings of the International Speech Communication
Association Special Interest Group on Speech and
Language Technology in Education, pages 128ā€“131.
K. Zechner, D. Higgins, X. Xi, and D. M. Williamson.
2009. Automatic scoring of non-native spontaneous
speech in tests of spoken English. Speech Commu-
nication, 51(10):883ā€“895.
K. Zechner and X. Wang. 2013. Automated content
scoring of spoken responses in an assessment for
teachers of english. In Proceedings of the Eighth
Workshop on Innovative Use of NLP for Building
Educational Applications, page 73ā€“81.

More Related Content

Similar to Application Of An Automatic Plagiarism Detection System In A Large-Scale Assessment Of English Speaking Proficiency

50 3 10_norris
50 3 10_norris50 3 10_norris
50 3 10_norris
havannamsih
Ā 
A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...
A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...
A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...
Cemal Ardil
Ā 
Site2011 tomidaokibayashitamura
Site2011 tomidaokibayashitamuraSite2011 tomidaokibayashitamura
Site2011 tomidaokibayashitamura
Eiji Tomida
Ā 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
IAESIJAI
Ā 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
cscpconf
Ā 

Similar to Application Of An Automatic Plagiarism Detection System In A Large-Scale Assessment Of English Speaking Proficiency (20)

50 3 10_norris
50 3 10_norris50 3 10_norris
50 3 10_norris
Ā 
A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...
A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...
A black-box-approach-for-response-quality-evaluation-of-conversational-agent-...
Ā 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
Ā 
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATIONSEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
Ā 
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATIONSEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
SEMI-AUTOMATIC SIMULTANEOUS INTERPRETING QUALITY EVALUATION
Ā 
UTPL-LENGUAGE TESTING-I-BIMESTRE-(OCTUBRE 2011-FEBRERO 2012)
UTPL-LENGUAGE TESTING-I-BIMESTRE-(OCTUBRE 2011-FEBRERO 2012)UTPL-LENGUAGE TESTING-I-BIMESTRE-(OCTUBRE 2011-FEBRERO 2012)
UTPL-LENGUAGE TESTING-I-BIMESTRE-(OCTUBRE 2011-FEBRERO 2012)
Ā 
Automatic Distractor Generation For Multiple-Choice English Vocabulary Questions
Automatic Distractor Generation For Multiple-Choice English Vocabulary QuestionsAutomatic Distractor Generation For Multiple-Choice English Vocabulary Questions
Automatic Distractor Generation For Multiple-Choice English Vocabulary Questions
Ā 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
Ā 
Chapter 3-different ways of data gathering
Chapter 3-different ways of data gatheringChapter 3-different ways of data gathering
Chapter 3-different ways of data gathering
Ā 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
Ā 
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGESA SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGES
Ā 
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
 MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK
Ā 
Thesis malyn
Thesis malynThesis malyn
Thesis malyn
Ā 
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
Ā 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
Ā 
Site2011 tomidaokibayashitamura
Site2011 tomidaokibayashitamuraSite2011 tomidaokibayashitamura
Site2011 tomidaokibayashitamura
Ā 
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSA COMPARATIVE STUDY OF FEATURE SELECTION METHODS
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
Ā 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
Ā 
Language Needs Analysis for English Curriculum Validation
Language Needs Analysis for English Curriculum ValidationLanguage Needs Analysis for English Curriculum Validation
Language Needs Analysis for English Curriculum Validation
Ā 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
Ā 

More from Daphne Smith

More from Daphne Smith (20)

8 Page Essay Topics
8 Page Essay Topics8 Page Essay Topics
8 Page Essay Topics
Ā 
250 Word Essay Structure
250 Word Essay Structure250 Word Essay Structure
250 Word Essay Structure
Ā 
500 Word College Essay Format
500 Word College Essay Format500 Word College Essay Format
500 Word College Essay Format
Ā 
4 Main Causes Of World War 1 Essay
4 Main Causes Of World War 1 Essay4 Main Causes Of World War 1 Essay
4 Main Causes Of World War 1 Essay
Ā 
5 Paragraph Essay Outline Template Doc
5 Paragraph Essay Outline Template Doc5 Paragraph Essay Outline Template Doc
5 Paragraph Essay Outline Template Doc
Ā 
5Th Grade Essay Outline
5Th Grade Essay Outline5Th Grade Essay Outline
5Th Grade Essay Outline
Ā 
1. Write An Essay On The Evolution Of Computers
1. Write An Essay On The Evolution Of Computers1. Write An Essay On The Evolution Of Computers
1. Write An Essay On The Evolution Of Computers
Ā 
3Rd Grade Persuasive Essay Example
3Rd Grade Persuasive Essay Example3Rd Grade Persuasive Essay Example
3Rd Grade Persuasive Essay Example
Ā 
1 Page Essay Topics
1 Page Essay Topics1 Page Essay Topics
1 Page Essay Topics
Ā 
2 Page Essay On The Cold War
2 Page Essay On The Cold War2 Page Essay On The Cold War
2 Page Essay On The Cold War
Ā 
250 Word Scholarship Essay Examples
250 Word Scholarship Essay Examples250 Word Scholarship Essay Examples
250 Word Scholarship Essay Examples
Ā 
5 Paragraph Descriptive Essay Examples
5 Paragraph Descriptive Essay Examples5 Paragraph Descriptive Essay Examples
5 Paragraph Descriptive Essay Examples
Ā 
5 Paragraph Essay Topics For 8Th Grade
5 Paragraph Essay Topics For 8Th Grade5 Paragraph Essay Topics For 8Th Grade
5 Paragraph Essay Topics For 8Th Grade
Ā 
5 Paragraph Essay On Flowers For Algernon
5 Paragraph Essay On Flowers For Algernon5 Paragraph Essay On Flowers For Algernon
5 Paragraph Essay On Flowers For Algernon
Ā 
3 Page Essay Template
3 Page Essay Template3 Page Essay Template
3 Page Essay Template
Ā 
500 Word Narrative Essay Examples
500 Word Narrative Essay Examples500 Word Narrative Essay Examples
500 Word Narrative Essay Examples
Ā 
8 On My Sat Essay
8 On My Sat Essay8 On My Sat Essay
8 On My Sat Essay
Ā 
400 Word Essay Example
400 Word Essay Example400 Word Essay Example
400 Word Essay Example
Ā 
2 Essays Due Tomorrow
2 Essays Due Tomorrow2 Essays Due Tomorrow
2 Essays Due Tomorrow
Ā 
5 Paragraph Essay Template
5 Paragraph Essay Template5 Paragraph Essay Template
5 Paragraph Essay Template
Ā 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
Ā 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
Ā 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
Ā 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
Ā 

Recently uploaded (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
Ā 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
Ā 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Ā 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
Ā 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
Ā 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
Ā 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
Ā 
Tį»”NG ƔN Tįŗ¬P THI VƀO Lį»šP 10 MƔN TIįŗ¾NG ANH NĂM Hį»ŒC 2023 - 2024 CƓ ĐƁP ƁN (NGį»® Ƃ...
Tį»”NG ƔN Tįŗ¬P THI VƀO Lį»šP 10 MƔN TIįŗ¾NG ANH NĂM Hį»ŒC 2023 - 2024 CƓ ĐƁP ƁN (NGį»® Ƃ...Tį»”NG ƔN Tįŗ¬P THI VƀO Lį»šP 10 MƔN TIįŗ¾NG ANH NĂM Hį»ŒC 2023 - 2024 CƓ ĐƁP ƁN (NGį»® Ƃ...
Tį»”NG ƔN Tįŗ¬P THI VƀO Lį»šP 10 MƔN TIįŗ¾NG ANH NĂM Hį»ŒC 2023 - 2024 CƓ ĐƁP ƁN (NGį»® Ƃ...
Ā 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Ā 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
Ā 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
Ā 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
Ā 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
Ā 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
Ā 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
Ā 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
Ā 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Ā 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
Ā 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
Ā 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
Ā 

Application Of An Automatic Plagiarism Detection System In A Large-Scale Assessment Of English Speaking Proficiency

  • 1. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 435ā€“443 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics 435 Application of an Automatic Plagiarism Detection System in a Large-scale Assessment of English Speaking Proficiency Xinhao Wang1 , Keelan Evanini2 , Matthew Mulholland2 , Yao Qian1 , James V. Bruno2 Educational Testing Service 1 90 New Montgomery St #1450, San Francisco, CA 94105, USA 2 660 Rosedale Road, Princeton, NJ 08541, USA {xwang002, kevanini, mmulholland, yqian, jbruno}@ets.org Abstract This study aims to build an automatic sys- tem for the detection of plagiarized spoken responses in the context of an assessment of English speaking proficiency for non-native speakers. Classification models were trained to distinguish between plagiarized and non- plagiarized responses with two different types of features: text-to-text content similarity measures, which are commonly used in the task of plagiarism detection for written doc- uments, and speaking proficiency measures, which were specifically designed for spon- taneous speech and extracted using an auto- mated speech scoring system. The experi- ments were first conducted on a large data set drawn from an operational English proficiency assessment across multiple years, and the best classifier on this heavily imbalanced data set resulted in an F1-score of 0.761 on the plagia- rized class. This system was then validated on operational responses collected from a single administration of the assessment and achieved a recall of 0.897. The results indicate that the proposed system can potentially be used to improve the validity of both human and au- tomated assessment of non-native spoken En- glish. 1 Introduction Plagiarism of spoken responses has become a vex- ing problem in the domain of spoken language assessment, in particular, the evaluation of non- native speaking proficiency, since there exists a vast amount of easily accessible online resources covering a wide variety of topics that test tak- ers can use to prepare responses prior to the test. In the context of large-scale, standardized as- sessments of spoken English for academic pur- poses, such as the TOEFL iBT test (ETS, 2012), the Pearson Test of English Academic (Long- man, 2010), and the IELTS Academic assessment (Cullen et al., 2014), some test takers may uti- lize content from online resources or other pre- pared sources in their spoken responses to test questions that are intended to elicit spontaneous speech. These responses that are based on canned material pose a problem for both human raters and automated scoring systems, and can reduce the va- lidity of scores that are provided to the test takers; therefore, research into the automated detection of plagiarized spoken responses is necessary. In this paper, we investigate a variety of features for automatically detecting plagiarized spoken re- sponses in the context of a standardized assess- ment of English speaking proficiency. In addition to examining several commonly used text-to-text content similarity features, we also use features that compare various aspects of speaking profi- ciency across multiple responses provided by a test taker, based on the hypothesis that certain as- pects of speaking proficiency, such as fluency, may be artificially inflated in a test takerā€™s canned re- sponses in comparison to non-canned responses. These features are designed to be independent of the availability of the reference source materials. Finally, we evaluate the effectiveness of this sys- tem on a data set with a large number of control (non-plagiarized) responses in an attempt to sim- ulate the imbalanced distribution from an opera- tional setting in which only a small number of the test takersā€™ responses are plagiarized. In addition, we further validate this system on operational data and show how it can practically assist both human and automated scoring in a large scale assessment of English speaking proficiency 2 Previous Work Previous research related to automated plagia- rism detection for natural language has mainly focused on written documents. For example, a
  • 2. 436 series of shared tasks has enabled a variety of NLP approaches for detecting plagiarism to be compared on a standardized set of texts (Potthast et al., 2013), and several plagiarism detection ser- vices are available online.1 Various techniques have been employed in this task, including n- gram overlap (Lyon et al., 2006), document finger- printing (Brin et al., 1995), word frequency statis- tics (Shivakumar and Garcia-Molina, 1995), in- formation retrieval-based metrics (Hoad and Zo- bel, 2003), text summarization evaluation met- rics (Chen et al., 2010), WordNet-based features (Nahnsen et al., 2005), stopword-based features (Stamatatos, 2011), features based on shared syn- tactic patterns (Uzuner et al., 2005), features based on word swaps detected via dependency parsing (Mozgovoy et al., 2007), and stylometric features (Stein et al., 2011), among others. In general, for the task of monolingual plagiarism detection, these methods can be categorized as either ex- ternal plagiarism detection, in which a document is compared to a body of reference documents, or intrinsic plagiarism detection, in which a doc- ument is evaluated independently without a ref- erence collection (Alzahrani et al., 2012). This task is also related to the widely studied task of paraphrase recognition, which benefits from sim- ilar types of features (Finch et al., 2005; Mad- nani et al., 2012). The current study adopts sev- eral of these features that are designed to be ro- bust to the presence of word-level modifications between the source and the plagiarized text; since this study focuses on spoken responses that are reproduced from memory and subsequently pro- cessed by a speech recognizer, metrics that rely on exact matches are likely to perform sub-optimally. Little prior work has been conducted on the task of automatically detecting similar spoken re- sponses, although research in the field of Spoken Document Retrieval (Hauptmann, 2006) is rele- vant. Due to the difficulties involved in collecting corpora of actual plagiarized material, nearly all published results of approaches to the task of pla- giarism detection have relied on either simulated plagiarism (i.e., plagiarized texts generated by ex- perimental human participants in a controlled en- vironment) or artificial plagiarism (i.e., plagia- rized texts generated by algorithmically modifying 1 For example, http://turnitin.com/en_ us/what-we-offer/feedback-studio,http: //www.grammarly.com/plagiarism, and http: //www.paperrater.com/plagiarism_checker. a source text) (Potthast et al., 2010). These results, however, may not reflect actual performance in a deployed setting, since the characteristics of the plagiarized material may differ from actual plagia- rized responses. In previous studies (Evanini and Wang, 2014; Wang et al., 2016), we conducted experiments on a simulated data set from an operational, large-scale, standardized English proficiency as- sessment and obtained initial results with an F1- measure of 70.6% using an automatic system to detect plagiarized spoken responses (Wang et al., 2016). Based on these previous findings, we ex- tend this line of research and contribute in the fol- lowing ways: 1) an improved automatic speech recognition (ASR) system based on Kaldi was in- troduced, and an unsupervised language model adaptation method was employed to improve the ASR performance on spontaneous speech elicited by new, unseen test questions; 2) an improved set of text-to-text content similarity features based on n-gram overlap and Word Moverā€™s Distance was investigated; 3) in addition to evaluating the sys- tem on a simulated imbalanced data set, we also validated the developed automatic system using all of the responses from a single operational admin- istration of the English speaking assessment in or- der to obtain a reliable estimate of the systemā€™s performance in a practical deployment. 3 Data The data used in this study was drawn from a large-scale, high-stakes assessment of English for non-native speakers, which assesses English com- munication skills for academic purposes. The Speaking section of this assessment contains six tasks designed to elicit spontaneous spoken re- sponses: two of them require test takers to provide an opinion or preference based on personal expe- rience, which are referred to as independent tasks; and the other four tasks require test takers to sum- marize or discuss material provided in a reading and/or listening passage, which are referred to as integrated tasks. In general, the independent tasks ask questions on topics that are familiar to test takers and are not based on any stimulus materials. Therefore, test takers can provide responses containing a wide va- riety of specific examples. In some cases, test tak- ers may attempt to game the assessment by mem- orizing canned material from an external source
  • 3. 437 and adapting it to the questions presented in the independent tasks. This type of plagiarism can af- fect the validity of a test takerā€™s speaking score and can be grounds for score cancellation. However, it is often difficult even for trained human raters to recognize plagiarized spoken responses, due to the large number and variety of external sources that are available to test takers. In order to identify the plagiarized spoken re- sponses from the operational test, human raters need to first flag spoken responses that contained potentially plagiarized material, then trained ex- perts subsequently review them to make the fi- nal decision. In the review process, the re- sponses were transcribed and compared to external source materials obtained through manual internet searches; if it was determined that the presence of plagiarized material made it impossible to provide a valid assessment of the test takerā€™s performance on the speaking task, the response was labeled as a plagiarized response and assigned a score of 0. In this study, a total of 1,557 plagiarized responses to independent test questions were collected from the operational assessment across multiple years. During the process of reviewing potentially pla- giarized responses, the raters also collected a data set of external sources that appeared to have been used by test takers in their responses. In some cases, the test takerā€™s spoken response was nearly identical to an identified source; in other cases, several sentences or phrases were clearly drawn from a particular source, although some modifi- cations were apparent. Table 1 presents a sam- ple source that was identified for several of the responses in the data set along with a sample plagiarized response that apparently contains ex- tended sequences of words directly matching id- iosyncratic features of this source, such as the phrases ā€œhow romantic it can ever beā€ and ā€œjust relax yourself on the beach.ā€ In general, test tak- ers typically do not reproduce the entire source material in their responses; rather, they often at- tempt to adapt the source material to a specific test question by providing some speech that is directly relevant to the prompt and combining it with the plagiarized material. An example of this is shown by the opening and closing non-italicized portions of the sample plagiarized response in Table 1. In total, human raters identified 224 different source materials while reviewing the potentially plagia- rized responses, and their statistics information is as follows: the average number of words is 95.7 (std. dev. = 38.5), the average number of clauses is 10.3 (std. dev. = 5.1), and the average number of words per clause is 9.3 (std. dev. = 7.1). In addition to the source materials and the plagiarized responses, a set of non-plagiarized control responses was also obtained in order to conduct classification experiments between pla- giarized and non-plagiarized responses. Since the plagiarized responses were collected over the course of multiple years, they were drawn from many different test forms, and it was not practi- cal to obtain control data from all of the test forms that were represented in the plagiarized set. So, only the 166 test forms that appear most frequently in the canned data set were used for the collec- tion of control responses, and 200 test takers were randomly selected from each form, without any overlap with speakers in the plagiarized set. The two spoken responses from the two independent questions were collected from each speaker; in to- tal, 66,400 spoken responses from 33,200 speak- ers were obtained as the control set. Therefore, the data set used in this study is quite imbalanced: the number of control responses is larger than the number of plagiarized responses by a factor of 43. 4 Methodology This study employed two different types of fea- tures in the automatic detection of plagiarized spo- ken responses: 1) similar to human ratersā€™ behav- ior in identifying the canned spoken responses, a set of features is developed to measure the content similarities between a test response and the source materials that were collected; 2) to deal with this particular task of plagiarism detection for sponta- neous spoken responses, a set of features is intro- duced based on the assumption that the produc- tion of spoken language based on memorized ma- terial is expected to differ from the production of non-plagiarized speech in aspects of a test takerā€™s speech delivery, such as fluency, pronunciation, and prosody. 4.1 Content Similarity Previous work (Wang et al., 2016; Evanini and Wang, 2014) has demonstrated the effectiveness of using content-based features for the task of automatic plagiarized spoken response detection. Therefore, this study investigates the use of im- proved features based on the measurement of text-
  • 4. 438 Table 1: A sample source passage and the transcription of a sample plagiarized spoken response that was appar- ently drawn from the source. The test question/prompt used to elicit this response is also included. The overlapping word sequences between the source material and the transcription of the spoken response are indicated in italics. Sample source passage: Well, the place I enjoy the most is a small town located in France. I like this small town because it has very charming ocean view. I mean the sky there is so blue and the beach is always full of sunshine. You know how romantic it can ever be, just relax yourself on the beach, when the sun is setting down, when the ocean breeze is blowing and the seabirds are singing. Of course I like this small French town also be- cause there are many great French restau- rants. They offer the best seafood in the world like lobsters and tuna fishes. The most impor- tant, I have been benefited a lot from this trip to France because I made friends with some gorgeous French girls. One of them even gave me a little watch as a souvenir of our friend- ship. Prompt: Talk about an activity you enjoyed doing with your family when you were a kid. Transcription of a plagiarized response: family is a little trip to France when I was in primary school ten years ago I enjoy this ac- tivity first because we visited a small French town located by the beach the town has very charming ocean view and in the sky is so blue and the beach is always full of sunshine you know how romantic it can ever be just relax yourself on the beach when the sun is settling down the sea birds are singing of course I en- joy this activity with my family also because there are many great French restaurants they offer the best sea food in the world like lob- sters and tuna fishes so I enjoy this activity with my family very much even it has passed several years to-text content similarity. Given a test response, a comparison is made with each of the 224 reference sources using the following two content similarity metrics: word n-gram overlap and Word Moverā€™s Distance. Then, the maximum similarity or the minimum distance is taken as a single feature to measure the content relevance between the test re- sponses and the source materials. 4.1.1 N-gram Overlap Features based on the BLEU metric have been proven to be effective in measuring the content ap- propriateness of spoken responses in the context of English proficiency assessment (Zechner and Wang, 2013) and in measuring content similarity in the detection of plagiarized spoken responses (Wang et al., 2016; Evanini and Wang, 2014). In this study, we first design a new type of feature, known as n-gram overlap, by simulating and im- proving the previous BLEU-based features. Word n-grams, with n varying from 1 word to 11 words, are first extracted from both the test response and each of the source materials, and then the number of overlapping n-grams are counted, where both n-gram types and tokens are counted separately. The intuition behind decreasing the maximum or- der is to increase the classifierā€™s recall by evalu- ating the overlap of shorter word sequences, such as individual words in the unigram setting. On the other hand, the motivation behind increasing the maximum order is to boost the classifierā€™s preci- sion, since it will focus on matches of longer word sequences. Here, the maximum order of 11 was experimentally decided in consideration of the av- erage number of words per clause in source mate- rials, which is 9.3 as described in Section 3. In order to calculate the maximum similarity across all source materials, the 11 n-gram over- lap counts are combined together to generate one weighted score between a test and each source as in Equation 1, and then the maximum score across all sources is calculated as a feature to measure the similarity between a test and the set of source ma- terials. Meanwhile, the 11 n-gram overlap counts calculated using the source with the maximum similarity score are also taken as features. n X i=1 i (n Ā· (n + 1)/2) count overlap(i-gram) (1) Furthermore, the n-gram based feature set can be enlarged by: 1) normalizing the n-gram counts by either the number of n-grams in the test re- sponse or the number of n-grams in each of the sources; 2) combining all source materials to- gether into a single document for comparison (11
  • 5. 439 features based on n-gram overlap with the com- bined source), which is designed based on the as- sumption that test takers may attempt to use con- tent from multiple sources. Similarly, this type of features can be further normalized by the number of n-grams in the test response. Based on all of these variations, a total of 116 n-gram overlap fea- tures is generated for each spoken response. 4.1.2 Word Moverā€™s Distance More recently, various approaches based on deep neural networks (DNN) and word-embeddings trained on large corpora have shown promis- ing performance in document similarity detec- tion (Kusner et al., 2015). In contrast to tradi- tional similarity features, which are limited to a reliance on exact word matching, as the above n-gram overlap features, these new approaches have the advantage of capturing topically relevant words that are not identical. In this study, we employ Word Moverā€™s Distance (WMD) (Kusner et al., 2015) to measure the distance between a test response and a source material based on word- embeddings. Embeddings of words are first represented as vectors, and then the distance between each word appearing in a test response and each word in a source can be measured using the Euclidean dis- tance in the embedding space. WMD represents the sum of the minimum values among the Eu- clidean word distances between words in the two compared documents. This minimization problem is a special case of Earth Moverā€™s Distance (Rub- ner et al., 1998), for which efficient algorithms are available. Kunsner et al. (Kusner et al., 2015) reported that WMD outperformed other distance measures on document retrieval tasks and that the embeddings trained on the Google News corpus consistently performed well across a variety of contexts. For this work, we used the same word embeddings used in weighted embedding features as the input for the WMD calculation. 4.2 Difference in Speaking Proficiency The performance of the above content similarity features greatly depends on the availability of a comprehensive set of source materials. If a test taker uses unseen source materials as the basis for a plagiarized response, the system may fail to de- tect it. Accordingly, a set of features that do not rely on a comparison with source materials has been proposed previously (Wang et al., 2016). The current study also examined this type of features. As described in Section 3, the Speaking section of the assessment includes both independent and integrated tasks. In a given test administration, test takers are required to respond to all six questions; however, plagiarized responses are more likely to appear in the two independent tasks, since they are not based on specific reading and/or listening passages and thus elicit a wider range of variation across responses. Since the plagiarized responses are mostly constructed based on memorized ma- terial, they may be delivered in a more fluent and proficient manner compared to the responses that contain fully spontaneous speech. Based on this assumption, a set of features can be developed to capture the difference between various speak- ing proficiency features extracted from the canned and the fully spontaneous speech produced by the same test taker, where an automated spoken En- glish assessment system, SpeechRater R (Zechner et al., 2007, 2009), can be used to provide the automatic proficiency scores along with 29 fea- tures measuring fluency, pronunciation, prosody, rhythm, vocabulary, and grammar. Since most pla- giarized responses are expected to occur in the independent tasks, we assume the integrated re- sponses are based on spontaneous speech. A mis- match between the proficiency scores and the fea- ture values from the independent responses and the integrated responses from the same speaker can potentially indicate the presence of both pre- pared speech and spontaneous speech, and, there- fore, the presence of plagiarized spoken responses. Given an independent response from a test taker, along with the other independent response and four integrated responses from the same test taker, 6 features can be extracted according to each of the proficiency scores and 29 SpeechRater fea- tures. First, the difference of score/feature val- ues between two independent responses was cal- culated as a feature, which was used to deal with the case in which only one independent response was canned and the other one contained sponta- neous speech. Then, basic descriptive statistics, including mean, median, min, and max, were ob- tained across the four integrated responses. The differences between the score/feature value of the independent response and these four basic statis- tics were extracted as additional features. Finally, another feature was also extracted by standardiz- ing the score/feature value of the independent re-
  • 6. 440 sponse with the mean and standard deviation from the integrated responses. In total, a set of 180 fea- tures were extracted, referred as SpeechRater in the following experiments. 5 Experiments and Results 5.1 ASR Improvement In this study, spoken responses need to be tran- scribed into text so that they can be compared with the source materials to measure the text-to- text similarity. However, due to the large amount of spoken responses considered in this study, it is not practical to manually transcribe all of them; therefore, a Kaldi2-based automatic speech recog- nition engine was employed. The training set used to develop the speech recognizer consists of simi- lar responses (around 800 hours of speech) drawn from the same assessment and do not overlap with the data sets included in this study. When using an ASR system to recognize spo- ken responses from new prompts that are not seen in the ASR training data, a degradation in recog- nition accuracy is expected because of the mis- match between the training and test data. In this study, we used an unsupervised language model (LM) adaptation method to improve the ASR per- formance on unseen data. In this method, two rounds of language model adaptation were con- ducted with the following steps: first, out-of- vocabulary (OOV) words from the prompt mate- rials were added to the pronunciation dictionary and the baseline models were adapted with the prompts; second, the adapted models were ap- plied to spoken responses from these new prompts to produce the recognized texts along with con- fidence scores corresponding to each response; third, automatically transcribed texts with confi- dence scores higher than a predefined threshold of 0.8 were selected; finally, these high-confident recognized texts were used to conduct another round of language model adaptation. We evalu- ated this unsupervised language model adaptation method on a stand-alone data set with 1,500 re- sponses from 250 test speakers, where the prompts used to elicit these responses were unseen in the baseline recognition models. In this experiment, supervised language model adaptation with human transcriptions was also examined for comparison. As shown in Table 2, the overall word error rate 2 http://kaldi-asr.org/ Table 2: Word error rate (WER %) reduction with an unsupervised language model adaptation method, where the WERs on Independent items (IND), Inte- grated items (INT), as well as all items (ALL), are reported. The WER with the supervised adaptation method based on human transcriptions is also listed for comparison. ASR Systems IND INT ALL Baseline 22.5 26.6 25.5 Unsupervised 21.5 23.5 22.9 Supervised 21.2 22.1 21.8 (WER) is 25.5% for the baseline models. By ap- plying the unsupervised LM adaptation method, the overall WER can be reduced to 22.9%; in com- parison, the overall WER can be reduced to 21.8% by using supervised LM adaptation. The unsuper- vised method can achieve very similar results to the supervised method especially for responses to the independent prompts, i.e., with WER of 21.5% (unsupervised) vs 21.2% (supervised). These re- sults indicate the effectiveness of the proposed unsupervised adaptation method, which was em- ployed in the subsequent automatic plagiarism de- tection work. 5.2 Experimental Setup Due to ASR failures, a small number of responses were excluded from the experiments; finally, a to- tal of 1,551 canned and 66,257 control responses were included in the simulated data. Since this work was conducted on a very imbalanced data set and only 2.3% of the responses in the simu- lated data are authentic plagiarized ones confirmed by human raters, 10-fold cross-validation was per- formed first on the simulated data. Afterward, the classification model built on the simulated data set was further evaluated on a corpus with real opera- tional data. We employed the machine learning tool of scikit-learn3 (Pedregosa et al., 2011), for train- ing the classifier. It provides various classifica- tion methods, such as decision tree, random for- est, AdaBoost, etc. This study involves a variety of features from two different categories, and prelim- inary experiments demonstrated that the random forest model can achieve the overall better perfor- 3 SKLL, a python tool making the running of scikit-learn experiments simpler, was used. Downloaded from https: //github.com/EducationalTestingService/ skll.
  • 7. 441 mance. Therefore, the random forest method is used to build classification models in the follow- ing experiments, and the precision, recall, as well as F1-score on the positive class (plagiarized re- sponses) are used as the evaluation metrics. 5.3 Experimental Results First, in order to verify the effectiveness of the newly developed n-gram overlap features, a pre- liminary experiment was conducted to compare this set of features with BLEU-based features, since they had been shown to be effective in pre- vious research (Wang et al., 2016). The results as shown in Table 3 indicate that the F1-Measure of the n-gram features outperforms the BLEU fea- tures (0.761 vs. 0.748), and the recall of the n- gram features is higher than the BLEU features (0.716 vs. 0.683); inversely, the BLEU features result in higher precision (0.83 vs. 0.814). Ac- cordingly, the n-gram features are used to replace the BLEU ones, since it is more important to re- duce the number of false negatives, i.e., improve the recall, for our task. Table 3: Comparison of n-gram and BLEU features. Features Precision Recall F1 BLEU 0.83 0.683 0.748 n-gram 0.814 0.716 0.761 Furthermore, each individual type of feature and their combinations were examined in the clas- sification experiments described above. As shown in Table 4, each feature set alone was used to build classification models. The n-gram overlap fea- tures result in the best performance with an F1- score of 0.761, and the WMD features capturing the topical relevance between word pairs result in a much lower F1-score of 0.649. Furthermore, the combination of both types of content similarity features, i.e., n-gram and WMD, slightly reduces the F1-score to 0.76. These results indicate that for this particular task, the exact match of certain ex- pressions appearing in both the test response and a source material plays a critical role. As to the speaking proficiency related features, they can lead to a promising precision of 0.8 but with a very low recall of 0.009. After reexamining the assumption that human experts may be able to differentiate prepared speech from fully spon- taneous speech based on the way how the speech is delivered, it turns out that it is quite challenging Table 4: Classification performance using each individ- ual feature set and their combinations. Features Precision Recall F1 1. n-gram 0.814 0.716 0.761 2. WMD 0.663 0.636 0.649 3. SpeechRater 0.8 0.009 0.018 1 + 2 0.812 0.716 0.76 1 + 2 + 3 0.821 0.696 0.752 for human experts to make a reliable judgment of plagiarism just based on the speech delivery with- out any reference to the source materials, in par- ticular, within the context of high-stakes language assessment. Accordingly, the features capturing the difference in speaking proficiency of prepared and spontaneous speech can be used as contrib- utory information to improve the accuracy of an automatic detection system, but they are unable to achieve promising performance alone. Also as shown in Table 4, By adding the speaking profi- ciency features, the precision can be improved to 0.821, but the recall is reduced; finally, the F1- score is reduced to 0.752. 6 Employment in Operational Test 6.1 Operational Use In order to obtain a more accurate estimate of how well the automatic plagiarism detection system might perform in a practical application in which the distribution is expected to be heavily skewed towards the non-plagiarized category, all test-taker responses to independent prompts were collected from a single administration of the speaking as- sessment. In total, 13,516 independent responses from 6,758 speakers were extracted for system evaluation. We collected 39 responses confirmed as plagiarized through the human review process, which represents 0.29% of the data set. In this particular task, automatic detection sys- tems can be applied to support human raters, where all potentially plagiarized responses can be first automatically identified and then human ex- perts can be involved to review flagged responses and make the final decision about potential in- stances of plagiarism. In this scenario, it is more important to increase the number of true positives flagged by the automated system; thus, recall of plagiarized responses was used as the evaluation metric, i.e., how many responses can be success- fully detected among these 39 confirmed instances
  • 8. 442 of plagiarism. 6.2 Results and Discussion In order to maximize the recall of plagiarized re- sponses for this review, several models were built with either different types of features or different types of classification models, for example, ran- dom forest and Adaboost with decision tree as the weak classifier, and then they were combined in an ensemble manner to flag potentially plagia- rized responses, i.e., a response is flagged if any of the models detects it as a plagiarized response. This ensemble system flagged 850 responses as in- stances of plagiarism in total and achieved a recall of 0.897, i.e., 35 of the confirmed plagiarized re- sponses were successfully identified by the auto- matic system and 4 of them were missed. These results prove that the developed sys- tem can provide a substantial benefit to both hu- man and automated assessment of non-native spo- ken English. Manual identification of plagia- rized responses can be a very difficult and time- consuming task, where human experts need to memorize hundreds of source materials and com- pare them to thousands of responses. By apply- ing an automated system, potentially plagiarized responses can first be filtered out of the standard scoring pipeline; subsequently, experts can review these flagged responses to confirm whether they actually contain plagiarized material. Accord- ingly, instead of reviewing all 13,516 responses in this administration for plagiarized content, hu- man effort is required only for the 850 flagged responses, thus substantially reducing the overall human effort. Thus, optimizing recall is appro- priate in this targeted use case, since the number of false positives is within an acceptable range for the expert review. In addition, the source labels in- dicating which source materials were likely used in the preparation of each response are automati- cally generated by the automatic system for each suspected response; this information can help to accelerate the manual review process. 7 Conclusion and Future Work This study proposed a system which can bene- fit a high-stakes assessment of English speaking proficiency by automatically detecting potentially plagiarized spoken responses, and investigated the empirical effectiveness of two different types of features. One is based on automatic plagiarism detection methods commonly applied to written texts, in which the content similarity between a test response and a set of source materials col- lected from human raters were measured. In addi- tion, this study also adopted a set of features which do not rely on the human effort involved in source material collection and can be easily applied to un- seen test questions. This type of feature attempts to capture the difference in speech patterns be- tween prepared responses and fully spontaneous responses from the same speaker in a test. Finally, the classification models were evaluated on a large set of responses collected from an operational test, and the experimental results demonstrate that the automatic detection system can achieve an F1- measure of 0.761. Further evaluation on the real operational data also shows the effectiveness of the automatic detection system. The task of applying an automatic system in a large-scale operational assessment is quite chal- lenging since typically only a small number of pla- giarized responses are distributed among a much larger amount of non-plagiarized responses to a wide range of different test questions. In the fu- ture, we will continue our research efforts to im- prove the automatic detection system along the following lines. First, since deep learning tech- niques have recently shown their effectiveness in the fields of both speech processing and natural language understanding, we will further explore various deep learning techniques to improve the metrics used to measure the content similarity be- tween test responses and source materials. Next, further analysis will be conducted to determine the extent of differences between canned and sponta- neous speech, and additional features will be ex- plored based on the findings. In addition, another focus is to build automatic tools to regularly up- date the pool of source materials. Besides internet search, new sources can also be detected by com- paring all candidate responses from the same test question, since different test takers may use the same source to produce their canned responses. References S. M. Alzahrani, N. Salim, and A. Abraham. 2012. Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transaction on Systems, Man, and Cybernetics-Part C: Applica- tions and Reviews, 42(2):133ā€“149. S. Brin, J. Davis, and H. Garcia-Molina. 1995. Copy
  • 9. 443 detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD Annual Confer- ence, pages 398ā€“409. C. Chen, J. Yeh, and H. Ke. 2010. Plagiarism detection using ROUGE and WordNet. Journal of Computing, 2(3):34ā€“44. P. Cullen, A. French, and V. Jakeman. 2014. The Of- ficial Cambridge Guide to IELTS. Cambridge Uni- versity Press. ETS. 2012. The Official Guide to the TOEFL R Test, Fourth Edition. McGraw-Hill. Keelan Evanini and Xinhao Wang. 2014. Automatic detection of plagiarized spoken responses. In Pro- ceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22ā€“27. A. Finch, Y. Hwang, and E. Sumita. 2005. Using ma- chine translation evaluation techniques to determine sentence-level semantic equivalence. In Proceed- ings of the Third International Workshop on Para- phrasing, pages 17ā€“24. A. Hauptmann. 2006. Automatic spoken document re- trieval. In Ketih Brown, editor, Encylclopedia of Language and Linguistics (Second Edition), pages 95ā€“103. Elsevier Science. T. C. Hoad and J. Zobel. 2003. Methods for identi- fying versioned and plagiarised documents. Journal of the American Society for Information Science and Technology, 54:203ā€“215. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to doc- ument distances. In Proceedings of the 32nd In- ternational Conference on Machine Learning, vol- ume 37 of Proceedings of Machine Learning Re- search, pages 957ā€“966, Lille, France. P. Longman. 2010. The Official Guide to Pearson Test of English Academic. Pearson Education ESL. C. Lyon, R. Barrett, and J. Malcolm. 2006. Plagiarism is easy, but also easy to detect. Plagiary, 1:57ā€“65. N. Madnani, J. Tetreault, and M. Chodorow. 2012. Re-examining machine translation metrics for para- phrase identification. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 182ā€“190, MontreĢal, Canada. Association for Computational Linguistics. M. Mozgovoy, T. Kakkonen, and E. Sutinen. 2007. Us- ing natural language parsers in plagiarism detection. In Proceedings of the ISCA Workshop on Speech and Language Technology in Education (SLaTE). T. Nahnsen, OĢˆ. Uzuner, and B. Katz. 2005. Lexical chains and sliding locality windows in content-based text similarity detection. CSAIL Technical Report, MIT-CSAIL-TR-2005-034. Fabian Pedregosa, GaeĢˆl Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825ā€“2830. M. Potthast, M. Hagen, T. Gollub, M. Tippmann, J. Kiesel, P. Rosso, E. Stamatatos, and B. Stein. 2013. Overview of the 5th International Competi- tion on Plagiarism Detection. In CLEF 2013 Evalu- ation Labs and Workshop ā€“ Working Notes Papers. M. Potthast, B. Stein, A. BarroĢn-CedenĢƒo, and P. Rosso. 2010. An evaluation framework for plagiarism de- tection. In Proceedings of the 23rd International Conference on Computational Linguistics. Y. Rubner, C. Tomasi, and L. J. Guibas. 1998. A metric for distributions with applications to image databases. In Proceedings of the Sixth International Conference on Computer Vision. N. Shivakumar and H. Garcia-Molina. 1995. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries. E. Stamatatos. 2011. Plagiarism detection using stop- word n-grams. American Society for Information Science and Technology, 62(12):2512ā€“2527. B. Stein, N. Lipka, and P. Prettenhofer. 2011. Intrinsic plagiarism analysis. Language Resources and Eval- uation, 45(1):63ā€“82. OĢˆ. Uzuner, B. Katz, and T. Nahnsen. 2005. Using syn- tactic information to identify plagiarism. In Pro- ceedings of the 2nd Workshop on Building Educa- tional Applications using NLP. Ann Arbor. X. Wang, K. Evanini, J. Bruno, and M. Mulholland. 2016. Automatic plagiarism detection for spoken responses in an assessment of english language pro- ficiency. In Proceedings of the IEEE Spoken Lan- guage Technology Workshop, pages 121ā€“128. K. Zechner, D. Higgins, and X. Xi. 2007. SpeechraterSM : A construct-driven approach to scoring spontaneous non-native speech. In Proceed- ings of the International Speech Communication Association Special Interest Group on Speech and Language Technology in Education, pages 128ā€“131. K. Zechner, D. Higgins, X. Xi, and D. M. Williamson. 2009. Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Commu- nication, 51(10):883ā€“895. K. Zechner and X. Wang. 2013. Automated content scoring of spoken responses in an assessment for teachers of english. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, page 73ā€“81.