Unit-IV; Professional Sales Representative (PSR).pptx
A Review Of Plagiarism Detection Based On Lexical And Semantic Approach
1. A Review of Plagiarism Detection Based On
Lexical and Semantic Approach
Shameem Yousuf
Student
Department of Information Technology,
Central University of Kashmir
Srinagar, Jammu and Kashmir, India
Email: bhat.shameem@gmail.com
Muzamil Ahmad
Student
Department of Information Technology,
Central University of Kashmir
Srinagar, Jammu and Kashmir, India
Email: muzamilahmad87@gmail.com
Sheikh Nasrullah
Assistant Professor
Department of Information Technology,
Central University of Kashmir
Srinagar, Jammu and Kashmir, India
Email: nasrullah@cukashmir.ac.in
Abstract—Due to easy availability of documents over the web,
plagiarism has become serious problem to teachers, researchers
and publishers. In this paper, we discuss about the plagiarism
process, types and detection methodologies. Further, we have
classified the different plagiarism detection techniques based on
Lexical and Semantic Approach. Finally we present the brief
study of the different tools (offline and online) available for
detecting plagiarism.
I. INTRODUCTION
Plagiarism is the act of claiming or implementing author-
ship over someone else’s work, wholly or the part of it without
his proper authorization. With the evolution of the modern
web, information and data is available to everybody and is
just a click away from an individual, but with this information
boom there arises certain problems as well like most of the
people started copying and pasting their work from the already
existing digital documents on the web, and this act of cloning
others work without giving the proper credits to the author can
be regarded as the plagiarism. Formally it can be defined as
an act or instance of using or closely imitating the language
and thoughts of another author without authorization and the
representation of that author’s work as one’s own, as by not
crediting the original author1
.
The different areas where plagiarism can be found is the
literature, music, software, scientific articles, research papers,
newspapers, advertisements, websites etc. A study carried in
United States shows that among 18000 university students
almost 40% of them have plagiarized at least once2
.
Plagiarism is a growing challenge in the modern society
and in order to maintain the academic integrity the use of
plagiarism detection tools has become the norm in many higher
education institutions, but the effectiveness of detection level
depends on the type of algorithm and the type of obfuscation
strategy employed by the plagiarist in order to create the
plagiarised text.
A. Plagiarising Process
The process of plagiarising is the simple task of reusing the
existing content in a way that it adheres to your requirements
of the given task.
1http://dictionary.reference.com/browse/plagiarism
2D. McCabe. Research Report of the Center for Academic Integrity.
http://www.academicintegrity.org, 2005
The task includes searching for the related content from
the web, text documents or any other resources and then using
this content by copying and pasting it into the newly created
document by using different obfuscation strategies present to
disguise the act of plagiarism detection. After completing this
we are done with the new plagiarised version of the existing
document, which could be used anywhere required to fulfil
the task. The whole task of the plagiarism process can be
illustrated using this simple illustration in figure 1.
Fig. 1. The basic steps of Plagiarizing.
B. Types of Plagiarism
Plagiarism is very vast and dynamic i.e. there exist a
number of obfuscation strategies which help to create the pla-
giarised text. Plagiarism.org has classified and ranked different
types of plagiarism based on the severity of the intent.
According to the plagiarism.org3
the 10 most common
types of plagiarism are illustrated in figure 2.
1) Clone: This type of plagiarism means when a plagiarist
3http://www.plagiarism.org
2. submits another person’s work, word by word as one’s
own work.
2) Ctrl-C: When the plagiarised text contains the significant
portions of original text without any alterations.
3) Find-Replace: This type of plagiarism includes changing
of the keywords in the text but retaining the original
content of the source.
4) Remix: When a plagiarist paraphrase the documents from
the multiple sources and combine them in the single
document.
5) Recycle: When an author uses its previous work, without
proper citation in order to form the new documents. this
type is sometimes also called as self plagiarism.
6) Hybrid: When a plagiarist Copies different passages from
the multiple cited sources, without proper citation.
7) Mashup: This is the mixup of the content from different
sources.
8) 404 Error: This type of plagiarism defines when a
plagiarist includes citation to non-existent or inaccurate
information about sources.
9) Aggregator: This type of plagiarism includes proper
citation to sources, but the paper contains no original
work.
10) Re-Tweet: In this type of plagiarism the author mentions
proper citations but relies too closely on the authors
original wording or structure.
In section II we will discuss about the plagiarism detection
methodologies. The section IV provides a brief study of
various plagiarism detection tools.
Fig. 2. Types of Plagiarism.
II. PLAGIARISM DETECTION
Plagiarising means to reuse someone elses work without
the proper citation and pretending it to be ones own work. Text
plagiarism is one of the old forms of plagiarism and remains
difficult to be identified in practice, till this day. The challenges
in automatic plagiarism detection has been widely discussed
in [1].
In order to create an automatic plagiarism detection system
we need to have an existing Corpus and detection techniques.
The general detection process is shown in figure 3. The
detection process is divided into 3 tasks given as under:
• Pre Processing
• Intermediate Processing
• Post Processing
Pre Processing includes uploading source document and
retrieving suspicious documents from Corpus based on the
source document. Once we acquire the specific data we send
this data for intermediate processing. The design issue of this
stage include how accurate the searching of documents is done
from the corpus.
Intermediate processing stage includes the detailed de-
tection and comparison of the source and the suspicious
documents based on the algorithm running. The design issues
of this stage include the running time and the effectiveness of
the comparison logic.
Post Processing is the final stage of this process include
preparing the results of the detection task and based on those
results we are actually able to decide whether the source
document is plagiarised or not.
Fig. 3. Genric retrieval process.
In order to detect the plagiarism in source code of program
and free text (plain natural language text), the following given
methodologies can be used:
• Manual detection: This process of plagiarism detec-
tion is done manually by the humans by comparing
and verifying the given set of document. This process
requires a lot of expertise in the particular as we
have to look for the various plagiarism strategies. This
type of detection is suitable for checking class work,
articles, and short notes but is impractical to verify
large number of documents and infeasible in the terms
of economy and time.
• Computer aided detection: This type of detection
technique refers to the process of detecting plagiarism
with the help of computer system equipped with
plagiarism detection algorithm. Since the rise of the
process of detection, different approaches have been
followed to get the job done in efficient way. Using
this type of detection approach over manual plagiarism
detection has got both advantages and disadvantages
as well, but we always have to trade-off between the
performance and the cost associated with it.
3. III. PLAGIARISM DETECTION TECHNIQUES
In this section, we present a review on various plagiarism
detection techniques developed. The techniques are classified
into, Source Code Plagiarism Detection and Free Text Plagia-
rism Detection.
Fig. 4. Classification of Plagiarism Detection.
A. Source Code Plagiarism Detection
The different techniques that are used to create the plagia-
rised source code include comment removal, identifier renam-
ing, structured constant renaming, and removing debugging
information.
The Source code plagiarism detection techniques include:
i) Textual Based Approach: This approach is based on
comparison of line or string sequence in the code, this
type usually works with raw source code. An example of
this approach is diff4
file comparison utility which tries
to find out the differences between two files by finding
the longest common subsequence.
ii) Token Based Approach: In this type of approach source
code is parsed into sequence of Tokens depending upon
the rules of the programming language. The example of
such a type of detector is java based detection tool JPLAG
[2].
iii) Tree Based Approach: In this type of approach source
code is parsed into parse tree or Abstract Syntax Tree
(AST). AST represents the syntactic structure of the
parsed source code with abstract representation of every
element, and then a tree matching algorithm is used to
search the similar sub tree in order to detect the code
clones. Clone Digger [3] is the example of such tool.
iv) PDG based/ Semantics-Aware Approach: This approach
aims at analysing the behaviour of the source code rather
than the syntactic features. In this method highly ab-
stracted source code representation called PDG (Program
4http://pubs.opengroup.org/onlinepubs/9699919799/utilities/diff.html
Dependency Graph) is obtained that carries the semantic
information of the source code. The PDG contains the data
and control flow and thus ignoring the syntactic structure.
After obtaining the PDG, a sub graph matching algorithm
is applied to discover similar sub graph which are then
returned as clones. Scorpio [4] is example of such a tool.
B. Free Text Plagiarism Detection
i) Lexical Approach: This approach of plagiarism detection
focuses on using the lexical features of the text or docu-
ments, which operate at the charcter or the word level of
the document [5] in order to trace the plagiarism scenario
from the suspicious documents. This type of approach
tries to enhance the standard string matching comparison
in order to detect the plagiarism. The processing tech-
nique that this approach relies upon includes tokeniza-
tion, lowercasing, punctuation removal and stemming [6],
however it can vary from technique to technique. The
comparison units adopted for detecting plagiarism differ
from one technique to another, the different such units
include words, sentences, passages, human defined sliding
window or an n-gram. The summary of the work that have
been done using these techniques include [7], [8], [9],
[10], [11], [12], [13]. With the evolution of the time and
technology researchers are moving forward to breakdown
the problem in simpler forms and to get the efficient and
desired results, In order to achieve this goal researchers
are merging different detection approaches like the usage
of Natural Language Processing (NLP) in order to extract
key features of the text.
ii) Semantic Approach: Plagiarists sometimes use sophisti-
cated obfuscation techniques in order to create the pla-
giarised text, like changing the words or phrases to those
with the similar meaning and in this way they create
the plagiarised copies of the original version of the text.
The basic detection method using semantic approach is
illustrated in the Figure 5.
Fig. 5. Hypothetical semantic retrieval approach.
In this approach different semantic features which include
(Synonyms, hyponyms, hypernyms, semantic dependen-
cies) [5] are extracted from the source documents and
then these features are used to trace out the plagiarism
case from the corpus and the fact database build-up of
already existing documents.
4. The semantic approach is aimed to attain the high perfor-
mance in terms of detection, and should address the issues
of polysemy (same words referring to different things
based on the context like mouse the computer input device
and mouse the rodent) and synonymy (different words
referring to the same things like car and automobile)
that are not handled by the lexical (straight forward term
matching) approach.
Lin et al. [14] has explored semantic similarity using
lexical databases such as Stanford Wordnet5
to acquire
synonyms, another algorithms that can be used to extract
the semantic features of sentences are Latent Dirichelet
Allocation [15], another novel way of computing the
document is using the RDF framework6
. In this approach
a document is represented as RDF triples, an RDF triple
has the proper format (subject, predicate, object). For e.g.,
the sentence (john, livesIn, ohio) where john and ohio are
known as entities and the livesIn (predicate) is a relation
between two entities. It has a domain (restricting the set
of subjects) and a range (restricting the set of objects).
The predicate livesIn for instance has the domain humans
and the range locations, this is denoted by livesIn(humans,
locations) which is a two notion relation. A set of RDF
facts is referred to as an ontology and can be extracted
from the text documents.
The Semantic based approach is not widely used because
of the level of difficulty incorporated with this approach
however different work that has been done in this area
include [16].
IV. TOOLS
Various tools have been developed so far to detect the
plagiarism. The tools are classified as Online and Offline.
Offline tools include those set of plagiarism detection tools
which can be run in the offline environment in order to perform
the detection process, these tools usually include the inbuilt
corpus against which the suspicious document is checked, and
this can be one of the limitation of these kind of tools. On
the other hand, online tools include those tools which perform
the operation of detection in the online enviroment and check
the documents against the indexed documents. These tools
constantly build up their corpus by indexing the web. Hence,
the detection is fairly good as compared to offline tools.
A. Offline Tools
Some of the offline tools for plagiarism detection are given
under:
i) CopyCatch: CopyCatch is a plagiarism detection tool that
evolved from the WordCheck. Earlier it was a primary
plagiarism detection tool used for research papers, and
essays etc., its algorithm relied on the principle of
hapexlegomena words. These are the words that only
appear once in a text, so instead of counting occurrence
of every word, it returned a list that only exist once in
the document. If a document shared over 50 percent of
its hapexlegomena words with another, it was marked as
possible plagiarism. The idea was based on the research
5http://ai.stanford.edu/ rion/swn/
6http://www.w3.org/RDF/
that found independently written texts on the same
subject can have upto 50 percent hapexlegomena overlap,
but anymore indicates potential plagiarism [17].
ii) SCAM: SCAM (Stanford Copy Analysis Mechanism)
[18] is a plagiarism detection tool that first appeared in
1995. Its algorithm was much more statistics oriented
than many of the other programs. It first gathers the
list of word occurrences exactly like WordCheck. Then
it statistically normalise the list based on the number
of occurrences essentially i.e., putting the data in bell
shaped curve. The list is then stored as vector. SCAM
then uses the vector space model [19] to compare this
vector with other documents resulting vectors. The
vectors were compared using a dot product or a cosine
function for similarity. In other words, if distribution of
words is similar, then the documents must be similar.
iii) CHECK: CHECK is a tool that combines statistical
analysis with computer science techniques. It still has
to maintain the huge database of documents to compare
against the submission. However, it was able to narrow
down the search process by restricting the document
comparison by attempting to determine the contents and
semantics of the paper. So instead of comparing the
submitted paper with every document in its database, it
would have to compare it with those that were determined
to be of similar content.
CHECK determines the semantics of the paper by creat-
ing its document tree. The CHECK algorithm creates a
tree from a document by layering sections, subsections,
paragraphs, sentences, etc.
B. Online Tools
The various online tools for detecting Plagiarism include:
i) TurnItIn:It was designed by four UC Berkeley graduate
students as a peer review application to use for their
classes. Eventually, that prototype developed into one
of the most recognizable names in plagiarism detection.
TurnItIn, which processed over 60 million academic
papers in 2011, is accessible for a fee per educator.
Students can use TurnItIn’s WriteCheck service to
maintain proper citations and to access various writing
tools.
ii) IThenticate: Like TurnItIn, iThenticate is a service
offered by Plagiarism.org, but is geared more toward
professional writing and scholarly research. Publishers
like Oxford University Press use iThenticate for its Cross
Check software, which includes a database of more than
31 million articles and 67,664 books and journals.
iii) Viper: Viper calls itself the ”Free TurnItIn Alternative.”
It scans a large database of academic essays and other
online sources, offering side-by-side comparisons for
plagiarism. But the limitation of this is that it is available
for Microsoft Windows users only.
iv) PlagiarismChecker.com: PlagiarismChecker.com makes
it simple for educators to check for copied work by
5. pasting phrases from a student’s paper into a search box.
The system can search through either Google or Yahoo.
Users can also use the ”Author” option to check if others
have plagiarized their work online.
v) PlagiarismDetect.com: PlagiarismDetect.com scans
text at a rate of dollar 0.50 per page. The system
takes about 5-7 minutes per page, which makes for
thorough examination. According to the website,
PlagiarismDetect.com has recently updated its system
with a new advanced algorithm, combining multi-layered
technology and SMART scanning (which supposedly
scans papers like humans).
vi) Plagiarisma.net: Plagiarisma.net has a search box as well
as a software download available for Windows. Users can
also search for entire URLs and files in HTML, DOC,
DOCX, RTF, TXT, ODT and PDF formats.
vii) PlagiarismSoftware.net: Formerly known as
(Duplichecker), this minimalistic checker lets users
search for text and upload text files.
viii) CheckForPlagiarism.net: CheckForPlagiarism.net claims
its licensing fees are, on average, between 35% and 70%
lower than competing services. Its basic account, meant
for high school students, costs dollar 20 and allows users
to scan five documents. The service can scan multiple
languages, and users can compare papers.
ix) Essay Verification Engine (EVE2): The EVE plagiarism
detection system is one of the older services on this
list, having performed almost 150 million scans since its
creation in 2000. It runs users dollar 29.99 for unlimited
use and includes a 10-day money-back guarantee.
REFERENCES
[1] P. Clough and D. O. I. Studies, “Old and new challenges in automatic
plagiarism detection,” in National Plagiarism Advisory Service, 2003;
http://ir.shef.ac.uk/cloughie/index.html, 2003, pp. 391–407.
[2] L. Prechelt, G. Malpohl, and M. Phlippsen, “Jplag: Finding plagiarisms
among a set of programs,” Tech. Rep., 2000.
[3] M. M. Peter Bulychev, “An evaluation of duplicate code detection using
anti-unification,” in Proceedings of the 3rd International Workshop on
Software Clones at CSMR, 2009.
[4] S. K. Yoshiki Higo, “Code clone detection on specialized pdgs with
heuristics,” 2011 15th European Conference on Software Maintenance
and Reengineering, vol. 0, pp. 75–84, 2011.
[5] S. M. Alzahrani, N. Salim, and A. Abraham, “Understanding plagiarism
linguistic patterns, textual features, and detection methods,” Trans. Sys.
Man Cyber Part C, vol. 42, no. 2, pp. 133–149, Mar. 2012. [Online].
Available: http://dx.doi.org/10.1109/TSMCC.2011.2134847
[6] M. Chong and L. Specia, “Lexical generalisation for word-level match-
ing in plagiarism detection,” in RANLP, 2011, pp. 704–709.
[7] S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms
for digital documents,” in SIGMOD Conference, 1995, pp. 398–409.
[8] D. R. White and M. Joy, “Sentence-based natural language plagiarism
detection,” ACM Journal of Educational Resources in Computing,
vol. 4, no. 4, pp. 1–20, 2004.
[9] S. Niezgoda and T. P. Way, “Snitch: a software tool for detecting cut
and paste plagiarism,” in SIGCSE, 2006, pp. 51–55.
[10] A. Barrón-Cedeño and P. Rosso, “On automatic plagiarism detection
based on n-grams comparison,” in ECIR, 2009, pp. 696–700.
[11] M. S. Pera and Y.-K. Ng, “A naı̈ve bayes classifier for web document
summaries created by using word similarity and significant factors,”
International Journal on Artificial Intelligence Tools, vol. 19, no. 4, pp.
465–486, 2010.
[12] E. Stamatatos, “Plagiarism detection using stopword n-grams,” JASIST,
vol. 62, no. 12, pp. 2512–2527, 2011.
[13] J. Grman and R. Ravas, “Improved implementation for finding text
similarities in large sets of data - notebook for pan at clef 2011,” in
CLEF (Notebook Papers/Labs/Workshop), 2011.
[14] H.-H. Chen, M.-S. Lin, and Y.-C. Wei, “Novel association measures
using web search with double checking,” in ACL, 2006.
[15] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
[16] G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis, “Text relatedness based
on a word thesaurus,” J. Artif. Intell. Res. (JAIR), vol. 37, pp. 1–39,
2010.
[17] P. Clough, “Plagiarism in natural and programming languages: an
overview of current tools and technologies,” 2000.
[18] N. Shivakumar and H. Garcia-molina, “Scam: A copy detection
mechanism for digital documents,” in In Proceedings of the Second
Annual Conference on the Theory and Practice of Digital Libraries,
1995. [Online]. Available: http://ilpubs.stanford.edu:8090/95/
[19] G. Salton, A. Wong, and C. S. Yang, “A vector space model for
automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, Nov.
1975. [Online]. Available: http://doi.acm.org/10.1145/361219.361220