2. Scope, Objectives, Significance
Propose a Framework
Investigate the role of machine learning in the proposed
framework. Significance of the Project
Degradation of
Education Quality
Scope External Plagiarism
3. Plagiarism
Natural Language Processing
“The action or practice of taking someone else's work, idea, etc.,
and passing it off as one's own; literary theft."
computer science
+ artificial intelligence
+linguistics
Clough, Gaizauskas , Piao, and Wilks on “METER: Measuring TExt Reuse”
on 2000
Shallow
Deep
Direct copy or paraphrase of n grams
Can be of various length
Information Retrieval
Word Segmentation
Sentence Breaking
Word Sense Disambiguation
4. Works Influence us… …
SCAM(Shivakumar and Garcia-Molina (1995, 1996))
The more complex the metrics are, the more processing
power is required.(Lancaster and Culwin (2003)
PRAISE(Culwin and Lancaster (2001)
N gram overlap Method(RomanTesar, Massimo Poesio,Vaclav Strnad, and Karel
Jezek)
Use of cosine similarity and tf-idf (Thade Nahnsen, Ozlem Uzuner, and Boris
Katz)
Plagiarism Pattern Checker(Nam Oh Kang,Alexander Gelbukh, and SangYong
Han)
Use ofVSM(Benno Stein, Sven Meyer zu Eissen, and Martin Potthast.)
Limitations!!!!
7. Experimental Setup(cont…)
Text Pre-
processing &
NLP
Techniques
Comparison
Methodologies
Machine Learning
Accuracy Score
Feature
SelectionMachine Learning
Construction of a
Train Model
Plagiarism Detection
Suspicious Documents
Original Documents
Machine Learning
Accuracy
Corpus
Test Model
8. Experimental Setup(cont…)
Text pre-processing & NLP techniques:
Lower Case
Without Stop
Word
StopWord
Punctuation No Punctuation No PunctuationPunctuation
Stemming No Stemming
Lemmatizing No Lemmatizing
Stemming No Stemming Stemming No Stemming Stemming No Stemming
Lemmatizing No Lemmatizing Lemmatizing No Lemmatizing Lemmatizing No Lemmatiz
Sentence Segmentation
Tokenization
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them.”]
[ To die, to sleep no more – and by a sleep to say we
end the heartache and the thousand natural shocks
that flesh is heir to – ‘tis a consummation devoutly to
be wished.]
“To be or not to be– that is the question:”
[To] [be] [or] [not] [to] [be] [–] [that] [is] [the]
[question] [:]
“To be or not to be– that is the question:”
to be or not to be– that is the question
“To be or not to be– that is the question:”
be or not be - question:
“Hello Dear, how areYou?
Hello Dear how are you
Produced Produce
Produced/ Product/ Produce Produc
Computational Comput
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them. To die, to
sleep no more – and by a sleep to say we end the
heartache and the thousand natural shocks that flesh
is heir to – ‘tis a consummation devoutly to be
wished.]
9. Experimental Setup(cont…)
Comparison Methodologies
Machine learning algorithm:
N gram Frequency based similarity measure
N gram Similarity measure using Jaccard Index
J48 Classifier, Naïve Bais Classifier
10. N gram Similarity Measure
1 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
1 gram representation
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The] [girl] [is] [standing] [outside] [of] [PUCSD] [and]
[talking] [with] [her] [friend]]
[[The] [boy] [is] [talking] [with] [his] [friend] [outside] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%
[[The] [girl] [is] [stand] [outsid] [of] [PUCSD] [and] [talk]
[with] [her] [friend]]
[[The] [boy] [is] [talk] [with] [his] [friend] [outsid] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%No SP and P
With SP and P
11. N gram Similarity measure using
Jaccard Index
2 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
2 gram representation
Similarity Index
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The girl],[girl is],[is standing],[standing outside],[outside
of],[of PUCSD],[PUCSD and],[and talking],[talking with],[with
her],[her friend],[friend ]]
[[The boy],[boy is],[is talking],[talking with],[with his],[his
friend],[friend outside],[outside of],[of Symbiosis], [Symbiosis
]]
2/20= 10%
12. Experiment and Findings-1
Generating DecisionTree
95 instances121 attributes
Selecting Features
Build train model
95 instances27 attributes
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 93.617 % on J48
Accuracy: 89.0052 % on J48
Accuracy: 86.3874 % on NaiveBaise
13. Experiment and Findings-2
Generating DecisionTree
95 instances121 attributes
Use Filter Metrics
Build train model
95 instances26 attributes
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 93.617 % on J48
Accuracy: 89.0052 % on J48
Accuracy: 86.3874 % on NaiveBaise