SlideShare a Scribd company logo
1 of 14
Download to read offline
Saugata Bose
12204
M.Sc-II
Natural Language Processing:
Plagiarism Detection
Scope, Objectives, Significance
 Propose a Framework
 Investigate the role of machine learning in the proposed
framework. Significance of the Project
Degradation of
Education Quality
Scope External Plagiarism
 Plagiarism
 Natural Language Processing
“The action or practice of taking someone else's work, idea, etc.,
and passing it off as one's own; literary theft."
computer science
+ artificial intelligence
+linguistics
Clough, Gaizauskas , Piao, and Wilks on “METER: Measuring TExt Reuse”
on 2000
Shallow
Deep
Direct copy or paraphrase of n grams
Can be of various length
Information Retrieval
Word Segmentation
Sentence Breaking
Word Sense Disambiguation
Works Influence us… …
 SCAM(Shivakumar and Garcia-Molina (1995, 1996))
 The more complex the metrics are, the more processing
power is required.(Lancaster and Culwin (2003)
 PRAISE(Culwin and Lancaster (2001)
 N gram overlap Method(RomanTesar, Massimo Poesio,Vaclav Strnad, and Karel
Jezek)
 Use of cosine similarity and tf-idf (Thade Nahnsen, Ozlem Uzuner, and Boris
Katz)
 Plagiarism Pattern Checker(Nam Oh Kang,Alexander Gelbukh, and SangYong
Han)
 Use ofVSM(Benno Stein, Sven Meyer zu Eissen, and Martin Potthast.)
Limitations!!!!
Our Initiatives… …
Frequency Comparison
Approach
N gram Similarity
Measure along with
Jaccard Index
Shallow NLP
Experimental Setup
 Corpus of Plagiarised ShortAnswers
-------Clough & Stevenson (2009)
 Original source documents : 5
 Plagiarised documents : 57
----------Near copy : 19
----------Light revision : 19
----------Heavy revision :19
----------Non-plagiarised documents : 38
Experimental Setup(cont…)
Text Pre-
processing &
NLP
Techniques
Comparison
Methodologies
Machine Learning
Accuracy Score
Feature
SelectionMachine Learning
Construction of a
Train Model
Plagiarism Detection
Suspicious Documents
Original Documents
Machine Learning
Accuracy
Corpus
Test Model
Experimental Setup(cont…)
 Text pre-processing & NLP techniques:
Lower Case
Without Stop
Word
StopWord
Punctuation No Punctuation No PunctuationPunctuation
Stemming No Stemming
Lemmatizing No Lemmatizing
Stemming No Stemming Stemming No Stemming Stemming No Stemming
Lemmatizing No Lemmatizing Lemmatizing No Lemmatizing Lemmatizing No Lemmatiz
Sentence Segmentation
Tokenization
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them.”]
[ To die, to sleep no more – and by a sleep to say we
end the heartache and the thousand natural shocks
that flesh is heir to – ‘tis a consummation devoutly to
be wished.]
“To be or not to be– that is the question:”
[To] [be] [or] [not] [to] [be] [–] [that] [is] [the]
[question] [:]
“To be or not to be– that is the question:”
to be or not to be– that is the question
“To be or not to be– that is the question:”
be or not be - question:
“Hello Dear, how areYou?
Hello Dear how are you
Produced  Produce
Produced/ Product/ Produce  Produc
Computational  Comput
[ “To be or not to be– that is the question: whether
'tis nobler in the mind to suffer the slings and arrows
of outrageous fortune, or to take arms against a sea
of troubles and, by opposing, end them. To die, to
sleep no more – and by a sleep to say we end the
heartache and the thousand natural shocks that flesh
is heir to – ‘tis a consummation devoutly to be
wished.]
Experimental Setup(cont…)
 Comparison Methodologies
 Machine learning algorithm:
N gram Frequency based similarity measure
N gram Similarity measure using Jaccard Index
J48 Classifier, Naïve Bais Classifier
N gram Similarity Measure
 1 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
1 gram representation
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The] [girl] [is] [standing] [outside] [of] [PUCSD] [and]
[talking] [with] [her] [friend]]
[[The] [boy] [is] [talking] [with] [his] [friend] [outside] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%
[[The] [girl] [is] [stand] [outsid] [of] [PUCSD] [and] [talk]
[with] [her] [friend]]
[[The] [boy] [is] [talk] [with] [his] [friend] [outsid] [of]
[Symbiosis ]]
7/10= 70%
3/10= 30%No SP and P
With SP and P
N gram Similarity measure using
Jaccard Index
 2 gram similarity measure (Pre- processing +NLP+ Comparison)
Original Document
Suspicious Document
2 gram representation
Similarity Index
The girl is standing outside of PUCSD and talking with her
friend
The boy is talking with his friend outside of Symbiosis
[[The girl],[girl is],[is standing],[standing outside],[outside
of],[of PUCSD],[PUCSD and],[and talking],[talking with],[with
her],[her friend],[friend ]]
[[The boy],[boy is],[is talking],[talking with],[with his],[his
friend],[friend outside],[outside of],[of Symbiosis], [Symbiosis
]]
2/20= 10%
Experiment and Findings-1
Generating DecisionTree
95 instances121 attributes
Selecting Features
Build train model
95 instances27 attributes
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 93.617 % on J48
Accuracy: 89.0052 % on J48
Accuracy: 86.3874 % on NaiveBaise
Experiment and Findings-2
Generating DecisionTree
95 instances121 attributes
Use Filter Metrics
Build train model
95 instances26 attributes
Accuracy: 94.6809 % on J48
Accuracy: 65.9574 % % on Naïve Baise
Accuracy: 71.2766 % on Naïve Baise
Accuracy: 93.617 % on J48
Accuracy: 89.0052 % on J48
Accuracy: 86.3874 % on NaiveBaise
Future Improvements
 IntegrateWordnet with current framework
 Address Paraphrasing
 Address multi-lingual plagiarism detection

More Related Content

Recently uploaded

chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 

Recently uploaded (20)

chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Natural language processing

  • 1. Saugata Bose 12204 M.Sc-II Natural Language Processing: Plagiarism Detection
  • 2. Scope, Objectives, Significance  Propose a Framework  Investigate the role of machine learning in the proposed framework. Significance of the Project Degradation of Education Quality Scope External Plagiarism
  • 3.  Plagiarism  Natural Language Processing “The action or practice of taking someone else's work, idea, etc., and passing it off as one's own; literary theft." computer science + artificial intelligence +linguistics Clough, Gaizauskas , Piao, and Wilks on “METER: Measuring TExt Reuse” on 2000 Shallow Deep Direct copy or paraphrase of n grams Can be of various length Information Retrieval Word Segmentation Sentence Breaking Word Sense Disambiguation
  • 4. Works Influence us… …  SCAM(Shivakumar and Garcia-Molina (1995, 1996))  The more complex the metrics are, the more processing power is required.(Lancaster and Culwin (2003)  PRAISE(Culwin and Lancaster (2001)  N gram overlap Method(RomanTesar, Massimo Poesio,Vaclav Strnad, and Karel Jezek)  Use of cosine similarity and tf-idf (Thade Nahnsen, Ozlem Uzuner, and Boris Katz)  Plagiarism Pattern Checker(Nam Oh Kang,Alexander Gelbukh, and SangYong Han)  Use ofVSM(Benno Stein, Sven Meyer zu Eissen, and Martin Potthast.) Limitations!!!!
  • 5. Our Initiatives… … Frequency Comparison Approach N gram Similarity Measure along with Jaccard Index Shallow NLP
  • 6. Experimental Setup  Corpus of Plagiarised ShortAnswers -------Clough & Stevenson (2009)  Original source documents : 5  Plagiarised documents : 57 ----------Near copy : 19 ----------Light revision : 19 ----------Heavy revision :19 ----------Non-plagiarised documents : 38
  • 7. Experimental Setup(cont…) Text Pre- processing & NLP Techniques Comparison Methodologies Machine Learning Accuracy Score Feature SelectionMachine Learning Construction of a Train Model Plagiarism Detection Suspicious Documents Original Documents Machine Learning Accuracy Corpus Test Model
  • 8. Experimental Setup(cont…)  Text pre-processing & NLP techniques: Lower Case Without Stop Word StopWord Punctuation No Punctuation No PunctuationPunctuation Stemming No Stemming Lemmatizing No Lemmatizing Stemming No Stemming Stemming No Stemming Stemming No Stemming Lemmatizing No Lemmatizing Lemmatizing No Lemmatizing Lemmatizing No Lemmatiz Sentence Segmentation Tokenization [ “To be or not to be– that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and, by opposing, end them.”] [ To die, to sleep no more – and by a sleep to say we end the heartache and the thousand natural shocks that flesh is heir to – ‘tis a consummation devoutly to be wished.] “To be or not to be– that is the question:” [To] [be] [or] [not] [to] [be] [–] [that] [is] [the] [question] [:] “To be or not to be– that is the question:” to be or not to be– that is the question “To be or not to be– that is the question:” be or not be - question: “Hello Dear, how areYou? Hello Dear how are you Produced  Produce Produced/ Product/ Produce  Produc Computational  Comput [ “To be or not to be– that is the question: whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune, or to take arms against a sea of troubles and, by opposing, end them. To die, to sleep no more – and by a sleep to say we end the heartache and the thousand natural shocks that flesh is heir to – ‘tis a consummation devoutly to be wished.]
  • 9. Experimental Setup(cont…)  Comparison Methodologies  Machine learning algorithm: N gram Frequency based similarity measure N gram Similarity measure using Jaccard Index J48 Classifier, Naïve Bais Classifier
  • 10. N gram Similarity Measure  1 gram similarity measure (Pre- processing +NLP+ Comparison) Original Document Suspicious Document 1 gram representation The girl is standing outside of PUCSD and talking with her friend The boy is talking with his friend outside of Symbiosis [[The] [girl] [is] [standing] [outside] [of] [PUCSD] [and] [talking] [with] [her] [friend]] [[The] [boy] [is] [talking] [with] [his] [friend] [outside] [of] [Symbiosis ]] 7/10= 70% 3/10= 30% [[The] [girl] [is] [stand] [outsid] [of] [PUCSD] [and] [talk] [with] [her] [friend]] [[The] [boy] [is] [talk] [with] [his] [friend] [outsid] [of] [Symbiosis ]] 7/10= 70% 3/10= 30%No SP and P With SP and P
  • 11. N gram Similarity measure using Jaccard Index  2 gram similarity measure (Pre- processing +NLP+ Comparison) Original Document Suspicious Document 2 gram representation Similarity Index The girl is standing outside of PUCSD and talking with her friend The boy is talking with his friend outside of Symbiosis [[The girl],[girl is],[is standing],[standing outside],[outside of],[of PUCSD],[PUCSD and],[and talking],[talking with],[with her],[her friend],[friend ]] [[The boy],[boy is],[is talking],[talking with],[with his],[his friend],[friend outside],[outside of],[of Symbiosis], [Symbiosis ]] 2/20= 10%
  • 12. Experiment and Findings-1 Generating DecisionTree 95 instances121 attributes Selecting Features Build train model 95 instances27 attributes Accuracy: 94.6809 % on J48 Accuracy: 65.9574 % % on Naïve Baise Accuracy: 71.2766 % on Naïve Baise Accuracy: 93.617 % on J48 Accuracy: 89.0052 % on J48 Accuracy: 86.3874 % on NaiveBaise
  • 13. Experiment and Findings-2 Generating DecisionTree 95 instances121 attributes Use Filter Metrics Build train model 95 instances26 attributes Accuracy: 94.6809 % on J48 Accuracy: 65.9574 % % on Naïve Baise Accuracy: 71.2766 % on Naïve Baise Accuracy: 93.617 % on J48 Accuracy: 89.0052 % on J48 Accuracy: 86.3874 % on NaiveBaise
  • 14. Future Improvements  IntegrateWordnet with current framework  Address Paraphrasing  Address multi-lingual plagiarism detection