SlideShare a Scribd company logo
Authors
University
Politehnica
of Bucharest
Automatic Plagiarism Detection
System for Specialized Corpora
Filip Cristian Buruiană
Adrian Scoică
Traian Rebedea – traian.rebedea@cs.pub.ro
Razvan Rughiniș
Overview
• Introduction
• System architecture
• Detection of plagiarism
• Algorithms for candidate selection
• Algorithms for detailed analysis
• Algorithms for post-procesing
• Results
• Conclusions
22.09.13 Sesiunea de Licenţe - Iulie 2012 2
Introduction
• Plagiarism: unauthorized appropriation of the language or
thoughts of another author and the representation of that
author's work as pertaining to one's own without according
proper credit to the original author
• Lots of documents => automatic detection
needed
• Information Retrieval
– Stemming (ex. beauty, beautiful, beautifulness => beauti)
– Vector Space Model
– tf-idf weighting, cosine similarity
• Measuring results
– precision, recall, granularity => F-measure
22.09.13 CSCS 2013 – Bucharest, Romania 3
Existing solutions
• Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.)
• They are general solutions, topic independent
• No open-source solutions that offer good results
• No solutions specialized for Computer Science
• Difficult to evaluate: need a good corpus (annotated by persons,
how to find plagiarized documents, etc.)
• AuthentiCop – developed for specialized corpora, also evaluated on
general texts
• Used corpora:
– PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and
social software misuse” at CLEF)
– Bachelor thesis @ A&C
22.09.13 CSCS 2013 – Bucharest, Romania 4
System Architecture
• Web interface for accessing AuthentiCop
– Simple to add documents (text, pdf) and to highlight suspicios
elements
22.09.13 5CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 6
• Logical separation
– Front-end (PHP, JavaScript + AJAX, jquery)
– Back-end (C++)
– Cross-Language Communication
• Scalable solution, easy to update
– Web server (front-end) and the plagiarism detection
modules (back-end) may run on different machines
– Plagiarism detection can be distributed on different
machines (distributed workers)
• Several external open-source libraries are used
(e.g. Apache Tika, Clucene, etc.)
CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 7CSCS 2013 – Bucharest, Romania
System architecture
22.09.13 8
•Example: sequence of steps for processing PDF files:
•Apache Tika is used for transforming PDFs into text
•Automatic build module for the back-end components
•Automatic deployment system for the solution
CSCS 2013 – Bucharest, Romania
Detection of plagiarism
• Different problems
– Intrinsic plagiarism (analyze only the suspicious
document)
– External plagiarism (also has a reference collection
to check against)
• How large is the collection? Online sources?
• Source identification
• Text allignment
22.09.13 CSCS 2013 – Bucharest, Romania 9
Detection of plagiarism
Steps for external plagiarism detection
1.Candidate selection
– Find pairs of suspicious texts
– Combines source identification with text
allignment
1.Detailed analysis
2.Post-processing
22.09.13 CSCS 2013 – Bucharest, Romania 10
Algorithms for candidate selection
22.09.13 11
•Selection of the plausible pairs of
plagiarism
•Using stop-words elimination, tf-idf & cosine
•Initial hypothesis
•“Similarity Search Problem”: All-Pairs,
ppjoin (Prefix Filtering with Positional
Information Join) CSCS 2013 – Bucharest, Romania
Algorithms for candidate selection
22.09.13 12
•FastDocode (presented at PAN 2010)
+ caching + sub-linear merging
•New approach
- Text segments => fingerprints & indexing with Apache
CLucene
- Compute the number of inversions
N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet
3 150 10% 5413 44522 11469 ~ 1 0.162
4 150 10% 4913 10297 11969 ~ 2 0.306
4 150 30% 7633 35169 9249 ~ 4.5 0.256
5 150 20% 5194 6256 11688 ~ 3 0.367
Used method (used on
1000 documents)
TP FP FN Prec. Recall Plagdet
Fingerprinting & indexing 685 494 761 0.581 0.474 0.522
FastDocode#3 634 4097 812 0.134 0.438 0.205
FastDocode#4 424 815 1022 0.342 0.293 0.316
CSCS 2013 – Bucharest, Romania
Algorithms for detailed analysis
22.09.13 13
•DotPlot: “Sequence Alignment Problem”.
•Modified FastDocode
• Extending the analysis to the right and to the left,
starting from common words/passages
• Using passages instead of words as seeds for the
comparison
• tf-idf weighting & cosine similarity
Image source: Wikipedia
CSCS 2013 – Bucharest, Romania
Algorithms for post-processing
• Semantic analysis using LSA
– Built a semantic space with papers from Computer
Science (and pages from Wikipedia)
– Gensim framework in Pyhton
• Smith-Waterman Algorithm
– Dynamic programming
– Similar to the longest common subsequence
– Insert and delete operations may have any cost
(they may be greater than 1)
22.09.13 14CSCS 2013 – Bucharest, Romania
Results
22.09.13 15
• Corpus: PAN 2011 (~ 22k documents)
• Run time on laptop: ~ 20 hours
• Results:
• Official results from PAN 2011:
Plagdet Recall Precision Granularity
0.221929185084 0.202996955425 0.366482242839 1.26150173611
CSCS 2013 – Bucharest, Romania
Results
22.09.13 16
• Specific corpus for CS:
– 940 BSc thesis + 8700 article on CS from Wikipedia
• Detecting thesis written in English: TextCat
– 307 BSc thesis in English
Plagiarized text Original text from Wikipedia
The Canny edge detector uses a filter based
on the first derivative of a Gaussian, because
it is susceptible to noise present on raw
unprocessed image data, so to begin with,
the raw image is convolved with a Gaussian
filter. The result is a slightly blurred version of
the original which is not affected by a single
noisy pixel to any significant degree.
Because the Canny edge detector is
susceptible to noise present in raw
unprocessed image data, it uses a filter based
on a Gaussian (bell curve), where the raw
image is convolved with a Gaussian filter. The
result is a slightly blurred version of the
original which is not affected by a single noisy
pixel to any significant degree.
• Some elements are incorrectly identified as
plagiarism: quotes, bibliographic references
CSCS 2013 – Bucharest, Romania
Conclusions
• Improving the corpus
• The system uses several parameters that were
determined empirically => use machine
learning for finding the best values
• Increase the speed of the processing
• Improve the method: “bag of words” +
information about the position of the words
• Need a better post-processing for real
documents (like scientific papers or thesis)
22.09.13 17CSCS 2013 – Bucharest, Romania
Thank you!
• Questions?
• Discussion
22.09.13 CSCS 2013 – Bucharest, Romania 18

More Related Content

Viewers also liked

Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarismguestf17a2e
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
PyData
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...
yosra Yassora
 
Authorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsAuthorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguistics
Vlad Mackevic
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
Vijay Ganti
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detectionankit_saluja
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
osify
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2
Traian Rebedea
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLPbutest
 
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniques
Nimisha T
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
Traian Rebedea
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
Sujit Pal
 
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguistics
Abbou Zohra
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applicationsdahveed123
 

Viewers also liked (14)

Using Technology To Detect Plagiarism
Using Technology To Detect PlagiarismUsing Technology To Detect Plagiarism
Using Technology To Detect Plagiarism
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...The routledge handbook of forensic linguistics routledge handbooks in applied...
The routledge handbook of forensic linguistics routledge handbooks in applied...
 
Authorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguisticsAuthorship analysis using function words forensic linguistics
Authorship analysis using function words forensic linguistics
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Plagiarism and its detection
Plagiarism and its detectionPlagiarism and its detection
Plagiarism and its detection
 
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
Support Vector Machine (SVM) Based Classifier For Khmer Printed Character-set...
 
Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
 
plagiarism detection tools and techniques
plagiarism detection tools and techniquesplagiarism detection tools and techniques
plagiarism detection tools and techniques
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguistics
 
Forensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical ApplicationsForensic Linguistics:The Practical Applications
Forensic Linguistics:The Practical Applications
 

Similar to Automatic plagiarism detection system for specialized corpora

Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Kausal Malladi
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Vienna Data Science Group
 
Neven Vrček: Project activities and opportunities for collaboration with Facu...
Neven Vrček: Project activities and opportunities for collaboration with Facu...Neven Vrček: Project activities and opportunities for collaboration with Facu...
Neven Vrček: Project activities and opportunities for collaboration with Facu...
CUBCCE Conference
 
SERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_schoolSERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_school
Henry Muccini
 
SERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School:Andras pataricza serene2014_schoolSERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School:Andras pataricza serene2014_schoolHenry Muccini
 
SERENE 2014 School: Challenges in Cyber-Physical Systems
SERENE 2014 School: Challenges in Cyber-Physical SystemsSERENE 2014 School: Challenges in Cyber-Physical Systems
SERENE 2014 School: Challenges in Cyber-Physical Systems
SERENEWorkshop
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016Manjula Ambur
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 
A Review on Traffic Classification Methods in WSN
A Review on Traffic Classification Methods in WSNA Review on Traffic Classification Methods in WSN
A Review on Traffic Classification Methods in WSN
IJARIIT
 
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
A Knowledge-based Approach for Real-Time IoT Stream Annotation and ProcessingA Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
PayamBarnaghi
 
Performance evaluation methods for P2P overlays
Performance evaluation methods for P2P overlaysPerformance evaluation methods for P2P overlays
Performance evaluation methods for P2P overlays
Knut-Helge Vik
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities Researchers
Baden Hughes
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
Oscar Corcho
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
Dr Reeja S R
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingBesnik Fetahu
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
Feng Li
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
jerdeb
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...Mumbai Academisc
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
Dr.M.Prasad Naidu
 

Similar to Automatic plagiarism detection system for specialized corpora (20)

Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
Relevant Updated Data Retrieval Architectural Model for Continuous Text Extra...
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
 
Neven Vrček: Project activities and opportunities for collaboration with Facu...
Neven Vrček: Project activities and opportunities for collaboration with Facu...Neven Vrček: Project activities and opportunities for collaboration with Facu...
Neven Vrček: Project activities and opportunities for collaboration with Facu...
 
SERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_schoolSERENE 2014 School: Andras pataricza serene2014_school
SERENE 2014 School: Andras pataricza serene2014_school
 
SERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School:Andras pataricza serene2014_schoolSERENE 2014 School:Andras pataricza serene2014_school
SERENE 2014 School:Andras pataricza serene2014_school
 
SERENE 2014 School: Challenges in Cyber-Physical Systems
SERENE 2014 School: Challenges in Cyber-Physical SystemsSERENE 2014 School: Challenges in Cyber-Physical Systems
SERENE 2014 School: Challenges in Cyber-Physical Systems
 
AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016AIAA Conference - Big Data Session_ Final - Jan 2016
AIAA Conference - Big Data Session_ Final - Jan 2016
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
A Review on Traffic Classification Methods in WSN
A Review on Traffic Classification Methods in WSNA Review on Traffic Classification Methods in WSN
A Review on Traffic Classification Methods in WSN
 
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
A Knowledge-based Approach for Real-Time IoT Stream Annotation and ProcessingA Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
A Knowledge-based Approach for Real-Time IoT Stream Annotation and Processing
 
Performance evaluation methods for P2P overlays
Performance evaluation methods for P2P overlaysPerformance evaluation methods for P2P overlays
Performance evaluation methods for P2P overlays
 
bonino
boninobonino
bonino
 
Disambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities ResearchersDisambiguating Advanced Computing for Humanities Researchers
Disambiguating Advanced Computing for Humanities Researchers
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...A signature based indexing method for efficient content-based retrieval of re...
A signature based indexing method for efficient content-based retrieval of re...
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
 

More from Traian Rebedea

An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
Traian Rebedea
 
AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5
Traian Rebedea
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
Traian Rebedea
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
Traian Rebedea
 
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...
Traian Rebedea
 
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discovery
Traian Rebedea
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
Traian Rebedea
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
Traian Rebedea
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitareTraian Rebedea
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeTraian Rebedea
 
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianTraian Rebedea
 
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...Traian Rebedea
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Traian Rebedea
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Traian Rebedea
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyTraian Rebedea
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Traian Rebedea
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Traian Rebedea
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Traian Rebedea
 

More from Traian Rebedea (20)

An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
 
AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...
 
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discovery
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTube
 
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in Romanian
 
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
 
Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12Algorithm Design and Complexity - Course 12
Algorithm Design and Complexity - Course 12
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

Automatic plagiarism detection system for specialized corpora

  • 1. Authors University Politehnica of Bucharest Automatic Plagiarism Detection System for Specialized Corpora Filip Cristian Buruiană Adrian Scoică Traian Rebedea – traian.rebedea@cs.pub.ro Razvan Rughiniș
  • 2. Overview • Introduction • System architecture • Detection of plagiarism • Algorithms for candidate selection • Algorithms for detailed analysis • Algorithms for post-procesing • Results • Conclusions 22.09.13 Sesiunea de Licenţe - Iulie 2012 2
  • 3. Introduction • Plagiarism: unauthorized appropriation of the language or thoughts of another author and the representation of that author's work as pertaining to one's own without according proper credit to the original author • Lots of documents => automatic detection needed • Information Retrieval – Stemming (ex. beauty, beautiful, beautifulness => beauti) – Vector Space Model – tf-idf weighting, cosine similarity • Measuring results – precision, recall, granularity => F-measure 22.09.13 CSCS 2013 – Bucharest, Romania 3
  • 4. Existing solutions • Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.) • They are general solutions, topic independent • No open-source solutions that offer good results • No solutions specialized for Computer Science • Difficult to evaluate: need a good corpus (annotated by persons, how to find plagiarized documents, etc.) • AuthentiCop – developed for specialized corpora, also evaluated on general texts • Used corpora: – PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and social software misuse” at CLEF) – Bachelor thesis @ A&C 22.09.13 CSCS 2013 – Bucharest, Romania 4
  • 5. System Architecture • Web interface for accessing AuthentiCop – Simple to add documents (text, pdf) and to highlight suspicios elements 22.09.13 5CSCS 2013 – Bucharest, Romania
  • 6. System architecture 22.09.13 6 • Logical separation – Front-end (PHP, JavaScript + AJAX, jquery) – Back-end (C++) – Cross-Language Communication • Scalable solution, easy to update – Web server (front-end) and the plagiarism detection modules (back-end) may run on different machines – Plagiarism detection can be distributed on different machines (distributed workers) • Several external open-source libraries are used (e.g. Apache Tika, Clucene, etc.) CSCS 2013 – Bucharest, Romania
  • 7. System architecture 22.09.13 7CSCS 2013 – Bucharest, Romania
  • 8. System architecture 22.09.13 8 •Example: sequence of steps for processing PDF files: •Apache Tika is used for transforming PDFs into text •Automatic build module for the back-end components •Automatic deployment system for the solution CSCS 2013 – Bucharest, Romania
  • 9. Detection of plagiarism • Different problems – Intrinsic plagiarism (analyze only the suspicious document) – External plagiarism (also has a reference collection to check against) • How large is the collection? Online sources? • Source identification • Text allignment 22.09.13 CSCS 2013 – Bucharest, Romania 9
  • 10. Detection of plagiarism Steps for external plagiarism detection 1.Candidate selection – Find pairs of suspicious texts – Combines source identification with text allignment 1.Detailed analysis 2.Post-processing 22.09.13 CSCS 2013 – Bucharest, Romania 10
  • 11. Algorithms for candidate selection 22.09.13 11 •Selection of the plausible pairs of plagiarism •Using stop-words elimination, tf-idf & cosine •Initial hypothesis •“Similarity Search Problem”: All-Pairs, ppjoin (Prefix Filtering with Positional Information Join) CSCS 2013 – Bucharest, Romania
  • 12. Algorithms for candidate selection 22.09.13 12 •FastDocode (presented at PAN 2010) + caching + sub-linear merging •New approach - Text segments => fingerprints & indexing with Apache CLucene - Compute the number of inversions N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet 3 150 10% 5413 44522 11469 ~ 1 0.162 4 150 10% 4913 10297 11969 ~ 2 0.306 4 150 30% 7633 35169 9249 ~ 4.5 0.256 5 150 20% 5194 6256 11688 ~ 3 0.367 Used method (used on 1000 documents) TP FP FN Prec. Recall Plagdet Fingerprinting & indexing 685 494 761 0.581 0.474 0.522 FastDocode#3 634 4097 812 0.134 0.438 0.205 FastDocode#4 424 815 1022 0.342 0.293 0.316 CSCS 2013 – Bucharest, Romania
  • 13. Algorithms for detailed analysis 22.09.13 13 •DotPlot: “Sequence Alignment Problem”. •Modified FastDocode • Extending the analysis to the right and to the left, starting from common words/passages • Using passages instead of words as seeds for the comparison • tf-idf weighting & cosine similarity Image source: Wikipedia CSCS 2013 – Bucharest, Romania
  • 14. Algorithms for post-processing • Semantic analysis using LSA – Built a semantic space with papers from Computer Science (and pages from Wikipedia) – Gensim framework in Pyhton • Smith-Waterman Algorithm – Dynamic programming – Similar to the longest common subsequence – Insert and delete operations may have any cost (they may be greater than 1) 22.09.13 14CSCS 2013 – Bucharest, Romania
  • 15. Results 22.09.13 15 • Corpus: PAN 2011 (~ 22k documents) • Run time on laptop: ~ 20 hours • Results: • Official results from PAN 2011: Plagdet Recall Precision Granularity 0.221929185084 0.202996955425 0.366482242839 1.26150173611 CSCS 2013 – Bucharest, Romania
  • 16. Results 22.09.13 16 • Specific corpus for CS: – 940 BSc thesis + 8700 article on CS from Wikipedia • Detecting thesis written in English: TextCat – 307 BSc thesis in English Plagiarized text Original text from Wikipedia The Canny edge detector uses a filter based on the first derivative of a Gaussian, because it is susceptible to noise present on raw unprocessed image data, so to begin with, the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. Because the Canny edge detector is susceptible to noise present in raw unprocessed image data, it uses a filter based on a Gaussian (bell curve), where the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. • Some elements are incorrectly identified as plagiarism: quotes, bibliographic references CSCS 2013 – Bucharest, Romania
  • 17. Conclusions • Improving the corpus • The system uses several parameters that were determined empirically => use machine learning for finding the best values • Increase the speed of the processing • Improve the method: “bag of words” + information about the position of the words • Need a better post-processing for real documents (like scientific papers or thesis) 22.09.13 17CSCS 2013 – Bucharest, Romania
  • 18. Thank you! • Questions? • Discussion 22.09.13 CSCS 2013 – Bucharest, Romania 18