Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automatic plagiarism detection system for specialized corpora

  • Be the first to comment

  • Be the first to like this

Automatic plagiarism detection system for specialized corpora

  1. 1. Authors University Politehnica of Bucharest Automatic Plagiarism Detection System for Specialized Corpora Filip Cristian Buruiană Adrian Scoică Traian Rebedea – traian.rebedea@cs.pub.ro Razvan Rughiniș
  2. 2. Overview • Introduction • System architecture • Detection of plagiarism • Algorithms for candidate selection • Algorithms for detailed analysis • Algorithms for post-procesing • Results • Conclusions 22.09.13 Sesiunea de Licenţe - Iulie 2012 2
  3. 3. Introduction • Plagiarism: unauthorized appropriation of the language or thoughts of another author and the representation of that author's work as pertaining to one's own without according proper credit to the original author • Lots of documents => automatic detection needed • Information Retrieval – Stemming (ex. beauty, beautiful, beautifulness => beauti) – Vector Space Model – tf-idf weighting, cosine similarity • Measuring results – precision, recall, granularity => F-measure 22.09.13 CSCS 2013 – Bucharest, Romania 3
  4. 4. Existing solutions • Lots of commercial systems exist (Turnitin, Antiplag, Ephorus, etc.) • They are general solutions, topic independent • No open-source solutions that offer good results • No solutions specialized for Computer Science • Difficult to evaluate: need a good corpus (annotated by persons, how to find plagiarized documents, etc.) • AuthentiCop – developed for specialized corpora, also evaluated on general texts • Used corpora: – PAN 2011 (“evaluation lab on uncovering plagiarism, authorship, and social software misuse” at CLEF) – Bachelor thesis @ A&C 22.09.13 CSCS 2013 – Bucharest, Romania 4
  5. 5. System Architecture • Web interface for accessing AuthentiCop – Simple to add documents (text, pdf) and to highlight suspicios elements 22.09.13 5CSCS 2013 – Bucharest, Romania
  6. 6. System architecture 22.09.13 6 • Logical separation – Front-end (PHP, JavaScript + AJAX, jquery) – Back-end (C++) – Cross-Language Communication • Scalable solution, easy to update – Web server (front-end) and the plagiarism detection modules (back-end) may run on different machines – Plagiarism detection can be distributed on different machines (distributed workers) • Several external open-source libraries are used (e.g. Apache Tika, Clucene, etc.) CSCS 2013 – Bucharest, Romania
  7. 7. System architecture 22.09.13 7CSCS 2013 – Bucharest, Romania
  8. 8. System architecture 22.09.13 8 •Example: sequence of steps for processing PDF files: •Apache Tika is used for transforming PDFs into text •Automatic build module for the back-end components •Automatic deployment system for the solution CSCS 2013 – Bucharest, Romania
  9. 9. Detection of plagiarism • Different problems – Intrinsic plagiarism (analyze only the suspicious document) – External plagiarism (also has a reference collection to check against) • How large is the collection? Online sources? • Source identification • Text allignment 22.09.13 CSCS 2013 – Bucharest, Romania 9
  10. 10. Detection of plagiarism Steps for external plagiarism detection 1.Candidate selection – Find pairs of suspicious texts – Combines source identification with text allignment 1.Detailed analysis 2.Post-processing 22.09.13 CSCS 2013 – Bucharest, Romania 10
  11. 11. Algorithms for candidate selection 22.09.13 11 •Selection of the plausible pairs of plagiarism •Using stop-words elimination, tf-idf & cosine •Initial hypothesis •“Similarity Search Problem”: All-Pairs, ppjoin (Prefix Filtering with Positional Information Join) CSCS 2013 – Bucharest, Romania
  12. 12. Algorithms for candidate selection 22.09.13 12 •FastDocode (presented at PAN 2010) + caching + sub-linear merging •New approach - Text segments => fingerprints & indexing with Apache CLucene - Compute the number of inversions N-grams length Segment dimension Retention rate TP FP FN Time (h) Plagdet 3 150 10% 5413 44522 11469 ~ 1 0.162 4 150 10% 4913 10297 11969 ~ 2 0.306 4 150 30% 7633 35169 9249 ~ 4.5 0.256 5 150 20% 5194 6256 11688 ~ 3 0.367 Used method (used on 1000 documents) TP FP FN Prec. Recall Plagdet Fingerprinting & indexing 685 494 761 0.581 0.474 0.522 FastDocode#3 634 4097 812 0.134 0.438 0.205 FastDocode#4 424 815 1022 0.342 0.293 0.316 CSCS 2013 – Bucharest, Romania
  13. 13. Algorithms for detailed analysis 22.09.13 13 •DotPlot: “Sequence Alignment Problem”. •Modified FastDocode • Extending the analysis to the right and to the left, starting from common words/passages • Using passages instead of words as seeds for the comparison • tf-idf weighting & cosine similarity Image source: Wikipedia CSCS 2013 – Bucharest, Romania
  14. 14. Algorithms for post-processing • Semantic analysis using LSA – Built a semantic space with papers from Computer Science (and pages from Wikipedia) – Gensim framework in Pyhton • Smith-Waterman Algorithm – Dynamic programming – Similar to the longest common subsequence – Insert and delete operations may have any cost (they may be greater than 1) 22.09.13 14CSCS 2013 – Bucharest, Romania
  15. 15. Results 22.09.13 15 • Corpus: PAN 2011 (~ 22k documents) • Run time on laptop: ~ 20 hours • Results: • Official results from PAN 2011: Plagdet Recall Precision Granularity 0.221929185084 0.202996955425 0.366482242839 1.26150173611 CSCS 2013 – Bucharest, Romania
  16. 16. Results 22.09.13 16 • Specific corpus for CS: – 940 BSc thesis + 8700 article on CS from Wikipedia • Detecting thesis written in English: TextCat – 307 BSc thesis in English Plagiarized text Original text from Wikipedia The Canny edge detector uses a filter based on the first derivative of a Gaussian, because it is susceptible to noise present on raw unprocessed image data, so to begin with, the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. Because the Canny edge detector is susceptible to noise present in raw unprocessed image data, it uses a filter based on a Gaussian (bell curve), where the raw image is convolved with a Gaussian filter. The result is a slightly blurred version of the original which is not affected by a single noisy pixel to any significant degree. • Some elements are incorrectly identified as plagiarism: quotes, bibliographic references CSCS 2013 – Bucharest, Romania
  17. 17. Conclusions • Improving the corpus • The system uses several parameters that were determined empirically => use machine learning for finding the best values • Increase the speed of the processing • Improve the method: “bag of words” + information about the position of the words • Need a better post-processing for real documents (like scientific papers or thesis) 22.09.13 17CSCS 2013 – Bucharest, Romania
  18. 18. Thank you! • Questions? • Discussion 22.09.13 CSCS 2013 – Bucharest, Romania 18

    Be the first to comment

    Login to see the comments

Views

Total views

2,509

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

66

Shares

0

Comments

0

Likes

0

×