Final presentation

750 views

Published on

Term Extraction in Plagiarism Detection

Published in: Engineering, Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
750
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Final presentation

  1. 1. IST 441 Query Formulation for Similarity Search Student : Nitish Upreti Customer : Kyle Williams nzu100@cse.psu.edu kwilliams@psu.edu
  2. 2. OUTLINE • Introduction • Motivation • Challenges with Similarity Search • Background & Reference Point • Approaches to Similarity Search • Our Approaches to Problem • JateToolkit Introduction • Solution Architecture • Evaluation • Conclusion
  3. 3. What is Similarity Search? “ Given a sample document and a standard Web search engine, the goal is to find similar documents to the given document. ” What is a similar document? • Cosine Similarity • Citation Similarity • Code Similarity • Multimedia Content Similarity
  4. 4. Motivation Plagiarism Detection Process of locating instances of plagiarism in a suspicious document from the web. Example : Turnitin™ Content Recommendation Recommending articles from credible news sources based on social media entities such as tweets. Academic Scenario : Research Paper Recommendation Finding relevant documents for research paper recommendation.
  5. 5. Challenge Involved • Constructing queries from the sample document in order to find similar documents is not obvious. • Several Constraints on the maximum number of queries and results to be downloaded for scalability constraints. • Capture different facets of Similarity : How can we be general enough to capture the theme but also specific to capture unique document attributes? (Domain Dependent)
  6. 6. BACKGROUND
  7. 7. The Big Picture Credits : Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring Notebook for PAN at CLEF 2013
  8. 8. Our Reference Point • Source Retrieval is the KEY component. (Dictates the possibility of future steps) • Query Formulation is at the heart of this problem. • Challenges with : – How can we design better algorithms to formulate accurate queries? – What has been done and what can be explored?
  9. 9. Our Reference Point (Contd..) • CLEF: Conference and Labs of the Evaluation Forum. • PAN Labs centers around the topics of plagiarism, authorship, and social software misuse. – Author Identification – Author Profiling – Plagiarism Detection • Evaluation possible in a Plagiarism domain.
  10. 10. Approaching Similarity Search Major classes of Similarity Search : • Choosing sentences from text corpus. • Choosing a set of generic keywords. • Term Extraction Algorithms. • Topic Mining for document using Machine Learning techniques. Mix and Match Ideas depending and employ well known tweaks depending on the scenario. (Most of it is experimental)
  11. 11. Query Formulation Approach Term Extraction (Automatic extraction of relevant terms from a given corpus)
  12. 12. Approach Contd… • Central Theme : Term Extraction Algorithms • Approach Similarity Search in context of Term Extraction algorithms. • Design a framework which incorporates which these algorithms. • Evaluate the algorithms. • Document all the approaches.
  13. 13. Enter JateToolkit Java Automatic Term Extraction toolkit A library of state-of-the-art term extraction algorithms and framework for developing term extraction algorithms. https://code.google.com/p/jatetoolkit/
  14. 14. Term Extraction Approaches… • Term Extraction Algorithms : – TF-IDF – RIDF – Weirdness – C-value – GlossEx – TermEx (Open Ended Project : Work in Progress) – Justeson & Katz Algorithm – NC Value Algorithm – Rake Algorithm – Chi-squared Algorithm
  15. 15. Solution Architecture
  16. 16. Phase 1 : Pre-Processing
  17. 17. Pre-Processing Document StopList Pre-Processing Extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words form the Stop List. • Use Jate’s built in “StopList” for filtering.
  18. 18. Pre-Processing Document Contd… Lemmatization Group together words that are present in the document as different inflected forms to a single word so they can be analyzed as a single item. Example : “run, runs, ran and running are forms of the same lexeme, with run as the lemma.”
  19. 19. Phase 2 : Candidate Term Extraction
  20. 20. Candidate Term Extraction • Approaches to Candidate Term Extraction : 1. Simply extracting single words as candidate terms. If you task extracts single words as terms. (Naïve Approach) 2. A generic N-gram extractor that extracts ‘n grams’. Final Approach : Stanford’s OpenNLP NPE (Noun Phrase Extractor) that extracts noun phrases as candidate terms.
  21. 21. Why are other two Approaches worth mentioning? Performance of Term Extraction Algorithms is text corpus dependent. (Our dataset was more receptive to NPE)
  22. 22. Phase 3 : Index Building
  23. 23. Building Document Index • Using Jate toolkit to build a corpus index (Pre- Requisite for Term Extraction). • Memory Based / Disk Resident file / Exporting to HSQL (HyperSQL).
  24. 24. Phase 4 : Building Statistical Features
  25. 25. Building Features for Jate Toolkit • Word Count • Feature Corpus Term Frequency (A feature store that contains information of term distributions over a corpus) • Feature Term Nest Frequency (A feature store that contains information of nested terms) Example: “Hedgehog" is a nested term in "European Hedgehog". • Executing a single or multithreaded client.
  26. 26. Phase 5 : Register and Execute Algorithms Jate Output File : term { variations } score The output file is arranged in descending order of score.
  27. 27. Phase 6 : Post Processing Writing an Output file suitable for submission. Format : DocumentId { query terms } (Maximum 10 non-repeating query terms)
  28. 28. Evaluation • Last year PAN CLEF Baseline : Precision = 0.244388609715 (200) queries • Performance for Term Extraction Algorithms: (105) queries 1. IBM’s GlossEx : 0.171428571429 2. C Value : 0.0598255721489 3. TermEx : 0.0635 4. Weirdness : 0.03190851 5. RIDF : 0.176470588235 6. TF-IDF : 0.13058482157
  29. 29. RESULTS • The code is live on github! https://github.com/myth17/QF • Code, Query Logs and entire results submitted to Kyle. • Working on incorporating the other alpha term extraction algorithms. • Future Work : How can the results be improved and integrated with topic modeling?
  30. 30. Questions ? (Thank You!)

×