Project progress

455 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Project progress

  1. 1. IST 441 : Progress Report Query Formulation for Similarity Search Student : Nitish Upreti Customer : Kyle Williams nzu100@cse.psu.edu kwilliams@psu.edu
  2. 2. What is Similarity Search? • Given a sample document and a standard Web search engine, the goal is to find similar documents. • Multiple Applications of Similarity Search : Plagiarism Detection Process of locating instances of plagiarism in a suspicious document from the web. Research Paper Recommendation Finding relevant documents for research paper recommendation.
  3. 3. Query Formulation Approach Term Extraction (Automatic extraction of relevant terms from a given corpus)
  4. 4. Enter JateToolkit Java Automatic Term Extraction toolkit A library of state-of-the-art term extraction algorithms and framework for developing term extraction algorithms. https://code.google.com/p/jatetoolkit/
  5. 5. Approaches for Query Formulation • Term Extraction Algorithms : – TF-IDF – RIDF – Weirdness – C-value – GlossEx – TermEx (Alpha Phase) – Justeson & Katz Algorithm – NC Value Algorithm – Rake Algorithm – Chi-squared Algorithm
  6. 6. Continuing On…
  7. 7. Query Formulation Approach Topic Extraction
  8. 8. Topic Extraction • Topic models provide a simple way to analyze large volumes of unlabeled text. • A "topic" consists of a cluster of words that frequently occur together. • Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.
  9. 9. MALLET + MAUI • The MALLET topic model package includes an extremely fast and efficient methods for document topic hyper parameter optimization, and tools for inferring topics for new documents given trained models. • MAUI uses candidate generation algorithms to identify topics in a given document and then filtering, analyzing the properties, or features, of the candidate topics and filtering out the most significant ones.
  10. 10. THANK YOU!

×