  1. 1. IST 441 : Progress Report Query Formulation for Similarity Search Student : Nitish Upreti Customer : Kyle Williams
  2. 2. What is Similarity Search? • Given a sample document and a standard Web search engine, the goal is to find similar documents. • Multiple Applications of Similarity Search : Plagiarism Detection Process of locating instances of plagiarism in a suspicious document from the web. Research Paper Recommendation Finding relevant documents for research paper recommendation.
  3. 3. Query Formulation Approach Term Extraction (Automatic extraction of relevant terms from a given corpus)
  4. 4. Enter JateToolkit Java Automatic Term Extraction toolkit A library of state-of-the-art term extraction algorithms and framework for developing term extraction algorithms.
  5. 5. Approaches for Query Formulation • Term Extraction Algorithms : – TF-IDF – RIDF – Weirdness – C-value – GlossEx – TermEx (Alpha Phase) – Justeson & Katz Algorithm – NC Value Algorithm – Rake Algorithm – Chi-squared Algorithm
  7. 7. Query Formulation Approach Topic Extraction
  8. 8. Topic Extraction • Topic models provide a simple way to analyze large volumes of unlabeled text. • A "topic" consists of a cluster of words that frequently occur together. • Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.
  9. 9. MALLET + MAUI • The MALLET topic model package includes an extremely fast and efficient methods for document topic hyper parameter optimization, and tools for inferring topics for new documents given trained models. • MAUI uses candidate generation algorithms to identify topics in a given document and then filtering, analyzing the properties, or features, of the candidate topics and filtering out the most significant ones.
