The document discusses query formulation approaches for similarity search. It describes term extraction and topic extraction methods. Term extraction involves using algorithms like TF-IDF and RIDF to automatically extract relevant terms from a given corpus. Topic extraction uses topic models to cluster words that frequently occur together and connect words with similar meanings. MALLET and MAUI are tools mentioned for topic modeling, with MALLET providing efficient topic inference and MAUI identifying significant topics in documents.
AWS Community Day CPH - Three problems of Terraform
Project progress
1. IST 441 : Progress Report
Query Formulation for Similarity
Search
Student : Nitish Upreti
Customer : Kyle Williams
nzu100@cse.psu.edu
kwilliams@psu.edu
2. What is Similarity Search?
• Given a sample document and a standard Web
search engine, the goal is to find similar
documents.
• Multiple Applications of Similarity Search :
Plagiarism Detection
Process of locating instances of plagiarism in a
suspicious document from the web.
Research Paper Recommendation
Finding relevant documents for research paper
recommendation.
4. Enter JateToolkit
Java Automatic Term Extraction toolkit
A library of state-of-the-art term extraction
algorithms and framework for developing term
extraction algorithms.
https://code.google.com/p/jatetoolkit/
8. Topic Extraction
• Topic models provide a simple way to analyze
large volumes of unlabeled text.
• A "topic" consists of a cluster of words that
frequently occur together.
• Using contextual clues, topic models can
connect words with similar meanings and
distinguish between uses of words with
multiple meanings.
9. MALLET + MAUI
• The MALLET topic model package includes an
extremely fast and efficient methods for
document topic hyper parameter optimization,
and tools for inferring topics for new documents
given trained models.
• MAUI uses candidate generation algorithms to
identify topics in a given document and then
filtering, analyzing the properties, or features, of
the candidate topics and filtering out the most
significant ones.