Information Retrieval basic presentation

HINDUSTHAN INSTITUTE OF
TECHNOLOGY
DEPARTMENT OF COMPUER
SCIENCE AND ENGINEERING
20CS512 – INFORMATION RETRIEVAL
Dr. M. Thangamani

Components of the IR
• Document Collection: The document collection is the set of documents that the IRS indexes
and searches. Documents can be in various formats, including text, images, audio, video, or a
combination of these.
• Preprocessing: Preprocessing involves transforming the raw documents into a form that can
be easily indexed and searched. This may include tokenization, stemming, stop-word
removal, and other text processing techniques.
• Indexing: Indexing is the process of creating a data structure (such as an inverted index) that
allows for efficient searching and retrieval of documents. The index maps terms or keywords
to the documents in which they appear.
• Query Processing: Query processing involves interpreting and transforming the user's query
into a form that can be matched against the indexed documents. This may include parsing,
tokenization, stemming, and other query processing techniques.
• Search and Retrieval: Search and retrieval involves searching the index to find documents
that match the query. This may involve ranking the documents based on their relevance to
the query and returning the most relevant documents to the user.
• User Interface: The user interface is the front end of the IRS that allows users to enter
queries, view search results, and interact with the system. The user interface may include
features such as faceted search, filtering, sorting, and visualization.
• Relevance Feedback: Relevance feedback involves collecting feedback from users about the
relevance of the retrieved documents and using this feedback to improve the search results.
This may involve adjusting the search algorithm, updating the index, or re-ranking the
documents.

Process of querying, indexing and
retrieval system

Information Retrieval Model
• basic information retrieval model.
• Classical IR Model
• Boolean Model
• Vector Space Model
• Probability Distribution Model

Continue..
• Boolean Model: This model is based on Boolean logic,
where a query is formulated using logical operators such as
AND, OR, and NOT. The model returns documents that
satisfy the Boolean expression specified in the query.
• Vector Space Model (VSM): This model represents
documents and queries as vectors in a high-dimensional
space, where each dimension corresponds to a term. The
similarity between a document and a query is measured
using a similarity metric, such as the cosine similarity.
• Probabilistic Model: This model estimates the probability
of a document being relevant to a query based on the
probability of the query terms appearing in the document.
One of the most well-known probabilistic models is the
Binary Independence Model (BIM).

Visualization Interface
• Scatter Plots: Scatter plots are used to display the relationship between two variables, and
they can be used to show the distribution of search results based on two different dimensions,
such as relevance and publication date.
• Bar Charts: Bar charts are used to display the frequency or count of items in different
categories, and they can be used to show the distribution of search results based on different
facets, such as document type, author, or topic.
• Line Charts: Line charts are used to display trends over time, and they can be used to show
the distribution of search results based on a temporal dimension, such as publication date or
time of access.
• Heat Maps: Heat maps are used to display data in a matrix format, where the values are
represented by colors. They can be used to show the distribution of search results based on
two dimensions, such as relevance and publication date.
• Word Clouds: Word clouds are used to display the frequency of terms in a text, with more
frequent terms displayed in larger font sizes. They can be used to show the distribution of
terms in the search results or to highlight the most relevant terms in a document.
• Network Graphs: Network graphs are used to display relationships between entities, such as
documents, authors, or topics. They can be used to show the connections between documents
based on citations, references, or other relationships.
• Map Visualizations: Map visualizations are used to display geographical data, and they can be
used to show the distribution of search results based on geographical dimensions, such as
location or region.

Continue..
• Construct the term matrix for the following
document and query
Documemt :
• 1.Taj mahal is a beautiful monument.
• 2.Victoria Memorial is also a monument.
• 3.I like to visit agra.

• Construct the term matrix for the following
document and query
• Documemt :
• 1.Taj mahal is a beautiful monument.
• 2.Victoria Memorial is also a monument.
• 3.I like to visit agra.
Continue..

Term Document Matrix
D1 D2 D3
• Taj 1 0 0
• mahal 1 0 0
• is 1 1 0
• a 1 1 1
• beautiful 1 0 0
• monument 1 1 0
• Victoria 0 1 0
• Memorial 0 1 0
• also 0 1 0
• I 0 0 1
• like 0 0 1
• to 0 0 1
• visit 0 0 1
• agra 0 0 1
In this matrix, the rows represent
the terms, and the columns
represent the documents (D1, D2,
and D3). The entries in the matrix
represent the term frequency in
each document. For example, the
term "Taj" appears once in
document D1 and not in documents
D2 and D3, so its entry in the
matrix is (1, 0, 0).

Information Retrieval basic presentation

Recommended

Recommended

More Related Content

Similar to Information Retrieval basic presentation

Similar to Information Retrieval basic presentation (20)

Recently uploaded

Recently uploaded (20)

Information Retrieval basic presentation