Unit 1 Information Storage and Retrieval

COURSE CONTENTS
Unit I Introduction to Information Retrieval ( 06 hrs )
Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system
block diagram. Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing,
Probabilistic Indexing Clustering Techniques : Single pass algorithm , Single Link algorithm
Text & Reference Book
Yates & Neto, Modern Information Retrieval, Pearson
Education, ISBN:81-297-0274-6
C.J. Rijsbergen, Information Retrieval,
(www.dcs.gla.ac.uk).,2ndISBN:978- 408709293
CO 1:
Understand the concept of Information retrieval and apply
clustering in information retrieval.
Prepared By : Prof. Datta S. Shingate

• Retrieval - “Fetch something”
• Data - raw alphanumeric values.
• Information – Processed data.
• Knowledge – What we know.
• Types of Information
• Text
• Images
• Audio
• Video
• Source Code
• Applications/Web services
• XML and structured documents
Definition of IR

Defining Data, Information, Knowledge & Wisdom

Definition of IR
• Goal
Find the documents most relevant to user Query.
• Information Retrieval (IR)
Information retrieval (IR) may be defined as a software program that
deals with the organization, storage, retrieval and evaluation of
information from document repositories particularly textual information.

Data Retrieval Vs Information Retrieval
Data Retrieval Information Retrieval
• Retrieves data based on the keywords in the query
entered by the user.
• Retrieves information based on the similarity
between the query and the document.
• There is no room for errors since it results in
complete system failure.
• Small errors are tolerated and will likely go
unnoticed.
• It has a defined structure with respect to
semantics.
• It is ambiguous and doesn’t have a defined
structure.
• Provides solutions to the user of the database
system.
• Does not provide a solution to the user of the
database system.
• Data Retrieval system produces exact results. • Information Retrieval system produces
approximate results
• Displayed results are not sorted by relevance. • Displayed results are sorted by relevance
• Eg : SQL • Eg : Google Search Engine

Text mining and Information Retrieval (IR)
Text mining is a process of extracting useful information and patterns from
a large volume of text databases.

IR System Block Diagram
Fig : Typical IR System (Black Box) Fig : Information Retrieval (IR) Process

Evaluation Criteria
• Recall – is defined as the portion of the total relevant document that is
retrieved.
Recall =
No of Relevant document retrieved
* 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
• Precision - is defined as the portion of the document retrieved that is
relevant.
Precision =
No of Relevant document retrieved
* 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑

Automatic Text Analysis
1. Document Representative
2. Text Summarization
3. Luhn’s Idea
Document
Document
Representative
Predictions from
Frequency of
Words
Conflation
Algorithm

Luhn’s Idea
Stop
words
The Luhn’s Idea Says:
->Too low frequent words are not significant.
-> Too high frequent words are also not significant
(e.g. “is”, “and”).
-> Removing low frequent words is easy.
-- Set a minimum frequency-threshold
-> Removing common (high frequent) words:
--Setting a maximum frequency threshold
(statistically obtained)
-- Comparing to a common-word list
-> Used for summarizing technical documents.

Conflation Algorithm
1. Open and read each input file and create a single index file.
2. Remove high frequency words (stop words) .
3. Remove all suffixes/affixes from each word if present.
4. Detecting equivalent stems.
5. Store in index file.
{Compute, Computer, Computing} → Comput
{Walks, Walking, Walker} → Walk
{develop, developing, development, developments } → develop

Clustering in Information Retrieval
Medical Legal Financial
Documents Collection

Clustering in Information Retrieval
Similarity matrix
Objects: {1,2,3,4,5,6}
Threshold: .89
Graph TheoreticApproach
C1 :{1,4,5,6}
C2 :{2}
C3 : {3}

Jaccard’s Similarity Example

Single Pass Clustering Algorithm
1. Assign the first document D1 as the representative for C1.
2. For Di, calculate the similarity S with the representative for each existing cluster.
3. If Smax is greater than a threshold value ST, add the item to the corresponding cluster
and recalculate the cluster representative; otherwise, use Di to initiate a new cluster.
4. If an item Di remains to be clustered, return to step 2.

Example of Single Pass Clustering Technique
Suppose that we have the following set of documents and terms, and
that we are interested in clustering the terms using the single pass
method. Threshold value is 10

Example of Single Pass Clustering Technique

Single Link Clustering Algorithm
Dissimilarity
matrix:

Unit 1 Information Storage and Retrieval

More Related Content

What's hot

Similar to Unit 1 Information Storage and Retrieval

Recently uploaded

Unit 1 Information Storage and Retrieval