COURSE CONTENTS
Unit I Introduction to Information Retrieval ( 06 hrs )
Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system
block diagram. Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing,
Probabilistic Indexing Clustering Techniques : Single pass algorithm , Single Link algorithm
Text & Reference Book
Yates & Neto, Modern Information Retrieval, Pearson
Education, ISBN:81-297-0274-6
C.J. Rijsbergen, Information Retrieval,
(www.dcs.gla.ac.uk).,2ndISBN:978- 408709293
CO 1:
Understand the concept of Information retrieval and apply
clustering in information retrieval.
Prepared By : Prof. Datta S. Shingate
• Retrieval - “Fetch something”
• Data - raw alphanumeric values.
• Information – Processed data.
• Knowledge – What we know.
• Types of Information
• Text
• Images
• Audio
• Video
• Source Code
• Applications/Web services
• XML and structured documents
Definition of IR
Defining Data, Information, Knowledge & Wisdom
Definition of IR
• Goal
Find the documents most relevant to user Query.
• Information Retrieval (IR)
Information retrieval (IR) may be defined as a software program that
deals with the organization, storage, retrieval and evaluation of
information from document repositories particularly textual information.
Data Retrieval Vs Information Retrieval
Data Retrieval Information Retrieval
• Retrieves data based on the keywords in the query
entered by the user.
• Retrieves information based on the similarity
between the query and the document.
• There is no room for errors since it results in
complete system failure.
• Small errors are tolerated and will likely go
unnoticed.
• It has a defined structure with respect to
semantics.
• It is ambiguous and doesn’t have a defined
structure.
• Provides solutions to the user of the database
system.
• Does not provide a solution to the user of the
database system.
• Data Retrieval system produces exact results. • Information Retrieval system produces
approximate results
• Displayed results are not sorted by relevance. • Displayed results are sorted by relevance
• Eg : SQL • Eg : Google Search Engine
Text mining and Information Retrieval (IR)
Text mining is a process of extracting useful information and patterns from
a large volume of text databases.
IR System Block Diagram
Fig : Typical IR System (Black Box) Fig : Information Retrieval (IR) Process
Evaluation Criteria
• Recall – is defined as the portion of the total relevant document that is
retrieved.
Recall =
No of Relevant document retrieved
* 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛
• Precision - is defined as the portion of the document retrieved that is
relevant.
Precision =
No of Relevant document retrieved
* 100
𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
Automatic Text Analysis
1. Document Representative
2. Text Summarization
3. Luhn’s Idea
Document
Document
Representative
Predictions from
Frequency of
Words
Conflation
Algorithm
Luhn’s Idea
Stop
words
The Luhn’s Idea Says:
->Too low frequent words are not significant.
-> Too high frequent words are also not significant
(e.g. “is”, “and”).
-> Removing low frequent words is easy.
-- Set a minimum frequency-threshold
-> Removing common (high frequent) words:
--Setting a maximum frequency threshold
(statistically obtained)
-- Comparing to a common-word list
-> Used for summarizing technical documents.
Conflation Algorithm
1. Open and read each input file and create a single index file.
2. Remove high frequency words (stop words) .
3. Remove all suffixes/affixes from each word if present.
4. Detecting equivalent stems.
5. Store in index file.
{Compute, Computer, Computing} → Comput
{Walks, Walking, Walker} → Walk
{develop, developing, development, developments } → develop
High frequency words
Indexing Subsystem
Clustering in Information Retrieval
Medical Legal Financial
Documents Collection
Clustering in Information Retrieval
Similarity matrix
Objects: {1,2,3,4,5,6}
Threshold: .89
Graph TheoreticApproach
C1 :{1,4,5,6}
C2 :{2}
C3 : {3}
Similarity Measures
Jaccard’s Similarity Example
Single Pass Clustering Algorithm
1. Assign the first document D1 as the representative for C1.
2. For Di, calculate the similarity S with the representative for each existing cluster.
3. If Smax is greater than a threshold value ST, add the item to the corresponding cluster
and recalculate the cluster representative; otherwise, use Di to initiate a new cluster.
4. If an item Di remains to be clustered, return to step 2.
Example of Single Pass Clustering Technique
Suppose that we have the following set of documents and terms, and
that we are interested in clustering the terms using the single pass
method. Threshold value is 10
Example of Single Pass Clustering Technique
Example of Single Pass Clustering Technique
Example of Single Pass Clustering Technique
Example of Single Pass Clustering Technique
Single Link Clustering Algorithm
Dissimilarity
matrix:
Thanks

Unit 1 Information Storage and Retrieval

  • 1.
    COURSE CONTENTS Unit IIntroduction to Information Retrieval ( 06 hrs ) Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system block diagram. Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing, Probabilistic Indexing Clustering Techniques : Single pass algorithm , Single Link algorithm Text & Reference Book Yates & Neto, Modern Information Retrieval, Pearson Education, ISBN:81-297-0274-6 C.J. Rijsbergen, Information Retrieval, (www.dcs.gla.ac.uk).,2ndISBN:978- 408709293 CO 1: Understand the concept of Information retrieval and apply clustering in information retrieval. Prepared By : Prof. Datta S. Shingate
  • 2.
    • Retrieval -“Fetch something” • Data - raw alphanumeric values. • Information – Processed data. • Knowledge – What we know. • Types of Information • Text • Images • Audio • Video • Source Code • Applications/Web services • XML and structured documents Definition of IR
  • 3.
    Defining Data, Information,Knowledge & Wisdom
  • 4.
    Definition of IR •Goal Find the documents most relevant to user Query. • Information Retrieval (IR) Information retrieval (IR) may be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories particularly textual information.
  • 5.
    Data Retrieval VsInformation Retrieval Data Retrieval Information Retrieval • Retrieves data based on the keywords in the query entered by the user. • Retrieves information based on the similarity between the query and the document. • There is no room for errors since it results in complete system failure. • Small errors are tolerated and will likely go unnoticed. • It has a defined structure with respect to semantics. • It is ambiguous and doesn’t have a defined structure. • Provides solutions to the user of the database system. • Does not provide a solution to the user of the database system. • Data Retrieval system produces exact results. • Information Retrieval system produces approximate results • Displayed results are not sorted by relevance. • Displayed results are sorted by relevance • Eg : SQL • Eg : Google Search Engine
  • 6.
    Text mining andInformation Retrieval (IR) Text mining is a process of extracting useful information and patterns from a large volume of text databases.
  • 7.
    IR System BlockDiagram Fig : Typical IR System (Black Box) Fig : Information Retrieval (IR) Process
  • 8.
    Evaluation Criteria • Recall– is defined as the portion of the total relevant document that is retrieved. Recall = No of Relevant document retrieved * 100 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑙𝑙𝑒𝑐𝑡𝑖𝑜𝑛 • Precision - is defined as the portion of the document retrieved that is relevant. Precision = No of Relevant document retrieved * 100 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
  • 9.
    Automatic Text Analysis 1.Document Representative 2. Text Summarization 3. Luhn’s Idea Document Document Representative Predictions from Frequency of Words Conflation Algorithm
  • 10.
    Luhn’s Idea Stop words The Luhn’sIdea Says: ->Too low frequent words are not significant. -> Too high frequent words are also not significant (e.g. “is”, “and”). -> Removing low frequent words is easy. -- Set a minimum frequency-threshold -> Removing common (high frequent) words: --Setting a maximum frequency threshold (statistically obtained) -- Comparing to a common-word list -> Used for summarizing technical documents.
  • 11.
    Conflation Algorithm 1. Openand read each input file and create a single index file. 2. Remove high frequency words (stop words) . 3. Remove all suffixes/affixes from each word if present. 4. Detecting equivalent stems. 5. Store in index file. {Compute, Computer, Computing} → Comput {Walks, Walking, Walker} → Walk {develop, developing, development, developments } → develop
  • 12.
  • 13.
  • 14.
    Clustering in InformationRetrieval Medical Legal Financial Documents Collection
  • 15.
    Clustering in InformationRetrieval Similarity matrix Objects: {1,2,3,4,5,6} Threshold: .89 Graph TheoreticApproach C1 :{1,4,5,6} C2 :{2} C3 : {3}
  • 16.
  • 17.
  • 18.
    Single Pass ClusteringAlgorithm 1. Assign the first document D1 as the representative for C1. 2. For Di, calculate the similarity S with the representative for each existing cluster. 3. If Smax is greater than a threshold value ST, add the item to the corresponding cluster and recalculate the cluster representative; otherwise, use Di to initiate a new cluster. 4. If an item Di remains to be clustered, return to step 2.
  • 19.
    Example of SinglePass Clustering Technique Suppose that we have the following set of documents and terms, and that we are interested in clustering the terms using the single pass method. Threshold value is 10
  • 20.
    Example of SinglePass Clustering Technique
  • 21.
    Example of SinglePass Clustering Technique
  • 22.
    Example of SinglePass Clustering Technique
  • 23.
    Example of SinglePass Clustering Technique
  • 24.
    Single Link ClusteringAlgorithm Dissimilarity matrix:
  • 25.