Module 1
Information Retrieval
Topics
 Introduction
 Information versus Data Retrieval
 IR: Past, present, and future
 Basic concepts: The retrieval process
 Logical view of documents
 Modeling: A Taxonomy of IR models
 Ad-hoc retrieval and filtering
 Classic IR models
 Set theoretic
 Algebraic
 Probabilistic IR models
 Models for browsing
Introduction
 Information retrieval (IR) deals with the
representation, storage, organization and access
to information items.
 The representation and organization of the
information items should provide the user with
easy access to the information in which he is
interested.
 Unfortunately, characterization of the user
information need is not a simple problem.
 Web search engines are the most visible IR
applications.
 The user must first translate the request
information into a query which can be processed
by the search engine.
 An information retrieval process begins when a
user enters a query into the system.
 Queries are formal statements of information
needs, for example search strings in web search
engines.
 Given the user query, the key goal of an IR
system is to retrieve information which might be
useful or relevant to the user.
 The emphasis is on the retrieval of information as
opposed to the retrieval of data.
 Data is a row fact and information is the
processed data.
Information versus Data Retrieval
Information Retrieval Data Retrieval
Retrieves information about a subject Determines the keywords in the user
query and retrieves data
Small errors are likely to go unnoticed. Single erroneous object means total
failure
Deals with natural language text Deals with relational database
Not always well structured and is
semantically ambiguous.
Has a well defined structure and
semantics
Does not provides solution to the user
of a database system
Provides solution to the user of a
database system
Extracts syntactic and semantic
information from the document text and
use this information to match the user
information need
Does not solve the problem of retrieving
information about a subject
The primary goal is to retrieve all the
documents which are relevant to a user
query while retrieving as few non-
The primary goal is to retrieve the data
according to the user query
IR: Past, present, and future
 A typical example of information retrieval is the table of
contents of a book
 Since the volume of information eventually grew
beyond a few books, it became necessary to build
specialized data structures to ensure faster access to
the stored information.
 An old and popular data structure for faster
information retrieval is a collection of selected words
or concepts with which are associated pointers to the
related information is the index.
 In one form or another, indexes are at the core of
every modern information retrieval system.
 They provide faster access to the data and allow the
query processing task to be speeded up.
 Two different views of the IR problem: a
computer-centered one and a human-centered
one.
 In the computer-centered view, the IR problem
consists mainly of building up efficient indexes,
processing user queries with high performance,
and developing ranking algorithms which improve
the `quality' of the answer set.
 In the human-centered view, the IR problem
consists mainly of studying the behavior of the
user, of understanding his main needs, and of
determining how such understanding affects the
organization and operation of the retrieval system
Information Retrieval in the Library
 Libraries were among the first institutions to adopt
IR systems for retrieving information.
 In the first generation, the systems allowed
searches based on author name and title.
 In the second generation, increased search
functionality was added which allowed searching
by subject headings, by keywords.
 In the third generation, which is currently being
deployed, the focus is on improved graphical
interfaces, electronic forms, hypertext features,
and open system architectures.
The Web and Digital Libraries
 Three dramatic and fundamental changes have
occurred due to the advances in modern computer
technology and the boom of the Web.
 First, it became a lot cheaper to have access to various
sources of information.
 Second, the advances in all kinds of digital
communication provided greater access to networks.
 Third, the freedom to post whatever information
someone judges useful has greatly contributed to the
popularity of the Web.
 Fundamentally, low cost, greater access, and
publishing freedom have allowed people to use the
Web as a highly interactive medium.
 Such interactivity allows people to exchange
messages, photos, documents, software, videos, and
Practical Issues
 Security
 Privacy
 Copyright
Basic concepts: The retrieval process
 To describe the retrieval process, we use a simple
and generic software architecture .
 First of all, before the retrieval process can even be
initiated, it is necessary to define the text database.
 This is usually done by the manager of the
database, which specifies the following:
(a) The documents to be used
(b) The operations to be performed on the text
(c) The text model (i.e., the text structure and what
elements can be retrieved). The text operations
transform the original documents and generate a
logical view of them.
 Once the logical view of the documents is defined,
the database manager (using the DB Manager
Module) builds an index of the text.
 An index is a critical data structure because it allows
fast searching over large volumes of data.
 Given that the document database is indexed, the
retrieval process can be initiated.
 The user first specifies a user need which is then
parsed and transformed by the same text operations
applied to the text.
 Then, query operations might be applied before the
actual query, which provides a system representation
for the user need, is generated.
 The query is then processed to obtain the retrieved
documents.
 Fast query processing is made possible by the index
structure previously built.
 Before been sent to the user, the retrieved
documents are ranked according to a likelihood of
relevance.
 At this point, he might pinpoint a subset of the
documents seen as definitely of interest and initiate
a user feedback cycle.
 In such a cycle, the system uses the documents
selected by the user to change the query
formulation.
Logical view of documents
 Due to historical reasons, documents in a
collection are frequently represented through a
set of index terms or keywords.
 Such keywords might be extracted directly from
the text of the document or might be specified by
a human subject.
 No matter whether these representative keywords
are derived automatically or generated by a
specialist, they provide a logical view of the
document.
 Modern computers are making it possible to
represent a document by its full set of words. In
this case, we say that the retrieval system adopts
a full text logical view of the documents.
 With very large collections, however, even modern
computers might have to reduce the set of
representative keywords.
 This can be accomplished through the elimination
of stopwords (such as articles and connectives), the
use of stemming (which reduces distinct words to
their common grammatical root), and the
identification of noun groups (which eliminates
adjectives, adverbs, and verbs).
 These operations are called text operations (or
transformations).
 Text operations reduce the complexity of the
document representation and allow moving the
logical view from that of a full text to that of a set
of index terms.
Modeling: A Taxonomy of IR models
 Three models:
 Classic models
 Structured models
 Browsing models
 Three classic models in IR are :
 Boolean: documents and queries are represented as
sets of index terms. Also known as set theoretic.
 Vector: documents and queries are represented as
vectors in a t-dimensional space. Also known as
algebraic.
 Probabilistic: the framework for modeling document and
query representations is based on probability theory.
 Set theoretic
 Fuzzy
 Extended boolean
 Algebraic
 Generalized vector
 Latent semantic indexing
 Neural networks
 Probabilistic
 Inference network
 Belief network
 Two structured models in IR are
 Non-overlapping lists model
 Proximal nodes
 Three models for browsing are:
 Flat
 Structure guided
 Hypertext
Ad-hoc retrieval and filtering
 Ad hoc retrieval
 Standard retrieval task in which the user specifies his
information need through a query which initiates a
search (executed by the information system) for
documents which are likely to be relevant to the user.
 The documents in the collection remain relatively static
while new queries are submitted to the system
 The most common form of user task
 Filtering
 The queries remain relatively static while new
documents come into the system (and leave)
 User profile
 Describing the user’s preferences
 Routing (variation of filtering, rank the filtered document)
 User profile is compared to the incoming documents
to determine the user’s interest.
 Eg: selecting a news article among thousands of
articles which are broadcast each day
 The task of determining which ones are really relevant
is fully reserved to the user.
 This is accomplished by ranking the filtered
documents and thereby user should examine smaller
number of documents. This variation of filtering is
called routing.
 To rank the documents vector model is preferred.
 In filtering, crucial task is not ranking, but the
construction of the user profile.
 An approach for constructing a user profile is to
describe the profile through a set of keywords and to
require the user to provide the necessary keywords.
 The process is to collect information from the user
about his preferences and use this information to
build the user profile.
 In the beginning user provides a set of keywords
which describes an initial profile of his preferences.
 As new documents arrive the system uses this
profile to select documents that are relevant and not
relevant.
 The system uses this information to adjust the user
profile description such that it reflects new
preferences.
Formal characterization of IR models
 An IR model is a quadruple [D, Q, F, R(qi, dj)] where
1. D is a set of logical views for the documents in the
collection
2. Q is a set of logical views for the user queries
3. F is a framework for modeling documents and
queries
4. R(qi, dj) is a ranking function
Classic IR models
 Basic concepts: Each document is described by a set of
representative keywords called index terms
 An index term is a word that represents a documents
main theme.
 It is used to index and summarize the document contents.
 Assign a numerical weights to distinct relevance between
index terms
 Three classic models: Boolean, vector, probabilistic
 ki : A generic index term
 K : The set of all index terms {k1,…,kt}
 wi,j : A weight associated with index term
ki of a document dj
For an index term does not appear in the document , wi,j=0
 gi: A function returns the weight associated
Boolean model
 Simple retrieval model based on set theory and
Boolean algebra
 Binary decision criterion
 Either relevant or not relevant (no partial match)
 Data retrieval model
 Advantage
 Simplicity
 Disadvantage
 It is not simple to translate an information need into a
Boolean expression
 Exact matching may lead to retrieval of too few or too
many documents
 Formal definition
 For the Boolean model, the index term weight are
all binary, i.e. wij {0,1}
 A query q is composed of index terms linked by
three connectives: not, and , or.
 A query is a conventional Boolean expression,
which can be represented as a disjunction of
conjunctive vectors (in DNF)
 q= ka(kbkc)= (1,1,1)  (1,1,0)  (1,0,0)
where each of the component is a binary weighted
vector associated with the tuple (ka,kb,kc)
 The similarity of the document dj to the query q is
defined as
if (qcc )(ki, wi,j=gi(qcc))




0
1
),( qdsim j
(1,1,1)
(1,0,0) (1,1,0)Ka
Kb
Kc
dnfq

dnfq

 If 1 then the boolean model predicts
that the document dj is relevant to the query q
 Otherwise the prediction is that the document is not
relevant.
),( qdsim j
Vector model
 Assign non-binary weights to index terms in queries and in
documents
 Compute the similarity between documents and query =
Sim(dj,q)
 More precise than Boolean model
 The weight associated with the pair
(ki,dj) positive and non-binary.
 The index terms in the query are also weighted
 The vector for a document dj is represented by
 is the query vector where wi,q be the weight
associated with the pair (ki,q) where wi,q >=0
 Term weights are used to compute the degree of
similarity between documents and the user query. It
varies from 0 to +1.
),,,(
),,,(
,,2,1
,,2,1
qtqq
jtjjj
wwwq
wwwd






q

 A document might be retrieved even if it matches the
query only partially.
 Degree of similarity

dj
q
Figure 2.4 The cosine of  is adopted
as sim(dj,q)









t
i qi
t
i ji
t
i qiji
j
j
j
ww
ww
qd
qd
qdsim
1
2
,1
2
,
1 ,,
||||
),( 

 Advantages
 Its term-weighting scheme improves retrieval
performance
 Its partial matching strategy allows retrieval of
documents that approximate the query conditions
 Its cosine ranking formula sorts the documents
according to their degree of similarity to the query
 Disadvantage
 The assumption of mutual independence between index
terms
Probabilistic model
 Introduced by Roberston and Sparck Jones, 1976
 Binary independence retrieval (BIR) model
 Idea: Given a user query q, and the ideal answer set R of
the relevant documents, the problem is to specify the
properties for this set
 Assumption (probabilistic principle): the probability of
relevance depends on the query and document
representations only; ideal answer set R should
maximize the overall probability of relevance
 The probabilistic model tries to estimate the probability
that the user will find the document dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)
 Definition
 All index term weights are all binary i.e., wi,j  {0,1}
 Let R be the set of documents known to be relevant to
query q
 Let be the complement of R
 Let be the probability that the document dj is
relevant to the query q
 Let be the probability that the document dj is
nonelevant to query q
)|( jdRP
)|( jdRP
R
 Pr(ki |R) stands for the probability that the index
term ki is present in a document randomly selected
from the set R
 stands for the probability that the index
term ki is not present in a document randomly
selected from the set R
)|Pr( Rki
Models for browsing
Flat Browsing
Structure Guided Browsing
Hypertext Model
Information retrieval introduction
Information retrieval introduction

Information retrieval introduction

  • 1.
  • 2.
    Topics  Introduction  Informationversus Data Retrieval  IR: Past, present, and future  Basic concepts: The retrieval process  Logical view of documents  Modeling: A Taxonomy of IR models  Ad-hoc retrieval and filtering  Classic IR models  Set theoretic  Algebraic  Probabilistic IR models  Models for browsing
  • 3.
    Introduction  Information retrieval(IR) deals with the representation, storage, organization and access to information items.  The representation and organization of the information items should provide the user with easy access to the information in which he is interested.  Unfortunately, characterization of the user information need is not a simple problem.  Web search engines are the most visible IR applications.  The user must first translate the request information into a query which can be processed by the search engine.
  • 4.
     An informationretrieval process begins when a user enters a query into the system.  Queries are formal statements of information needs, for example search strings in web search engines.  Given the user query, the key goal of an IR system is to retrieve information which might be useful or relevant to the user.  The emphasis is on the retrieval of information as opposed to the retrieval of data.  Data is a row fact and information is the processed data.
  • 5.
    Information versus DataRetrieval Information Retrieval Data Retrieval Retrieves information about a subject Determines the keywords in the user query and retrieves data Small errors are likely to go unnoticed. Single erroneous object means total failure Deals with natural language text Deals with relational database Not always well structured and is semantically ambiguous. Has a well defined structure and semantics Does not provides solution to the user of a database system Provides solution to the user of a database system Extracts syntactic and semantic information from the document text and use this information to match the user information need Does not solve the problem of retrieving information about a subject The primary goal is to retrieve all the documents which are relevant to a user query while retrieving as few non- The primary goal is to retrieve the data according to the user query
  • 6.
    IR: Past, present,and future  A typical example of information retrieval is the table of contents of a book  Since the volume of information eventually grew beyond a few books, it became necessary to build specialized data structures to ensure faster access to the stored information.  An old and popular data structure for faster information retrieval is a collection of selected words or concepts with which are associated pointers to the related information is the index.  In one form or another, indexes are at the core of every modern information retrieval system.  They provide faster access to the data and allow the query processing task to be speeded up.
  • 7.
     Two differentviews of the IR problem: a computer-centered one and a human-centered one.  In the computer-centered view, the IR problem consists mainly of building up efficient indexes, processing user queries with high performance, and developing ranking algorithms which improve the `quality' of the answer set.  In the human-centered view, the IR problem consists mainly of studying the behavior of the user, of understanding his main needs, and of determining how such understanding affects the organization and operation of the retrieval system
  • 8.
    Information Retrieval inthe Library  Libraries were among the first institutions to adopt IR systems for retrieving information.  In the first generation, the systems allowed searches based on author name and title.  In the second generation, increased search functionality was added which allowed searching by subject headings, by keywords.  In the third generation, which is currently being deployed, the focus is on improved graphical interfaces, electronic forms, hypertext features, and open system architectures.
  • 9.
    The Web andDigital Libraries  Three dramatic and fundamental changes have occurred due to the advances in modern computer technology and the boom of the Web.  First, it became a lot cheaper to have access to various sources of information.  Second, the advances in all kinds of digital communication provided greater access to networks.  Third, the freedom to post whatever information someone judges useful has greatly contributed to the popularity of the Web.  Fundamentally, low cost, greater access, and publishing freedom have allowed people to use the Web as a highly interactive medium.  Such interactivity allows people to exchange messages, photos, documents, software, videos, and
  • 10.
    Practical Issues  Security Privacy  Copyright
  • 11.
    Basic concepts: Theretrieval process  To describe the retrieval process, we use a simple and generic software architecture .  First of all, before the retrieval process can even be initiated, it is necessary to define the text database.  This is usually done by the manager of the database, which specifies the following: (a) The documents to be used (b) The operations to be performed on the text (c) The text model (i.e., the text structure and what elements can be retrieved). The text operations transform the original documents and generate a logical view of them.  Once the logical view of the documents is defined, the database manager (using the DB Manager Module) builds an index of the text.
  • 12.
     An indexis a critical data structure because it allows fast searching over large volumes of data.  Given that the document database is indexed, the retrieval process can be initiated.  The user first specifies a user need which is then parsed and transformed by the same text operations applied to the text.  Then, query operations might be applied before the actual query, which provides a system representation for the user need, is generated.  The query is then processed to obtain the retrieved documents.  Fast query processing is made possible by the index structure previously built.  Before been sent to the user, the retrieved documents are ranked according to a likelihood of relevance.
  • 13.
     At thispoint, he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle.  In such a cycle, the system uses the documents selected by the user to change the query formulation.
  • 14.
    Logical view ofdocuments  Due to historical reasons, documents in a collection are frequently represented through a set of index terms or keywords.  Such keywords might be extracted directly from the text of the document or might be specified by a human subject.  No matter whether these representative keywords are derived automatically or generated by a specialist, they provide a logical view of the document.  Modern computers are making it possible to represent a document by its full set of words. In this case, we say that the retrieval system adopts a full text logical view of the documents.
  • 15.
     With verylarge collections, however, even modern computers might have to reduce the set of representative keywords.  This can be accomplished through the elimination of stopwords (such as articles and connectives), the use of stemming (which reduces distinct words to their common grammatical root), and the identification of noun groups (which eliminates adjectives, adverbs, and verbs).  These operations are called text operations (or transformations).  Text operations reduce the complexity of the document representation and allow moving the logical view from that of a full text to that of a set of index terms.
  • 17.
    Modeling: A Taxonomyof IR models  Three models:  Classic models  Structured models  Browsing models  Three classic models in IR are :  Boolean: documents and queries are represented as sets of index terms. Also known as set theoretic.  Vector: documents and queries are represented as vectors in a t-dimensional space. Also known as algebraic.  Probabilistic: the framework for modeling document and query representations is based on probability theory.  Set theoretic  Fuzzy  Extended boolean
  • 18.
     Algebraic  Generalizedvector  Latent semantic indexing  Neural networks  Probabilistic  Inference network  Belief network  Two structured models in IR are  Non-overlapping lists model  Proximal nodes  Three models for browsing are:  Flat  Structure guided  Hypertext
  • 20.
    Ad-hoc retrieval andfiltering  Ad hoc retrieval  Standard retrieval task in which the user specifies his information need through a query which initiates a search (executed by the information system) for documents which are likely to be relevant to the user.  The documents in the collection remain relatively static while new queries are submitted to the system  The most common form of user task  Filtering  The queries remain relatively static while new documents come into the system (and leave)  User profile  Describing the user’s preferences  Routing (variation of filtering, rank the filtered document)
  • 23.
     User profileis compared to the incoming documents to determine the user’s interest.  Eg: selecting a news article among thousands of articles which are broadcast each day  The task of determining which ones are really relevant is fully reserved to the user.  This is accomplished by ranking the filtered documents and thereby user should examine smaller number of documents. This variation of filtering is called routing.  To rank the documents vector model is preferred.  In filtering, crucial task is not ranking, but the construction of the user profile.  An approach for constructing a user profile is to describe the profile through a set of keywords and to require the user to provide the necessary keywords.
  • 24.
     The processis to collect information from the user about his preferences and use this information to build the user profile.  In the beginning user provides a set of keywords which describes an initial profile of his preferences.  As new documents arrive the system uses this profile to select documents that are relevant and not relevant.  The system uses this information to adjust the user profile description such that it reflects new preferences.
  • 25.
    Formal characterization ofIR models  An IR model is a quadruple [D, Q, F, R(qi, dj)] where 1. D is a set of logical views for the documents in the collection 2. Q is a set of logical views for the user queries 3. F is a framework for modeling documents and queries 4. R(qi, dj) is a ranking function
  • 26.
    Classic IR models Basic concepts: Each document is described by a set of representative keywords called index terms  An index term is a word that represents a documents main theme.  It is used to index and summarize the document contents.  Assign a numerical weights to distinct relevance between index terms  Three classic models: Boolean, vector, probabilistic  ki : A generic index term  K : The set of all index terms {k1,…,kt}  wi,j : A weight associated with index term ki of a document dj For an index term does not appear in the document , wi,j=0  gi: A function returns the weight associated
  • 27.
    Boolean model  Simpleretrieval model based on set theory and Boolean algebra  Binary decision criterion  Either relevant or not relevant (no partial match)  Data retrieval model  Advantage  Simplicity  Disadvantage  It is not simple to translate an information need into a Boolean expression  Exact matching may lead to retrieval of too few or too many documents
  • 28.
     Formal definition For the Boolean model, the index term weight are all binary, i.e. wij {0,1}  A query q is composed of index terms linked by three connectives: not, and , or.  A query is a conventional Boolean expression, which can be represented as a disjunction of conjunctive vectors (in DNF)  q= ka(kbkc)= (1,1,1)  (1,1,0)  (1,0,0) where each of the component is a binary weighted vector associated with the tuple (ka,kb,kc)  The similarity of the document dj to the query q is defined as if (qcc )(ki, wi,j=gi(qcc))     0 1 ),( qdsim j (1,1,1) (1,0,0) (1,1,0)Ka Kb Kc dnfq  dnfq 
  • 29.
     If 1then the boolean model predicts that the document dj is relevant to the query q  Otherwise the prediction is that the document is not relevant. ),( qdsim j
  • 30.
    Vector model  Assignnon-binary weights to index terms in queries and in documents  Compute the similarity between documents and query = Sim(dj,q)  More precise than Boolean model  The weight associated with the pair (ki,dj) positive and non-binary.  The index terms in the query are also weighted  The vector for a document dj is represented by  is the query vector where wi,q be the weight associated with the pair (ki,q) where wi,q >=0  Term weights are used to compute the degree of similarity between documents and the user query. It varies from 0 to +1. ),,,( ),,,( ,,2,1 ,,2,1 qtqq jtjjj wwwq wwwd       q 
  • 31.
     A documentmight be retrieved even if it matches the query only partially.  Degree of similarity  dj q Figure 2.4 The cosine of  is adopted as sim(dj,q)          t i qi t i ji t i qiji j j j ww ww qd qd qdsim 1 2 ,1 2 , 1 ,, |||| ),(  
  • 34.
     Advantages  Itsterm-weighting scheme improves retrieval performance  Its partial matching strategy allows retrieval of documents that approximate the query conditions  Its cosine ranking formula sorts the documents according to their degree of similarity to the query  Disadvantage  The assumption of mutual independence between index terms
  • 35.
    Probabilistic model  Introducedby Roberston and Sparck Jones, 1976  Binary independence retrieval (BIR) model  Idea: Given a user query q, and the ideal answer set R of the relevant documents, the problem is to specify the properties for this set  Assumption (probabilistic principle): the probability of relevance depends on the query and document representations only; ideal answer set R should maximize the overall probability of relevance  The probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)
  • 36.
     Definition  Allindex term weights are all binary i.e., wi,j  {0,1}  Let R be the set of documents known to be relevant to query q  Let be the complement of R  Let be the probability that the document dj is relevant to the query q  Let be the probability that the document dj is nonelevant to query q )|( jdRP )|( jdRP R
  • 39.
     Pr(ki |R)stands for the probability that the index term ki is present in a document randomly selected from the set R  stands for the probability that the index term ki is not present in a document randomly selected from the set R )|Pr( Rki
  • 41.
  • 42.
  • 43.
  • 46.