Model of information retrieval (3)Presentation Transcript
DPT OF LIS
Information retrieval is the activity of obtaining
information resources relevant to an information need
from a collection of information resources.
Searches can be based on metadata or on full-text (or
other content-based) indexing.
Goal: Find the documents most relevant to a certain Query
Dealing with notions of:
Collection of documents
Query (User’s information need)
Notion of Relevancy
A model is a construct designed help us understand a
A particular way of “looking at things”
Models inevitably make simplifying assumptions
What are the limitations of the model?
Different types of models:
Physical analog models
A retrieval model specifies the details
Determines a notion of relevance.
Notion of relevance can be binary or
continuous (i.e. ranked retrieval).
CLASSES OF RM
Boolean models (set theoretic)
Vector space models
Latent Semantic Indexing
MODELS OF IR
Based on the notion of sets
Documents are retrieved only if they satisfy Boolean
conditions specified in the query
Does not impose a ranking on retrieved documents
Vector space model
Based on geometry, the notion of vectors in high dimensional
Documents are ranked based on their similarity to the query
Based on the notion of probabilities and processes for
Documents are ranked based on the probability that
they generated the query
Invented by George Boole (1815-1864)
He devised a system of symbolic logic in which he used
three operators (+, , - ) to combine statements in
John Venn named to this operators of Boolean logic
are the logical sum(+), logical product(), and logical
IR systems allow the users to express their queries by
using this operators.
Each index term is either present or absent
Documents are either Relevant or Not Relevant(no
A document is represented as a set of keywords.
Queries are Boolean expressions of
keywords, connected by AND, OR, and
NOT, including the use of brackets to indicate scope.
[[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
Output: Document is relevant or not. No partial
matches or ranking.
BOOLEAN RETRIEVAL MODEL
Popular retrieval model because:
Easy to understand for simple queries.
Boolean models can be extended to include ranking.
Reasonably efficient implementations possible for
Weights assigned to terms are either “0” or “1”
“0” represents “absence”: term isn’t in the document
“1” represents “presence”: term is in the document
Build queries by combining terms with Boolean
AND, OR, NOT
The system returns all documents that satisfy the
Why Boolean Retrieval Works
Boolean operators approximate natural language
Find documents about a good party that is not over
AND can discover relationships between concepts
OR can discover alternate terminology
excellent party, wild party, etc.
NOT can discover alternate meanings
The Perfect Query Paradox
Every information need has a perfect set of documents
If not, there would be no sense doing retrieval
Every document set has a perfect query
AND every word in a document to get a query for it
Repeat for each document in the set
OR every document query to get the set query
But can users realistically be expected to formulate this
Boolean query formulation is hard!
Why Boolean Retrieval Fails
• Natural language is way more complex
• AND “discovers” nonexistent relationships
– Terms in different sentences, paragraphs, …
• Guessing terminology for OR is hard
– good, nice, excellent, outstanding, awesome, …
• Guessing terms to exclude is even harder!
– Democratic party, party to a lawsuit, …
Precise, if you know the right strategies
Precise, if you have an idea of what you’re looking for
Efficient for the computer
Users must learn Boolean logic
Boolean logic insufficient to capture the richness of language
No control over size of result set: either too many documents or none
When do you stop reading? All documents in the result set are
considered “equally good”
What about partial matches? Documents that “don’t quite match” the
query may be useful also
No notion of ranking (exact matching only)
All index terms have equal weight
Very rigid: AND means all; OR means any.
Difficult to express complex user requests.
Difficult to control the number of documents retrieved.
All matched documents will be returned.
Difficult to rank output.
All matched documents logically satisfy the query.
Difficult to perform relevance feedback.
If a document is identified by the user as relevant or
irrelevant, how should the query be modified?
ADVANTAGES & DISADVANTAGES
Results are predictable, relatively easy to explain
Many different features can be incorporated
Efficient processing since many documents can be
eliminated from search
Effectiveness depends entirely on user
Simple queries usually don’t work well
Complex queries are difficult.
The first relates to the formulation of search statements.
It has been noted that users are not able to formulate an exact search
statement by the combination of AND, OR and NOT operators,
especially when several query terms are involved.
In such cases either the search statement becomes too narrow or too
The second limitation relates to the number of retrieval items.
It has been noted that users cannot predict a priori exactly how many
items are to be retrieved to satisfy a given query.
If the search statement is broad, the number of retrieved items may
sometimes be several hundreds and thus it may be quite difficult to
find out the exact information required.
The third limitation is that it identifies an item as relevant by finding
out whether a given query term is present or not in a given record in the