SlideShare a Scribd company logo
1
Chapter 4
IR Models
Outline
 Introduction of IR Models
 Boolean model
 Vector space model
 Probabilistic model
 Word evidence: Bag of words
 IR systems usually adopt index terms to index and
retrieve documents
 Each document is represented by a set of
representative keywords or index terms (called Bag
of Words)
An index term is a word from a document useful for
remembering the document main themes
 Not all terms are equally useful for representing the document
contents:
 less frequent terms allow identifying a narrower set
of documents
 But No ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting which documents are relevant
and which are not
• Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
• Documents appearning at the top of this
ordering are considered to be more likely to be
relevant
• Thus ranking algorithms are at the core of IR
systems
• The IR models determine the predictions of
what is relevant and what is not, based on the
notion of relevance implemented by the system
IR Models - Basic Concepts
• After preprocessing, N distinct terms remain which
are Unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a
real-valued weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 ,
it indicates that term does not belong to
document dj
• The weight wij quantifies the importance of the
index term for describing the document contents
• vec(dj) = (w1j, w2j, …, wtj) is a weighted vector
associated with the document dj
2
Mapping Documents & Queries
 Represent both documents and queries as N-
dimensional vectors in a term-document matrix, which
shows occurrence of terms in the document collection
or query
 An entry in the matrix corresponds to the “weight” of a
term in the document; )
,...,
,
(
);
,...,
,
( ,
,
2
,
1
,
,
2
,
1 k
N
k
k
k
j
N
j
j
j t
t
t
q
t
t
t
d 



T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
–Document collection is mapped to
term-by-document matrix
–The documents are viewed as
vectors in multidimensional space
• “Nearby” vectors are related
–Normalize the weight as usual for
vector length to avoid the effect of
document length
Weighting Terms in Vector Sapce
• The importance of the index terms is represented by
weights associated to them
• Problem: to show the importance of the index term for
describing the document/query contents, what weight can
we assign?
• Solution 1: Binary weights: t=1 if presence, 0 otherwise
– Similarity: number of terms in common
• Problem: Not all terms equally interesting
– E.g. the vs. dog vs. cat
• Solution: Replace binary weights with non-binary weights
)
,...,
,
(
);
,...,
,
( ,
,
2
,
1
,
,
2
,
1 k
N
k
k
k
j
N
j
j
j w
w
w
q
w
w
w
d 



The Boolean Model
• Boolean model is a simple model based on set theory
• The Boolean model imposes a binary criterion for
deciding relevance
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise
T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
- Note that, no weights
assigned in-between 0 and 1,
only values 0 or 1 can be
assigned
d1
d2
d3
d4 d5
d6
d7
k1
k2
k3
• Generate the relevant documents retrieved by
the Boolean model for the query :
q = k1  (k2  k3)
The Boolean Model: Example
3
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for given query
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Query: “gold  (silver   truck)”
Table below shows document –term (ti) matrix
The Boolean Model: Further Example
arrive damage deliver fire gold silver ship truck
D1
D2
D3
query
Also find the relevant
documents for the queries:
(a) “golddelivery”;
(b) ship  ¬ gold;
(c) “silver  truck”
Exercise
Given the following three documents with
the following contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”
• What are the relevant documents
retrieved for the queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
Doc No Term 1 Term 2 Term 3 Term 4
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
Exercise: What are the relevant documents retrieved for the query:
((chaucer OR milton) AND (swift OR shakespeare))
4
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
• As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
• Just changing a boolean operator from “AND” to “OR”
changes the result from intersection to union
Vector-Space Model (VSM)
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
• Use of binary weights is too limiting
• Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
• Ranked set of documents provides for better matching
• The idea behind VSM is that
• the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, Documents and queries are mapped into term vector
space.
• Note that queries are considered as short documents
• Second, in the vector space, queries and documents are
represented as weighted vectors
• There are different weighting technique; the most widely
used one is computing tf*idf for each term
• Third, similarity measurement is used to rank documents by
the closeness of their vectors to the query.
• Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
–An entry in the matrix corresponds to the “weight” of a
term in the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij >
0 whenever ki  dj
T1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
5
Computing weights
• How do we compute weights for term i in document j and
query q; wij and wiq ?
• A good weight must take into account two effects:
–Quantification of intra-document contents (similarity)
• The tf factor, the term frequency within a document
–Quantification of inter-documents separation
(dissimilarity)
• The idf factor, the inverse document frequency
–As a result of which most IR systems are using tf*idf
weighting technique:
wij = tf(i,j) * idf(i)
 Let,
 N be the total number of documents in the collection
 ni be the number of documents which contain ki
 freq(i,j) raw frequency of ki within dj
 A normalized tf factor is given by
 f(i,j) = freq(i,j) / max(freq(l,j))
 where the maximum is computed over all terms which
occur within the document dj
 The idf factor is computed as
 idf(i) = log (N/ni)
 the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
information associated with the term ki.
Computing Weights
 The best term-weighting schemes use tf*idf
weights which are given by
 wij = f(i,j) * log(N/ni)
 For the query term weights, a suggestion is
 wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *
log(N/ni)
 The vector space model with tf*idf weights is
a good ranking strategy with general
collections
 The vector space model is usually as good as the
known ranking alternatives.
 It is also simple and fast to compute.
Computing weights
 A collection includes 10,000 documents
 The term A appears 20 times in a particular
document
 The maximum appearance of any term in this
document is 50
 The term A appears in 2,000 of the collection
documents.
 Compute TF*IDF weight?
 f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4
 idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
 wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
Example: Computing weights
6
Similarity Measure
• Sim(q,dj) = cos()
• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
• A document is retrieved even if it matches the query
terms only partially
i
j
dj
q










n
i k
i
n
i j
i
n
i k
i
j
i
j
j
j
q
w
q
w
q
d
q
d
q
d
sim
1
2
,
1
2
,
1 ,
,
)
,
( 



 Suppose we query for the query: Q: “gold silver
truck”. The database collection consists of
three documents with the following documents.
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
 Assume that all terms are used, including
common terms, stop words, and also no terms
are reduced to root terms.
 Show retrieval results in ranked order?
Vector-Space Model: Example
Vector-Space Model: Example
Terms Q Counts TF DF IDF Wi = TF*IDF
Q D1 D2 D3
D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
7
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= = = 0.719
|d2|= = = 1.095
|d3|= = = 0.352
|q|= = = 0.538
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
2
2
2
2
176
.
0
176
.
0
477
.
0
477
.
0 

 517
.
0
2
2
2
2
176
.
0
176
.
0
477
.
0
176
.
0 

 2001
.
1
2
2
2
2
176
.
0
176
.
0
176
.
0
176
.
0 

 124
.
0
2896
.
0
2
2
2
176
.
0
471
.
0
176
.
0 

Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
• Exercise: using normalized TF, rank documents using
cosine similarity measure? Hint: Normalize TF of term i
in doc j using max frequency of a term k in document j.
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer
set since it displays in ranked order
• partial matching allows retrieval of documents
that approximate the query conditions
• cosine ranking formula sorts documents
according to degree of similarity to the query
• Disadvantages:
• assumes independence of index terms (??)
Technical Memo Example
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response
time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error
measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer interaction"
8
Exercises
• Given the following documents, rank documents
according to their relevance to the query using cosine
similarity, Euclidean distance and inner product
measures?
docID words in document
1 Taipei Taiwan
2 Macao Taiwan Shanghai
3 Japan Sapporo
4 Sapporo Osaka Taiwan
Query: Taiwan Sapporo ?
29 30
Thank you

More Related Content

What's hot

AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
DataminingTools Inc
 
Indexing Data Structure
Indexing Data StructureIndexing Data Structure
Indexing Data Structure
Vivek Kantariya
 
8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimization
Kumar
 
Text MIning
Text MIningText MIning
Text MIning
Prakhyath Rai
 
Enhanced Entity-Relationship (EER) Modeling
Enhanced Entity-Relationship (EER) ModelingEnhanced Entity-Relationship (EER) Modeling
Enhanced Entity-Relationship (EER) Modeling
sontumax
 
7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Ordbms
OrdbmsOrdbms
STACKS IN DATASTRUCTURE
STACKS IN DATASTRUCTURESTACKS IN DATASTRUCTURE
STACKS IN DATASTRUCTURE
Archie Jamwal
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
Mohamed Loey
 
Entities and attributes
Entities and attributesEntities and attributes
Entities and attributes
Forrester High School
 
Searching and Sorting Techniques in Data Structure
Searching and Sorting Techniques in Data StructureSearching and Sorting Techniques in Data Structure
Searching and Sorting Techniques in Data Structure
Balwant Gorad
 
B+ tree
B+ treeB+ tree
BINARY TREE REPRESENTATION.ppt
BINARY TREE REPRESENTATION.pptBINARY TREE REPRESENTATION.ppt
BINARY TREE REPRESENTATION.ppt
SeethaDinesh
 
Linear Search Presentation
Linear Search PresentationLinear Search Presentation
Linear Search Presentation
Markajul Hasnain Alif
 
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NFDatabase Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Oum Saokosal
 
Forms of learning in ai
Forms of learning in aiForms of learning in ai
Forms of learning in ai
Robert Antony
 
Linear search-and-binary-search
Linear search-and-binary-searchLinear search-and-binary-search
Linear search-and-binary-search
International Islamic University
 
Entity relationship modelling
Entity relationship modellingEntity relationship modelling
Entity relationship modelling
Dr. C.V. Suresh Babu
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Shell sort
Shell sortShell sort
Shell sort
Rajendran
 

What's hot (20)

AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
Indexing Data Structure
Indexing Data StructureIndexing Data Structure
Indexing Data Structure
 
8 query processing and optimization
8 query processing and optimization8 query processing and optimization
8 query processing and optimization
 
Text MIning
Text MIningText MIning
Text MIning
 
Enhanced Entity-Relationship (EER) Modeling
Enhanced Entity-Relationship (EER) ModelingEnhanced Entity-Relationship (EER) Modeling
Enhanced Entity-Relationship (EER) Modeling
 
7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil7. Tree - Data Structures using C++ by Varsha Patil
7. Tree - Data Structures using C++ by Varsha Patil
 
Ordbms
OrdbmsOrdbms
Ordbms
 
STACKS IN DATASTRUCTURE
STACKS IN DATASTRUCTURESTACKS IN DATASTRUCTURE
STACKS IN DATASTRUCTURE
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
 
Entities and attributes
Entities and attributesEntities and attributes
Entities and attributes
 
Searching and Sorting Techniques in Data Structure
Searching and Sorting Techniques in Data StructureSearching and Sorting Techniques in Data Structure
Searching and Sorting Techniques in Data Structure
 
B+ tree
B+ treeB+ tree
B+ tree
 
BINARY TREE REPRESENTATION.ppt
BINARY TREE REPRESENTATION.pptBINARY TREE REPRESENTATION.ppt
BINARY TREE REPRESENTATION.ppt
 
Linear Search Presentation
Linear Search PresentationLinear Search Presentation
Linear Search Presentation
 
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NFDatabase Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
Database Normalization 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
 
Forms of learning in ai
Forms of learning in aiForms of learning in ai
Forms of learning in ai
 
Linear search-and-binary-search
Linear search-and-binary-searchLinear search-and-binary-search
Linear search-and-binary-search
 
Entity relationship modelling
Entity relationship modellingEntity relationship modelling
Entity relationship modelling
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Shell sort
Shell sortShell sort
Shell sort
 

Similar to Chapter 4 IR Models.pdf

Ir models
Ir modelsIr models
Ir models
Ambreen Angel
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
alaa223
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
thenmozhip8
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
BereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
BereketAraya
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
rchbeir
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
Sardhendu Mishra
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
Kira
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
George Ang
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
Vaibhav Khanna
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
thenmozhip8
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
onlmcq
 
Ir 09
Ir   09Ir   09
Lec1
Lec1Lec1
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
Web search engines
Web search enginesWeb search engines
Web search engines
AbdusamadAbdukarimov2
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
Primya Tamil
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
WrushabhShirsat3
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
Zakaria Zubi
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델
guesta34d441
 

Similar to Chapter 4 IR Models.pdf (20)

Ir models
Ir modelsIr models
Ir models
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Ir 09
Ir   09Ir   09
Ir 09
 
Lec1
Lec1Lec1
Lec1
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델
 

More from Habtamu100

Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
Habtamu100
 
qury.pdf
qury.pdfqury.pdf
qury.pdf
Habtamu100
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
Habtamu100
 
Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
Habtamu100
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
Habtamu100
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
Habtamu100
 

More from Habtamu100 (7)

Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 
qury.pdf
qury.pdfqury.pdf
qury.pdf
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
Chapter 3 Indexing.pdf
Chapter 3 Indexing.pdfChapter 3 Indexing.pdf
Chapter 3 Indexing.pdf
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 

Recently uploaded

NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
deepaannamalai16
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
RamseyBerglund
 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Vivekanand Anglo Vedic Academy
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
melliereed
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 

Recently uploaded (20)

NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
 
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDFLifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
Lifelines of National Economy chapter for Class 10 STUDY MATERIAL PDF
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 

Chapter 4 IR Models.pdf

  • 1. 1 Chapter 4 IR Models Outline  Introduction of IR Models  Boolean model  Vector space model  Probabilistic model  Word evidence: Bag of words  IR systems usually adopt index terms to index and retrieve documents  Each document is represented by a set of representative keywords or index terms (called Bag of Words) An index term is a word from a document useful for remembering the document main themes  Not all terms are equally useful for representing the document contents:  less frequent terms allow identifying a narrower set of documents  But No ordering information is attached to the Bag of Words identified from the document collection. IR Models - Basic Concepts IR Models - Basic Concepts • One central problem regarding IR systems is the issue of predicting which documents are relevant and which are not • Such a decision is usually dependent on a ranking algorithm which attempts to establish a simple ordering of the documents retrieved • Documents appearning at the top of this ordering are considered to be more likely to be relevant • Thus ranking algorithms are at the core of IR systems • The IR models determine the predictions of what is relevant and what is not, based on the notion of relevance implemented by the system IR Models - Basic Concepts • After preprocessing, N distinct terms remain which are Unique terms that form the VOCABULARY • Let – ki be an index term i & dj be a document j – K = (k1, k2, …, kN) is the set of all index terms • Each term, i, in a document or query j, is given a real-valued weight, wij. – wij is a weight associated with (ki,dj). If wij = 0 , it indicates that term does not belong to document dj • The weight wij quantifies the importance of the index term for describing the document contents • vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj
  • 2. 2 Mapping Documents & Queries  Represent both documents and queries as N- dimensional vectors in a term-document matrix, which shows occurrence of terms in the document collection or query  An entry in the matrix corresponds to the “weight” of a term in the document; ) ,..., , ( ); ,..., , ( , , 2 , 1 , , 2 , 1 k N k k k j N j j j t t t q t t t d     T1 T2 …. TN D1 w11 w12 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN –Document collection is mapped to term-by-document matrix –The documents are viewed as vectors in multidimensional space • “Nearby” vectors are related –Normalize the weight as usual for vector length to avoid the effect of document length Weighting Terms in Vector Sapce • The importance of the index terms is represented by weights associated to them • Problem: to show the importance of the index term for describing the document/query contents, what weight can we assign? • Solution 1: Binary weights: t=1 if presence, 0 otherwise – Similarity: number of terms in common • Problem: Not all terms equally interesting – E.g. the vs. dog vs. cat • Solution: Replace binary weights with non-binary weights ) ,..., , ( ); ,..., , ( , , 2 , 1 , , 2 , 1 k N k k k j N j j j w w w q w w w d     The Boolean Model • Boolean model is a simple model based on set theory • The Boolean model imposes a binary criterion for deciding relevance • Terms are either present or absent. Thus, wij  {0,1} • sim(q,dj) = 1, if document satisfies the boolean query 0 otherwise T1 T2 …. TN D1 w11 w12 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN - Note that, no weights assigned in-between 0 and 1, only values 0 or 1 can be assigned d1 d2 d3 d4 d5 d6 d7 k1 k2 k3 • Generate the relevant documents retrieved by the Boolean model for the query : q = k1  (k2  k3) The Boolean Model: Example
  • 3. 3 • Given the following determine documents retrieved by the Boolean model based IR system • Index Terms: K1, …,K8. • Documents: 1. D1 = {K1, K2, K3, K4, K5} 2. D2 = {K1, K2, K3, K4} 3. D3 = {K2, K4, K6, K8} 4. D4 = {K1, K3, K5, K7} 5. D5 = {K4, K5, K6, K7, K8} 6. D6 = {K1, K2, K3, K4} • Query: K1 (K2  K3) • Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5}) = {D1, D2, D6} The Boolean Model: Example Given the following three documents, Construct Term – document matrix and find the relevant documents retrieved by the Boolean model for given query • D1: “Shipment of gold damaged in a fire” • D2: “Delivery of silver arrived in a silver truck” • D3: “Shipment of gold arrived in a truck” • Query: “gold  (silver   truck)” Table below shows document –term (ti) matrix The Boolean Model: Further Example arrive damage deliver fire gold silver ship truck D1 D2 D3 query Also find the relevant documents for the queries: (a) “golddelivery”; (b) ship  ¬ gold; (c) “silver  truck” Exercise Given the following three documents with the following contents: – D1 = “computer information retrieval” – D2 = “computer retrieval” – D3 = “information” – D4 = “computer information” • What are the relevant documents retrieved for the queries: – Q1 = “information  retrieval” – Q2 = “information  ¬computer” Doc No Term 1 Term 2 Term 3 Term 4 1 Swift 2 Shakespeare 3 Shakespeare Swift 4 Milton 5 Milton Swift 6 Milton Shakespeare 7 Milton Shakespeare Swift 8 Chaucer 9 Chaucer Swift 10 Chaucer Shakespeare 11 Chaucer Shakespeare Swift 12 Chaucer Milton 13 Chaucer Milton Swift 14 Chaucer Milton Shakespeare 15 Chaucer Milton Shakespeare Swift Exercise: What are the relevant documents retrieved for the query: ((chaucer OR milton) AND (swift OR shakespeare))
  • 4. 4 Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided (absence of a grading scale) • Information need has to be translated into a Boolean expression which most users find awkward • The Boolean queries formulated by the users are most often too simplistic • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query • Just changing a boolean operator from “AND” to “OR” changes the result from intersection to union Vector-Space Model (VSM) • This is the most commonly used strategy for measuring relevance of documents for a given query. This is because, • Use of binary weights is too limiting • Non-binary weights provide consideration for partial matches • These term weights are used to compute a degree of similarity between a query and each document • Ranked set of documents provides for better matching • The idea behind VSM is that • the meaning of a document is conveyed by the words used in that document Vector-Space Model To find relevant documens for a given query, • First, Documents and queries are mapped into term vector space. • Note that queries are considered as short documents • Second, in the vector space, queries and documents are represented as weighted vectors • There are different weighting technique; the most widely used one is computing tf*idf for each term • Third, similarity measurement is used to rank documents by the closeness of their vectors to the query. • Documents are ranked by closeness to the query. Closeness is determined by a similarity score calculation Term-document matrix. • A collection of n documents and query can be represented in the vector space model by a term-document matrix. –An entry in the matrix corresponds to the “weight” of a term in the document; –zero means the term has no significance in the document or it simply doesn’t exist in the document. Otherwise, wij > 0 whenever ki  dj T1 T2 …. TN D1 w11 w21 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN
  • 5. 5 Computing weights • How do we compute weights for term i in document j and query q; wij and wiq ? • A good weight must take into account two effects: –Quantification of intra-document contents (similarity) • The tf factor, the term frequency within a document –Quantification of inter-documents separation (dissimilarity) • The idf factor, the inverse document frequency –As a result of which most IR systems are using tf*idf weighting technique: wij = tf(i,j) * idf(i)  Let,  N be the total number of documents in the collection  ni be the number of documents which contain ki  freq(i,j) raw frequency of ki within dj  A normalized tf factor is given by  f(i,j) = freq(i,j) / max(freq(l,j))  where the maximum is computed over all terms which occur within the document dj  The idf factor is computed as  idf(i) = log (N/ni)  the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki. Computing Weights  The best term-weighting schemes use tf*idf weights which are given by  wij = f(i,j) * log(N/ni)  For the query term weights, a suggestion is  wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)  The vector space model with tf*idf weights is a good ranking strategy with general collections  The vector space model is usually as good as the known ranking alternatives.  It is also simple and fast to compute. Computing weights  A collection includes 10,000 documents  The term A appears 20 times in a particular document  The maximum appearance of any term in this document is 50  The term A appears in 2,000 of the collection documents.  Compute TF*IDF weight?  f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4  idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32  wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928 Example: Computing weights
  • 6. 6 Similarity Measure • Sim(q,dj) = cos() • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 • A document is retrieved even if it matches the query terms only partially i j dj q           n i k i n i j i n i k i j i j j j q w q w q d q d q d sim 1 2 , 1 2 , 1 , , ) , (      Suppose we query for the query: Q: “gold silver truck”. The database collection consists of three documents with the following documents. • D1: “Shipment of gold damaged in a fire” • D2: “Delivery of silver arrived in a silver truck” • D3: “Shipment of gold arrived in a truck”  Assume that all terms are used, including common terms, stop words, and also no terms are reduced to root terms.  Show retrieval results in ranked order? Vector-Space Model: Example Vector-Space Model: Example Terms Q Counts TF DF IDF Wi = TF*IDF Q D1 D2 D3 D1 D2 D3 a 0 1 1 1 3 0 0 0 0 0 arrived 0 0 1 1 2 0.176 0 0 0.176 0.176 damaged 0 1 0 0 1 0.477 0 0.477 0 0 delivery 0 0 1 0 1 0.477 0 0 0.477 0 fire 0 1 0 0 1 0.477 0 0.477 0 0 gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176 in 0 1 1 1 3 0 0 0 0 0 of 0 1 1 1 3 0 0 0 0 0 silver 1 0 2 0 1 0.477 0.477 0 0.954 0 shipment 0 1 0 1 2 0.176 0 0.176 0 0.176 truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176 Vector-Space Model Terms Q D1 D2 D3 a 0 0 0 0 arrived 0 0 0.176 0.176 damaged 0 0.477 0 0 delivery 0 0 0.477 0 fire 0 0.477 0 0 gold 0.176 0.176 0 0.176 in 0 0 0 0 of 0 0 0 0 silver 0.477 0 0.954 0 shipment 0 0.176 0 0.176 truck 0.176 0 0.176 0.176
  • 7. 7 Vector-Space Model: Example • Compute similarity using cosine Sim(q,d1) • First, for each document and query, compute all vector lengths (zero terms ignored) |d1|= = = 0.719 |d2|= = = 1.095 |d3|= = = 0.352 |q|= = = 0.538 • Next, compute dot products (zero products ignored) Q*d1= 0.176*0.167 = 0.0310 Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862 Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620 2 2 2 2 176 . 0 176 . 0 477 . 0 477 . 0    517 . 0 2 2 2 2 176 . 0 176 . 0 477 . 0 176 . 0    2001 . 1 2 2 2 2 176 . 0 176 . 0 176 . 0 176 . 0    124 . 0 2896 . 0 2 2 2 176 . 0 471 . 0 176 . 0   Vector-Space Model: Example Now, compute similarity score Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801 Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246 Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271 Finally, we sort and rank documents in descending order according to the similarity scores Rank 1: Doc 2 = 0.8246 Rank 2: Doc 3 = 0.3271 Rank 3: Doc 1 = 0.0801 • Exercise: using normalized TF, rank documents using cosine similarity measure? Hint: Normalize TF of term i in doc j using max frequency of a term k in document j. Vector-Space Model • Advantages: • term-weighting improves quality of the answer set since it displays in ranked order • partial matching allows retrieval of documents that approximate the query conditions • cosine ranking formula sorts documents according to degree of similarity to the query • Disadvantages: • assumes independence of index terms (??) Technical Memo Example Suppose the database collection consists of the following documents. c1: Human machine interface for Lab ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user-perceived response time to error measure M1: The generation of random, binary, unordered trees M2: The intersection graph of paths in trees M3: Graph minors: Widths of trees and well-quasi-ordering m4: Graph minors: A survey Query: Find documents relevant to "human computer interaction"
  • 8. 8 Exercises • Given the following documents, rank documents according to their relevance to the query using cosine similarity, Euclidean distance and inner product measures? docID words in document 1 Taipei Taiwan 2 Macao Taiwan Shanghai 3 Japan Sapporo 4 Sapporo Osaka Taiwan Query: Taiwan Sapporo ? 29 30 Thank you