SlideShare a Scribd company logo
1 of 36
Chapter 4 : IR Models
Adama Science and Technology University
School of Electrical Engineering and
Computing
Department of CSE
Kibrom T
Word evidence:
IR systems usually adopt index terms to index and retrieve
documents.
Each document is represented by a set of representative keywords
or index terms (called Bag of Words).
An index term is a document word useful for remembering the
document main themes.
Not all terms are equally useful for representing the document
contents:
Less frequent terms allow identifying a narrower set of documents.
But no ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
IR Models - Basic Concepts
One central problem regarding IR systems is the issue of
predicting the degree of relevance of documents for a given
query.
Such a decision is usually dependent on a ranking algorithm
which attempts to establish a simple ordering of the documents
retrieved.
Documents appearning at the top of this ordering are considered
to be more likely to be relevant.
Thus ranking algorithms are at the core of IR systems.
The IR models determine the predictions of what is relevant and
what is not, based on the notion of relevance implemented by the
system.
What is the importance of IR Models?
 Ranking: IR models help in ranking the retrieved documents
according to their relevance to the query.
 Efficient Retrieval: IR models help in retrieving relevant documents or
information from a large collection of documents.
 Analysis: IR models can be used to analyse the content of the
documents, extract important information, and perform various other
analyses.
 Personalization: IR models can be used to personalize the search
results based on the user's preferences, search history, and other
factors.
Alternative IR models
Probabilistic
relevance
IR Models - Basic Concepts
After preprocessing, N distinct terms (Bag of words) remain
which are unique terms that form the VOCABULARY. Let:
ki be an index term i & dj be a document j.
K = (k1, k2, …, kN) is the set of all index terms.
Each term, i, in a document or query j, is given a real-valued
weight, wij.
wij is a weight associated with (ki,dj). If wij = 0 , it indicates that
term does not belong to document dj.
The weight wij quantifies the importance of the index term for
describing the document contents.
vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the
document dj.
Mapping Documents & Queries
Represent both documents and queries as N-dimensional vectors
in a term-document matrix, which shows occurrence of terms in
the document collection or query.
– E.g.
An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term doesn’t exist in the document.
)
,...,
,
(
);
,...,
,
( ,
,
2
,
1
,
,
2
,
1 k
N
k
k
k
j
N
j
j
j t
t
t
q
t
t
t
d 



T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
Qi wi1 wi2 … wiN
 Document collection is mapped to term-by-
document matrix.
 View as vector in multidimensional space.
 Nearby vectors are related.
 Normalize for vector length to avoid the
effect of document length.
Weighting Terms in Vector Sapce
The importance of the index terms is represented by weights
associated to them.
Problem: to show the importance of the index term for
describing the document/query contents, what weight we can
assign?
Solution 1: Binary weights: t=1 if presence, 0 otherwise.
 Similarity: number of terms in common.
Problem: Not all terms equally interesting.
 E.g. the vs. dog vs. cat
Solution: Replace binary weights with non-binary weights.
)
,...,
,
(
);
,...,
,
( ,
,
2
,
1
,
,
2
,
1 k
N
k
k
k
j
N
j
j
j w
w
w
q
w
w
w
d 



How to evaluate Models?
We need to investigate what procedures they follow and what
techniques they used for:
 Are they using binary or non-binary weighting for measuring
importance of terms in documents.
 Are they using similarity measurements?
 Are they applying partial matching?
 Are they performing Exact matching or Best matching for
document retrieval?
 Any Ranking mechanism?
The Boolean Model
Boolean model is a simple model based on set theory.
 The Boolean model imposes a binary criterion for deciding
relevance. Query terms are linked by the logical operators
AND, OR and NOT;
Terms are either present or absent. Thus,
wij  {0,1}
sim(q,dj) = 1, if document satisfies the boolean query,
0 otherwise
T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
Note that, no weights assigned in-
between 0 and 1, just only values 0
or 1.
Given the following docs determine the documents retrieved by
the Boolean model based IR system.
 Index Terms: K1, …,K8.
 Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
The Boolean Model: Example
 Given the following three documents, Construct Term – document matrix and
find the relevant documents retrieved by the Boolean model for given query.
 D1: “Shipment of gold damaged in a fire”
 D2: “Delivery of silver arrived in a silver truck”
 D3: “Shipment of gold arrived in a truck”
 Query: “gold silver truck”
Table below shows document –term (ti) matrix
The Boolean Model: Further Example
 Find the relevant
documents for
the queries:
(a) gold delivery
(b) ship gold
(c) silver truck
Pros & Cons of the Boolean Model
 Pros : Simplicity: The Boolean Model is easy to understand and use.
 Precise Retrieval: It retrieves documents that strictly match the query
conditions.
 Fast Processing: The model can be processed quickly due to its
simple operations.
 Cons: Lack of Ranking, Lack of Flexibility: It has limited support for
complex relationships between terms or advanced search techniques.
 Vocabulary Mismatch: If there's a mismatch between query terms and
document terms, retrieval can be ineffective.
 No Partial Matches: The model only considers exact matches,
potentially missing relevant documents.
 Information need has to be translated into a Boolean expression which most
users find awkward,
Drawbacks of the Boolean Model
Retrieval based on binary decision criteria with no notion of
partial matching,
No ranking of the documents is provided (absence of a grading
scale),
Information need has to be translated into a Boolean expression
which most users find awkward,
The Boolean queries formulated by the users are most often too
simplistic.
 As a consequence, the Boolean model frequently returns either
too few or too many documents in response to a user query.
Class Exercise
 Consider a document collection consisting of five documents: Each
document contains the following terms:
D1: apple, banana, cherry
D2: apple, banana, grape
D3: apple, cherry
D4: banana, cherry
D5: apple, grape
 Using the Boolean Model, answer the following queries:
Query1: apple AND banana. D1, D2
Query2: apple OR cherry D1, D2, D3
Query3: NOT banana. D3, D5
Query4: (apple OR cherry) AND grape D2, D5
 Please provide your answers based on the Boolean Model.
Class Exercise
Given the following four documents with the following contents:
 D1 = “computer information retrieval”
 D2 = “computer retrieval”
 D3 = “information”
 D4 = “computer information”
What are the relevant documents retrieved for the queries:
 Q1 = “information  retrieval”
 Q2 = “information  ¬computer”
 Q3= “computer  information”
Vector-Space Model
This is the most commonly used strategy for measuring
relevance of documents for a given query. This is because,
 Use of binary weights is too limiting,
 Non-binary weights provide consideration for partial matches.
These term weights are used to compute a degree of similarity
between a query and each document.
 Ranked set of documents provides for better matching.
The idea behind VSM is that:
 The meaning of a document is conveyed by the words used in that
document.
Vector-Space (similarity based)Model
To find relevant documens for a given query,
First, map documents and queries into term-document vector
space.
 Note that queries are considered as short document.
Second, in the vector space, queries and documents are
represented as weighted vectors, wij.
 There are different weighting technique; the most widely used one
is computing tf*idf for each term.
Third, similarity measurement is used to rank documents by the
closeness of their vectors to the query.
 Documents are ranked by closeness to the query.
 Closeness is determined by a similarity score calculation.
Term-document Matrix.
A collection of n documents and query can be represented in the
vector space model by a term-document matrix.
 An entry in the matrix corresponds to the “weight” of a term in
the document;
 Zero means the term has no significance in the document or it
simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj.
T1 T2 …. TN
D1 w11 w21 … wN1
D2 w12 w22 … wN2
: : : :
: : : :
DM w1M w2M … wNM
Computing Weights
How to compute weights for term i in document j and query
q; wij and wiq?
A good weight must take into account two effects:
 Quantification of intra-document contents (similarity).
 tf factor, the term frequency within a document.
 Quantification of inter-documents separation (dissimilarity)
 idf factor, the inverse document frequency.
As a result of which most IR systems are using tf*idf
weighting technique:
wij = tf(i,j) * idf(i)
 Let:N be the total number of documents in the collection,
 ni be the number of documents which contain ti.
 freq(i,j) raw frequency of ti within dj.
 A normalized tf(i,j) factor is given by
tf(i,j) = freq(i,j) / max(freq(k,j))
 Where the maximum is computed over all terms which occur within the
document dj.
 The idf factor is computed as
idf(i) = log (N/ni)
 The log is used to make the values of tf and idf comparable. It can also
be interpreted as the amount of information associated with the term ti.
 A normalized tf*idf weight is given by:
wij = freq(i,j) / max(freq(k,j)) * log(N/ni)
Computing Weights
Query:
 Users query is typically treated as a document and also tf-idf weighted.
 For the query term weights, a suggestion is:
wiq = (0.5 + [0.5 * freq(i,q) / max(freq(k,q)]) * log(N/ni)
 The vector space model with tf*idf weights is a good ranking strategy
with general collections,
 The vector space model is usually as good as the known ranking
alternatives.
 It is also simple and fast to compute.
Computing Weights
 A collection includes 10,000 documents:
 The term A appears 20 times in a particular document,
 The maximum appearance of any term in this document is 50,
 The term A appears in 2,000 of the collection documents.
 Compute TF*IDF weight?
 tf(i,j) = freq(i,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
 wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
Example: Computing Weights
Similarity Measure
A similarity measure is a function that computes the degree of
similarity between two vectors.
Using a similarity measure between the query and each
document:
 It is possible to rank the retrieved documents in the order of
presumed relevance.
 It is possible to enforce a certain threshold so that we can control
the size of the retrieved set of documents.
Similarity Measures
|)
|
|,
min(|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2
|
|
2
1
2
1
D
Q
D
Q
D
Q
D
Q
D
Q
D
Q
D
Q
D
Q
D
Q







  Dot Product (Simple matching)
 Dice’s Coefficient
 Jaccard’s Coefficient
 Cosine Coefficient
 Overlap Coefficient
Similarity Measure
 Sim(q,dj) = cos()
 Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
 A document is retrieved even if it matches the query terms only
partially.
i
j
dj
q










n
i q
i
n
i j
i
n
i q
i
j
i
j
j
j
w
w
w
w
q
d
q
d
q
d
sim
1
2
,
1
2
,
1 ,
,
)
,
( 



Vector Space with Term Weights
and Cosine Matching
1.0
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
0 1.0
D2
D1
Q
1

2

Term B
Term A
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)
Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
 

 


t
j
t
j d
q
t
j d
q
i
ij
j
ij
j
w
w
w
w
D
Q
sim
1 1
2
2
1
)
(
)
(
)
,
(
Q = (0.4,0.8)
D1=(0.8,0.3)
D2=(0.2,0.7)
98
.
0
42
.
0
64
.
0
]
)
7
.
0
(
)
2
.
0
[(
]
)
8
.
0
(
)
4
.
0
[(
)
7
.
0
8
.
0
(
)
2
.
0
4
.
0
(
)
2
,
(
2
2
2
2









D
Q
sim
74
.
0
58
.
0
56
.
)
,
( 1 

D
Q
sim
 Suppose user query for: Q = “gold silver truck”. The database
collection consists of three documents with the following content.
 D1: “Shipment of gold damaged in a fire”
 D2: “Delivery of silver arrived in a silver truck”
 D3: “Shipment of gold arrived in a truck”
 Show retrieval results in ranked order?
 Assume that full text terms are used during indexing, without
removing common terms, stop words, & also no terms are
stemmed.
 Assume that content-bearing terms are selected during indexing.
 Also compare your result with or without normalizing term
frequency.
Vector-Space Model: Example
Vector-Space Model: Example
Terms Q
Counts TF
DF IDF
Wi = TF*IDF
D1 D2 D3 Q D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
 Compute similarity using cosine Sim(q,d1) .
 First, for each document and query, compute all vector lengths (zero
terms ignored).
|d1|= = = 0.719
|d2|= = = 1.095
|d3|= = = 0.352
|q|= = = 0.538
 Next, compute dot products (zero products ignored):
 Q*d1= 0.176*0.176 = 0.0310
 Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
 Q*d3 = 0.176*0.176 + 0.176*0.176 = 0.0620
2
2
2
2
176
.
0
176
.
0
477
.
0
477
.
0 

 517
.
0
2
2
2
2
176
.
0
176
.
0
477
.
0
176
.
0 

 2001
.
1
2
2
2
2
176
.
0
176
.
0
176
.
0
176
.
0 

 124
.
0
2896
.
0
2
2
2
176
.
0
471
.
0
176
.
0 

Vector-Space Model: Example
 Now, compute similarity score:
 Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
 Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246
 Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271
 Finally, we sort and rank documents in descending order according to
the similarity scores:
 Rank 1: Doc 2 = 0.8246
 Rank 2: Doc 3 = 0.3271
 Rank 3: Doc 1 = 0.0801
 Exercise: using normalized TF, rank documents using cosine
similarity measure?
 Hint: Normalize TF of term i in doc j using max frequency of a term
k in document j.
Vector-Space Model
Advantages:
 Term-weighting improves quality of the answer set since it
displays in ranked order,
 Partial matching allows retrieval of documents that
approximate the query conditions,
 Cosine ranking formula sorts documents according to degree of
similarity to the query.
Disadvantages:
 Assumes independence of index terms (it is the freedom from
the control or influence of index terms)
Exercise 1
Consider these documents:
 Doc 1: breakthrough drug for schizophrenia
 Doc 2: new schizophrenia drug
 Doc 3: new approach for treatment of schizophrenia
 Doc 4: new hopes for schizophrenia patients
 Draw the term-document incidence matrix for this document
collection.
 Draw the inverted index representation for this collection.
For the document collection shown above, what are the returned
results for the queries:
 schizophrenia AND drug
 for AND NOT(drug OR approach)
Question & Answer
10/24/2023 35
Thank You !!!
10/24/2023 36

More Related Content

Similar to 4-IR Models_new.ppt

A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnIOSR Journals
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델guesta34d441
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델JUNGEUN KANG
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithmRupali Bhatnagar
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries鍾誠 陳鍾誠
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptxthenmozhip8
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델guesta34d441
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델JUNGEUN KANG
 

Similar to 4-IR Models_new.ppt (20)

A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
G04124041046
G04124041046G04124041046
G04124041046
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
 
집합모델 확장불린모델
집합모델  확장불린모델집합모델  확장불린모델
집합모델 확장불린모델
 

More from BereketAraya

4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
B.Heat Exchange-PPT.ppt
B.Heat Exchange-PPT.pptB.Heat Exchange-PPT.ppt
B.Heat Exchange-PPT.pptBereketAraya
 
6&7-Query Languages & Operations.ppt
6&7-Query Languages & Operations.ppt6&7-Query Languages & Operations.ppt
6&7-Query Languages & Operations.pptBereketAraya
 
H.Insulation-PPT.ppt
H.Insulation-PPT.pptH.Insulation-PPT.ppt
H.Insulation-PPT.pptBereketAraya
 
COLD AND SUNNNY ZONE.pptx
COLD AND SUNNNY ZONE.pptxCOLD AND SUNNNY ZONE.pptx
COLD AND SUNNNY ZONE.pptxBereketAraya
 
Product design (datan).pptx
Product design (datan).pptxProduct design (datan).pptx
Product design (datan).pptxBereketAraya
 

More from BereketAraya (7)

CH-II-I.pptx
CH-II-I.pptxCH-II-I.pptx
CH-II-I.pptx
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
B.Heat Exchange-PPT.ppt
B.Heat Exchange-PPT.pptB.Heat Exchange-PPT.ppt
B.Heat Exchange-PPT.ppt
 
6&7-Query Languages & Operations.ppt
6&7-Query Languages & Operations.ppt6&7-Query Languages & Operations.ppt
6&7-Query Languages & Operations.ppt
 
H.Insulation-PPT.ppt
H.Insulation-PPT.pptH.Insulation-PPT.ppt
H.Insulation-PPT.ppt
 
COLD AND SUNNNY ZONE.pptx
COLD AND SUNNNY ZONE.pptxCOLD AND SUNNNY ZONE.pptx
COLD AND SUNNNY ZONE.pptx
 
Product design (datan).pptx
Product design (datan).pptxProduct design (datan).pptx
Product design (datan).pptx
 

Recently uploaded

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 

Recently uploaded (20)

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 

4-IR Models_new.ppt

  • 1. Chapter 4 : IR Models Adama Science and Technology University School of Electrical Engineering and Computing Department of CSE Kibrom T
  • 2. Word evidence: IR systems usually adopt index terms to index and retrieve documents. Each document is represented by a set of representative keywords or index terms (called Bag of Words). An index term is a document word useful for remembering the document main themes. Not all terms are equally useful for representing the document contents: Less frequent terms allow identifying a narrower set of documents. But no ordering information is attached to the Bag of Words identified from the document collection. IR Models - Basic Concepts
  • 3. IR Models - Basic Concepts One central problem regarding IR systems is the issue of predicting the degree of relevance of documents for a given query. Such a decision is usually dependent on a ranking algorithm which attempts to establish a simple ordering of the documents retrieved. Documents appearning at the top of this ordering are considered to be more likely to be relevant. Thus ranking algorithms are at the core of IR systems. The IR models determine the predictions of what is relevant and what is not, based on the notion of relevance implemented by the system.
  • 4. What is the importance of IR Models?  Ranking: IR models help in ranking the retrieved documents according to their relevance to the query.  Efficient Retrieval: IR models help in retrieving relevant documents or information from a large collection of documents.  Analysis: IR models can be used to analyse the content of the documents, extract important information, and perform various other analyses.  Personalization: IR models can be used to personalize the search results based on the user's preferences, search history, and other factors.
  • 6. IR Models - Basic Concepts After preprocessing, N distinct terms (Bag of words) remain which are unique terms that form the VOCABULARY. Let: ki be an index term i & dj be a document j. K = (k1, k2, …, kN) is the set of all index terms. Each term, i, in a document or query j, is given a real-valued weight, wij. wij is a weight associated with (ki,dj). If wij = 0 , it indicates that term does not belong to document dj. The weight wij quantifies the importance of the index term for describing the document contents. vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj.
  • 7. Mapping Documents & Queries Represent both documents and queries as N-dimensional vectors in a term-document matrix, which shows occurrence of terms in the document collection or query. – E.g. An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term doesn’t exist in the document. ) ,..., , ( ); ,..., , ( , , 2 , 1 , , 2 , 1 k N k k k j N j j j t t t q t t t d     T1 T2 …. TN D1 w11 w12 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN Qi wi1 wi2 … wiN  Document collection is mapped to term-by- document matrix.  View as vector in multidimensional space.  Nearby vectors are related.  Normalize for vector length to avoid the effect of document length.
  • 8. Weighting Terms in Vector Sapce The importance of the index terms is represented by weights associated to them. Problem: to show the importance of the index term for describing the document/query contents, what weight we can assign? Solution 1: Binary weights: t=1 if presence, 0 otherwise.  Similarity: number of terms in common. Problem: Not all terms equally interesting.  E.g. the vs. dog vs. cat Solution: Replace binary weights with non-binary weights. ) ,..., , ( ); ,..., , ( , , 2 , 1 , , 2 , 1 k N k k k j N j j j w w w q w w w d    
  • 9. How to evaluate Models? We need to investigate what procedures they follow and what techniques they used for:  Are they using binary or non-binary weighting for measuring importance of terms in documents.  Are they using similarity measurements?  Are they applying partial matching?  Are they performing Exact matching or Best matching for document retrieval?  Any Ranking mechanism?
  • 10. The Boolean Model Boolean model is a simple model based on set theory.  The Boolean model imposes a binary criterion for deciding relevance. Query terms are linked by the logical operators AND, OR and NOT; Terms are either present or absent. Thus, wij  {0,1} sim(q,dj) = 1, if document satisfies the boolean query, 0 otherwise T1 T2 …. TN D1 w11 w12 … w1N D2 w21 w22 … w2N : : : : : : : : DM wM1 wM2 … wMN Note that, no weights assigned in- between 0 and 1, just only values 0 or 1.
  • 11. Given the following docs determine the documents retrieved by the Boolean model based IR system.  Index Terms: K1, …,K8.  Documents: 1. D1 = {K1, K2, K3, K4, K5} 2. D2 = {K1, K2, K3, K4} 3. D3 = {K2, K4, K6, K8} 4. D4 = {K1, K3, K5, K7} 5. D5 = {K4, K5, K6, K7, K8} 6. D6 = {K1, K2, K3, K4} • Query: K1 (K2  K3) • Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5}) = {D1, D2, D6} The Boolean Model: Example
  • 12.  Given the following three documents, Construct Term – document matrix and find the relevant documents retrieved by the Boolean model for given query.  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck”  Query: “gold silver truck” Table below shows document –term (ti) matrix The Boolean Model: Further Example  Find the relevant documents for the queries: (a) gold delivery (b) ship gold (c) silver truck
  • 13. Pros & Cons of the Boolean Model  Pros : Simplicity: The Boolean Model is easy to understand and use.  Precise Retrieval: It retrieves documents that strictly match the query conditions.  Fast Processing: The model can be processed quickly due to its simple operations.  Cons: Lack of Ranking, Lack of Flexibility: It has limited support for complex relationships between terms or advanced search techniques.  Vocabulary Mismatch: If there's a mismatch between query terms and document terms, retrieval can be ineffective.  No Partial Matches: The model only considers exact matches, potentially missing relevant documents.  Information need has to be translated into a Boolean expression which most users find awkward,
  • 14. Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching, No ranking of the documents is provided (absence of a grading scale), Information need has to be translated into a Boolean expression which most users find awkward, The Boolean queries formulated by the users are most often too simplistic.  As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query.
  • 15. Class Exercise  Consider a document collection consisting of five documents: Each document contains the following terms: D1: apple, banana, cherry D2: apple, banana, grape D3: apple, cherry D4: banana, cherry D5: apple, grape  Using the Boolean Model, answer the following queries: Query1: apple AND banana. D1, D2 Query2: apple OR cherry D1, D2, D3 Query3: NOT banana. D3, D5 Query4: (apple OR cherry) AND grape D2, D5  Please provide your answers based on the Boolean Model.
  • 16. Class Exercise Given the following four documents with the following contents:  D1 = “computer information retrieval”  D2 = “computer retrieval”  D3 = “information”  D4 = “computer information” What are the relevant documents retrieved for the queries:  Q1 = “information  retrieval”  Q2 = “information  ¬computer”  Q3= “computer  information”
  • 17. Vector-Space Model This is the most commonly used strategy for measuring relevance of documents for a given query. This is because,  Use of binary weights is too limiting,  Non-binary weights provide consideration for partial matches. These term weights are used to compute a degree of similarity between a query and each document.  Ranked set of documents provides for better matching. The idea behind VSM is that:  The meaning of a document is conveyed by the words used in that document.
  • 18. Vector-Space (similarity based)Model To find relevant documens for a given query, First, map documents and queries into term-document vector space.  Note that queries are considered as short document. Second, in the vector space, queries and documents are represented as weighted vectors, wij.  There are different weighting technique; the most widely used one is computing tf*idf for each term. Third, similarity measurement is used to rank documents by the closeness of their vectors to the query.  Documents are ranked by closeness to the query.  Closeness is determined by a similarity score calculation.
  • 19. Term-document Matrix. A collection of n documents and query can be represented in the vector space model by a term-document matrix.  An entry in the matrix corresponds to the “weight” of a term in the document;  Zero means the term has no significance in the document or it simply doesn’t exist in the document. Otherwise, wij > 0 whenever ki  dj. T1 T2 …. TN D1 w11 w21 … wN1 D2 w12 w22 … wN2 : : : : : : : : DM w1M w2M … wNM
  • 20. Computing Weights How to compute weights for term i in document j and query q; wij and wiq? A good weight must take into account two effects:  Quantification of intra-document contents (similarity).  tf factor, the term frequency within a document.  Quantification of inter-documents separation (dissimilarity)  idf factor, the inverse document frequency. As a result of which most IR systems are using tf*idf weighting technique: wij = tf(i,j) * idf(i)
  • 21.  Let:N be the total number of documents in the collection,  ni be the number of documents which contain ti.  freq(i,j) raw frequency of ti within dj.  A normalized tf(i,j) factor is given by tf(i,j) = freq(i,j) / max(freq(k,j))  Where the maximum is computed over all terms which occur within the document dj.  The idf factor is computed as idf(i) = log (N/ni)  The log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ti.  A normalized tf*idf weight is given by: wij = freq(i,j) / max(freq(k,j)) * log(N/ni) Computing Weights
  • 22. Query:  Users query is typically treated as a document and also tf-idf weighted.  For the query term weights, a suggestion is: wiq = (0.5 + [0.5 * freq(i,q) / max(freq(k,q)]) * log(N/ni)  The vector space model with tf*idf weights is a good ranking strategy with general collections,  The vector space model is usually as good as the known ranking alternatives.  It is also simple and fast to compute. Computing Weights
  • 23.  A collection includes 10,000 documents:  The term A appears 20 times in a particular document,  The maximum appearance of any term in this document is 50,  The term A appears in 2,000 of the collection documents.  Compute TF*IDF weight?  tf(i,j) = freq(i,j) / max(freq(k,j)) = 20/50 = 0.4  idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32  wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928 Example: Computing Weights
  • 24. Similarity Measure A similarity measure is a function that computes the degree of similarity between two vectors. Using a similarity measure between the query and each document:  It is possible to rank the retrieved documents in the order of presumed relevance.  It is possible to enforce a certain threshold so that we can control the size of the retrieved set of documents.
  • 25. Similarity Measures |) | |, min(| | | | | | | | | | | | | | | | | | | 2 | | 2 1 2 1 D Q D Q D Q D Q D Q D Q D Q D Q D Q          Dot Product (Simple matching)  Dice’s Coefficient  Jaccard’s Coefficient  Cosine Coefficient  Overlap Coefficient
  • 26. Similarity Measure  Sim(q,dj) = cos()  Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1  A document is retrieved even if it matches the query terms only partially. i j dj q           n i q i n i j i n i q i j i j j j w w w w q d q d q d sim 1 2 , 1 2 , 1 , , ) , (    
  • 27. Vector Space with Term Weights and Cosine Matching 1.0 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0 1.0 D2 D1 Q 1  2  Term B Term A Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)        t j t j d q t j d q i ij j ij j w w w w D Q sim 1 1 2 2 1 ) ( ) ( ) , ( Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) 98 . 0 42 . 0 64 . 0 ] ) 7 . 0 ( ) 2 . 0 [( ] ) 8 . 0 ( ) 4 . 0 [( ) 7 . 0 8 . 0 ( ) 2 . 0 4 . 0 ( ) 2 , ( 2 2 2 2          D Q sim 74 . 0 58 . 0 56 . ) , ( 1   D Q sim
  • 28.  Suppose user query for: Q = “gold silver truck”. The database collection consists of three documents with the following content.  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck”  Show retrieval results in ranked order?  Assume that full text terms are used during indexing, without removing common terms, stop words, & also no terms are stemmed.  Assume that content-bearing terms are selected during indexing.  Also compare your result with or without normalizing term frequency. Vector-Space Model: Example
  • 29. Vector-Space Model: Example Terms Q Counts TF DF IDF Wi = TF*IDF D1 D2 D3 Q D1 D2 D3 a 0 1 1 1 3 0 0 0 0 0 arrived 0 0 1 1 2 0.176 0 0 0.176 0.176 damaged 0 1 0 0 1 0.477 0 0.477 0 0 delivery 0 0 1 0 1 0.477 0 0 0.477 0 fire 0 1 0 0 1 0.477 0 0.477 0 0 gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176 in 0 1 1 1 3 0 0 0 0 0 of 0 1 1 1 3 0 0 0 0 0 silver 1 0 2 0 1 0.477 0.477 0 0.954 0 shipment 0 1 0 1 2 0.176 0 0.176 0 0.176 truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
  • 30. Vector-Space Model Terms Q D1 D2 D3 a 0 0 0 0 arrived 0 0 0.176 0.176 damaged 0 0.477 0 0 delivery 0 0 0.477 0 fire 0 0.477 0 0 gold 0.176 0.176 0 0.176 in 0 0 0 0 of 0 0 0 0 silver 0.477 0 0.954 0 shipment 0 0.176 0 0.176 truck 0.176 0 0.176 0.176
  • 31. Vector-Space Model: Example  Compute similarity using cosine Sim(q,d1) .  First, for each document and query, compute all vector lengths (zero terms ignored). |d1|= = = 0.719 |d2|= = = 1.095 |d3|= = = 0.352 |q|= = = 0.538  Next, compute dot products (zero products ignored):  Q*d1= 0.176*0.176 = 0.0310  Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862  Q*d3 = 0.176*0.176 + 0.176*0.176 = 0.0620 2 2 2 2 176 . 0 176 . 0 477 . 0 477 . 0    517 . 0 2 2 2 2 176 . 0 176 . 0 477 . 0 176 . 0    2001 . 1 2 2 2 2 176 . 0 176 . 0 176 . 0 176 . 0    124 . 0 2896 . 0 2 2 2 176 . 0 471 . 0 176 . 0  
  • 32. Vector-Space Model: Example  Now, compute similarity score:  Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801  Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246  Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271  Finally, we sort and rank documents in descending order according to the similarity scores:  Rank 1: Doc 2 = 0.8246  Rank 2: Doc 3 = 0.3271  Rank 3: Doc 1 = 0.0801  Exercise: using normalized TF, rank documents using cosine similarity measure?  Hint: Normalize TF of term i in doc j using max frequency of a term k in document j.
  • 33. Vector-Space Model Advantages:  Term-weighting improves quality of the answer set since it displays in ranked order,  Partial matching allows retrieval of documents that approximate the query conditions,  Cosine ranking formula sorts documents according to degree of similarity to the query. Disadvantages:  Assumes independence of index terms (it is the freedom from the control or influence of index terms)
  • 34. Exercise 1 Consider these documents:  Doc 1: breakthrough drug for schizophrenia  Doc 2: new schizophrenia drug  Doc 3: new approach for treatment of schizophrenia  Doc 4: new hopes for schizophrenia patients  Draw the term-document incidence matrix for this document collection.  Draw the inverted index representation for this collection. For the document collection shown above, what are the returned results for the queries:  schizophrenia AND drug  for AND NOT(drug OR approach)