Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
A Probabilistic model of Information Retrieval
Harsh Thakkar
DA-IICT, Gandhinagar
2015-04-27
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 1 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Overview
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 2 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Outline
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 3 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Why Probability in IR?
Queries are representations of user’s information need
Relevance is binary
Retrieval is inherently uncertain, since the needs of users are
vague in nature i.e. change with time
Probability deals with uncertainity
Provides a good estimate of which documents to choose,
hence more reliable
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 4 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 5 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Outline
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 6 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Relevance feedback
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 7 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Relevance feedback
In relevance feedback, the user marks documents as
relevant/nonrelevant
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 7 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Relevance feedback
In relevance feedback, the user marks documents as
relevant/nonrelevant
Given some known relevant and nonrelevant documents, we
compute weights for non-query terms that indicate how likely
they will occur in relevant documents
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 7 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Relevance feedback
In relevance feedback, the user marks documents as
relevant/nonrelevant
Given some known relevant and nonrelevant documents, we
compute weights for non-query terms that indicate how likely
they will occur in relevant documents
Develop a probabilistic approach for relevance feedback and
also a general probabilistic model for IR
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 7 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic Approach to Retrieval
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic Approach to Retrieval
Given a user information need (represented as a query) and a
collection of documents (transformed into document
representations), a system must determine how well the
documents satisfy the query
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic Approach to Retrieval
Given a user information need (represented as a query) and a
collection of documents (transformed into document
representations), a system must determine how well the
documents satisfy the query
An IR system has an uncertain understanding of the user
query, and makes an uncertain guess of whether a document
satisfies the query
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic Approach to Retrieval
Given a user information need (represented as a query) and a
collection of documents (transformed into document
representations), a system must determine how well the
documents satisfy the query
An IR system has an uncertain understanding of the user
query, and makes an uncertain guess of whether a document
satisfies the query
Probability theory provides a principled foundation for such
reasoning under uncertainty
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic Approach to Retrieval
Given a user information need (represented as a query) and a
collection of documents (transformed into document
representations), a system must determine how well the
documents satisfy the query
An IR system has an uncertain understanding of the user
query, and makes an uncertain guess of whether a document
satisfies the query
Probability theory provides a principled foundation for such
reasoning under uncertainty
Probabilistic models exploit this foundation to estimate how
likely it is that a document is relevant to a query
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic IR Models at a Glance
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic IR Models at a Glance
Classical probabilistic retrieval model
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic IR Models at a Glance
Classical probabilistic retrieval model
Probability ranking principle
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic IR Models at a Glance
Classical probabilistic retrieval model
Probability ranking principle
Binary Independence Model, BestMatch25 (Okapi)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic IR Models at a Glance
Classical probabilistic retrieval model
Probability ranking principle
Binary Independence Model, BestMatch25 (Okapi)
Bayesian networks for text retrieval
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic IR Models at a Glance
Classical probabilistic retrieval model
Probability ranking principle
Binary Independence Model, BestMatch25 (Okapi)
Bayesian networks for text retrieval
Language model approach to IR
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic IR Models at a Glance
Classical probabilistic retrieval model
Probability ranking principle
Binary Independence Model, BestMatch25 (Okapi)
Bayesian networks for text retrieval
Language model approach to IR
Probabilistic methods are one of the oldest but also one of the
currently hottest topics in IR
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Outline
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 10 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets
The paper provides a common platform to a variety of
performance scattered over many other papers from
1992-1999
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets
The paper provides a common platform to a variety of
performance scattered over many other papers from
1992-1999
All collections from TREC 1-7 are used
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets
The paper provides a common platform to a variety of
performance scattered over many other papers from
1992-1999
All collections from TREC 1-7 are used
Also older results from datasets such as NPL, UKCIS,
Cranfield and a newly created TREC (described below) are
reproduced.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets
The paper provides a common platform to a variety of
performance scattered over many other papers from
1992-1999
All collections from TREC 1-7 are used
Also older results from datasets such as NPL, UKCIS,
Cranfield and a newly created TREC (described below) are
reproduced.
Cranfield collection: has short initial manual index
descriptions based on the whole document
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets
The paper provides a common platform to a variety of
performance scattered over many other papers from
1992-1999
All collections from TREC 1-7 are used
Also older results from datasets such as NPL, UKCIS,
Cranfield and a newly created TREC (described below) are
reproduced.
Cranfield collection: has short initial manual index
descriptions based on the whole document
NLP: has short initial automatic descriptions from abstracts
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets
The paper provides a common platform to a variety of
performance scattered over many other papers from
1992-1999
All collections from TREC 1-7 are used
Also older results from datasets such as NPL, UKCIS,
Cranfield and a newly created TREC (described below) are
reproduced.
Cranfield collection: has short initial manual index
descriptions based on the whole document
NLP: has short initial automatic descriptions from abstracts
UKCIS: has only the title of the documents
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets
The paper provides a common platform to a variety of
performance scattered over many other papers from
1992-1999
All collections from TREC 1-7 are used
Also older results from datasets such as NPL, UKCIS,
Cranfield and a newly created TREC (described below) are
reproduced.
Cranfield collection: has short initial manual index
descriptions based on the whole document
NLP: has short initial automatic descriptions from abstracts
UKCIS: has only the title of the documents
TREC: has automatic initial descriptions in natural text form
mostly from full documents but in some cases from abstracts.
TREC is the largest of all the collections
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets
The paper provides a common platform to a variety of
performance scattered over many other papers from
1992-1999
All collections from TREC 1-7 are used
Also older results from datasets such as NPL, UKCIS,
Cranfield and a newly created TREC (described below) are
reproduced.
Cranfield collection: has short initial manual index
descriptions based on the whole document
NLP: has short initial automatic descriptions from abstracts
UKCIS: has only the title of the documents
TREC: has automatic initial descriptions in natural text form
mostly from full documents but in some cases from abstracts.
TREC is the largest of all the collections
TREC had a mixture of Long, Medium and Very short requests
(L,M,V)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets 2
Figure : Dataset descriptions from (Jones et al., 2000).
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 12 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Datasets 3
Figure : Dataset statistics (Jones et al., 2000).
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 13 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Outline
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 14 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
For events A and B
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
For events A and B
Joint probability P(A ∩ B) of both events occurring
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
For events A and B
Joint probability P(A ∩ B) of both events occurring
Conditional probability P(A|B) of event A occurring given that
event B has occurred
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
For events A and B
Joint probability P(A ∩ B) of both events occurring
Conditional probability P(A|B) of event A occurring given that
event B has occurred
Chain rule gives fundamental relationship between joint and
conditional probabilities:
P(AB) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
For events A and B
Joint probability P(A ∩ B) of both events occurring
Conditional probability P(A|B) of event A occurring given that
event B has occurred
Chain rule gives fundamental relationship between joint and
conditional probabilities:
P(AB) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)
Similarly for the complement of an event P(A):
P(AB) = P(B|A)P(A)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
For events A and B
Joint probability P(A ∩ B) of both events occurring
Conditional probability P(A|B) of event A occurring given that
event B has occurred
Chain rule gives fundamental relationship between joint and
conditional probabilities:
P(AB) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)
Similarly for the complement of an event P(A):
P(AB) = P(B|A)P(A)
Partition rule: if B can be divided into an exhaustive set of
disjoint subcases, then P(B) is the sum of the probabilities of
the subcases. A special case of this rule gives:
P(B) = P(AB) + P(AB)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
Bayes’ Rule for inverting conditional probabilities:
P(A|B) =
P(B|A)P(A)
P(B)
=
P(B|A)
X∈{A,A} P(B|X)P(X)
P(A)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
Bayes’ Rule for inverting conditional probabilities:
P(A|B) =
P(B|A)P(A)
P(B)
=
P(B|A)
X∈{A,A} P(B|X)P(X)
P(A)
Can be thought of as a way of updating probabilities:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
Bayes’ Rule for inverting conditional probabilities:
P(A|B) =
P(B|A)P(A)
P(B)
=
P(B|A)
X∈{A,A} P(B|X)P(X)
P(A)
Can be thought of as a way of updating probabilities:
Start off with prior probability P(A) (initial estimate of how
likely event A is in the absence of any other information)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
Bayes’ Rule for inverting conditional probabilities:
P(A|B) =
P(B|A)P(A)
P(B)
=
P(B|A)
X∈{A,A} P(B|X)P(X)
P(A)
Can be thought of as a way of updating probabilities:
Start off with prior probability P(A) (initial estimate of how
likely event A is in the absence of any other information)
Derive a posterior probability P(A|B) after having seen the
evidence B, based on the likelihood of B occurring in the two
cases that A does or does not hold
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Probability Theory
Bayes’ Rule for inverting conditional probabilities:
P(A|B) =
P(B|A)P(A)
P(B)
=
P(B|A)
X∈{A,A} P(B|X)P(X)
P(A)
Can be thought of as a way of updating probabilities:
Start off with prior probability P(A) (initial estimate of how
likely event A is in the absence of any other information)
Derive a posterior probability P(A|B) after having seen the
evidence B, based on the likelihood of B occurring in the two
cases that A does or does not hold
Odds of an event provide a kind of multiplier for how probabilities
change:
Odds: O(A) =
P(A)
P(A)
=
P(A)
1 − P(A)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model
The probability of a document (D) being relevant (L) to a
user is modeled as:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model
The probability of a document (D) being relevant (L) to a
user is modeled as:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model
The probability of a document (D) being relevant (L) to a
user is modeled as:
P(L|D) =
P(D|L)P(L)
P(D)
We use the Log-Odds to quantify the change since it can be
derived from probabiliy by an oder-preserving transformation
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model
The probability of a document (D) being relevant (L) to a
user is modeled as:
P(L|D) =
P(D|L)P(L)
P(D)
We use the Log-Odds to quantify the change since it can be
derived from probabiliy by an oder-preserving transformation
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model
The probability of a document (D) being relevant (L) to a
user is modeled as:
P(L|D) =
P(D|L)P(L)
P(D)
We use the Log-Odds to quantify the change since it can be
derived from probabiliy by an oder-preserving transformation
Thus, the equation becomes:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model
The probability of a document (D) being relevant (L) to a
user is modeled as:
P(L|D) =
P(D|L)P(L)
P(D)
We use the Log-Odds to quantify the change since it can be
derived from probabiliy by an oder-preserving transformation
Thus, the equation becomes:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model
The probability of a document (D) being relevant (L) to a
user is modeled as:
P(L|D) =
P(D|L)P(L)
P(D)
We use the Log-Odds to quantify the change since it can be
derived from probabiliy by an oder-preserving transformation
Thus, the equation becomes:
log
P(L|D)
P(L|D)
= log
P(D|L)P(L)
P(D|L)P(L)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Matching Score
=log
P(D|L)
P(D|L)
+ log
P(L)
P(L)
(1)
Here we define the concept of matching score: MS(D)
MS − PRIM(D) = log
P(L|D)
P(L|D)
− log
P(L)
P(L)
(2)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 18 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Matching Score
=log
P(D|L)
P(D|L)
+ log
P(L)
P(L)
(1)
Here we define the concept of matching score: MS(D)
Since the function is primitive, thus named MS-PRIM(D),
which is as follows:
MS − PRIM(D) = log
P(L|D)
P(L|D)
− log
P(L)
P(L)
(2)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 18 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes
The basic model of probabilistic IR has been built on the
”Independence Assumption”, stated as:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 19 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes
The basic model of probabilistic IR has been built on the
”Independence Assumption”, stated as:
”Given relevance, the attributes are statistically independent”
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 19 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes
The basic model of probabilistic IR has been built on the
”Independence Assumption”, stated as:
”Given relevance, the attributes are statistically independent”
Thus, from the probability of statistically independent events,
we have:
P(D|L) = ΠP(Ai = ai |L)
P(D|L) = ΠP(Ai = ai |L)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 19 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes
The basic model of probabilistic IR has been built on the
”Independence Assumption”, stated as:
”Given relevance, the attributes are statistically independent”
Thus, from the probability of statistically independent events,
we have:
P(D|L) = ΠP(Ai = ai |L)
P(D|L) = ΠP(Ai = ai |L)
Here, Ai is the ith attribute with value equal to ai for that
specific document
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 19 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (2)
The MS-PRIM(D) derived ealier can now be written as:
MS − PRIM(D) = Σlog
P(Ai = ai |L)
P(Ai = ai |L)
(3)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (2)
The MS-PRIM(D) derived ealier can now be written as:
MS − PRIM(D) = Σlog
P(Ai = ai |L)
P(Ai = ai |L)
(3)
Thus, we could calculate a score for each document, based on
the sum of the products of all independent attributes relating
to rhe document (D).
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (2)
The MS-PRIM(D) derived ealier can now be written as:
MS − PRIM(D) = Σlog
P(Ai = ai |L)
P(Ai = ai |L)
(3)
Thus, we could calculate a score for each document, based on
the sum of the products of all independent attributes relating
to rhe document (D).
This function takes into account both the relevant and
non-relevant attributes
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (2)
The MS-PRIM(D) derived ealier can now be written as:
MS − PRIM(D) = Σlog
P(Ai = ai |L)
P(Ai = ai |L)
(3)
Thus, we could calculate a score for each document, based on
the sum of the products of all independent attributes relating
to rhe document (D).
This function takes into account both the relevant and
non-relevant attributes
For simplicity we take into account the relevant attributes and
take others as ”zero”
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (2)
The MS-PRIM(D) derived ealier can now be written as:
MS − PRIM(D) = Σlog
P(Ai = ai |L)
P(Ai = ai |L)
(3)
Thus, we could calculate a score for each document, based on
the sum of the products of all independent attributes relating
to rhe document (D).
This function takes into account both the relevant and
non-relevant attributes
For simplicity we take into account the relevant attributes and
take others as ”zero”
We define a new measure called MS-BASIC(D) which is as:
MS − BASIC(D) = MS − PRIM(D) − Σlog
P(Ai = 0|L)
P(Ai = 0|L)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (3)
Plugging in the value of MS-PRIM(D) from equation 4, we
have:
MS−BASIC(D) = Σlog
P(Ai = ai |L)
P(Ai = ai |L)
− log
P(Ai = 0|L)
P(Ai = 0|L)
(4)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 21 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (3)
Plugging in the value of MS-PRIM(D) from equation 4, we
have:
MS−BASIC(D) = Σlog
P(Ai = ai |L)
P(Ai = ai |L)
− log
P(Ai = 0|L)
P(Ai = 0|L)
(4)
which can be further simplified to:
=Σlog
P(Ai = ai |L)P(Ai = 0|L)
P(Ai = ai |L)P(Ai = 0|L)
(5)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 21 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (3)
We define this a our Weight function, W (Ai = ai ), Thus,
MS − BASIC(D) = ΣW (Ai = ai )
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 22 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (3)
We define this a our Weight function, W (Ai = ai ), Thus,
MS − BASIC(D) = ΣW (Ai = ai )
This function (W), provides a weight for each value of each
attribute and the matching score for document is somply the
sum of the weights.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 22 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model - Independent attributes (3)
We define this a our Weight function, W (Ai = ai ), Thus,
MS − BASIC(D) = ΣW (Ai = ai )
This function (W), provides a weight for each value of each
attribute and the matching score for document is somply the
sum of the weights.
W (Ai = 0) is always zero, i.e. for a randomly chosen term,
which is irrelevant to the query, we can reasonably assume the
weight to be zero.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 22 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model-Term presence and absence
We can further simplify the above model by using the case
where attribute Ai is simply the presence or absence of a term
ti
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model-Term presence and absence
We can further simplify the above model by using the case
where attribute Ai is simply the presence or absence of a term
ti
We denote P(ti present|L) by pi and P(ti present|L) by pi
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model-Term presence and absence
We can further simplify the above model by using the case
where attribute Ai is simply the presence or absence of a term
ti
We denote P(ti present|L) by pi and P(ti present|L) by pi
Substituting in previous eqation, the new weight formula
becomes:
wi = Σlog
pi (1 − pi )
pi (1 − pi )
(6)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model-Term presence and absence
We can further simplify the above model by using the case
where attribute Ai is simply the presence or absence of a term
ti
We denote P(ti present|L) by pi and P(ti present|L) by pi
Substituting in previous eqation, the new weight formula
becomes:
wi = Σlog
pi (1 − pi )
pi (1 − pi )
(6)
Hence, the matching score for the document is just the sum
of the weights of the terms present.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Basic Model-Term presence and absence
We can further simplify the above model by using the case
where attribute Ai is simply the presence or absence of a term
ti
We denote P(ti present|L) by pi and P(ti present|L) by pi
Substituting in previous eqation, the new weight formula
becomes:
wi = Σlog
pi (1 − pi )
pi (1 − pi )
(6)
Hence, the matching score for the document is just the sum
of the weights of the terms present.
This formula is later used in the BIM termed as RSV function.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Outline
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 24 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
The Document Ranking Problem
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
The Document Ranking Problem
Ranked retrieval setup: given a collection of documents, the
user issues a query, and an ordered list of documents is
returned
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
The Document Ranking Problem
Ranked retrieval setup: given a collection of documents, the
user issues a query, and an ordered list of documents is
returned
Assume binary notion of relevance: Rd,q is a random
dichotomous variable, such that
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
The Document Ranking Problem
Ranked retrieval setup: given a collection of documents, the
user issues a query, and an ordered list of documents is
returned
Assume binary notion of relevance: Rd,q is a random
dichotomous variable, such that
Rd,q = 1 if document d is relevant w.r.t query q
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
The Document Ranking Problem
Ranked retrieval setup: given a collection of documents, the
user issues a query, and an ordered list of documents is
returned
Assume binary notion of relevance: Rd,q is a random
dichotomous variable, such that
Rd,q = 1 if document d is relevant w.r.t query q
Rd,q = 0 otherwise
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
The Document Ranking Problem
Ranked retrieval setup: given a collection of documents, the
user issues a query, and an ordered list of documents is
returned
Assume binary notion of relevance: Rd,q is a random
dichotomous variable, such that
Rd,q = 1 if document d is relevant w.r.t query q
Rd,q = 0 otherwise
Probabilistic ranking orders documents decreasingly by their
estimated probability of relevance w.r.t. query: P(R = 1|d, q)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
The Document Ranking Problem
Ranked retrieval setup: given a collection of documents, the
user issues a query, and an ordered list of documents is
returned
Assume binary notion of relevance: Rd,q is a random
dichotomous variable, such that
Rd,q = 1 if document d is relevant w.r.t query q
Rd,q = 0 otherwise
Probabilistic ranking orders documents decreasingly by their
estimated probability of relevance w.r.t. query: P(R = 1|d, q)
Assume that the relevance of each document is independent
of the relevance of other documents
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability Ranking Principle (PRP)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability Ranking Principle (PRP)
PRP in brief
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability Ranking Principle (PRP)
PRP in brief
If the retrieved documents (w.r.t a query) are ranked
decreasingly on their probability of relevance, then the
effectiveness of the system will be the best that is obtainable
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability Ranking Principle (PRP)
PRP in brief
If the retrieved documents (w.r.t a query) are ranked
decreasingly on their probability of relevance, then the
effectiveness of the system will be the best that is obtainable
PRP in full
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability Ranking Principle (PRP)
PRP in brief
If the retrieved documents (w.r.t a query) are ranked
decreasingly on their probability of relevance, then the
effectiveness of the system will be the best that is obtainable
PRP in full
If [the IR] system’s response to each [query] is a ranking of the
documents [...] in order of decreasing probability of relevance
to the [query], where the probabilities are estimated as
accurately as possible on the basis of whatever data have been
made available to the system for this purpose, the overall
effectiveness of the system to its user will be the best that is
obtainable on the basis of those data
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model (BIM)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model (BIM)
Traditionally used with the PRP
Assumptions:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model (BIM)
Traditionally used with the PRP
Assumptions:
‘Binary’ (equivalent to Boolean): documents and queries
represented as binary term incidence vectors
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model (BIM)
Traditionally used with the PRP
Assumptions:
‘Binary’ (equivalent to Boolean): documents and queries
represented as binary term incidence vectors
E.g., document d represented by vector x = (x1, . . . , xM ),
where xt = 1 if term t occurs in d and xt = 0 otherwise
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model (BIM)
Traditionally used with the PRP
Assumptions:
‘Binary’ (equivalent to Boolean): documents and queries
represented as binary term incidence vectors
E.g., document d represented by vector x = (x1, . . . , xM ),
where xt = 1 if term t occurs in d and xt = 0 otherwise
Different documents may have the same vector representation
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model (BIM)
Traditionally used with the PRP
Assumptions:
‘Binary’ (equivalent to Boolean): documents and queries
represented as binary term incidence vectors
E.g., document d represented by vector x = (x1, . . . , xM ),
where xt = 1 if term t occurs in d and xt = 0 otherwise
Different documents may have the same vector representation
‘Independence’: no association between terms (not true, but
practically works - ‘naive’ assumption of Naive Bayes models)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary incidence matrix
Anthony Julius The Hamlet Othello Macbeth . . .
and Caesar Tempest
Cleopatra
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
. . .
Each document is represented as a binary vector ∈ {0, 1}|V |.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 28 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 29 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
To make a probabilistic retrieval strategy precise, need to estimate
how terms in documents contribute to relevance
Find measurable statistics (term frequency, document
frequency, document length) that affect judgments about
document relevance
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 29 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
To make a probabilistic retrieval strategy precise, need to estimate
how terms in documents contribute to relevance
Find measurable statistics (term frequency, document
frequency, document length) that affect judgments about
document relevance
Combine these statistics to estimate the probability P(R|d, q)
of document relevance
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 29 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
To make a probabilistic retrieval strategy precise, need to estimate
how terms in documents contribute to relevance
Find measurable statistics (term frequency, document
frequency, document length) that affect judgments about
document relevance
Combine these statistics to estimate the probability P(R|d, q)
of document relevance
Next: how exactly we can do this
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 29 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 30 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
P(R|d, q) is modeled using term incidence vectors as P(R|x, q)
P(R = 1|x, q) =
P(x|R = 1, q)P(R = 1|q)
P(x|q)
P(R = 0|x, q) =
P(x|R = 0, q)P(R = 0|q)
P(x|q)
P(x|R = 1, q) and P(x|R = 0, q): probability that if a
relevant or nonrelevant document is retrieved, then that
document’s representation is x
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 30 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
P(R|d, q) is modeled using term incidence vectors as P(R|x, q)
P(R = 1|x, q) =
P(x|R = 1, q)P(R = 1|q)
P(x|q)
P(R = 0|x, q) =
P(x|R = 0, q)P(R = 0|q)
P(x|q)
P(x|R = 1, q) and P(x|R = 0, q): probability that if a
relevant or nonrelevant document is retrieved, then that
document’s representation is x
Use statistics about the document collection to estimate these
probabilities
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 30 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
P(R|d, q) is modeled using term incidence vectors as P(R|x, q)
P(R = 1|x, q) =
P(x|R = 1, q)P(R = 1|q)
P(x|q)
P(R = 0|x, q) =
P(x|R = 0, q)P(R = 0|q)
P(x|q)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
P(R|d, q) is modeled using term incidence vectors as P(R|x, q)
P(R = 1|x, q) =
P(x|R = 1, q)P(R = 1|q)
P(x|q)
P(R = 0|x, q) =
P(x|R = 0, q)P(R = 0|q)
P(x|q)
P(R = 1|q) and P(R = 0|q): prior probability of retrieving a
relevant or nonrelevant document for a query q
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
P(R|d, q) is modeled using term incidence vectors as P(R|x, q)
P(R = 1|x, q) =
P(x|R = 1, q)P(R = 1|q)
P(x|q)
P(R = 0|x, q) =
P(x|R = 0, q)P(R = 0|q)
P(x|q)
P(R = 1|q) and P(R = 0|q): prior probability of retrieving a
relevant or nonrelevant document for a query q
Estimate P(R = 1|q) and P(R = 0|q) from percentage of
relevant documents in the collection
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
P(R|d, q) is modeled using term incidence vectors as P(R|x, q)
P(R = 1|x, q) =
P(x|R = 1, q)P(R = 1|q)
P(x|q)
P(R = 0|x, q) =
P(x|R = 0, q)P(R = 0|q)
P(x|q)
P(R = 1|q) and P(R = 0|q): prior probability of retrieving a
relevant or nonrelevant document for a query q
Estimate P(R = 1|q) and P(R = 0|q) from percentage of
relevant documents in the collection
Since a document is either relevant or nonrelevant to a query,
we must have that:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Binary Independence Model
P(R|d, q) is modeled using term incidence vectors as P(R|x, q)
P(R = 1|x, q) =
P(x|R = 1, q)P(R = 1|q)
P(x|q)
P(R = 0|x, q) =
P(x|R = 0, q)P(R = 0|q)
P(x|q)
P(R = 1|q) and P(R = 0|q): prior probability of retrieving a
relevant or nonrelevant document for a query q
Estimate P(R = 1|q) and P(R = 0|q) from percentage of
relevant documents in the collection
Since a document is either relevant or nonrelevant to a query,
we must have that:
P(R = 1|x, q) + P(R = 0|x, q) = 1
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (1)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 32 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (1)
Given a query q, ranking documents by P(R = 1|d, q) is
modeled under BIM as ranking them by P(R = 1|x, q)
Easier: rank documents by their odds of relevance (gives same
ranking)
O(R|x, q) =
P(R = 1|x, q)
P(R = 0|x, q)
=
P(R=1|q)P(x|R=1,q)
P(x|q)
P(R=0|q)P(x|R=0,q)
P(x|q)
=
P(R = 1|q)
P(R = 0|q)
·
P(x|R = 1, q)
P(x|R = 0, q)
P(R=1|q)
P(R=0|q) is a constant for a given query - can be ignored
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 32 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (2)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 33 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (2)
It is at this point that we make the Naive Bayes conditional
independence assumption that the presence or absence of a word in
a document is independent of the presence or absence of any other
word (given the query):
P(x|R = 1, q)
P(x|R = 0, q)
=
M
t=1
P(xt|R = 1, q)
P(xt|R = 0, q)
So:
O(R|x, q) = O(R|q) ·
M
t=1
P(xt|R = 1, q)
P(xt|R = 0, q)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 33 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (3)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 34 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (3)
Since each xt is either 0 or 1, we can separate the terms:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 34 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (3)
Since each xt is either 0 or 1, we can separate the terms:
O(R|x, q) = O(R|q)·
t:xt =1
P(xt = 1|R = 1, q)
P(xt = 1|R = 0, q)
·
t:xt =0
P(xt = 0|R = 1, q)
P(xt = 0|R = 0, q)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 34 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (4)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (4)
Let pt = P(xt = 1|R = 1, q) be the probability of a term
appearing in relevant document
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (4)
Let pt = P(xt = 1|R = 1, q) be the probability of a term
appearing in relevant document
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (4)
Let pt = P(xt = 1|R = 1, q) be the probability of a term
appearing in relevant document
Let pt = P(xt = 1|R = 0, q) be the probability of a term
appearing in a nonrelevant document
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (4)
Let pt = P(xt = 1|R = 1, q) be the probability of a term
appearing in relevant document
Let pt = P(xt = 1|R = 0, q) be the probability of a term
appearing in a nonrelevant document
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms (4)
Let pt = P(xt = 1|R = 1, q) be the probability of a term
appearing in relevant document
Let pt = P(xt = 1|R = 0, q) be the probability of a term
appearing in a nonrelevant document
Can be displayed as contingency table:
document relevant (R = 1) nonrelevant (R = 0)
Term present xt = 1 pt pt
Term absent xt = 0 1 − pt 1 − pt
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 36 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Additional simplifying assumption: terms not occurring in the
query are equally likely to occur in relevant and nonrelevant
documents
If qt = 0, then pt = pt
Now we need only to consider terms in the products that appear in
the query:
O(R|x, q) = O(R|q) ·
t:xt =qt =1
pt
pt
·
t:xt =0,qt =1
1 − pt
1 − pt
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 36 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Additional simplifying assumption: terms not occurring in the
query are equally likely to occur in relevant and nonrelevant
documents
If qt = 0, then pt = pt
Now we need only to consider terms in the products that appear in
the query:
O(R|x, q) = O(R|q) ·
t:xt =qt =1
pt
pt
·
t:xt =0,qt =1
1 − pt
1 − pt
The left product is over query terms found in the document
and the right product is over query terms not found in the
document
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 36 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 37 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Including the query terms found in the document into the right
product, but simultaneously dividing by them in the left product,
gives:
O(R|x, q) = O(R|q) ·
t:xt =qt =1
pt(1 − pt)
pt(1 − pt)
·
t:qt =1
1 − pt
1 − pt
The left product is still over query terms found in the
document, but the right product is now over all query terms,
hence constant for a particular query and can be ignored.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 37 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Including the query terms found in the document into the right
product, but simultaneously dividing by them in the left product,
gives:
O(R|x, q) = O(R|q) ·
t:xt =qt =1
pt(1 − pt)
pt(1 − pt)
·
t:qt =1
1 − pt
1 − pt
The left product is still over query terms found in the
document, but the right product is now over all query terms,
hence constant for a particular query and can be ignored.
→ The only quantity that needs to be estimated to rank
documents w.r.t a query is the left product
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 37 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Including the query terms found in the document into the right
product, but simultaneously dividing by them in the left product,
gives:
O(R|x, q) = O(R|q) ·
t:xt =qt =1
pt(1 − pt)
pt(1 − pt)
·
t:qt =1
1 − pt
1 − pt
The left product is still over query terms found in the
document, but the right product is now over all query terms,
hence constant for a particular query and can be ignored.
→ The only quantity that needs to be estimated to rank
documents w.r.t a query is the left product
Hence the Retrieval Status Value (RSV) in this model:
RSVd = log
t:xt =qt =1
pt(1 − pt)
pt(1 − pt)
=
t:xt =qt =1
log
pt(1 − pt)
pt(1 − pt)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 37 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Equivalent: rank documents using the log odds ratios for the terms
in the query ct:
ct = log
pt(1 − pt)
pt(1 − pt)
= log
pt
(1 − pt)
− log
pt
1 − pt
The odds ratio is the ratio of two odds: (i) the odds of the
term appearing if the document is relevant (pt/(1 − pt)), and
(ii) the odds of the term appearing if the document is
nonrelevant (pt/(1 − pt))
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Equivalent: rank documents using the log odds ratios for the terms
in the query ct:
ct = log
pt(1 − pt)
pt(1 − pt)
= log
pt
(1 − pt)
− log
pt
1 − pt
The odds ratio is the ratio of two odds: (i) the odds of the
term appearing if the document is relevant (pt/(1 − pt)), and
(ii) the odds of the term appearing if the document is
nonrelevant (pt/(1 − pt))
ct = 0: term has equal odds of appearing in relevant and
nonrelevant docs
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Equivalent: rank documents using the log odds ratios for the terms
in the query ct:
ct = log
pt(1 − pt)
pt(1 − pt)
= log
pt
(1 − pt)
− log
pt
1 − pt
The odds ratio is the ratio of two odds: (i) the odds of the
term appearing if the document is relevant (pt/(1 − pt)), and
(ii) the odds of the term appearing if the document is
nonrelevant (pt/(1 − pt))
ct = 0: term has equal odds of appearing in relevant and
nonrelevant docs
ct positive: higher odds to appear in relevant documents
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Deriving a Ranking Function for Query Terms
Equivalent: rank documents using the log odds ratios for the terms
in the query ct:
ct = log
pt(1 − pt)
pt(1 − pt)
= log
pt
(1 − pt)
− log
pt
1 − pt
The odds ratio is the ratio of two odds: (i) the odds of the
term appearing if the document is relevant (pt/(1 − pt)), and
(ii) the odds of the term appearing if the document is
nonrelevant (pt/(1 − pt))
ct = 0: term has equal odds of appearing in relevant and
nonrelevant docs
ct positive: higher odds to appear in relevant documents
ct negative: higher odds to appear in nonrelevant documents
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Term weight ct in BIM
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Term weight ct in BIM
ct = log pt
(1−pt ) − log pt
1−pt
functions as a term weight.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Term weight ct in BIM
ct = log pt
(1−pt ) − log pt
1−pt
functions as a term weight.
Retrieval status value for document d: RSVd = xt =qt =1 ct.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Term weight ct in BIM
ct = log pt
(1−pt ) − log pt
1−pt
functions as a term weight.
Retrieval status value for document d: RSVd = xt =qt =1 ct.
So BIM and vector space model are identical on an
operational level . . .
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Term weight ct in BIM
ct = log pt
(1−pt ) − log pt
1−pt
functions as a term weight.
Retrieval status value for document d: RSVd = xt =qt =1 ct.
So BIM and vector space model are identical on an
operational level . . .
. . . except that the term weights are different.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Term weight ct in BIM
ct = log pt
(1−pt ) − log pt
1−pt
functions as a term weight.
Retrieval status value for document d: RSVd = xt =qt =1 ct.
So BIM and vector space model are identical on an
operational level . . .
. . . except that the term weights are different.
In particular: we can use the same data structures (inverted
index etc) for the two models.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
How to compute probability estimates
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 40 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
How to compute probability estimates
For each term t in a query, estimate ct in the whole collection
using a contingency table of counts of documents in the
collection, where n is the number of documents that contain term t:
documents relevant nonrelevant Total
Term present xt = 1 r n − r n
Term absent xt = 0 R − r (N − n) − (R − r) N − n
Total R N − R N
pt = r/R
pt = (n − r)/(N − R)
ct = K(N, n, R, r) = log
r/(R − r)
(n − r)/((N − n) − (R − r))
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 40 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Avoiding zeros
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Avoiding zeros
If any of the counts is a zero, then the term weight is not
well-defined.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Avoiding zeros
If any of the counts is a zero, then the term weight is not
well-defined.
Maximum likelihood estimates do not work for rare events.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Avoiding zeros
If any of the counts is a zero, then the term weight is not
well-defined.
Maximum likelihood estimates do not work for rare events.
To avoid zeros: add 0.5 to each count (expected likelihood
estimation = ELE)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Avoiding zeros
If any of the counts is a zero, then the term weight is not
well-defined.
Maximum likelihood estimates do not work for rare events.
To avoid zeros: add 0.5 to each count (expected likelihood
estimation = ELE)
For example, use R − r + 0.5 in formula for R − r. Thus,
fomula becomes:
RSVd = Σ log
r + 0.5/(R − r + 0.5)
(n − r + 0.5)/((N − n + 0.5) − (R − r + 0.5))
(7)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Avoiding zeros
If any of the counts is a zero, then the term weight is not
well-defined.
Maximum likelihood estimates do not work for rare events.
To avoid zeros: add 0.5 to each count (expected likelihood
estimation = ELE)
For example, use R − r + 0.5 in formula for R − r. Thus,
fomula becomes:
RSVd = Σ log
r + 0.5/(R − r + 0.5)
(n − r + 0.5)/((N − n + 0.5) − (R − r + 0.5))
(7)
This was previously known as relevance weight (RW) or
matching score relevance weight (MS-RW) (Roberston and
Jones, 1976)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Simplifying assumption
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 42 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Simplifying assumption
Assuming that relevant documents are a very small
percentage of the collection, approximate statistics for
nonrelevant documents by statistics from the whole collection
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 42 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Simplifying assumption
Assuming that relevant documents are a very small
percentage of the collection, approximate statistics for
nonrelevant documents by statistics from the whole collection
Hence, (pt) (the probability of term occurrence in nonrelevant
documents for a query) is n/N and
log[(1 − (pt))/(pt)] = log[(N − n)/n] ≈ log N/n
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 42 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Simplifying assumption
Assuming that relevant documents are a very small
percentage of the collection, approximate statistics for
nonrelevant documents by statistics from the whole collection
Hence, (pt) (the probability of term occurrence in nonrelevant
documents for a query) is n/N and
log[(1 − (pt))/(pt)] = log[(N − n)/n] ≈ log N/n
The above approximation cannot easily be extended to
relevant documents
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 42 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in relevance feedback
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in relevance feedback
Statistics of relevant documents (pt) in relevance feedback
can be estimated using maximum likelihood estimation or ELE
(add 0.5).
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in relevance feedback
Statistics of relevant documents (pt) in relevance feedback
can be estimated using maximum likelihood estimation or ELE
(add 0.5).
Use the frequency of term occurrence in known relevant
documents.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in relevance feedback
Statistics of relevant documents (pt) in relevance feedback
can be estimated using maximum likelihood estimation or ELE
(add 0.5).
Use the frequency of term occurrence in known relevant
documents.
This is the basis of probabilistic approaches to relevance
feedback weighting in a feedback loop
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in relevance feedback
Statistics of relevant documents (pt) in relevance feedback
can be estimated using maximum likelihood estimation or ELE
(add 0.5).
Use the frequency of term occurrence in known relevant
documents.
This is the basis of probabilistic approaches to relevance
feedback weighting in a feedback loop
What we just saw was a probabilistic relevance feedback
exercise since we were assuming the availability of relevance
judgments.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in adhoc retrieval
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in adhoc retrieval
Ad-hoc retrieval: no user-supplied relevance judgments
available
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in adhoc retrieval
Ad-hoc retrieval: no user-supplied relevance judgments
available
In this case: assume that pt is constant over all terms xt in
the query and that pt = 0.5
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in adhoc retrieval
Ad-hoc retrieval: no user-supplied relevance judgments
available
In this case: assume that pt is constant over all terms xt in
the query and that pt = 0.5
Each term is equally likely to occur in a relevant document,
and so the pt and (1 − pt) factors cancel out in the expression
for RSV
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in adhoc retrieval
Ad-hoc retrieval: no user-supplied relevance judgments
available
In this case: assume that pt is constant over all terms xt in
the query and that pt = 0.5
Each term is equally likely to occur in a relevant document,
and so the pt and (1 − pt) factors cancel out in the expression
for RSV
Weak estimate, but doesn’t disagree violently with
expectation that query terms appear in many but not all
relevant documents
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in adhoc retrieval
Ad-hoc retrieval: no user-supplied relevance judgments
available
In this case: assume that pt is constant over all terms xt in
the query and that pt = 0.5
Each term is equally likely to occur in a relevant document,
and so the pt and (1 − pt) factors cancel out in the expression
for RSV
Weak estimate, but doesn’t disagree violently with
expectation that query terms appear in many but not all
relevant documents
Combining this method with the earlier approximation for pt,
the document ranking is determined simply by which query
terms occur in documents scaled by their idf weighting
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probability estimates in adhoc retrieval
Ad-hoc retrieval: no user-supplied relevance judgments
available
In this case: assume that pt is constant over all terms xt in
the query and that pt = 0.5
Each term is equally likely to occur in a relevant document,
and so the pt and (1 − pt) factors cancel out in the expression
for RSV
Weak estimate, but doesn’t disagree violently with
expectation that query terms appear in many but not all
relevant documents
Combining this method with the earlier approximation for pt,
the document ranking is determined simply by which query
terms occur in documents scaled by their idf weighting
For short documents (titles or abstracts) in one-pass retrieval
situations, this estimate can be quite satisfactory
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Outline
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 45 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
History and summary of assumptions
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
History and summary of assumptions
Among the oldest formal models in IR
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
History and summary of assumptions
Among the oldest formal models in IR
Maron & Kuhns, 1960: Since an IR system cannot predict with
certainty which document is relevant, we should deal with
probabilities
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
History and summary of assumptions
Among the oldest formal models in IR
Maron & Kuhns, 1960: Since an IR system cannot predict with
certainty which document is relevant, we should deal with
probabilities
Assumptions for getting reasonable approximations of the
needed probabilities (in the BIM):
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
History and summary of assumptions
Among the oldest formal models in IR
Maron & Kuhns, 1960: Since an IR system cannot predict with
certainty which document is relevant, we should deal with
probabilities
Assumptions for getting reasonable approximations of the
needed probabilities (in the BIM):
Boolean representation of documents/queries/relevance
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
History and summary of assumptions
Among the oldest formal models in IR
Maron & Kuhns, 1960: Since an IR system cannot predict with
certainty which document is relevant, we should deal with
probabilities
Assumptions for getting reasonable approximations of the
needed probabilities (in the BIM):
Boolean representation of documents/queries/relevance
Term independence
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
History and summary of assumptions
Among the oldest formal models in IR
Maron & Kuhns, 1960: Since an IR system cannot predict with
certainty which document is relevant, we should deal with
probabilities
Assumptions for getting reasonable approximations of the
needed probabilities (in the BIM):
Boolean representation of documents/queries/relevance
Term independence
Out-of-query terms do not affect retrieval
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
History and summary of assumptions
Among the oldest formal models in IR
Maron & Kuhns, 1960: Since an IR system cannot predict with
certainty which document is relevant, we should deal with
probabilities
Assumptions for getting reasonable approximations of the
needed probabilities (in the BIM):
Boolean representation of documents/queries/relevance
Term independence
Out-of-query terms do not affect retrieval
Document relevance values are independent
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
How different are vector space and BIM?
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
How different are vector space and BIM?
They are not that different.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
How different are vector space and BIM?
They are not that different.
In either case you build an information retrieval scheme in the
exact same way.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
How different are vector space and BIM?
They are not that different.
In either case you build an information retrieval scheme in the
exact same way.
For probabilistic IR, at the end, you score queries not by
cosine similarity and tf-idf in a vector space, but by a slightly
different formula motivated by probability theory.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
How different are vector space and BIM?
They are not that different.
In either case you build an information retrieval scheme in the
exact same way.
For probabilistic IR, at the end, you score queries not by
cosine similarity and tf-idf in a vector space, but by a slightly
different formula motivated by probability theory.
Next: how to add term frequency and length normalization to
the probabilistic model.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Overview
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Overview
Okapi BM25 is a probabilistic model that incorporates term
frequency (i.e., it’s nonbinary) and length normalization.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Overview
Okapi BM25 is a probabilistic model that incorporates term
frequency (i.e., it’s nonbinary) and length normalization.
BIM was originally designed for short catalog records of fairly
consistent length, and it works reasonably in these contexts
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Overview
Okapi BM25 is a probabilistic model that incorporates term
frequency (i.e., it’s nonbinary) and length normalization.
BIM was originally designed for short catalog records of fairly
consistent length, and it works reasonably in these contexts
For modern full-text search collections, a model should pay
attention to term frequency and document length
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Overview
Okapi BM25 is a probabilistic model that incorporates term
frequency (i.e., it’s nonbinary) and length normalization.
BIM was originally designed for short catalog records of fairly
consistent length, and it works reasonably in these contexts
For modern full-text search collections, a model should pay
attention to term frequency and document length
BestMatch25 (a.k.a BM25 or Okapi) is sensitive to these
quantities
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Overview
Okapi BM25 is a probabilistic model that incorporates term
frequency (i.e., it’s nonbinary) and length normalization.
BIM was originally designed for short catalog records of fairly
consistent length, and it works reasonably in these contexts
For modern full-text search collections, a model should pay
attention to term frequency and document length
BestMatch25 (a.k.a BM25 or Okapi) is sensitive to these
quantities
BM25 is one of the most widely used and robust retrieval
models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Starting point
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 49 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Starting point
The simplest score for document d is just idf weighting of the
query terms present in the document:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 49 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Starting point
The simplest score for document d is just idf weighting of the
query terms present in the document:
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 49 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25: Starting point
The simplest score for document d is just idf weighting of the
query terms present in the document:
RSVd =
t∈q
log
N
n
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 49 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 Basic Weighting
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 Basic Weighting
Improve idf term [log N/n] by factoring in term frequency and
document length.
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 Basic Weighting
Improve idf term [log N/n] by factoring in term frequency and
document length.
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
tftd : term frequency in document d
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 Basic Weighting
Improve idf term [log N/n] by factoring in term frequency and
document length.
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
tftd : term frequency in document d
Ld (Lave): length of document d (average document length in
the whole collection)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 Basic Weighting
Improve idf term [log N/n] by factoring in term frequency and
document length.
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
tftd : term frequency in document d
Ld (Lave): length of document d (average document length in
the whole collection)
k1: tuning parameter controlling the document term
frequency scaling
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 Basic Weighting
Improve idf term [log N/n] by factoring in term frequency and
document length.
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
tftd : term frequency in document d
Ld (Lave): length of document d (average document length in
the whole collection)
k1: tuning parameter controlling the document term
frequency scaling
b: tuning parameter controlling the scaling by document
length
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 weighting for Long queries
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 weighting for Long queries
For long queries, use similar weighting for query terms
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
·
(k3 + 1)tftq
k3 + tftq
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 weighting for Long queries
For long queries, use similar weighting for query terms
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
·
(k3 + 1)tftq
k3 + tftq
tftq: term frequency in the query q
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 weighting for Long queries
For long queries, use similar weighting for query terms
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
·
(k3 + 1)tftq
k3 + tftq
tftq: term frequency in the query q
k3: tuning parameter controlling term frequency scaling of the
query
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 weighting for Long queries
For long queries, use similar weighting for query terms
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
·
(k3 + 1)tftq
k3 + tftq
tftq: term frequency in the query q
k3: tuning parameter controlling term frequency scaling of the
query
No length normalization of queries (because retrieval is being
done with respect to a single fixed query)
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi BM25 weighting for Long queries
For long queries, use similar weighting for query terms
RSVd =
t∈q
log
N
n
·
(k1 + 1)tftd
k1((1 − b) + b × (Ld /Lave)) + tftd
·
(k3 + 1)tftq
k3 + tftq
tftq: term frequency in the query q
k3: tuning parameter controlling term frequency scaling of the
query
No length normalization of queries (because retrieval is being
done with respect to a single fixed query)
The above tuning parameters should ideally be set to optimize
performance on a development test collection. In the absence
of such optimization, experiments have shown reasonable
values are to set k1 and k3 to a value between 1.2 and 2 and
b = 0.75
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Okapi at TREC 7*
All the TREC-7 searches used varieties of Okapi BM25, such
as:
t∈q
log
N
n
·
(k1 + 1)tftd
K + tftd
·
(k3 + 1)tftq
k3 + tftq
+ k2|Q|
Lave − Ld
Lave + Ld
Where K = k1((1 − b) + b × (Ld /Lave))
k2 is a parameter depending on the nature of the query and
possibly on the database
k2 was always zero in all searches of TREC-7, simplying the
equation to:
t∈q
w1
·
(k1 + 1)tftd
K + tftd
·
(k3 + 1)tftq
k3 + tftq
Where w1 = log N
n , aka Roberstons-Jones weight
(Roberston et.al.,
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 52 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Outline
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 53 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Performance measures used
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 54 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Outline
1 Inception
2 Probabilistic Approach to IR
3 Data
4 Basic Probability Theory
5 Probability Ranking Principle
6 Extensions to BIM: Okapi
7 Performance measure
8 Comparision of Models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 55 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic model vs. other models
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic model vs. other models
Boolean model
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic model vs. other models
Boolean model
Probabilistic models support ranking and thus are better than
the simple Boolean model.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic model vs. other models
Boolean model
Probabilistic models support ranking and thus are better than
the simple Boolean model.
Vector space model
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic model vs. other models
Boolean model
Probabilistic models support ranking and thus are better than
the simple Boolean model.
Vector space model
The vector space model is also a formally defined model that
supports ranking.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic model vs. other models
Boolean model
Probabilistic models support ranking and thus are better than
the simple Boolean model.
Vector space model
The vector space model is also a formally defined model that
supports ranking.
Why would we want to look for an alternative to the vector
space model?
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic vs. vector space model
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic vs. vector space model
Vector space model: rank documents according to similarity
to query.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic vs. vector space model
Vector space model: rank documents according to similarity
to query.
The notion of similarity does not translate directly into an
assessment of “is the document a good document to give to
the user or not?”
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic vs. vector space model
Vector space model: rank documents according to similarity
to query.
The notion of similarity does not translate directly into an
assessment of “is the document a good document to give to
the user or not?”
The most similar document can be highly relevant or
completely nonrelevant.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Probabilistic vs. vector space model
Vector space model: rank documents according to similarity
to query.
The notion of similarity does not translate directly into an
assessment of “is the document a good document to give to
the user or not?”
The most similar document can be highly relevant or
completely nonrelevant.
Probability theory is arguably a cleaner formalization of what
we really want an IR system to do: give relevant documents
to the user.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Which ranking model should I use?
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 58 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Which ranking model should I use?
I want something basic and simple → use vector space with
tf-idf weighting.
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 58 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Which ranking model should I use?
I want something basic and simple → use vector space with
tf-idf weighting.
I want to use a state-of-the-art ranking model with excellent
performance → use language models or BM25 with tuned
parameters
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 58 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Which ranking model should I use?
I want something basic and simple → use vector space with
tf-idf weighting.
I want to use a state-of-the-art ranking model with excellent
performance → use language models or BM25 with tuned
parameters
In between: BM25 or language models with no or just one
tuned parameter
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 58 / 59
Inception Probabilistic Approach to IR Data Basic Probability Theory Probability Ranking Principle Extension
Resources
Chapter 11 of IIR
Resources at http://cislmu.org
Presentation on ”Probabilistic Information Retrieval”by Dr.
Suman Mitra at:
http://www.irsi.res.in/winter-school/slides/
ProbabilisticModel.ppt
PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 59 / 59

Probabilistic Information Retrieval

  • 1.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension A Probabilistic model of Information Retrieval Harsh Thakkar DA-IICT, Gandhinagar 2015-04-27 PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 1 / 59
  • 2.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Overview 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 2 / 59
  • 3.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Outline 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 3 / 59
  • 4.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Why Probability in IR? Queries are representations of user’s information need Relevance is binary Retrieval is inherently uncertain, since the needs of users are vague in nature i.e. change with time Probability deals with uncertainity Provides a good estimate of which documents to choose, hence more reliable PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 4 / 59
  • 5.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 5 / 59
  • 6.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Outline 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 6 / 59
  • 7.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Relevance feedback PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 7 / 59
  • 8.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Relevance feedback In relevance feedback, the user marks documents as relevant/nonrelevant PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 7 / 59
  • 9.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Relevance feedback In relevance feedback, the user marks documents as relevant/nonrelevant Given some known relevant and nonrelevant documents, we compute weights for non-query terms that indicate how likely they will occur in relevant documents PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 7 / 59
  • 10.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Relevance feedback In relevance feedback, the user marks documents as relevant/nonrelevant Given some known relevant and nonrelevant documents, we compute weights for non-query terms that indicate how likely they will occur in relevant documents Develop a probabilistic approach for relevance feedback and also a general probabilistic model for IR PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 7 / 59
  • 11.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic Approach to Retrieval PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
  • 12.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic Approach to Retrieval Given a user information need (represented as a query) and a collection of documents (transformed into document representations), a system must determine how well the documents satisfy the query PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
  • 13.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic Approach to Retrieval Given a user information need (represented as a query) and a collection of documents (transformed into document representations), a system must determine how well the documents satisfy the query An IR system has an uncertain understanding of the user query, and makes an uncertain guess of whether a document satisfies the query PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
  • 14.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic Approach to Retrieval Given a user information need (represented as a query) and a collection of documents (transformed into document representations), a system must determine how well the documents satisfy the query An IR system has an uncertain understanding of the user query, and makes an uncertain guess of whether a document satisfies the query Probability theory provides a principled foundation for such reasoning under uncertainty PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
  • 15.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic Approach to Retrieval Given a user information need (represented as a query) and a collection of documents (transformed into document representations), a system must determine how well the documents satisfy the query An IR system has an uncertain understanding of the user query, and makes an uncertain guess of whether a document satisfies the query Probability theory provides a principled foundation for such reasoning under uncertainty Probabilistic models exploit this foundation to estimate how likely it is that a document is relevant to a query PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 8 / 59
  • 16.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic IR Models at a Glance PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
  • 17.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic IR Models at a Glance Classical probabilistic retrieval model PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
  • 18.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic IR Models at a Glance Classical probabilistic retrieval model Probability ranking principle PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
  • 19.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic IR Models at a Glance Classical probabilistic retrieval model Probability ranking principle Binary Independence Model, BestMatch25 (Okapi) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
  • 20.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic IR Models at a Glance Classical probabilistic retrieval model Probability ranking principle Binary Independence Model, BestMatch25 (Okapi) Bayesian networks for text retrieval PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
  • 21.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic IR Models at a Glance Classical probabilistic retrieval model Probability ranking principle Binary Independence Model, BestMatch25 (Okapi) Bayesian networks for text retrieval Language model approach to IR PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
  • 22.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic IR Models at a Glance Classical probabilistic retrieval model Probability ranking principle Binary Independence Model, BestMatch25 (Okapi) Bayesian networks for text retrieval Language model approach to IR Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 9 / 59
  • 23.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Outline 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 10 / 59
  • 24.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets The paper provides a common platform to a variety of performance scattered over many other papers from 1992-1999 PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
  • 25.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets The paper provides a common platform to a variety of performance scattered over many other papers from 1992-1999 All collections from TREC 1-7 are used PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
  • 26.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets The paper provides a common platform to a variety of performance scattered over many other papers from 1992-1999 All collections from TREC 1-7 are used Also older results from datasets such as NPL, UKCIS, Cranfield and a newly created TREC (described below) are reproduced. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
  • 27.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets The paper provides a common platform to a variety of performance scattered over many other papers from 1992-1999 All collections from TREC 1-7 are used Also older results from datasets such as NPL, UKCIS, Cranfield and a newly created TREC (described below) are reproduced. Cranfield collection: has short initial manual index descriptions based on the whole document PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
  • 28.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets The paper provides a common platform to a variety of performance scattered over many other papers from 1992-1999 All collections from TREC 1-7 are used Also older results from datasets such as NPL, UKCIS, Cranfield and a newly created TREC (described below) are reproduced. Cranfield collection: has short initial manual index descriptions based on the whole document NLP: has short initial automatic descriptions from abstracts PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
  • 29.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets The paper provides a common platform to a variety of performance scattered over many other papers from 1992-1999 All collections from TREC 1-7 are used Also older results from datasets such as NPL, UKCIS, Cranfield and a newly created TREC (described below) are reproduced. Cranfield collection: has short initial manual index descriptions based on the whole document NLP: has short initial automatic descriptions from abstracts UKCIS: has only the title of the documents PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
  • 30.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets The paper provides a common platform to a variety of performance scattered over many other papers from 1992-1999 All collections from TREC 1-7 are used Also older results from datasets such as NPL, UKCIS, Cranfield and a newly created TREC (described below) are reproduced. Cranfield collection: has short initial manual index descriptions based on the whole document NLP: has short initial automatic descriptions from abstracts UKCIS: has only the title of the documents TREC: has automatic initial descriptions in natural text form mostly from full documents but in some cases from abstracts. TREC is the largest of all the collections PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
  • 31.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets The paper provides a common platform to a variety of performance scattered over many other papers from 1992-1999 All collections from TREC 1-7 are used Also older results from datasets such as NPL, UKCIS, Cranfield and a newly created TREC (described below) are reproduced. Cranfield collection: has short initial manual index descriptions based on the whole document NLP: has short initial automatic descriptions from abstracts UKCIS: has only the title of the documents TREC: has automatic initial descriptions in natural text form mostly from full documents but in some cases from abstracts. TREC is the largest of all the collections TREC had a mixture of Long, Medium and Very short requests (L,M,V) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 11 / 59
  • 32.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets 2 Figure : Dataset descriptions from (Jones et al., 2000). PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 12 / 59
  • 33.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Datasets 3 Figure : Dataset statistics (Jones et al., 2000). PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 13 / 59
  • 34.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Outline 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 14 / 59
  • 35.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
  • 36.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory For events A and B PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
  • 37.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory For events A and B Joint probability P(A ∩ B) of both events occurring PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
  • 38.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory For events A and B Joint probability P(A ∩ B) of both events occurring Conditional probability P(A|B) of event A occurring given that event B has occurred PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
  • 39.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory For events A and B Joint probability P(A ∩ B) of both events occurring Conditional probability P(A|B) of event A occurring given that event B has occurred Chain rule gives fundamental relationship between joint and conditional probabilities: P(AB) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
  • 40.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory For events A and B Joint probability P(A ∩ B) of both events occurring Conditional probability P(A|B) of event A occurring given that event B has occurred Chain rule gives fundamental relationship between joint and conditional probabilities: P(AB) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) Similarly for the complement of an event P(A): P(AB) = P(B|A)P(A) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
  • 41.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory For events A and B Joint probability P(A ∩ B) of both events occurring Conditional probability P(A|B) of event A occurring given that event B has occurred Chain rule gives fundamental relationship between joint and conditional probabilities: P(AB) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) Similarly for the complement of an event P(A): P(AB) = P(B|A)P(A) Partition rule: if B can be divided into an exhaustive set of disjoint subcases, then P(B) is the sum of the probabilities of the subcases. A special case of this rule gives: P(B) = P(AB) + P(AB) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 15 / 59
  • 42.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
  • 43.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory Bayes’ Rule for inverting conditional probabilities: P(A|B) = P(B|A)P(A) P(B) = P(B|A) X∈{A,A} P(B|X)P(X) P(A) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
  • 44.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory Bayes’ Rule for inverting conditional probabilities: P(A|B) = P(B|A)P(A) P(B) = P(B|A) X∈{A,A} P(B|X)P(X) P(A) Can be thought of as a way of updating probabilities: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
  • 45.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory Bayes’ Rule for inverting conditional probabilities: P(A|B) = P(B|A)P(A) P(B) = P(B|A) X∈{A,A} P(B|X)P(X) P(A) Can be thought of as a way of updating probabilities: Start off with prior probability P(A) (initial estimate of how likely event A is in the absence of any other information) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
  • 46.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory Bayes’ Rule for inverting conditional probabilities: P(A|B) = P(B|A)P(A) P(B) = P(B|A) X∈{A,A} P(B|X)P(X) P(A) Can be thought of as a way of updating probabilities: Start off with prior probability P(A) (initial estimate of how likely event A is in the absence of any other information) Derive a posterior probability P(A|B) after having seen the evidence B, based on the likelihood of B occurring in the two cases that A does or does not hold PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
  • 47.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Probability Theory Bayes’ Rule for inverting conditional probabilities: P(A|B) = P(B|A)P(A) P(B) = P(B|A) X∈{A,A} P(B|X)P(X) P(A) Can be thought of as a way of updating probabilities: Start off with prior probability P(A) (initial estimate of how likely event A is in the absence of any other information) Derive a posterior probability P(A|B) after having seen the evidence B, based on the likelihood of B occurring in the two cases that A does or does not hold Odds of an event provide a kind of multiplier for how probabilities change: Odds: O(A) = P(A) P(A) = P(A) 1 − P(A) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 16 / 59
  • 48.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model The probability of a document (D) being relevant (L) to a user is modeled as: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
  • 49.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model The probability of a document (D) being relevant (L) to a user is modeled as: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
  • 50.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model The probability of a document (D) being relevant (L) to a user is modeled as: P(L|D) = P(D|L)P(L) P(D) We use the Log-Odds to quantify the change since it can be derived from probabiliy by an oder-preserving transformation PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
  • 51.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model The probability of a document (D) being relevant (L) to a user is modeled as: P(L|D) = P(D|L)P(L) P(D) We use the Log-Odds to quantify the change since it can be derived from probabiliy by an oder-preserving transformation PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
  • 52.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model The probability of a document (D) being relevant (L) to a user is modeled as: P(L|D) = P(D|L)P(L) P(D) We use the Log-Odds to quantify the change since it can be derived from probabiliy by an oder-preserving transformation Thus, the equation becomes: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
  • 53.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model The probability of a document (D) being relevant (L) to a user is modeled as: P(L|D) = P(D|L)P(L) P(D) We use the Log-Odds to quantify the change since it can be derived from probabiliy by an oder-preserving transformation Thus, the equation becomes: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
  • 54.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model The probability of a document (D) being relevant (L) to a user is modeled as: P(L|D) = P(D|L)P(L) P(D) We use the Log-Odds to quantify the change since it can be derived from probabiliy by an oder-preserving transformation Thus, the equation becomes: log P(L|D) P(L|D) = log P(D|L)P(L) P(D|L)P(L) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 17 / 59
  • 55.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Matching Score =log P(D|L) P(D|L) + log P(L) P(L) (1) Here we define the concept of matching score: MS(D) MS − PRIM(D) = log P(L|D) P(L|D) − log P(L) P(L) (2) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 18 / 59
  • 56.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Matching Score =log P(D|L) P(D|L) + log P(L) P(L) (1) Here we define the concept of matching score: MS(D) Since the function is primitive, thus named MS-PRIM(D), which is as follows: MS − PRIM(D) = log P(L|D) P(L|D) − log P(L) P(L) (2) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 18 / 59
  • 57.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes The basic model of probabilistic IR has been built on the ”Independence Assumption”, stated as: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 19 / 59
  • 58.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes The basic model of probabilistic IR has been built on the ”Independence Assumption”, stated as: ”Given relevance, the attributes are statistically independent” PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 19 / 59
  • 59.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes The basic model of probabilistic IR has been built on the ”Independence Assumption”, stated as: ”Given relevance, the attributes are statistically independent” Thus, from the probability of statistically independent events, we have: P(D|L) = ΠP(Ai = ai |L) P(D|L) = ΠP(Ai = ai |L) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 19 / 59
  • 60.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes The basic model of probabilistic IR has been built on the ”Independence Assumption”, stated as: ”Given relevance, the attributes are statistically independent” Thus, from the probability of statistically independent events, we have: P(D|L) = ΠP(Ai = ai |L) P(D|L) = ΠP(Ai = ai |L) Here, Ai is the ith attribute with value equal to ai for that specific document PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 19 / 59
  • 61.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (2) The MS-PRIM(D) derived ealier can now be written as: MS − PRIM(D) = Σlog P(Ai = ai |L) P(Ai = ai |L) (3) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
  • 62.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (2) The MS-PRIM(D) derived ealier can now be written as: MS − PRIM(D) = Σlog P(Ai = ai |L) P(Ai = ai |L) (3) Thus, we could calculate a score for each document, based on the sum of the products of all independent attributes relating to rhe document (D). PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
  • 63.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (2) The MS-PRIM(D) derived ealier can now be written as: MS − PRIM(D) = Σlog P(Ai = ai |L) P(Ai = ai |L) (3) Thus, we could calculate a score for each document, based on the sum of the products of all independent attributes relating to rhe document (D). This function takes into account both the relevant and non-relevant attributes PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
  • 64.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (2) The MS-PRIM(D) derived ealier can now be written as: MS − PRIM(D) = Σlog P(Ai = ai |L) P(Ai = ai |L) (3) Thus, we could calculate a score for each document, based on the sum of the products of all independent attributes relating to rhe document (D). This function takes into account both the relevant and non-relevant attributes For simplicity we take into account the relevant attributes and take others as ”zero” PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
  • 65.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (2) The MS-PRIM(D) derived ealier can now be written as: MS − PRIM(D) = Σlog P(Ai = ai |L) P(Ai = ai |L) (3) Thus, we could calculate a score for each document, based on the sum of the products of all independent attributes relating to rhe document (D). This function takes into account both the relevant and non-relevant attributes For simplicity we take into account the relevant attributes and take others as ”zero” We define a new measure called MS-BASIC(D) which is as: MS − BASIC(D) = MS − PRIM(D) − Σlog P(Ai = 0|L) P(Ai = 0|L) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 20 / 59
  • 66.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (3) Plugging in the value of MS-PRIM(D) from equation 4, we have: MS−BASIC(D) = Σlog P(Ai = ai |L) P(Ai = ai |L) − log P(Ai = 0|L) P(Ai = 0|L) (4) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 21 / 59
  • 67.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (3) Plugging in the value of MS-PRIM(D) from equation 4, we have: MS−BASIC(D) = Σlog P(Ai = ai |L) P(Ai = ai |L) − log P(Ai = 0|L) P(Ai = 0|L) (4) which can be further simplified to: =Σlog P(Ai = ai |L)P(Ai = 0|L) P(Ai = ai |L)P(Ai = 0|L) (5) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 21 / 59
  • 68.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (3) We define this a our Weight function, W (Ai = ai ), Thus, MS − BASIC(D) = ΣW (Ai = ai ) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 22 / 59
  • 69.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (3) We define this a our Weight function, W (Ai = ai ), Thus, MS − BASIC(D) = ΣW (Ai = ai ) This function (W), provides a weight for each value of each attribute and the matching score for document is somply the sum of the weights. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 22 / 59
  • 70.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model - Independent attributes (3) We define this a our Weight function, W (Ai = ai ), Thus, MS − BASIC(D) = ΣW (Ai = ai ) This function (W), provides a weight for each value of each attribute and the matching score for document is somply the sum of the weights. W (Ai = 0) is always zero, i.e. for a randomly chosen term, which is irrelevant to the query, we can reasonably assume the weight to be zero. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 22 / 59
  • 71.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model-Term presence and absence We can further simplify the above model by using the case where attribute Ai is simply the presence or absence of a term ti PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
  • 72.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model-Term presence and absence We can further simplify the above model by using the case where attribute Ai is simply the presence or absence of a term ti We denote P(ti present|L) by pi and P(ti present|L) by pi PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
  • 73.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model-Term presence and absence We can further simplify the above model by using the case where attribute Ai is simply the presence or absence of a term ti We denote P(ti present|L) by pi and P(ti present|L) by pi Substituting in previous eqation, the new weight formula becomes: wi = Σlog pi (1 − pi ) pi (1 − pi ) (6) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
  • 74.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model-Term presence and absence We can further simplify the above model by using the case where attribute Ai is simply the presence or absence of a term ti We denote P(ti present|L) by pi and P(ti present|L) by pi Substituting in previous eqation, the new weight formula becomes: wi = Σlog pi (1 − pi ) pi (1 − pi ) (6) Hence, the matching score for the document is just the sum of the weights of the terms present. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
  • 75.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Basic Model-Term presence and absence We can further simplify the above model by using the case where attribute Ai is simply the presence or absence of a term ti We denote P(ti present|L) by pi and P(ti present|L) by pi Substituting in previous eqation, the new weight formula becomes: wi = Σlog pi (1 − pi ) pi (1 − pi ) (6) Hence, the matching score for the document is just the sum of the weights of the terms present. This formula is later used in the BIM termed as RSV function. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 23 / 59
  • 76.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Outline 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 24 / 59
  • 77.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension The Document Ranking Problem PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
  • 78.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension The Document Ranking Problem Ranked retrieval setup: given a collection of documents, the user issues a query, and an ordered list of documents is returned PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
  • 79.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension The Document Ranking Problem Ranked retrieval setup: given a collection of documents, the user issues a query, and an ordered list of documents is returned Assume binary notion of relevance: Rd,q is a random dichotomous variable, such that PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
  • 80.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension The Document Ranking Problem Ranked retrieval setup: given a collection of documents, the user issues a query, and an ordered list of documents is returned Assume binary notion of relevance: Rd,q is a random dichotomous variable, such that Rd,q = 1 if document d is relevant w.r.t query q PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
  • 81.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension The Document Ranking Problem Ranked retrieval setup: given a collection of documents, the user issues a query, and an ordered list of documents is returned Assume binary notion of relevance: Rd,q is a random dichotomous variable, such that Rd,q = 1 if document d is relevant w.r.t query q Rd,q = 0 otherwise PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
  • 82.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension The Document Ranking Problem Ranked retrieval setup: given a collection of documents, the user issues a query, and an ordered list of documents is returned Assume binary notion of relevance: Rd,q is a random dichotomous variable, such that Rd,q = 1 if document d is relevant w.r.t query q Rd,q = 0 otherwise Probabilistic ranking orders documents decreasingly by their estimated probability of relevance w.r.t. query: P(R = 1|d, q) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
  • 83.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension The Document Ranking Problem Ranked retrieval setup: given a collection of documents, the user issues a query, and an ordered list of documents is returned Assume binary notion of relevance: Rd,q is a random dichotomous variable, such that Rd,q = 1 if document d is relevant w.r.t query q Rd,q = 0 otherwise Probabilistic ranking orders documents decreasingly by their estimated probability of relevance w.r.t. query: P(R = 1|d, q) Assume that the relevance of each document is independent of the relevance of other documents PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 25 / 59
  • 84.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability Ranking Principle (PRP) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
  • 85.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability Ranking Principle (PRP) PRP in brief PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
  • 86.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability Ranking Principle (PRP) PRP in brief If the retrieved documents (w.r.t a query) are ranked decreasingly on their probability of relevance, then the effectiveness of the system will be the best that is obtainable PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
  • 87.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability Ranking Principle (PRP) PRP in brief If the retrieved documents (w.r.t a query) are ranked decreasingly on their probability of relevance, then the effectiveness of the system will be the best that is obtainable PRP in full PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
  • 88.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability Ranking Principle (PRP) PRP in brief If the retrieved documents (w.r.t a query) are ranked decreasingly on their probability of relevance, then the effectiveness of the system will be the best that is obtainable PRP in full If [the IR] system’s response to each [query] is a ranking of the documents [...] in order of decreasing probability of relevance to the [query], where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 26 / 59
  • 89.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model (BIM) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
  • 90.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model (BIM) Traditionally used with the PRP Assumptions: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
  • 91.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model (BIM) Traditionally used with the PRP Assumptions: ‘Binary’ (equivalent to Boolean): documents and queries represented as binary term incidence vectors PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
  • 92.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model (BIM) Traditionally used with the PRP Assumptions: ‘Binary’ (equivalent to Boolean): documents and queries represented as binary term incidence vectors E.g., document d represented by vector x = (x1, . . . , xM ), where xt = 1 if term t occurs in d and xt = 0 otherwise PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
  • 93.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model (BIM) Traditionally used with the PRP Assumptions: ‘Binary’ (equivalent to Boolean): documents and queries represented as binary term incidence vectors E.g., document d represented by vector x = (x1, . . . , xM ), where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
  • 94.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model (BIM) Traditionally used with the PRP Assumptions: ‘Binary’ (equivalent to Boolean): documents and queries represented as binary term incidence vectors E.g., document d represented by vector x = (x1, . . . , xM ), where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation ‘Independence’: no association between terms (not true, but practically works - ‘naive’ assumption of Naive Bayes models) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 27 / 59
  • 95.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary incidence matrix Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 . . . Each document is represented as a binary vector ∈ {0, 1}|V |. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 28 / 59
  • 96.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 29 / 59
  • 97.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model To make a probabilistic retrieval strategy precise, need to estimate how terms in documents contribute to relevance Find measurable statistics (term frequency, document frequency, document length) that affect judgments about document relevance PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 29 / 59
  • 98.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model To make a probabilistic retrieval strategy precise, need to estimate how terms in documents contribute to relevance Find measurable statistics (term frequency, document frequency, document length) that affect judgments about document relevance Combine these statistics to estimate the probability P(R|d, q) of document relevance PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 29 / 59
  • 99.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model To make a probabilistic retrieval strategy precise, need to estimate how terms in documents contribute to relevance Find measurable statistics (term frequency, document frequency, document length) that affect judgments about document relevance Combine these statistics to estimate the probability P(R|d, q) of document relevance Next: how exactly we can do this PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 29 / 59
  • 100.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 30 / 59
  • 101.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model P(R|d, q) is modeled using term incidence vectors as P(R|x, q) P(R = 1|x, q) = P(x|R = 1, q)P(R = 1|q) P(x|q) P(R = 0|x, q) = P(x|R = 0, q)P(R = 0|q) P(x|q) P(x|R = 1, q) and P(x|R = 0, q): probability that if a relevant or nonrelevant document is retrieved, then that document’s representation is x PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 30 / 59
  • 102.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model P(R|d, q) is modeled using term incidence vectors as P(R|x, q) P(R = 1|x, q) = P(x|R = 1, q)P(R = 1|q) P(x|q) P(R = 0|x, q) = P(x|R = 0, q)P(R = 0|q) P(x|q) P(x|R = 1, q) and P(x|R = 0, q): probability that if a relevant or nonrelevant document is retrieved, then that document’s representation is x Use statistics about the document collection to estimate these probabilities PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 30 / 59
  • 103.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model P(R|d, q) is modeled using term incidence vectors as P(R|x, q) P(R = 1|x, q) = P(x|R = 1, q)P(R = 1|q) P(x|q) P(R = 0|x, q) = P(x|R = 0, q)P(R = 0|q) P(x|q) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
  • 104.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model P(R|d, q) is modeled using term incidence vectors as P(R|x, q) P(R = 1|x, q) = P(x|R = 1, q)P(R = 1|q) P(x|q) P(R = 0|x, q) = P(x|R = 0, q)P(R = 0|q) P(x|q) P(R = 1|q) and P(R = 0|q): prior probability of retrieving a relevant or nonrelevant document for a query q PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
  • 105.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model P(R|d, q) is modeled using term incidence vectors as P(R|x, q) P(R = 1|x, q) = P(x|R = 1, q)P(R = 1|q) P(x|q) P(R = 0|x, q) = P(x|R = 0, q)P(R = 0|q) P(x|q) P(R = 1|q) and P(R = 0|q): prior probability of retrieving a relevant or nonrelevant document for a query q Estimate P(R = 1|q) and P(R = 0|q) from percentage of relevant documents in the collection PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
  • 106.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model P(R|d, q) is modeled using term incidence vectors as P(R|x, q) P(R = 1|x, q) = P(x|R = 1, q)P(R = 1|q) P(x|q) P(R = 0|x, q) = P(x|R = 0, q)P(R = 0|q) P(x|q) P(R = 1|q) and P(R = 0|q): prior probability of retrieving a relevant or nonrelevant document for a query q Estimate P(R = 1|q) and P(R = 0|q) from percentage of relevant documents in the collection Since a document is either relevant or nonrelevant to a query, we must have that: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
  • 107.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Binary Independence Model P(R|d, q) is modeled using term incidence vectors as P(R|x, q) P(R = 1|x, q) = P(x|R = 1, q)P(R = 1|q) P(x|q) P(R = 0|x, q) = P(x|R = 0, q)P(R = 0|q) P(x|q) P(R = 1|q) and P(R = 0|q): prior probability of retrieving a relevant or nonrelevant document for a query q Estimate P(R = 1|q) and P(R = 0|q) from percentage of relevant documents in the collection Since a document is either relevant or nonrelevant to a query, we must have that: P(R = 1|x, q) + P(R = 0|x, q) = 1 PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 31 / 59
  • 108.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (1) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 32 / 59
  • 109.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (1) Given a query q, ranking documents by P(R = 1|d, q) is modeled under BIM as ranking them by P(R = 1|x, q) Easier: rank documents by their odds of relevance (gives same ranking) O(R|x, q) = P(R = 1|x, q) P(R = 0|x, q) = P(R=1|q)P(x|R=1,q) P(x|q) P(R=0|q)P(x|R=0,q) P(x|q) = P(R = 1|q) P(R = 0|q) · P(x|R = 1, q) P(x|R = 0, q) P(R=1|q) P(R=0|q) is a constant for a given query - can be ignored PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 32 / 59
  • 110.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (2) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 33 / 59
  • 111.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (2) It is at this point that we make the Naive Bayes conditional independence assumption that the presence or absence of a word in a document is independent of the presence or absence of any other word (given the query): P(x|R = 1, q) P(x|R = 0, q) = M t=1 P(xt|R = 1, q) P(xt|R = 0, q) So: O(R|x, q) = O(R|q) · M t=1 P(xt|R = 1, q) P(xt|R = 0, q) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 33 / 59
  • 112.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (3) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 34 / 59
  • 113.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (3) Since each xt is either 0 or 1, we can separate the terms: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 34 / 59
  • 114.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (3) Since each xt is either 0 or 1, we can separate the terms: O(R|x, q) = O(R|q)· t:xt =1 P(xt = 1|R = 1, q) P(xt = 1|R = 0, q) · t:xt =0 P(xt = 0|R = 1, q) P(xt = 0|R = 0, q) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 34 / 59
  • 115.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (4) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
  • 116.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (4) Let pt = P(xt = 1|R = 1, q) be the probability of a term appearing in relevant document PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
  • 117.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (4) Let pt = P(xt = 1|R = 1, q) be the probability of a term appearing in relevant document PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
  • 118.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (4) Let pt = P(xt = 1|R = 1, q) be the probability of a term appearing in relevant document Let pt = P(xt = 1|R = 0, q) be the probability of a term appearing in a nonrelevant document PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
  • 119.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (4) Let pt = P(xt = 1|R = 1, q) be the probability of a term appearing in relevant document Let pt = P(xt = 1|R = 0, q) be the probability of a term appearing in a nonrelevant document PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
  • 120.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms (4) Let pt = P(xt = 1|R = 1, q) be the probability of a term appearing in relevant document Let pt = P(xt = 1|R = 0, q) be the probability of a term appearing in a nonrelevant document Can be displayed as contingency table: document relevant (R = 1) nonrelevant (R = 0) Term present xt = 1 pt pt Term absent xt = 0 1 − pt 1 − pt PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 35 / 59
  • 121.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 36 / 59
  • 122.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Additional simplifying assumption: terms not occurring in the query are equally likely to occur in relevant and nonrelevant documents If qt = 0, then pt = pt Now we need only to consider terms in the products that appear in the query: O(R|x, q) = O(R|q) · t:xt =qt =1 pt pt · t:xt =0,qt =1 1 − pt 1 − pt PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 36 / 59
  • 123.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Additional simplifying assumption: terms not occurring in the query are equally likely to occur in relevant and nonrelevant documents If qt = 0, then pt = pt Now we need only to consider terms in the products that appear in the query: O(R|x, q) = O(R|q) · t:xt =qt =1 pt pt · t:xt =0,qt =1 1 − pt 1 − pt The left product is over query terms found in the document and the right product is over query terms not found in the document PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 36 / 59
  • 124.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 37 / 59
  • 125.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Including the query terms found in the document into the right product, but simultaneously dividing by them in the left product, gives: O(R|x, q) = O(R|q) · t:xt =qt =1 pt(1 − pt) pt(1 − pt) · t:qt =1 1 − pt 1 − pt The left product is still over query terms found in the document, but the right product is now over all query terms, hence constant for a particular query and can be ignored. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 37 / 59
  • 126.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Including the query terms found in the document into the right product, but simultaneously dividing by them in the left product, gives: O(R|x, q) = O(R|q) · t:xt =qt =1 pt(1 − pt) pt(1 − pt) · t:qt =1 1 − pt 1 − pt The left product is still over query terms found in the document, but the right product is now over all query terms, hence constant for a particular query and can be ignored. → The only quantity that needs to be estimated to rank documents w.r.t a query is the left product PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 37 / 59
  • 127.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Including the query terms found in the document into the right product, but simultaneously dividing by them in the left product, gives: O(R|x, q) = O(R|q) · t:xt =qt =1 pt(1 − pt) pt(1 − pt) · t:qt =1 1 − pt 1 − pt The left product is still over query terms found in the document, but the right product is now over all query terms, hence constant for a particular query and can be ignored. → The only quantity that needs to be estimated to rank documents w.r.t a query is the left product Hence the Retrieval Status Value (RSV) in this model: RSVd = log t:xt =qt =1 pt(1 − pt) pt(1 − pt) = t:xt =qt =1 log pt(1 − pt) pt(1 − pt) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 37 / 59
  • 128.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
  • 129.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Equivalent: rank documents using the log odds ratios for the terms in the query ct: ct = log pt(1 − pt) pt(1 − pt) = log pt (1 − pt) − log pt 1 − pt The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant (pt/(1 − pt)), and (ii) the odds of the term appearing if the document is nonrelevant (pt/(1 − pt)) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
  • 130.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Equivalent: rank documents using the log odds ratios for the terms in the query ct: ct = log pt(1 − pt) pt(1 − pt) = log pt (1 − pt) − log pt 1 − pt The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant (pt/(1 − pt)), and (ii) the odds of the term appearing if the document is nonrelevant (pt/(1 − pt)) ct = 0: term has equal odds of appearing in relevant and nonrelevant docs PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
  • 131.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Equivalent: rank documents using the log odds ratios for the terms in the query ct: ct = log pt(1 − pt) pt(1 − pt) = log pt (1 − pt) − log pt 1 − pt The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant (pt/(1 − pt)), and (ii) the odds of the term appearing if the document is nonrelevant (pt/(1 − pt)) ct = 0: term has equal odds of appearing in relevant and nonrelevant docs ct positive: higher odds to appear in relevant documents PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
  • 132.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Deriving a Ranking Function for Query Terms Equivalent: rank documents using the log odds ratios for the terms in the query ct: ct = log pt(1 − pt) pt(1 − pt) = log pt (1 − pt) − log pt 1 − pt The odds ratio is the ratio of two odds: (i) the odds of the term appearing if the document is relevant (pt/(1 − pt)), and (ii) the odds of the term appearing if the document is nonrelevant (pt/(1 − pt)) ct = 0: term has equal odds of appearing in relevant and nonrelevant docs ct positive: higher odds to appear in relevant documents ct negative: higher odds to appear in nonrelevant documents PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 38 / 59
  • 133.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Term weight ct in BIM PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
  • 134.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Term weight ct in BIM ct = log pt (1−pt ) − log pt 1−pt functions as a term weight. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
  • 135.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Term weight ct in BIM ct = log pt (1−pt ) − log pt 1−pt functions as a term weight. Retrieval status value for document d: RSVd = xt =qt =1 ct. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
  • 136.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Term weight ct in BIM ct = log pt (1−pt ) − log pt 1−pt functions as a term weight. Retrieval status value for document d: RSVd = xt =qt =1 ct. So BIM and vector space model are identical on an operational level . . . PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
  • 137.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Term weight ct in BIM ct = log pt (1−pt ) − log pt 1−pt functions as a term weight. Retrieval status value for document d: RSVd = xt =qt =1 ct. So BIM and vector space model are identical on an operational level . . . . . . except that the term weights are different. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
  • 138.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Term weight ct in BIM ct = log pt (1−pt ) − log pt 1−pt functions as a term weight. Retrieval status value for document d: RSVd = xt =qt =1 ct. So BIM and vector space model are identical on an operational level . . . . . . except that the term weights are different. In particular: we can use the same data structures (inverted index etc) for the two models. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 39 / 59
  • 139.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension How to compute probability estimates PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 40 / 59
  • 140.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension How to compute probability estimates For each term t in a query, estimate ct in the whole collection using a contingency table of counts of documents in the collection, where n is the number of documents that contain term t: documents relevant nonrelevant Total Term present xt = 1 r n − r n Term absent xt = 0 R − r (N − n) − (R − r) N − n Total R N − R N pt = r/R pt = (n − r)/(N − R) ct = K(N, n, R, r) = log r/(R − r) (n − r)/((N − n) − (R − r)) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 40 / 59
  • 141.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Avoiding zeros PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
  • 142.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Avoiding zeros If any of the counts is a zero, then the term weight is not well-defined. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
  • 143.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Avoiding zeros If any of the counts is a zero, then the term weight is not well-defined. Maximum likelihood estimates do not work for rare events. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
  • 144.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Avoiding zeros If any of the counts is a zero, then the term weight is not well-defined. Maximum likelihood estimates do not work for rare events. To avoid zeros: add 0.5 to each count (expected likelihood estimation = ELE) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
  • 145.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Avoiding zeros If any of the counts is a zero, then the term weight is not well-defined. Maximum likelihood estimates do not work for rare events. To avoid zeros: add 0.5 to each count (expected likelihood estimation = ELE) For example, use R − r + 0.5 in formula for R − r. Thus, fomula becomes: RSVd = Σ log r + 0.5/(R − r + 0.5) (n − r + 0.5)/((N − n + 0.5) − (R − r + 0.5)) (7) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
  • 146.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Avoiding zeros If any of the counts is a zero, then the term weight is not well-defined. Maximum likelihood estimates do not work for rare events. To avoid zeros: add 0.5 to each count (expected likelihood estimation = ELE) For example, use R − r + 0.5 in formula for R − r. Thus, fomula becomes: RSVd = Σ log r + 0.5/(R − r + 0.5) (n − r + 0.5)/((N − n + 0.5) − (R − r + 0.5)) (7) This was previously known as relevance weight (RW) or matching score relevance weight (MS-RW) (Roberston and Jones, 1976) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 41 / 59
  • 147.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Simplifying assumption PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 42 / 59
  • 148.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Simplifying assumption Assuming that relevant documents are a very small percentage of the collection, approximate statistics for nonrelevant documents by statistics from the whole collection PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 42 / 59
  • 149.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Simplifying assumption Assuming that relevant documents are a very small percentage of the collection, approximate statistics for nonrelevant documents by statistics from the whole collection Hence, (pt) (the probability of term occurrence in nonrelevant documents for a query) is n/N and log[(1 − (pt))/(pt)] = log[(N − n)/n] ≈ log N/n PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 42 / 59
  • 150.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Simplifying assumption Assuming that relevant documents are a very small percentage of the collection, approximate statistics for nonrelevant documents by statistics from the whole collection Hence, (pt) (the probability of term occurrence in nonrelevant documents for a query) is n/N and log[(1 − (pt))/(pt)] = log[(N − n)/n] ≈ log N/n The above approximation cannot easily be extended to relevant documents PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 42 / 59
  • 151.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in relevance feedback PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
  • 152.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in relevance feedback Statistics of relevant documents (pt) in relevance feedback can be estimated using maximum likelihood estimation or ELE (add 0.5). PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
  • 153.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in relevance feedback Statistics of relevant documents (pt) in relevance feedback can be estimated using maximum likelihood estimation or ELE (add 0.5). Use the frequency of term occurrence in known relevant documents. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
  • 154.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in relevance feedback Statistics of relevant documents (pt) in relevance feedback can be estimated using maximum likelihood estimation or ELE (add 0.5). Use the frequency of term occurrence in known relevant documents. This is the basis of probabilistic approaches to relevance feedback weighting in a feedback loop PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
  • 155.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in relevance feedback Statistics of relevant documents (pt) in relevance feedback can be estimated using maximum likelihood estimation or ELE (add 0.5). Use the frequency of term occurrence in known relevant documents. This is the basis of probabilistic approaches to relevance feedback weighting in a feedback loop What we just saw was a probabilistic relevance feedback exercise since we were assuming the availability of relevance judgments. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 43 / 59
  • 156.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in adhoc retrieval PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
  • 157.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
  • 158.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume that pt is constant over all terms xt in the query and that pt = 0.5 PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
  • 159.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume that pt is constant over all terms xt in the query and that pt = 0.5 Each term is equally likely to occur in a relevant document, and so the pt and (1 − pt) factors cancel out in the expression for RSV PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
  • 160.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume that pt is constant over all terms xt in the query and that pt = 0.5 Each term is equally likely to occur in a relevant document, and so the pt and (1 − pt) factors cancel out in the expression for RSV Weak estimate, but doesn’t disagree violently with expectation that query terms appear in many but not all relevant documents PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
  • 161.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume that pt is constant over all terms xt in the query and that pt = 0.5 Each term is equally likely to occur in a relevant document, and so the pt and (1 − pt) factors cancel out in the expression for RSV Weak estimate, but doesn’t disagree violently with expectation that query terms appear in many but not all relevant documents Combining this method with the earlier approximation for pt, the document ranking is determined simply by which query terms occur in documents scaled by their idf weighting PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
  • 162.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probability estimates in adhoc retrieval Ad-hoc retrieval: no user-supplied relevance judgments available In this case: assume that pt is constant over all terms xt in the query and that pt = 0.5 Each term is equally likely to occur in a relevant document, and so the pt and (1 − pt) factors cancel out in the expression for RSV Weak estimate, but doesn’t disagree violently with expectation that query terms appear in many but not all relevant documents Combining this method with the earlier approximation for pt, the document ranking is determined simply by which query terms occur in documents scaled by their idf weighting For short documents (titles or abstracts) in one-pass retrieval situations, this estimate can be quite satisfactory PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 44 / 59
  • 163.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Outline 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 45 / 59
  • 164.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension History and summary of assumptions PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
  • 165.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension History and summary of assumptions Among the oldest formal models in IR PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
  • 166.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension History and summary of assumptions Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
  • 167.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension History and summary of assumptions Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities Assumptions for getting reasonable approximations of the needed probabilities (in the BIM): PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
  • 168.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension History and summary of assumptions Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities Assumptions for getting reasonable approximations of the needed probabilities (in the BIM): Boolean representation of documents/queries/relevance PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
  • 169.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension History and summary of assumptions Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities Assumptions for getting reasonable approximations of the needed probabilities (in the BIM): Boolean representation of documents/queries/relevance Term independence PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
  • 170.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension History and summary of assumptions Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities Assumptions for getting reasonable approximations of the needed probabilities (in the BIM): Boolean representation of documents/queries/relevance Term independence Out-of-query terms do not affect retrieval PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
  • 171.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension History and summary of assumptions Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities Assumptions for getting reasonable approximations of the needed probabilities (in the BIM): Boolean representation of documents/queries/relevance Term independence Out-of-query terms do not affect retrieval Document relevance values are independent PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 46 / 59
  • 172.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension How different are vector space and BIM? PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
  • 173.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension How different are vector space and BIM? They are not that different. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
  • 174.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension How different are vector space and BIM? They are not that different. In either case you build an information retrieval scheme in the exact same way. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
  • 175.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension How different are vector space and BIM? They are not that different. In either case you build an information retrieval scheme in the exact same way. For probabilistic IR, at the end, you score queries not by cosine similarity and tf-idf in a vector space, but by a slightly different formula motivated by probability theory. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
  • 176.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension How different are vector space and BIM? They are not that different. In either case you build an information retrieval scheme in the exact same way. For probabilistic IR, at the end, you score queries not by cosine similarity and tf-idf in a vector space, but by a slightly different formula motivated by probability theory. Next: how to add term frequency and length normalization to the probabilistic model. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 47 / 59
  • 177.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Overview PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
  • 178.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Overview Okapi BM25 is a probabilistic model that incorporates term frequency (i.e., it’s nonbinary) and length normalization. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
  • 179.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Overview Okapi BM25 is a probabilistic model that incorporates term frequency (i.e., it’s nonbinary) and length normalization. BIM was originally designed for short catalog records of fairly consistent length, and it works reasonably in these contexts PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
  • 180.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Overview Okapi BM25 is a probabilistic model that incorporates term frequency (i.e., it’s nonbinary) and length normalization. BIM was originally designed for short catalog records of fairly consistent length, and it works reasonably in these contexts For modern full-text search collections, a model should pay attention to term frequency and document length PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
  • 181.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Overview Okapi BM25 is a probabilistic model that incorporates term frequency (i.e., it’s nonbinary) and length normalization. BIM was originally designed for short catalog records of fairly consistent length, and it works reasonably in these contexts For modern full-text search collections, a model should pay attention to term frequency and document length BestMatch25 (a.k.a BM25 or Okapi) is sensitive to these quantities PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
  • 182.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Overview Okapi BM25 is a probabilistic model that incorporates term frequency (i.e., it’s nonbinary) and length normalization. BIM was originally designed for short catalog records of fairly consistent length, and it works reasonably in these contexts For modern full-text search collections, a model should pay attention to term frequency and document length BestMatch25 (a.k.a BM25 or Okapi) is sensitive to these quantities BM25 is one of the most widely used and robust retrieval models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 48 / 59
  • 183.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Starting point PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 49 / 59
  • 184.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Starting point The simplest score for document d is just idf weighting of the query terms present in the document: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 49 / 59
  • 185.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Starting point The simplest score for document d is just idf weighting of the query terms present in the document: PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 49 / 59
  • 186.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25: Starting point The simplest score for document d is just idf weighting of the query terms present in the document: RSVd = t∈q log N n PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 49 / 59
  • 187.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 Basic Weighting PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
  • 188.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 Basic Weighting Improve idf term [log N/n] by factoring in term frequency and document length. RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
  • 189.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 Basic Weighting Improve idf term [log N/n] by factoring in term frequency and document length. RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd tftd : term frequency in document d PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
  • 190.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 Basic Weighting Improve idf term [log N/n] by factoring in term frequency and document length. RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd tftd : term frequency in document d Ld (Lave): length of document d (average document length in the whole collection) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
  • 191.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 Basic Weighting Improve idf term [log N/n] by factoring in term frequency and document length. RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd tftd : term frequency in document d Ld (Lave): length of document d (average document length in the whole collection) k1: tuning parameter controlling the document term frequency scaling PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
  • 192.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 Basic Weighting Improve idf term [log N/n] by factoring in term frequency and document length. RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd tftd : term frequency in document d Ld (Lave): length of document d (average document length in the whole collection) k1: tuning parameter controlling the document term frequency scaling b: tuning parameter controlling the scaling by document length PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 50 / 59
  • 193.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 weighting for Long queries PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
  • 194.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 weighting for Long queries For long queries, use similar weighting for query terms RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd · (k3 + 1)tftq k3 + tftq PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
  • 195.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 weighting for Long queries For long queries, use similar weighting for query terms RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd · (k3 + 1)tftq k3 + tftq tftq: term frequency in the query q PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
  • 196.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 weighting for Long queries For long queries, use similar weighting for query terms RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd · (k3 + 1)tftq k3 + tftq tftq: term frequency in the query q k3: tuning parameter controlling term frequency scaling of the query PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
  • 197.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 weighting for Long queries For long queries, use similar weighting for query terms RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd · (k3 + 1)tftq k3 + tftq tftq: term frequency in the query q k3: tuning parameter controlling term frequency scaling of the query No length normalization of queries (because retrieval is being done with respect to a single fixed query) PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
  • 198.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi BM25 weighting for Long queries For long queries, use similar weighting for query terms RSVd = t∈q log N n · (k1 + 1)tftd k1((1 − b) + b × (Ld /Lave)) + tftd · (k3 + 1)tftq k3 + tftq tftq: term frequency in the query q k3: tuning parameter controlling term frequency scaling of the query No length normalization of queries (because retrieval is being done with respect to a single fixed query) The above tuning parameters should ideally be set to optimize performance on a development test collection. In the absence of such optimization, experiments have shown reasonable values are to set k1 and k3 to a value between 1.2 and 2 and b = 0.75 PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 51 / 59
  • 199.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Okapi at TREC 7* All the TREC-7 searches used varieties of Okapi BM25, such as: t∈q log N n · (k1 + 1)tftd K + tftd · (k3 + 1)tftq k3 + tftq + k2|Q| Lave − Ld Lave + Ld Where K = k1((1 − b) + b × (Ld /Lave)) k2 is a parameter depending on the nature of the query and possibly on the database k2 was always zero in all searches of TREC-7, simplying the equation to: t∈q w1 · (k1 + 1)tftd K + tftd · (k3 + 1)tftq k3 + tftq Where w1 = log N n , aka Roberstons-Jones weight (Roberston et.al., PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 52 / 59
  • 200.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Outline 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 53 / 59
  • 201.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Performance measures used PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 54 / 59
  • 202.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Outline 1 Inception 2 Probabilistic Approach to IR 3 Data 4 Basic Probability Theory 5 Probability Ranking Principle 6 Extensions to BIM: Okapi 7 Performance measure 8 Comparision of Models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 55 / 59
  • 203.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic model vs. other models PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
  • 204.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic model vs. other models Boolean model PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
  • 205.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic model vs. other models Boolean model Probabilistic models support ranking and thus are better than the simple Boolean model. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
  • 206.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic model vs. other models Boolean model Probabilistic models support ranking and thus are better than the simple Boolean model. Vector space model PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
  • 207.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic model vs. other models Boolean model Probabilistic models support ranking and thus are better than the simple Boolean model. Vector space model The vector space model is also a formally defined model that supports ranking. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
  • 208.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic model vs. other models Boolean model Probabilistic models support ranking and thus are better than the simple Boolean model. Vector space model The vector space model is also a formally defined model that supports ranking. Why would we want to look for an alternative to the vector space model? PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 56 / 59
  • 209.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic vs. vector space model PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
  • 210.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic vs. vector space model Vector space model: rank documents according to similarity to query. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
  • 211.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic vs. vector space model Vector space model: rank documents according to similarity to query. The notion of similarity does not translate directly into an assessment of “is the document a good document to give to the user or not?” PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
  • 212.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic vs. vector space model Vector space model: rank documents according to similarity to query. The notion of similarity does not translate directly into an assessment of “is the document a good document to give to the user or not?” The most similar document can be highly relevant or completely nonrelevant. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
  • 213.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Probabilistic vs. vector space model Vector space model: rank documents according to similarity to query. The notion of similarity does not translate directly into an assessment of “is the document a good document to give to the user or not?” The most similar document can be highly relevant or completely nonrelevant. Probability theory is arguably a cleaner formalization of what we really want an IR system to do: give relevant documents to the user. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 57 / 59
  • 214.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Which ranking model should I use? PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 58 / 59
  • 215.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Which ranking model should I use? I want something basic and simple → use vector space with tf-idf weighting. PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 58 / 59
  • 216.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Which ranking model should I use? I want something basic and simple → use vector space with tf-idf weighting. I want to use a state-of-the-art ranking model with excellent performance → use language models or BM25 with tuned parameters PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 58 / 59
  • 217.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Which ranking model should I use? I want something basic and simple → use vector space with tf-idf weighting. I want to use a state-of-the-art ranking model with excellent performance → use language models or BM25 with tuned parameters In between: BM25 or language models with no or just one tuned parameter PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 58 / 59
  • 218.
    Inception Probabilistic Approachto IR Data Basic Probability Theory Probability Ranking Principle Extension Resources Chapter 11 of IIR Resources at http://cislmu.org Presentation on ”Probabilistic Information Retrieval”by Dr. Suman Mitra at: http://www.irsi.res.in/winter-school/slides/ ProbabilisticModel.ppt PhD Comprehensive presentation Part 1: Probabilistic Information Retrieval 59 / 59