Probabilistic Retrieval Model

Baradhidasan P
2nd Year
Pondicherry University
INTRODUCTION
• Probability theory has been used as a principal
means for modeling the retrieval process in
mathematical terms .
• In conventional retrieval situations a
document is retrieved whenever the keyword
set attached to the appears similar in some
sense to the query keywords.
• In this case the document is considered
relevant to the query.
Cont..
• Since the relevance of a document with
respect to a query is a matter of degree. It can
be postulated that when the document and
query vectors are sufficiently similar, the
corresponding probability of relevance is large
enough to make it reasonable to retrieve the
document in response query
• Applies the theory of probability
Why use Probabilities?
• Information Retrieval deals with uncertain
information
• Probability is a measure of uncertainty
• Probabilistic Ranking Principle

• provable
• minimization of risk
• Probabilistic Inference
• To justify your decision
Approach
• The basic underlying tenet of the probabilistic
approach to Retrieval is that, for optimal
performance documents should be ranked in
order of decreasing probability of relevance.
• Several models based on probabilistic
approaches have been advocated here we
shall briefly look into three such models.
objectives
•
•
•
•
•
•
•
•

Highlight influential work on probabilistic models for IR
Provide a working understanding of the probabilistic
Techniques through a set of common implementation
tricks
Establish relationships between the popular
approaches: stress common ideas, explain differences
Outline issues in extending the models to interactive,
cross-language, multi-media
Maron and kuhns
• Maron and kuhns proposed a model for
probabilistic retrieval as early as in 1960. they
advocated that the probability that a given
document would be relevant to a user can be
assessed by a calculation of the probability, for
each document in the collection . That a user
submitting a particular query would judge that
document relevant Thus,
Cont..
• For a query consisting of only one term
(B), the probability that particular document
(DM) will be judged relevant is the ratio of
users who submit query term (B) and
consider the document (DM) to be relevant in
relation to the number of users who
submitted the query term (B) Adopting this
approach one has to employ historical
information to calculate the probability of
relevance the number times users.
Cont..
• Who submitted a particular query term (B)
judged a document (Dm) relevant compared
with the total number of users who submitted
that particular query term (B)
Salton approach
• The model suggested by salton and mcgill
takes a different approach. The essence of
this model is that if estimates for the
probability of occurrence of various terms in
relevant document can be calculated, then the
probabilities that a document will be retrieved
given that it is relevant, several experiments
have shown that the probabilistic model can
yield good results.
Two basic parameters
• The probability of relevance –pr(rel)
• The probability of non-relevance-pr(non-rel)
if relevance is considered as a binary property
then pr(non-rel)= 1 pr(rel)
However, there are two cost parameters
associated with the process of retrieval
A1- the loss associated with the retrieval of a
non-relevant record
Cont…
• A2 the loss associated with the non- retrieval
of a relevant record
• Because of the fact that retrieval of anonrelevant record carries a loss of a1 {1p(rel)}, and the rejection of a relevant item
has an associated loss factor of a2pr(rel), the
total loss for a given retrieval process will be
minimized if an item is retrieved whenever
A2pr(rel)>a1pr(rel)
Cont…
• Detined, and an item may be retrieved whenever the
value of g and DISC is greater than or equals
zero, where
• g or DISC = P(rel) a1
1-Pr(rel)

a2

• The relevance properties of a record mist be related to
the relevance properties of various terms attached to
the records. The probabilities that a document is
relevant and not relevant, given that is has been
selected, are defined by P (rel selected) and P (non-rel
selected) respectively.
Historical Background
 The first attempts to develop a probabilistic theory of
retrieval were made over 30 years ago [Moron and
Kuhn's 1960; Miller 1971], and since then there has
been a steady development of the approach. There
are already several operational IR systems based upon
probabilistic or semi probabilistic models.
 One major obstacle in probabilistic or
semiprobabilistic IR models is finding methods for
estimating the probabilities used to evaluate the
probability of relevance that are both theoretically
sound and computationally efficient.
Conclusion

Probabilistic retrieval model

  • 1.
    Probabilistic Retrieval Model BaradhidasanP 2nd Year Pondicherry University
  • 2.
    INTRODUCTION • Probability theoryhas been used as a principal means for modeling the retrieval process in mathematical terms . • In conventional retrieval situations a document is retrieved whenever the keyword set attached to the appears similar in some sense to the query keywords. • In this case the document is considered relevant to the query.
  • 3.
    Cont.. • Since therelevance of a document with respect to a query is a matter of degree. It can be postulated that when the document and query vectors are sufficiently similar, the corresponding probability of relevance is large enough to make it reasonable to retrieve the document in response query • Applies the theory of probability
  • 4.
    Why use Probabilities? •Information Retrieval deals with uncertain information • Probability is a measure of uncertainty • Probabilistic Ranking Principle • provable • minimization of risk • Probabilistic Inference • To justify your decision
  • 5.
    Approach • The basicunderlying tenet of the probabilistic approach to Retrieval is that, for optimal performance documents should be ranked in order of decreasing probability of relevance. • Several models based on probabilistic approaches have been advocated here we shall briefly look into three such models.
  • 6.
    objectives • • • • • • • • Highlight influential workon probabilistic models for IR Provide a working understanding of the probabilistic Techniques through a set of common implementation tricks Establish relationships between the popular approaches: stress common ideas, explain differences Outline issues in extending the models to interactive, cross-language, multi-media
  • 7.
    Maron and kuhns •Maron and kuhns proposed a model for probabilistic retrieval as early as in 1960. they advocated that the probability that a given document would be relevant to a user can be assessed by a calculation of the probability, for each document in the collection . That a user submitting a particular query would judge that document relevant Thus,
  • 8.
    Cont.. • For aquery consisting of only one term (B), the probability that particular document (DM) will be judged relevant is the ratio of users who submit query term (B) and consider the document (DM) to be relevant in relation to the number of users who submitted the query term (B) Adopting this approach one has to employ historical information to calculate the probability of relevance the number times users.
  • 9.
    Cont.. • Who submitteda particular query term (B) judged a document (Dm) relevant compared with the total number of users who submitted that particular query term (B)
  • 10.
    Salton approach • Themodel suggested by salton and mcgill takes a different approach. The essence of this model is that if estimates for the probability of occurrence of various terms in relevant document can be calculated, then the probabilities that a document will be retrieved given that it is relevant, several experiments have shown that the probabilistic model can yield good results.
  • 11.
    Two basic parameters •The probability of relevance –pr(rel) • The probability of non-relevance-pr(non-rel) if relevance is considered as a binary property then pr(non-rel)= 1 pr(rel) However, there are two cost parameters associated with the process of retrieval A1- the loss associated with the retrieval of a non-relevant record
  • 12.
    Cont… • A2 theloss associated with the non- retrieval of a relevant record • Because of the fact that retrieval of anonrelevant record carries a loss of a1 {1p(rel)}, and the rejection of a relevant item has an associated loss factor of a2pr(rel), the total loss for a given retrieval process will be minimized if an item is retrieved whenever A2pr(rel)>a1pr(rel)
  • 13.
    Cont… • Detined, andan item may be retrieved whenever the value of g and DISC is greater than or equals zero, where • g or DISC = P(rel) a1 1-Pr(rel) a2 • The relevance properties of a record mist be related to the relevance properties of various terms attached to the records. The probabilities that a document is relevant and not relevant, given that is has been selected, are defined by P (rel selected) and P (non-rel selected) respectively.
  • 14.
    Historical Background  Thefirst attempts to develop a probabilistic theory of retrieval were made over 30 years ago [Moron and Kuhn's 1960; Miller 1971], and since then there has been a steady development of the approach. There are already several operational IR systems based upon probabilistic or semi probabilistic models.  One major obstacle in probabilistic or semiprobabilistic IR models is finding methods for estimating the probabilities used to evaluate the probability of relevance that are both theoretically sound and computationally efficient.
  • 15.