The document discusses probabilistic retrieval models in information retrieval. It introduces three influential probabilistic models: (1) Maron and Kuhns' 1960 model which calculates the probability of relevance based on historical user data; (2) Salton's model which estimates the probability of term occurrence in relevant documents; (3) A model that ranks documents by the probability of relevance and considers retrieval as a decision between costs of retrieving non-relevant vs. not retrieving relevant documents. The document provides background on the development of probabilistic IR models and challenges of estimating probabilities for evaluation.
2. INTRODUCTION
• Probability theory has been used as a principal
means for modeling the retrieval process in
mathematical terms .
• In conventional retrieval situations a
document is retrieved whenever the keyword
set attached to the appears similar in some
sense to the query keywords.
• In this case the document is considered
relevant to the query.
3. Cont..
• Since the relevance of a document with
respect to a query is a matter of degree. It can
be postulated that when the document and
query vectors are sufficiently similar, the
corresponding probability of relevance is large
enough to make it reasonable to retrieve the
document in response query
• Applies the theory of probability
4. Why use Probabilities?
• Information Retrieval deals with uncertain
information
• Probability is a measure of uncertainty
• Probabilistic Ranking Principle
• provable
• minimization of risk
• Probabilistic Inference
• To justify your decision
5. Approach
• The basic underlying tenet of the probabilistic
approach to Retrieval is that, for optimal
performance documents should be ranked in
order of decreasing probability of relevance.
• Several models based on probabilistic
approaches have been advocated here we
shall briefly look into three such models.
6. objectives
•
•
•
•
•
•
•
•
Highlight influential work on probabilistic models for IR
Provide a working understanding of the probabilistic
Techniques through a set of common implementation
tricks
Establish relationships between the popular
approaches: stress common ideas, explain differences
Outline issues in extending the models to interactive,
cross-language, multi-media
7. Maron and kuhns
• Maron and kuhns proposed a model for
probabilistic retrieval as early as in 1960. they
advocated that the probability that a given
document would be relevant to a user can be
assessed by a calculation of the probability, for
each document in the collection . That a user
submitting a particular query would judge that
document relevant Thus,
8. Cont..
• For a query consisting of only one term
(B), the probability that particular document
(DM) will be judged relevant is the ratio of
users who submit query term (B) and
consider the document (DM) to be relevant in
relation to the number of users who
submitted the query term (B) Adopting this
approach one has to employ historical
information to calculate the probability of
relevance the number times users.
9. Cont..
• Who submitted a particular query term (B)
judged a document (Dm) relevant compared
with the total number of users who submitted
that particular query term (B)
10. Salton approach
• The model suggested by salton and mcgill
takes a different approach. The essence of
this model is that if estimates for the
probability of occurrence of various terms in
relevant document can be calculated, then the
probabilities that a document will be retrieved
given that it is relevant, several experiments
have shown that the probabilistic model can
yield good results.
11. Two basic parameters
• The probability of relevance –pr(rel)
• The probability of non-relevance-pr(non-rel)
if relevance is considered as a binary property
then pr(non-rel)= 1 pr(rel)
However, there are two cost parameters
associated with the process of retrieval
A1- the loss associated with the retrieval of a
non-relevant record
12. Cont…
• A2 the loss associated with the non- retrieval
of a relevant record
• Because of the fact that retrieval of anonrelevant record carries a loss of a1 {1p(rel)}, and the rejection of a relevant item
has an associated loss factor of a2pr(rel), the
total loss for a given retrieval process will be
minimized if an item is retrieved whenever
A2pr(rel)>a1pr(rel)
13. Cont…
• Detined, and an item may be retrieved whenever the
value of g and DISC is greater than or equals
zero, where
• g or DISC = P(rel) a1
1-Pr(rel)
a2
• The relevance properties of a record mist be related to
the relevance properties of various terms attached to
the records. The probabilities that a document is
relevant and not relevant, given that is has been
selected, are defined by P (rel selected) and P (non-rel
selected) respectively.
14. Historical Background
The first attempts to develop a probabilistic theory of
retrieval were made over 30 years ago [Moron and
Kuhn's 1960; Miller 1971], and since then there has
been a steady development of the approach. There
are already several operational IR systems based upon
probabilistic or semi probabilistic models.
One major obstacle in probabilistic or
semiprobabilistic IR models is finding methods for
estimating the probabilities used to evaluate the
probability of relevance that are both theoretically
sound and computationally efficient.