Information Retrieval : 20
Divergence from Randomness
Prof Neeraj Bhargava
Vaibhav Khanna
Department of Computer Science
School of Engineering and Systems Sciences
Maharshi Dayanand Saraswati University Ajmer
Divergence from Randomness
• A distinct probabilistic model has been
proposed by Amati and Rijsbergen
• The idea is to compute term weights by
measuring the divergence between a term
distribution produced by a random process
and the actual term distribution
• Thus, the name divergence from randomness
• The model is based on two fundamental
assumptions, as follows.
First assumption:
• Not all words are equally important for describing
the content of the documents
• Words that carry little information are assumed to
be randomly distributed over the whole
document collection C
• Given a term ki, its probability distribution over
the whole collection is referred to as P(ki|C)
• The amount of information associated with this
distribution is given by −log P(ki|C)
• By modifying this probability function, we can
implement distinct notions of term randomness
Second assumption
• A complementary term distribution can be obtained by
considering just the subset of documents that contain
term ki
• This subset is referred to as the elite set
• The corresponding probability distribution, computed
with regard to document dj , is referred to as P(ki|dj)
• Smaller the probability of observing a term ki in a
document dj , more rare and important is the term
considered to be
• Thus, the amount of information associated with the
term in the elite set is defined as 1 − P(ki|dj)
Divergence from Randomness
Random Distribution
• To compute the distribution of terms in the collection,
distinct probability models can be considered
• For instance, consider that Bernoulli trials are used to
model the occurrences of a term in the collection
• To illustrate, consider a collection with 1,000 documents
and a term ki that occurs 10 times in the collection
• Then, the probability of observing 4 occurrences of term
ki in a document is given by
Random Distribution
Random Distribution
• Under these conditions, we can aproximate
the binomial distribution by a Poisson process,
which yields
Distribution over the Elite Set
Normalization
Normalization
Assignment
• Explain the Information Retrieval Model of
Divergence from Randomness

Information retrieval 20 divergence from randomness

  • 1.
    Information Retrieval :20 Divergence from Randomness Prof Neeraj Bhargava Vaibhav Khanna Department of Computer Science School of Engineering and Systems Sciences Maharshi Dayanand Saraswati University Ajmer
  • 2.
    Divergence from Randomness •A distinct probabilistic model has been proposed by Amati and Rijsbergen • The idea is to compute term weights by measuring the divergence between a term distribution produced by a random process and the actual term distribution • Thus, the name divergence from randomness • The model is based on two fundamental assumptions, as follows.
  • 3.
    First assumption: • Notall words are equally important for describing the content of the documents • Words that carry little information are assumed to be randomly distributed over the whole document collection C • Given a term ki, its probability distribution over the whole collection is referred to as P(ki|C) • The amount of information associated with this distribution is given by −log P(ki|C) • By modifying this probability function, we can implement distinct notions of term randomness
  • 4.
    Second assumption • Acomplementary term distribution can be obtained by considering just the subset of documents that contain term ki • This subset is referred to as the elite set • The corresponding probability distribution, computed with regard to document dj , is referred to as P(ki|dj) • Smaller the probability of observing a term ki in a document dj , more rare and important is the term considered to be • Thus, the amount of information associated with the term in the elite set is defined as 1 − P(ki|dj)
  • 5.
  • 6.
    Random Distribution • Tocompute the distribution of terms in the collection, distinct probability models can be considered • For instance, consider that Bernoulli trials are used to model the occurrences of a term in the collection • To illustrate, consider a collection with 1,000 documents and a term ki that occurs 10 times in the collection • Then, the probability of observing 4 occurrences of term ki in a document is given by
  • 7.
  • 8.
    Random Distribution • Underthese conditions, we can aproximate the binomial distribution by a Poisson process, which yields
  • 9.
  • 10.
  • 11.
  • 12.
    Assignment • Explain theInformation Retrieval Model of Divergence from Randomness