Probabilistic Retrieval

INCORPORATING
PROBABILISTIC
RETRIEVAL
KNOWLEDGE INTO
TFIDF-BASED SEARCH
ENGINE
Alex Lin
Senior Architect
Intelligent Mining
alin at IntelligentMinining.com

Overview of Retrieval Models
  Boolean Retrieval
  Vector Space Model

  Probabilistic Model

  Language Model

Boolean Retrieval
  lincolnAND NOT (car AND automobile)
  The earliest model and still in use today

  The result is very easy to explain to users

  Highly efficient computationally

  The major drawback – lack of sophisticated
ranking algorithm.

Vector Space Model
Term2 Doc1

Doc2

t
Query
∑d ij *qj
j=1
Cos(Di ,Q) =
t t
Term3
∑ d * ∑q2
ij
2
j
j=1 j=1

Major flaws: It lacks guidance on the details of
€
how weighting and ranking algorithms are
related to relevance

Probabilistic Retrieval Model

Relevant P(R|D)

Document

Non-
Relevant P(NR|D)

P(D | R)P(R)
Bayes’ Rule P(R | D) =
P(D)

€

Likelihood Ratio
  Likelihood ratio:
P(D | R) P(NR)
>
P(D | NR) P(R)
si: in non-relevant set, the probability of term i occurring
pi: in relevant set, the probability of term i occurring

P(D | R) pi 1− pi pi (1− si )
=∏ ⋅ ∏ = ∑ log
€ P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi )
(ri + 0.5) /(R − ri + 0.5)
= ∑ log
i:d i = q i =1 (n i − ri + 0.5) /(N − n i − R + ri + 0.5)
€
N: total number of Non-relevant documents
ni: number of non-relevant documents that contain a term
ri: number of relevant documents that contain a term
R: total number of Relevant documents
€

Combine with BM25 Ranking
Algorithm
  BM25 extends the scoring function for the binary
independence model to include document and
query term weight.
  It performs very well in TREC experiments

(ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ log ⋅ i ⋅
i∈Q
(n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i

dl
K = k1 ((1− b) + b ⋅ )
avgdl
€
k1 k2 b: tuning parameters
dl: document length
avgdl: average document length in data set
€
qf: term frequency in query terms

Weighted Fields Boolean Search
doc-id field0 field1 … text
1
2
3
…
n

R(q,D) = ∑ ∑w f mi
i∈q f ∈ fileds

€

Apply Probabilistic Knowledge
into Fields
Higher gradient Lower

doc-id field0 field1 … Text
1
Lightyear Buzz
2
3
…
n

Relevant

P(R|D)

Document
Non-
Relevant P(NR|D)

Use the Knowledge during Ranking
doc-id field0 field1 … Text
1
Lightyear Buzz
2
3
…
n

  The goal is:
t
t
P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi
i=1
i=1 i∈q f ∈F

Learnable

€

Comparison of Approaches
f ik N
RTF −IDF = tf ik ⋅ idf i = t
⋅ log
nk
∑f ij
j=1

(k1 + 1) f i (k2 + 1)qf i dl
Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ )
K + fi k 2 + qf i avgdl
€ (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ log ⋅ 1 ⋅
i∈Q
(n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i
€ €
IDF TF

€ (k1 + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ ∑ w f mi ⋅ ⋅
i∈q f ∈F K + fi k 2 + qf i

IDF TF

€

Other Considerations
  Thisis not a formal model
  Require user relevance feedback (search log)

  Harder to handle real-time search queries

  How to Prevent Love/Hate attacks

Probabilistic Retrieval

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Probabilistic Retrieval

Similar to Probabilistic Retrieval (20)

More from otisg

More from otisg (6)

Recently uploaded

Recently uploaded (20)

Probabilistic Retrieval

Editor's Notes