INCORPORATINGPROBABILISTICRETRIEVALKNOWLEDGE INTOTFIDF-BASED SEARCHENGINEAlex LinSenior ArchitectIntelligent Miningalin at...
Overview of Retrieval Models  Boolean Retrieval  Vector Space Model  Probabilistic Model  Language Model
Boolean Retrieval  lincolnAND NOT (car AND automobile)  The earliest model and still in use today  The result is very e...
Vector Space Model    Term2            Doc1                   Doc2                                                t       ...
Probabilistic Retrieval Model             Relevant       P(R|D)                                     Document              ...
Probabilistic Retrieval Model                     P(D | R)P(R)               P(D | NR)P(NR)          P(R | D) =           ...
Estimate P(D|R) and P(D|NR)  Define        D = (d1,d2 ,...,dt )                                t        then    P(D | R) ...
Likelihood Ratio      Likelihood   ratio:           P(D | R)   P(NR)                    >          P(D | NR)    P(R)     ...
Combine with BM25 Ranking    Algorithm      BM25   extends the scoring function for the binary       independence model t...
Weighted Fields Boolean Search doc-id       field0     field1                     …   text   1   2   3   …   n            ...
Apply Probabilistic Knowledgeinto Fields           Higher     gradient         Lower doc-id   field0      field1          ...
Use the Knowledge during Ranking     doc-id         field0      field1    …           Text       1       2            Ligh...
Comparison of Approaches                                     f ik                N    RTF −IDF = tf ik ⋅ idf i =   t      ...
Other Considerations  Thisis not a formal model  Require user relevance feedback (search log)  Harder to handle real-ti...
Thank you
Upcoming SlideShare
Loading in...5
×

Probabilistic Retrieval TFIDF

1,151

Published on

To download slides please go here:
http://www.intelligentmining.com/category/knowledge-base/

Published in: Technology
1 Comment
1 Like
Statistics
Notes
  • To download slides please go here:
    http://www.intelligentmining.com/category/knowledge-base/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,151
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Probabilistic Retrieval TFIDF

  1. 1. INCORPORATINGPROBABILISTICRETRIEVALKNOWLEDGE INTOTFIDF-BASED SEARCHENGINEAlex LinSenior ArchitectIntelligent Miningalin at IntelligentMinining.com
  2. 2. Overview of Retrieval Models  Boolean Retrieval  Vector Space Model  Probabilistic Model  Language Model
  3. 3. Boolean Retrieval  lincolnAND NOT (car AND automobile)  The earliest model and still in use today  The result is very easy to explain to users  Highly efficient computationally  The major drawback – lack of sophisticated ranking algorithm.
  4. 4. Vector Space Model Term2 Doc1 Doc2 t Query ∑d ij *qj j=1 Cos(Di ,Q) = t t Term3 ∑ d * ∑q2 ij 2 j j=1 j=1 Major flaws: It lacks guidance on the details of € how weighting and ranking algorithms are related to relevance
  5. 5. Probabilistic Retrieval Model Relevant P(R|D) Document Non- Relevant P(NR|D) P(D | R)P(R) Bayes’ Rule P(R | D) = P(D) €
  6. 6. Probabilistic Retrieval Model P(D | R)P(R) P(D | NR)P(NR) P(R | D) = P(NR | D) = P(D) P(D)   IfP(D | R)P(R) > P(D | NR)P(NR)€ € then classify D as relevant €
  7. 7. Estimate P(D|R) and P(D|NR)  Define D = (d1,d2 ,...,dt ) t then P(D | R) = ∏ P(di | R) i=1 t € P(D | NR) = ∏ P(di | NR) i=1€   Binary Independence Model€ term independence + binary features in documents
  8. 8. Likelihood Ratio   Likelihood ratio: P(D | R) P(NR) > P(D | NR) P(R) si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring P(D | R) p 1− pi p (1− si ) =∏ i⋅ ∏ = ∑ log i€ P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi ) (ri + 0.5) /(R − ri + 0.5) = ∑ log (n i − ri + 0.5) /(N − n i − R + ri + 0.5) i:d i = q i =1€ N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents €
  9. 9. Combine with BM25 Ranking Algorithm   BM25 extends the scoring function for the binary independence model to include document and query term weight.   It performs very well in TREC experiments (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ i ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i dl K = k1 ((1− b) + b ⋅ ) avgdl€ k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set € qf: term frequency in query terms
  10. 10. Weighted Fields Boolean Search doc-id field0 field1 … text 1 2 3 … n R(q,D) = ∑ ∑w f mi i∈q f ∈ fileds €
  11. 11. Apply Probabilistic Knowledgeinto Fields Higher gradient Lower doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n Relevant P(R|D) Document Non- Relevant P(NR|D)
  12. 12. Use the Knowledge during Ranking doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n   The goal is: t t P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi i=1 i=1 i∈q f ∈F Learnable€
  13. 13. Comparison of Approaches f ik N RTF −IDF = tf ik ⋅ idf i = t ⋅ log nk ∑f ij j=1 (k1 + 1) f i (k2 + 1)qf i dl Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ ) K + fi k 2 + qf i avgdl€ (ri + 0.5) /(R − ri + 0.5) (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i€ € IDF TF€ (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ ∑ w f mi ⋅ ⋅ i∈q f ∈F K + fi k 2 + qf i IDF TF€
  14. 14. Other Considerations  Thisis not a formal model  Require user relevance feedback (search log)  Harder to handle real-time search queries  How to prevent Love/Hate attacks
  15. 15. Thank you

×