Upcoming SlideShare
×

Probabilistic Retrieval TFIDF

1,889 views

Published on

http://www.intelligentmining.com/category/knowledge-base/

Published in: Technology
1 Comment
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
http://www.intelligentmining.com/category/knowledge-base/

Are you sure you want to  Yes  No
Views
Total views
1,889
On SlideShare
0
From Embeds
0
Number of Embeds
83
Actions
Shares
0
0
1
Likes
1
Embeds 0
No embeds

No notes for slide

Probabilistic Retrieval TFIDF

1. 1. INCORPORATINGPROBABILISTICRETRIEVALKNOWLEDGE INTOTFIDF-BASED SEARCHENGINEAlex LinSenior ArchitectIntelligent Miningalin at IntelligentMinining.com
2. 2. Overview of Retrieval Models  Boolean Retrieval  Vector Space Model  Probabilistic Model  Language Model
3. 3. Boolean Retrieval  lincolnAND NOT (car AND automobile)  The earliest model and still in use today  The result is very easy to explain to users  Highly efficient computationally  The major drawback – lack of sophisticated ranking algorithm.
4. 4. Vector Space Model Term2 Doc1 Doc2 t Query ∑d ij *qj j=1 Cos(Di ,Q) = t t Term3 ∑ d * ∑q2 ij 2 j j=1 j=1 Major flaws: It lacks guidance on the details of € how weighting and ranking algorithms are related to relevance
5. 5. Probabilistic Retrieval Model Relevant P(R|D) Document Non- Relevant P(NR|D) P(D | R)P(R) Bayes’ Rule P(R | D) = P(D) €
6. 6. Probabilistic Retrieval Model P(D | R)P(R) P(D | NR)P(NR) P(R | D) = P(NR | D) = P(D) P(D)   IfP(D | R)P(R) > P(D | NR)P(NR)€ € then classify D as relevant €
7. 7. Estimate P(D|R) and P(D|NR)  Define D = (d1,d2 ,...,dt ) t then P(D | R) = ∏ P(di | R) i=1 t € P(D | NR) = ∏ P(di | NR) i=1€   Binary Independence Model€ term independence + binary features in documents
8. 8. Likelihood Ratio   Likelihood ratio: P(D | R) P(NR) > P(D | NR) P(R) si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring P(D | R) p 1− pi p (1− si ) =∏ i⋅ ∏ = ∑ log i€ P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi ) (ri + 0.5) /(R − ri + 0.5) = ∑ log (n i − ri + 0.5) /(N − n i − R + ri + 0.5) i:d i = q i =1€ N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents €
9. 9. Combine with BM25 Ranking Algorithm   BM25 extends the scoring function for the binary independence model to include document and query term weight.   It performs very well in TREC experiments (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ i ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i dl K = k1 ((1− b) + b ⋅ ) avgdl€ k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set € qf: term frequency in query terms
10. 10. Weighted Fields Boolean Search doc-id field0 field1 … text 1 2 3 … n R(q,D) = ∑ ∑w f mi i∈q f ∈ fileds €
11. 11. Apply Probabilistic Knowledgeinto Fields Higher gradient Lower doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n Relevant P(R|D) Document Non- Relevant P(NR|D)
12. 12. Use the Knowledge during Ranking doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n   The goal is: t t P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi i=1 i=1 i∈q f ∈F Learnable€
13. 13. Comparison of Approaches f ik N RTF −IDF = tf ik ⋅ idf i = t ⋅ log nk ∑f ij j=1 (k1 + 1) f i (k2 + 1)qf i dl Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ ) K + fi k 2 + qf i avgdl€ (ri + 0.5) /(R − ri + 0.5) (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i€ € IDF TF€ (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ ∑ w f mi ⋅ ⋅ i∈q f ∈F K + fi k 2 + qf i IDF TF€
14. 14. Other Considerations  Thisis not a formal model  Require user relevance feedback (search log)  Harder to handle real-time search queries  How to prevent Love/Hate attacks
15. 15. Thank you