• Save
Probabilistic Retrieval TFIDF
Upcoming SlideShare
Loading in...5
×
 

Probabilistic Retrieval TFIDF

on

  • 1,228 views

To download slides please go here:

To download slides please go here:
http://www.intelligentmining.com/category/knowledge-base/

Statistics

Views

Total Views
1,228
Views on SlideShare
1,173
Embed Views
55

Actions

Likes
0
Downloads
0
Comments
1

3 Embeds 55

http://www.intelligentmining.com 48
http://www.linkedin.com 5
https://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • To download slides please go here:
    http://www.intelligentmining.com/category/knowledge-base/
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Probabilistic Retrieval TFIDF Probabilistic Retrieval TFIDF Presentation Transcript

    • INCORPORATINGPROBABILISTICRETRIEVALKNOWLEDGE INTOTFIDF-BASED SEARCHENGINEAlex LinSenior ArchitectIntelligent Miningalin at IntelligentMinining.com
    • Overview of Retrieval Models  Boolean Retrieval  Vector Space Model  Probabilistic Model  Language Model
    • Boolean Retrieval  lincolnAND NOT (car AND automobile)  The earliest model and still in use today  The result is very easy to explain to users  Highly efficient computationally  The major drawback – lack of sophisticated ranking algorithm.
    • Vector Space Model Term2 Doc1 Doc2 t Query ∑d ij *qj j=1 Cos(Di ,Q) = t t Term3 ∑ d * ∑q2 ij 2 j j=1 j=1 Major flaws: It lacks guidance on the details of € how weighting and ranking algorithms are related to relevance
    • Probabilistic Retrieval Model Relevant P(R|D) Document Non- Relevant P(NR|D) P(D | R)P(R) Bayes’ Rule P(R | D) = P(D) €
    • Probabilistic Retrieval Model P(D | R)P(R) P(D | NR)P(NR) P(R | D) = P(NR | D) = P(D) P(D)   IfP(D | R)P(R) > P(D | NR)P(NR)€ € then classify D as relevant €
    • Estimate P(D|R) and P(D|NR)  Define D = (d1,d2 ,...,dt ) t then P(D | R) = ∏ P(di | R) i=1 t € P(D | NR) = ∏ P(di | NR) i=1€   Binary Independence Model€ term independence + binary features in documents
    • Likelihood Ratio   Likelihood ratio: P(D | R) P(NR) > P(D | NR) P(R) si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring P(D | R) p 1− pi p (1− si ) =∏ i⋅ ∏ = ∑ log i€ P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi ) (ri + 0.5) /(R − ri + 0.5) = ∑ log (n i − ri + 0.5) /(N − n i − R + ri + 0.5) i:d i = q i =1€ N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents €
    • Combine with BM25 Ranking Algorithm   BM25 extends the scoring function for the binary independence model to include document and query term weight.   It performs very well in TREC experiments (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ i ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i dl K = k1 ((1− b) + b ⋅ ) avgdl€ k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set € qf: term frequency in query terms
    • Weighted Fields Boolean Search doc-id field0 field1 … text 1 2 3 … n R(q,D) = ∑ ∑w f mi i∈q f ∈ fileds €
    • Apply Probabilistic Knowledgeinto Fields Higher gradient Lower doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n Relevant P(R|D) Document Non- Relevant P(NR|D)
    • Use the Knowledge during Ranking doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n   The goal is: t t P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi i=1 i=1 i∈q f ∈F Learnable€
    • Comparison of Approaches f ik N RTF −IDF = tf ik ⋅ idf i = t ⋅ log nk ∑f ij j=1 (k1 + 1) f i (k2 + 1)qf i dl Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ ) K + fi k 2 + qf i avgdl€ (ri + 0.5) /(R − ri + 0.5) (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i€ € IDF TF€ (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ ∑ w f mi ⋅ ⋅ i∈q f ∈F K + fi k 2 + qf i IDF TF€
    • Other Considerations  Thisis not a formal model  Require user relevance feedback (search log)  Harder to handle real-time search queries  How to prevent Love/Hate attacks
    • Thank you