Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Probabilistic Retrieval

4,015 views

Published on

From http://www.meetup.com/NYC-Search-and-Discovery/calendar/11745435/

Published in: Technology, Education
  • I am so pleased that I found you! I have suffered from Sleep Apnea for years. I have tried everything to fix the problem but nothing has worked. For the last years I have been trying to use a CPAP machine on and off but it is very difficult to sleep with. It's noisy and very uncomfortable. I had no idea there was a natural way to help me. I am so pleased that I found you!  http://t.cn/AigiCT7Q
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • It's genuinely changed my life. I have been sleeping in the spare room for 4 months - and let's just say my sex life had become pretty boring! My wife and I were becoming strangers living in the same house. Thanks to your strategies, I am now back in our bed and the closeness and intimacy have returned. Thank you so much for taking the time to put all this together. It has genuinely changed my life. ●●● https://bit.ly/37PhtTN
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nothing short of a miracle! I'm writing on behalf of my husband to send you a BIG THANK YOU!! The improvement has been amazing. Peter's sleep apnea was a huge worry for both of us, and it left us both feeling tired and drowsy every morning. What you've discovered here is nothing short of a miracle. God bless you. ●●● http://t.cn/AigiN2V1
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • (Intelligent Mining Sr Architect) Alex Lin's presentation to NYC Search & Discovery Meetup group. Thanks to otisg for the opportunity.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Probabilistic Retrieval

  1. 1. Incorporating Probabilistic Retrieval Knowledge into tfidf-based Search Engine Alex to ClickLinedit Master subtitle style Senior Architect Intelligent Mining alin at IntelligentMinining.com
  2. 2. Overview of Retrieval Models  Boolean Retrieval  Vector Space Model  Probabilistic Model  Language Model
  3. 3. Boolean Retrieval  lincoln AND NOT (car AND automobile)  The earliest model and still in use today  The result is very easy to explain to users  Highly efficient computationally  The major drawback – lack of sophisticated ranking algorithm.
  4. 4. Vector Space Model Term2 Doc1 Doc2 Query t å dij * q j j=1 Cos(Di ,Q) = t t Term3 å d * å q2 2 ij j j=1 j=1 1 rm Te Major flaws: It lacks guidance on the details of how weighting and ranking algorithms are related to relevance
  5. 5. Probabilistic Retrieval Model Relevant P(R|D) Document Non-Relevant P(NR|D) Bayes’ Rule P(D | R)P(R) P(R | D) = P(D)
  6. 6. Probabilistic Retrieval Model P(D | R)P(R) P(D | NR)P(NR) P(R | D) = P(NR | D) = P(D) P(D)  If P(D | R)P(R) > P(D | NR)P(NR) then classify D as relevant
  7. 7. Estimate P(D|R) and P(D| NR)  Define D = (d1,d2 ,...,dt ) t thenP(D | R) = Õ P(di | R) i=1 t P(D | NR) = Õ P(di | NR) i=1  Binary Independence Model term independence + binary features in documents
  8. 8. Likelihood Ratio  Likelihood ratio: P(D | R) P(NR) > P(D | NR) P(R) si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring P(D | R) p 1- pi p (1- si ) = Õ i ×Õ = å log i P(D | NR) i:d i =1 si i:d i = 0 1- si i:d i =1 si (1- pi ) (ri + 0.5) /(R - ri + 0.5) = å log i:d i = q i =1 (n i - ri + 0.5) /(N - n i - R + ri + 0.5) N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents
  9. 9. Combine with BM25 Ranking Algorithm  BM25 extends the scoring function for the binary independence model to include document and query term weight.  It performs very well in TREC experiments (ri + 0.5) /(R - ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = å log × i × iÎ Q (n i - ri + 0.5) /(N - n i - R + ri + 0.5) K + f i k 2 + qf i dl K = k1 ((1- b) + b × ) avgdl k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set qf: term frequency in query terms
  10. 10. Weighted Fields Boolean Search doc-id field0 field1 … text 1 2 3 … n R(q,D) = å å w f mi iÎ q f Î fileds
  11. 11. Apply Probabilistic Knowledge into Fields Higher gradient Lower doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n Relevant P(R|D) Document Non- Relevant P(NR|D)
  12. 12. Use the Knowledge during Ranking doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n  The goal is: t t P(D | R) = Õ P(di | R) = å log(P(di | R)) » å å w f mi i=1 i=1 iÎ q f Î F Learnable
  13. 13. Comparison of Approaches f ik N RTF - IDF = tf ik ×idf i = t ×log nk å f ij j=1 (k1 + 1) f i (k2 + 1)qf i dl Rbm 25 (q,D) = × K = k1 ((1- b) + b × ) K + fi k 2 + qf i avgdl (ri + 0.5) /(R - ri + 0.5) (k1 + 1) f i (k 2 + 1)qf i R(q,D) = å log × × iÎ Q (n i - ri + 0.5) /(N - n i - R + ri + 0.5) K + f i k 2 + qf i IDF TF (k + 1) f i (k 2 + 1)qf i R(q,D) = å å w f mi × 1 × iÎ q f Î F K + fi k 2 + qf i IDF TF
  14. 14. Other Considerations  This is not a formal model  Require user relevance feedback (search log)  Harder to handle real-time search queries  How to Prevent Love/Hate attacks
  15. 15. Thank you

×