BM25 Scoring for Lucene: From Academia to Industry

5,377 views

Published on

Slides from a talk about the BM25 library given at the Meetup session of Apache Lucene Eurocon 2010 in Prague on May 20th, 2010.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,377
On SlideShare
0
From Embeds
0
Number of Embeds
70
Actions
Shares
0
Downloads
45
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

BM25 Scoring for Lucene: From Academia to Industry

  1. 1. BM25 Scoring for Lucene: From Academia to Industry Yuval Feinstein Answers Corporation Apache Lucene EuroCon 2010 Meetup Prague, May 2010
  2. 2. Overview  Answers.com  A Relevance problem  BM25F - a possible solution  Joaquin’s Implementation  Productization  Future directions 2
  3. 3. Answers.com  Mission - Provide best answers about anything.  A popular web site (according to comScore, March 2010):  #33 worldwide, with 75.8 million unique users  #18 in US, with 51.2 million unique users  WikiAnswers – community Q&A site (UGC)  ReferenceAnswers – editorial content  Atlas – internal search engine  Implicit search example: find similar 3 questions
  4. 4. Similar Questions 4
  5. 5. Case 31136 5
  6. 6. Enter BM25F  Query Q = (t1, t2, …, tm)  Document D  Term frequency tfi similarity Q , D    w i tf i  tQ  D  How much should tfi influence similarity?  Determine similarity by choosing weights  BM25F: saturation, soft length normalization, idf weights and field weights.
  7. 7. Saturation Frequency Saturation 1 0.9 0.8 0.7 0.6 Saturated 0.5 Weight, tf/(2+tf) 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 Term Frequency tf Replace tf by tf/(k1+tf)
  8. 8. Soft Length Normalization length normalization 2 1.8 1.6 1.4 1.2 normalized 1 frequency 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 30 document length tf tf '  Replace tf by  dl   1  b   b   avdl 
  9. 9. Inverse Document Frequency (IDF) IDF weighting 2.5 2 1.5 IDF weight (wi) 1 0.5 0 0 20 40 60 80 100 120 num docs with term (ni) N  n i  0 .5  log IDF wi n i  0 .5
  10. 10. Field Weights Every field has a different b (length verbosity parameter) and a different v (field value parameer) 10
  11. 11. The BM25F Formula S ~ tf si v Field weighting tf i  s s 1 Bs  sl s  Field length normalization B s   1  b s   b s   avsl  ~ tf i  BM 25 F IDF Saturation and IDF w i ~ w i k1  f i
  12. 12. Joaquin’s Implementation  Joaquín Pérez Iglesias of UNED, Madrid, Spain implemented a BM25F library for Lucene, with the class BM25BooleanQuery  Algorithm:  Collect documents with query terms  Score individual terms using BM25F  Combine scores using addition to get Boolean query score 12
  13. 13. BM25F Usefulness for Our Case  Short texts  Term repetitions hurt relevance for short texts  Want to combine different fields (in the future, different information sources)  Initial Experiments showed nice relevance, but…. 13
  14. 14. Feeling Safe to make Changes  How can we be sure not to break anything?  Added Unit Tests  (This is almost a Lucene standard, but not in Academia…) 14
  15. 15. Production Challenges – Performance Can this library handle 10M queries daily? Initial Runtimes: Average Median Runtime Runtime mSec mSec Standard 161 119 Lucene Scoring BM25F 273 209 Difference 68% 75% 15
  16. 16. Improving Performance Addressed using:  Benchmarking  Profiling  Refactoring, to give Average Median Runtime Runtime mSec mSec Standard 93 65 Lucene Scoring BM25F 92 70 16 Difference -1% 8%
  17. 17. Production Challenges – Robustness  Lots of users  strange inputs e.g. //////////////////////////////////////// ;-) fdsfdsdfsdffssssssfsfsfs  Addressed using more careful tokenization
  18. 18. Production Challenges – Integration and Interoperability  Needs data not currently in Lucene index:  Average Field Lengths  Document-level IDF  We calculated the first externally and approximated the second using longest field IDF  Library does not play nicely with others – not recursive  BM25 Library supports BooleanQuery, not phrases, prefix, etc.
  19. 19. Remember case 31136? Well, She’s mostly pleased…  BM25 runs in our production environment  Supporting 10s of millions of queries daily
  20. 20. Future Work  LUCENE-2091 – Our suggested contrib patch  LUCENE-2392 – Current work on making Lucene scoring more flexible, to incorporate BM25 as well as other models  We want to incorporate BM25 scoring into Solr  Could this be faster as well? 20
  21. 21. References  Integrating the Probabilistic Model BM25/BM25F into Lucene – Joaquin Perez Iglesias  The Probabilistic Relevance Framework: BM25 and Beyond – Stephen Robertson and Hugo Zaragoza  Working Effectively with Legacy Code – Michael Feathers

×