BM25 Scoring for Lucene:
From Academia to Industry

             Yuval Feinstein
             Answers Corporation




              Apache Lucene EuroCon 2010 Meetup
              Prague, May 2010
Overview

       Answers.com
       A Relevance problem
       BM25F - a possible solution
       Joaquin’s Implementation
       Productization
       Future directions




2
Answers.com

       Mission - Provide best answers about anything.
       A popular web site (according to comScore,
        March 2010):
          #33 worldwide, with 75.8 million unique users
          #18 in US, with 51.2 million unique users
       WikiAnswers – community Q&A site (UGC)
       ReferenceAnswers – editorial content
       Atlas – internal search engine
       Implicit search example: find similar
3
        questions
Similar Questions




4
Case 31136




5
Enter BM25F

   Query Q = (t1, t2, …, tm)
   Document D
   Term frequency tfi
    similarity   Q , D    w i tf i 
                            tQ  D

   How much should tfi influence similarity?
   Determine similarity by choosing weights
   BM25F: saturation, soft length normalization, idf
    weights and field weights.
Saturation

                            Frequency Saturation


                    1
                  0.9
                  0.8
                  0.7
                  0.6
 Saturated
                  0.5
Weight, tf/(2+tf)
                  0.4
                  0.3
                  0.2
                  0.1
                    0
                        0   5       10        15      20   25   30
                                      Term Frequency tf




 Replace tf by tf/(k1+tf)
Soft Length Normalization

                         length normalization

             2
           1.8
           1.6
           1.4
           1.2
normalized
             1
 frequency
           0.8
           0.6
           0.4
           0.2
             0
                 0   5          10          15          20     25   30
                                      document length




                                                 tf
                             tf ' 
Replace tf by                                         dl 
                                        1  b   b      
                                                     avdl 
Inverse Document Frequency (IDF)

                                       IDF weighting

                   2.5

                    2

                   1.5
 IDF weight (wi)
                    1

                   0.5

                    0
                         0        20        40         60      80    100   120
                                           num docs with term (ni)



                 N  n i  0 .5
          log
   IDF
 wi
                   n i  0 .5
Field Weights




     Every field has a different b (length verbosity parameter) and a different v
     (field value parameer)
10
The BM25F Formula

                                         S
                                ~                  tf si
                                        v
 Field weighting
                               tf i           s
                                        s 1       Bs

                                                       sl s 
 Field length normalization   B s   1  b s   b s       
                                                      avsl 

                                                        ~
                                                       tf i
                                               
                                   BM 25 F                     IDF
  Saturation and IDF          w   i                     ~ w   i
                                                   k1  f i
Joaquin’s Implementation

        Joaquín Pérez Iglesias of UNED, Madrid, Spain
         implemented a BM25F library for Lucene,
         with the class BM25BooleanQuery
        Algorithm:
          Collect documents with query terms
          Score individual terms using BM25F
          Combine scores using addition to get Boolean query
           score




12
BM25F Usefulness for Our Case

        Short texts
        Term repetitions hurt relevance for short texts
        Want to combine different fields (in the future,
         different information sources)

        Initial Experiments showed nice relevance, but….




13
Feeling Safe to make Changes

        How can we be sure not to break anything?



        Added Unit Tests
        (This is almost a Lucene standard, but not in
         Academia…)




14
Production Challenges –
     Performance

     Can this library handle 10M queries daily?
     Initial Runtimes:


                     Average   Median
                     Runtime   Runtime
                     mSec      mSec

        Standard     161       119
        Lucene
        Scoring
        BM25F        273       209
        Difference   68%       75%

15
Improving Performance

     Addressed using:
      Benchmarking

      Profiling

      Refactoring, to give


                     Average   Median
                     Runtime   Runtime
                     mSec      mSec
        Standard     93        65
        Lucene
        Scoring
        BM25F        92        70
16      Difference   -1%       8%
Production Challenges –
Robustness

   Lots of users  strange inputs e.g.
////////////////////////////////////////
;-)
fdsfdsdfsdffssssssfsfsfs

   Addressed using more careful tokenization
Production Challenges –
Integration and Interoperability

   Needs data not currently in Lucene index:
     Average Field Lengths
     Document-level IDF
   We calculated the first externally and
    approximated the second using longest field IDF

   Library does not play nicely with others – not
    recursive
   BM25 Library supports BooleanQuery, not
    phrases, prefix, etc.
Remember case 31136?



Well, She’s mostly pleased…

   BM25 runs in our production environment
   Supporting 10s of millions of queries daily
Future Work

        LUCENE-2091 – Our suggested contrib patch
        LUCENE-2392 – Current work on making Lucene
         scoring more flexible, to incorporate BM25 as well
         as other models
        We want to incorporate BM25 scoring into Solr
        Could this be faster as well?




20
References

   Integrating the Probabilistic Model BM25/BM25F
    into Lucene – Joaquin Perez Iglesias
   The Probabilistic Relevance Framework: BM25
    and Beyond – Stephen Robertson and Hugo
    Zaragoza
   Working Effectively with Legacy Code – Michael
    Feathers

BM25 Scoring for Lucene: From Academia to Industry

  • 1.
    BM25 Scoring forLucene: From Academia to Industry Yuval Feinstein Answers Corporation Apache Lucene EuroCon 2010 Meetup Prague, May 2010
  • 2.
    Overview  Answers.com  A Relevance problem  BM25F - a possible solution  Joaquin’s Implementation  Productization  Future directions 2
  • 3.
    Answers.com  Mission - Provide best answers about anything.  A popular web site (according to comScore, March 2010):  #33 worldwide, with 75.8 million unique users  #18 in US, with 51.2 million unique users  WikiAnswers – community Q&A site (UGC)  ReferenceAnswers – editorial content  Atlas – internal search engine  Implicit search example: find similar 3 questions
  • 4.
  • 5.
  • 6.
    Enter BM25F  Query Q = (t1, t2, …, tm)  Document D  Term frequency tfi similarity Q , D    w i tf i  tQ  D  How much should tfi influence similarity?  Determine similarity by choosing weights  BM25F: saturation, soft length normalization, idf weights and field weights.
  • 7.
    Saturation Frequency Saturation 1 0.9 0.8 0.7 0.6 Saturated 0.5 Weight, tf/(2+tf) 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 Term Frequency tf Replace tf by tf/(k1+tf)
  • 8.
    Soft Length Normalization length normalization 2 1.8 1.6 1.4 1.2 normalized 1 frequency 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 30 document length tf tf '  Replace tf by  dl   1  b   b   avdl 
  • 9.
    Inverse Document Frequency(IDF) IDF weighting 2.5 2 1.5 IDF weight (wi) 1 0.5 0 0 20 40 60 80 100 120 num docs with term (ni) N  n i  0 .5  log IDF wi n i  0 .5
  • 10.
    Field Weights Every field has a different b (length verbosity parameter) and a different v (field value parameer) 10
  • 11.
    The BM25F Formula S ~ tf si v Field weighting tf i  s s 1 Bs  sl s  Field length normalization B s   1  b s   b s   avsl  ~ tf i  BM 25 F IDF Saturation and IDF w i ~ w i k1  f i
  • 12.
    Joaquin’s Implementation  Joaquín Pérez Iglesias of UNED, Madrid, Spain implemented a BM25F library for Lucene, with the class BM25BooleanQuery  Algorithm:  Collect documents with query terms  Score individual terms using BM25F  Combine scores using addition to get Boolean query score 12
  • 13.
    BM25F Usefulness forOur Case  Short texts  Term repetitions hurt relevance for short texts  Want to combine different fields (in the future, different information sources)  Initial Experiments showed nice relevance, but…. 13
  • 14.
    Feeling Safe tomake Changes  How can we be sure not to break anything?  Added Unit Tests  (This is almost a Lucene standard, but not in Academia…) 14
  • 15.
    Production Challenges – Performance Can this library handle 10M queries daily? Initial Runtimes: Average Median Runtime Runtime mSec mSec Standard 161 119 Lucene Scoring BM25F 273 209 Difference 68% 75% 15
  • 16.
    Improving Performance Addressed using:  Benchmarking  Profiling  Refactoring, to give Average Median Runtime Runtime mSec mSec Standard 93 65 Lucene Scoring BM25F 92 70 16 Difference -1% 8%
  • 17.
    Production Challenges – Robustness  Lots of users  strange inputs e.g. //////////////////////////////////////// ;-) fdsfdsdfsdffssssssfsfsfs  Addressed using more careful tokenization
  • 18.
    Production Challenges – Integrationand Interoperability  Needs data not currently in Lucene index:  Average Field Lengths  Document-level IDF  We calculated the first externally and approximated the second using longest field IDF  Library does not play nicely with others – not recursive  BM25 Library supports BooleanQuery, not phrases, prefix, etc.
  • 19.
    Remember case 31136? Well,She’s mostly pleased…  BM25 runs in our production environment  Supporting 10s of millions of queries daily
  • 20.
    Future Work  LUCENE-2091 – Our suggested contrib patch  LUCENE-2392 – Current work on making Lucene scoring more flexible, to incorporate BM25 as well as other models  We want to incorporate BM25 scoring into Solr  Could this be faster as well? 20
  • 21.
    References  Integrating the Probabilistic Model BM25/BM25F into Lucene – Joaquin Perez Iglesias  The Probabilistic Relevance Framework: BM25 and Beyond – Stephen Robertson and Hugo Zaragoza  Working Effectively with Legacy Code – Michael Feathers