SlideShare a Scribd company logo
1 of 21
Download to read offline
BM25 Scoring for Lucene:
From Academia to Industry

             Yuval Feinstein
             Answers Corporation




              Apache Lucene EuroCon 2010 Meetup
              Prague, May 2010
Overview

       Answers.com
       A Relevance problem
       BM25F - a possible solution
       Joaquin’s Implementation
       Productization
       Future directions




2
Answers.com

       Mission - Provide best answers about anything.
       A popular web site (according to comScore,
        March 2010):
          #33 worldwide, with 75.8 million unique users
          #18 in US, with 51.2 million unique users
       WikiAnswers – community Q&A site (UGC)
       ReferenceAnswers – editorial content
       Atlas – internal search engine
       Implicit search example: find similar
3
        questions
Similar Questions




4
Case 31136




5
Enter BM25F

   Query Q = (t1, t2, …, tm)
   Document D
   Term frequency tfi
    similarity   Q , D    w i tf i 
                            tQ  D

   How much should tfi influence similarity?
   Determine similarity by choosing weights
   BM25F: saturation, soft length normalization, idf
    weights and field weights.
Saturation

                            Frequency Saturation


                    1
                  0.9
                  0.8
                  0.7
                  0.6
 Saturated
                  0.5
Weight, tf/(2+tf)
                  0.4
                  0.3
                  0.2
                  0.1
                    0
                        0   5       10        15      20   25   30
                                      Term Frequency tf




 Replace tf by tf/(k1+tf)
Soft Length Normalization

                         length normalization

             2
           1.8
           1.6
           1.4
           1.2
normalized
             1
 frequency
           0.8
           0.6
           0.4
           0.2
             0
                 0   5          10          15          20     25   30
                                      document length




                                                 tf
                             tf ' 
Replace tf by                                         dl 
                                        1  b   b      
                                                     avdl 
Inverse Document Frequency (IDF)

                                       IDF weighting

                   2.5

                    2

                   1.5
 IDF weight (wi)
                    1

                   0.5

                    0
                         0        20        40         60      80    100   120
                                           num docs with term (ni)



                 N  n i  0 .5
          log
   IDF
 wi
                   n i  0 .5
Field Weights




     Every field has a different b (length verbosity parameter) and a different v
     (field value parameer)
10
The BM25F Formula

                                         S
                                ~                  tf si
                                        v
 Field weighting
                               tf i           s
                                        s 1       Bs

                                                       sl s 
 Field length normalization   B s   1  b s   b s       
                                                      avsl 

                                                        ~
                                                       tf i
                                               
                                   BM 25 F                     IDF
  Saturation and IDF          w   i                     ~ w   i
                                                   k1  f i
Joaquin’s Implementation

        Joaquín Pérez Iglesias of UNED, Madrid, Spain
         implemented a BM25F library for Lucene,
         with the class BM25BooleanQuery
        Algorithm:
          Collect documents with query terms
          Score individual terms using BM25F
          Combine scores using addition to get Boolean query
           score




12
BM25F Usefulness for Our Case

        Short texts
        Term repetitions hurt relevance for short texts
        Want to combine different fields (in the future,
         different information sources)

        Initial Experiments showed nice relevance, but….




13
Feeling Safe to make Changes

        How can we be sure not to break anything?



        Added Unit Tests
        (This is almost a Lucene standard, but not in
         Academia…)




14
Production Challenges –
     Performance

     Can this library handle 10M queries daily?
     Initial Runtimes:


                     Average   Median
                     Runtime   Runtime
                     mSec      mSec

        Standard     161       119
        Lucene
        Scoring
        BM25F        273       209
        Difference   68%       75%

15
Improving Performance

     Addressed using:
      Benchmarking

      Profiling

      Refactoring, to give


                     Average   Median
                     Runtime   Runtime
                     mSec      mSec
        Standard     93        65
        Lucene
        Scoring
        BM25F        92        70
16      Difference   -1%       8%
Production Challenges –
Robustness

   Lots of users  strange inputs e.g.
////////////////////////////////////////
;-)
fdsfdsdfsdffssssssfsfsfs

   Addressed using more careful tokenization
Production Challenges –
Integration and Interoperability

   Needs data not currently in Lucene index:
     Average Field Lengths
     Document-level IDF
   We calculated the first externally and
    approximated the second using longest field IDF

   Library does not play nicely with others – not
    recursive
   BM25 Library supports BooleanQuery, not
    phrases, prefix, etc.
Remember case 31136?



Well, She’s mostly pleased…

   BM25 runs in our production environment
   Supporting 10s of millions of queries daily
Future Work

        LUCENE-2091 – Our suggested contrib patch
        LUCENE-2392 – Current work on making Lucene
         scoring more flexible, to incorporate BM25 as well
         as other models
        We want to incorporate BM25 scoring into Solr
        Could this be faster as well?




20
References

   Integrating the Probabilistic Model BM25/BM25F
    into Lucene – Joaquin Perez Iglesias
   The Probabilistic Relevance Framework: BM25
    and Beyond – Stephen Robertson and Hugo
    Zaragoza
   Working Effectively with Legacy Code – Michael
    Feathers

More Related Content

What's hot (20)

Waveform Coding
Waveform CodingWaveform Coding
Waveform Coding
 
Spectrum-Compliant Accelerograms through Harmonic Wavelet Transform
Spectrum-Compliant Accelerograms through Harmonic Wavelet TransformSpectrum-Compliant Accelerograms through Harmonic Wavelet Transform
Spectrum-Compliant Accelerograms through Harmonic Wavelet Transform
 
VSB
VSBVSB
VSB
 
Software-defined white-space cognitive systems: implementation of the spectru...
Software-defined white-space cognitive systems: implementation of the spectru...Software-defined white-space cognitive systems: implementation of the spectru...
Software-defined white-space cognitive systems: implementation of the spectru...
 
Icici bme 2011
Icici bme 2011Icici bme 2011
Icici bme 2011
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
Ofdm
OfdmOfdm
Ofdm
 
I phone 10
I phone 10I phone 10
I phone 10
 
Ch6 1 v1
Ch6 1 v1Ch6 1 v1
Ch6 1 v1
 
Introduction to OFDM
Introduction to OFDMIntroduction to OFDM
Introduction to OFDM
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
Mimo
MimoMimo
Mimo
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISI
 
Faster rcnn
Faster rcnnFaster rcnn
Faster rcnn
 
Tham khao ofdm tutorial
Tham khao ofdm tutorialTham khao ofdm tutorial
Tham khao ofdm tutorial
 
Data and signals
Data and signalsData and signals
Data and signals
 
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
 
Physical Layer Numericals - Data Communication & Networking
Physical Layer  Numericals - Data Communication & NetworkingPhysical Layer  Numericals - Data Communication & Networking
Physical Layer Numericals - Data Communication & Networking
 
Adm
AdmAdm
Adm
 
2008 anna university
2008 anna university2008 anna university
2008 anna university
 

Viewers also liked

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitKavita Ganesan
 
LOW COST HOUSING
LOW COST HOUSINGLOW COST HOUSING
LOW COST HOUSINGSUJEESH A S
 
What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)Rudy De Busscher
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkVolker Hirsch
 

Viewers also liked (7)

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Future Urban Transport: When Less is More
Future Urban Transport: When Less is MoreFuture Urban Transport: When Less is More
Future Urban Transport: When Less is More
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
 
LOW COST HOUSING
LOW COST HOUSINGLOW COST HOUSING
LOW COST HOUSING
 
Skybus
SkybusSkybus
Skybus
 
What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 

Similar to BM25 Scoring for Lucene: From Academia to Industry

Analysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitAnalysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitsushanthsjce
 
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingAnalysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingOmer Ali
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010tcoyle72
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010tcoyle72
 
OFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingOFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingAbdullaziz Tagawy
 
MEF Service Level Aggrement
MEF Service Level AggrementMEF Service Level Aggrement
MEF Service Level Aggrementshivlu
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐlykhnh386525
 
4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-notePei-Che Chang
 
Resilience at exascale
Resilience at exascaleResilience at exascale
Resilience at exascaleMarc Snir
 
V5 protocol English
V5 protocol EnglishV5 protocol English
V5 protocol Englishfigtree614
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...NUGU developers
 
F01 beam forming_srs
F01 beam forming_srsF01 beam forming_srs
F01 beam forming_srsLuciano Motta
 
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R... Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...inventy
 

Similar to BM25 Scoring for Lucene: From Academia to Industry (17)

Analysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitAnalysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unit
 
D0432427
D0432427D0432427
D0432427
 
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingAnalysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010
 
OFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingOFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division Multiplexing
 
MEF Service Level Aggrement
MEF Service Level AggrementMEF Service Level Aggrement
MEF Service Level Aggrement
 
ofdm
ofdmofdm
ofdm
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
 
4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note
 
Resilience at exascale
Resilience at exascaleResilience at exascale
Resilience at exascale
 
V5 protocol English
V5 protocol EnglishV5 protocol English
V5 protocol English
 
Lec11 rate distortion optimization
Lec11 rate distortion optimizationLec11 rate distortion optimization
Lec11 rate distortion optimization
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
F01 beam forming_srs
F01 beam forming_srsF01 beam forming_srs
F01 beam forming_srs
 
Filter dengan-op-amp
Filter dengan-op-ampFilter dengan-op-amp
Filter dengan-op-amp
 
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R... Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

BM25 Scoring for Lucene: From Academia to Industry

  • 1. BM25 Scoring for Lucene: From Academia to Industry Yuval Feinstein Answers Corporation Apache Lucene EuroCon 2010 Meetup Prague, May 2010
  • 2. Overview  Answers.com  A Relevance problem  BM25F - a possible solution  Joaquin’s Implementation  Productization  Future directions 2
  • 3. Answers.com  Mission - Provide best answers about anything.  A popular web site (according to comScore, March 2010):  #33 worldwide, with 75.8 million unique users  #18 in US, with 51.2 million unique users  WikiAnswers – community Q&A site (UGC)  ReferenceAnswers – editorial content  Atlas – internal search engine  Implicit search example: find similar 3 questions
  • 6. Enter BM25F  Query Q = (t1, t2, …, tm)  Document D  Term frequency tfi similarity Q , D    w i tf i  tQ  D  How much should tfi influence similarity?  Determine similarity by choosing weights  BM25F: saturation, soft length normalization, idf weights and field weights.
  • 7. Saturation Frequency Saturation 1 0.9 0.8 0.7 0.6 Saturated 0.5 Weight, tf/(2+tf) 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 Term Frequency tf Replace tf by tf/(k1+tf)
  • 8. Soft Length Normalization length normalization 2 1.8 1.6 1.4 1.2 normalized 1 frequency 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 30 document length tf tf '  Replace tf by  dl   1  b   b   avdl 
  • 9. Inverse Document Frequency (IDF) IDF weighting 2.5 2 1.5 IDF weight (wi) 1 0.5 0 0 20 40 60 80 100 120 num docs with term (ni) N  n i  0 .5  log IDF wi n i  0 .5
  • 10. Field Weights Every field has a different b (length verbosity parameter) and a different v (field value parameer) 10
  • 11. The BM25F Formula S ~ tf si v Field weighting tf i  s s 1 Bs  sl s  Field length normalization B s   1  b s   b s   avsl  ~ tf i  BM 25 F IDF Saturation and IDF w i ~ w i k1  f i
  • 12. Joaquin’s Implementation  Joaquín Pérez Iglesias of UNED, Madrid, Spain implemented a BM25F library for Lucene, with the class BM25BooleanQuery  Algorithm:  Collect documents with query terms  Score individual terms using BM25F  Combine scores using addition to get Boolean query score 12
  • 13. BM25F Usefulness for Our Case  Short texts  Term repetitions hurt relevance for short texts  Want to combine different fields (in the future, different information sources)  Initial Experiments showed nice relevance, but…. 13
  • 14. Feeling Safe to make Changes  How can we be sure not to break anything?  Added Unit Tests  (This is almost a Lucene standard, but not in Academia…) 14
  • 15. Production Challenges – Performance Can this library handle 10M queries daily? Initial Runtimes: Average Median Runtime Runtime mSec mSec Standard 161 119 Lucene Scoring BM25F 273 209 Difference 68% 75% 15
  • 16. Improving Performance Addressed using:  Benchmarking  Profiling  Refactoring, to give Average Median Runtime Runtime mSec mSec Standard 93 65 Lucene Scoring BM25F 92 70 16 Difference -1% 8%
  • 17. Production Challenges – Robustness  Lots of users  strange inputs e.g. //////////////////////////////////////// ;-) fdsfdsdfsdffssssssfsfsfs  Addressed using more careful tokenization
  • 18. Production Challenges – Integration and Interoperability  Needs data not currently in Lucene index:  Average Field Lengths  Document-level IDF  We calculated the first externally and approximated the second using longest field IDF  Library does not play nicely with others – not recursive  BM25 Library supports BooleanQuery, not phrases, prefix, etc.
  • 19. Remember case 31136? Well, She’s mostly pleased…  BM25 runs in our production environment  Supporting 10s of millions of queries daily
  • 20. Future Work  LUCENE-2091 – Our suggested contrib patch  LUCENE-2392 – Current work on making Lucene scoring more flexible, to incorporate BM25 as well as other models  We want to incorporate BM25 scoring into Solr  Could this be faster as well? 20
  • 21. References  Integrating the Probabilistic Model BM25/BM25F into Lucene – Joaquin Perez Iglesias  The Probabilistic Relevance Framework: BM25 and Beyond – Stephen Robertson and Hugo Zaragoza  Working Effectively with Legacy Code – Michael Feathers