Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tweaking the Base Score: Lucene/Solr Similarities Explained

3,850 views

Published on

This talk was given during Activate Conference 2019. Lucene has a lot of options for configuring similarity, and Solr inherits them. Similarity makes the base of your relevancy score: how similar is this document to the query? The default similarity (BM25) is a good start, but you may need to tweak it for your use-case. In this session, you will learn how BM25 works and how you may want to change its parameters. Then, we'll move to other similarity classes: DFR, DFI, IB and LM. You will learn the thinking behind them, how that thinking translates to the similarity score, and which parameters allow you to tweak how score evolves based on things like term frequency or document length. By the end, you’ll have a good understanding of which similarity options are likely to work well for your use-case. You'll know which tunables are available and whether you need to implement a custom similarity class. As an example, we’ll focus on E-commerce, where you often end up ignoring term frequency altogether.

Key Takeaway
1) What are the built-in Lucene/Solr similarities and what they do
2) Which similarity to use for which use-case
3) How to use a custom similarity class in Solr

Learn more about search relevance and similarity: sematext.com/blog/search-relevance-solr-elasticsearch-similarity

Published in: Engineering
  • A professional Paper writing services can alleviate your stress in writing a successful paper and take the pressure off you to hand it in on time. Check out, please ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nothing short of a miracle! I'm writing on behalf of my husband to send you a BIG THANK YOU!! The improvement has been amazing. Peter's sleep apnea was a huge worry for both of us, and it left us both feeling tired and drowsy every morning. What you've discovered here is nothing short of a miracle. God bless you. ★★★ http://t.cn/Aigi9dEf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Tweaking the Base Score: Lucene/Solr Similarities Explained

  1. 1. Tweaking the Base Score: Lucene/Solr Similarities Explained Demo: github.com/sematext/activate/tree/master/2019 More info: sematext.com/blog/search-relevance-solr-elasticsearch-similarity Radu Gheorghe Rafał Kuć www.sematext.com
  2. 2. Agenda BM25 - Best Match: the default DFR - Divergence From Randomness framework DFI - Divergence From Independence IB - Information-Based models LM - Language Models Custom similarity Putting it all together
  3. 3. TF*IDF You know, for historical reasons
  4. 4. BM25 - the TF part freq / (freq + k1 * (1 - b + b * dl / avgdl)) Best for Most 😁
  5. 5. BM25 tunables freq / (freq + k1 * (1 - b + b * dl / avgdl)) k1 - raise or lower ceiling
  6. 6. BM25 tunables freq / (freq + k1 * (1 - b + b * dl / avgdl)) doc length normalization
  7. 7. BM25 demo yes, that’s how we look when we give demos
  8. 8. BM25 Good default. You can tune the weight of freq and docLength.
  9. 9. Divergence From Randomness Basic Model G, I(n), I(ne), I(F) After Effect L, B Normalization H1, H2, H3, Z, none
  10. 10. tf * c * avgFieldLength / docFieldLength Divergence From Randomness - H1
  11. 11. Divergence From Randomness - H1 No normalization, and H1 with c == 1, 3, 5, 7
  12. 12. tf * log2 (1 + c * (avgFieldLength / docFieldLength)) Divergence From Randomness - H2
  13. 13. Divergence From Randomness - H2 No normalization, and H2 with c == 1, 3, 5, 7
  14. 14. tf * (avgFieldLength / docFieldLength)Z Divergence From Randomness - Z
  15. 15. Divergence From Randomness - Z No normalization, and Z with z == 0.1, 0.2, 0.3, 0.4
  16. 16. (tf * mu * ((totalTermFreq + 1) / (#fieldTokens + 1))) (docFieldLength + mu) * mu Divergence From Randomness - H3
  17. 17. Divergence From Randomness - H3 No normalization, and H3 with mu == 1, 3, 5, 7
  18. 18. DFR demo Only one, I promise
  19. 19. DFR Framework. Tunable: choose algorithm and tune parameters for both IDF* and docLength. * generic name for importance of this term
  20. 20. Divergence From Independence expected frequency
  21. 21. Divergence From Independence docLength*totalTermFrequency/numberOfFieldTokens expected frequency
  22. 22. DFI: Standardized (actual - expected)/sqrt(expected)
  23. 23. DFI demo Oh, but don’t remove stopwords*! 1) arbitrarily chops field length 2) stopwords aren’t always stopwords ;)
  24. 24. DFI Simple. Parameterless. Flexible: works well with various datasets.
  25. 25. Information Based how much information we get from this term?
  26. 26. Information Based Distribution Log-Logistic, Smoothed Power-Law Lambda DF, TTF Normalization H1, H2, H3, Z, none
  27. 27. Information Based - Log-Logistic log( tfn / (lambda + 1) )
  28. 28. Information Based - Log-Logistic lambda: 0.1 (red), 0.3 (black), 0.8 (blue)
  29. 29. Information Based - Retrieval Function the average of the document information brought by each query term
  30. 30. Information Based - Retrieval Function - DF number of matching documents (docFrequency + 1) / (numberOfDocuments + 1)
  31. 31. Information Based - Retrieval Function - TTF total number of term occurrences (totalTermFrequency + 1) / (numberOfDocuments + 1)
  32. 32. IB demo
  33. 33. IB Framework. like DFR. Even has the same normalization options. But newer and, in the paper, better.
  34. 34. Language Models probability of a term being our term
  35. 35. Language Models totalTermFreq/totalFieldTokens probability of a term being our term
  36. 36. Language Models: Jelinek-Mercer log( (1-λ)* tf )docLength λ * probability
  37. 37. LM demo feat. Jelinek-Mercer
  38. 38. LM Two probabilistic models. Similar approach to DFI, but tunable.
  39. 39. Custom Similarity compute a similarity score using custom code
  40. 40. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }
  41. 41. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }
  42. 42. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }
  43. 43. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }
  44. 44. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }
  45. 45. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }
  46. 46. Custom Similarity - SimScorer public class ActivateSimScorer extends Similarity.SimScorer { public float score(float freq, long norm) { return freq; } }
  47. 47. Custom Similarity - SimScorer public class ActivateSimScorer extends Similarity.SimScorer { public float score(float freq, long norm) { return freq; } }
  48. 48. Custom Similarity demo
  49. 49. Custom When you need something special, like disregarding term frequency.
  50. 50. Multiple similarities demo
  51. 51. THANK YOU

×