Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Tweaking the Base Score:
Lucene/Solr Similarities Explained
Demo: github.com/sematext/activate/tree/master/2019
More info:...
Agenda
BM25 - Best Match: the default
DFR - Divergence From Randomness framework
DFI - Divergence From Independence
IB - I...
TF*IDF
You know, for historical reasons
BM25 - the TF part
freq / (freq + k1 * (1 - b + b * dl / avgdl))
Best for Most 😁
BM25 tunables
freq / (freq + k1 * (1 - b + b * dl / avgdl))
k1 - raise or lower ceiling
BM25 tunables
freq / (freq + k1 * (1 - b + b * dl / avgdl))
doc length normalization
BM25 demo
yes, that’s how we look
when we give demos
BM25
Good default. You can
tune the weight of freq
and docLength.
Divergence From Randomness
Basic Model
G, I(n), I(ne), I(F)
After Effect
L, B
Normalization
H1, H2, H3, Z, none
tf * c * avgFieldLength / docFieldLength
Divergence From Randomness - H1
Divergence From Randomness - H1
No normalization, and H1 with c == 1, 3, 5, 7
tf * log2
(1 + c * (avgFieldLength / docFieldLength))
Divergence From Randomness - H2
Divergence From Randomness - H2
No normalization, and H2 with c == 1, 3, 5, 7
tf * (avgFieldLength / docFieldLength)Z
Divergence From Randomness - Z
Divergence From Randomness - Z
No normalization, and Z with z == 0.1, 0.2, 0.3, 0.4
(tf * mu * ((totalTermFreq + 1) / (#fieldTokens + 1)))
(docFieldLength + mu) * mu
Divergence From Randomness - H3
Divergence From Randomness - H3
No normalization, and H3 with mu == 1, 3, 5, 7
DFR demo
Only one, I promise
DFR
Framework. Tunable:
choose algorithm and
tune parameters for
both IDF* and
docLength.
* generic name for importance
of...
Divergence From Independence
expected frequency
Divergence From Independence
docLength*totalTermFrequency/numberOfFieldTokens
expected frequency
DFI: Standardized
(actual - expected)/sqrt(expected)
DFI demo
Oh, but don’t remove
stopwords*!
1) arbitrarily chops field length
2) stopwords aren’t always
stopwords ;)
DFI
Simple. Parameterless.
Flexible: works well
with various datasets.
Information Based
how much information we get from this term?
Information Based
Distribution
Log-Logistic, Smoothed Power-Law
Lambda
DF, TTF
Normalization
H1, H2, H3, Z, none
Information Based - Log-Logistic
log( tfn / (lambda + 1) )
Information Based - Log-Logistic
lambda: 0.1 (red), 0.3 (black), 0.8 (blue)
Information Based - Retrieval Function
the average of the document information brought
by each query term
Information Based - Retrieval Function - DF
number of matching documents
(docFrequency + 1) / (numberOfDocuments + 1)
Information Based - Retrieval Function - TTF
total number of term occurrences
(totalTermFrequency + 1) / (numberOfDocument...
IB demo
IB
Framework. like DFR.
Even has the same
normalization options.
But newer and, in the
paper, better.
Language Models
probability of a term being our term
Language Models
totalTermFreq/totalFieldTokens
probability of a term being our term
Language Models: Jelinek-Mercer
log(
(1-λ)*
tf
)docLength
λ * probability
LM demo
feat. Jelinek-Mercer
LM
Two probabilistic
models. Similar
approach to DFI, but
tunable.
Custom Similarity
compute a similarity score using custom code
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private...
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private...
Custom Similarity - Activate Similarity Factory
public class ActivateSimilarityFactory extends SimilarityFactory {
private...
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public ...
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public ...
Custom Similarity - Similarity
public class ActivateSimilarity extends Similarity {
public ActivateSimilarity() {}
public ...
Custom Similarity - SimScorer
public class ActivateSimScorer extends Similarity.SimScorer {
public float score(float freq,...
Custom Similarity - SimScorer
public class ActivateSimScorer extends Similarity.SimScorer {
public float score(float freq,...
Custom
Similarity
demo
Custom
When you need
something special, like
disregarding term
frequency.
Multiple
similarities
demo
THANK YOU
Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities Explained
Upcoming SlideShare
Loading in …5
×

Tweaking the Base Score: Lucene/Solr Similarities Explained

209 views

Published on

This talk was given during Activate Conference 2019. Lucene has a lot of options for configuring similarity, and Solr inherits them. Similarity makes the base of your relevancy score: how similar is this document to the query? The default similarity (BM25) is a good start, but you may need to tweak it for your use-case. In this session, you will learn how BM25 works and how you may want to change its parameters. Then, we'll move to other similarity classes: DFR, DFI, IB and LM. You will learn the thinking behind them, how that thinking translates to the similarity score, and which parameters allow you to tweak how score evolves based on things like term frequency or document length. By the end, you’ll have a good understanding of which similarity options are likely to work well for your use-case. You'll know which tunables are available and whether you need to implement a custom similarity class. As an example, we’ll focus on E-commerce, where you often end up ignoring term frequency altogether.

Key Takeaway
1) What are the built-in Lucene/Solr similarities and what they do
2) Which similarity to use for which use-case
3) How to use a custom similarity class in Solr

Learn more about search relevance and similarity: sematext.com/blog/search-relevance-solr-elasticsearch-similarity

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Tweaking the Base Score: Lucene/Solr Similarities Explained

  1. 1. Tweaking the Base Score: Lucene/Solr Similarities Explained Demo: github.com/sematext/activate/tree/master/2019 More info: sematext.com/blog/search-relevance-solr-elasticsearch-similarity Radu Gheorghe Rafał Kuć www.sematext.com
  2. 2. Agenda BM25 - Best Match: the default DFR - Divergence From Randomness framework DFI - Divergence From Independence IB - Information-Based models LM - Language Models Custom similarity Putting it all together
  3. 3. TF*IDF You know, for historical reasons
  4. 4. BM25 - the TF part freq / (freq + k1 * (1 - b + b * dl / avgdl)) Best for Most 😁
  5. 5. BM25 tunables freq / (freq + k1 * (1 - b + b * dl / avgdl)) k1 - raise or lower ceiling
  6. 6. BM25 tunables freq / (freq + k1 * (1 - b + b * dl / avgdl)) doc length normalization
  7. 7. BM25 demo yes, that’s how we look when we give demos
  8. 8. BM25 Good default. You can tune the weight of freq and docLength.
  9. 9. Divergence From Randomness Basic Model G, I(n), I(ne), I(F) After Effect L, B Normalization H1, H2, H3, Z, none
  10. 10. tf * c * avgFieldLength / docFieldLength Divergence From Randomness - H1
  11. 11. Divergence From Randomness - H1 No normalization, and H1 with c == 1, 3, 5, 7
  12. 12. tf * log2 (1 + c * (avgFieldLength / docFieldLength)) Divergence From Randomness - H2
  13. 13. Divergence From Randomness - H2 No normalization, and H2 with c == 1, 3, 5, 7
  14. 14. tf * (avgFieldLength / docFieldLength)Z Divergence From Randomness - Z
  15. 15. Divergence From Randomness - Z No normalization, and Z with z == 0.1, 0.2, 0.3, 0.4
  16. 16. (tf * mu * ((totalTermFreq + 1) / (#fieldTokens + 1))) (docFieldLength + mu) * mu Divergence From Randomness - H3
  17. 17. Divergence From Randomness - H3 No normalization, and H3 with mu == 1, 3, 5, 7
  18. 18. DFR demo Only one, I promise
  19. 19. DFR Framework. Tunable: choose algorithm and tune parameters for both IDF* and docLength. * generic name for importance of this term
  20. 20. Divergence From Independence expected frequency
  21. 21. Divergence From Independence docLength*totalTermFrequency/numberOfFieldTokens expected frequency
  22. 22. DFI: Standardized (actual - expected)/sqrt(expected)
  23. 23. DFI demo Oh, but don’t remove stopwords*! 1) arbitrarily chops field length 2) stopwords aren’t always stopwords ;)
  24. 24. DFI Simple. Parameterless. Flexible: works well with various datasets.
  25. 25. Information Based how much information we get from this term?
  26. 26. Information Based Distribution Log-Logistic, Smoothed Power-Law Lambda DF, TTF Normalization H1, H2, H3, Z, none
  27. 27. Information Based - Log-Logistic log( tfn / (lambda + 1) )
  28. 28. Information Based - Log-Logistic lambda: 0.1 (red), 0.3 (black), 0.8 (blue)
  29. 29. Information Based - Retrieval Function the average of the document information brought by each query term
  30. 30. Information Based - Retrieval Function - DF number of matching documents (docFrequency + 1) / (numberOfDocuments + 1)
  31. 31. Information Based - Retrieval Function - TTF total number of term occurrences (totalTermFrequency + 1) / (numberOfDocuments + 1)
  32. 32. IB demo
  33. 33. IB Framework. like DFR. Even has the same normalization options. But newer and, in the paper, better.
  34. 34. Language Models probability of a term being our term
  35. 35. Language Models totalTermFreq/totalFieldTokens probability of a term being our term
  36. 36. Language Models: Jelinek-Mercer log( (1-λ)* tf )docLength λ * probability
  37. 37. LM demo feat. Jelinek-Mercer
  38. 38. LM Two probabilistic models. Similar approach to DFI, but tunable.
  39. 39. Custom Similarity compute a similarity score using custom code
  40. 40. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }
  41. 41. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }
  42. 42. Custom Similarity - Activate Similarity Factory public class ActivateSimilarityFactory extends SimilarityFactory { private volatile Similarity similarity; public void init(SolrParams params) { super.init(params); } public Similarity getSimilarity() { if (similarity == null) { similarity = new ActivateSimilarity(); } return similarity; } }
  43. 43. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }
  44. 44. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }
  45. 45. Custom Similarity - Similarity public class ActivateSimilarity extends Similarity { public ActivateSimilarity() {} public long computeNorm(FieldInvertState state) { return 1; } public Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) { return new ActivateSimScorer(); } }
  46. 46. Custom Similarity - SimScorer public class ActivateSimScorer extends Similarity.SimScorer { public float score(float freq, long norm) { return freq; } }
  47. 47. Custom Similarity - SimScorer public class ActivateSimScorer extends Similarity.SimScorer { public float score(float freq, long norm) { return freq; } }
  48. 48. Custom Similarity demo
  49. 49. Custom When you need something special, like disregarding term frequency.
  50. 50. Multiple similarities demo
  51. 51. THANK YOU

×