Hybrid Search with
Apache Solr
Speaker: Alessandro Benedetti, Director @ Sease
COMMUNITY OVER CODE EU 2024 - 03/06/2024
‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2
‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
Hot Trends :
● Large Language Models Applications
● Vector-based (Neural) Search
● Natural Language Processing
● Learning To Rank
● Document Similarity
● Search Quality Evaluation
● Relevance Tuning
SEArch SErvices
www.sease.io
3
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Vector-based Search
Training Indexing Searching
Labeled Samples Text to Vectors Query to Vector
Lookup in Index
Neural Search Workflow
Similarity between a Query and a Document is translated to distance in a vector space
Distance Between Vectors -> Similarity
https://www.pinecone.io/learn/what-is-similarity-search/
https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-
e51395ffe60d
● Specify the relationship metric
between elements in the dataset
● use-case dependant
○ experiment which one
works better for you!
● In Information Retrieval Cosine
similarity proved to work quite well (it’s
a normalised inner product)
https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity
Low Explainability
● High Dimensionality - vectors are long sequences of numerical values (768,
1536 …)
● Dimensions - each feature (element in the vector) has no clear semantic in
many cases (slightly different from sparse vectors or explicit feature vectors)
● Values - It’s not obvious to estimate how a single value impact relevance
(higher is better?)
● Similarity - To explain why a search result is retrieved in the top-K the vector
distance is the only info you have
Research is happening but it’s still an open problem
Lexical matches?
● Search users still have the expectation of lexical matches to happen
● You can’t enforce that with Vector-based search
“Why the document with the keyword in the title
is not coming up?” cit.
Low Diversity
● By nature vector-based search just returns the top-k
ordered by vector similarity
● Unless you add more logic on top, you would expect low
diversity by definition
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Hybrid Search
● Mitigation of current vector-search problems - Is it here to stay?
● Combine traditional keyword-based (lexical) search with vector-based (neural)
search
● Retrieval of two sets of candidates:
○ one set of results coming from lexical matches with the query keywords
○ a set of results coming from the K-Nearest Neighbours search with the query
vector
● Ranking of the candidates
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
May 2022 - Apache Solr 9.0
Sease Introduced support to KNN search (HNSW)
https://issues.apache.org/jira/browse/SOLR-15880
JIRA ISSUES
https://issues.apache.org/jir
a/browse/SOLR-
15880?jql=labels%20%3D
%20vector-based-search
Apache Solr implementation
Retrieval Stage
The hybrid candidate result set is the union of the results coming from the two models:
● the top-K results coming from the K-Nearest Neighbours search and the <numFound>
results coming from the lexical (keyword-based) search.
● The cardinality of the combined result set is <= (K + NumFound).
● The result set doesn’t include any duplicates.
Retrieval Stage
The hybrid candidate result set is the intersection of the results coming from the two models:
● only the top-K results coming from the K-Nearest Neighbour that satisfy the lexical query
are returned.
● The cardinality of the combined result set is <= K .
● This is effectively post-filtering K-Nearest Neighbours results but affecting the score
Bonus Point: PRE-FILTERING VS POST-FILTERING
● < 9.1 -> FQ were post-filters
● > 9.1 -> FQ are pre-filters (to run a post-filter
https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local-
parameter)
● > 9.6 -> More flexibility with additional !knn query parser params
9.6 PRE-FILTERING VS POST-FILTERING
More flexibility with additional !knn query parser params
○ preFilter -> Specifies an explicit list of Pre-Filter query strings to use.
N.B. if you specify this, FQs will all become post-filters
○ includeTags -> Indicates that only fq filters with the specified tag should be
considered for implicit Pre-Filtering. Must not be combined with preFilter.
○ excludeTags -> Indicates that fq filters with the specified tag should be
excluded from consideration for implicit Pre-Filtering. Must not be combined with
preFilter.
DEFAULT
Main Q -> same as >9.1, FQ that are not post-filters become pre-filters automatically
○ includeTags and excludeTags may be used to limit the set of fq filters used in the Pre-Filter.
As an fq param, or as a subquery clause in a larger query: No implicit Pre-Filter is used.
○ includeTags and excludeTags must not be used in these situations.
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-
search.html#explicit-knn-pre-filtering
9.6 PRE-FILTERING VS POST-FILTERING
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-
search.html#explicit-knn-pre-filtering
Some use cases where includeTags and/or excludeTags may be more useful then
an explicit preFilter parameters:
● You have some fq parameters that are re-used on many requests (even when
you don’t use the knn parser) that you wish to be used as KNN Pre-Filters when
you do use the knn query parser.
● You typically want all fq params to be used as KNN Pre-Filters, but when users
"drill down" on Facets, you want the fq parameters you add to be excluded from
the KNN Pre-Filtering so that the result set gets smaller; instead of just computing
a new topK set.
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
The Ranking Problem
● We need a score that reflects the best relevance ordering
● KNN candidates present a score [-1 … 1]
● Lexical candidates present an unbounded score (potentially on a complete
different scale)
N.B. combining Lexical scores with vector-based similarity (and potentially other
features) is not a solved problem
What options do we have right now in Apache Solr?
Ranking Stage
● The filter component ignores any scoring and just builds the hybrid result set.
● The must clause is responsible for assigning the score, using the appropriate function query.
● The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest
Neighbours score.
This simple linear combination of the scores could be a good starting point.
Ranking Stage
● The filter component ignores any scoring and just builds the hybrid result set.
● The must clause is responsible for assigning the score, using the appropriate function query.
● The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest
Neighbours score.
● Better? No evidence
Learning To Rank
● Multiple factors (features) affect ranking
○ no quick answer to a mathematical function to combine them -> what do you want
to optimise?
○ Sum? Normalised Sum? Multiplication? linear or non-linear function?
○ Rather than manual trial/error let’s use Machine Learning -> LTR
● Apache Solr supports Learning To Rank since 6.4
○ from 9.3 Sease contributed the first support for vector similarity as a feature
○ First step -> sponsor us or contribute improvements!
● Build a training set <query, document> -> rating
○ <query, document> is described by a vector of numerical feature, one of them can
be your vector similarity, others may be lexical scores or business rules
https://sease.io/category/learning-to-rank
Features.json
model.json
LTR Query
Limitations of Vector-Based Search
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Future Works
● Reciprocal Rank Fusion
(https://github.com/apache/solr/pull/2489 )
● Better scaling (min-max on theoretical max)
● And More!
The AI Side Of Apache Lucene/Solr - Training
Additional Resources
Additional Resources
● Blog: https://sease.io/2023/12/hybrid-search-with-apache-solr.html
THANK YOU!
@seaseltd @sease-
ltd
@seaseltd @sease_ltd

Hybrid Search With Apache Solr

  • 1.
    Hybrid Search with ApacheSolr Speaker: Alessandro Benedetti, Director @ Sease COMMUNITY OVER CODE EU 2024 - 03/06/2024
  • 2.
    ‣ Born inTarquinia (ancient Etruscan city in Italy) ‣ R&D Software Engineer ‣ Director ‣ Master degree in Computer Science ‣ PC member for ECIR, SIGIR and Desires ‣ Apache Lucene/Solr PMC member/committer ‣ Elasticsearch/OpenSearch expert ‣ Semantic search, NLP, Machine Learning technologies passionate ‣ Beach Volleyball player and Snowboarder ALESSANDRO BENEDETTI WHO AM I ? 2
  • 3.
    ‣ Headquarter inLondon/distributed ‣ Open-source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch/OpenSearch experts ‣ Community Contributors ‣ Active Researchers Hot Trends : ● Large Language Models Applications ● Vector-based (Neural) Search ● Natural Language Processing ● Learning To Rank ● Document Similarity ● Search Quality Evaluation ● Relevance Tuning SEArch SErvices www.sease.io 3
  • 4.
    Limitations of Vector-BasedSearch Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 5.
    Limitations of Vector-BasedSearch Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 6.
    Vector-based Search Training IndexingSearching Labeled Samples Text to Vectors Query to Vector Lookup in Index
  • 7.
    Neural Search Workflow Similaritybetween a Query and a Document is translated to distance in a vector space
  • 8.
    Distance Between Vectors-> Similarity https://www.pinecone.io/learn/what-is-similarity-search/ https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling- e51395ffe60d ● Specify the relationship metric between elements in the dataset ● use-case dependant ○ experiment which one works better for you! ● In Information Retrieval Cosine similarity proved to work quite well (it’s a normalised inner product) https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity
  • 9.
    Low Explainability ● HighDimensionality - vectors are long sequences of numerical values (768, 1536 …) ● Dimensions - each feature (element in the vector) has no clear semantic in many cases (slightly different from sparse vectors or explicit feature vectors) ● Values - It’s not obvious to estimate how a single value impact relevance (higher is better?) ● Similarity - To explain why a search result is retrieved in the top-K the vector distance is the only info you have Research is happening but it’s still an open problem
  • 10.
    Lexical matches? ● Searchusers still have the expectation of lexical matches to happen ● You can’t enforce that with Vector-based search “Why the document with the keyword in the title is not coming up?” cit.
  • 11.
    Low Diversity ● Bynature vector-based search just returns the top-k ordered by vector similarity ● Unless you add more logic on top, you would expect low diversity by definition
  • 12.
    Limitations of Vector-BasedSearch Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 13.
    Hybrid Search ● Mitigationof current vector-search problems - Is it here to stay? ● Combine traditional keyword-based (lexical) search with vector-based (neural) search ● Retrieval of two sets of candidates: ○ one set of results coming from lexical matches with the query keywords ○ a set of results coming from the K-Nearest Neighbours search with the query vector ● Ranking of the candidates
  • 14.
    Limitations of Vector-BasedSearch Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 15.
    May 2022 -Apache Solr 9.0 Sease Introduced support to KNN search (HNSW) https://issues.apache.org/jira/browse/SOLR-15880 JIRA ISSUES https://issues.apache.org/jir a/browse/SOLR- 15880?jql=labels%20%3D %20vector-based-search Apache Solr implementation
  • 16.
    Retrieval Stage The hybridcandidate result set is the union of the results coming from the two models: ● the top-K results coming from the K-Nearest Neighbours search and the <numFound> results coming from the lexical (keyword-based) search. ● The cardinality of the combined result set is <= (K + NumFound). ● The result set doesn’t include any duplicates.
  • 17.
    Retrieval Stage The hybridcandidate result set is the intersection of the results coming from the two models: ● only the top-K results coming from the K-Nearest Neighbour that satisfy the lexical query are returned. ● The cardinality of the combined result set is <= K . ● This is effectively post-filtering K-Nearest Neighbours results but affecting the score
  • 18.
    Bonus Point: PRE-FILTERINGVS POST-FILTERING ● < 9.1 -> FQ were post-filters ● > 9.1 -> FQ are pre-filters (to run a post-filter https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local- parameter) ● > 9.6 -> More flexibility with additional !knn query parser params
  • 19.
    9.6 PRE-FILTERING VSPOST-FILTERING More flexibility with additional !knn query parser params ○ preFilter -> Specifies an explicit list of Pre-Filter query strings to use. N.B. if you specify this, FQs will all become post-filters ○ includeTags -> Indicates that only fq filters with the specified tag should be considered for implicit Pre-Filtering. Must not be combined with preFilter. ○ excludeTags -> Indicates that fq filters with the specified tag should be excluded from consideration for implicit Pre-Filtering. Must not be combined with preFilter. DEFAULT Main Q -> same as >9.1, FQ that are not post-filters become pre-filters automatically ○ includeTags and excludeTags may be used to limit the set of fq filters used in the Pre-Filter. As an fq param, or as a subquery clause in a larger query: No implicit Pre-Filter is used. ○ includeTags and excludeTags must not be used in these situations. https://solr.apache.org/guide/solr/latest/query-guide/dense-vector- search.html#explicit-knn-pre-filtering
  • 20.
    9.6 PRE-FILTERING VSPOST-FILTERING https://solr.apache.org/guide/solr/latest/query-guide/dense-vector- search.html#explicit-knn-pre-filtering Some use cases where includeTags and/or excludeTags may be more useful then an explicit preFilter parameters: ● You have some fq parameters that are re-used on many requests (even when you don’t use the knn parser) that you wish to be used as KNN Pre-Filters when you do use the knn query parser. ● You typically want all fq params to be used as KNN Pre-Filters, but when users "drill down" on Facets, you want the fq parameters you add to be excluded from the KNN Pre-Filtering so that the result set gets smaller; instead of just computing a new topK set.
  • 21.
    Limitations of Vector-BasedSearch Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 22.
    The Ranking Problem ●We need a score that reflects the best relevance ordering ● KNN candidates present a score [-1 … 1] ● Lexical candidates present an unbounded score (potentially on a complete different scale) N.B. combining Lexical scores with vector-based similarity (and potentially other features) is not a solved problem What options do we have right now in Apache Solr?
  • 23.
    Ranking Stage ● Thefilter component ignores any scoring and just builds the hybrid result set. ● The must clause is responsible for assigning the score, using the appropriate function query. ● The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest Neighbours score. This simple linear combination of the scores could be a good starting point.
  • 24.
    Ranking Stage ● Thefilter component ignores any scoring and just builds the hybrid result set. ● The must clause is responsible for assigning the score, using the appropriate function query. ● The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest Neighbours score. ● Better? No evidence
  • 25.
    Learning To Rank ●Multiple factors (features) affect ranking ○ no quick answer to a mathematical function to combine them -> what do you want to optimise? ○ Sum? Normalised Sum? Multiplication? linear or non-linear function? ○ Rather than manual trial/error let’s use Machine Learning -> LTR ● Apache Solr supports Learning To Rank since 6.4 ○ from 9.3 Sease contributed the first support for vector similarity as a feature ○ First step -> sponsor us or contribute improvements! ● Build a training set <query, document> -> rating ○ <query, document> is described by a vector of numerical feature, one of them can be your vector similarity, others may be lexical scores or business rules https://sease.io/category/learning-to-rank
  • 26.
  • 27.
  • 28.
  • 29.
    Limitations of Vector-BasedSearch Hybrid Search Hybrid Retrieval in Apache Solr Hybrid Ranking in Apache Solr Future Works Overview
  • 30.
    Future Works ● ReciprocalRank Fusion (https://github.com/apache/solr/pull/2489 ) ● Better scaling (min-max on theoretical max) ● And More!
  • 31.
    The AI SideOf Apache Lucene/Solr - Training
  • 32.
    Additional Resources Additional Resources ●Blog: https://sease.io/2023/12/hybrid-search-with-apache-solr.html
  • 33.