The document discusses hybrid search techniques using Apache Solr, focusing on integrating traditional keyword-based and vector-based search methods to improve result relevance and diversity. It highlights the limitations of vector-based search, including issues with explainability and relevance scoring, and explains how hybrid ranking can mitigate these problems. Future developments and support for features like learning to rank and reciprocal rank fusion are also mentioned.
Hybrid Search with
ApacheSolr
Speaker: Alessandro Benedetti, Director @ Sease
COMMUNITY OVER CODE EU 2024 - 03/06/2024
2.
‣ Born inTarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2
3.
‣ Headquarter inLondon/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
Hot Trends :
● Large Language Models Applications
● Vector-based (Neural) Search
● Natural Language Processing
● Learning To Rank
● Document Similarity
● Search Quality Evaluation
● Relevance Tuning
SEArch SErvices
www.sease.io
3
4.
Limitations of Vector-BasedSearch
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
5.
Limitations of Vector-BasedSearch
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
Distance Between Vectors-> Similarity
https://www.pinecone.io/learn/what-is-similarity-search/
https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-
e51395ffe60d
● Specify the relationship metric
between elements in the dataset
● use-case dependant
○ experiment which one
works better for you!
● In Information Retrieval Cosine
similarity proved to work quite well (it’s
a normalised inner product)
https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity
9.
Low Explainability
● HighDimensionality - vectors are long sequences of numerical values (768,
1536 …)
● Dimensions - each feature (element in the vector) has no clear semantic in
many cases (slightly different from sparse vectors or explicit feature vectors)
● Values - It’s not obvious to estimate how a single value impact relevance
(higher is better?)
● Similarity - To explain why a search result is retrieved in the top-K the vector
distance is the only info you have
Research is happening but it’s still an open problem
10.
Lexical matches?
● Searchusers still have the expectation of lexical matches to happen
● You can’t enforce that with Vector-based search
“Why the document with the keyword in the title
is not coming up?” cit.
11.
Low Diversity
● Bynature vector-based search just returns the top-k
ordered by vector similarity
● Unless you add more logic on top, you would expect low
diversity by definition
12.
Limitations of Vector-BasedSearch
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
13.
Hybrid Search
● Mitigationof current vector-search problems - Is it here to stay?
● Combine traditional keyword-based (lexical) search with vector-based (neural)
search
● Retrieval of two sets of candidates:
○ one set of results coming from lexical matches with the query keywords
○ a set of results coming from the K-Nearest Neighbours search with the query
vector
● Ranking of the candidates
14.
Limitations of Vector-BasedSearch
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
15.
May 2022 -Apache Solr 9.0
Sease Introduced support to KNN search (HNSW)
https://issues.apache.org/jira/browse/SOLR-15880
JIRA ISSUES
https://issues.apache.org/jir
a/browse/SOLR-
15880?jql=labels%20%3D
%20vector-based-search
Apache Solr implementation
16.
Retrieval Stage
The hybridcandidate result set is the union of the results coming from the two models:
● the top-K results coming from the K-Nearest Neighbours search and the <numFound>
results coming from the lexical (keyword-based) search.
● The cardinality of the combined result set is <= (K + NumFound).
● The result set doesn’t include any duplicates.
17.
Retrieval Stage
The hybridcandidate result set is the intersection of the results coming from the two models:
● only the top-K results coming from the K-Nearest Neighbour that satisfy the lexical query
are returned.
● The cardinality of the combined result set is <= K .
● This is effectively post-filtering K-Nearest Neighbours results but affecting the score
18.
Bonus Point: PRE-FILTERINGVS POST-FILTERING
● < 9.1 -> FQ were post-filters
● > 9.1 -> FQ are pre-filters (to run a post-filter
https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#cache-local-
parameter)
● > 9.6 -> More flexibility with additional !knn query parser params
19.
9.6 PRE-FILTERING VSPOST-FILTERING
More flexibility with additional !knn query parser params
○ preFilter -> Specifies an explicit list of Pre-Filter query strings to use.
N.B. if you specify this, FQs will all become post-filters
○ includeTags -> Indicates that only fq filters with the specified tag should be
considered for implicit Pre-Filtering. Must not be combined with preFilter.
○ excludeTags -> Indicates that fq filters with the specified tag should be
excluded from consideration for implicit Pre-Filtering. Must not be combined with
preFilter.
DEFAULT
Main Q -> same as >9.1, FQ that are not post-filters become pre-filters automatically
○ includeTags and excludeTags may be used to limit the set of fq filters used in the Pre-Filter.
As an fq param, or as a subquery clause in a larger query: No implicit Pre-Filter is used.
○ includeTags and excludeTags must not be used in these situations.
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-
search.html#explicit-knn-pre-filtering
20.
9.6 PRE-FILTERING VSPOST-FILTERING
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-
search.html#explicit-knn-pre-filtering
Some use cases where includeTags and/or excludeTags may be more useful then
an explicit preFilter parameters:
● You have some fq parameters that are re-used on many requests (even when
you don’t use the knn parser) that you wish to be used as KNN Pre-Filters when
you do use the knn query parser.
● You typically want all fq params to be used as KNN Pre-Filters, but when users
"drill down" on Facets, you want the fq parameters you add to be excluded from
the KNN Pre-Filtering so that the result set gets smaller; instead of just computing
a new topK set.
21.
Limitations of Vector-BasedSearch
Hybrid Search
Hybrid Retrieval in Apache Solr
Hybrid Ranking in Apache Solr
Future Works
Overview
22.
The Ranking Problem
●We need a score that reflects the best relevance ordering
● KNN candidates present a score [-1 … 1]
● Lexical candidates present an unbounded score (potentially on a complete
different scale)
N.B. combining Lexical scores with vector-based similarity (and potentially other
features) is not a solved problem
What options do we have right now in Apache Solr?
23.
Ranking Stage
● Thefilter component ignores any scoring and just builds the hybrid result set.
● The must clause is responsible for assigning the score, using the appropriate function query.
● The lexical score is min-max normalised to be scaled between 0 and 1 and it is summed to the K-Nearest
Neighbours score.
This simple linear combination of the scores could be a good starting point.
24.
Ranking Stage
● Thefilter component ignores any scoring and just builds the hybrid result set.
● The must clause is responsible for assigning the score, using the appropriate function query.
● The lexical score is min-max normalised to be scaled between 0.1 and 1 and it is multiplied by the K-Nearest
Neighbours score.
● Better? No evidence
25.
Learning To Rank
●Multiple factors (features) affect ranking
○ no quick answer to a mathematical function to combine them -> what do you want
to optimise?
○ Sum? Normalised Sum? Multiplication? linear or non-linear function?
○ Rather than manual trial/error let’s use Machine Learning -> LTR
● Apache Solr supports Learning To Rank since 6.4
○ from 9.3 Sease contributed the first support for vector similarity as a feature
○ First step -> sponsor us or contribute improvements!
● Build a training set <query, document> -> rating
○ <query, document> is described by a vector of numerical feature, one of them can
be your vector similarity, others may be lexical scores or business rules
https://sease.io/category/learning-to-rank