Dense Retrieval with Apache Solr Neural Search.pdf

Dense Retrieval with Apache Solr Neural Search
 
Alessandro Benedetti, CEO
29th
June 2022

‣ Born in Tarquinia(ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch expert
‣ Semantic, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
Who We Are
Alessandro Benedetti

‣ Headquarter in London/distributed
‣ Open Source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
SEArch SErvices
www.sease.io

Semantic Search Problems
Neural (Vector-based) Search
Apache Solr Implementation
BERT to the rescue!
Future Works
Overview

How many people live in Rome?
Rome’s
population
is 4.3
million
Hundreds
of people
queuing
for live
music in
Rome
Vocabulary mismatch problem
False
Positive

How big is a tiger?
The tiger
is the
biggest
member
of the
Felidae
family
Panthera
Tigris can
reach
390cm
nose to
tail
Vocabulary mismatch problem
False
Negative

Semantic Similarity
https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html

Vector representation for query/documents
Sparse Dense
● e.g. Bag-of-words approach
● each term in the corpus
dictionary ->
one vector dimension
● number of dimensions ->
term dictionary cardinality
● the vector for any given
document contains mostly
zeroes
● Fixed number of dimensions
● Normally much lower than
term dictionary cardinality
● the vector for any given
document contains mostly
non-zeroes
How can you generate such
dense vectors?
D=[0, 0, 0, 0, 1, 0, 0, 0, 1, 0] D=[0.7, 0.9, 0.4, 0.6, 1, 0.4, 0.7, 0.8, 1, 0.9]

Neural Search
Training Indexing Searching
Labeled Samples Text to Vectors Query to Vector
Lookup in Index

Neural Search Workflow
Similarity between a Query and a Document is translated to distance in a vector space

Distance Between Vectors
https://www.pinecone.io/learn/what-is-similarity-search/
https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d
● Specify the relationship metric
between elements in the dataset
● use-case dependant
○ experiment which one works
better for you!
● In Information Retrieval Cosine
similarity proved to work quite well (it’s
a normalised inner product)
https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity

Nearest Neighbour Retrieval (KNN)
The Relevance score is calculated
with vector similarity distance
metric.
Closer vectors means higher
semantic similarity.

ANN - Approximate Nearest Neighbor
● Exact Nearest Neighbor is expensive! (1vs1 vector
distance)
● it’s fine to lose accuracy to get a massive performance gain
● pre-process the dataset to build index data structures
● Generally vectors are quantized(compressed) and then
modelled in:
○ Trees - partitioning of the vector space (k-d tree)
https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search
○ Hashes - reducing high dimensionality preserving
differences and grouping similar objects
https://towardsdatascience.com/locality-sensitive-hashing-for-music-search-f2f1940ace23
○ Graphs - HNSW

HNSW - Hierarchical Navigable Small World graphs
Hierarchical Navigable Small World (HNSW)
graphs are among the top-performing
index-time data structures for approximate
nearest neighbor search (ANN).
References
https://doi.org/10.1016/j.is.2013.10.006
https://arxiv.org/abs/1603.09320

HNSW - How it works in a nutshell
● Proximity graph
● Vertices are vectors, closer vertices are linked
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)

Nov 2020 - Apache Lucene 9.0
Dedicated File Format for Navigable Small World Graphs
https://issues.apache.org/jira/browse/LUCENE-9004
Jan 2022 - Apache Lucene 9.0
Handle Document Deletions
Feb 2022 - Apache Lucene 9.1
Introduced Hierarchy in HNSW
Mar 2022 - Apache Lucene 9.1
Re-use data structures across HNSW Graph
Pre filters with KNN queries
JIRA ISSUES
https://issues.apache.org/jir
a/issues/?jql=project%20%
3D%20LUCENE%20AND
%20labels%20%3D%20ve
ctor-based-search
Apache Lucene implementation

org.apache.lucene.index.VectorSimilarityFunction
/** Euclidean distance */
EUCLIDEAN
/**
* Dot product. NOTE: this similarity
is intended as an optimized way to
perform cosine
* similarity. In order to use it, all
vectors must be of unit length,
including both document and
* query vectors. Using dot product
with vectors that are not unit length
can result in errors or
* poor search results.
*/
DOT_PRODUCT
/**
* Cosine similarity. NOTE: the
preferred way to perform cosine
similarity is to normalize all
* vectors to unit length, and instead
use {@link
VectorSimilarityFunction#DOT_PRODUCT}
. You
* should only use this function if
you need to preserve the original
vectors and cannot normalize
* them in advance.
*/
COSINE
Apache Lucene - indexing

org.apache.lucene.document.KnnVectorField
private static Field randomKnnVectorField
(Random random, String fieldName) {
VectorSimilarityFunction similarityFunction =
RandomPicks.randomFrom(random, VectorSimilarityFunction
.values());
float[] values = new float[randomIntBetween(1, 10)];
for (int i = 0; i < values.length; i++) {
values[i] = randomFloat();
}
return new KnnVectorField(fieldName, values, similarityFunction
);
}
Document doc = new Document();
doc.add(
new KnnVectorField(
"field", new float[] {j, j}, VectorSimilarityFunction
.EUCLIDEAN));

org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsWriter
Lucene91HnswVectorsReader
.OffHeapVectorValues offHeapVectors=
new Lucene91HnswVectorsReader
.OffHeapVectorValues(
vectors.dimension(), docsWithField.cardinality(), null, vectorDataInput);
OnHeapHnswGraph graph =
offHeapVectors
.size() == 0
? null
: writeGraph(offHeapVectors
, fieldInfo.getVectorSimilarityFunction());
org.apache.lucene.util.hnsw.HnswGraphBuilder
public Lucene91Codec(Mode mode) {
super("Lucene91");
this.storedFieldsFormat =
new Lucene90StoredFieldsFormat(
Objects.requireNonNull(mode).storedMode);
this.defaultPostingsFormat = new Lucene90PostingsFormat();
this.defaultDVFormat = new Lucene90DocValuesFormat();
this.defaultKnnVectorsFormat = new Lucene91HnswVectorsFormat();
}

org.apache.lucene.search.KnnVectorQuery
/**
* Find the <code>k</code> nearest documents to the target vector according to the
vectors in the
* given field. <code>target</code> vector.
*
* @param field a field that has been indexed as a {@link KnnVectorField}.
* @param target the target of the search
* @param k the number of documents to find
* @param filter a filter applied before the vector search
* @throws IllegalArgumentException if <code>k</code> is less than 1
*/
public KnnVectorQuery
(String field, float[] target, int k, Query filter) {
this.field = field;
this.target = target;
this.k = k;
if (k < 1) {
throw new IllegalArgumentException(
"k must be at least 1, got: " + k);
}
this.filter = filter;
}
Apache Lucene - searching

May 2022 - Apache Solr 9.0
Sease Introduced support to KNN search (HNSW)
https://issues.apache.org/jira/browse/SOLR-15880
JIRA ISSUES
https://issues.apache.org/jir
a/browse/SOLR-15880?jql=
labels%20%3D%20vector-
based-search
Apache Solr implementation

Apache Solr 9.0 - Schema
DenseVectorField
The dense vector field gives the possibility of indexing and searching dense vectors of float elements.
For example:
[1.0, 2.5, 3.7, 4.1]
Here’s how DenseVectorField should be configured in the schema:
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension ="4"
similarityFunction ="cosine"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>
vectorDimension
The dimension of the dense vector to
pass in.
Accepted values: Any integer < = 1024.
similarityFunction
Vector similarity function; used in
search to return top K most similar
vectors to a target vector.
Accepted values: euclidean,
dot_product or cosine.
● euclidean: Euclidean distance
● dot_product: Dot product
● cosine: Cosine similarity

<config>
<codecFactory class="solr.SchemaCodecFactory"/>
<fieldType
name="knn_vector"
class="solr.DenseVectorFiel
d" vectorDimension="4"
similarityFunction="cosine"
codecFormat="Lucene90HnswVe
ctorsFormat"
hnswMaxConnections="10"
hnswBeamWidth="40"/>
solrconfig.xml
schema.xml
Optional Default: 16
hnswMaxConnections
(advanced) This parameter is specific
for the Lucene90HnswVectorsFormat
codec format:
Controls how many of the nearest
neighbor candidates are connected to
the new node.
It has the same meaning as M from the
2018 paper.
Accepted values: Any integer.
hnswBeamWidth
(advanced) This parameter is specific
for the Lucene90HnswVectorsFormat
codec format:
It is the number of nearest neighbor
candidates to track while searching the
graph for each newly inserted node.
It has the same meaning as
efConstruction from the 2018 paper.
Accepted values: Any integer.
Apache Solr 9.0 - Schema

Apache Solr 9.0 - Indexing
JSON
[{ "id": "1",
"vector": [1.0, 2.5, 3.7,
4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7,
65.1]
}
]
XML
<add>
<doc>
<field name="id">1</field>
<field name="vector">1.0</field>
</doc>
<doc>
<field name="id">2</field>
</doc>
</add>
SolrJ
final SolrClient client =
getSolrClient();
final SolrInputDocument d1 = new
SolrInputDocument();
d1.setField("id", "1");
d1.setField("vector",
Arrays.asList(1.0f, 2.5f, 3.7f,
4.1f));
final SolrInputDocument d2 = new
SolrInputDocument();
d2.setField("id", "2");
d2.setField("vector",
Arrays.asList(1.5f, 5.5f, 6.7f,
65.1f));
client.add(Arrays.asList(d1,
d2));
N.B. from indexing and
storing perspective a dense
vector field is not any
different from an array of float
elements

Apache Solr 9.0 - Searching
knn Query Parser
The knn k-nearest neighbors query parser allows to find the k-nearest documents to the target vector
according to indexed dense vectors in the given field.
Required Default: none
f
The DenseVectorField to search in.
topK
How many k-nearest results to return.
Here’s how to run a KNN search:
e.g.
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]

Apache Solr 9.0-Searching with Filter Queries(fq)
When using knn in these scenarios make sure you have a clear understanding of how filter queries
work in Apache Solr:
The Ranked List of document IDs resulting from the main query q is intersected with the set of
document IDs deriving from each filter query fq.
e.g.
Ranked List from q=[ID1, ID4, ID2, ID10] <intersects> Set from fq={ID3, ID2, ID9, ID4}
= [ID4,ID2]
Usage with Filter Queries
The knn query parser can be used in filter queries:
&q=id:(1 2 3)&fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
The knn query parser can be used with filter queries:
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq=id:(1 2 3)
N.B. this can be called ‘post filtering’, ‘pre filtering’ is coming in a future release

Apache Solr 9.0 - Reranking
When using knn in re-ranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the document d from the first pass is
within the k-nearest neighbors(in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current
limitation.
The final ranked list of results will have the first pass score(main query q) added to the second pass
score(the approximated similarityFunction distance to the target vector to search) multiplied by a
multiplicative factor(reRankWeight).
Details about using the ReRank Query Parser can be found in the Query Re-Ranking section.
Usage as Re-Ranking Query
The knn query parser can be used to rerank first pass query results:
&q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4
reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
N.B. pure re-scoring is coming in a later release

Apache Solr 9.0 - Hybrid Dense/Sparse search
…/solr/dense/select?indent=true&q={!bool should=$clause1 should=$clause2}
&clause1={!type=field f=id v='901'}
&clause2={!knn f=vector topK=10}[0.4,0.5,0.3,0.6,0.8]
&fl=id,score,vector
"debug":{
"rawquerystring":"{!bool should=$clause1 should=$clause2}",
"querystring":"{!bool should=$clause1 should=$clause2}",
"parsedquery":"id:901
KnnVectorQuery(KnnVectorQuery:vector[0.4,...][10])",
"parsedquery_toString":"id:901 KnnVectorQuery:vector[0.4,...][10]",

Apache Solr 9.0 - Reranking
Apache Solr 9.0 - An Initial Benchmark
N.B. this is a
quick
benchmark, it
doesn’t
necessary
reflect bigger
volumes
linearly
KNN
text

● Large language models need to be trained on a lot of data to work well
● … which is normally difficult for the average enterprise/project
● Transformers are pre-trained in an unsupervised way on large corpora
(Wikipedia, web …)
● Pre-training captures meaning, topics, word pattern in the language
(rather than the specific domain)
● Shares similarities with word2vec (vectors per word) ->
main difference is encoding word/sentences in a context
● Pre-trained model can be fine-tuned for specific tasks (dense retrieval,
essay generation, translation, text summarization…)
How to produce vectors?

Masked Language Modeling
Models are pre-trained using the
masked language modeling
technique
Take a test, mask random words
and try to predict the masked words
This technique allows to capture
interactions between word within the
context

Bidirectional Encoder Representations from Transformers
Large model (many dimensions and layers – base: 12 layers and
768 dim.)
Special tokens:
[CLS] Classification token, used as pooling operator to get a
single vector per sequence
[MASK] Used in the masked language model, to predict this
word
[SEP] Used to separate sentences
Devlin, Chang, Lee, Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
BERT to the rescue!
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html

Training/ Fine Tuning
Self-supervised on ∞ training data Supervised on few labeled examples

From Text to Vectors
from sentence_transformers import SentenceTransformer
import torch
import sys
BATCH_SIZE = 50
MODEL_NAME = ‘msmarco-distilbert-base-dot-prod-v3’
model = SentenceTransformer(MODEL_NAME)
if torch.cuda.is_available():
model = model.to(torch.device(“cuda”))
def main():
input_filename = sys.argv[1]
output_filename = sys.argv[2]
create_embeddings(input_filename, output_filename)
https://pytorch.org
https://pytorch.org/tutorials/beginne
r/basics/quickstart_tutorial.html
https://www.sbert.net
https://www.sbert.net/docs/pretrain
ed_models.html

def create_embeddings(input_filename, output_filename):
with open(input_filename, ‘r’, encoding=“utf-8") as f:
with open(output_filename, ‘w’) as out:
processed = 0
sentences = []
for line in f:
sentences.append(line)
if len(sentences) == BATCH_SIZE:
processed += 1
if (processed % 1000 == 0):
print(“processed {} documents”.format(processed))
vectors = process(sentences)
for v in vectors:
out.write(‘,’.join([str(i) for i in v]))
out.write(‘n’)
sentences = []
Each line in the
input file is a
sentence.
In this example we
consider each
sentence a
separate
document to be
indexed in Apache
Solr.

def process(sentences):
embeddings = model.encode(sentences, show_progress_bar=True)
return embeddings
embeddings is an
array [] of vectors
Each vector
represents a
sentence

Future Works: [Solr] Codec Agnostic
May 2022 - Apache Lucene 9.1 in Apache Solr
Currently it is possible to conﬁgure the codec to use for the HNSW
implementation in Lucene.
The proposal is to change this so that the user can only conﬁgure
the algorithm (HNSW now and default, maybe IVFFlat in the future?
https://issues.apache.org/jira/browse/LUCENE-9136)
<fieldType name="knn_vector"
class="solr.DenseVectorField"
vectorDimension="4"
similarityFunction="cosine"
codecFormat="Lucene90HnswVecto
rsFormat"
hnswMaxConnections="10"
hnswBeamWidth="40"/>

Future Works: [Solr] Pre-filtering
Pre filters with KNN queries
This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.

Future Works: [Lucene] VectorSimilarityFunction simplification
The proposal in this Pull Request aims to:
1) the Euclidean similarity just returns the score, in line with the other similarities, with the
formula currently used
2) simplify the code, removing the bound checker that's not necessary anymore
3) refactor here and there to be in line with the simpliﬁcation
4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, now
debugging is much easier and understanding the HNSW code is much more intuitive

Future Works: [Solr] BERT Integration
We are planning to contribute:
1) an update request processor, that takes in input a BERT model and does the inference
vectorization at indexing time
2) a query parser, that takes in input a BERT model and does the inference vectorization at
query time

Apache Solr 9.0 - Additional Resources
Additional Resources
● Blog: https://sease.io/2022/01/apache-solr-neural-search.html
● Blog: https://sease.io/2022/01/apache-solr-neural-search-knn-benchmark.html
● Blog: https://sease.io/2021/07/artificial-intelligence-applied-to-search-introduction.html
● Blog: https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
● Blog: https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html
● Blog: https://sease.io/2022/04/have-neural-networks-killed-the-inverted-index.html

Thanks!
Special thanks to:
● Apache Lucene community for all the HNSW goodies
● Elia Porciani - active contributor of code for Apache Solr Neural Search components
● Christine Poerschke for the accurate review
● Cassandra Targett for all the documentation corrections
● Michael Gibney for the discussion on dense vectors, and everyone involved in the review process!

Dense Retrieval with Apache Solr Neural Search.pdf

More Related Content

What's hot

Similar to Dense Retrieval with Apache Solr Neural Search.pdf

More from Sease

Recently uploaded

Dense Retrieval with Apache Solr Neural Search.pdf