Answers to some questions about Natural Language Search, Language Modelling (Google Bert, OpenAI GPT-3), Neural Search and Learning to Rank made during our London Information Retrieval Meetup (December).
3. General Considerations
Language Models
https://en.wikipedia.org/wiki/BERT_(language_model)
https://en.wikipedia.org/wiki/GPT-3
https://rajpurkar.github.io/SQuAD-explorer/
• Pre Trained on large corpora (expensive)
• Ad hoc fine tuning to solve Natural Language Tasks (inexpensive)
• Ability to encode terms and sentences as high dimensional vectors
e.g.
https://github.com/google-research/bert#pre-trained-models
https://github.com/hanxiao/bert-as-service/
Bert vectors for sentences [‘access the bank', ‘walking by the street', ‘tigers are big cats'] :
[[ 0.13186474 0.32404128 -0.82704437 ... -0.3711958 -0.39250174
-0.31721866]
[ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
-0.11345179]
[ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366 -0.39310536
0.07640187]]
4. General Considerations
Language Models in Search
• Indexing Time : encode sentences (or full field contents) and store the vectors
• Searching Time: encode the query
• Score the query-document vectors pair, calculating vector distance/similarity:
Euclidean distance
Cosine Similarity
…
Limitations
• Rank entire corpus of documents ? Apply an (Approximate) Nearest Neighbour approach?
• Performance for embedding extraction?
• Un-intuitive results -> should be combined with Traditional Information Retrieval
• Explainability
5. Apache Lucene
Ideally you want to avoid scoring all documents of your corpus for your query.
The algorithms for vector retrieval can be roughly classified into four categories,
1. Tree-base algorithms, such as KD-tree;
2. Hashing methods, such as LSH (Local Sensitive Hashing);
3. Product quantization based algorithms, such as IVFFlat;
4. Graph-base algorithms, such as HNSW, SSG, NSG;
Specific File Format (Nov 2020)
•https://issues.apache.org/jira/browse/LUCENE-9004
Hierarchical Navigable Small World Graphs - DONE
•https://issues.apache.org/jira/browse/LUCENE-9322
DONE Unified Vector Format
•https://issues.apache.org/jira/browse/LUCENE-9136
IVFFlat - In Progress
6. Apache Lucene
Follow-ups
- reducing heap usage during graph construction
- adding a Query implementation
- exposing index hyper-parameters
- benchmarks
- testing on public datasets
- implementing a diversity heuristic for neighbour selection during graph construction
- making the graph hierarchical
- exploring more efficient search across multiple per-segment graphs…
Keep an eye on Lucene JIRA!
https://issues.apache.org/jira/browse/LUCENE-9004
7. Apache Solr
Status of Deep Learning Vector Based Search
• Lucene latest codecs and file format not used yet
https://issues.apache.org/jira/browse/SOLR-14397 -> develop an official solution out of the box
https://issues.apache.org/jira/browse/SOLR-12890 -> summary
Ready to use Approaches
• Vector Scoring using Streaming Expressions (Point Fields)
• Available Solr Vector Search Plugin - https://github.com/saaay71/solr-vector-scoring (Payloads)
https://medium.com/@dmitry.kan/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
• Available Solr Vector Search Plugin with LSH Hashing (Payloads)
Limitations
• Generally slow solutions
• Re-use data structures, not using ad hoc codecs/file format
• Generally support only one vector per field
8. Apache Solr - Streaming Expressions
Index Time
<dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/>
Sample Docs:
curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/food_collection/
update?commit=true --data-binary '
[
{"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]},
{"id": "2", "name_s":"apple juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]},
…
]
https://www.elastic.co/blog/lucene-points-6.0
org.apache.solr.schema.PointField
Multi Valued Field
<fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
9. Apache Solr - Streaming Expressions
Streaming Expression:
sort(
select(
search(food_collection,
q="*:*",
fl="id,vector_fs",
sort="id asc",
rows=3),
cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as sim,
id),
by="sim desc")
Response:
{
"result-set": {
"docs": [
{ "sim": 0.99996111, "id": "1" },
{ "sim": 0.98590279, "id": "10" },
{ "sim": 0.55566643, "id": "2" },
{ "EOF": true, "RESPONSE_TIME": 10 }
]
}
} https://lucene.apache.org/solr/guide/8_7/vector-math.html
Drawbacks:
1) it doesn’t apply to normal search
-> you need to use Streaming
Expressions
2) Requires traversing all vectors
and scoring them.
3) no support for multiple vectors
per field - SOLR-11077
Query Time
11. Apache Solr - Solr Vector Search Plugin
Query Time
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="0.1,4.75,0.3,1.2,0.7,4.0" cosine=false}
N.B. Adding the parameter cosine=false calculates the dot product
"response":{"numFound":6,"start":0,"maxScore":40.1675,"docs":[
{
"name":["example 3"],
"vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "],
"score":40.1675},
{
"name":["example 0"],
"vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "],
"score":30.180502},
…
]}
Drawbacks:
1) Payloads used for storing
vectors->
slow
2) Requires traversing all vectors
and scoring them.
3) support for multiple vectors per
field must be investigated
N.B. https://github.com/DmitryKey/solr-vector-scoring is a fork with a 8.6 Apache Solr port
13. Apache Solr - LSH Hashing Plugin
Query Time
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true" reRankDocs="5"}
&fl=name,score,vector,_vector_,_lsh_hash_
"response":{"numFound":1,"start":0,"maxScore":36.65736,"docs":[
{
"id": "1",
"vector":"1.55,3.53,2.3,0.7,3.44,2.33",
"_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==",
"_lsh_hash_":["0_8",
"1_35",
"2_7",
…
"49_43"],
"score":36.65736}
]
Drawbacks:
1) Performance must be
investigated, usage of binary fields
with encoded vectors
2) latest commit October 2018
14. Elasticsearch
Status of Deep Learning Vector Based Search
• Lucene latest codecs and file format not used yet
https://github.com/elastic/elasticsearch/issues/42326 - Work in progress for covering Approximate Nearest Neighbour Techiques
Ready to use Approaches
• X-Pack enterprise features - https://www.elastic.co/guide/en/elasticsearch/reference/current/dens
vector.html
• https://github.com/alexklibisz/elastiknn
• https://github.com/opendistro-for-elasticsearch/k-NN
Limitations
• Performance must be investigated ( https://elastiknn.com/performance/ )
• Re-use data structures, not using ad hoc codecs/file format
• Supports only one vector per field
15. Elasticsearch - X-Pack
Index Time
PUT my-index-000001
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"dims": 3
},
“status" : {
"type" : "keyword"
}
}
}
}
PUT my-index-000001/_doc/1
{
"my_dense_vector": [0.5, 10, 6],
"status" : "published"
}
PUT my-index-000001/_doc/2
{
"my_dense_vector": [-0.5, 10, 10],
"status" : "published"
}
• N.B. Lucene latest codecs and file format not used yet, vectors are stored as binary doc values.
16. Elasticsearch - X-Pack
Query Time
N.B. various distance functions are supported
Drawbacks:
1) Requires traversing all vectors
returned by initial query and scoring
them.
2) no support for multiple vectors
per field
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
"params": {
"query_vector": [4, 3.4, -0.2]
}
}
}
}
}
17. Next Steps
● Keep an eye on our Blog: https://sease.io/blog, as more is coming!
20. Ranklib
Overview
https://sourceforge.net/p/lemur/wiki/RankLib/
RankLib is a library of learning to rank algorithms. Currently eight popular algorithms have been
implemented:
• MART (Multiple Additive Regression Trees, a.k.a. Gradient boosted regression tree) [6]
• RankNet [1]
• RankBoost [2]
• AdaRank [3]
• Coordinate Ascent [4]
• LambdaMART [5]
• ListNet [7]
• Random Forests [8]
21. Ranklib
Our Experience
https://sourceforge.net/p/lemur/wiki/RankLib/
• Multiple learning to rank libraries supported including LambdaMART
• Relatively easy to use
• Command Line Interface application -> not meant to be integrated with other apps
• Java code, minimal Test Coverage
• Svn (there’s a Github port, not official: https://github.com/codelibs/ranklib )
• Small Community
22. XGBoost
Overview
https://github.com/dmlc/xgboost
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient,
flexible and portable.
• It implements machine learning algorithms under the Gradient Boosting framework.
• XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data
science problems in a fast and accurate way.
• The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problem
beyond billions of examples.
23. XGBoost
Our Experience
• Multiple learning to rank libraries supported including LambdaMART
• Relatively easy to use
• Library easy to integrate
• Python code, huge project, Tests seem fair
• Github (https://github.com/dmlc/xgboost )
• Extremely popular
• Huge Community
24. Learning to Rank Libraries
Limitations:
‣ Developed for small data sets
‣ Limited support for Sparse Features
‣ Require extensive Feature Engineering
‣ Do not support the recent advances in Unbiased Learning-to-rank
The TensorFlow Ranking library addresses these gaps
25. TensorFlow Ranking
Overview
‣ Open source library for solving large-scale ranking problems in a deep learning framework
‣ Developed by Google’s AI department
‣ Fast and easy to use
‣ Flexible and highly configurable
‣ Support Pointwise, Pairwise, and Listwise losses
‣ Support popular ranking metrics like Mean Reciprocal Rank (MRR) and Normalized
Discounted Cumulative Gain (NDCG)
GitHub: https://github.com/tensorflow/ranking
26. TensorFlow Ranking
Additional components:
‣ Fully integrated with the rest of the TensorFlow ecosystem
‣ Can handle textual features using Text Embeddings
‣ Multi-item (also known as Groupwise) scoring functions
‣ LambdaLoss implementation
‣ Unbiased Learning-to-Rank
TF-Ranking Article: https://arxiv.org/abs/1812.00073
27. XGBoost vs TensorFlow
XGBoost TensorFlow
Tree-based Ranker Neural Ranker
Handle Missing Values Handle Missing Values
Run Efficiently on CPU Run Efficiently on CPU
Large Scale Training Large Scale Training
Main Differences