SlideShare a Scribd company logo
1 of 38
Download to read offline
Players in Vector Search:
Algorithms, Software and
Use Cases
Dmitry Kan
London IR Meetup
July’22
About me
● PhD in NLP, Co-founder of Muves.io
● Senior Product Manager, Search at TomTom
● Contributor to and user of Quepid – query rating tool
● 16+ years of experience in developing search engines for
start-ups and multinational technology giants
● Host of the Vector Podcast
https://www.youtube.com/c/VectorPodcast
● Blogging about vector search on Medium:
https://dmitry-kan.medium.com/
Outline
● Report on 1st season of Vector Podcast
● Vector Search algorithms
● Vector Search pyramid
● Use cases
● Where things are going
Vector Podcast stats
Launched on October 13, 2021
11 interviews
2 lectures
2 webinars
1 LIVE recording
with 4 guests
To be continued…
Topics covered
● Vector DBs: Weaviate, Qdrant, Milvus,
Vespa, Apache Solr
● Neural search: Jina, ZIR.AI
● Algorithms: HNSW, doc2query,
bi-encoders
● Embedding layers: Mighty
● Sparse search & ML
Keyword search
● Examples: Elasticsearch, OpenSearch, Solr
● Relies on matching of search terms to text
in an inverted index
● Makes it difficult to find items with similar
meaning but containing different keywords
● Not directly suitable for multimodal or
multilingual search
Vector search
● Utilises neural networks models to represent
objects (like text and images) and queries as
high-dimensional vectors
● Ranking based on vector similarity
● Allows finding items with similar meaning or
of different modality
EXAMPLE
Query: A bear eating a fish by a river
Result: heron eating a fish
EXAMPLE
Query: A bear eating a fish by a river
Query vector: [0.072893, -0.277076, 0.201384, …]
Result vector: [0.004142, -0.022811, 0.019714 …]
Result:
Muves in the context of Vector Search Ecosystem
Vector Databases: Milvus, Weaviate, Pinecone, GSI, Qdrant, Vespa, Vald, Elastiknn…
Neural frameworks: Haystack, Jina.AI, ZIR.AI, Hebbia.AI, Featureform…
KNN / ANN algorithms: HNSW, PQ, IVF, LSH, Zoom, DiskANN, BuddyPQ …
Encoders: Transformers, Clip, GPT3… + Mighty
Application business logic: neural / BM25,
symbolic filters, ranking
user interface
Vector Search Pyramid
Algorithms: Big players in the game
● Spotify: ANNOY
● Microsoft (Bing team): Zoom, DiskANN, SPTAG
○ Azure Cognitive Search
● Amazon: KNN based on HNSW in OpenSearch
● Google: ScaNN
● Yahoo! Japan: NGT
● Facebook: FAISS, PQ (CPU & GPU)
● Baidu: IPDG (Baidu Cloud)
● Yandex
● NVIDIA
● Intel
ANN algorithms
BigANN Competition @ NeurIPS’21
● Max Irwin
● Alex Semenov
● Aarne Talman
● Leo Joffe
● Alex Klibisz
● Dmitry Kan
Billion-Scale ANN Algorithm Challenge
https://bit.ly/3ApqYYQ
BuddyPQ
increases 12%
in Recall over
baseline
FAISS
Credit: The Pinecone Vector Database System (Edo Liberty)
BuddyPQ: improving over FAISS
BIGANN dataset Kolmogorov-Smirnov dimension test matrix for the first
100000 points. A higher number indicates a less similar distribution
Variance Inflation Factor (Multicollinearity) (~2.25)
Motivation for vector search
● Text search can be solved* with inverted index and BM25 or
TF-IDF ranking
● But what if your data is images, audio, video?
● Can you find images with textual queries?
Motivation for vector search
With inverted index you can represent documents as a sparse vector:
[1, 1, 0, 1, 0, 0, 0, 0, …, 0]
Problem 1: this bag-of-words representation does not take into account semantic context in
which query words appear.
“Capital” as a main city OR as some monetary value?
Problem 2: Does not respect word order:
Visa from Finland to Canada
Visa from Canada to Finland
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Vector search in a nutshell
Vector (=Neural) search is
● a way to represent and
● search your objects (documents,
songs, images) in a geometric space
● usually of high-dimension in the form
of a dense embedding
● a vector of numbers: [0.9, -0.1, 0.15, …]
Credit: Weaviate V1.0 release - virtual meetup
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Credit: https://www.youtube.com/watch?v=qzQPbIcQu9Q
Muves in the context of Vector Search Ecosystem
Vector Databases: Milvus, Weaviate, Pinecone, GSI, Qdrant, Vespa, Vald, Elastiknn…
Neural frameworks: Haystack, Jina.AI, ZIR.AI, Hebbia.AI, Featureform…
KNN / ANN algorithms: HNSW, PQ, IVF, LSH, Zoom, DiskANN, BuddyPQ …
Encoders: Transformers, Clip, GPT3… + Mighty
Application business logic: neural / BM25,
symbolic filters, ranking
user interface
Vector Search Pyramid
Smaller Vector DB players: 71% are Open Source
Company Product Cloud Open Source:
Y/N
Algorithms
SeMI Weaviate Y Y (Go) custom HNSW
Pinecone Pinecone Y N FAISS + own
GSI APU chip for
Elasticsearch /
Opensearch
N N Neural hashing
/ Hamming
distance
Qdrant Qdrant N Y (Rust) HNSW (graph)
Yahoo! Vespa Y Y (Java, C++) HNSW (graph)
Ziliz Milvus N Y (Go, C++,
Python)
FAISS, HNSW
Yahoo! Vald N Y (Go) NGT
Semantic search frameworks: 57% Open Source
Company Product Open Source: Y/N Focus
Deepset.ai Haystack Y NLP, neural search
Jina.AI Jina, Hub, Finetuner Y NLP, CV, ASR
Featureform Feature store,
EmbeddingHub
Y All AI verticals
ZIR.AI AI search platform N NLP
Hebbia.AI Knowledge Base N NLP -> Finance
Rasa.ai Virtual assistants Y NLP
Muves.io Multilingual vector
search
N Multilingual search,
multimodality
Specific:
● Image similarity (CLIP retrieval with
KNN search)
● Multilingual search (Muves)
● Question answering (Techcrunch
Weaviate demo, Deepset)
● Recommenders: Facebook news
feed
● Google Talk to Books
● Car image search (Classic.com)
● E-commerce: multimodal search
(Muves+GSI APU)
Broad:
● Metric learning: use case for
Vector DBs
● Semantic search: ride-sharing
● Anomaly detection (Pinecone)
● Classification
● Multi-stage ranking
Use cases
Demo: Muves books search muves.io
Distiluse-base-multilingual-cased-v2
FAISS
1M books
For the demo we used a multilingual CLIP from
Huggingface and a 10M subset from LAION-400M dataset
Multilingual CLIP model for image search
Multilingual Sentence Transformers for text
embeddings
Demo: Muves multilingual & multimodal search muves.io
● BM25 / Vector search: how to combine?
○ Vespa’s talk on BBUZZ’22
○ Jina demos with text/image blending
● Efficient embeddings
○ Mighty
○ GPU vs CPU (cost)
○ Latency at scale
● Going multimodal
○ Muves: multimodal search + hardware
acceleration
○ Search inside audio and blend with
text-based matching
○ Search complex objects, like 3D mesh,
polygon similarity
● Model fine-tuning
○ BEIR paper
○ Play with Jina Finetuner
○ Pre-trained models on your domain,
like CLIP
● Choose strategy for your vector search
○ Add new vector DB or enable a
dense_vector in Elasticsearch /
OpenSearch / Solr
○ Doc2query – no vector search is
involved!
○ Precise vs Explorative search
○ Use cases in your product
○ MLOps
Where things are going
Get practical
● Code: https://github.com/DmitryKey/bert-solr-search
● Supported Engines: Solr, Elasticsearch, OpenSearch, GSI, [hnswlib]
● Supported LMs: BERT, SBERT, [theoretically any]
Thank you! 🧡
twitter.com/DmitryKan
youtube.com/c/VectorPodcast
https://spoti.fi/3sRXcdn
Links
1. Martin Fowler’s talk on NoSQL databases: https://www.youtube.com/watch?v=qI_g07C_Q5I
2. Neural search vs Inverted search
https://trends.google.com/trends/explore?date=all&q=neural%20search,inverted%20search
3. Not All Vector Databases Are Made Equal
https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these-
buzz-words-and-what-makes-each-9c65a3bd0696
4. HN thread: https://news.ycombinator.com/item?id=28727816
5. A survey of PQ and related methods: https://faiss.ai/
6. Vector Podcast on YouTube: https://www.youtube.com/c/VectorPodcast
7. Vector Podcast on Spotify: https://open.spotify.com/show/13JO3vhMf7nAqcpvlIgOY6
8. Vector Podcast on Apple Podcasts:
https://podcasts.apple.com/us/podcast/vector-podcast/id1587568733
9. BERT, Solr, Elasticsearch, OpenSearch, HNSWlib – in Python:
https://github.com/DmitryKey/bert-solr-search
10. Speeding up BERT search in Elasticsearch:
https://towardsdatascience.com/speeding-up-bert-search-in-elasticsearch-750f1f34f455
Links
1. Merging Knowledge Graphs with Language Models:
https://www.microsoft.com/en-us/research/publication/jaket-joint-pre-training-of-knowledge-gr
aph-and-language-understanding/
2. Fusing Knowledge into Language Models: https://cs.stanford.edu/people/cgzhu/ in particular:
https://drive.google.com/file/d/1-jvKce5rcBp_DdkPv8ijLeFW5N2XrapS/view

More Related Content

Similar to London IR Meetup - Players in Vector Search_ algorithms, software and use cases

Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Abhishek Thakur
 
Exploring Google APIs with Python
Exploring Google APIs with PythonExploring Google APIs with Python
Exploring Google APIs with Pythonwesley chun
 
[DSC Europe 22] Developing Visual AI Solutions for Online Marketplaces - Mlad...
[DSC Europe 22] Developing Visual AI Solutions for Online Marketplaces - Mlad...[DSC Europe 22] Developing Visual AI Solutions for Online Marketplaces - Mlad...
[DSC Europe 22] Developing Visual AI Solutions for Online Marketplaces - Mlad...DataScienceConferenc1
 
Tools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserTools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserSafe Software
 
Druid meetup @ Netflix (11/14/2018 )
Druid meetup @ Netflix  (11/14/2018 )Druid meetup @ Netflix  (11/14/2018 )
Druid meetup @ Netflix (11/14/2018 )Jaebin Yoon
 
Clouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
Clouds are Not Free: Guide to Observability-Driven Efficiency OptimizationsClouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
Clouds are Not Free: Guide to Observability-Driven Efficiency OptimizationsScyllaDB
 
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024Cloud Native NoVA
 
Python in Industry
Python in IndustryPython in Industry
Python in IndustryDharmit Shah
 
Google Cloud: Next'19 Extended Hanoi
Google Cloud: Next'19 Extended HanoiGoogle Cloud: Next'19 Extended Hanoi
Google Cloud: Next'19 Extended HanoiGCPUserGroupVietnam
 
Easy path to machine learning (Spring 2020)
Easy path to machine learning (Spring 2020)Easy path to machine learning (Spring 2020)
Easy path to machine learning (Spring 2020)wesley chun
 
Serverless Computing with Google Cloud
Serverless Computing with Google CloudServerless Computing with Google Cloud
Serverless Computing with Google Cloudwesley chun
 
Serverless computing with Google Cloud
Serverless computing with Google CloudServerless computing with Google Cloud
Serverless computing with Google Cloudwesley chun
 
Data Scenarios 2020: 6 Amazing Transformations
Data Scenarios 2020: 6 Amazing TransformationsData Scenarios 2020: 6 Amazing Transformations
Data Scenarios 2020: 6 Amazing TransformationsSafe Software
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker
 
Building a Distributed & Automated Open Source Program at Netflix
Building a Distributed & Automated Open Source Program at NetflixBuilding a Distributed & Automated Open Source Program at Netflix
Building a Distributed & Automated Open Source Program at NetflixAll Things Open
 
Everyone wants (someone else) to do it: writing documentation for open source...
Everyone wants (someone else) to do it: writing documentation for open source...Everyone wants (someone else) to do it: writing documentation for open source...
Everyone wants (someone else) to do it: writing documentation for open source...Jody Garnett
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell
 

Similar to London IR Meetup - Players in Vector Search_ algorithms, software and use cases (20)

Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Exploring Google APIs with Python
Exploring Google APIs with PythonExploring Google APIs with Python
Exploring Google APIs with Python
 
[DSC Europe 22] Developing Visual AI Solutions for Online Marketplaces - Mlad...
[DSC Europe 22] Developing Visual AI Solutions for Online Marketplaces - Mlad...[DSC Europe 22] Developing Visual AI Solutions for Online Marketplaces - Mlad...
[DSC Europe 22] Developing Visual AI Solutions for Online Marketplaces - Mlad...
 
Tools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserTools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web Browser
 
Druid meetup @ Netflix (11/14/2018 )
Druid meetup @ Netflix  (11/14/2018 )Druid meetup @ Netflix  (11/14/2018 )
Druid meetup @ Netflix (11/14/2018 )
 
Clouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
Clouds are Not Free: Guide to Observability-Driven Efficiency OptimizationsClouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
Clouds are Not Free: Guide to Observability-Driven Efficiency Optimizations
 
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
A Love Story with Kubevirt and Backstage from Cloud Native NoVA meetup Feb 2024
 
Nlp workshop-share
Nlp workshop-shareNlp workshop-share
Nlp workshop-share
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
Google Cloud: Next'19 Extended Hanoi
Google Cloud: Next'19 Extended HanoiGoogle Cloud: Next'19 Extended Hanoi
Google Cloud: Next'19 Extended Hanoi
 
Easy path to machine learning (Spring 2020)
Easy path to machine learning (Spring 2020)Easy path to machine learning (Spring 2020)
Easy path to machine learning (Spring 2020)
 
Serverless Computing with Google Cloud
Serverless Computing with Google CloudServerless Computing with Google Cloud
Serverless Computing with Google Cloud
 
Serverless computing with Google Cloud
Serverless computing with Google CloudServerless computing with Google Cloud
Serverless computing with Google Cloud
 
Data Scenarios 2020: 6 Amazing Transformations
Data Scenarios 2020: 6 Amazing TransformationsData Scenarios 2020: 6 Amazing Transformations
Data Scenarios 2020: 6 Amazing Transformations
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Building a Distributed & Automated Open Source Program at Netflix
Building a Distributed & Automated Open Source Program at NetflixBuilding a Distributed & Automated Open Source Program at Netflix
Building a Distributed & Automated Open Source Program at Netflix
 
Everyone wants (someone else) to do it: writing documentation for open source...
Everyone wants (someone else) to do it: writing documentation for open source...Everyone wants (someone else) to do it: writing documentation for open source...
Everyone wants (someone else) to do it: writing documentation for open source...
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 

More from Dmitry Kan

Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source stateDmitry Kan
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaDmitry Kan
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupDmitry Kan
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupDmitry Kan
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for EnglishDmitry Kan
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageDmitry Kan
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageDmitry Kan
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageDmitry Kan
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsDmitry Kan
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationDmitry Kan
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesDmitry Kan
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)Dmitry Kan
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_crDmitry Kan
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation systemDmitry Kan
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1Dmitry Kan
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine TranslationDmitry Kan
 

More from Dmitry Kan (20)

Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer Group
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwords
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation system
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine Translation
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

London IR Meetup - Players in Vector Search_ algorithms, software and use cases

  • 1. Players in Vector Search: Algorithms, Software and Use Cases Dmitry Kan London IR Meetup July’22
  • 2. About me ● PhD in NLP, Co-founder of Muves.io ● Senior Product Manager, Search at TomTom ● Contributor to and user of Quepid – query rating tool ● 16+ years of experience in developing search engines for start-ups and multinational technology giants ● Host of the Vector Podcast https://www.youtube.com/c/VectorPodcast ● Blogging about vector search on Medium: https://dmitry-kan.medium.com/
  • 3. Outline ● Report on 1st season of Vector Podcast ● Vector Search algorithms ● Vector Search pyramid ● Use cases ● Where things are going
  • 4. Vector Podcast stats Launched on October 13, 2021 11 interviews 2 lectures 2 webinars 1 LIVE recording with 4 guests To be continued…
  • 5. Topics covered ● Vector DBs: Weaviate, Qdrant, Milvus, Vespa, Apache Solr ● Neural search: Jina, ZIR.AI ● Algorithms: HNSW, doc2query, bi-encoders ● Embedding layers: Mighty ● Sparse search & ML
  • 6. Keyword search ● Examples: Elasticsearch, OpenSearch, Solr ● Relies on matching of search terms to text in an inverted index ● Makes it difficult to find items with similar meaning but containing different keywords ● Not directly suitable for multimodal or multilingual search Vector search ● Utilises neural networks models to represent objects (like text and images) and queries as high-dimensional vectors ● Ranking based on vector similarity ● Allows finding items with similar meaning or of different modality EXAMPLE Query: A bear eating a fish by a river Result: heron eating a fish EXAMPLE Query: A bear eating a fish by a river Query vector: [0.072893, -0.277076, 0.201384, …] Result vector: [0.004142, -0.022811, 0.019714 …] Result:
  • 7. Muves in the context of Vector Search Ecosystem Vector Databases: Milvus, Weaviate, Pinecone, GSI, Qdrant, Vespa, Vald, Elastiknn… Neural frameworks: Haystack, Jina.AI, ZIR.AI, Hebbia.AI, Featureform… KNN / ANN algorithms: HNSW, PQ, IVF, LSH, Zoom, DiskANN, BuddyPQ … Encoders: Transformers, Clip, GPT3… + Mighty Application business logic: neural / BM25, symbolic filters, ranking user interface Vector Search Pyramid
  • 8. Algorithms: Big players in the game ● Spotify: ANNOY ● Microsoft (Bing team): Zoom, DiskANN, SPTAG ○ Azure Cognitive Search ● Amazon: KNN based on HNSW in OpenSearch ● Google: ScaNN ● Yahoo! Japan: NGT ● Facebook: FAISS, PQ (CPU & GPU) ● Baidu: IPDG (Baidu Cloud) ● Yandex ● NVIDIA ● Intel
  • 10. BigANN Competition @ NeurIPS’21 ● Max Irwin ● Alex Semenov ● Aarne Talman ● Leo Joffe ● Alex Klibisz ● Dmitry Kan Billion-Scale ANN Algorithm Challenge
  • 12. Credit: The Pinecone Vector Database System (Edo Liberty)
  • 13. BuddyPQ: improving over FAISS BIGANN dataset Kolmogorov-Smirnov dimension test matrix for the first 100000 points. A higher number indicates a less similar distribution Variance Inflation Factor (Multicollinearity) (~2.25)
  • 14. Motivation for vector search ● Text search can be solved* with inverted index and BM25 or TF-IDF ranking ● But what if your data is images, audio, video? ● Can you find images with textual queries?
  • 15. Motivation for vector search With inverted index you can represent documents as a sparse vector: [1, 1, 0, 1, 0, 0, 0, 0, …, 0] Problem 1: this bag-of-words representation does not take into account semantic context in which query words appear. “Capital” as a main city OR as some monetary value? Problem 2: Does not respect word order: Visa from Finland to Canada Visa from Canada to Finland
  • 17. Vector search in a nutshell Vector (=Neural) search is ● a way to represent and ● search your objects (documents, songs, images) in a geometric space ● usually of high-dimension in the form of a dense embedding ● a vector of numbers: [0.9, -0.1, 0.15, …] Credit: Weaviate V1.0 release - virtual meetup
  • 26. Muves in the context of Vector Search Ecosystem Vector Databases: Milvus, Weaviate, Pinecone, GSI, Qdrant, Vespa, Vald, Elastiknn… Neural frameworks: Haystack, Jina.AI, ZIR.AI, Hebbia.AI, Featureform… KNN / ANN algorithms: HNSW, PQ, IVF, LSH, Zoom, DiskANN, BuddyPQ … Encoders: Transformers, Clip, GPT3… + Mighty Application business logic: neural / BM25, symbolic filters, ranking user interface Vector Search Pyramid
  • 27.
  • 28. Smaller Vector DB players: 71% are Open Source Company Product Cloud Open Source: Y/N Algorithms SeMI Weaviate Y Y (Go) custom HNSW Pinecone Pinecone Y N FAISS + own GSI APU chip for Elasticsearch / Opensearch N N Neural hashing / Hamming distance Qdrant Qdrant N Y (Rust) HNSW (graph) Yahoo! Vespa Y Y (Java, C++) HNSW (graph) Ziliz Milvus N Y (Go, C++, Python) FAISS, HNSW Yahoo! Vald N Y (Go) NGT
  • 29. Semantic search frameworks: 57% Open Source Company Product Open Source: Y/N Focus Deepset.ai Haystack Y NLP, neural search Jina.AI Jina, Hub, Finetuner Y NLP, CV, ASR Featureform Feature store, EmbeddingHub Y All AI verticals ZIR.AI AI search platform N NLP Hebbia.AI Knowledge Base N NLP -> Finance Rasa.ai Virtual assistants Y NLP Muves.io Multilingual vector search N Multilingual search, multimodality
  • 30. Specific: ● Image similarity (CLIP retrieval with KNN search) ● Multilingual search (Muves) ● Question answering (Techcrunch Weaviate demo, Deepset) ● Recommenders: Facebook news feed ● Google Talk to Books ● Car image search (Classic.com) ● E-commerce: multimodal search (Muves+GSI APU) Broad: ● Metric learning: use case for Vector DBs ● Semantic search: ride-sharing ● Anomaly detection (Pinecone) ● Classification ● Multi-stage ranking Use cases
  • 31. Demo: Muves books search muves.io Distiluse-base-multilingual-cased-v2 FAISS 1M books
  • 32. For the demo we used a multilingual CLIP from Huggingface and a 10M subset from LAION-400M dataset Multilingual CLIP model for image search Multilingual Sentence Transformers for text embeddings
  • 33. Demo: Muves multilingual & multimodal search muves.io
  • 34. ● BM25 / Vector search: how to combine? ○ Vespa’s talk on BBUZZ’22 ○ Jina demos with text/image blending ● Efficient embeddings ○ Mighty ○ GPU vs CPU (cost) ○ Latency at scale ● Going multimodal ○ Muves: multimodal search + hardware acceleration ○ Search inside audio and blend with text-based matching ○ Search complex objects, like 3D mesh, polygon similarity ● Model fine-tuning ○ BEIR paper ○ Play with Jina Finetuner ○ Pre-trained models on your domain, like CLIP ● Choose strategy for your vector search ○ Add new vector DB or enable a dense_vector in Elasticsearch / OpenSearch / Solr ○ Doc2query – no vector search is involved! ○ Precise vs Explorative search ○ Use cases in your product ○ MLOps Where things are going
  • 35. Get practical ● Code: https://github.com/DmitryKey/bert-solr-search ● Supported Engines: Solr, Elasticsearch, OpenSearch, GSI, [hnswlib] ● Supported LMs: BERT, SBERT, [theoretically any]
  • 37. Links 1. Martin Fowler’s talk on NoSQL databases: https://www.youtube.com/watch?v=qI_g07C_Q5I 2. Neural search vs Inverted search https://trends.google.com/trends/explore?date=all&q=neural%20search,inverted%20search 3. Not All Vector Databases Are Made Equal https://towardsdatascience.com/milvus-pinecone-vespa-weaviate-vald-gsi-what-unites-these- buzz-words-and-what-makes-each-9c65a3bd0696 4. HN thread: https://news.ycombinator.com/item?id=28727816 5. A survey of PQ and related methods: https://faiss.ai/ 6. Vector Podcast on YouTube: https://www.youtube.com/c/VectorPodcast 7. Vector Podcast on Spotify: https://open.spotify.com/show/13JO3vhMf7nAqcpvlIgOY6 8. Vector Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/vector-podcast/id1587568733 9. BERT, Solr, Elasticsearch, OpenSearch, HNSWlib – in Python: https://github.com/DmitryKey/bert-solr-search 10. Speeding up BERT search in Elasticsearch: https://towardsdatascience.com/speeding-up-bert-search-in-elasticsearch-750f1f34f455
  • 38. Links 1. Merging Knowledge Graphs with Language Models: https://www.microsoft.com/en-us/research/publication/jaket-joint-pre-training-of-knowledge-gr aph-and-language-understanding/ 2. Fusing Knowledge into Language Models: https://cs.stanford.edu/people/cgzhu/ in particular: https://drive.google.com/file/d/1-jvKce5rcBp_DdkPv8ijLeFW5N2XrapS/view