SlideShare a Scribd company logo
Toria Gibbs
@scarletdrive
How Search Engines Work
(A Thing I Didn’t Learn in University)
1 Introduction / who is this lady?
2 Text Search / inverted indexes
a Performance
b Quality
3 Relevance / quality part 2: ranking the results
4 Open Source Tools / free search engines!
5 Conclusion / bye
Agenda
2
Who is this lady?
3
Bachelor of
Computer
Science 2010
Toria Gibbs
@scarletdrive
4
2010 → 2020
😱
@scarletdrive
5
Raise your hand if you learned
about search engines in university
@scarletdrive
6
Story time!
📖
@scarletdrive
7
Design Search
@scarletdrive
8
Search index?
🤔
@scarletdrive
9
Database index!
💡
@scarletdrive
10
They hired me!
😁
@scarletdrive
11
They hired me!
😬
(even though I was wrong)
@scarletdrive
12
🙋🏽 💁🏻‍♀
Hey Toria, can I get
some help?
Heck yes you can,
buddy!
@scarletdrive
13
🙋🏽
How much disk space
will my new field use?
@scarletdrive
14
🙋🏽
How much disk space
will my new field use?
🤷🏻‍♀
...............???
@scarletdrive
2 / Text Search
15
torias-pet-emporium.myshopify.com
@scarletdrive
17
Assume we have a database...
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
18
cat
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
@scarletdrive
19
cat
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
@scarletdrive
2 / Text Search
20
A / Performance
n = items in database
m = max length of title strings
n·m
@scarletdrive21
n = items in database
m = max length of title strings = 256
O(n)
@scarletdrive22
n n · m (m=256)
10 2 560
100 25 600
1 000 256 000
10 000 2 560 000
100 000 25 600 000
1 000 000 256 000 000
10 000 000 2 560 000 000
@scarletdrive23
24
Let’s make it faster
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
We can look up an item by its
ID in constant time.
25
Let’s make it faster
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
26
Inverted Index
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Map from words to
sets of IDs of records
which contained
those words
27
cat
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
O(1)
@scarletdrive28
O(1)
@scarletdrive29
● Assumes perfect hash function
● Trade-offs: storage, pre-processing, complexity
● Additional lookup step still required
30
cat
@scarletdrive
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
O(r)
@scarletdrive31
r = number of results found
@scarletdrive32
...but we usually only ask for a fixed
number of results at a time
O(25) → O(1)
Search engines provide faster results
than a database for text search
@scarletdrive
2 / Text Search
34
B / Quality
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
● Search for “cat” incorrectly
returns “vacation hat for dog”
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
● Search for “cats” doesn’t return
“cat hat” or “red cat mittens”
@scarletdrive
SELECT *
FROM items
WHERE title LIKE ‘%cats%’
SELECT * FROM items
WHERE title LIKE ‘cat’ OR title LIKE ‘cats’
OR title LIKE ‘cat %’ OR title LIKE ‘cats %’
OR title LIKE ‘% cat’ OR title LIKE ‘% cats’
OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’
OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’
OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’
OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’
OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’
OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’
OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’
OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’
OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’
OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’
...
@scarletdrive
40
How did we do this?
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
term ids
red [1, 6]
cat [1, 3, 5]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
41
Analyzers
1. Tokenizers
2. Normalizers (a.k.a. filters)
○ Stemmers
○ Lowercase, character filters
○ Stop words
○ Synonyms
@scarletdrive
42
Tokenization
@scarletdrive
string: “cat hat”
array: [“cat”, “hat”]
43
Stemming
@scarletdrive
“dogs” → “dog”
“walking” → “walk”
“fetched” → “fetch”
“ran” → “run”
44
Lowercase
@scarletdrive
Character Filters
“Toria” → “toria”
“WOW” → “wow”
“résumé” → “resume”
45 @scarletdrive
Stop Words
Remove “the”, “and”,
“or”, “but”, etc...
Synonyms
“colour” → “color”
“lb” → “pound”
@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog”
47
id title price
4 vacation hat for dog 12.99
5 cat hat 5.00
@scarletdrive
[“vacation”, “hat”, “for”, “dog”]
[“cat”, “hat”]
[“vacation”, “hat”, “dog”]
[“cat”, “hat”]
term ids
cat [5]
hat [4, 5]
dog [4]
vacation [4]
Tokenize it
Remove stop words
48 @scarletdrive
term ids
cat [5]
hat [4, 5]
dog [4]
vacation [4]
cat
id title price
4 vacation hat for dog 12.99
5 cat hat 5.00
● Search for “cat” does not return “vacation hat for dog” due to tokenization
@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog” ✓
2. “cats” search does not return “cat hat”
50
id title price
3 blue hat for cats 12.99
5 cat hat 5.00
@scarletdrive
[“blue”, “hat”, “for”, “cats”]
[“cat”, “hat”]
[“blue”, “hat”, “cat”]
[“cat”, “hat”]
term ids
blue [3]
cat [3, 5]
hat [3, 5]
Tokenize it
Remove stop words
Stem it
[“blue”, “hat”, “for”, “cat”]
[“cat”, “hat”]
51
id title price
3 blue hat for cats 12.99
5 cat hat 5.00
@scarletdrive
term ids
blue [3]
cat [3, 5]
hat [3, 5]
cats
???
All transformations performed on
the input data for the index
are also performed on the query
@scarletdrive
53
id title price
3 blue hat for cats 12.99
5 cat hat 5.00
@scarletdrive
term ids
blue [3]
cat [3, 5]
hat [3, 5]
cats
Stem it
cat
● Search for “cats” does return
“cat hat” due to stemming
@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog” ✓
2. “cats” search does not return “cat hat” ✓
3. “cat” search does not return “kitten mittens”
55
id title price
5 cat hat 5.00
7 kitten mittens 11.99
@scarletdrive
[“cat”, “hat”]
[“kitten”, “mittens”]
[“cat”, “hat”]
[“cat”, “mitten”]
term ids
cat [5, 7]
hat [5]
mitten [7]
Tokenize it
Swap synonymsStem it
[“cat”, “hat”]
[“kitten”, “mitten”]
56 @scarletdrive
cat
id title price
5 cat hat 5.00
7 kitten mittens 11.99
term ids
cat [5, 7]
hat [5]
mitten [7]
● Search for “cat” returns all
items with “cat” or “kitten” due
to synonyms
id title price
5 cat hat 5.00
7 kitten mittens 11.99
term ids
cat [5, 7]
hat [5]
mitten [7]
57 @scarletdrive
kitten
Swap synonym
cat
● Search for “kitten” returns all
items with “cat” or “kitten” due
to synonyms
@scarletdrive
Quality Problems
1. “cat” search returned “vacation hat for dog” ✓
2. “cats” search does not return “cat hat” ✓
3. “cat” search does not return “kitten mittens” ✓
Search engines provide faster and
better quality results than a
database for text search
@scarletdrive
60
🙋🏽
How much disk space
will my new field use?
👩🏻‍🎓
I learned things!
I can help!
@scarletdrive
61
��🏽
@scarletdrive
It’s a string field, but it’s only going
to be 100 characters long, max.
��
Can you tell me anything about the
characteristics of these strings?
62 @scarletdrive
id title
1 red cat mittens
2 blue dog mittens
3 blue hat for cats
4 vacation hat for dog
5 cat hat
6 red and blue dog hat
7 kitten mittens
8 dog booties
term ids
red [1, 6]
cat [1, 3, 5, 7]
mitten [1, 2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
boot [8]
8 rows 8 rows
63 @scarletdrive
id text
1 good dog
2 bad dog
3 good dog
4 bad dog
5 good dog
6 bad dog
7 good dog
8 bad dog
...
100 bad dog
term ids
good [1, 3, 5, 7, 9,
11, 13, 15, 17,
19, … 99]
bad [2, 4, 6, 8, 10,
12, 14, 16, 18,
20, … 100]
dog [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11,
12, … 99, 100]
100 rows
3 rows
64
��🏽
@scarletdrive
Why yes, they are categories
which are static and well-defined.
��
AWESOME.
categories
Pet Accessories
Pet Beds
term ids
pet ?
accessory ?
bed ?
65
Pause for cat pictures
1 Introduction / who is this lady?
2 Text Search / inverted indexes
a Performance
b Quality
3 Relevance / quality part 2: ranking the results
4 Open Source Tools / free search engines!
5 Conclusion / bye
Agenda
67
3 / Relevance
68
@scarletdrive
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
22 feather cat toy 7.99
124 cat and mouse t-shirt 24.50
128 cat t-shirt 31.80
329 “cats rule” sticker 0.99
420 catnip joint for cats 5.99
455 cat toy 7.00
... ... ...
When there are
many results, what
order should we
display them in?
69
Relevance
tf-idf
term frequency
inverse document frequency
@scarletdrive
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange.
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 1/5 = 0.20
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [1, 3, 2]Query: cat
71
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange. Cat cat cat!
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 4/8 = 0.50
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [2, 1, 3]Query: cat
72
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
(assume 100 records which all contain
“cat” in them)
Query: orange cat
73
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1)
score = score(cat, doc2) + s(orange, doc2)
74
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
75
@scarletdrive
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Result order = [2, 1]Query: orange cat
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
1/7 = 0.14
1/5 = 0.20
76
77
Better Relevance
● Phrase matching
● Fuzzy matching, spelling correction
● User factors: location, language
● Other factors: quality, recency, randomness
bm25
is the cool new thing
RIP tf-idf
@scarletdrive
https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
6 / Open Source Tools
79
@scarletdrive80
@scarletdrive
● Inverted index
● Basic tokenization and
normalization
● Ranking
● Replication, sharding, and
distribution
● Caching and warming
● Advanced tokenization and
normalization
● Advanced ranking
● Plugins
81
Which one should I pick?
@scarletdrive
It doesn’t matter
Which one should I pick?
@scarletdrive
● Most projects work well with either
● Getting configuration right is most important
● Test with your own data, your own queries
Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
Solr vs. Elasticsearch by Kelvin Tan
http://solr-vs-elasticsearch.com/
Which one should I pick?
Better for advanced
customization
Easier to learn, faster to
start up, better docs
~ ~ WARNING: Toria’s personal opinion ~ ~
@scarletdrive
7 / Conclusion
85
86
Recap
● Inverted index for text search
○ Faster than a database
○ Better quality than a database
● Ranking for relevance with tf-idf (or bm25)
● Solr and Elasticsearch are great open source solutions
@scarletdrive
Thank you!
careers.shopify.com
engineering.shopify.com (blog)

More Related Content

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
Marius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

How Search Engines Work (A Thing I Didn't Learn in University)

  • 1. Toria Gibbs @scarletdrive How Search Engines Work (A Thing I Didn’t Learn in University)
  • 2. 1 Introduction / who is this lady? 2 Text Search / inverted indexes a Performance b Quality 3 Relevance / quality part 2: ranking the results 4 Open Source Tools / free search engines! 5 Conclusion / bye Agenda 2
  • 3. Who is this lady? 3 Bachelor of Computer Science 2010 Toria Gibbs @scarletdrive
  • 5. 5 Raise your hand if you learned about search engines in university @scarletdrive
  • 11. 11 They hired me! 😬 (even though I was wrong) @scarletdrive
  • 12. 12 🙋🏽 💁🏻‍♀ Hey Toria, can I get some help? Heck yes you can, buddy! @scarletdrive
  • 13. 13 🙋🏽 How much disk space will my new field use? @scarletdrive
  • 14. 14 🙋🏽 How much disk space will my new field use? 🤷🏻‍♀ ...............??? @scarletdrive
  • 15. 2 / Text Search 15
  • 17. 17 Assume we have a database... @scarletdrive id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 18. 18 cat id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’ @scarletdrive
  • 19. 19 cat id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’ @scarletdrive
  • 20. 2 / Text Search 20 A / Performance
  • 21. n = items in database m = max length of title strings n·m @scarletdrive21
  • 22. n = items in database m = max length of title strings = 256 O(n) @scarletdrive22
  • 23. n n · m (m=256) 10 2 560 100 25 600 1 000 256 000 10 000 2 560 000 100 000 25 600 000 1 000 000 256 000 000 10 000 000 2 560 000 000 @scarletdrive23
  • 24. 24 Let’s make it faster @scarletdrive id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 We can look up an item by its ID in constant time.
  • 25. 25 Let’s make it faster @scarletdrive id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 26. 26 Inverted Index @scarletdrive term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] Map from words to sets of IDs of records which contained those words
  • 27. 27 cat @scarletdrive term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 29. O(1) @scarletdrive29 ● Assumes perfect hash function ● Trade-offs: storage, pre-processing, complexity ● Additional lookup step still required
  • 30. 30 cat @scarletdrive term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00
  • 32. @scarletdrive32 ...but we usually only ask for a fixed number of results at a time O(25) → O(1)
  • 33. Search engines provide faster results than a database for text search @scarletdrive
  • 34. 2 / Text Search 34 B / Quality
  • 35. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 @scarletdrive SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 36. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 @scarletdrive SELECT * FROM items WHERE title LIKE ‘%cat%’ ● Search for “cat” incorrectly returns “vacation hat for dog”
  • 37. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” @scarletdrive SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 38. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” ● Search for “cats” doesn’t return “cat hat” or “red cat mittens” @scarletdrive SELECT * FROM items WHERE title LIKE ‘%cats%’
  • 39. SELECT * FROM items WHERE title LIKE ‘cat’ OR title LIKE ‘cats’ OR title LIKE ‘cat %’ OR title LIKE ‘cats %’ OR title LIKE ‘% cat’ OR title LIKE ‘% cats’ OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’ OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’ OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’ OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’ OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’ OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’ OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’ OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’ OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’ OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’ ... @scarletdrive
  • 40. 40 How did we do this? @scarletdrive id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 term ids red [1, 6] cat [1, 3, 5] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 41. 41 Analyzers 1. Tokenizers 2. Normalizers (a.k.a. filters) ○ Stemmers ○ Lowercase, character filters ○ Stop words ○ Synonyms @scarletdrive
  • 43. 43 Stemming @scarletdrive “dogs” → “dog” “walking” → “walk” “fetched” → “fetch” “ran” → “run”
  • 44. 44 Lowercase @scarletdrive Character Filters “Toria” → “toria” “WOW” → “wow” “résumé” → “resume”
  • 45. 45 @scarletdrive Stop Words Remove “the”, “and”, “or”, “but”, etc... Synonyms “colour” → “color” “lb” → “pound”
  • 46. @scarletdrive Quality Problems 1. “cat” search returned “vacation hat for dog”
  • 47. 47 id title price 4 vacation hat for dog 12.99 5 cat hat 5.00 @scarletdrive [“vacation”, “hat”, “for”, “dog”] [“cat”, “hat”] [“vacation”, “hat”, “dog”] [“cat”, “hat”] term ids cat [5] hat [4, 5] dog [4] vacation [4] Tokenize it Remove stop words
  • 48. 48 @scarletdrive term ids cat [5] hat [4, 5] dog [4] vacation [4] cat id title price 4 vacation hat for dog 12.99 5 cat hat 5.00 ● Search for “cat” does not return “vacation hat for dog” due to tokenization
  • 49. @scarletdrive Quality Problems 1. “cat” search returned “vacation hat for dog” ✓ 2. “cats” search does not return “cat hat”
  • 50. 50 id title price 3 blue hat for cats 12.99 5 cat hat 5.00 @scarletdrive [“blue”, “hat”, “for”, “cats”] [“cat”, “hat”] [“blue”, “hat”, “cat”] [“cat”, “hat”] term ids blue [3] cat [3, 5] hat [3, 5] Tokenize it Remove stop words Stem it [“blue”, “hat”, “for”, “cat”] [“cat”, “hat”]
  • 51. 51 id title price 3 blue hat for cats 12.99 5 cat hat 5.00 @scarletdrive term ids blue [3] cat [3, 5] hat [3, 5] cats ???
  • 52. All transformations performed on the input data for the index are also performed on the query @scarletdrive
  • 53. 53 id title price 3 blue hat for cats 12.99 5 cat hat 5.00 @scarletdrive term ids blue [3] cat [3, 5] hat [3, 5] cats Stem it cat ● Search for “cats” does return “cat hat” due to stemming
  • 54. @scarletdrive Quality Problems 1. “cat” search returned “vacation hat for dog” ✓ 2. “cats” search does not return “cat hat” ✓ 3. “cat” search does not return “kitten mittens”
  • 55. 55 id title price 5 cat hat 5.00 7 kitten mittens 11.99 @scarletdrive [“cat”, “hat”] [“kitten”, “mittens”] [“cat”, “hat”] [“cat”, “mitten”] term ids cat [5, 7] hat [5] mitten [7] Tokenize it Swap synonymsStem it [“cat”, “hat”] [“kitten”, “mitten”]
  • 56. 56 @scarletdrive cat id title price 5 cat hat 5.00 7 kitten mittens 11.99 term ids cat [5, 7] hat [5] mitten [7] ● Search for “cat” returns all items with “cat” or “kitten” due to synonyms
  • 57. id title price 5 cat hat 5.00 7 kitten mittens 11.99 term ids cat [5, 7] hat [5] mitten [7] 57 @scarletdrive kitten Swap synonym cat ● Search for “kitten” returns all items with “cat” or “kitten” due to synonyms
  • 58. @scarletdrive Quality Problems 1. “cat” search returned “vacation hat for dog” ✓ 2. “cats” search does not return “cat hat” ✓ 3. “cat” search does not return “kitten mittens” ✓
  • 59. Search engines provide faster and better quality results than a database for text search @scarletdrive
  • 60. 60 🙋🏽 How much disk space will my new field use? 👩🏻‍🎓 I learned things! I can help! @scarletdrive
  • 61. 61 ��🏽 @scarletdrive It’s a string field, but it’s only going to be 100 characters long, max. �� Can you tell me anything about the characteristics of these strings?
  • 62. 62 @scarletdrive id title 1 red cat mittens 2 blue dog mittens 3 blue hat for cats 4 vacation hat for dog 5 cat hat 6 red and blue dog hat 7 kitten mittens 8 dog booties term ids red [1, 6] cat [1, 3, 5, 7] mitten [1, 2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] boot [8] 8 rows 8 rows
  • 63. 63 @scarletdrive id text 1 good dog 2 bad dog 3 good dog 4 bad dog 5 good dog 6 bad dog 7 good dog 8 bad dog ... 100 bad dog term ids good [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, … 99] bad [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, … 100] dog [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, … 99, 100] 100 rows 3 rows
  • 64. 64 ��🏽 @scarletdrive Why yes, they are categories which are static and well-defined. �� AWESOME. categories Pet Accessories Pet Beds term ids pet ? accessory ? bed ?
  • 65. 65 Pause for cat pictures
  • 66.
  • 67. 1 Introduction / who is this lady? 2 Text Search / inverted indexes a Performance b Quality 3 Relevance / quality part 2: ranking the results 4 Open Source Tools / free search engines! 5 Conclusion / bye Agenda 67
  • 69. @scarletdrive id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00 22 feather cat toy 7.99 124 cat and mouse t-shirt 24.50 128 cat t-shirt 31.80 329 “cats rule” sticker 0.99 420 catnip joint for cats 5.99 455 cat toy 7.00 ... ... ... When there are many results, what order should we display them in? 69 Relevance
  • 70. tf-idf term frequency inverse document frequency @scarletdrive
  • 71. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 1/5 = 0.20 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [1, 3, 2]Query: cat 71
  • 72. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. Cat cat cat! 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 4/8 = 0.50 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [2, 1, 3]Query: cat 72
  • 73. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. (assume 100 records which all contain “cat” in them) Query: orange cat 73
  • 74. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: orange cat IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) score = score(cat, doc2) + s(orange, doc2) 74
  • 75. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: orange cat IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 75
  • 76. @scarletdrive TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Result order = [2, 1]Query: orange cat IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 1/7 = 0.14 1/5 = 0.20 76
  • 77. 77 Better Relevance ● Phrase matching ● Fuzzy matching, spelling correction ● User factors: location, language ● Other factors: quality, recency, randomness
  • 78. bm25 is the cool new thing RIP tf-idf @scarletdrive https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
  • 79. 6 / Open Source Tools 79
  • 81. @scarletdrive ● Inverted index ● Basic tokenization and normalization ● Ranking ● Replication, sharding, and distribution ● Caching and warming ● Advanced tokenization and normalization ● Advanced ranking ● Plugins 81
  • 82. Which one should I pick? @scarletdrive It doesn’t matter
  • 83. Which one should I pick? @scarletdrive ● Most projects work well with either ● Getting configuration right is most important ● Test with your own data, your own queries Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability Solr vs. Elasticsearch by Kelvin Tan http://solr-vs-elasticsearch.com/
  • 84. Which one should I pick? Better for advanced customization Easier to learn, faster to start up, better docs ~ ~ WARNING: Toria’s personal opinion ~ ~ @scarletdrive
  • 86. 86 Recap ● Inverted index for text search ○ Faster than a database ○ Better quality than a database ● Ranking for relevance with tf-idf (or bm25) ● Solr and Elasticsearch are great open source solutions @scarletdrive