SlideShare a Scribd company logo
1 of 87
Download to read offline
A Search Index
is not
A Database Index
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive
Story time!
3
Search Index
4
Database Index
They hired me!
5
They hired me!
6
(even though I was wrong)
Agenda
0: Terminology
1: Text Search
2: Numeric Range Search
3: Storage
Terminology
Database
Table
Schema
Column
Row
8
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
9
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
10
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
11
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: string
Breed: string
Terminology
Database
Table
Schema
Column
Row
12
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
pets
id: integer
name: string
Breed: string
id name
001 Toria
002 Colleen
humans
id: integer
name: string
human_id pet_id
001 001
001 002
002 003
owners
human_id: int
pet_id: int
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
13
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
Database Index
14
?
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
Database Index Inverted Index
15
16
Text Search
Part 1
By Rebecca Davis
pawsomecrochet.etsy.com
Secret Santa
for Cats
Find all the
cat-related items in a
database
github.com/toriagibbs/SecretSanta
19
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
005 Kitten mittens Finally! An elegant,
comfortable mitten for cats
$25.97 18
20
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
Database Performance
n*m
21
n = number of rows in the database
m = length of strings
Database Performance
O(n)
n = number of rows in the database
22
23
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
005 Kitten mittens Finally! An elegant,
comfortable mitten for cats
$25.97 18
24
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id)
);
25
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
26
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
title id
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
27
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
very [001]
good [001]
hat [001, 002, 003, 004]
cat [001, 003, 005]
wear [002]
beach [002]
... ...
q=cat
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
title
description
28
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
very [001]
good [001]
hat [001, 002, 003, 004]
cat [001, 003, 005]
wear [002]
beach [002]
... ...
q=cat
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
title
description
Search Index Performance
O(1)
2 hash lookups = constant time
29
Search Index Performance
O(1) + retrieval
2 hash lookups = constant time
30
Search Index Performance
O(r)
r = number of results found
31
Text Search Quality
Part 1 ½
33
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
SELECT * FROM listings
WHERE LOWER(title) LIKE “%cat%”
OR LOWER(description) LIKE “%cat%”;
34
Solution: SQL “LOWER”
id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
35
Problem: hidden substring
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
36
Solution: check punctuation &
whitespace for every word form
SELECT * FROM listings
WHERE title LIKE “cat” OR title LIKE “cats”
OR title LIKE “cat %” OR title LIKE “cats %”
OR title LIKE “% cat” OR title LIKE “% cats”
OR title LIKE “% cat %” OR title LIKE “% cats %”
OR title LIKE “% cat.%” OR title LIKE “% cats.%”
OR title LIKE “%.cat %” OR title LIKE “%.cats %”
...
37
Problem: missed relevant item
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
38
SELECT * FROM listings
WHERE LOWER(title) = “cat” OR LOWER(title) = “cats”
OR LOWER(title) = “kitten” OR LOWER(title) = “kittens”
OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %”
OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %”
OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %”
OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %”
OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%”
OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%”
OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %”
OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %”
OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%”
OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%”
...
OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats”
OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens”
...
Let’s solve it with a
search index
39
40
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
q=cat
41
Solution: everything is lowercase
q=cat
key value
cat [003]
Cat [001]
title
key value
cat [001, 003]
title
id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
42
Problem: hidden substring
q=cat
43
Solution: tokenization
& stemming
“Vacation hat”
[“vacation”, “hat”]
“hats” → “hat”
“cats” → “cat”
“catlike” → “cat”
id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
44
Problem: missed relevant item
q=cat
45
Solution: synonyms
q=cat
key value
cat [001, 003]
kitten [004, 005]
title
key value
cat [001, 003, 004, 005]
title
46
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality
due to case sensitivity,
substring mismatches, and
missing terms
High quality
due to case insensitivity,
tokenization, stemming, and
synonyms
More disk space
Do work at “index time”
TRADE-OFFS
Numeric Range Search
Part 2
By Rebecca Davis
pawsomecrochet.etsy.com
Secret Santa
for Cats
Find all the
cat-related items
under $15
in a database
github.com/toriagibbs/SecretSanta
50
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
51
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id)
);
52
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id),
KEY (price)
);
53
Database Index
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
54
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
Database Performance
O(log n)
Log base 2 for a binary tree
Log base B for a B-tree
55
Database Performance
O(log n) + retrieval
Log base 2 for a binary tree
Log base B for a B-tree
56
Database Performance
O(log n + r)
57
n = number of rows in the database
r = number of results found
58
n log2
n
10 3.32
100 6.64
1 000 9.97
10 000 13.29
100 000 16.61
1 000 000 19.93
Why didn’t we do this
for text fields?!
SIDEBAR
60
Prefix Tree (Trie)
car
cat
ham
hat
SID
EB
A
R
61
Prefix Tree (Trie)
“car cat ham hat”
SID
EB
A
R
Database indexes for string fields
can only search prefixes
SIDEBAR
Unless you declare a “full text” index like:
FULLTEXT (description)
63
Database Search Engine
O(r)
text search
O(r)
text search
Poor quality
due to case sensitivity,
substring mismatches, and
missing terms
High quality
due to case insensitivity,
tokenization, stemming, and
synonyms
SID
EB
A
R
By Lacey Smith
hungupokanagan.etsy.com
Back to numeric searching...
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
65
price
66
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
price
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
67
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
price
price=0.00 OR price=0.01 OR
price=0.02 OR price=0.03 OR
price=0.04 OR price=0.05 OR
price=0.06 OR price=0.07 OR
price=0.08 OR price=0.09 OR
…
price=14.93 OR price=14.94 OR
price=14.95 OR price=14.96 OR
price=14.97 OR price=14.98 OR
price=14.99 OR price=15.00
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
68
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
11.00 [004]
12.50 - 24.99 [001]
15.00 [001]
25.00 - 49.99 [002, 003, 005]
25.00 - 37.49 [003, 005]
25.00 [003]
25.97 [005]
37.50 - 49.99 [002]
49.99 [002]
price
price(25.00 - 49.99)
U price(50.00)
price(0 - 24.99)
U price(25.00 - 37.49)
U price(37.50)
U price(37.51)
U price(37.52)
...
U price(40.00)
fq=price:[25 TO 50]
fq=price:[* TO 40]
69
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
15.00 - 15.49 [001]
15.00 [001]
... ...
price
price(0 - 12.49)
U price(12.50 - 12.99)
U price(13.00 - 13.49)
U price(13.50 - 13.99)
U price(14.00 - 14.49)
U price(14.50 - 14.99)
U price(15.00)
fq=price:[* TO 15]
70
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
15.00 - 15.49 [001]
15.00 [001]
... ...
price
Search Index Performance
O(log (max-min))
For the max and min values
of the field
71
Search Index Performance
O(1)
Number of buckets don’t
change with the size of the data
72
Search Index Performance
O(r)
73
r = number of results found
74
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
75
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
76
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
O(r)
numeric range search
Storage
Part 3
78
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id),
KEY (price)
);
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” required=true indexed=true stored=true>
<field name=”title” type=”text” required=true indexed=true stored=false>
<field name=”description” type=”text” required=true indexed=true stored=false>
<field name=”price” type=”long” required=true indexed=true stored=false>
<field name=”quantity” type=”int8” required=true indexed=true stored=false>
</fields>
</schema>
79
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” stored=true>
<field name=”title” type=”text” stored=false>
<field name=”description” type=”text” stored=false>
<field name=”price” type=”long” stored=false>
<field name=”quantity” type=”int8” stored=false>
</fields>
</schema>
80
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” stored=true>
<field name=”title” type=”text” stored=true>
<field name=”description” type=”text” stored=true>
<field name=”price” type=”long” stored=true>
<field name=”quantity” type=”int8” stored=true>
</fields>
</schema>
81
A search index
is not a database index
But a search engine
can totally be a database
Don’t do it
By Darcy Quinn
riotcakes.etsy.com
84
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
O(r)
numeric range search
Good at storage ‘Meh’ at storage
✓
✓
✓
✓
By Ashley Fehribach
furballfanatic.etsy.com
@nerdymathlete
Thank you
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive

More Related Content

What's hot

Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Douglas Starnes
 
Getting to know Arel
Getting to know ArelGetting to know Arel
Getting to know ArelRay Zane
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出すTakashi Kitano
 
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析Takashi Kitano
 
Python data structures
Python data structuresPython data structures
Python data structuresHarry Potter
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School ProgrammersSiva Arunachalam
 
Python WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorPython WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorAmy Hanlon
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology InitiativeBasil Bibi
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver){tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)Takashi Kitano
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesIHTMINSTITUTE
 
Predictions European Championships 2020
Predictions European Championships 2020Predictions European Championships 2020
Predictions European Championships 2020Ruben Kerkhofs
 
Spruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted textSpruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted textClaus Wilke
 

What's hot (15)

Ruby things
Ruby thingsRuby things
Ruby things
 
Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.Data Science for Folks Without (or With!) a Ph.D.
Data Science for Folks Without (or With!) a Ph.D.
 
Getting to know Arel
Getting to know ArelGetting to know Arel
Getting to know Arel
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出す
 
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
 
Python data structures
Python data structuresPython data structures
Python data structures
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
 
Python WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd BehaviorPython WATs: Uncovering Odd Behavior
Python WATs: Uncovering Odd Behavior
 
Brixton Library Technology Initiative
Brixton Library Technology InitiativeBrixton Library Technology Initiative
Brixton Library Technology Initiative
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver){tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
 
Python PCEP Tuples and Dictionaries
Python PCEP Tuples and DictionariesPython PCEP Tuples and Dictionaries
Python PCEP Tuples and Dictionaries
 
Elixir
ElixirElixir
Elixir
 
Predictions European Championships 2020
Predictions European Championships 2020Predictions European Championships 2020
Predictions European Championships 2020
 
Spruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted textSpruce up your ggplot2 visualizations with formatted text
Spruce up your ggplot2 visualizations with formatted text
 

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

A Search Index is Not a Database Index - Full Stack Toronto 2017

  • 1. A Search Index is not A Database Index Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive
  • 2.
  • 6. They hired me! 6 (even though I was wrong)
  • 7. Agenda 0: Terminology 1: Text Search 2: Numeric Range Search 3: Storage
  • 8. Terminology Database Table Schema Column Row 8 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 9. Terminology Database Table Schema Column Row 9 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 10. Terminology Database Table Schema Column Row 10 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 11. Terminology Database Table Schema Column Row 11 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  • 12. Terminology Database Table Schema Column Row 12 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog pets id: integer name: string Breed: string id name 001 Toria 002 Colleen humans id: integer name: string human_id pet_id 001 001 001 002 002 003 owners human_id: int pet_id: int
  • 13. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document 13
  • 14. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index 14 ?
  • 15. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index Inverted Index 15
  • 16. 16
  • 18. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items in a database github.com/toriagibbs/SecretSanta
  • 19. 19 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  • 20. 20 SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 21. Database Performance n*m 21 n = number of rows in the database m = length of strings
  • 22. Database Performance O(n) n = number of rows in the database 22
  • 23. 23 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  • 24. 24 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  • 25. 25 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens
  • 26. 26 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens title id cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005]
  • 27. 27 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  • 28. 28 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  • 29. Search Index Performance O(1) 2 hash lookups = constant time 29
  • 30. Search Index Performance O(1) + retrieval 2 hash lookups = constant time 30
  • 31. Search Index Performance O(r) r = number of results found 31
  • 33. 33 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 34. SELECT * FROM listings WHERE LOWER(title) LIKE “%cat%” OR LOWER(description) LIKE “%cat%”; 34 Solution: SQL “LOWER”
  • 35. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 35 Problem: hidden substring SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  • 36. 36 Solution: check punctuation & whitespace for every word form SELECT * FROM listings WHERE title LIKE “cat” OR title LIKE “cats” OR title LIKE “cat %” OR title LIKE “cats %” OR title LIKE “% cat” OR title LIKE “% cats” OR title LIKE “% cat %” OR title LIKE “% cats %” OR title LIKE “% cat.%” OR title LIKE “% cats.%” OR title LIKE “%.cat %” OR title LIKE “%.cats %” ...
  • 37. 37 Problem: missed relevant item SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”; id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2
  • 38. 38 SELECT * FROM listings WHERE LOWER(title) = “cat” OR LOWER(title) = “cats” OR LOWER(title) = “kitten” OR LOWER(title) = “kittens” OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %” OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %” OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %” OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %” OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%” OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%” OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %” OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %” OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%” OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%” ... OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats” OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens” ...
  • 39. Let’s solve it with a search index 39
  • 40. 40 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity q=cat
  • 41. 41 Solution: everything is lowercase q=cat key value cat [003] Cat [001] title key value cat [001, 003] title
  • 42. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 42 Problem: hidden substring q=cat
  • 43. 43 Solution: tokenization & stemming “Vacation hat” [“vacation”, “hat”] “hats” → “hat” “cats” → “cat” “catlike” → “cat”
  • 44. id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 44 Problem: missed relevant item q=cat
  • 45. 45 Solution: synonyms q=cat key value cat [001, 003] kitten [004, 005] title key value cat [001, 003, 004, 005] title
  • 46. 46 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms
  • 47. More disk space Do work at “index time” TRADE-OFFS
  • 49. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items under $15 in a database github.com/toriagibbs/SecretSanta
  • 50. 50 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 51. 51 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  • 52. 52 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) );
  • 53. 53 Database Index price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002
  • 54. 54 price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 55. Database Performance O(log n) Log base 2 for a binary tree Log base B for a B-tree 55
  • 56. Database Performance O(log n) + retrieval Log base 2 for a binary tree Log base B for a B-tree 56
  • 57. Database Performance O(log n + r) 57 n = number of rows in the database r = number of results found
  • 58. 58 n log2 n 10 3.32 100 6.64 1 000 9.97 10 000 13.29 100 000 16.61 1 000 000 19.93
  • 59. Why didn’t we do this for text fields?! SIDEBAR
  • 61. 61 Prefix Tree (Trie) “car cat ham hat” SID EB A R
  • 62. Database indexes for string fields can only search prefixes SIDEBAR Unless you declare a “full text” index like: FULLTEXT (description)
  • 63. 63 Database Search Engine O(r) text search O(r) text search Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms SID EB A R
  • 65. key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002] 65 price
  • 66. 66 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  • 67. 67 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price price=0.00 OR price=0.01 OR price=0.02 OR price=0.03 OR price=0.04 OR price=0.05 OR price=0.06 OR price=0.07 OR price=0.08 OR price=0.09 OR … price=14.93 OR price=14.94 OR price=14.95 OR price=14.96 OR price=14.97 OR price=14.98 OR price=14.99 OR price=15.00 key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  • 68. 68 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] 11.00 [004] 12.50 - 24.99 [001] 15.00 [001] 25.00 - 49.99 [002, 003, 005] 25.00 - 37.49 [003, 005] 25.00 [003] 25.97 [005] 37.50 - 49.99 [002] 49.99 [002] price price(25.00 - 49.99) U price(50.00) price(0 - 24.99) U price(25.00 - 37.49) U price(37.50) U price(37.51) U price(37.52) ... U price(40.00) fq=price:[25 TO 50] fq=price:[* TO 40]
  • 69. 69 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price price(0 - 12.49) U price(12.50 - 12.99) U price(13.00 - 13.49) U price(13.50 - 13.99) U price(14.00 - 14.49) U price(14.50 - 14.99) U price(15.00) fq=price:[* TO 15]
  • 70. 70 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price
  • 71. Search Index Performance O(log (max-min)) For the max and min values of the field 71
  • 72. Search Index Performance O(1) Number of buckets don’t change with the size of the data 72
  • 73. Search Index Performance O(r) 73 r = number of results found
  • 74. 74 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality
  • 75. 75 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search
  • 76. 76 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search
  • 78. 78 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) ); SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  • 79. <schema name=”listings”> <fields> <field name=”id” type=”int20” required=true indexed=true stored=true> <field name=”title” type=”text” required=true indexed=true stored=false> <field name=”description” type=”text” required=true indexed=true stored=false> <field name=”price” type=”long” required=true indexed=true stored=false> <field name=”quantity” type=”int8” required=true indexed=true stored=false> </fields> </schema> 79 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler>
  • 80. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=false> <field name=”description” type=”text” stored=false> <field name=”price” type=”long” stored=false> <field name=”quantity” type=”int8” stored=false> </fields> </schema> 80
  • 81. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=true> <field name=”description” type=”text” stored=true> <field name=”price” type=”long” stored=true> <field name=”quantity” type=”int8” stored=true> </fields> </schema> 81
  • 82. A search index is not a database index But a search engine can totally be a database
  • 83. Don’t do it By Darcy Quinn riotcakes.etsy.com
  • 84. 84 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search Good at storage ‘Meh’ at storage ✓ ✓ ✓ ✓
  • 87. Thank you Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive