Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Search Index
is not
A Database Index
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive
Story time!
3
Search Index
4
Database Index
They hired me!
5
They hired me!
6
(even though I was wrong)
Agenda
0: Terminology
1: Text Search
2: Numeric Range Search
3: Storage
Terminology
Database
Table
Schema
Column
Row
8
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: ...
Terminology
Database
Table
Schema
Column
Row
9
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name: ...
Terminology
Database
Table
Schema
Column
Row
10
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name:...
Terminology
Database
Table
Schema
Column
Row
11
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
id: integer
name:...
Terminology
Database
Table
Schema
Column
Row
12
id name breed
001 Momo Cat
002 Naga Cat
003 Sullivan Dog
pets
id: integer
...
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
13
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
Database Index
14
?
Terminology
Database Search Engine
Table Search Index
Schema Schema
Column Field
Row Document
Database Index Inverted Inde...
16
Text Search
Part 1
By Rebecca Davis
pawsomecrochet.etsy.com
Secret Santa
for Cats
Find all the
cat-related items in a
database
github.com/tor...
19
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this ...
20
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
Database Performance
n*m
21
n = number of rows in the database
m = length of strings
Database Performance
O(n)
n = number of rows in the database
22
23
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this ...
24
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),...
25
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
26
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
title id
cat [001, 003]
hat [...
27
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
ver...
28
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
ver...
Search Index Performance
O(1)
2 hash lookups = constant time
29
Search Index Performance
O(1) + retrieval
2 hash lookups = constant time
30
Search Index Performance
O(r)
r = number of results found
31
Text Search Quality
Part 1 ½
33
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
S...
SELECT * FROM listings
WHERE LOWER(title) LIKE “%cat%”
OR LOWER(description) LIKE “%cat%”;
34
Solution: SQL “LOWER”
id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of...
36
Solution: check punctuation &
whitespace for every word form
SELECT * FROM listings
WHERE title LIKE “cat” OR title LIK...
37
Problem: missed relevant item
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
id title des...
38
SELECT * FROM listings
WHERE LOWER(title) = “cat” OR LOWER(title) = “cats”
OR LOWER(title) = “kitten” OR LOWER(title) =...
Let’s solve it with a
search index
39
40
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
q...
41
Solution: everything is lowercase
q=cat
key value
cat [003]
Cat [001]
title
key value
cat [001, 003]
title
id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of...
43
Solution: tokenization
& stemming
“Vacation hat”
[“vacation”, “hat”]
“hats” → “hat”
“cats” → “cat”
“catlike” → “cat”
id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
44
Problem:...
45
Solution: synonyms
q=cat
key value
cat [001, 003]
kitten [004, 005]
title
key value
cat [001, 003, 004, 005]
title
46
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality
due to case sensitivity,
substring...
More disk space
Do work at “index time”
TRADE-OFFS
Numeric Range Search
Part 2
By Rebecca Davis
pawsomecrochet.etsy.com
Secret Santa
for Cats
Find all the
cat-related items
under $15
in a database
gith...
50
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
51
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),...
52
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),...
53
Database Index
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
54
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
SELECT * FROM listings
WH...
Database Performance
O(log n)
Log base 2 for a binary tree
Log base B for a B-tree
55
Database Performance
O(log n) + retrieval
Log base 2 for a binary tree
Log base B for a B-tree
56
Database Performance
O(log n + r)
57
n = number of rows in the database
r = number of results found
58
n log2
n
10 3.32
100 6.64
1 000 9.97
10 000 13.29
100 000 16.61
1 000 000 19.93
Why didn’t we do this
for text fields?!
SIDEBAR
60
Prefix Tree (Trie)
car
cat
ham
hat
SID
EB
A
R
61
Prefix Tree (Trie)
“car cat ham hat”
SID
EB
A
R
Database indexes for string fields
can only search prefixes
SIDEBAR
Unless you declare a “full text” index like:
FULLTEXT ...
63
Database Search Engine
O(r)
text search
O(r)
text search
Poor quality
due to case sensitivity,
substring mismatches, an...
By Lacey Smith
hungupokanagan.etsy.com
Back to numeric searching...
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
65
price
66
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title de...
67
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title de...
68
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
11.00 [004]
12.50 - 24.99 [001]
15.00 [001]
25.00 - 49.99 [002, 003, 005...
69
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
70
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
Search Index Performance
O(log (max-min))
For the max and min values
of the field
71
Search Index Performance
O(1)
Number of buckets don’t
change with the size of the data
72
Search Index Performance
O(r)
73
r = number of results found
74
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
75
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric ...
76
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric ...
Storage
Part 3
78
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),...
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” required=true indexed=true stored=true>
<field name=”title...
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” stored=true>
<field name=”title” type=”text” stored=false>...
<schema name=”listings”>
<fields>
<field name=”id” type=”int20” stored=true>
<field name=”title” type=”text” stored=true>
...
A search index
is not a database index
But a search engine
can totally be a database
Don’t do it
By Darcy Quinn
riotcakes.etsy.com
84
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric ...
By Ashley Fehribach
furballfanatic.etsy.com
@nerdymathlete
Thank you
Toria Gibbs
Senior Software Engineer @ Etsy
@scarletdrive
A Search Index is Not a Database Index - Full Stack Toronto 2017
Upcoming SlideShare
Loading in …5
×

of

A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 1 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 2 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 3 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 4 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 5 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 6 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 7 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 8 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 9 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 10 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 11 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 12 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 13 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 14 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 15 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 16 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 17 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 18 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 19 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 20 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 21 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 22 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 23 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 24 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 25 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 26 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 27 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 28 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 29 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 30 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 31 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 32 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 33 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 34 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 35 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 36 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 37 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 38 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 39 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 40 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 41 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 42 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 43 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 44 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 45 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 46 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 47 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 48 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 49 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 50 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 51 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 52 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 53 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 54 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 55 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 56 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 57 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 58 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 59 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 60 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 61 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 62 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 63 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 64 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 65 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 66 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 67 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 68 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 69 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 70 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 71 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 72 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 73 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 74 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 75 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 76 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 77 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 78 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 79 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 80 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 81 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 82 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 83 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 84 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 85 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 86 A Search Index is Not a Database Index - Full Stack Toronto 2017 Slide 87
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2 Likes

Share

Download to read offline

A Search Index is Not a Database Index - Full Stack Toronto 2017

Download to read offline

A search engine is not a database. Search systems are optimized for fast search using an internal data structure called an inverted index. Databases have a similar feature to allow quick access, also called an index, but it’s a totally different thing!

In this talk, Toria Gibbs will take you on a tour of the internals of a search index, comparing it to common implementations of indexing in relational databases. We’ll see how search engines can outperform databases and discuss the tradeoffs in implementing and maintaining such a system. No prior knowledge of database or search index implementations required; experience creating or querying database tables will be helpful.

Related Books

Free with a 30 day trial from Scribd

See all

A Search Index is Not a Database Index - Full Stack Toronto 2017

  1. 1. A Search Index is not A Database Index Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive
  2. 2. Story time! 3
  3. 3. Search Index 4 Database Index
  4. 4. They hired me! 5
  5. 5. They hired me! 6 (even though I was wrong)
  6. 6. Agenda 0: Terminology 1: Text Search 2: Numeric Range Search 3: Storage
  7. 7. Terminology Database Table Schema Column Row 8 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  8. 8. Terminology Database Table Schema Column Row 9 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  9. 9. Terminology Database Table Schema Column Row 10 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  10. 10. Terminology Database Table Schema Column Row 11 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  11. 11. Terminology Database Table Schema Column Row 12 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog pets id: integer name: string Breed: string id name 001 Toria 002 Colleen humans id: integer name: string human_id pet_id 001 001 001 002 002 003 owners human_id: int pet_id: int
  12. 12. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document 13
  13. 13. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index 14 ?
  14. 14. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index Inverted Index 15
  15. 15. 16
  16. 16. Text Search Part 1
  17. 17. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items in a database github.com/toriagibbs/SecretSanta
  18. 18. 19 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  19. 19. 20 SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  20. 20. Database Performance n*m 21 n = number of rows in the database m = length of strings
  21. 21. Database Performance O(n) n = number of rows in the database 22
  22. 22. 23 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  23. 23. 24 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  24. 24. 25 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens
  25. 25. 26 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens title id cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005]
  26. 26. 27 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  27. 27. 28 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  28. 28. Search Index Performance O(1) 2 hash lookups = constant time 29
  29. 29. Search Index Performance O(1) + retrieval 2 hash lookups = constant time 30
  30. 30. Search Index Performance O(r) r = number of results found 31
  31. 31. Text Search Quality Part 1 ½
  32. 32. 33 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  33. 33. SELECT * FROM listings WHERE LOWER(title) LIKE “%cat%” OR LOWER(description) LIKE “%cat%”; 34 Solution: SQL “LOWER”
  34. 34. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 35 Problem: hidden substring SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  35. 35. 36 Solution: check punctuation & whitespace for every word form SELECT * FROM listings WHERE title LIKE “cat” OR title LIKE “cats” OR title LIKE “cat %” OR title LIKE “cats %” OR title LIKE “% cat” OR title LIKE “% cats” OR title LIKE “% cat %” OR title LIKE “% cats %” OR title LIKE “% cat.%” OR title LIKE “% cats.%” OR title LIKE “%.cat %” OR title LIKE “%.cats %” ...
  36. 36. 37 Problem: missed relevant item SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”; id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2
  37. 37. 38 SELECT * FROM listings WHERE LOWER(title) = “cat” OR LOWER(title) = “cats” OR LOWER(title) = “kitten” OR LOWER(title) = “kittens” OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %” OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %” OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %” OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %” OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%” OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%” OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %” OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %” OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%” OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%” ... OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats” OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens” ...
  38. 38. Let’s solve it with a search index 39
  39. 39. 40 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity q=cat
  40. 40. 41 Solution: everything is lowercase q=cat key value cat [003] Cat [001] title key value cat [001, 003] title
  41. 41. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 42 Problem: hidden substring q=cat
  42. 42. 43 Solution: tokenization & stemming “Vacation hat” [“vacation”, “hat”] “hats” → “hat” “cats” → “cat” “catlike” → “cat”
  43. 43. id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 44 Problem: missed relevant item q=cat
  44. 44. 45 Solution: synonyms q=cat key value cat [001, 003] kitten [004, 005] title key value cat [001, 003, 004, 005] title
  45. 45. 46 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms
  46. 46. More disk space Do work at “index time” TRADE-OFFS
  47. 47. Numeric Range Search Part 2
  48. 48. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items under $15 in a database github.com/toriagibbs/SecretSanta
  49. 49. 50 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  50. 50. 51 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  51. 51. 52 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) );
  52. 52. 53 Database Index price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002
  53. 53. 54 price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  54. 54. Database Performance O(log n) Log base 2 for a binary tree Log base B for a B-tree 55
  55. 55. Database Performance O(log n) + retrieval Log base 2 for a binary tree Log base B for a B-tree 56
  56. 56. Database Performance O(log n + r) 57 n = number of rows in the database r = number of results found
  57. 57. 58 n log2 n 10 3.32 100 6.64 1 000 9.97 10 000 13.29 100 000 16.61 1 000 000 19.93
  58. 58. Why didn’t we do this for text fields?! SIDEBAR
  59. 59. 60 Prefix Tree (Trie) car cat ham hat SID EB A R
  60. 60. 61 Prefix Tree (Trie) “car cat ham hat” SID EB A R
  61. 61. Database indexes for string fields can only search prefixes SIDEBAR Unless you declare a “full text” index like: FULLTEXT (description)
  62. 62. 63 Database Search Engine O(r) text search O(r) text search Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms SID EB A R
  63. 63. By Lacey Smith hungupokanagan.etsy.com Back to numeric searching...
  64. 64. key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002] 65 price
  65. 65. 66 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  66. 66. 67 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price price=0.00 OR price=0.01 OR price=0.02 OR price=0.03 OR price=0.04 OR price=0.05 OR price=0.06 OR price=0.07 OR price=0.08 OR price=0.09 OR … price=14.93 OR price=14.94 OR price=14.95 OR price=14.96 OR price=14.97 OR price=14.98 OR price=14.99 OR price=15.00 key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  67. 67. 68 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] 11.00 [004] 12.50 - 24.99 [001] 15.00 [001] 25.00 - 49.99 [002, 003, 005] 25.00 - 37.49 [003, 005] 25.00 [003] 25.97 [005] 37.50 - 49.99 [002] 49.99 [002] price price(25.00 - 49.99) U price(50.00) price(0 - 24.99) U price(25.00 - 37.49) U price(37.50) U price(37.51) U price(37.52) ... U price(40.00) fq=price:[25 TO 50] fq=price:[* TO 40]
  68. 68. 69 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price price(0 - 12.49) U price(12.50 - 12.99) U price(13.00 - 13.49) U price(13.50 - 13.99) U price(14.00 - 14.49) U price(14.50 - 14.99) U price(15.00) fq=price:[* TO 15]
  69. 69. 70 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price
  70. 70. Search Index Performance O(log (max-min)) For the max and min values of the field 71
  71. 71. Search Index Performance O(1) Number of buckets don’t change with the size of the data 72
  72. 72. Search Index Performance O(r) 73 r = number of results found
  73. 73. 74 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality
  74. 74. 75 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search
  75. 75. 76 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search
  76. 76. Storage Part 3
  77. 77. 78 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) ); SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  78. 78. <schema name=”listings”> <fields> <field name=”id” type=”int20” required=true indexed=true stored=true> <field name=”title” type=”text” required=true indexed=true stored=false> <field name=”description” type=”text” required=true indexed=true stored=false> <field name=”price” type=”long” required=true indexed=true stored=false> <field name=”quantity” type=”int8” required=true indexed=true stored=false> </fields> </schema> 79 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler>
  79. 79. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=false> <field name=”description” type=”text” stored=false> <field name=”price” type=”long” stored=false> <field name=”quantity” type=”int8” stored=false> </fields> </schema> 80
  80. 80. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=true> <field name=”description” type=”text” stored=true> <field name=”price” type=”long” stored=true> <field name=”quantity” type=”int8” stored=true> </fields> </schema> 81
  81. 81. A search index is not a database index But a search engine can totally be a database
  82. 82. Don’t do it By Darcy Quinn riotcakes.etsy.com
  83. 83. 84 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search Good at storage ‘Meh’ at storage ✓ ✓ ✓ ✓
  84. 84. By Ashley Fehribach furballfanatic.etsy.com
  85. 85. @nerdymathlete
  86. 86. Thank you Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive
  • NattapongPattanapon

    Jun. 13, 2021
  • ManjinderMannySingh

    Jan. 24, 2019

A search engine is not a database. Search systems are optimized for fast search using an internal data structure called an inverted index. Databases have a similar feature to allow quick access, also called an index, but it’s a totally different thing! In this talk, Toria Gibbs will take you on a tour of the internals of a search index, comparing it to common implementations of indexing in relational databases. We’ll see how search engines can outperform databases and discuss the tradeoffs in implementing and maintaining such a system. No prior knowledge of database or search index implementations required; experience creating or querying database tables will be helpful.

Views

Total views

719

On Slideshare

0

From embeds

0

Number of embeds

233

Actions

Downloads

5

Shares

0

Comments

0

Likes

2

×