A search engine is not a database. Search systems are optimized for fast search using an internal data structure called an inverted index. Databases have a similar feature to allow quick access, also called an index, but it’s a totally different thing!
In this talk, Toria Gibbs will take you on a tour of the internals of a search index, comparing it to common implementations of indexing in relational databases. We’ll see how search engines can outperform databases and discuss the tradeoffs in implementing and maintaining such a system. No prior knowledge of database or search index implementations required; experience creating or querying database tables will be helpful.
19. 19
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
005 Kitten mittens Finally! An elegant,
comfortable mitten for cats
$25.97 18
20. 20
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
23. 23
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
005 Kitten mittens Finally! An elegant,
comfortable mitten for cats
$25.97 18
24. 24
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id)
);
25. 25
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
26. 26
id title
001 Cat hat
002 Vacation hat
003 Hats for cats
004 Kitten hat
005 Kitten mittens
title id
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
27. 27
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
very [001]
good [001]
hat [001, 002, 003, 004]
cat [001, 003, 005]
wear [002]
beach [002]
... ...
q=cat
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
title
description
28. 28
key value
cat [001, 003]
hat [001, 002, 003, 004]
vacation [002]
for [003]
kitten [004, 005]
mitten [005]
key value
very [001]
good [001]
hat [001, 002, 003, 004]
cat [001, 003, 005]
wear [002]
beach [002]
... ...
q=cat
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
title
description
33. 33
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
34. SELECT * FROM listings
WHERE LOWER(title) LIKE “%cat%”
OR LOWER(description) LIKE “%cat%”;
34
Solution: SQL “LOWER”
35. id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
35
Problem: hidden substring
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
36. 36
Solution: check punctuation &
whitespace for every word form
SELECT * FROM listings
WHERE title LIKE “cat” OR title LIKE “cats”
OR title LIKE “cat %” OR title LIKE “cats %”
OR title LIKE “% cat” OR title LIKE “% cats”
OR title LIKE “% cat %” OR title LIKE “% cats %”
OR title LIKE “% cat.%” OR title LIKE “% cats.%”
OR title LIKE “%.cat %” OR title LIKE “%.cats %”
...
37. 37
Problem: missed relevant item
SELECT * FROM listings
WHERE title LIKE “%cat%”
OR description LIKE “%cat%”;
id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
38. 38
SELECT * FROM listings
WHERE LOWER(title) = “cat” OR LOWER(title) = “cats”
OR LOWER(title) = “kitten” OR LOWER(title) = “kittens”
OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %”
OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %”
OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %”
OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %”
OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%”
OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%”
OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %”
OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %”
OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%”
OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%”
...
OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats”
OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens”
...
40. 40
id title description price quantity
001 Cat hat A very good hat for very
good cats
$15.00 4
Problem: case sensitivity
q=cat
41. 41
Solution: everything is lowercase
q=cat
key value
cat [003]
Cat [001]
title
key value
cat [001, 003]
title
42. id title description price quantity
002 Vacation hat Wear this hat to the beach
maybe
$49.99 22
003 Hats for cats A set of three hats for the
most extreme cat people
$25.00 1
42
Problem: hidden substring
q=cat
44. id title description price quantity
004 Kitten hat This is a very small hat, for
kittens particularly
$11.00 2
44
Problem: missed relevant item
q=cat
46. 46
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality
due to case sensitivity,
substring mismatches, and
missing terms
High quality
due to case insensitivity,
tokenization, stemming, and
synonyms
54. 54
price 15.00 49.99 25.00 11.00 25.97
id 001 002 003 004 005
id=004 id=001 id=003 id=005 id=002
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
62. Database indexes for string fields
can only search prefixes
SIDEBAR
Unless you declare a “full text” index like:
FULLTEXT (description)
63. 63
Database Search Engine
O(r)
text search
O(r)
text search
Poor quality
due to case sensitivity,
substring mismatches, and
missing terms
High quality
due to case insensitivity,
tokenization, stemming, and
synonyms
SID
EB
A
R
67. 67
q=cat & fq=price:[* TO 15]
<requestHandler name=”myHandler” default=true>
<lst name=”defaults”>
<str name=”qf”>title description</str>
</lst>
</requestHandler>
price
price=0.00 OR price=0.01 OR
price=0.02 OR price=0.03 OR
price=0.04 OR price=0.05 OR
price=0.06 OR price=0.07 OR
price=0.08 OR price=0.09 OR
…
price=14.93 OR price=14.94 OR
price=14.95 OR price=14.96 OR
price=14.97 OR price=14.98 OR
price=14.99 OR price=15.00
key value
11.00 [004]
15.00 [001]
25.00 [003]
25.97 [005]
49.99 [002]
68. 68
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
11.00 [004]
12.50 - 24.99 [001]
15.00 [001]
25.00 - 49.99 [002, 003, 005]
25.00 - 37.49 [003, 005]
25.00 [003]
25.97 [005]
37.50 - 49.99 [002]
49.99 [002]
price
price(25.00 - 49.99)
U price(50.00)
price(0 - 24.99)
U price(25.00 - 37.49)
U price(37.50)
U price(37.51)
U price(37.52)
...
U price(40.00)
fq=price:[25 TO 50]
fq=price:[* TO 40]
69. 69
key value
0 - 24.99 [001, 004]
0 - 12.49 [004]
... ...
11.00 [004]
12.50 - 24.99 [001]
12.50 - 12.99
13.00 - 13.49
... ...
15.00 - 15.49 [001]
15.00 [001]
... ...
price
price(0 - 12.49)
U price(12.50 - 12.99)
U price(13.00 - 13.49)
U price(13.50 - 13.99)
U price(14.00 - 14.49)
U price(14.50 - 14.99)
U price(15.00)
fq=price:[* TO 15]
76. 76
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
O(r)
numeric range search
78. 78
CREATE TABLE listings (
id bigint(20),
title varchar(1024),
description longtext,
price decimal(10,2),
quantity int(8),
PRIMARY KEY (id),
KEY (price)
);
SELECT * FROM listings
WHERE (title LIKE “%cat%” OR description LIKE “%cat%”)
AND price <= 15;
84. 84
Database Search Engine
O(n)
text search
O(r)
text search (where r <= n)
Poor quality High quality
O(log n + r)
numeric range search
O(r)
numeric range search
Good at storage ‘Meh’ at storage
✓
✓
✓
✓