Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Search Index is Not a Database Index - Full Stack Toronto 2017

586 views

Published on

A search engine is not a database. Search systems are optimized for fast search using an internal data structure called an inverted index. Databases have a similar feature to allow quick access, also called an index, but it’s a totally different thing!

In this talk, Toria Gibbs will take you on a tour of the internals of a search index, comparing it to common implementations of indexing in relational databases. We’ll see how search engines can outperform databases and discuss the tradeoffs in implementing and maintaining such a system. No prior knowledge of database or search index implementations required; experience creating or querying database tables will be helpful.

Published in: Technology
  • Be the first to comment

A Search Index is Not a Database Index - Full Stack Toronto 2017

  1. 1. A Search Index is not A Database Index Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive
  2. 2. Story time! 3
  3. 3. Search Index 4 Database Index
  4. 4. They hired me! 5
  5. 5. They hired me! 6 (even though I was wrong)
  6. 6. Agenda 0: Terminology 1: Text Search 2: Numeric Range Search 3: Storage
  7. 7. Terminology Database Table Schema Column Row 8 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  8. 8. Terminology Database Table Schema Column Row 9 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  9. 9. Terminology Database Table Schema Column Row 10 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  10. 10. Terminology Database Table Schema Column Row 11 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog id: integer name: string Breed: string
  11. 11. Terminology Database Table Schema Column Row 12 id name breed 001 Momo Cat 002 Naga Cat 003 Sullivan Dog pets id: integer name: string Breed: string id name 001 Toria 002 Colleen humans id: integer name: string human_id pet_id 001 001 001 002 002 003 owners human_id: int pet_id: int
  12. 12. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document 13
  13. 13. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index 14 ?
  14. 14. Terminology Database Search Engine Table Search Index Schema Schema Column Field Row Document Database Index Inverted Index 15
  15. 15. 16
  16. 16. Text Search Part 1
  17. 17. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items in a database github.com/toriagibbs/SecretSanta
  18. 18. 19 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  19. 19. 20 SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  20. 20. Database Performance n*m 21 n = number of rows in the database m = length of strings
  21. 21. Database Performance O(n) n = number of rows in the database 22
  22. 22. 23 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 005 Kitten mittens Finally! An elegant, comfortable mitten for cats $25.97 18
  23. 23. 24 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  24. 24. 25 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens
  25. 25. 26 id title 001 Cat hat 002 Vacation hat 003 Hats for cats 004 Kitten hat 005 Kitten mittens title id cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005]
  26. 26. 27 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  27. 27. 28 key value cat [001, 003] hat [001, 002, 003, 004] vacation [002] for [003] kitten [004, 005] mitten [005] key value very [001] good [001] hat [001, 002, 003, 004] cat [001, 003, 005] wear [002] beach [002] ... ... q=cat <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> title description
  28. 28. Search Index Performance O(1) 2 hash lookups = constant time 29
  29. 29. Search Index Performance O(1) + retrieval 2 hash lookups = constant time 30
  30. 30. Search Index Performance O(r) r = number of results found 31
  31. 31. Text Search Quality Part 1 ½
  32. 32. 33 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  33. 33. SELECT * FROM listings WHERE LOWER(title) LIKE “%cat%” OR LOWER(description) LIKE “%cat%”; 34 Solution: SQL “LOWER”
  34. 34. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 35 Problem: hidden substring SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”;
  35. 35. 36 Solution: check punctuation & whitespace for every word form SELECT * FROM listings WHERE title LIKE “cat” OR title LIKE “cats” OR title LIKE “cat %” OR title LIKE “cats %” OR title LIKE “% cat” OR title LIKE “% cats” OR title LIKE “% cat %” OR title LIKE “% cats %” OR title LIKE “% cat.%” OR title LIKE “% cats.%” OR title LIKE “%.cat %” OR title LIKE “%.cats %” ...
  36. 36. 37 Problem: missed relevant item SELECT * FROM listings WHERE title LIKE “%cat%” OR description LIKE “%cat%”; id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2
  37. 37. 38 SELECT * FROM listings WHERE LOWER(title) = “cat” OR LOWER(title) = “cats” OR LOWER(title) = “kitten” OR LOWER(title) = “kittens” OR LOWER(title) LIKE “cat %” OR LOWER(title) LIKE “cats %” OR LOWER(title) LIKE “kitten %” OR LOWER(title) LIKE “kittens %” OR LOWER(title) LIKE “% cat %” OR LOWER(title) LIKE “% cats %” OR LOWER(title) LIKE “% kitten %” OR LOWER(title) LIKE “% kittens %” OR LOWER(title) LIKE “% cat.%” OR LOWER(title) LIKE “% cats.%” OR LOWER(title) LIKE “% kitten.%” OR LOWER(title) LIKE “% kittens.%” OR LOWER(title) LIKE “%.cat %” OR LOWER(title) LIKE “%.cats %” OR LOWER(title) LIKE “%.kitten %” OR LOWER(title) LIKE “%.kittens %” OR LOWER(title) LIKE “%.cat.%” OR LOWER(title) LIKE “%.cats.%” OR LOWER(title) LIKE “%.kitten.%” OR LOWER(title) LIKE “%.kittens.%” ... OR LOWER(title) LIKE “% cat” OR LOWER(title) LIKE “% cats” OR LOWER(title) LIKE “% kitten” OR LOWER(title) LIKE “% kittens” ...
  38. 38. Let’s solve it with a search index 39
  39. 39. 40 id title description price quantity 001 Cat hat A very good hat for very good cats $15.00 4 Problem: case sensitivity q=cat
  40. 40. 41 Solution: everything is lowercase q=cat key value cat [003] Cat [001] title key value cat [001, 003] title
  41. 41. id title description price quantity 002 Vacation hat Wear this hat to the beach maybe $49.99 22 003 Hats for cats A set of three hats for the most extreme cat people $25.00 1 42 Problem: hidden substring q=cat
  42. 42. 43 Solution: tokenization & stemming “Vacation hat” [“vacation”, “hat”] “hats” → “hat” “cats” → “cat” “catlike” → “cat”
  43. 43. id title description price quantity 004 Kitten hat This is a very small hat, for kittens particularly $11.00 2 44 Problem: missed relevant item q=cat
  44. 44. 45 Solution: synonyms q=cat key value cat [001, 003] kitten [004, 005] title key value cat [001, 003, 004, 005] title
  45. 45. 46 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms
  46. 46. More disk space Do work at “index time” TRADE-OFFS
  47. 47. Numeric Range Search Part 2
  48. 48. By Rebecca Davis pawsomecrochet.etsy.com Secret Santa for Cats Find all the cat-related items under $15 in a database github.com/toriagibbs/SecretSanta
  49. 49. 50 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  50. 50. 51 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id) );
  51. 51. 52 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) );
  52. 52. 53 Database Index price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002
  53. 53. 54 price 15.00 49.99 25.00 11.00 25.97 id 001 002 003 004 005 id=004 id=001 id=003 id=005 id=002 SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  54. 54. Database Performance O(log n) Log base 2 for a binary tree Log base B for a B-tree 55
  55. 55. Database Performance O(log n) + retrieval Log base 2 for a binary tree Log base B for a B-tree 56
  56. 56. Database Performance O(log n + r) 57 n = number of rows in the database r = number of results found
  57. 57. 58 n log2 n 10 3.32 100 6.64 1 000 9.97 10 000 13.29 100 000 16.61 1 000 000 19.93
  58. 58. Why didn’t we do this for text fields?! SIDEBAR
  59. 59. 60 Prefix Tree (Trie) car cat ham hat SID EB A R
  60. 60. 61 Prefix Tree (Trie) “car cat ham hat” SID EB A R
  61. 61. Database indexes for string fields can only search prefixes SIDEBAR Unless you declare a “full text” index like: FULLTEXT (description)
  62. 62. 63 Database Search Engine O(r) text search O(r) text search Poor quality due to case sensitivity, substring mismatches, and missing terms High quality due to case insensitivity, tokenization, stemming, and synonyms SID EB A R
  63. 63. By Lacey Smith hungupokanagan.etsy.com Back to numeric searching...
  64. 64. key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002] 65 price
  65. 65. 66 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  66. 66. 67 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler> price price=0.00 OR price=0.01 OR price=0.02 OR price=0.03 OR price=0.04 OR price=0.05 OR price=0.06 OR price=0.07 OR price=0.08 OR price=0.09 OR … price=14.93 OR price=14.94 OR price=14.95 OR price=14.96 OR price=14.97 OR price=14.98 OR price=14.99 OR price=15.00 key value 11.00 [004] 15.00 [001] 25.00 [003] 25.97 [005] 49.99 [002]
  67. 67. 68 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] 11.00 [004] 12.50 - 24.99 [001] 15.00 [001] 25.00 - 49.99 [002, 003, 005] 25.00 - 37.49 [003, 005] 25.00 [003] 25.97 [005] 37.50 - 49.99 [002] 49.99 [002] price price(25.00 - 49.99) U price(50.00) price(0 - 24.99) U price(25.00 - 37.49) U price(37.50) U price(37.51) U price(37.52) ... U price(40.00) fq=price:[25 TO 50] fq=price:[* TO 40]
  68. 68. 69 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price price(0 - 12.49) U price(12.50 - 12.99) U price(13.00 - 13.49) U price(13.50 - 13.99) U price(14.00 - 14.49) U price(14.50 - 14.99) U price(15.00) fq=price:[* TO 15]
  69. 69. 70 key value 0 - 24.99 [001, 004] 0 - 12.49 [004] ... ... 11.00 [004] 12.50 - 24.99 [001] 12.50 - 12.99 13.00 - 13.49 ... ... 15.00 - 15.49 [001] 15.00 [001] ... ... price
  70. 70. Search Index Performance O(log (max-min)) For the max and min values of the field 71
  71. 71. Search Index Performance O(1) Number of buckets don’t change with the size of the data 72
  72. 72. Search Index Performance O(r) 73 r = number of results found
  73. 73. 74 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality
  74. 74. 75 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search
  75. 75. 76 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search
  76. 76. Storage Part 3
  77. 77. 78 CREATE TABLE listings ( id bigint(20), title varchar(1024), description longtext, price decimal(10,2), quantity int(8), PRIMARY KEY (id), KEY (price) ); SELECT * FROM listings WHERE (title LIKE “%cat%” OR description LIKE “%cat%”) AND price <= 15;
  78. 78. <schema name=”listings”> <fields> <field name=”id” type=”int20” required=true indexed=true stored=true> <field name=”title” type=”text” required=true indexed=true stored=false> <field name=”description” type=”text” required=true indexed=true stored=false> <field name=”price” type=”long” required=true indexed=true stored=false> <field name=”quantity” type=”int8” required=true indexed=true stored=false> </fields> </schema> 79 q=cat & fq=price:[* TO 15] <requestHandler name=”myHandler” default=true> <lst name=”defaults”> <str name=”qf”>title description</str> </lst> </requestHandler>
  79. 79. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=false> <field name=”description” type=”text” stored=false> <field name=”price” type=”long” stored=false> <field name=”quantity” type=”int8” stored=false> </fields> </schema> 80
  80. 80. <schema name=”listings”> <fields> <field name=”id” type=”int20” stored=true> <field name=”title” type=”text” stored=true> <field name=”description” type=”text” stored=true> <field name=”price” type=”long” stored=true> <field name=”quantity” type=”int8” stored=true> </fields> </schema> 81
  81. 81. A search index is not a database index But a search engine can totally be a database
  82. 82. Don’t do it By Darcy Quinn riotcakes.etsy.com
  83. 83. 84 Database Search Engine O(n) text search O(r) text search (where r <= n) Poor quality High quality O(log n + r) numeric range search O(r) numeric range search Good at storage ‘Meh’ at storage ✓ ✓ ✓ ✓
  84. 84. By Ashley Fehribach furballfanatic.etsy.com
  85. 85. @nerdymathlete
  86. 86. Thank you Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive

×