Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Search Engines: How They Work and Why You Need Them

226 views

Published on

Given at self.conference 2019 in Detroit, MI by Toria Gibbs.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Search Engines: How They Work and Why You Need Them

  1. 1. Search Engines How They Work and Why You Need Them
  2. 2. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  3. 3. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  4. 4. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  5. 5. What do you even do all day? We have Google. @scarletdrive
  6. 6. Not all search engines are web search engines. @scarletdrive
  7. 7. google.com potatoparcel.com Large scope (entire internet) Small scope (just a few potatoes) No control over content Total control over content Many use cases Optimize for selling potatoes
  8. 8. Most websites have a custom search engine. @scarletdrive
  9. 9. Why build search engines? ● Keep it local and customize it
  10. 10. Let’s try to search my store. @scarletdrive
  11. 11. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  12. 12. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  13. 13. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  14. 14. n = items in database m = max length of title strings n·m
  15. 15. n = items in database m = max length of title strings = 250 O(n)
  16. 16. n n · m (m=250) 10 2 500 100 25 000 1 000 250 000 10 000 2 500 000 100 000 25 000 000 1 000 000 250 000 000
  17. 17. Why build search engines? ● Keep it local and customize it ● Improve performance
  18. 18. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’
  19. 19. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” SELECT * FROM items WHERE title LIKE ‘%cat%’
  20. 20. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” SELECT * FROM items WHERE title LIKE ‘%cat%’
  21. 21. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” ● Search for “cats” doesn’t return “cat hat” or “red cat mittens” SELECT * FROM items WHERE title LIKE ‘%cats%’
  22. 22. SELECT * FROM items WHERE title LIKE ‘cat’ OR title LIKE ‘cats’ OR title LIKE ‘cat %’ OR title LIKE ‘cats %’ OR title LIKE ‘% cat’ OR title LIKE ‘% cats’ OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’ OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’ OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’ OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’ OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’ OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’ OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’ OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’ OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’ OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’ ...
  23. 23. Why build search engines? ● Keep it local and customize it ● Improve performance ● Improve quality of results
  24. 24. But how? @scarletdrive
  25. 25. Agenda 1. Why build search engines? ✓ 2. Search indexes 3. Open source tools 4. Interesting challenges
  26. 26. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  27. 27. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] Inverted Index
  28. 28. Terminology ● A document is a single searchable unit red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] 7 kitten mittens 11.99
  29. 29. Terminology ● A document is a single searchable unit ● A field is a defined value in a document red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  30. 30. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  31. 31. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  32. 32. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs ● An index is a collection of documents (including many inverted indexes) red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] ... ... 5.00 [5] 8.00 [3] 0-10.00 [3, 5] 11.99 [7, 8] ... ... id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 ... ... ...
  33. 33. items indexTerminology ● A search index can have many inverted indexes ● A search engine can have many search indexes title inverted index price inverted index blog-posts index title inverted index post inverted index
  34. 34. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ● Improve quality of results
  35. 35. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat
  36. 36. O(1)
  37. 37. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00
  38. 38. r = number of results found O(1+r)
  39. 39. ...but we usually only ask for a fixed number of results at a time O(25) → O(1)
  40. 40. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ● Improve quality of results
  41. 41. But at what cost? @scarletdrive
  42. 42. Trade-offs ● Space ● System complexity ● Pre-processing time
  43. 43. O(1) Query time O(n·m·p) Index time
  44. 44. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results
  45. 45. Let’s talk about how we build it. @scarletdrive
  46. 46. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 How did we do this??
  47. 47. Step 1: Tokenization string: “cat hat” array: [“cat”, “hat”] Image from aliexpress.com
  48. 48. Image from aliexpress.com Step 2: Normalization ● Stemming ○ “cats” → “cat” ○ “walking” → “walk” ● Stop words ○ Remove “the”, “and”, “to”, etc...
  49. 49. Image from aliexpress.com Step 3: Filters ● Lowercase ○ “Dog” → “dog” ● Synonyms ○ “colour” → “color” ○ “t-shirt” → “tshirt” ○ “canadian” → “canada” ○ “kitten” → “cat”
  50. 50. Quality Problems 1. “cat” search returned “vacation hat for dog”
  51. 51. Quality Problems 1. “cat” search returned “vacation hat for dog” id title price 4 vacation hat for dog 12.99 cat [1, 3, 5] hat [4] dog [4] vacation [4]
  52. 52. Quality Problems 1. “cat” search returned “vacation hat for dog” cat [1, 3, 5] hat [4] dog [4] vacation [4] cat id title price 4 vacation hat for dog 12.99
  53. 53. Quality Problems 1. “cat” search returned “vacation hat for dog” 2. “cats” search does not return “red cat mittens”
  54. 54. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] →
  55. 55. All transformations performed on the input data for the index are also performed on the query
  56. 56. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] cats cat
  57. 57. Quality Problems 1. “cat” search returned “vacation hat for dogs” 2. “cats” search does not return “red cat mittens” 3. “cat” search does not return “kitten mittens”
  58. 58. Quality Problems 3. “cat” search does not return “kitten mittens” id title price 7 kitten mittens 11.99 cat [7] mitten [7]
  59. 59. Quality Problems 3. “cat” search does not return “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 cat
  60. 60. Quality Problems 3 ½ search for “kitten” still returns “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 kitten cat
  61. 61. Did we solve it? ● Keep it local ✓ and customize it ✓ ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results ✓ ○ By performing special pre-processing steps
  62. 62. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools 4. Interesting challenges
  63. 63. I want a search engine... do I have to build it myself? @scarletdrive
  64. 64. ● Inverted index ● Basic tokenization, normalization, and filters ● Replication, sharding, and distribution ● Caching and warming ● Advanced tokenization, normalization, and filters ● Plugins ● ...and more!
  65. 65. Which one should I pick? It doesn’t matter
  66. 66. Which one should I pick? ● Most projects work well with either ● Getting configuration right is most important ● Test with your own data, your own queries Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability Solr vs. Elasticsearch by Kelvin Tan http://solr-vs-elasticsearch.com/
  67. 67. Which one should I pick? Better for advanced customization Easier to learn, faster to start up, better docs ~ ~ WARNING: Toria’s personal opinion ~ ~
  68. 68. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges
  69. 69. Interesting Challenge: Scalability
  70. 70. Too much traffic? Replication
  71. 71. Too much traffic? Replication update
  72. 72. Too much data? Sharding Distribution
  73. 73. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  74. 74. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  75. 75. Interesting Challenge: Relevance
  76. 76. id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00 22 feather cat toy 7.99 124 cat and mouse t-shirt 24.50 128 cat t-shirt 31.80 329 “cats rule” sticker 0.99 420 catnip joint for cats 5.99 455 cat toy 7.00 ... ... ... When there are many results, what order should we display them in?
  77. 77. tf-idf
  78. 78. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 1/5 = 0.20 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [1, 3, 2]Query: “cat”
  79. 79. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. Cat cat cat! 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 4/8 = 0.50 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [2, 1, 3]Query: “cat”
  80. 80. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. (assume 100 records which all contain “cat” in them) IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 Query: “orange cat”
  81. 81. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
  82. 82. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Result order = [2, 1]Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 3/7 = 0.43 2/5 = 0.40 1/7 = 0.14 1/5 = 0.20
  83. 83. tf-idf bm25 https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
  84. 84. Relevance Challenges ● Prevent keyword stuffing or other “gaming the system” ● Phrase matching ● Fuzzy matching ● User factors: language, location ● Other factors: quality, recency, randomness, diversity
  85. 85. Interesting Challenges ● Scalability ● Relevance ● Query understanding ● Numeric range search ● Faceted search ● Autocomplete We covered: We did not cover:
  86. 86. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges ✓
  87. 87. Thanks!

×