Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Search Engine Basics - Ruben Ortega

2,692 views

Published on

A non-progammer's introduction to search and search engines.

Published in: Education, Business
  • Be the first to comment

Search Engine Basics - Ruben Ortega

  1. 1. Search Engine Basics Ruben Ortega
  2. 2. What is covered? • A non-programmers introduction to: • Why do we have search engines. • How search works across a page, a book, thousands of books, to millions of books. • How to get a good search result.
  3. 3. Speaker Background • 10+ years working on search engines • Amazon, A9.com, Mechanical Turk, Trusera.com • 13 patents -- Helping people find anything • Billions of dollars of revenue • Millions of searches per hour
  4. 4. Have you searched a book for your name? • Wonder how many times your name was mentioned in your High School yearbook? • Find your name across all your High School and college yearbooks? • Which would be the “best result” if I searched for your name in those yearbooks?
  5. 5. Success of Search Engines
  6. 6. Search engines not taught before the web • Not taught because there was no demand. • Why no demand? • Machines had 10-20MB of disk. • $100 per MB of disk --> Disk quotas • Limited networking --> Limited information
  7. 7. What does 1 Megabyte of space hold? • Book Page -- 2.5 Kilobytes of text • 1 Megabyte == 400 pages ~ 1 thick book
  8. 8. Is it worth it to store a book ? • If disk space cost $100 per MB it had better be worth it! • Copying a $20 book into a $100 of disk space is not cost effective.
  9. 9. Why has Search grown so quickly? • Lots and lots of fantastically cheap disk space!
  10. 10. Inexpensive Disk! Cost per Megabyte of Disk Megabytes per dollar of disk 100 10000 75 7500 50 5000 25 2500 0 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 0 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
  11. 11. So what happens? • Information blossoms. • Quotas are gone -- Never have to delete! • Email • Web Pages • Books, Image data, Music
  12. 12. Demand for search skyrocketed • Cheaper disks == more data to search. • More data means • Demand better search techniques • Different handling of items indexed. • Better user interfaces • Reminder: There is no magic in search!
  13. 13. How does search work? • Let’s run through a text search example
  14. 14. Simple Searching • How do you search for the word “coyness” in the following string: • “Had we but world enough and time thy coyness lady would be no crime.”
  15. 15. Find the first “c” coyness to his coy mistress had we but world enough and time thy coyness lady
  16. 16. coyness to his coy mistress had we but world enough and time thy coyness lady
  17. 17. coyness to his coy mistress had we but world enough and time thy coyness lady
  18. 18. coyness to his coy mistress had we but world enough and time thy coyness lady
  19. 19. coyness to his coy mistress had we but world enough and time thy coyness lady
  20. 20. coyness to his coy mistress had we but world enough and time thy coyness lady
  21. 21. coyness to his coy mistress had we but world enough and time thy coyness lady
  22. 22. coyness to his coy mistress had we but world enough and time thy coyness lady
  23. 23. coyness to his coy mistress had we but world enough and time thy coyness lady
  24. 24. coyness to his coy mistress had we but world enough and time thy coyness lady
  25. 25. No match. coyness to his coy mistress had we but world enough and time thy coyness lady
  26. 26. coyness to his coy mistress had we but world enough and time thy coyness lady
  27. 27. coyness to his coy mistress had we but world enough and time thy coyness lady
  28. 28. coyness to his coy mistress had we but world enough and time thy coyness lady
  29. 29. coyness to his coy mistress had we but world enough and time thy coyness lady
  30. 30. coyness to his coy mistress had we but world enough and time thy coyness lady
  31. 31. coynes to his coy mistress s had we but world enough and time thy coyness lady
  32. 32. coyne to his coy mistress ss had we but world enough and time thy coyness lady
  33. 33. coyn to his coy mistress ess had we but world enough and time thy coyness lady
  34. 34. coy to his coy mistress ness had we but world enough and time thy coyness lady
  35. 35. co to his coy mistress yness had we but world enough and time thy coyness lady
  36. 36. c to his coy mistress oyness had we but world enough and time thy coyness lady
  37. 37. to his coy mistress coyness had we but world enough and time thy coyness lady
  38. 38. to his coy mistress coyness had we but world enough and time thy coyness lady
  39. 39. to his coy mistress coyness had we but world enough and time thy coyness lady
  40. 40. to his coy mistress coyness had we but world enough and time thy coyness lady
  41. 41. to his coy mistress coyness had we but world enough and time thy coyness lady
  42. 42. to his coy mistress coyness had we but world enough and time thy coyness lady
  43. 43. to his coy mistress coyness had we but world enough and time thy coyness lady
  44. 44. to his coy mistress coyness had we but world enough and time thy coyness lady
  45. 45. to his coy mistress coyness had we but world enough and time thy coyness lady
  46. 46. to his coy mistress coyness had we but world enough and time thy coyness lady
  47. 47. to his coy mistress coyness had we but world enough and time thy coyness lady
  48. 48. to his coy mistress coyness had we but world enough and time thy coyness lady
  49. 49. to his coy mistress coyness had we but world enough and time thy coyness lady
  50. 50. to his coy mistress coyness had we but world enough and time thy coyness lady
  51. 51. to his coy mistress coyness had we but world enough and time thy coyness lady
  52. 52. to his coy mistress coyness had we but world enough and time thy coyness lady
  53. 53. to his coy mistress coyness had we but world enough and time thy coyness lady
  54. 54. to his coy mistress coynes had we but world enough s and time thy coyness lady
  55. 55. to his coy mistress coyne had we but world enough ss and time thy coyness lady
  56. 56. to his coy mistress coyn had we but world enough ess and time thy coyness lady
  57. 57. to his coy mistress coy had we but world enough ness and time thy coyness lady
  58. 58. to his coy mistress co had we but world enough yness and time thy coyness lady
  59. 59. to his coy mistress c had we but world enough oyness and time thy coyness lady
  60. 60. to his coy mistress had we but world enough coyness and time thy coyness lady
  61. 61. to his coy mistress had we but world enough coyness and time thy coyness lady
  62. 62. to his coy mistress had we but world enough coyness and time thy coyness lady
  63. 63. to his coy mistress had we but world enough coyness and time thy coyness lady
  64. 64. to his coy mistress had we but world enough coyness and time thy coyness lady
  65. 65. to his coy mistress had we but world enough coyness and time thy coyness lady
  66. 66. to his coy mistress had we but world enough coyness and time thy coyness lady
  67. 67. to his coy mistress had we but world enough coyness and time thy coyness lady
  68. 68. to his coy mistress had we but world enough coyness and time thy coyness lady
  69. 69. to his coy mistress had we but world enough coyness and time thy coyness lady
  70. 70. to his coy mistress had we but world enough coyness and time thy coyness lady
  71. 71. to his coy mistress had we but world enough coyness and time thy coyness lady
  72. 72. to his coy mistress had we but world enough coyness and time thy coyness lady
  73. 73. to his coy mistress had we but world enough coyness and time thy coyness lady
  74. 74. to his coy mistress had we but world enough coyness and time thy coyness lady
  75. 75. to his coy mistress had we but world enough coyness and time thy coyness lady
  76. 76. to his coy mistress had we but world enough coyness and time thy coyness lady
  77. 77. to his coy mistress had we but world enough coyness and time thy coyness lady
  78. 78. to his coy mistress had we but world enough coyness and time thy coyness lady
  79. 79. Matched! to his coy mistress had we but world enough coyness and time thy coyness lady
  80. 80. Can we find it faster? • Yes! • Boyer-Moore-Horspool. • Start searching from the end of the word • If a character matches one in the word, shift forward to the character.
  81. 81. coyness to his coy mistress had we but world enough and time thy coyness lady
  82. 82. coyness to his coy mistress had we but world enough and time thy coyness lady
  83. 83. No match, skip. coyness to his coy mistress had we but world enough and time thy coyness lady
  84. 84. coyne to his coy mistress ss had we but world enough and time thy coyness lady
  85. 85. to his coy mistress coyness had we but world enough and time thy coyness lady
  86. 86. to his coy mistress coyness had we but world enough and time thy coyness lady
  87. 87. to his coy mistress coyness had we but world enough and time thy coyness lady
  88. 88. to his coy mistress had we but world enough coyness and time thy coyness lady
  89. 89. Doesn’t match but C is a letter in our word to his coy mistress had we but world enough coyness and time thy coyness lady
  90. 90. Jump 7 spaces to his coy mistress had we but world enough coyness and time thy coyness lady
  91. 91. to his coy mistress had we but world enough coyness and time thy coyness lady
  92. 92. to his coy mistress had we but world enough coyness and time thy coyness lady
  93. 93. to his coy mistress had we but world enough coyness and time thy coyness lady
  94. 94. to his coy mistress had we but world enough coyness and time thy coyness lady
  95. 95. Matched! to his coy mistress had we but world enough coyness and time thy coyness lady
  96. 96. Simple Search works! • Naive algorithm can work quickly for documents you have never seen before and don’t want to bother keeping around. • Boyer Moore Horspool works even faster with a little extra overhead of building a table • But, what if I have extra disk space to store a book and want to go even faster?
  97. 97. Build an index! Image by Dan Taylor: http://www.flickr.com/photos/dantaylor/1145628275/
  98. 98. Indexes are not new • Indexes created in the 10th century to find words in books. • Card catalogs in libraries provide indexes to books. • What is new is how much information can be stored in a single place.
  99. 99. Indexing is simple • For each word in a book • Store which page in the book it is on.
  100. 100. Partial index • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  101. 101. Indexes use more disk space • A complete index takes about 33% of the text indexed. • In 1984, that would be $133 in disk space per book.
  102. 102. • In 2008, $133 is able to store and index 1 million books.
  103. 103. How do you search with an index? • Step 1: Pick the words you are looking for from the index. • Step 2: Return all the pages that the word appears on.
  104. 104. Search for “coyness” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  105. 105. Search for “coyness” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  106. 106. Search for “coyness” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  107. 107. Search for “Cat in the Hat” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  108. 108. Search for “Cat in the Hat” • a -- 1,2,3,4,5,6,7,8,9,10,.... • cat -- 20, 45, 56, 58, 93, 84, 85 • coyness -- 70, 152, 425 • hat -- 6, 10, 35, 58, 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58,...... • the -- 1,2,3,4,5,6,7,8,9,10,....58,....
  109. 109. Phrase Search for “Cat in the Hat” • a -- page 1(3, 12, 15,18),2( 12, 54,56).... • cat -- page 20(45), 56(5), 58(3), 93(23).... • coyness -- 70(56, 82), 152(45), 425(12) • hat -- 6, 10, 35, 58(6), 89,105 • in -- 1,2,3,4,5,6,7,8,9,10,....58(4),...... • the -- 1,2,3,4,5,6,7,8,9,10,....58(5),.... Added page position in ()
  110. 110. How about Searching 1000's of books? • Leverage the same tools we used before • Create an index over multiple books • Perform a search returning books and pages
  111. 111. Multiple books for “Cat in the Hat” • cat -- [Dr. Seuss] 20, 45, 56, 58, [Pet Health Dictionary] 5, 25, 68 • hat -- [Harry Potter] 6, 92, [Dr. Seuss] 35, 58, 89,105 • in -- [Twilight]1,2,...[Dr. Seuss],1,2,3,...58,... • the -- [Programming Perl] 1,2,3,4,5, .... [Dr. Seuss]...58,.... Added Book titles in []
  112. 112. How do you search Millions of Books? • Similar to finding all the Aces in a deck of cards. • 1 person -- 30 seconds if deck is unsorted • 1 person -- 3 seconds if deck is sorted • 26 people -- 1 second if each has 2 cards.
  113. 113. How do you search Millions of Books? Website Search Service Search across many machines Query Collector and return best results Index Server Index Server Index Server Index Server Index Server Index Server Index Server Book Indexes
  114. 114. Millions of books to millions of customers. Website Search Service Query Collector Query Collector Query Collector Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server
  115. 115. Which is the best result? • Should a search for “cat in the hat” return: • The book by Dr. Seuss, • A book about all the Dr. Seuss books, • A story where the mother reads the story to their child? • Did you get what the customer wanted?
  116. 116. Relevancy (It depends) • TF/IDF -- Prefer results with rare words versus results with common words • Amazon -- Biases towards what people are searching and buying recently. • Google -- Biases towards user activity, PageRank, and other factors. • Depends on what the customer intends and how they ask the question.
  117. 117. Last step: Get the text snippet. • You have searched across millions of books, • You have found the “Best” books with the words “cat in the hat” • You have spent 50 msec across 100’s of machines to get the right result. • How do you find the “snippet” on the page?
  118. 118. Snippets Excerpts
  119. 119. Get snippet using simple search • Fetch the book page from a different disk. • Use a simple linear search like Naive or Boyer-Moore to get snippet and surrounding text. • Simple techniques applied across more machines.
  120. 120. Future Trends • Disk space costs dropping --> More data • More networked devices --> More sharing • What would you do with: • All the web on your cell phone • All your family/friends instantly available
  121. 121. Just scratching the surface. • Lucene search engine -- Open source. How to index and search results. http:// lucene.apache.org/ • Google --Presentations and research notes. -- http://research.google.com/video.html • http://www.searchenginehistory.com/
  122. 122. Questions?

×