Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Search Systems - ScaleConf Colombia 2017

1,003 views

Published on

Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.

Published in: Technology
  • Be the first to comment

Introduction to Search Systems - ScaleConf Colombia 2017

  1. 1. Introduction to Search Systems Toria Gibbs Senior Software Engineer @ Etsy @scarletdrive
  2. 2. 2
  3. 3. LEONOR. Macrame wall hanging $145.00 USDAncestralStore 3 Bread your Cat Costume for Cats $12.00 USDMissMaddyMakes
  4. 4. 4 45MITEMS FOR SALE AS OF DECEMBER 31, 2016
  5. 5. 5
  6. 6. Agenda Main Section One Main Section Two Main Section Three Why Build Search Systems? Search Indexes Open Source Tools Interesting Challenges in Search
  7. 7. 7 Why build search systems?
  8. 8. “Isn’t search a solved problem? We have Google!” All my friends Photo by Alissa loveherbyalissa.etsy.com
  9. 9. title • Title • Title Very very large scope Medium scope No control over content Some control over content High intent Low intent Optimize for Google users Optimize for Etsy users 9 Google Etsy
  10. 10. Why build search systems? 1. Customize the solution (your users, your data, your algorithms) 10
  11. 11. id description price 001 red cat mittens 40.00 002 blue mittens 19.99 003 blue hat for cats 12.50 004 cat hat 25.00 005 red and blue hat 30.00 11 Database Example q=“cat” SELECT * FROM items WHERE description LIKE ‘%cat%’
  12. 12. 12 n = items in database m = length of string SUBSTRING SEARCH O(n·m)
  13. 13. 13 n n·m 10 250 100 2500 1000 25000 10000 250000 100000 2500000 1000000 25000000 Database Scalability m=25
  14. 14. Why build search systems? 1. Customize the solution (your users, your data, your algorithms) 2. Improve performance 14
  15. 15. ✓ cat hat ✓ blue hat for cats ✓ vacation hat ? kitten hat By Laura Solarte floflyco.etsy.com SELECT * FROM items WHERE description LIKE ‘%cat%’
  16. 16. Why build search systems? 1. Customize the solution (your users, your data, your algorithms) 2. Improve performance 3. Improve quality of results 16
  17. 17. 17 Search Index
  18. 18. Inverted Index red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 18 001 red cat mittens 002 blue mittens 003 blue hat for cats 004 cat hat 005 red and blue hat
  19. 19. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 19 ● A document is a single searchable unit 001 red cat mittens 40.00
  20. 20. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 20 ● A document is a single searchable unit ● A field is a defined value in a document id description price 001 red cat mittens 40.00
  21. 21. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 21 ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the inverted index id description price 001 red cat mittens 40.00
  22. 22. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 22 ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the inverted index ● An inverted index is an internal data structure that maps terms of a field to document ids
  23. 23. Terminology red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 23 ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the inverted index ● An inverted index is an internal data structure that maps terms of a field to document ids ● An index is a collection of documents 12.50 [003] 19.99 [002] 25.00 [004] 30.00 [005] 40.00 [001] 001 red cat mittens 40.00 002 blue mittens 19.99 ... ... ...
  24. 24. red [001, 005] blue [002, 003, 005] cat [001, 003, 004] hat [003, 004, 005] mitten [001, 002] 001 red cat mittens 002 blue mittens 003 blue hat for cats 004 cat hat 005 red and blue hat How did we do this?
  25. 25. string: “cat hat” array: [“cat”, “hat”] Tokenization By Meredith Langley iheartneedlework.etsy.com
  26. 26. Stemming By Paradise Crow ParadiseCrow.etsy.com “cats” → “cat” “walking” → “walk” “painting” → “paint” ?
  27. 27. By Dina Castellano mamaslilsugarcrochet.etsy.com Bonus: Synonyms ✓ [“cat”, “kitten”] ✓ [“color”, “colour”] ✓ [“Canada”, “Canadian”, “canuck”] ✗ [“Poland”, “Polish”]
  28. 28. =(
  29. 29. By Ludwinus van den Arend circuszoo.etsy.com ● Stemming ✓ hat for cats ● Tokenization ✗ vacation ● Synonyms ✓ kitten hat Building an Inverted Index
  30. 30. 30 INDEX TIME O(n·m·p) QUERY TIME O(1) n = items in database m = length of string p = preprocessing steps
  31. 31. 31 By Lisa Van Riper humbleelephant.etsy.com
  32. 32. title1. “big data” 2. “small data” 3. “big data” 4. “small data” 5. “big data” 6. “small data” 7. “big data” 8. “small data” 9. “big data” 10. “small data” 11. “bigger data” 12. “biggest data” data=[1,2,3,4,5,6,7,8,9,10,11,12] big=[1,3,5,7,9,11,12] small=[2,4,6,8,10] 32
  33. 33. title1. “Carlos Vives is the greatest singer alive” 2. “Shakira is the best dancer in the world” 3. “Sophía Vergara is the most famous Colombian in the United States” carlos=[1] vives=[1] is=[1,2,3] the=[1,2,3] great=[1] singer=[1] alive=[1] shakira=[2] best=[2] dancer=[2] in=[2,3] world=[2] sophia=[3] vergara=[3] most=[3] famous=[3] colombia=[3] unite=[3] states=[3] 33
  34. 34. Did we solve it? ✓ Customize the solution (your users, your data, your algorithms) ✓ Improve performance ✓ Improve quality of results 34
  35. 35. Agenda Main Section One Main Section Two Main Section Three Why Build Search Systems? Search Indexes Open Source Tools Interesting Challenges in Search ✓ ✓
  36. 36. 36 Open Source Tools
  37. 37. 37
  38. 38. 38 ● Inverted index ● Field data (uninverted index) ● Basic stemming, tokenizing, faceting ● Advanced stemming, tokenizing, faceting ● Plugins ● Caching, warming ● Replication ● Sharding, distribution ● ...and more!
  39. 39. Which one should I pick? IT DOESN’T MATTER 39
  40. 40. Source Side by Side with Elasticsearch and Solr By Rafał Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability See also http://solr-vs-elasticsearch.com/ By Kelvin Tan 40 It Doesn’t Matter ● Most projects work well with either ● Getting configuration right is more important ● Test with your own data and your own queries
  41. 41. 41 <schema name="items" version="1.6"> <types> <fieldType name="long" class="solr.TrieLongField"/> <fieldType name="int" class="solr.TrieField" type="integer"/> <fieldType name="tdate" class="solr.TrieDateField"/> <fieldType name="text" class="solr.TextField"/> </types> <fields> <field name="item_id" type="long" stored="true" required="true"/> <field name="description" type="text"/> <field name="quantity" type="int"/> <field name="price" type="long"/> <field name="update_date" type="tdate"/> </fields> <defaultSearchField>description</defaultSearchField> <uniqueKey>item_id</uniqueKey> </schema> "item" : { "properties" : { "item_id": { "type": "long", "store": true }, "description": { "type": "string" }, "quantity": { "type": "int" }, "price": { "type": "long" }, "update_date": { "type": "date" } } }
  42. 42. Which one should I pick? Just pick one and get started :) 42
  43. 43. 43 Interesting Challenges
  44. 44. Scalability Relevance Query Understanding INTERESTING CHALLENGES 44
  45. 45. 45 By Bekki TresorsDesPyrenees.etsy.com Data Users
  46. 46. 46 Replication
  47. 47. 47 Replication update
  48. 48. 48 Sharding Distribution
  49. 49. 49
  50. 50. 50
  51. 51. Scalability Relevance Query Understanding INTERESTING CHALLENGES 51 ✓
  52. 52. TF·IDF 58
  53. 53. 59 TF-IDF TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) 1. The orange cat is a very good cat 2. My cat ate an orange 3. Cats are the best and I will give every cat a special cat toy 1. TF(cat) = 2/8 2. TF(cat) = 1/5 3. TF(cat) = 3/14 IDF(cat) = loge (3/3) “cat” → [1, 3, 2]
  54. 54. 60 TF-IDF TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) 1. The orange cat is a very good cat 2. My cat ate an orange 3. Cats are the best and I will give every cat a special cat toy cat cat cat cat cat 1. TF(cat) = 2/8 2. TF(cat) = 1/5 3. TF(cat) = 8/19 IDF(cat) = loge (3/3) “cat” → [3, 1, 2]
  55. 55. TF·IDF 61
  56. 56. IDF·Q·R 62
  57. 57. Quality By Lisa airfriend.etsy.com ● User reviews ● Clicks ● Favorites ● Adds to shopping cart ● Purchases ● Dwell (time spent viewing the item) ● ...and more!
  58. 58. Recency By Olya foxberrystudio.etsy.com ● Ensure that each visit is new and fresh ● New items have a chance to be seen
  59. 59. Diversity 65
  60. 60. Scalability Relevance Query Understanding INTERESTING CHALLENGES 66 ✓ ✓
  61. 61. Query Understanding ● Tokenization and stemming ● Language identification ● Spelling correction ● Query rewriting (scoping, expansion, relaxation) For more information http://queryunderstanding.com/ By Daniel Tunkelang 67
  62. 62. Query Scoping 68 q=“red mittens” q=“pizza restaurants in Medellin” q=“necklace under $20” q=“mittens” & color=red q=“pizza restaurant” & location=“Medellin” q=“necklace” & price<20
  63. 63. By Amanda Ellis GreenChickens.etsy.com
  64. 64. How Etsy Uses Thermodynamics to Help You Search for “Geeky” by Fiona Condon http://codeascraft.com/2015/08/31/how-etsy-uses-thermodynamics-to-help-you-search-for-geeky
  65. 65. ✓ Scalability Relevance Query Understanding INTERESTING CHALLENGES 71 ✓ ✓
  66. 66. Agenda Main Section One Main Section Two Main Section Three Why Build Search Systems? Search Indexes Open Source Tools Interesting Challenges in Search ✓ ✓ ✓ ✓
  67. 67. Follow me on Twitter! @scarletdrive Thanks!
  68. 68. title 74 We Covered We Did Not Cover ● Stemming ● Tokenization ● Synonyms ● Replication, distribution, and sharding ● Ranking for relevance ● Query understanding ● Faceting ● Field data ● Internationalization ● Spelling correction ● Autocomplete suggestions

×