Solr, Lucene and Hadoop @ Etsy

5,017 views

Published on

Presented by David Giffin | Etsy. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

Search at Etsy poses significant challenges. Our marketplace is filled with millions of unique, short-lived items and people trying to find them over 13 million times a day. In this session we'll discuss many of the solutions we've engineered to meet these challenges including, the evolution of indexing at Etsy, how HBase and Hadoop have taken indexing from hours to minutes, how and why we use bittorrent for Solr replication, how we track search performance, our approach to shave crucial milliseconds off every search, and an overview of our continuous deployment strategy, web / search config integration and A/B testing and analytics.

Published in: Technology

Solr, Lucene and Hadoop @ Etsy

  1. 1. Solr, Lucene & Hadoop @Monday, May 14, 12
  2. 2. david@etsy 4 Years Lucene and Solr @ EtsyMonday, May 14, 12
  3. 3. History of Search @ Etsy Hadoop + HBase Indexing (in development) ReplicationMonday, May 14, 12
  4. 4. About UsMonday, May 14, 12
  5. 5. Monday, May 14, 12
  6. 6. Monday, May 14, 12
  7. 7. Monday, May 14, 12
  8. 8. 13MM Listings 39MM Unique Visitors 880K Shops / 150 Countries 100+ EngineersMonday, May 14, 12
  9. 9. Architectur e OverviewMonday, May 14, 12
  10. 10. Overview Search Web Database +n slaves +n webs +n db shards Memcached +n cachesMonday, May 14, 12
  11. 11. Thrift Search Web slave web query = hats for cats slave web result = 402, 283, 837 +n slaves +n websMonday, May 14, 12
  12. 12. Hydration Database shard shard Web web +n shards web Memcached +n webs cache cache +n cachesMonday, May 14, 12
  13. 13. The ResultsMonday, May 14, 12
  14. 14. History of SearchMonday, May 14, 12
  15. 15. History of Search 2007 •1 Million Listings •A Single “Master” Postgres Database •PHP > Twisted > Stored Proc > TSearchMonday, May 14, 12
  16. 16. History of Search 2008 •2 Million Listings •A Single “Master” Postgres Database •PHP > Solr •4 Solr Slaves + 2Monday, May 14, 12
  17. 17. History of Search 2009 •4 Million Listings •A Single “Master” Postgres Database •PHP > Solr •6 Solr Slaves + 2Monday, May 14, 12
  18. 18. History of Search 2010 •7 Million Listings •A Single “Master” Postgres Database •PHP > Thrift > Solr •10 Solr Slaves + 1Monday, May 14, 12
  19. 19. History of Search 2011 •10 Million Listings •“Master” Postgres Database + DB SHARDS! •PHP > Thrift > Solr •24 Solr Slaves + 1Monday, May 14, 12
  20. 20. Future of Search 2012 •?? Million Listings •MORE DB SHARDS! •PHP > Thrift > Solr •?? Solr Slaves + 1 MasterMonday, May 14, 12
  21. 21. What Did We Learn?Monday, May 14, 12
  22. 22. Lucene + Solr > TSearch http://www.depesz.com/2010/10/17/why-im- not-fan-of-tsearch-2/Monday, May 14, 12
  23. 23. Love Lucene + Solr Trunk!Monday, May 14, 12
  24. 24. Run, Don’t Walk...Monday, May 14, 12
  25. 25. Deployinator Fork it: https://github.com/etsy/deployinatorMonday, May 14, 12
  26. 26. SmokerMonday, May 14, 12
  27. 27. StatsD, Graph Everything! Fork it: https://github.com/etsy/statsdMonday, May 14, 12
  28. 28. Monday, May 14, 12
  29. 29. 95th PercentileMonday, May 14, 12
  30. 30. start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_renderMonday, May 14, 12
  31. 31. Solr Top Level Cache > MemcachedMonday, May 14, 12
  32. 32. etsy- index.properties $ cat /search/data/person/index/etsy-index.properties #Tue Mar 27 13:05:51 EDT 2012 max_update_time=2012-03-27T17:05:51.955ZMonday, May 14, 12
  33. 33. Check Index Size Don’t Install if < 50% Current SizeMonday, May 14, 12
  34. 34. Check if Index is Too Old Don’t Update if > 10 Days OldMonday, May 14, 12
  35. 35. What Did We Learn? Store NothingMonday, May 14, 12
  36. 36. Keep Denormalized DataMonday, May 14, 12
  37. 37. DB Shard PHP JSON Search DB Shard Denormalizer Database DB ShardMonday, May 14, 12
  38. 38. Full Apply Install Reindex IncrementalMonday, May 14, 12
  39. 39. Full Apply Apply Install Reindex Incremental IncrementalMonday, May 14, 12
  40. 40. r Database exe IndMonday, May 14, 12
  41. 41. HBase + HadoopMonday, May 14, 12
  42. 42. HBase + Hadoop Why HBase?Monday, May 14, 12
  43. 43. HBase + Hadoop DB Shard PHP JSON DB Shard Denormalizer HBase DB ShardMonday, May 14, 12
  44. 44. HBase + Hadoop listings_denormalized {NAME => listings_denormalized, FAMILIES => [{NAME => listing_data, BLOOMFILTER => ROW, REPLICATION_SCOPE => 0, COMPRESSION => SNAPPY, VERSIONS => 1, TTL => -1, BLOCKSIZE => 65536, IN_MEMORY => false, BLOCKCACHE =>Monday, May 14, 12
  45. 45. HBase + Hadoop listings_denormalized_m odified_index {NAME => listings_denormalized_modified_index, FAMILIES => [{NAME => pks, BLOOMFILTER => ROW, REPLICATION_SCOPE => 0, COMPRESSION => SNAPPY, VERSIONS => 1, TTL => -1, BLOCKSIZE => 65536,Monday, May 14, 12
  46. 46. HBase + Hadoop SOLR-1301 https://issues.apache.org/jira/browse/ SOLR-1301Monday, May 14, 12
  47. 47. HBase + Hadoop Disk •Solr Solr Output Format Document HDFS Converter •Solr RequiresMonday, May 14, 12
  48. 48. HBase + Hadoop •Not Great with Multi-Core Configs •Added Solr Multi-Core Support • Solr Config Issues •Added ENV supportMonday, May 14, 12
  49. 49. HBase + Hadoop SolrInputDocume ntWritable public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable {Monday, May 14, 12
  50. 50. HBase + Hadoop OozieMonday, May 14, 12
  51. 51. HBase + Hadoop Oozie + HBase?Monday, May 14, 12
  52. 52. HBase + Hadoop ScanStringGenera tor http://blog.ozbuyucusu.com/2011/07/21/ using-hbase-tablemapper-via-oozie-workflow/Monday, May 14, 12
  53. 53. HBase + Hadoop Hadoop Indexer Oozie Start Map HBase Copy Reduce HDFS Merge Solr Disk Install OutputMonday, May 14, 12
  54. 54. HBase + Hadoop IndexerActionMai nMonday, May 14, 12
  55. 55. HBase + Hadoop DeployinatorMonday, May 14, 12
  56. 56. HBase + Hadoop IndexCompareMonday, May 14, 12
  57. 57. HBase + Hadoop $ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588 options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index>Monday, May 14, 12
  58. 58. HBase + Hadoop $ ./compare /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672 id field: user_id hash field: hash percentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/ data/person/index-1335378487672 /search/data/person/index-1332867952588 contains 1515512 docs /search/data/person/index-1335378487672 contains 14837972 docs 1516 of 1516 documents are the sameMonday, May 14, 12
  59. 59. HBase + Hadoop Copy and MergeMonday, May 14, 12
  60. 60. HBase + Hadoop Open SourceMonday, May 14, 12
  61. 61. ReplicationMonday, May 14, 12
  62. 62. ReplicationMonday, May 14, 12
  63. 63. Replication Slaves Master +n slavesMonday, May 14, 12
  64. 64. Monday, May 14, 12
  65. 65. BitTorrent ReplicationMonday, May 14, 12
  66. 66. Bit Torrent Using BitTornado:Monday, May 14, 12
  67. 67. Replication Bit Torrent + SolrMonday, May 14, 12
  68. 68. Replication Bit Torrent + SolrMonday, May 14, 12
  69. 69. Monday, May 14, 12
  70. 70. Monday, May 14, 12
  71. 71. Replication Fork of TTorent: https://github.com/ etsy/ttorrent Multi-File Support Large File Support Fork BitTorrent: Comming SoonMonday, May 14, 12
  72. 72. Need a job?Monday, May 14, 12
  73. 73. Monday, May 14, 12
  74. 74. Thanks!Monday, May 14, 12
  75. 75. david@etsyMonday, May 14, 12

×