Solr, Lucene and Hadoop @ Etsy
Upcoming SlideShare
Loading in...5
×
 

Solr, Lucene and Hadoop @ Etsy

on

  • 2,743 views

Presented by David Giffin, Software Engineer, Etsy - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 ...

Presented by David Giffin, Software Engineer, Etsy - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

Search at Etsy poses significant challenges. Our marketplace is filled with millions of unique, short-lived items and people trying to find them over 13 million times a day. In this session we'll discuss many of the solutions we've engineered to meet these challenges including, the evolution of indexing at Etsy, how HBase and Hadoop have taken indexing from hours to minutes, how and why we use bittorrent for Solr replication, how we track search performance, our approach to shave crucial milliseconds off every search, and an overview of our continuous deployment strategy, web / search config integration and A/B testing and analytics.

Statistics

Views

Total Views
2,743
Views on SlideShare
2,743
Embed Views
0

Actions

Likes
6
Downloads
37
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Solr, Lucene and Hadoop @ Etsy Solr, Lucene and Hadoop @ Etsy Presentation Transcript

  • Solr, Lucene & Hadoop @Thursday, May 10, 12
  • david@etsy.com 4 Years Lucene and Solr @ EtsyThursday, May 10, 12
  • History of Search @ Etsy Hadoop + HBase Indexing (in development) ReplicationThursday, May 10, 12
  • About UsThursday, May 10, 12
  • Thursday, May 10, 12
  • Thursday, May 10, 12
  • Thursday, May 10, 12
  • 13MM Listings 39MM Unique Visitors 880K Shops / 150 Countries 100+ EngineersThursday, May 10, 12
  • Architecture OverviewThursday, May 10, 12
  • Overview Search Web Database +n slaves +n webs +n db shards Memcached +n cachesThursday, May 10, 12
  • Thrift Search Web slave web query = hats for cats slave web result = 402, 283, 837 +n slaves +n websThursday, May 10, 12
  • Hydration Database shard shard Web web +n shards web Memcached +n webs cache cache +n cachesThursday, May 10, 12
  • The ResultsThursday, May 10, 12
  • History of Search at EtsyThursday, May 10, 12
  • History of Search 2007 •1 Million Listings •A Single “Master” Postgres Database •PHP > Twisted > Stored Proc > TSearch •18 “Baby” Postgres Databases •Baby ReplicatorThursday, May 10, 12
  • History of Search 2008 •2 Million Listings •A Single “Master” Postgres Database •PHP > Solr •4 Solr Slaves + 2 Masters •Baby Replicator + DIH for ReindexingThursday, May 10, 12
  • History of Search 2009 •4 Million Listings •A Single “Master” Postgres Database •PHP > Solr •6 Solr Slaves + 2 Masters •Webs >ActiveMQ > SolrThursday, May 10, 12
  • History of Search 2010 •7 Million Listings •A Single “Master” Postgres Database •PHP > Thrift > Solr •10 Solr Slaves + 1 Master •Custom Import HandlerThursday, May 10, 12
  • History of Search 2011 •10 Million Listings •“Master” Postgres Database + DB SHARDS! •PHP > Thrift > Solr •24 Solr Slaves + 1 Master •Custom Import HandlerThursday, May 10, 12
  • Future of Search 2012 •?? Million Listings •MORE DB SHARDS! •PHP > Thrift > Solr •?? Solr Slaves + 1 Master •HBase + Hadoop IndexersThursday, May 10, 12
  • What Did We Learn?Thursday, May 10, 12
  • Lucene + Solr > TSearch http://www.depesz.com/2010/10/17/why-im-not-fan-of-tsearch-2/Thursday, May 10, 12
  • Love Lucene + Solr Trunk!Thursday, May 10, 12
  • Run, Don’t Walk...Thursday, May 10, 12
  • Deployinator Fork it: https://github.com/etsy/deployinatorThursday, May 10, 12
  • SmokerThursday, May 10, 12
  • StatsD, Graph Everything! Fork it: https://github.com/etsy/statsdThursday, May 10, 12
  • Thursday, May 10, 12
  • 95th PercentileThursday, May 10, 12
  • start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_renderThursday, May 10, 12
  • Solr Top Level Cache > MemcachedThursday, May 10, 12
  • etsy-index.properties $ cat /search/data/person/index/etsy-index.properties #Tue Mar 27 13:05:51 EDT 2012 max_update_time=2012-03-27T17:05:51.955ZThursday, May 10, 12
  • Check Index Size Don’t Install if < 50% Current SizeThursday, May 10, 12
  • Check if Index is Too Old Don’t Update if > 10 Days OldThursday, May 10, 12
  • What Did We Learn? Store NothingThursday, May 10, 12
  • Keep Denormalized DataThursday, May 10, 12
  • DB Shard PHP JSON Search DB Shard Denormalizer Database DB ShardThursday, May 10, 12
  • Full Apply Install Reindex IncrementalThursday, May 10, 12
  • Full Apply Apply Install Reindex Incremental IncrementalThursday, May 10, 12
  • r Database exe IndThursday, May 10, 12
  • HBase + Hadoop IndexingThursday, May 10, 12
  • HBase + Hadoop Indexing Why HBase?Thursday, May 10, 12
  • HBase + Hadoop Indexing DB Shard PHP JSON DB Shard Denormalizer HBase DB ShardThursday, May 10, 12
  • HBase + Hadoop Indexing listings_denormalized {NAME => listings_denormalized, FAMILIES => [{NAME => listing_data, BLOOMFILTER => ROW, REPLICATION_SCOPE => 0, COMPRESSION => SNAPPY, VERSIONS => 1, TTL => -1, BLOCKSIZE => 65536, IN_MEMORY => false, BLOCKCACHE => false}]}Thursday, May 10, 12
  • HBase + Hadoop Indexing listings_denormalized_modified_index {NAME => listings_denormalized_modified_index, FAMILIES => [{NAME => pks, BLOOMFILTER => ROW, REPLICATION_SCOPE => 0, COMPRESSION => SNAPPY, VERSIONS => 1, TTL => -1, BLOCKSIZE => 65536, IN_MEMORY => false, BLOCKCACHE => false}]}Thursday, May 10, 12
  • HBase + Hadoop Indexing SOLR-1301 https://issues.apache.org/jira/browse/SOLR-1301Thursday, May 10, 12
  • HBase + Hadoop Indexing Solr Disk •Solr Document Converter Output Format •Solr Requires Posix Disk HDFS •Index Copied Back to HDFSThursday, May 10, 12
  • HBase + Hadoop Indexing •Not Great with Multi-Core Configs •Added Solr Multi-Core Support • Solr Config Issues •Added ENV support for Configs •Uses “new” style Hadoop API •Added Support for both Old and NewThursday, May 10, 12
  • HBase + Hadoop Indexing SolrInputDocumentWritable public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable {Thursday, May 10, 12
  • HBase + Hadoop Indexing OozieThursday, May 10, 12
  • HBase + Hadoop Indexing Oozie + HBase?Thursday, May 10, 12
  • HBase + Hadoop Indexing ScanStringGenerator http://blog.ozbuyucusu.com/2011/07/21/using-hbase-tablemapper-via-oozie-workflow/Thursday, May 10, 12
  • HBase + Hadoop Indexing Hadoop Indexer Oozie Start Map HBase Copy Reduce HDFS Merge Solr Disk Install OutputThursday, May 10, 12
  • HBase + Hadoop Indexing IndexerActionMainThursday, May 10, 12
  • HBase + Hadoop Indexing DeployinatorThursday, May 10, 12
  • HBase + Hadoop Indexing IndexCompareThursday, May 10, 12
  • HBase + Hadoop Indexing $ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588 options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index>Thursday, May 10, 12
  • HBase + Hadoop Indexing $ ./compare /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672 id field: user_id hash field: hash percentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/ data/person/index-1335378487672 /search/data/person/index-1332867952588 contains 1515512 docs /search/data/person/index-1335378487672 contains 14837972 docs 1516 of 1516 documents are the sameThursday, May 10, 12
  • HBase + Hadoop Indexing Copy and MergeThursday, May 10, 12
  • HBase + Hadoop Indexing Open SourceThursday, May 10, 12
  • ReplicationThursday, May 10, 12
  • ReplicationThursday, May 10, 12
  • Replication Slaves Master +n slavesThursday, May 10, 12
  • Thursday, May 10, 12
  • BitTorrent ReplicationThursday, May 10, 12
  • Bit Torrent Using BitTornado:Thursday, May 10, 12
  • Replication Bit Torrent + SolrThursday, May 10, 12
  • Replication Bit Torrent + SolrThursday, May 10, 12
  • Thursday, May 10, 12
  • Thursday, May 10, 12
  • Replication Fork of TTorent: https://github.com/etsy/ttorrent Multi-File Support Large File Support Fork BitTorrent: Comming SoonThursday, May 10, 12
  • Need a job?Thursday, May 10, 12
  • Thursday, May 10, 12
  • Thanks!Thursday, May 10, 12
  • david@etsy.comThursday, May 10, 12