Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Lessons from Sharding Solr at Etsy
Gregg Donovan
@greggdonovan
Senior Software Engineer, etsy.com
• 5.5 Years Solr & Lucene at Etsy.com
• 3 Years Solr & Lucene at TheLadders.com
• Speaker at LuceneRevolution 2011 & 2013
Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
1.5Million Active Shops
32Million Items Listed
21.7Million Active Buyers
Agenda
• Sharding Solr at Etsy V0 — No sharding
• Sharding Solr at Etsy V1 — Local sharding
• Sharding Solr at Etsy V2 (*)...
Sharding V0 — Not Sharding
• Why do we shard?
• Data size grows beyond RAM on a single box
• Lucene can handle this, but t...
Sharding V0 — Not Sharding
• How to keep data size small enough for one host?
• Don’t store anything other than IDs
• fl=p...
Sharding V0 — Not Sharding
• How does it fail?
• GC
• Solution
• “Banner” protocol
• Client-side load balancer
• Client co...
Sharding V1 — Local Sharding
• Motivations
• Better latency
• Smaller JVMs
• Tough to open a 31gb heap dump on your laptop...
Sharding V1 — Local Sharding
• Lucene parallelism
• Shikhar Bhushan at Etsy experimented with segment level parallelism
• ...
Sharding V1 — Local Sharding
• Based on Solr distributed search
• By default, Solr does two-pass distributed search
• Firs...
Sharding V1 — Local Sharding
• Required us to fetch 1000+ results from each shard for reranking layer
• How to efficiently ...
Sharding V1 — Local Sharding
• Result
• Very large latency win
• Easy system to manage
• Well understood failure and recov...
Sharding V2 — Distributed Sharding
• Motivation
• Further latency improvements
• Prepare for data to exceed a single node’...
Sharding V2 — Distributed Sharding
• New problems
• Partial failures
• Lagging shards
• Synchronizing cluster state and co...
Solving Distributed IDF
• Inverse Document Frequency (IDF) now varies across shards, biasing ranking
• Calculate IDF offlin...
Sharding V2 — Distributed Sharding
• ShardHandler
• Solr’s abstraction for fanning out queries to shards
• Ships with defa...
ShardHandler API
Solr’s SearchHandler calls submit for each shard and then either takeCompletedIncludingErrors
or takeComp...
Sharding V2 — Distributed Sharding
Distributed query requirements
• Distributed tracing
• E.g.: Google’s Dapper, Twitter’s...
Better Know Your Switches
Have a clear understanding of your networking requirements and whether your hardware meets
them....
Sharding V2 — Distributed Sharding
First experiment, Twitter’s Finagle
• Built on Netty
• Mux RPC multiplexing protocol
• ...
Sharding V2 — Distributed Sharding
Second experiment, custom Thrift-based protocol
• Blocking I/O easier to integrate with...
Sharding V2 — Distributed Sharding
Future experiment: HTTP/2
• One TCP connection for all requests between two servers
• L...
Sharding V2 — Distributed Sharding
Implementation note
• Separated fanout from individual request processing
• SolrJ clien...
Sharding V2 — Distributed Sharding
• Good
• Individual shard times demonstrate very low average latency
• Bad
• Overall p9...
Sharding V2 — Distributed Sharding
• Solutions
• See The Tail at Scale by Jeff Dean, CACM 2013.
• Eliminate all sources of...
Sharding V2 — Distributed Sharding
• Backup Requests
• Methods
• Brute force — send two copies of every request to differe...
Sharding V2 — Distributed Sharding
• “Good enough”
• Return results to user after X% of results return if there are enough...
Resilience Testing
Now you own a distributed system. How do you know it works?
• “The Troublemaker”
• Inspired by Netflix’s...
Bonus material!
Better Know Your Kernel
A lesson not about sharding learned while sharding…
• Linux’s futex_wait() was broken in CentOS 6....
What else are we working on?
• Mesos for cluster orchestration
• GPUs for massive increases in per query computational cap...
Thanks for coming.
gregg@etsy.com
@greggdonovan
Questions?
@greggdonovan
gregg@etsy.com
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Upcoming SlideShare
Loading in …5
×

Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy

1,674 views

Published on

Lucene/Solr Revolution 2015

Published in: Technology
  • Be the first to comment

Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy

  1. 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  2. 2. Lessons from Sharding Solr at Etsy Gregg Donovan @greggdonovan Senior Software Engineer, etsy.com
  3. 3. • 5.5 Years Solr & Lucene at Etsy.com • 3 Years Solr & Lucene at TheLadders.com • Speaker at LuceneRevolution 2011 & 2013
  4. 4. Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
  5. 5. Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
  6. 6. 1.5Million Active Shops
  7. 7. 32Million Items Listed
  8. 8. 21.7Million Active Buyers
  9. 9. Agenda • Sharding Solr at Etsy V0 — No sharding • Sharding Solr at Etsy V1 — Local sharding • Sharding Solr at Etsy V2 (*) — Distributed sharding • Questions * —What we’re about to launch.
  10. 10. Sharding V0 — Not Sharding • Why do we shard? • Data size grows beyond RAM on a single box • Lucene can handle this, but there’s a performance cost • Data size grows beyond local disk • Latency requirements • Not sharding allowed us to avoid many problems we’ll discuss later.
  11. 11. Sharding V0 — Not Sharding • How to keep data size small enough for one host? • Don’t store anything other than IDs • fl=pk_id,fk_id,score • Keep materialized objects in memcached • Only index fields needed • Prune index after experiments add fields • Get more RAM
  12. 12. Sharding V0 — Not Sharding • How does it fail? • GC • Solution • “Banner” protocol • Client-side load balancer • Client connects, waits for 4-bytes — OxCODEA5CF— from the server within 1-10ms before sending query. Otherwise, try another server.
  13. 13. Sharding V1 — Local Sharding • Motivations • Better latency • Smaller JVMs • Tough to open a 31gb heap dump on your laptop • Working set still fit in RAM on one box. • What’s the simplest system we can built?
  14. 14. Sharding V1 — Local Sharding • Lucene parallelism • Shikhar Bhushan at Etsy experimented with segment level parallelism • See Search-time Parallelism at Lucene Revolution 2014 • Made its way into LUCENE-6294 (Generalize how IndexSearcher parallelizes collection execution). Committed in Lucene 5.1. • Ended up with eight Solr shards per host, each in its own small JVM • Moved query generation and re-ranking to separate process: the “mixer”
  15. 15. Sharding V1 — Local Sharding • Based on Solr distributed search • By default, Solr does two-pass distributed search • First pass gets top IDs • Second pass fetches stored fields for each top document • Implemented distrib.singlePass mode (SOLR-5768) • Does not make sense if individual documents are expensive to fetch • Basic request tracing via HTTP headers (SOLR-5969)
  16. 16. Sharding V1 — Local Sharding • Required us to fetch 1000+ results from each shard for reranking layer • How to efficiently fetch 1000 documents per shard? • Use Solr’s field syntax to fetch data from FieldCache • e.g. fl=pk_id:field(pk_id),fk_id:field(fk_id),score • When all fields are “pseudo” fields, no need to fetch stored fields per document.
  17. 17. Sharding V1 — Local Sharding • Result • Very large latency win • Easy system to manage • Well understood failure and recovery • Avoided solving many distributed systems issues
  18. 18. Sharding V2 — Distributed Sharding • Motivation • Further latency improvements • Prepare for data to exceed a single node’s capacity • Significant latency improvements require finer sharding, more CPUs per request • Requires a real distributed system and sophisticated RPC • Before proceeding, stop what you’re doing and read everything by Google’s Jeff Dean and Twitter’s Marius Eriksen
  19. 19. Sharding V2 — Distributed Sharding • New problems • Partial failures • Lagging shards • Synchronizing cluster state and configuration • Network partitions • Jespen • Distributed IDF issues exacerbated
  20. 20. Solving Distributed IDF • Inverse Document Frequency (IDF) now varies across shards, biasing ranking • Calculate IDF offline in Hadoop • IDFReplacedSimilarityFactory • Offline data populates cache of Map<BytesRef,Float> (term —> score) • Override SimilarityFactory#idfExplain • Cache misses given rare document constant • Can be extended to solve i18n IDF issues
  21. 21. Sharding V2 — Distributed Sharding • ShardHandler • Solr’s abstraction for fanning out queries to shards • Ships with default implementation (HttpShardHandler) based on HTTP 1.1 • Does fanout (distrib=true) and processes requests coming from other Solr nodes (distrib=false). • Reads shards.rows and shards.start parameters
  22. 22. ShardHandler API Solr’s SearchHandler calls submit for each shard and then either takeCompletedIncludingErrors or takeCompletedOrError depending on partial results tolerance. public abstract class ShardHandler {
 public abstract void checkDistributed(ResponseBuilder rb); 
 public abstract void submit(ShardRequest sreq, String shard, ModifiableSolrParams params);
 public abstract ShardResponse takeCompletedIncludingErrors();
 public abstract ShardResponse takeCompletedOrError();
 public abstract void cancelAll();
 public abstract ShardHandlerFactory getShardHandlerFactory();
 }
  23. 23. Sharding V2 — Distributed Sharding Distributed query requirements • Distributed tracing • E.g.: Google’s Dapper, Twitter’s Zipkin, Etsy’s CrossStich • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure • Handle node failures, slowness
  24. 24. Better Know Your Switches Have a clear understanding of your networking requirements and whether your hardware meets them. • Prefer line-rate switches • Prefer cut-through to store-and-forward • No buffering, just read the IP packet header and move packet to the destination • Track and graph switch statistics in the same dashboard you display your search latency stats • errors, retransmits, etc.
  25. 25. Sharding V2 — Distributed Sharding First experiment, Twitter’s Finagle • Built on Netty • Mux RPC multiplexing protocol • SeeYour Server as a Function by Marius Eriksen • Built-in support for Zipkin distributed tracing • Served as inspiration for Facebook’s futures-based RPC Wangle • Implemented a FinagleShardHandler
  26. 26. Sharding V2 — Distributed Sharding Second experiment, custom Thrift-based protocol • Blocking I/O easier to integrate with SolrJ API • Able to integrate our own distributed tracing • LZ4 compression via a custom Thrift TTransport
  27. 27. Sharding V2 — Distributed Sharding Future experiment: HTTP/2 • One TCP connection for all requests between two servers • Libraries • Square’s OkHttp • Google’s gRpc • Jetty client in 9.3+ — appears to be Solr’s choice
  28. 28. Sharding V2 — Distributed Sharding Implementation note • Separated fanout from individual request processing • SolrJ client via an EmbeddedSolrServer containing empty RAM directory. • Saves a network hop • Makes shards easier to profile, tune • Can return result to SolrJ without sending merged results over the network
  29. 29. Sharding V2 — Distributed Sharding • Good • Individual shard times demonstrate very low average latency • Bad • Overall p95, p99 nowhere near averages • Why? Lagging shards due to GC, filterCache misses, etc. • More shards means more chances to hit outliers
  30. 30. Sharding V2 — Distributed Sharding • Solutions • See The Tail at Scale by Jeff Dean, CACM 2013. • Eliminate all sources of inter-host variability • No filter or other cache misses • No GC • Eliminate OS pauses, networking hiccups, deploys, restarts, etc. • Not realistic
  31. 31. Sharding V2 — Distributed Sharding • Backup Requests • Methods • Brute force — send two copies of every request to different hosts, take the fastest response • Less crude — wait X milliseconds for the first server to respond, then send a backup request. • Adaptive — choose X based on the first Y% of responses to return. • Cancellation — Cancel the slow request to save CPU once you’re sure you don’t need it.
  32. 32. Sharding V2 — Distributed Sharding • “Good enough” • Return results to user after X% of results return if there are enough results. Don’t issue backup requests, just cancel laggards. • Only applicable in certain domains. • Poses questions: • Should you cache partial results? • How is paging effected?
  33. 33. Resilience Testing Now you own a distributed system. How do you know it works? • “The Troublemaker” • Inspired by Netflix’s Chaos Monkey • Authored by Etsy’s Toria Gibbs • Make sure humans can operate it • Failure simulation — don’t wait until 3am • Gameday exercises and Runbooks
  34. 34. Bonus material!
  35. 35. Better Know Your Kernel A lesson not about sharding learned while sharding… • Linux’s futex_wait() was broken in CentOS 6.6 • Backported patches needed from Linux 3.18 • Future direction: make kernel updates independent from distribution updates • E.g. Plenty of good stuff (e.g. networking improvements, kernel introspection [see @brendangregg]) between 3.10 and 4.2+, but it won’t come to CentOS for years • Updating kernel alone easier to roll out
  36. 36. What else are we working on? • Mesos for cluster orchestration • GPUs for massive increases in per query computational capacity
  37. 37. Thanks for coming. gregg@etsy.com @greggdonovan
  38. 38. Questions?
  39. 39. @greggdonovan gregg@etsy.com

×