Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Call me maybe: Jepsen and flaky networks

1,113 views

Published on

In the big data world, our data stores communicate over an asynchronous, unreliable network to provide a facade of consistency. However, to really understand the guarantees of these systems, we must understand the realities of networks and test our data stores against them.

Jepsen is a tool which simulates network partitions in data stores and helps us understand the guarantees of our systems and its failure modes. In this talk, I will help you understand why you should care about network partitions and how can we test datastores against partitions using Jepsen. I will explain what Jepsen is and how it works and the kind of tests it lets you create. We will try to understand the subtleties of distributed consensus, the CAP theorem and demonstrate how different data stores such as MongoDB, Cassandra, Elastic and Solr behave under network partitions. Finally, I will describe the results of the tests I wrote using Jepsen for Apache Solr and discuss the kinds of rare failures which were found by this excellent tool.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Call me maybe: Jepsen and flaky networks

  1. 1. Call me maybe: Jepsen and flaky networks Shalin Shekhar Mangar @shalinmangar Lucidworks Inc.
  2. 2. Typical first year for a new cluster — Jeff Dean, Google • ~5 racks out of 30 go wonky (50% packetloss) • ~8 network maintenances (4 might cause ~30-minute random connectivity losses) • ~3 router failures (have to immediately pull traffic for an hour) LADIS 2009
  3. 3. Reliable networks are a myth • GC pause • Process crash • Scheduling delays • Network maintenance • Faulty equipment
  4. 4. Network n1 n2 n3 n4 n5
  5. 5. Network partition n1 n2 n3 n4 n5
  6. 6. Messages can be lost, delayed, reordered and duplicated n1 n2 X n1 n2 Time Drop Delay n1 n2 Duplicate n1 n2 Reorder
  7. 7. CAP recap • Consistency (Linearizability): A total order on all operations such that each operation looks as if it were completed at a single instant. • Availability: Every request received by a non-failing node in the system must result in a response. • Partition Tolerance: Arbitrary many messages between two nodes may be lost. Mandatory unless you can guarantee that partitions don’t happen at all.
  8. 8. Have you planned for these? Availability Consistency X X • Errors • Connection timeouts • Hung requests (read timeouts) • Stale results • Dirty results • Data lost forever! During and after a partition
  9. 9. Jepsen: Testing systems under stress • Network partitions • Random process crashes • Slow networks • Clock skew http://github.com/aphyr/jepsen
  10. 10. Anatomy of a Jepsen test • Automated DB setup • Test definitions a.k.a Client • Partition types a.k.a Nemesis • Scheduler of operations (client & nemesis) • History of operations • Consistency checker Data store specific (Mongo/Solr/Elastic) Provided by Jepsen
  11. 11. n1 n2 n3 c1 c2 c3 OK X DatastoreClients History ?
  12. 12. nem.e.sis the inescapable agent of someone’s downfall
  13. 13. Nemesis n1 n2 n3 n4 n5 partition-random-node n1 n2 n3 n4 n5 kill-random-node clock-scrambler
  14. 14. Nemesis n1 n2 n3 n4 n5 partition-halves n1 n4 n5 n2 n3 partition-random-halves n1 n2 n4 n5 bridge n3
  15. 15. A set of integers: cas-set-client • S = {1, 2, 3, 4, 5, …} • Stored as a single document containing all the integers • Update using compare-and-set • Multiple clients try to update concurrently • Create and restore partitions • Finally, read the set of integers and verify consistency
  16. 16. Compare and Set client cas({}, 1) cas(1, 2) {1} {1, 2} cas(1, 3) X Time Client 1 Client 2 cas(2, 4) X cas(2, 5) {1, 2, 5} Client 1 Client 2 t=0 t=1 t=x
  17. 17. Compare and Set client cas({}, 1) cas(1, 2) {1} {1, 2} cas(1, 3) X Time Client 1 Client 2 cas(2, 4) X cas(2, 5) {1, 2, 5} Client 1 Client 2 t=0 t=1 t=x History = [(t, op, result)]
  18. 18. Solr • Search server built on Lucene • Lucene index + transaction log • Optimistic concurrency, linearizable CAS ops • Synchronous replication to all ‘live’ nodes • ZooKeeper for ‘consensus’ • http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky- networks/
  19. 19. Add an integer every second, partition network every 30 seconds for 200 seconds
  20. 20. Solr - Are we safe? • Leaders become unavailable for upto ZK session timeout, typically 30 seconds (expected) • Some write ‘hang’ for a long time on partition. Timeouts are essential. (unexpected) • Final reads under CAS are consistent but we haven’t proved linearizability (good!) • Loss of availability for writes in minority partition. (expected) • No data loss (yet!) which is great!
  21. 21. Solr - Bugs, bugs & bugs • SOLR-6530: Commits under network partition can put any node into ‘down’ state. • SOLR-6583: Resuming connection with ZK causes log replay • SOLR-6511: Requests threads hang under network partition • SOLR-7636: A flaky cluster status API - times out during partitions • SOLR-7109: Indexing threads stuck under network partition can mark leader as down
  22. 22. Elastic • Search server built on Lucene • It has a Lucene index and a transaction log • Consistent single doc reads, writes & updates • Eventually consistent search but a flush/commit should ensure that changes are visible
  23. 23. Elastic • Optimistic concurrency control a.k.a CAS linearizibility • Synchronous acknowledgement from a majority of nodes • “Instantaneous” promotion under a partition • Homegrown ‘ZenDisco’ consensus
  24. 24. Elastic - Are we safe? • “Instantaneous” promotion is not. 90 seconds timeouts to elect a new primary (worse in <1.5.0) • Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0. Better in 1.5.0, only 22/897 lost. • Isolated primaries: 209/947 updates lost • Repeated pauses (simulating GC): 200/2143 updates lost • Getting better but not quite there. Good documentation on resiliency problems.
  25. 25. MongoDB • Document-oriented database • Replica set has a single primary which accepts writes • Primary asynchronously replicates writes to secondaries • Replica decide between themselves to promote/demote primaries • Applies to 2.4.3 and 2.6.7
  26. 26. MongoDB • Claims atomic writes per document and consistent reads • But strict consistency only when reading from primaries • Eventual consistency when reading from secondaries
  27. 27. MongoDB - Are we safe? Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
  28. 28. MongoDB - Are we really safe? • Inconsistent reads are possible even with majority write concern • Read-uncommitted isolation • A minority partition will allow both stale reads and dirty reads
  29. 29. Conclusion • Network communication is flaky! Plan for it. • Hackernews driven development (HDD) is not a good way of choosing data stores! • Test the guarantees of your data stores. • Help me find more Solr bugs!
  30. 30. References • Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen • Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud- jepsen-flaky-networks/ • Jepsen on github: github.com/aphyr/jepsen • Solr fork of Jepsen: https://github.com/LucidWorks/jepsen
  31. 31. Solr/Lucene Meetup on 25th July 2015 Venue: Target Corporation, Manyata Embassy Business Park Time: 9:30am to 1pm Talks: Crux of eCommerce Search and Relevancy Creating Search Analytics Dashboards Signup at http://meetu.ps/2KnJHM
  32. 32. Thank you shalin@apache.org @shalinmangar

×