ZooKeeper Futures


Published on

Presented at the post-ApacheCon Hadoop meetup in Oakland on November 5th, 2009.

Published in: Technology, News & Politics

ZooKeeper Futures

  1. 1. Thursday, 5 November 2009
  2. 2. ZooKeeper Futures Expanding the Menagerie Henry Robinson Software engineer @ Cloudera Hadoop meetup - 11/5/2009 Thursday, 5 November 2009
  3. 3. Upcoming features for ZooKeeper ▪ Observers ▪ Dynamic ensembles ▪ ZooKeeper in Cloudera’s Distribution for Hadoop Thursday, 5 November 2009
  4. 4. Observers ▪ ZOOKEEPER-368 ▪ Problem: ▪ Every node in a ZooKeeper cluster has to vote ▪ So increasing the cluster size increases the cost of write operations ▪ But increasing the cluster size is the only way currently to get client scalability ▪ False tension between number of clients and performance ▪ Should only increase size of voting cluster to improve reliability Thursday, 5 November 2009
  5. 5. Observers ▪ It’s worse than that ▪ Since clients are given a list of servers in the ensemble to connect to, the cluster is not isolated from swamping due to the number of clients ▪ That is, if a swarm of clients connect to one server and kill it, they’ll move on to another and do the same. ▪ Now we are sharing the same number of clients amongst fewer servers! ▪ So if these were enough clients originally to down a server, the prognosis is not good for those remaining ▪ Only n/2 servers have to die before the cluster is no longer live Thursday, 5 November 2009
  6. 6. Cascading Failures Thursday, 5 November 2009
  7. 7. Cascading Failures Thursday, 5 November 2009
  8. 8. Cascading Failures Thursday, 5 November 2009
  9. 9. Cascading Failures Thursday, 5 November 2009
  10. 10. Observers ▪ Simple way to attack this problem: non-voting cluster members ▪ Act as a fan-in point for client connections by proxying requests to the inner voting ensemble ▪ Doesn’t matter if they die (in the sense that liveness is preserved) - cluster is still available for writes ▪ Write throughput stays roughly constant as number of Observers increases ▪ So we can freely scale the number of Observers to meet the requirements of the number of clients Thursday, 5 November 2009
  11. 11. Observers: More benefits ▪ Voting ensemble members must meet strict latency contracts in order to not be considered ‘failed’ ▪ Therefore distributing ZooKeeper across many racks, or even datacenters, is problematic. ▪ No such requirements made of Observers ▪ So deploy the voting ensemble for reliability and low latency communicaton, and everywhere you need a client, add an Observer ▪ Reads get served locally, so wide distribution isn’t too painful for some workloads ▪ Likelihood of partition increases relative to distribution of ensemble, so availability is increased in some cases ▪ Good integration point for publish-subscribe, and for specific optimisations Thursday, 5 November 2009
  12. 12. Observers: Current state ▪ This patch required a lot of structural work ▪ Hoping to get in to 3.3 ▪ One major refactor patch committed ▪ Core patch up on ZOOKEEPER-368 ▪ Check it out and add comments! ▪ Fully functional - you can apply the patch, update your configuration and start using Observers today ▪ Benchmarks show expected (and pleasing!) performance improvements ▪ To come in future JIRAs - performance tweaking (batching) Thursday, 5 November 2009
  13. 13. Dynamic Ensembles ▪ ZOOKEEPER-107 ▪ Problem: ▪ What if you really do want to change the membership of your cluster? ▪ Downtime is problematic for a ‘highly-available’ service ▪ But failures occur and machines get repurposed or upgraded Thursday, 5 November 2009
  14. 14. Dynamic Ensembles ▪ We would like to be able to add or remove machines from the cluster without stopping the world ▪ Conceptually, this is reasonably easy - we have a mechanism for updating information on every server synchronously, and in order ▪ (it’s called ZooKeeper) ▪ In practice, this is rather involved: ▪ When is a new cluster ‘live’? ▪ Who votes on the cluster membership change? ▪ How do we deal with slow members? ▪ What happens when the leader changes? ▪ How do we find the cluster when it’s completely different? Thursday, 5 November 2009
  15. 15. Dynamic Ensembles ▪ Getting all this right is hard ▪ (good!) ▪A fundamental change in how ZooKeeper is designed - much of the code is predicated on a static view of the cluster membership ▪ Ideally, we want to prove that the resulting protocol is correct ▪ The key observation is that membership changes must be voted upon by both the old and the new configuration ▪ So this is no magic bullet if the cluster is down ▪ Need to keep track of old configurations so that each vote can be tallied with the right quorum Thursday, 5 November 2009
  16. 16. Dynamic Ensembles ▪ Lots of discussion on the JIRA ▪ although no public activity for a couple of months ▪I have code that pretty much works ▪ But waiting until Observers gets committed before I move focus completely to this ▪ Current situation not *too* bad; there are upgrade workarounds that are a bit scary theoretically but in practice work ok. Thursday, 5 November 2009
  17. 17. ZooKeeper Packages in CDH ▪ We maintain Cloudera’s Distribution for Hadoop ▪ Packages for Mapred, HDFS, HBase, Pig and Hive ▪ We see ZooKeeper as increasingly important to that stack, as well as having a wide variety of other applications ▪ Therefore, we’ve packaged ZooKeeper 3.2.1 and are making it a first class part of CDH ▪ We’ll track the Apache releases, and also backport important patches ▪ Wrapped up in the service framework: ▪ /sbin/service zookeeper start ▪ RPMs and tarballs are done, DEBs to follow imminently ▪ Download RPMs at http://archive.cloudera.com/redhat/cdh/unstable/ Thursday, 5 November 2009
  18. 18. Thanks! Questions? henry@cloudera.com Thursday, 5 November 2009