ZooKeeper Futures

ZooKeeper Futures
Expanding the Menagerie

Henry Robinson
Software engineer @ Cloudera
Hadoop meetup - 11/5/2009

Thursday, 5 November 2009

Upcoming features for ZooKeeper

▪ Observers

▪ Dynamic ensembles
▪ ZooKeeper in Cloudera’s Distribution for Hadoop


Observers
▪ ZOOKEEPER-368

▪ Problem:

▪ Every node in a ZooKeeper cluster has to vote
▪ So increasing the cluster size increases the cost of write
operations
▪ But increasing the cluster size is the only way currently to get
client scalability
▪ False tension between number of clients and performance
▪ Should only increase size of voting cluster to improve reliability


Observers

▪ It’s worse than that
▪ Since clients are given a list of servers in the ensemble to connect
to, the cluster is not isolated from swamping due to the number of
clients
▪ That is, if a swarm of clients connect to one server and kill it,
they’ll move on to another and do the same.
▪ Now we are sharing the same number of clients amongst fewer
servers!
▪ So if these were enough clients originally to down a server, the
prognosis is not good for those remaining
▪ Only n/2 servers have to die before the cluster is no longer live


Cascading Failures


Observers
▪ Simple way to attack this problem: non-voting cluster members
▪ Act as a fan-in point for client connections by proxying requests to
the inner voting ensemble
▪ Doesn’t matter if they die (in the sense that liveness is preserved) -
cluster is still available for writes
▪ Write throughput stays roughly constant as number of Observers
increases
▪ So we can freely scale the number of Observers to meet the
requirements of the number of clients


Observers: More beneﬁts
▪ Voting ensemble members must meet strict latency contracts in
order to not be considered ‘failed’
▪ Therefore distributing ZooKeeper across many racks, or even
datacenters, is problematic.
▪ No such requirements made of Observers
▪ So deploy the voting ensemble for reliability and low latency
communicaton, and everywhere you need a client, add an Observer
▪ Reads get served locally, so wide distribution isn’t too painful for
some workloads
▪ Likelihood of partition increases relative to distribution of
ensemble, so availability is increased in some cases
▪ Good integration point for publish-subscribe, and for speciﬁc
optimisations

Observers: Current state
▪ This patch required a lot of structural work
▪ Hoping to get in to 3.3
▪ One major refactor patch committed
▪ Core patch up on ZOOKEEPER-368
▪ Check it out and add comments!
▪ Fully functional - you can apply the patch, update your conﬁguration
and start using Observers today
▪ Benchmarks show expected (and pleasing!) performance
improvements
▪ To come in future JIRAs - performance tweaking (batching)


Dynamic Ensembles
▪ ZOOKEEPER-107

▪ Problem:

▪ What if you really do want to change the membership of your
cluster?
▪ Downtime is problematic for a ‘highly-available’ service
▪ But failures occur and machines get repurposed or upgraded


Dynamic Ensembles
▪ We would like to be able to add or remove machines from the
cluster without stopping the world
▪ Conceptually, this is reasonably easy - we have a mechanism for
updating information on every server synchronously, and in order
▪ (it’s called ZooKeeper)
▪ In practice, this is rather involved:
▪ When is a new cluster ‘live’?
▪ Who votes on the cluster membership change?
▪ How do we deal with slow members?
▪ What happens when the leader changes?
▪ How do we ﬁnd the cluster when it’s completely different?


Dynamic Ensembles
▪ Getting all this right is hard
▪ (good!)
▪A fundamental change in how ZooKeeper is designed - much of the
code is predicated on a static view of the cluster membership
▪ Ideally, we want to prove that the resulting protocol is correct
▪ The key observation is that membership changes must be voted
upon by both the old and the new conﬁguration
▪ So this is no magic bullet if the cluster is down
▪ Need to keep track of old conﬁgurations so that each vote can be
tallied with the right quorum


Dynamic Ensembles
▪ Lots of discussion on the JIRA
▪ although no public activity for a couple of months
▪I have code that pretty much works
▪ But waiting until Observers gets committed before I move focus
completely to this
▪ Current situation not *too* bad; there are upgrade workarounds
that are a bit scary theoretically but in practice work ok.


ZooKeeper Packages in CDH
▪ We maintain Cloudera’s Distribution for Hadoop
▪ Packages for Mapred, HDFS, HBase, Pig and Hive
▪ We see ZooKeeper as increasingly important to that stack, as well
as having a wide variety of other applications
▪ Therefore, we’ve packaged ZooKeeper 3.2.1 and are making it a ﬁrst
class part of CDH
▪ We’ll track the Apache releases, and also backport important
patches
▪ Wrapped up in the service framework:
▪ /sbin/service zookeeper start
▪ RPMs and tarballs are done, DEBs to follow imminently
▪ Download RPMs at http://archive.cloudera.com/redhat/cdh/unstable/

Thanks! Questions?
henry@cloudera.com


ZooKeeper Futures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to ZooKeeper Futures

Similar to ZooKeeper Futures (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

ZooKeeper Futures