Wears a lot of hats. Can serve multiple purposes, all related to coordination of a distributed system. More concisely, Zookeeper is a coordination service for distributed applications.
A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Wikipedia
Digging a bit deeper: Simply put, it backs the nodes of your distributed system through a tree structure of znodes (more in upcoming slides) Central nervous system for your distributed application Centralized and replicated across an odd number of hosts to make up an ensemble Data is kept in memory and is backed up to a log for reliability. By using memory ZooKeeper is very fast and can handle the high loads typically seen in chatty coordination protocols across huge numbers of processes. Prefers read-heavy based applications Clients connect to any zk node in the ensemble and maintain that connection API of create, read, update, delete (but also watches, more in later slides)
Replicated hosts make up an “ensemble”
Writes are bubbled up to one elected leader (which is why there should be an odd # of instances). A quorum must confirm the update. Updates are submitted concurrently by clients and committed by FIFO order. After update, incremental state changes are broadcast to replicas using Zookeeper Atomic Broadcast (ZAB). Each state change is incremental with respect to the previous state, so there is an implicit dependence on the order of the state changes. This guarantees: Sequential Consistency - Updates from a client will be applied in the order that they were sent. Atomicity - Updates either succeed or fail. No partial results.
Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestamp allow ZooKeeper to validate the cache and to coordinate updates. Each time a znode's data changes, the version number increases.
Watches Clients can set watches on znodes. Changes to that znode trigger the watch and then clear the watch. When a watch triggers, ZooKeeper sends the client a notification.
Ephemeral Nodes ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children.
Time in ZooKeeper ZooKeeper tracks time multiple ways: Zxid Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2. Version numbers Every change to a a node will cause an increase to one of the version numbers of that node. The three version numbers are version (number of changes to the data of a znode), cversion (number of changes to the children of a znode), and aversion (number of changes to the ACL of a znode). Ticks When using multi-server ZooKeeper, servers use ticks to define timing of events such as status uploads, session timeouts, connection timeouts between peers, etc. The tick time is only indirectly exposed through the minimum session timeout (2 times the tick time); if a client requests a session timeout less than the minimum session timeout, the server will tell the client that the session timeout is actually the minimum session timeout. Real time ZooKeeper doesn't use real time, or clock time, at all except to put timestamps into the stat structure on znode creation and znode modification.
A critical component of ZooKeeper is Zab, the ZooKeeper Atomic Broadcast algorithm, which is the protocol that manages atomic updates to the replicas. It is responsible for agreeing on a leader in the ensemble, synchronizing the replicas, managing update transactions to be broadcast, as well as recovering from a crashed state to a valid state.
Why not simply use a database? Because of the guarantees
Fast especially in read-heavy workload (10:1 performance against writes)
More performance data in backup slides
Service Discovery - watch thru heartbeat
Let ELECTION be a path of choice of the application. To volunteer to be a leader: Create znode z with path "ELECTION/guid-n_" with both SEQUENCE and EPHEMERAL flags; Let C be the children of "ELECTION", and i be the sequence number of z; Watch for changes on "ELECTION/guid-n_j", where j is the largest sequence number such that j < i and n_j is a znode in C; Upon receiving a notification of znode deletion: Let C be the new set of children of ELECTION; If z is the smallest node in C, then execute leader procedure; Otherwise, watch for changes on "ELECTION/guid-n_j", where j is the largest sequence number such that j < i and n_j is a znode in C;
When a Redis machine reaches either the memory or CPU thresholds, we split it either horizontally orvertically. Vertical sharding a Redis machine is simply cutting the number of running Redis instances on the machine by half. We bring up a new master as a slave of the existing master and once the slaving is complete, we make it the new master for half of the Redis instances leaving the old master as the master for the other half.
The entire user id space is split into 8192 virtual shards. We place one virtual shard per Redis DB, and run multiple Redis instances (ranging from 8 to 32) on each machine depending on the memory and CPU consumption of the shards on those instances. Similarly, we run multiple Redis DBs per Redis instance.
Once a part of Hadoop Currently maintained by Yahoo! and the Apache Software Foundation
HBase is the Hadoop database, a distributed, scalable, big data store. HBase supports hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
Currently, hbase clients find the cluster to connect to by asking zookeeper. The only configuration a client needs is the zk quorum to connect to. Masters and hbase slave nodes (regionservers) all register themselves with zk. If their znode evaporates, the master or regionserver is consided lost and repair begins. HBase currently will default to manage the zookeeper cluster. It does this in an attempt at not burdening users with yet another technology to figure; things are bad enough for the hbase noob what with hbase, hdfs, and mapreduce. Part of hbase's management of zk includes being able to see zk configuration in the hbase configuration files. Anything that has the hbase.zookeeper prefix will have its suffix mapped to the corresponding zoo.cfg setting (HBase parses its config. and feeds the relevant zk configurations to zk on start).
Using Java as it’s very readable and shows type information. The same can easily be achieved with Node.js, Perl, or your language of choice.
Create - First we create a persistent node to act as a group (parent) for child nodes. Join - Second we create and join ephemeral child nodes to that parent node.
Each example creates with “unsafe acl”, so anyone who interacts with ZK can perform an update to that node. An alternative is CREATOR_ALL_ACL where only the creator can control the node.
List Think of use cases (active sessions, active system workers, etc) Delete -1 deletes unconditionally. Can also put a version in here to delete, so it will not delete a node that does not match the version specified. Will broadcast to any clients watching that node.
Because of the simplicity of operations on paths, each must catch errors if that node does not exist.
Applications with Apache
Alex Ehrnschwender | Game Server Engineer at DeNA
What is Zookeeper?
“ZooKeeper is a centralized service for
maintaining configuration information,
naming, providing distributed
synchronization, and providing group
ZooKeeper: A Coordination Service for Distributed Applications
Coordination & synchronization for
Logical namespacing implemented by a
hierarchy (tree) of znodes
Replicated in-memory over multiple hosts
for reliability, availability, and performance
Simple API of CRUD & basic tree operations
for client integration
Zookeeper: Reliability & Consistency
Distributed ensemble with automatic leader
election through quorum
Replicated in-memory on every instance with
snapshot writes to disk
Client TCP connection maintained to any
node with failover support
Guaranteed atomicity & sequential
Zookeeper: Watches & Ephemeral nodes
Underlying znodes have a data structure consisting of version numbers (cversion, aversion) &
● Client-initiated subscriptions to znodes
● Changes to a watched znode trigger notification to subscribed clients
● Backed by a client session and deleted when client session ends
● Cannot have children
Zookeeper: But… why?
“Because of the difficulty of implementing
these kinds of services, applications initially
usually skimp on them, which make them
brittle in the presence of change and
difficult to manage. Even when done
correctly, different implementations of
these services lead to management
complexity when the applications are
Zookeeper: Advantages for Backing a Server Cluster
Server workers can become cluster-aware
So much out-of-the-box that would be duplicated with a custom solution
Extremely fast reads (10:1 performance against writes)
Small footprint - An ensemble of only 5-7 zk instances can serve the
coordination needs of several large production applications
Centralized event broadcasting & failure detection (heartbeat)
Zookeeper: Common Use Cases
● Configuration Management
● Service Discovery
● Distributed Cloud-Based File Systems
● Internal DNS Management
● Master (Leader) Election and Voting
● Messaging Queue
● Event Broadcasting & Notification
ZK Use Case Example #1 - Pinterest
Pinterest stores their entire follower model inside sharded Redis instances (
~9000 Redis shards, multiple instances per core)
Shard configuration is stored and managed by Zookeeper
Client lookups and watches for shard location & subsequent data retrieval
Master-slave failover triggers updates to znode representation (slave address replaces master)
Vertical splitting of data broadcasted to watching clients
Use Case Example #2 - HBase Cluster Configuration
Exhibitor: A ZK Monitoring & Administration Tool from Netflix
Centralization & externalization of zk ensemble configuration* (S3/remote FS)
Web UI & REST API for ease of management
Instance monitoring with automatic configuration updates
Rolling ensemble changes while maintaining quorum
Miscellaneous administration tasks (backup/restore, log & snapshot cleanup)
* Configuration management for a configuration manager.... so meta!
Zookeeper Atomic Broadcast (ZAB) Algorithm
● Protocol for managing atomic updates to replicas
● Responsible for:
o Agreeing on an ensemble leader
o Synchronizing replicas
o Managing transactions and broadcasts
o Recovery of state
● ZXIDs & transactional ordering
o Local & global primary order
o Primary integrity