Coordination service:
A coordination service in a distributed world helps distributed applications to
offload some common challenges like
- Synchronization b/w the nodes of the cluster.
- Distributing a common configuration b/w the nodes of the cluster
- Grouping and naming services for each of the nodes of the cluster
- Leader election b/w the nodes.
Znodes:
Zookeeper(Zk) helps the nodes of the distributed applications coordinate with
each other by providing a common namespace.
Nodes can use this namespace to save and retrieve any shared info to help
coordination.
The namespace is hierarchical much like a tree d/s.
Each element in this namespace is called a znode associated with a name
separated by a (/) to indicate its hierarchical path from the root.
These namespace is stored in-memory and therefore provides faster access.
Ensemble:
Similar to the distributed application clients that it servers, zookeeper itself is
distributed i.e, a set of zookeeper nodes work together to achieve its goal. This
group of zookeeper nodes is called an Ensemble.
Clients can talk to any node within the ensemble (via zk client lib). Clients
periodically send heartbeats to server & receive an ack back to reaffirm its
connectivity.
Each node in the ensemble is aware of and talks to other nodes to share info.
Znodes - namespace
What is Zookeeper ?
Ensemble
Zookeeper Data Model
Each Znode stores a stat structure that contains zxid (transactionID),
version #, timestamp, ACL. Client receives this stats structure at the time
of read. The stat structure helps to validate the updates/deletes from
client.
With every creation/updates the stat structure is updated. Version #
increments, Zxid increments, reset timestamp etc..
Types of ZNodes:
● Ephemeral:
○ Exists as long as the session that created them also
exist.
○ Cannot have children.
● Persistent:
○ Unlike ephemeral, these are persisted across sessions.
● Sequential:
○ Contains a monotonically increasing no as a part of its
name. Helps keep its uniqueness.
Zookeeper Sessions
- Client lib configuration contains the list of all zookeeper servers.
- Client establishes connection to any random server from the list.
- The connected server sends an auth token upon successfully
connection.
- Both client and connected server periodically exchange
heartbeats to confirm that they are each alive.
- If the client loses connectivity, the client lib upon timeout will
connect to some other server from the config list. This switch is
transparent to the client application.
- During reconnection the auth token from the prev connection is
used for validity to attempt an connection to its lost session.
Zookeeper Watches
- Client ops like getData(), getChildren(), exits() etc.., has an
optional parameter to enable a watch on the target znode.
- Zookeeper servers notifies a single change event to the
watchers of the znode. Successive changes to the znode will
not be notify the watchers.
- There are 2 kinds of Watches
- Data watches: Watches for a change in data on a
znode.
getData(), exists() are set to watch for a change in
data.
Also create(), delete()
- Children watches: Watches for a add/deletion of a
child node for a parent znode.
getChildren() is set to watch for add/dele for child
znodes for a parent znode.
Also create(), delete()
- Server A & B creates ephemeral nodes 1 & 2
respectively.
- When A dies, B that’s watching 1 is notified
before 1 expires.
- B can now leverage the info to take evasive
action.
Read Path:
- In a Leader/Follower model, reads are eventually consistent.
- Client connects to one of the zk servers and request for znode along a
path.
- The connected zk server authenticates the client & servers the read from
its locally stored namespace.
- Since its a local copy, it can be stale.
- Zk servers choose availability over consistency hence each servers stores
its own copy of the namespace.
Zookeeper Data Access
Write Path:
- Client connects to one of the zk servers and requests a create/delete of a
znode along a path in the namespace.
- Since all writes are handled by the leader, the connected zk server
forwards the write to leader.
- The leader persists the data and broadcasts the write to all followers in the
cluster & awaits their response.
- If majority of them writes into their local namespace and responds back,
we then have a quorum & write is a success.
- The initial connected zk server responds the write request as a success to
the client.
Zk commands
Intent: Enforce a barricade while performing crucial
job.
● Client calls exits(/b, true), to check if barrier
exists and sets a watch.
● If barrier /b doesn’t exist, create a Ephemeral
node and proceed with the client job
● create(/b, EPHEMERAL)
● If barrier /b exists, client waits for the watch
trigger. At this point the there may be multiple
clients that are on wait n watch for the same
barrier /b.
● One the client job is done it can delete the
barrier.
● The delete of barrier node triggers notification
to all watchers.
● Other waiting clients can now retry with calls
to exits(/b, true).
Usage: Critical updates/housekeeping tasks to force
wait on other processes.
Recipe - Barrier
delete(/b,
Ephemeral)
Is
exists
(/b,
true)
Create(/b,
Ephemeral)
Run client
job
ClientClientClient
exists(/b, true)
Yes
No
Notify state
change to
watchers
Create & delete are atomic ops performed by leader upon agreement with quorum.
Leader guarantees order in the event of race condition for multiple creation requests
from different clients are sequential.
Notify state change
to watchers
Recipe - Cluster Management
Intent: Notify nodes about the arrival or departure of other nodes in the
cluster.
● Create a PERSISTENT parent node /member
● Each client sets a watch on the parent node /member
exists(/member, true)
● Each client creates EPHEMERAL child node under /member
create(/member/host1, false)
● Each client updates its status like CPU/memory/failure etc to its
node in the hierarchy.
● Watches are triggered to all watchers with a change to any child
node.
Usage: Cluster monitoring or management for elastic scaling.
Client c1
/member
Client c2
Client c3
Watches
parent
creates/
updates /member/c1
/member/c2
/member/c3
Notifies
watchers
When client c3 creates /member/c3, zk notifies the other
watches viz., c1 and c2.
Recipe - Queues
Intent: Creates a ordered data access FIFO
● Create a PERSISTENT parent node /queue
● Each client creates EPHEMERAL & SEQUENTIAL child node
under /queue. Since its sequential it appends a monotonically
increasing no at the end e.g., /queue/X-00001, /queue/X-0002...
create(/queue/X-, false)
● A client that wants to access the nodes in insertion order simply
invokes all its children.
getChildren(/queue, true)
By enabling the watch on the parent, the accessor client is
notified when a child is created or removed externally.
Useage: Cluster monitoring or management for elastic scaling.
Client c1
/queue
Client c2
Client c3
creates
/queue/x-0001
/queue/x-0002
/queue/x-0003
Client c4
getChildren
Watches for
changes to
children
Recipe - Locks
Intent: Avoid race condition by enforcing a lock/key pattern
1. Create a PERSISTENT parent node /lock
2. Each client creates EPHEMERAL & SEQUENTIAL child node
under /lock. Since its sequential it appends a monotonically
increasing no at the end e.g., /lock/X-00001, /queue/X-0002…
create(/lock/X-, false)
1. Locks are granted in the insertion order from smallest to largest.
Client wants to check if its the lowest, invokes
getChildren(/lock, false)
1. If 1st znode in the list of children is its very own, the lock is
acquired. Client proceeds to do its job. Upon completion,
releases the lock by deleting its znode.
delete(/lock/X-00001)
1. Else, waits for its turn by adding a watch of its predecessor
znode. (If its immediate predecessor doesn’t exists look for the
one before and so until you find one).
exists(/lock/X-00000n - 1)
1. When its predecessor znode is deleted/update the client is
notified.
2. When a node receives this event it goes to step 3.
Client c1
/lock
Client c2
Client c3
creates
/lock/x-0001
/lock/x-0002
/lock/x-0003
Watches it
predecessor
getChildren()
Checks for existence of its predecessor.
Also need to check with the parent if its the 1st existent
child, In the event its predecessor dies.
Recipe - Leader selection
Intent: Leader election
1. Create a PERSISTENT parent node /election
2. Each zk servers creates EPHEMERAL & SEQUENTIAL child
node under /election. Since its sequential it appends a
monotonically increasing no at the end e.g., /election/X-00001,
/election/X-0002…
create(/election/X-, false)
1. Each Zk server checks if it’s the smallest among all children
getChildren(/election, false)
1. If yes, it becomes the leader.
2. Else, it sets a watch on the znode just smaller that itself (smallest
and closest predecessor).
exists(/election/X-00000n - 1)
1. If the leader dies, so does it ephemeral znode triggering a watch
event to only its successor (next in line that watching it).
2. When a node receives this event it goes to step 3.
Zk 1
(Leader)
/election
Zk 2
Zk 3
creates
/election/x-0001
/election/x-0002
/election/x-0003
Watches it
predecessor
getChildren()
Checks for existence of its predecessor.
Also need to check with the parent if its the 1st existent
child, In the event its predecessor dies.
● Brief Architecture
https://data-flair.training/blogs/zookeeper-architecture/
● Datamodel
https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#ch_zkDataModel
● Zk API
https://www.tutorialspoint.com/zookeeper/zookeeper_api.htm
● Overview
https://www.slideshare.net/scottleber/apache-zookeeper
References

Zookeeper Architecture

  • 2.
    Coordination service: A coordinationservice in a distributed world helps distributed applications to offload some common challenges like - Synchronization b/w the nodes of the cluster. - Distributing a common configuration b/w the nodes of the cluster - Grouping and naming services for each of the nodes of the cluster - Leader election b/w the nodes. Znodes: Zookeeper(Zk) helps the nodes of the distributed applications coordinate with each other by providing a common namespace. Nodes can use this namespace to save and retrieve any shared info to help coordination. The namespace is hierarchical much like a tree d/s. Each element in this namespace is called a znode associated with a name separated by a (/) to indicate its hierarchical path from the root. These namespace is stored in-memory and therefore provides faster access. Ensemble: Similar to the distributed application clients that it servers, zookeeper itself is distributed i.e, a set of zookeeper nodes work together to achieve its goal. This group of zookeeper nodes is called an Ensemble. Clients can talk to any node within the ensemble (via zk client lib). Clients periodically send heartbeats to server & receive an ack back to reaffirm its connectivity. Each node in the ensemble is aware of and talks to other nodes to share info. Znodes - namespace What is Zookeeper ? Ensemble
  • 3.
    Zookeeper Data Model EachZnode stores a stat structure that contains zxid (transactionID), version #, timestamp, ACL. Client receives this stats structure at the time of read. The stat structure helps to validate the updates/deletes from client. With every creation/updates the stat structure is updated. Version # increments, Zxid increments, reset timestamp etc.. Types of ZNodes: ● Ephemeral: ○ Exists as long as the session that created them also exist. ○ Cannot have children. ● Persistent: ○ Unlike ephemeral, these are persisted across sessions. ● Sequential: ○ Contains a monotonically increasing no as a part of its name. Helps keep its uniqueness.
  • 4.
    Zookeeper Sessions - Clientlib configuration contains the list of all zookeeper servers. - Client establishes connection to any random server from the list. - The connected server sends an auth token upon successfully connection. - Both client and connected server periodically exchange heartbeats to confirm that they are each alive. - If the client loses connectivity, the client lib upon timeout will connect to some other server from the config list. This switch is transparent to the client application. - During reconnection the auth token from the prev connection is used for validity to attempt an connection to its lost session. Zookeeper Watches - Client ops like getData(), getChildren(), exits() etc.., has an optional parameter to enable a watch on the target znode. - Zookeeper servers notifies a single change event to the watchers of the znode. Successive changes to the znode will not be notify the watchers. - There are 2 kinds of Watches - Data watches: Watches for a change in data on a znode. getData(), exists() are set to watch for a change in data. Also create(), delete() - Children watches: Watches for a add/deletion of a child node for a parent znode. getChildren() is set to watch for add/dele for child znodes for a parent znode. Also create(), delete() - Server A & B creates ephemeral nodes 1 & 2 respectively. - When A dies, B that’s watching 1 is notified before 1 expires. - B can now leverage the info to take evasive action.
  • 5.
    Read Path: - Ina Leader/Follower model, reads are eventually consistent. - Client connects to one of the zk servers and request for znode along a path. - The connected zk server authenticates the client & servers the read from its locally stored namespace. - Since its a local copy, it can be stale. - Zk servers choose availability over consistency hence each servers stores its own copy of the namespace. Zookeeper Data Access Write Path: - Client connects to one of the zk servers and requests a create/delete of a znode along a path in the namespace. - Since all writes are handled by the leader, the connected zk server forwards the write to leader. - The leader persists the data and broadcasts the write to all followers in the cluster & awaits their response. - If majority of them writes into their local namespace and responds back, we then have a quorum & write is a success. - The initial connected zk server responds the write request as a success to the client.
  • 6.
  • 7.
    Intent: Enforce abarricade while performing crucial job. ● Client calls exits(/b, true), to check if barrier exists and sets a watch. ● If barrier /b doesn’t exist, create a Ephemeral node and proceed with the client job ● create(/b, EPHEMERAL) ● If barrier /b exists, client waits for the watch trigger. At this point the there may be multiple clients that are on wait n watch for the same barrier /b. ● One the client job is done it can delete the barrier. ● The delete of barrier node triggers notification to all watchers. ● Other waiting clients can now retry with calls to exits(/b, true). Usage: Critical updates/housekeeping tasks to force wait on other processes. Recipe - Barrier delete(/b, Ephemeral) Is exists (/b, true) Create(/b, Ephemeral) Run client job ClientClientClient exists(/b, true) Yes No Notify state change to watchers Create & delete are atomic ops performed by leader upon agreement with quorum. Leader guarantees order in the event of race condition for multiple creation requests from different clients are sequential. Notify state change to watchers
  • 8.
    Recipe - ClusterManagement Intent: Notify nodes about the arrival or departure of other nodes in the cluster. ● Create a PERSISTENT parent node /member ● Each client sets a watch on the parent node /member exists(/member, true) ● Each client creates EPHEMERAL child node under /member create(/member/host1, false) ● Each client updates its status like CPU/memory/failure etc to its node in the hierarchy. ● Watches are triggered to all watchers with a change to any child node. Usage: Cluster monitoring or management for elastic scaling. Client c1 /member Client c2 Client c3 Watches parent creates/ updates /member/c1 /member/c2 /member/c3 Notifies watchers When client c3 creates /member/c3, zk notifies the other watches viz., c1 and c2.
  • 9.
    Recipe - Queues Intent:Creates a ordered data access FIFO ● Create a PERSISTENT parent node /queue ● Each client creates EPHEMERAL & SEQUENTIAL child node under /queue. Since its sequential it appends a monotonically increasing no at the end e.g., /queue/X-00001, /queue/X-0002... create(/queue/X-, false) ● A client that wants to access the nodes in insertion order simply invokes all its children. getChildren(/queue, true) By enabling the watch on the parent, the accessor client is notified when a child is created or removed externally. Useage: Cluster monitoring or management for elastic scaling. Client c1 /queue Client c2 Client c3 creates /queue/x-0001 /queue/x-0002 /queue/x-0003 Client c4 getChildren Watches for changes to children
  • 10.
    Recipe - Locks Intent:Avoid race condition by enforcing a lock/key pattern 1. Create a PERSISTENT parent node /lock 2. Each client creates EPHEMERAL & SEQUENTIAL child node under /lock. Since its sequential it appends a monotonically increasing no at the end e.g., /lock/X-00001, /queue/X-0002… create(/lock/X-, false) 1. Locks are granted in the insertion order from smallest to largest. Client wants to check if its the lowest, invokes getChildren(/lock, false) 1. If 1st znode in the list of children is its very own, the lock is acquired. Client proceeds to do its job. Upon completion, releases the lock by deleting its znode. delete(/lock/X-00001) 1. Else, waits for its turn by adding a watch of its predecessor znode. (If its immediate predecessor doesn’t exists look for the one before and so until you find one). exists(/lock/X-00000n - 1) 1. When its predecessor znode is deleted/update the client is notified. 2. When a node receives this event it goes to step 3. Client c1 /lock Client c2 Client c3 creates /lock/x-0001 /lock/x-0002 /lock/x-0003 Watches it predecessor getChildren() Checks for existence of its predecessor. Also need to check with the parent if its the 1st existent child, In the event its predecessor dies.
  • 11.
    Recipe - Leaderselection Intent: Leader election 1. Create a PERSISTENT parent node /election 2. Each zk servers creates EPHEMERAL & SEQUENTIAL child node under /election. Since its sequential it appends a monotonically increasing no at the end e.g., /election/X-00001, /election/X-0002… create(/election/X-, false) 1. Each Zk server checks if it’s the smallest among all children getChildren(/election, false) 1. If yes, it becomes the leader. 2. Else, it sets a watch on the znode just smaller that itself (smallest and closest predecessor). exists(/election/X-00000n - 1) 1. If the leader dies, so does it ephemeral znode triggering a watch event to only its successor (next in line that watching it). 2. When a node receives this event it goes to step 3. Zk 1 (Leader) /election Zk 2 Zk 3 creates /election/x-0001 /election/x-0002 /election/x-0003 Watches it predecessor getChildren() Checks for existence of its predecessor. Also need to check with the parent if its the 1st existent child, In the event its predecessor dies.
  • 12.
    ● Brief Architecture https://data-flair.training/blogs/zookeeper-architecture/ ●Datamodel https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#ch_zkDataModel ● Zk API https://www.tutorialspoint.com/zookeeper/zookeeper_api.htm ● Overview https://www.slideshare.net/scottleber/apache-zookeeper References