SlideShare a Scribd company logo
Distributed Coordination
with ZooKeeper and Curator
Tibor Sulyán
tibor_sulyan@epam.com
April 25, 2015
2CONFIDENTIAL
CAP Theorem by Eric Brewer
• Consistency
• Availability
• Partition Tolerance
Introduction
1
2
12
write
read
?1
2
2
3CONFIDENTIAL
Agenda
What is ZooKeeper?1
ZooKeeper features2
Coordination Recipes3
Using Curator with ZooKeeper4
Deploying ZooKeeper clusters5
4CONFIDENTIAL
„ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.”
What is ZooKeeper about?
P
P
P
ES
/root
/root/data
/root/state
/root/state/service-000000001
zk
zk
zk zk
client client client
zk
ZK API ZK API ZK API
5CONFIDENTIAL
• Filesystem-like hierarchical structure
– Elements are called zNodes
• zNode operations
– Basic CRUD
– Transactional execution of multiple operations
– Watches
– Versioned changes
• zNode metadata
– Data
– Children
– Metadata (Stat structure)
• zNode types
ZooKeeper Data Model
P persistent
E ephemeral
PS persistent sequential
ES ephemeral sequential
6CONFIDENTIAL
• Ephemeral zNodes
– Session-scoped
– Exists as long as the ephemeral owner's
session is active
– Not persisted
– No children
• Sequence (sequential) zNodes
– Upon creation, zNode name is suffixed
by an integer value
– The value is unique in the zNode path
• Watches
– Can be set on read operations
(getData(), getChildren(), exists())
– One-time trigger when a zNode changes
ZooKeeper Data Model
P
P
/
servers
E server_A
E server_B
P leader
ES server_A0000000001
ES server_B0000000002
7CONFIDENTIAL
// this class will act as default watcher
class ZooKeeperClient implements Watcher {
...
// connect to the ensemble. 'this' refers to a watcher (aka default watcher)
ZooKeeper zooKeeper = new ZooKeeper("localhost:2181,localhost:2182,localhost:2183",
30_000, this);
@Override
public void process(WatchedEvent event) {
// zNode changes & connection state changes
// can be invoked before the constructor returns!
}
}
ZooKeeper API – connect, default watcher
8CONFIDENTIAL
// snyhronous node creation
try {
Stat stat = zooKeeper.create("/test", "data".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL);
} catch (KeeperException e) {
switch (e.code()) {
case CONNECTIONLOSS:
// retry operation
break;
}
}
// asynchronous node creation
zooKeeper.create("/test", "data".getBytes(), OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL, new
StringCallback() {
@Override
public void processResult(int rc, String path, Object ctx, String name) {
switch (Code.get(rc)) {
// handle errors (retry on CONNECTIONLOSS)
}
}
}, null /* no context passed to callback*/);
ZooKeeper API – create operations &
recoverable errors
9CONFIDENTIAL
// snyhronous update: sets "newdata" for /test1
// error handling omitted
Stat stat = zooKeeper.setData("/test", "newdata".getBytes(), -1);
// sets "newerdata" only if data version is 5
zooKeeper.setData("/test", "newerdata".getBytes(), 5);
ZooKeeper API – versioned update operations
10CONFIDENTIAL
// check if zNode exists using the default watcher
// error handling omitted
Stat stat = zooKeeper.exists("/parent/child1", false);
// get data & set default watcher
Stat stat = new Stat();
byte[] data = zooKeeper.getData("/parent/child1", true, stat);
// Use a separate Watcher
stat = zooKeeper.exists("/parent/child2", new Watcher() {
@Override
public void process(WatchedEvent event) {
// react to node deletion
}
});
ZooKeeper API – read operations & setting
watches
11CONFIDENTIAL
• Atomic updates
• Sequential Consistency
• Single System Image
• Timeliness
• Reliability
• Availability
ZooKeeper Guarantees
zk 5
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
12CONFIDENTIAL
notifycommitvoteproposepropagate
Sequential Consistency
client
follower
leader
follower
follower
setData
sync return
callback called
watch triggered
time
13CONFIDENTIAL
propagate, propose commit, notify
Timeliness
client 2
follower
leader
follower
follower
client 1
setData (v2)
v2
v2
time
14CONFIDENTIAL
• ZooKeeper process failures are tolerated if
a quorum is present
• Simplest quorum: majority-based
• Avoids split-brain scenarios
Availability
zk 5
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
behaviour on follower failures
15CONFIDENTIAL
• ZooKeeper process failures are tolerated if
a quorum is present
• Simplest quorum: majority-based
• Avoids split-brain scenarios
Availability
zk 5
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
behaviour on leader failure
zk 1
zk 2
16CONFIDENTIAL
ZooKeeper Recipes
17CONFIDENTIAL
Distributed Coordination Recipes
Shared Data Group Membership
P
P
/
serviceInstances
E serverA
E serverB
Service Discovery
P
P
/
service
E serviceInfo
Lock
Mutex
Leader Election
P
P
/
service
ES service_0000000001
ES service_0000000002
18CONFIDENTIAL
Leader Election Recipe
P
P
/
service
ES service_0000000001
ES service_0000000002
zk 5
zk 1
zk 2 zk 3
service service service
zk 4
ES service_0000000003
service
watch
service_0000000001
watch
service_0000000001
watch
service_0000000001
watch
service_0000000001
service
watch
service_0000000002
n-1 watches are set on the same node
Improvment: watch the last sequence node
instead of the first one
19CONFIDENTIAL
Improved Leader Election Recipe
P
P
/
service
ES service_0000000001
ES service_0000000002
zk 5
zk 1
zk 2 zk 3
service service service
zk 4
ES service_0000000003
service
watch
service_0000000001
watch
service_0000000002
watch
service_0000000001
service
20CONFIDENTIAL
• Higher level Client API to
ZooKeeper
• Hides most of the complexity of
communicating with ZK ensemble
• Implemented recipes
Curator and ZooKeeper
zk
zk
zk zk
client client client
zk
Curator
ZK API
Curator
ZK API
Curator
ZK API
21CONFIDENTIAL
// create & start framework instance
CuratorFramework framework =
CuratorFrameworkFactory.newClient("localhost:2181,localhost:2182,localhost:2183",
new ExponentialBackoffRetry(1000, 20));
framework.start()
// foreground operation
Stat stat = framework.setData().forPath("/a/b/c/d", "testdata".getBytes());
// background operation
framework.setData().inBackground().forPath("/a/b/c/d/e", "testdata".getBytes());
Curator API
22CONFIDENTIAL
Protected EPHEMERAL_SEQUENTIAL nodes
Curator Features
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
P
P
/
cluster
framework.create().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).forPath("/cluster/service");
ES service_0000000002
ES service_0000000003
zk 5
connection loss – reconnect attempt beginsreconnect successful within session timeout – retrying path creation
23CONFIDENTIAL
Protected EPHEMERAL_SEQUENTIAL nodes
Curator Features
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
P
P
/
cluster
framework.create().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).withProtection().forPath("/cluster/service");
ES _c_16c39a25-87b4-4a54-bd05-1666a3e718de_service_0000000002
zk 5
connection loss – reconnect attempt beginsreconnect successful within session timeout – checkning zNode with same GUIDno extra zNode created
24CONFIDENTIAL
• Performance Considerations
• Using Observers to scale
• Using Hierarchical Quorums for multi-datacenter setup
• Surviving network partition with read-only mode
Zookeeper in the real world
25CONFIDENTIAL
• Replicated data is kept entirely in-memory by zookeper processes
• full GC can drop out a server from the ensemble
• Synchronous filesystem writes in commit phase
• can take seconds on an overloaded storage device
• use dedicated device for zookeeper transaction logs
• Maximum zNode size is 1M by default
• data + metadata should fit in
• configurable using a system property, but increasing it is not recommended
• Watches and performance
• Too many watches on a single node – herd effect
• Too many watches overall – increases memory footprint
Performance considerations
26CONFIDENTIAL
notifycommitvoteproposepropagate
Using Observers to scale
client
follower
leader
follower
follower
setData
sync return
callback called
watch triggered
observer
observers:
• no proposals
• no votes
• can’t be leaders
time
27CONFIDENTIAL
Hierarchical Quorums
zk5
zk4
zk6
zk8
zk7
zk9
zk2
zk1
zk3
Majority quorums:
• any 4 zk failures are tolerated
A datacenter goes down
• remaining ensemble becomes
much less resilient
Hierarchical quorums:
• Disjoint groups are formed
• Quorum requires majority of votes
from the majority of groups
• 5 failures can be tolerated
• Better for clusters spanning
multiple datacenters
group 1 group 2
group 3
28CONFIDENTIAL
Read-only mode
zk5
zk4
zk6
zk8
zk7
zk9
zk2
zk1
zk3
Network partitions,
a datacenter gets detached
Partitioned zookeepers can operate
in read-only mode
• not connected to the ensemble
• no writes allowed
• read requests are still served
By default read-only mode is disabled
zk2
zk1
zk3
29CONFIDENTIAL
• ACLs
• Quota support
• Authentication support
• Transaction logging
• Connection state handling
• Weighted hierarchical quorums
• Configuration
• Dynamic reconfiguration
• ...
• More info:
• ZooKeeper documentation
http://zookeeper.apache.org/doc/trunk/index.html
• Curator resources
http://curator.apache.org
• ZAB protocol in detail
http://web.stanford.edu/class/cs347/reading/zab.pdf
http://diyhpl.us/~bryan/papers2/distributed/distributed-systems/zab.totally-ordered-broadcast-
protocol.2008.pdf
• ZooKeeper book
http://shop.oreilly.com/product/0636920028901.do
Topics not covered
30CONFIDENTIAL
THANK YOU!

More Related Content

What's hot

Akka Cluster in Java - JCConf 2015
Akka Cluster in Java - JCConf 2015Akka Cluster in Java - JCConf 2015
Akka Cluster in Java - JCConf 2015
Jiayun Zhou
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
Jimmy Lai
 
Deterministic simulation testing
Deterministic simulation testingDeterministic simulation testing
Deterministic simulation testing
FoundationDB
 
Apache Commons Pool and DBCP - Version 2 Update
Apache Commons Pool and DBCP - Version 2 UpdateApache Commons Pool and DBCP - Version 2 Update
Apache Commons Pool and DBCP - Version 2 Update
Phil Steitz
 
Programming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorialProgramming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorial
Jeff Smith
 
Javascript TDD with Jasmine, Karma, and Gulp
Javascript TDD with Jasmine, Karma, and GulpJavascript TDD with Jasmine, Karma, and Gulp
Javascript TDD with Jasmine, Karma, and Gulp
All Things Open
 
Behind modern concurrency primitives
Behind modern concurrency primitivesBehind modern concurrency primitives
Behind modern concurrency primitives
Bartosz Sypytkowski
 
DIY Java Profiler
DIY Java ProfilerDIY Java Profiler
DIY Java Profiler
aragozin
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
Minwoo Kim
 
Everything as a code
Everything as a codeEverything as a code
Everything as a code
Aleksandr Tarasov
 
Java profiling Do It Yourself
Java profiling Do It YourselfJava profiling Do It Yourself
Java profiling Do It Yourself
aragozin
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
Recruit Technologies
 
Non Blocking I/O for Everyone with RxJava
Non Blocking I/O for Everyone with RxJavaNon Blocking I/O for Everyone with RxJava
Non Blocking I/O for Everyone with RxJava
Frank Lyaruu
 
Csw2016 gawlik bypassing_differentdefenseschemes
Csw2016 gawlik bypassing_differentdefenseschemesCsw2016 gawlik bypassing_differentdefenseschemes
Csw2016 gawlik bypassing_differentdefenseschemes
CanSecWest
 
First glance at Akka 2.0
First glance at Akka 2.0First glance at Akka 2.0
First glance at Akka 2.0
Vasil Remeniuk
 
Advanced akka features
Advanced akka featuresAdvanced akka features
Advanced akka features
Grzegorz Duda
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)
Alexey Fyodorov
 
Jersey framework
Jersey frameworkJersey framework
Jersey framework
knight1128
 
Thinking Beyond ORM in JPA
Thinking Beyond ORM in JPAThinking Beyond ORM in JPA
Thinking Beyond ORM in JPA
Patrycja Wegrzynowicz
 
Zookeeper
ZookeeperZookeeper
Zookeeper
Geng-Dian Huang
 

What's hot (20)

Akka Cluster in Java - JCConf 2015
Akka Cluster in Java - JCConf 2015Akka Cluster in Java - JCConf 2015
Akka Cluster in Java - JCConf 2015
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Deterministic simulation testing
Deterministic simulation testingDeterministic simulation testing
Deterministic simulation testing
 
Apache Commons Pool and DBCP - Version 2 Update
Apache Commons Pool and DBCP - Version 2 UpdateApache Commons Pool and DBCP - Version 2 Update
Apache Commons Pool and DBCP - Version 2 Update
 
Programming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorialProgramming with ZooKeeper - A basic tutorial
Programming with ZooKeeper - A basic tutorial
 
Javascript TDD with Jasmine, Karma, and Gulp
Javascript TDD with Jasmine, Karma, and GulpJavascript TDD with Jasmine, Karma, and Gulp
Javascript TDD with Jasmine, Karma, and Gulp
 
Behind modern concurrency primitives
Behind modern concurrency primitivesBehind modern concurrency primitives
Behind modern concurrency primitives
 
DIY Java Profiler
DIY Java ProfilerDIY Java Profiler
DIY Java Profiler
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Everything as a code
Everything as a codeEverything as a code
Everything as a code
 
Java profiling Do It Yourself
Java profiling Do It YourselfJava profiling Do It Yourself
Java profiling Do It Yourself
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Non Blocking I/O for Everyone with RxJava
Non Blocking I/O for Everyone with RxJavaNon Blocking I/O for Everyone with RxJava
Non Blocking I/O for Everyone with RxJava
 
Csw2016 gawlik bypassing_differentdefenseschemes
Csw2016 gawlik bypassing_differentdefenseschemesCsw2016 gawlik bypassing_differentdefenseschemes
Csw2016 gawlik bypassing_differentdefenseschemes
 
First glance at Akka 2.0
First glance at Akka 2.0First glance at Akka 2.0
First glance at Akka 2.0
 
Advanced akka features
Advanced akka featuresAdvanced akka features
Advanced akka features
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)
 
Jersey framework
Jersey frameworkJersey framework
Jersey framework
 
Thinking Beyond ORM in JPA
Thinking Beyond ORM in JPAThinking Beyond ORM in JPA
Thinking Beyond ORM in JPA
 
Zookeeper
ZookeeperZookeeper
Zookeeper
 

Similar to Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zookeeper

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
Etienne Coutaud
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 
Oracle Basics and Architecture
Oracle Basics and ArchitectureOracle Basics and Architecture
Oracle Basics and Architecture
Sidney Chen
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
So we're running Apache ZooKeeper. Now What? By Camille Fournier
So we're running Apache ZooKeeper. Now What? By Camille Fournier So we're running Apache ZooKeeper. Now What? By Camille Fournier
So we're running Apache ZooKeeper. Now What? By Camille Fournier
Hakka Labs
 
UVM TUTORIAL;
UVM TUTORIAL;UVM TUTORIAL;
UVM TUTORIAL;
Azad Mishra
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
mumrah
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
Tanel Poder
 
Securing Hadoop @eBay
Securing Hadoop @eBaySecuring Hadoop @eBay
Securing Hadoop @eBay
DataWorks Summit
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demo
Ajith Narayanan
 
MySQL Performance Schema in Action
MySQL Performance Schema in ActionMySQL Performance Schema in Action
MySQL Performance Schema in Action
Sveta Smirnova
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
Engine Yard
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
DataWorks Summit
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Nagios
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
Monal Daxini
 
Severalnines Training: MySQL Cluster - Part X
Severalnines Training: MySQL Cluster - Part XSeveralnines Training: MySQL Cluster - Part X
Severalnines Training: MySQL Cluster - Part X
Severalnines
 
Curator intro
Curator introCurator intro
Curator intro
Jordan Zimmerman
 

Similar to Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zookeeper (20)

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Oracle Basics and Architecture
Oracle Basics and ArchitectureOracle Basics and Architecture
Oracle Basics and Architecture
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
So we're running Apache ZooKeeper. Now What? By Camille Fournier
So we're running Apache ZooKeeper. Now What? By Camille Fournier So we're running Apache ZooKeeper. Now What? By Camille Fournier
So we're running Apache ZooKeeper. Now What? By Camille Fournier
 
UVM TUTORIAL;
UVM TUTORIAL;UVM TUTORIAL;
UVM TUTORIAL;
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
Oracle Database In-Memory Option in Action
Oracle Database In-Memory Option in ActionOracle Database In-Memory Option in Action
Oracle Database In-Memory Option in Action
 
Securing Hadoop @eBay
Securing Hadoop @eBaySecuring Hadoop @eBay
Securing Hadoop @eBay
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demo
 
MySQL Performance Schema in Action
MySQL Performance Schema in ActionMySQL Performance Schema in Action
MySQL Performance Schema in Action
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
Severalnines Training: MySQL Cluster - Part X
Severalnines Training: MySQL Cluster - Part XSeveralnines Training: MySQL Cluster - Part X
Severalnines Training: MySQL Cluster - Part X
 
Curator intro
Curator introCurator intro
Curator intro
 

More from EPAM_Systems_Bulgaria

Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...
Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...
Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...
EPAM_Systems_Bulgaria
 
Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8
Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8
Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8
EPAM_Systems_Bulgaria
 
Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...
Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...
Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...
EPAM_Systems_Bulgaria
 
Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...
Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...
Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...
EPAM_Systems_Bulgaria
 
Tech Talk_25.04.15_Session 2_Martin Toshev_KDB database
Tech Talk_25.04.15_Session 2_Martin Toshev_KDB databaseTech Talk_25.04.15_Session 2_Martin Toshev_KDB database
Tech Talk_25.04.15_Session 2_Martin Toshev_KDB database
EPAM_Systems_Bulgaria
 
Tech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJ
Tech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJTech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJ
Tech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJ
EPAM_Systems_Bulgaria
 

More from EPAM_Systems_Bulgaria (6)

Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...
Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...
Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...
 
Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8
Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8
Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8
 
Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...
Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...
Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...
 
Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...
Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...
Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...
 
Tech Talk_25.04.15_Session 2_Martin Toshev_KDB database
Tech Talk_25.04.15_Session 2_Martin Toshev_KDB databaseTech Talk_25.04.15_Session 2_Martin Toshev_KDB database
Tech Talk_25.04.15_Session 2_Martin Toshev_KDB database
 
Tech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJ
Tech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJTech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJ
Tech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJ
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zookeeper

  • 1. Distributed Coordination with ZooKeeper and Curator Tibor Sulyán tibor_sulyan@epam.com April 25, 2015
  • 2. 2CONFIDENTIAL CAP Theorem by Eric Brewer • Consistency • Availability • Partition Tolerance Introduction 1 2 12 write read ?1 2 2
  • 3. 3CONFIDENTIAL Agenda What is ZooKeeper?1 ZooKeeper features2 Coordination Recipes3 Using Curator with ZooKeeper4 Deploying ZooKeeper clusters5
  • 4. 4CONFIDENTIAL „ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” What is ZooKeeper about? P P P ES /root /root/data /root/state /root/state/service-000000001 zk zk zk zk client client client zk ZK API ZK API ZK API
  • 5. 5CONFIDENTIAL • Filesystem-like hierarchical structure – Elements are called zNodes • zNode operations – Basic CRUD – Transactional execution of multiple operations – Watches – Versioned changes • zNode metadata – Data – Children – Metadata (Stat structure) • zNode types ZooKeeper Data Model P persistent E ephemeral PS persistent sequential ES ephemeral sequential
  • 6. 6CONFIDENTIAL • Ephemeral zNodes – Session-scoped – Exists as long as the ephemeral owner's session is active – Not persisted – No children • Sequence (sequential) zNodes – Upon creation, zNode name is suffixed by an integer value – The value is unique in the zNode path • Watches – Can be set on read operations (getData(), getChildren(), exists()) – One-time trigger when a zNode changes ZooKeeper Data Model P P / servers E server_A E server_B P leader ES server_A0000000001 ES server_B0000000002
  • 7. 7CONFIDENTIAL // this class will act as default watcher class ZooKeeperClient implements Watcher { ... // connect to the ensemble. 'this' refers to a watcher (aka default watcher) ZooKeeper zooKeeper = new ZooKeeper("localhost:2181,localhost:2182,localhost:2183", 30_000, this); @Override public void process(WatchedEvent event) { // zNode changes & connection state changes // can be invoked before the constructor returns! } } ZooKeeper API – connect, default watcher
  • 8. 8CONFIDENTIAL // snyhronous node creation try { Stat stat = zooKeeper.create("/test", "data".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); } catch (KeeperException e) { switch (e.code()) { case CONNECTIONLOSS: // retry operation break; } } // asynchronous node creation zooKeeper.create("/test", "data".getBytes(), OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL, new StringCallback() { @Override public void processResult(int rc, String path, Object ctx, String name) { switch (Code.get(rc)) { // handle errors (retry on CONNECTIONLOSS) } } }, null /* no context passed to callback*/); ZooKeeper API – create operations & recoverable errors
  • 9. 9CONFIDENTIAL // snyhronous update: sets "newdata" for /test1 // error handling omitted Stat stat = zooKeeper.setData("/test", "newdata".getBytes(), -1); // sets "newerdata" only if data version is 5 zooKeeper.setData("/test", "newerdata".getBytes(), 5); ZooKeeper API – versioned update operations
  • 10. 10CONFIDENTIAL // check if zNode exists using the default watcher // error handling omitted Stat stat = zooKeeper.exists("/parent/child1", false); // get data & set default watcher Stat stat = new Stat(); byte[] data = zooKeeper.getData("/parent/child1", true, stat); // Use a separate Watcher stat = zooKeeper.exists("/parent/child2", new Watcher() { @Override public void process(WatchedEvent event) { // react to node deletion } }); ZooKeeper API – read operations & setting watches
  • 11. 11CONFIDENTIAL • Atomic updates • Sequential Consistency • Single System Image • Timeliness • Reliability • Availability ZooKeeper Guarantees zk 5 zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4
  • 13. 13CONFIDENTIAL propagate, propose commit, notify Timeliness client 2 follower leader follower follower client 1 setData (v2) v2 v2 time
  • 14. 14CONFIDENTIAL • ZooKeeper process failures are tolerated if a quorum is present • Simplest quorum: majority-based • Avoids split-brain scenarios Availability zk 5 zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4 behaviour on follower failures
  • 15. 15CONFIDENTIAL • ZooKeeper process failures are tolerated if a quorum is present • Simplest quorum: majority-based • Avoids split-brain scenarios Availability zk 5 zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4 behaviour on leader failure zk 1 zk 2
  • 17. 17CONFIDENTIAL Distributed Coordination Recipes Shared Data Group Membership P P / serviceInstances E serverA E serverB Service Discovery P P / service E serviceInfo Lock Mutex Leader Election P P / service ES service_0000000001 ES service_0000000002
  • 18. 18CONFIDENTIAL Leader Election Recipe P P / service ES service_0000000001 ES service_0000000002 zk 5 zk 1 zk 2 zk 3 service service service zk 4 ES service_0000000003 service watch service_0000000001 watch service_0000000001 watch service_0000000001 watch service_0000000001 service watch service_0000000002 n-1 watches are set on the same node Improvment: watch the last sequence node instead of the first one
  • 19. 19CONFIDENTIAL Improved Leader Election Recipe P P / service ES service_0000000001 ES service_0000000002 zk 5 zk 1 zk 2 zk 3 service service service zk 4 ES service_0000000003 service watch service_0000000001 watch service_0000000002 watch service_0000000001 service
  • 20. 20CONFIDENTIAL • Higher level Client API to ZooKeeper • Hides most of the complexity of communicating with ZK ensemble • Implemented recipes Curator and ZooKeeper zk zk zk zk client client client zk Curator ZK API Curator ZK API Curator ZK API
  • 21. 21CONFIDENTIAL // create & start framework instance CuratorFramework framework = CuratorFrameworkFactory.newClient("localhost:2181,localhost:2182,localhost:2183", new ExponentialBackoffRetry(1000, 20)); framework.start() // foreground operation Stat stat = framework.setData().forPath("/a/b/c/d", "testdata".getBytes()); // background operation framework.setData().inBackground().forPath("/a/b/c/d/e", "testdata".getBytes()); Curator API
  • 22. 22CONFIDENTIAL Protected EPHEMERAL_SEQUENTIAL nodes Curator Features zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4 P P / cluster framework.create().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).forPath("/cluster/service"); ES service_0000000002 ES service_0000000003 zk 5 connection loss – reconnect attempt beginsreconnect successful within session timeout – retrying path creation
  • 23. 23CONFIDENTIAL Protected EPHEMERAL_SEQUENTIAL nodes Curator Features zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4 P P / cluster framework.create().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).withProtection().forPath("/cluster/service"); ES _c_16c39a25-87b4-4a54-bd05-1666a3e718de_service_0000000002 zk 5 connection loss – reconnect attempt beginsreconnect successful within session timeout – checkning zNode with same GUIDno extra zNode created
  • 24. 24CONFIDENTIAL • Performance Considerations • Using Observers to scale • Using Hierarchical Quorums for multi-datacenter setup • Surviving network partition with read-only mode Zookeeper in the real world
  • 25. 25CONFIDENTIAL • Replicated data is kept entirely in-memory by zookeper processes • full GC can drop out a server from the ensemble • Synchronous filesystem writes in commit phase • can take seconds on an overloaded storage device • use dedicated device for zookeeper transaction logs • Maximum zNode size is 1M by default • data + metadata should fit in • configurable using a system property, but increasing it is not recommended • Watches and performance • Too many watches on a single node – herd effect • Too many watches overall – increases memory footprint Performance considerations
  • 26. 26CONFIDENTIAL notifycommitvoteproposepropagate Using Observers to scale client follower leader follower follower setData sync return callback called watch triggered observer observers: • no proposals • no votes • can’t be leaders time
  • 27. 27CONFIDENTIAL Hierarchical Quorums zk5 zk4 zk6 zk8 zk7 zk9 zk2 zk1 zk3 Majority quorums: • any 4 zk failures are tolerated A datacenter goes down • remaining ensemble becomes much less resilient Hierarchical quorums: • Disjoint groups are formed • Quorum requires majority of votes from the majority of groups • 5 failures can be tolerated • Better for clusters spanning multiple datacenters group 1 group 2 group 3
  • 28. 28CONFIDENTIAL Read-only mode zk5 zk4 zk6 zk8 zk7 zk9 zk2 zk1 zk3 Network partitions, a datacenter gets detached Partitioned zookeepers can operate in read-only mode • not connected to the ensemble • no writes allowed • read requests are still served By default read-only mode is disabled zk2 zk1 zk3
  • 29. 29CONFIDENTIAL • ACLs • Quota support • Authentication support • Transaction logging • Connection state handling • Weighted hierarchical quorums • Configuration • Dynamic reconfiguration • ... • More info: • ZooKeeper documentation http://zookeeper.apache.org/doc/trunk/index.html • Curator resources http://curator.apache.org • ZAB protocol in detail http://web.stanford.edu/class/cs347/reading/zab.pdf http://diyhpl.us/~bryan/papers2/distributed/distributed-systems/zab.totally-ordered-broadcast- protocol.2008.pdf • ZooKeeper book http://shop.oreilly.com/product/0636920028901.do Topics not covered