SlideShare a Scribd company logo
1 of 23
Building Distributed
Applications with Apache
Zookeeper
Alex Ehrnschwender | Game Server Engineer at DeNA
What is Zookeeper?
“ZooKeeper is a centralized service for
maintaining configuration information,
naming, providing distributed
synchronization, and providing group
services.”
Zookeeper Wiki
ZooKeeper: A Coordination Service for Distributed Applications
Coordination & synchronization for
distributed processes
Logical namespacing implemented by a
hierarchy (tree) of znodes
Replicated in-memory over multiple hosts
for reliability, availability, and performance
Simple API of CRUD & basic tree operations
for client integration
Zookeeper: Reliability & Consistency
Distributed ensemble with automatic leader
election through quorum
Replicated in-memory on every instance with
snapshot writes to disk
Client TCP connection maintained to any
node with failover support
Guaranteed atomicity & sequential
consistency
Zookeeper: Watches & Ephemeral nodes
Underlying znodes have a data structure consisting of version numbers (cversion, aversion) &
timestamps
Watches
● Client-initiated subscriptions to znodes
● Changes to a watched znode trigger notification to subscribed clients
Ephemeral Nodes
● Backed by a client session and deleted when client session ends
● Cannot have children
Zookeeper: But… why?
“Because of the difficulty of implementing
these kinds of services, applications initially
usually skimp on them, which make them
brittle in the presence of change and
difficult to manage. Even when done
correctly, different implementations of
these services lead to management
complexity when the applications are
deployed.”
Zookeeper Wiki
Zookeeper: Advantages for Backing a Server Cluster
Server workers can become cluster-aware
So much out-of-the-box that would be duplicated with a custom solution
Extremely fast reads (10:1 performance against writes)
Small footprint - An ensemble of only 5-7 zk instances can serve the
coordination needs of several large production applications
Centralized event broadcasting & failure detection (heartbeat)
Zookeeper: Common Use Cases
● Configuration Management
● Service Discovery
● Distributed Cloud-Based File Systems
● Internal DNS Management
● Master (Leader) Election and Voting
● Messaging Queue
● Event Broadcasting & Notification
Use Case Example #1 - Managing Redis Shards
ZK Use Case Example #1 - Pinterest
Pinterest stores their entire follower model inside sharded Redis instances (
~9000 Redis shards, multiple instances per core)
Shard configuration is stored and managed by Zookeeper
Client lookups and watches for shard location & subsequent data retrieval
Master-slave failover triggers updates to znode representation (slave address replaces master)
Vertical splitting of data broadcasted to watching clients
Use Case Example #2 - HBase Cluster Configuration
Code Examples
public void join(String groupName, String memberName)
throws KeeperException, InterruptedException {
String path = "/" + groupName + "/" + memberName;
String createdPath = zk.create(path,
null /* data */,
ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL);
System.out.println("Created " + createdPath);
}
public void create(String groupName)
throws KeeperException, InterruptedException {
String path = "/" + groupName;
String createdPath = zk.create(path,
null /* data */,
ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.PERSISTENT);
System.out.println("Created " + createdPath);
}
Code Examples (cont.)
public void delete(String groupName)
throws KeeperException, InterruptedException {
String path = "/" + groupName;
try {
List<String> children = zk.getChildren(path, false);
for(String child : children) {
zk.delete(path + "/" + child, -1); /* child */
}
zk.delete(path, -1); /* parent */
} catch (KeeperException.NoNodeException e) {
System.out.printf("Group %s does not existn", groupName);
}
}
public void list(String groupName)
throws KeeperException, InterruptedException {
String path = "/" + groupName;
try {
List<String> children = zk.getChildren(path, false);
for(String child : children) {
System.out.println(child);
}
} catch (KeeperException.NoNodeException e) {
System.out.printf("Group %s does not existn",
groupName);
}
}
Performance
Standalone ops/sec 3-Node Ensemble (ops/sec)
Reference:
https://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview
Sample Configuration (zoo.cfg)
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
Exhibitor: A ZK Monitoring & Administration Tool from Netflix
Centralization & externalization of zk ensemble configuration* (S3/remote FS)
Web UI & REST API for ease of management
Instance monitoring with automatic configuration updates
Rolling ensemble changes while maintaining quorum
Miscellaneous administration tasks (backup/restore, log & snapshot cleanup)
* Configuration management for a configuration manager.... so meta!
Questions?
Appendix
Zookeeper Atomic Broadcast (ZAB) Algorithm
● Protocol for managing atomic updates to replicas
● Responsible for:
o Agreeing on an ensemble leader
o Synchronizing replicas
o Managing transactions and broadcasts
o Recovery of state
● ZXIDs & transactional ordering
● Guarantees:
o Local & global primary order
o Primary integrity
Performance
Performance
Standalone ops/sec 3-Node Ensemble (ops/sec)
Reference:
https://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview
Sample Configuration (zoo.cfg)
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
References
● http://engineering.pinterest.com/post/55272557617/building-a-follower-model-from-scratch
● http://zookeeper.apache.org/doc/trunk/zookeeperOver.html
● http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html
● https://github.com/Netflix/exhibitor/wiki
● http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf
● http://web.stanford.edu/class/cs347/reading/zab.pdf
● http://highscalability.com/blog/2008/7/15/zookeeper-a-reliable-scalable-distributed-coordination-
syste.html
● https://wiki.apache.org/solr/SolrCloud
● http://www.slideshare.net/scottleber/apache-zookeeper

More Related Content

What's hot

Apache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesdayApache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesday
Andrei Savu
 
zookeeperProgrammers
zookeeperProgrammerszookeeperProgrammers
zookeeperProgrammers
Hiroshi Ono
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr cluster
lucenerevolution
 

What's hot (20)

Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Action
 
Apache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesdayApache ZooKeeper TechTuesday
Apache ZooKeeper TechTuesday
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Curator intro
Curator introCurator intro
Curator intro
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
 
A Python Petting Zoo
A Python Petting ZooA Python Petting Zoo
A Python Petting Zoo
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
zookeeperProgrammers
zookeeperProgrammerszookeeperProgrammers
zookeeperProgrammers
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
 
Distributed Coordination with Python
Distributed Coordination with PythonDistributed Coordination with Python
Distributed Coordination with Python
 
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandra
 
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on DockerRunning High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr cluster
 
Introduction to .Net Driver
Introduction to .Net DriverIntroduction to .Net Driver
Introduction to .Net Driver
 
How to Run Solr on Docker and Why
How to Run Solr on Docker and WhyHow to Run Solr on Docker and Why
How to Run Solr on Docker and Why
 

Similar to Distributed Applications with Apache Zookeeper

Continuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsContinuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed Gaps
GilHecht
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
grim_radical
 
Php Site Optimization
Php Site OptimizationPhp Site Optimization
Php Site Optimization
Amit Kejriwal
 

Similar to Distributed Applications with Apache Zookeeper (20)

Apache ZooKeeper
Apache ZooKeeperApache ZooKeeper
Apache ZooKeeper
 
Continuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed GapsContinuity Software 4.3 Detailed Gaps
Continuity Software 4.3 Detailed Gaps
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
 
MYSQL
MYSQLMYSQL
MYSQL
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Cloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the CloudCloud Meetup - Automation in the Cloud
Cloud Meetup - Automation in the Cloud
 
Building and Scaling Node.js Applications
Building and Scaling Node.js ApplicationsBuilding and Scaling Node.js Applications
Building and Scaling Node.js Applications
 
Php Site Optimization
Php Site OptimizationPhp Site Optimization
Php Site Optimization
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
Copper: A high performance workflow engine
Copper: A high performance workflow engineCopper: A high performance workflow engine
Copper: A high performance workflow engine
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice Way
 
An Engineer's Intro to Oracle Coherence
An Engineer's Intro to Oracle CoherenceAn Engineer's Intro to Oracle Coherence
An Engineer's Intro to Oracle Coherence
 
Microservices observability
Microservices observabilityMicroservices observability
Microservices observability
 
Kerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit eastKerberizing spark. Spark Summit east
Kerberizing spark. Spark Summit east
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 

Recently uploaded

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 

Distributed Applications with Apache Zookeeper

  • 1. Building Distributed Applications with Apache Zookeeper Alex Ehrnschwender | Game Server Engineer at DeNA
  • 2. What is Zookeeper? “ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” Zookeeper Wiki
  • 3. ZooKeeper: A Coordination Service for Distributed Applications Coordination & synchronization for distributed processes Logical namespacing implemented by a hierarchy (tree) of znodes Replicated in-memory over multiple hosts for reliability, availability, and performance Simple API of CRUD & basic tree operations for client integration
  • 4. Zookeeper: Reliability & Consistency Distributed ensemble with automatic leader election through quorum Replicated in-memory on every instance with snapshot writes to disk Client TCP connection maintained to any node with failover support Guaranteed atomicity & sequential consistency
  • 5. Zookeeper: Watches & Ephemeral nodes Underlying znodes have a data structure consisting of version numbers (cversion, aversion) & timestamps Watches ● Client-initiated subscriptions to znodes ● Changes to a watched znode trigger notification to subscribed clients Ephemeral Nodes ● Backed by a client session and deleted when client session ends ● Cannot have children
  • 6. Zookeeper: But… why? “Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.” Zookeeper Wiki
  • 7. Zookeeper: Advantages for Backing a Server Cluster Server workers can become cluster-aware So much out-of-the-box that would be duplicated with a custom solution Extremely fast reads (10:1 performance against writes) Small footprint - An ensemble of only 5-7 zk instances can serve the coordination needs of several large production applications Centralized event broadcasting & failure detection (heartbeat)
  • 8. Zookeeper: Common Use Cases ● Configuration Management ● Service Discovery ● Distributed Cloud-Based File Systems ● Internal DNS Management ● Master (Leader) Election and Voting ● Messaging Queue ● Event Broadcasting & Notification
  • 9. Use Case Example #1 - Managing Redis Shards
  • 10. ZK Use Case Example #1 - Pinterest Pinterest stores their entire follower model inside sharded Redis instances ( ~9000 Redis shards, multiple instances per core) Shard configuration is stored and managed by Zookeeper Client lookups and watches for shard location & subsequent data retrieval Master-slave failover triggers updates to znode representation (slave address replaces master) Vertical splitting of data broadcasted to watching clients
  • 11. Use Case Example #2 - HBase Cluster Configuration
  • 12. Code Examples public void join(String groupName, String memberName) throws KeeperException, InterruptedException { String path = "/" + groupName + "/" + memberName; String createdPath = zk.create(path, null /* data */, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); System.out.println("Created " + createdPath); } public void create(String groupName) throws KeeperException, InterruptedException { String path = "/" + groupName; String createdPath = zk.create(path, null /* data */, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); System.out.println("Created " + createdPath); }
  • 13. Code Examples (cont.) public void delete(String groupName) throws KeeperException, InterruptedException { String path = "/" + groupName; try { List<String> children = zk.getChildren(path, false); for(String child : children) { zk.delete(path + "/" + child, -1); /* child */ } zk.delete(path, -1); /* parent */ } catch (KeeperException.NoNodeException e) { System.out.printf("Group %s does not existn", groupName); } } public void list(String groupName) throws KeeperException, InterruptedException { String path = "/" + groupName; try { List<String> children = zk.getChildren(path, false); for(String child : children) { System.out.println(child); } } catch (KeeperException.NoNodeException e) { System.out.printf("Group %s does not existn", groupName); } }
  • 14. Performance Standalone ops/sec 3-Node Ensemble (ops/sec) Reference: https://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview
  • 16. Exhibitor: A ZK Monitoring & Administration Tool from Netflix Centralization & externalization of zk ensemble configuration* (S3/remote FS) Web UI & REST API for ease of management Instance monitoring with automatic configuration updates Rolling ensemble changes while maintaining quorum Miscellaneous administration tasks (backup/restore, log & snapshot cleanup) * Configuration management for a configuration manager.... so meta!
  • 19. Zookeeper Atomic Broadcast (ZAB) Algorithm ● Protocol for managing atomic updates to replicas ● Responsible for: o Agreeing on an ensemble leader o Synchronizing replicas o Managing transactions and broadcasts o Recovery of state ● ZXIDs & transactional ordering ● Guarantees: o Local & global primary order o Primary integrity
  • 21. Performance Standalone ops/sec 3-Node Ensemble (ops/sec) Reference: https://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview
  • 23. References ● http://engineering.pinterest.com/post/55272557617/building-a-follower-model-from-scratch ● http://zookeeper.apache.org/doc/trunk/zookeeperOver.html ● http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html ● https://github.com/Netflix/exhibitor/wiki ● http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf ● http://web.stanford.edu/class/cs347/reading/zab.pdf ● http://highscalability.com/blog/2008/7/15/zookeeper-a-reliable-scalable-distributed-coordination- syste.html ● https://wiki.apache.org/solr/SolrCloud ● http://www.slideshare.net/scottleber/apache-zookeeper

Editor's Notes

  1. Wears a lot of hats. Can serve multiple purposes, all related to coordination of a distributed system. More concisely, Zookeeper is a coordination service for distributed applications. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Wikipedia
  2. Digging a bit deeper: Simply put, it backs the nodes of your distributed system through a tree structure of znodes (more in upcoming slides) Central nervous system for your distributed application Centralized and replicated across an odd number of hosts to make up an ensemble Data is kept in memory and is backed up to a log for reliability. By using memory ZooKeeper is very fast and can handle the high loads typically seen in chatty coordination protocols across huge numbers of processes. Prefers read-heavy based applications Clients connect to any zk node in the ensemble and maintain that connection API of create, read, update, delete (but also watches, more in later slides) Replicated hosts make up an “ensemble”
  3. Writes are bubbled up to one elected leader (which is why there should be an odd # of instances). A quorum must confirm the update. Updates are submitted concurrently by clients and committed by FIFO order. After update, incremental state changes are broadcast to replicas using Zookeeper Atomic Broadcast (ZAB). Each state change is incremental with respect to the previous state, so there is an implicit dependence on the order of the state changes. This guarantees: Sequential Consistency - Updates from a client will be applied in the order that they were sent. Atomicity - Updates either succeed or fail. No partial results.
  4. Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestamp allow ZooKeeper to validate the cache and to coordinate updates. Each time a znode's data changes, the version number increases. Watches Clients can set watches on znodes. Changes to that znode trigger the watch and then clear the watch. When a watch triggers, ZooKeeper sends the client a notification. Ephemeral Nodes ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children. Time in ZooKeeper ZooKeeper tracks time multiple ways: Zxid Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2. Version numbers Every change to a a node will cause an increase to one of the version numbers of that node. The three version numbers are version (number of changes to the data of a znode), cversion (number of changes to the children of a znode), and aversion (number of changes to the ACL of a znode). Ticks When using multi-server ZooKeeper, servers use ticks to define timing of events such as status uploads, session timeouts, connection timeouts between peers, etc. The tick time is only indirectly exposed through the minimum session timeout (2 times the tick time); if a client requests a session timeout less than the minimum session timeout, the server will tell the client that the session timeout is actually the minimum session timeout. Real time ZooKeeper doesn't use real time, or clock time, at all except to put timestamps into the stat structure on znode creation and znode modification. A critical component of ZooKeeper is Zab, the ZooKeeper Atomic Broadcast algorithm, which is the protocol that manages atomic updates to the replicas. It is responsible for agreeing on a leader in the ensemble, synchronizing the replicas, managing update transactions to be broadcast, as well as recovering from a crashed state to a valid state.
  5. Why not simply use a database? Because of the guarantees
  6. Fast especially in read-heavy workload (10:1 performance against writes) More performance data in backup slides
  7. Service Discovery - watch thru heartbeat Let ELECTION be a path of choice of the application. To volunteer to be a leader: Create znode z with path "ELECTION/guid-n_" with both SEQUENCE and EPHEMERAL flags; Let C be the children of "ELECTION", and i be the sequence number of z; Watch for changes on "ELECTION/guid-n_j", where j is the largest sequence number such that j < i and n_j is a znode in C; Upon receiving a notification of znode deletion: Let C be the new set of children of ELECTION; If z is the smallest node in C, then execute leader procedure; Otherwise, watch for changes on "ELECTION/guid-n_j", where j is the largest sequence number such that j < i and n_j is a znode in C;
  8. http://engineering.pinterest.com/post/55272557617/building-a-follower-model-from-scratch When a Redis machine reaches either the memory or CPU thresholds, we split it either horizontally orvertically. Vertical sharding a Redis machine is simply cutting the number of running Redis instances on the machine by half. We bring up a new master as a slave of the existing master and once the slaving is complete, we make it the new master for half of the Redis instances leaving the old master as the master for the other half. The entire user id space is split into 8192 virtual shards. We place one virtual shard per Redis DB, and run multiple Redis instances (ranging from 8 to 32) on each machine depending on the memory and CPU consumption of the shards on those instances. Similarly, we run multiple Redis DBs per Redis instance.
  9. Once a part of Hadoop Currently maintained by Yahoo! and the Apache Software Foundation HBase is the Hadoop database, a distributed, scalable, big data store. HBase supports hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Currently, hbase clients find the cluster to connect to by asking zookeeper. The only configuration a client needs is the zk quorum to connect to. Masters and hbase slave nodes (regionservers) all register themselves with zk. If their znode evaporates, the master or regionserver is consided lost and repair begins. HBase currently will default to manage the zookeeper cluster. It does this in an attempt at not burdening users with yet another technology to figure; things are bad enough for the hbase noob what with hbase, hdfs, and mapreduce. Part of hbase's management of zk includes being able to see zk configuration in the hbase configuration files. Anything that has the hbase.zookeeper prefix will have its suffix mapped to the corresponding zoo.cfg setting (HBase parses its config. and feeds the relevant zk configurations to zk on start).
  10. Using Java as it’s very readable and shows type information. The same can easily be achieved with Node.js, Perl, or your language of choice. Create - First we create a persistent node to act as a group (parent) for child nodes. Join - Second we create and join ephemeral child nodes to that parent node. Each example creates with “unsafe acl”, so anyone who interacts with ZK can perform an update to that node. An alternative is CREATOR_ALL_ACL where only the creator can control the node.
  11. List Think of use cases (active sessions, active system workers, etc) Delete -1 deletes unconditionally. Can also put a version in here to delete, so it will not delete a node that does not match the version specified. Will broadcast to any clients watching that node. Because of the simplicity of operations on paths, each must catch errors if that node does not exist.
  12. http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf http://web.stanford.edu/class/cs347/reading/zab.pdf