SlideShare a Scribd company logo
1
Improving Write Throughput Scalability of ZooKeeper by
Partitioning Namespace
CSE 223B Term Project, Spring 2014
Pramod Biligiri
UC San Diego
psubbara@ucsd.edu
Anita Kar
UC San Diego
ankar@ucsd.edu
Anjali Kanak
UC San Diego
akanak@ucsd.edu
_____________________________________________________________________________________
Abstract: We demonstrate that write throughput of a ZooKeeper cluster can be increased by
implementing a partitioning system for the data. By enabling a single ZooKeeper cluster to have
multiple quorums, each with its own leader, client requests that want to access different partitions of
the namespace can be processed in parallel by relaxing the ordering guarantees across partitions.
Experimental evaluation of our implementation shows approximately 14% increase in throughput. The
only extra resource required for each node in the partitioned ZooKeeper is additional hard disk(s) to
cope with the increased throughput.
1. Introduction
ZooKeeper provides a distributed configuration service, synchronization service, and naming registry
for large distributed systems [1]. ZooKeeper nodes store their data in a hierarchical name space (called
znodes), much like a file system or a Trie data structure. Clients can read and write from/to the nodes
and in this way have a shared configuration service.
Partitioned ZooKeeper ZooKeeper follows a primary-backup replication scheme. Currently, the
leader node in a ZooKeeper ensemble processes all incoming client requests sequentially, thus ensuring
strictly ordered access to the znodes.
We created a partitioned version of the ZooKeeper with the goal of improving the write throughput
scalability. This was first suggested on the ZooKeeper wiki page [2] and discussed on the issue tracker
[3]. Thus, the user can specify a set of paths that can be updated independent of each other. This leads
to relaxation in ordering across all the root znodes, where each root znode corresponds to a path.
However for any particular path, the requests are still ordered. Hence, if there are independent
applications using the ZooKeeper ensemble, then partitioning should lead to increased throughput by
removing ordering constraints across independent operations and hence potentially parallelizing
processing and disk writes.
One of the existing suggestions to implement it is by creating completely separated clusters using the
available nodes. Our approach, however, differs from this as we are using the all the nodes while
making independent operations proceed parallelly. To enable this, we created multiple quorums with
the same set of available nodes. All nodes are members of all the quorums. Each quorum can be
visualized as a logical abstraction having its own leader and being responsible for a handling requests
2
corresponding to a particular path. As every node is a part of multiple quorums, it is still involved in
processing of all the operations and its resources and fully utilized. This makes our approach more
economical than the aforementioned one in case of which resources like memory and CPU may remain
underutilized. Moreover, our implementation requires fewer nodes because each quorum comprises of
all the nodes and is contrary to the suggested approach where the nodes need to be distributed across
multiple quorums.
2. Motivation
ZooKeeper service is used in by many companies to provide coordination, synchronization and a
hierarchical namespace of data registers for various distributed applications. Typically, one ZooKeeper
cluster is used by multiple applications. These applications may be unrelated but still all requests are
still written sequentially to the transaction log on the disk since they are ordered and proposed by a
single ZooKeeper leader node. This leads to the possibility of the disk at the nodes becoming the
bottleneck for the cluster throughput. The ZooKeeper administrator’s guide [4] recognizes this issue
as it mentions that disk write into the transaction log is the most performance critical part of
ZooKeeper. Hence, for separate applications, the data can be handled and written separately.
Though one option is to have separate clusters for individual applications, it may end up being
expensive and lead to underutilization of resources. Hence, an ideal scenario would be to use the same
ZooKeeper cluster for multiple applications while removing the ordering guarantees between unrelated
requests coming from these separate applications. Then the independent sets of requests can be written
to different transaction logs which, as discussed earlier, can improve the performance. This is the issue
we tried to address through our implementation of partitioned ZooKeeper.
3. Design
Each ZooKeeper node is part of a quorum of nodes, where it participates in voting on proposals and
leader election. A ZooKeeper node can also service client requests. A client request is processed in
multiple stages. Each stage reads events from an incoming queue, performs its actions and passes the
output to the incoming queue of the next stage. The use of stages and queues is designed to avoid
blocking of incoming requests due to network and disk operations.
In ZooKeeper’s current design, a cluster can have only one leader and an associated quorum. We
modified this design to enable multiple leaders to exist within a cluster, each with their own quorum.
Each node can be part of multiple quorums. A node can be a leader in one quorum and a follower in
others, or a follower in all the quorums that it is part of. Having a node be the leader for multiple
quorums would have no benefit for the performance gains we are trying to obtain, so we did not
experiment with that option.
The namespace of nodes is partitioned among the leaders, with routing logic in each follower to
forward requests to the appropriate leader. Figure 1 below shows the current and modified designs of
ZooKeeper.
3
Figure 1: ZooKeeper cluster with multiple quorums
Whenever a node receives a proposal from the leader, it appends the request to a transaction log on
disk and sends back an acknowledgement. This transaction log is used for recovery in case of leader
failure. The write to disk constitutes a performance critical path of the request processing flow.
Therefore, the ZooKeeper administrator's guide [4] recommends keeping the transaction log on a
separate device from any other reads and writes on the system. Since each node can now be part of
multiple quorums, the transaction log for each quorum needs to be on its own device. We have
modified ZooKeeper to support this. Figure 2 shows this design.
Figure 2: Node operation as a part of multiple quorums
4
4. Implementation
Our initial approach [5] focused on partitioning
incoming requests and handling them in parallel. For
each distinct top-level path that comes in (/app1,
/app2 etc.), we create a new chain of request
processing stages and all subsequent requests on that
path are routed to that chain. Though this gave us a
performance speed up in the request processing stage,
the strict ordering guarantees of ZooKeeper’s atomic
broadcast protocol within a quorum meant that the
requests were getting serialized before being
forwarded to the leader. This led us to the current
design of having multiple quorums, in order to relax
the ordering guarantees across different request paths.
We are able to reuse a significant amount of the
routing logic from the initial implementation.
Figure 3 shows the class diagram for the multiple
quorum approach [6]. A separate thread is created
corresponding to each quorum to handle the request
flow and the inter-node communication. Each
QuorumPeerThread can be allocated one or more
partitions of the request namespace by the routing
logic, which determines the quorum by examining the
request.
Figure 3 Class diagram for the multiple quorum approach
Challenges during implementation
The original ZooKeeper codebase has the following core assumptions:
i. Total ordering of all requests, by a single primary
ii. All requests are treated alike
iii. Each node is part of only one quorum
Our design required changing each of these core assumptions, without compromising on performance
or safety. Further, the code uses multithreading extensively, and has optimized usage patterns for disk
and network I/O. Thus it was a challenge to modify the code-base to suit our needs.
Below we cite a couple of examples:
i. Request specific processing We introduced a new interface called RequestFlushable, along
the lines of the existing Flushable interface. This is implemented by
5
SendAckRequestProcessor, which now needs to examine the request to decide who to send an
acknowledgement to.
ii. Existence of multiple quorums The new class QuorumPeerMulti abstracts the existence of
multiple quorums, by implementing the same interface as the existing QuorumPeer. It is used
selectively by multi-leader aware parts of the code to use the new functionality.
5. Performance Evaluation
We measure the performance implications of the proposed design in terms of write throughput and
compare the results with that of ZooKeeper 3.4.6. This is the current stable version of ZooKeeper
available from Apache. We wrote a test program in Java to measure the total time taken for a given
number of operations. We have documented our results for the create command and expect similar
results for get, set and delete. The following sections describe the experimental setup and discuss the
results from our testing.
Experimental Setup
We ran the experiments on a five node cluster setup on Amazon EC2. Each instance was installed
with Ubuntu 14.04 LTS and ZooKeeper software. Every node is assigned 4 vCPUs, 15 GB RAM and
2 SDD 40 GB each. Two additional volumes 10GB in size were attached to each instance to study
ZooKeeper performance with various disk configurations.
Separate disks are configured for snapshots and transaction logs. The path to the respective directories
is specified in the ZooKeeper configuration file. ZooKeeper service is started on all instances. The tests
are run with globalOutstandingLimit set to 1000 (default) and snapCount set to 500,000. With low
snapCount, snapshots will be taken too frequently. Hence, to eliminate the possibility of deviations in
measurements due to frequent snapshotting, we set this value to a higher number.
We wrote a performance test script that creates a specified number of ZooKeeper nodes under a given
namespace, by connecting to ZooKeeper using the client API and using the asynchronous create
primitive. A warm up phase of 10000 nodes is executed in order to test the system in a stable state.
The results tabulated below are calculated averages over five test runs.
Results
The first experiment involves 400000 create commands on one client, 200000 for /app1/* and
200000 for /app2/*.
ZK Setup # Requests Time (ms) Ops/Sec
Version 3.4.6 400K 37455 10679.48
Multi-Leader 200K + 200K 24594 17176.92
Table 1 Comparison of throughput between ZooKeeper 3.4.6 and partitioned ZooKeeper
6
Table 1 records the total time and throughput. With ZooKeeper 3.4.6, this operation takes 37.45
seconds. With our code changes, we are able to demonstrate that the requests when split between two
different leaders can reduce the total time to approximately 24 seconds, which gives an order of 1.5x
improvement.
ZK
Setup
#
Requests
Client2 Client4
Time
(ms)
Ops/Sec
Time
(ms)
Ops/Sec
Version
3.4.6
200K 27034 7398.091 26282 7609.771
300K 40745 7362.867 40024 7495.503
400K 53843 7429.007 53077 7536.221
Multi-
Leader
200K 24594 8132.065 22112 9044.862
300K 38520 7788.162 32800 9146.342
400K 50600 7905.139 43338 9229.775
Table 2 Comparison of throughput between ZooKeeper 3.4.6 and partitioned ZooKeeper
In our second experiment, we use two separate clients to issue 200K, 300K and 500K create requests
on each in separate runs. Table 2 summarizes the results of this experiment. We observe that with
two clients processing the requests, we overcome the overhead involved in the client dispatching the
request to the leader.
Figure 4 Comparative throughput measurement on two different clients in a 5-node cluster
While ZooKeeper 3.4.6 shows considerable improvement,
experiment repeated with our code changes indicates that the
throughput improved further by approximately 14%. This can
be attributed to the fact that with ZooKeeper 3.4.6, all the
requests are totally ordered. In our multi-leader design, the
different namespaces are processed in parallel.
Table 3 Overall percentage improvement in throughput
0
2000
4000
6000
8000
10000
200K 300K 400K
No.ofOperations/Sec
No. of requests
Comparison of Throughput
Measurement on Client 2
Version 3.4.6 Multi-Leader
0
2000
4000
6000
8000
10000
200K 300K 400K
No.ofOperations/Sec
No. of requests
Comparison of Throughput
Measurement on Client 4
Version 3.4.6 Multi-Leader
# Requests
Overall
improvement (%)
200K 14.35
300K 13.95
400K 14.2
7
6. Future Work
In the current design of the ZooKeeper, a number of messages are exchanged between the leader and
the followers. By introducing two leaders, we do not increase the number of network messages since
the requests are split between the leaders. Hence, no latency is introduced by the partitioned ZooKeeper
design. Network latency can be one of the areas that might be of interest for future work through.
In our current implementation, the leaders for the quorum are statically chosen on system startup. We
have not handled leader failure and election of multiple leaders. The routing code that we have is quite
basic, and can be extended to be user configurable. We have also not studied the impact of network
latency on larger quorums with multiple leaders. For the purpose of performance evaluation, we chose
to implement only one of the primitives (create). This is similar to the rationale mentioned on the
original ZooKeeper ( [7], Section 5.1) paper for their performance evaluations.
7. Conclusion
Thus, using the same set of servers, we have implemented multiple quorums, each of which is
responsible for processing requests for a particular request path. Each quorum has its own leader which
orders only the requests targeted to the path that quorum is associated with. This relaxes the ordering
constraint across operations related to different paths while ensuring that for a particular path, the
ordering guarantee is still maintained. In order to parallelize disk writes, we assigned a separate disk
to each node for every quorum it was a part of. We demonstrated that write throughput of ZooKeeper
can be improved by separating the znode namespace and by having different leaders process
independent paths. We see an approximate improvement of 14% using this method.
8
References
[1] "Apache_ZooKeeper," [Online]. Available: http://en.wikipedia.org/wiki/Apache_ZooKeeper.
[2] "Partitioned ZooKeeper," [Online]. Available:
http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZooKeeper. [Last edited by F. Junqueira, 18
May 2010].
.[3] D. Williams. [Online]. Available: http://ria101.wordpress.com/2010/05/12/locking-and-
transactions-over-cassandra-using-cages/.
[4] "ZooKeeper Adminitrator's Guide," [Online]. Available:
http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_commonProblems.
[5] "Bitbucket Link to Older Approach (zk-Partition)," [Online]. Available:
https://bitbucket.org/pramodbiligiri/zk-partition.
[6] "Bitbucket link to New approach (zk-Multileader)," [Online]. Available:
https://bitbucket.org/pramodbiligiri/zk-multileader.
[7] Patrick Hunt et al, "ZooKeeper: Wait-free coordination for Internet-scale systems," in
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference,
2010-06-23.

More Related Content

What's hot

Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Action
juvenxu
 
Connection Pooling in PostgreSQL using pgbouncer
Connection Pooling in PostgreSQL using pgbouncer Connection Pooling in PostgreSQL using pgbouncer
Connection Pooling in PostgreSQL using pgbouncer
Sameer Kumar
 
[2A5]하둡 보안 어떻게 해야 할까
[2A5]하둡 보안 어떻게 해야 할까[2A5]하둡 보안 어떻게 해야 할까
[2A5]하둡 보안 어떻게 해야 할까
NAVER D2
 
FIWARE Data Management in High Availability
FIWARE Data Management in High AvailabilityFIWARE Data Management in High Availability
FIWARE Data Management in High Availability
Federico Michele Facca
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Siddharth Mathur
 
Distributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperDistributed Applications with Apache Zookeeper
Distributed Applications with Apache Zookeeper
Alex Ehrnschwender
 
googlecluster-ieee
googlecluster-ieeegooglecluster-ieee
googlecluster-ieee
Hiroshi Ono
 
Oracle Real Application Cluster ( RAC )
Oracle Real Application Cluster ( RAC )Oracle Real Application Cluster ( RAC )
Oracle Real Application Cluster ( RAC )
varasteh65
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
thelabdude
 
Google file system
Google file systemGoogle file system
Google file system
Roopesh Jhurani
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Google File System
Google File SystemGoogle File System
Google File System
guest2cb4689
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
Maaz Anjum
 
Clustrix Database Percona Ruby on Rails benchmark
Clustrix Database Percona Ruby on Rails benchmarkClustrix Database Percona Ruby on Rails benchmark
Clustrix Database Percona Ruby on Rails benchmark
Clustrix
 
Jms deep dive [con4864]
Jms deep dive [con4864]Jms deep dive [con4864]
Jms deep dive [con4864]
Ryan Cuprak
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
lucenerevolution
 
GFS
GFSGFS
An investigation into Cluster CPU load balancing in the JVM
An investigation into Cluster CPU load balancing in the JVMAn investigation into Cluster CPU load balancing in the JVM
An investigation into Cluster CPU load balancing in the JVM
Calum Beck
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
DataStax
 

What's hot (20)

Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Action
 
Connection Pooling in PostgreSQL using pgbouncer
Connection Pooling in PostgreSQL using pgbouncer Connection Pooling in PostgreSQL using pgbouncer
Connection Pooling in PostgreSQL using pgbouncer
 
[2A5]하둡 보안 어떻게 해야 할까
[2A5]하둡 보안 어떻게 해야 할까[2A5]하둡 보안 어떻게 해야 할까
[2A5]하둡 보안 어떻게 해야 할까
 
FIWARE Data Management in High Availability
FIWARE Data Management in High AvailabilityFIWARE Data Management in High Availability
FIWARE Data Management in High Availability
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Distributed Applications with Apache Zookeeper
Distributed Applications with Apache ZookeeperDistributed Applications with Apache Zookeeper
Distributed Applications with Apache Zookeeper
 
googlecluster-ieee
googlecluster-ieeegooglecluster-ieee
googlecluster-ieee
 
Oracle Real Application Cluster ( RAC )
Oracle Real Application Cluster ( RAC )Oracle Real Application Cluster ( RAC )
Oracle Real Application Cluster ( RAC )
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Google file system
Google file systemGoogle file system
Google file system
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
Google File System
Google File SystemGoogle File System
Google File System
 
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASMRACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
RACATTACK Lab Handbook - Enable Flex Cluster and Flex ASM
 
Clustrix Database Percona Ruby on Rails benchmark
Clustrix Database Percona Ruby on Rails benchmarkClustrix Database Percona Ruby on Rails benchmark
Clustrix Database Percona Ruby on Rails benchmark
 
Jms deep dive [con4864]
Jms deep dive [con4864]Jms deep dive [con4864]
Jms deep dive [con4864]
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
GFS
GFSGFS
GFS
 
An investigation into Cluster CPU load balancing in the JVM
An investigation into Cluster CPU load balancing in the JVMAn investigation into Cluster CPU load balancing in the JVM
An investigation into Cluster CPU load balancing in the JVM
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 

Similar to ZooKeeper Partitioning - A project report

Applications of parellel computing
Applications of parellel computingApplications of parellel computing
Applications of parellel computing
pbhopi
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
ijdpsjournal
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
jhao niu
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
Conference Papers
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
Conference Papers
 
Final report group2
Final report group2Final report group2
Final report group2
George Sam
 
MongoDB Replication and Sharding
MongoDB Replication and ShardingMongoDB Replication and Sharding
MongoDB Replication and Sharding
Tharun Srinivasa
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
Shashank Nemawarkar
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
International Journal of Science and Research (IJSR)
 
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
AM Publications
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
ashishmulchandani
 
Kubernetes From Scratch .pdf
Kubernetes From Scratch .pdfKubernetes From Scratch .pdf
Kubernetes From Scratch .pdf
ssuser9b44c7
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
faithxdunce63732
 
MOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufMOOC backbone using Netty and Protobuf
MOOC backbone using Netty and Protobuf
Gaurav Bhardwaj
 
A Review of Storage Specific Solutions for Providing Quality of Service in St...
A Review of Storage Specific Solutions for Providing Quality of Service in St...A Review of Storage Specific Solutions for Providing Quality of Service in St...
A Review of Storage Specific Solutions for Providing Quality of Service in St...
Editor IJCATR
 
Centrifuge
CentrifugeCentrifuge
Centrifuge
Ruchika Mehresh
 
Implementing Advanced Caching and Replication Techniques in ...
Implementing Advanced Caching and Replication Techniques in ...Implementing Advanced Caching and Replication Techniques in ...
Implementing Advanced Caching and Replication Techniques in ...
webhostingguy
 
Performance comparison on java technologies a practical approach
Performance comparison on java technologies   a practical approachPerformance comparison on java technologies   a practical approach
Performance comparison on java technologies a practical approach
csandit
 
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACHPERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
cscpconf
 
Survey on caching and replication algorithm for content distribution in peer ...
Survey on caching and replication algorithm for content distribution in peer ...Survey on caching and replication algorithm for content distribution in peer ...
Survey on caching and replication algorithm for content distribution in peer ...
ijcseit
 

Similar to ZooKeeper Partitioning - A project report (20)

Applications of parellel computing
Applications of parellel computingApplications of parellel computing
Applications of parellel computing
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
 
Final report group2
Final report group2Final report group2
Final report group2
 
MongoDB Replication and Sharding
MongoDB Replication and ShardingMongoDB Replication and Sharding
MongoDB Replication and Sharding
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
Improved Utilization of Infrastructure of Clouds by using Upgraded Functional...
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Kubernetes From Scratch .pdf
Kubernetes From Scratch .pdfKubernetes From Scratch .pdf
Kubernetes From Scratch .pdf
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
 
MOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufMOOC backbone using Netty and Protobuf
MOOC backbone using Netty and Protobuf
 
A Review of Storage Specific Solutions for Providing Quality of Service in St...
A Review of Storage Specific Solutions for Providing Quality of Service in St...A Review of Storage Specific Solutions for Providing Quality of Service in St...
A Review of Storage Specific Solutions for Providing Quality of Service in St...
 
Centrifuge
CentrifugeCentrifuge
Centrifuge
 
Implementing Advanced Caching and Replication Techniques in ...
Implementing Advanced Caching and Replication Techniques in ...Implementing Advanced Caching and Replication Techniques in ...
Implementing Advanced Caching and Replication Techniques in ...
 
Performance comparison on java technologies a practical approach
Performance comparison on java technologies   a practical approachPerformance comparison on java technologies   a practical approach
Performance comparison on java technologies a practical approach
 
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACHPERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
PERFORMANCE COMPARISON ON JAVA TECHNOLOGIES - A PRACTICAL APPROACH
 
Survey on caching and replication algorithm for content distribution in peer ...
Survey on caching and replication algorithm for content distribution in peer ...Survey on caching and replication algorithm for content distribution in peer ...
Survey on caching and replication algorithm for content distribution in peer ...
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 

ZooKeeper Partitioning - A project report

  • 1. 1 Improving Write Throughput Scalability of ZooKeeper by Partitioning Namespace CSE 223B Term Project, Spring 2014 Pramod Biligiri UC San Diego psubbara@ucsd.edu Anita Kar UC San Diego ankar@ucsd.edu Anjali Kanak UC San Diego akanak@ucsd.edu _____________________________________________________________________________________ Abstract: We demonstrate that write throughput of a ZooKeeper cluster can be increased by implementing a partitioning system for the data. By enabling a single ZooKeeper cluster to have multiple quorums, each with its own leader, client requests that want to access different partitions of the namespace can be processed in parallel by relaxing the ordering guarantees across partitions. Experimental evaluation of our implementation shows approximately 14% increase in throughput. The only extra resource required for each node in the partitioned ZooKeeper is additional hard disk(s) to cope with the increased throughput. 1. Introduction ZooKeeper provides a distributed configuration service, synchronization service, and naming registry for large distributed systems [1]. ZooKeeper nodes store their data in a hierarchical name space (called znodes), much like a file system or a Trie data structure. Clients can read and write from/to the nodes and in this way have a shared configuration service. Partitioned ZooKeeper ZooKeeper follows a primary-backup replication scheme. Currently, the leader node in a ZooKeeper ensemble processes all incoming client requests sequentially, thus ensuring strictly ordered access to the znodes. We created a partitioned version of the ZooKeeper with the goal of improving the write throughput scalability. This was first suggested on the ZooKeeper wiki page [2] and discussed on the issue tracker [3]. Thus, the user can specify a set of paths that can be updated independent of each other. This leads to relaxation in ordering across all the root znodes, where each root znode corresponds to a path. However for any particular path, the requests are still ordered. Hence, if there are independent applications using the ZooKeeper ensemble, then partitioning should lead to increased throughput by removing ordering constraints across independent operations and hence potentially parallelizing processing and disk writes. One of the existing suggestions to implement it is by creating completely separated clusters using the available nodes. Our approach, however, differs from this as we are using the all the nodes while making independent operations proceed parallelly. To enable this, we created multiple quorums with the same set of available nodes. All nodes are members of all the quorums. Each quorum can be visualized as a logical abstraction having its own leader and being responsible for a handling requests
  • 2. 2 corresponding to a particular path. As every node is a part of multiple quorums, it is still involved in processing of all the operations and its resources and fully utilized. This makes our approach more economical than the aforementioned one in case of which resources like memory and CPU may remain underutilized. Moreover, our implementation requires fewer nodes because each quorum comprises of all the nodes and is contrary to the suggested approach where the nodes need to be distributed across multiple quorums. 2. Motivation ZooKeeper service is used in by many companies to provide coordination, synchronization and a hierarchical namespace of data registers for various distributed applications. Typically, one ZooKeeper cluster is used by multiple applications. These applications may be unrelated but still all requests are still written sequentially to the transaction log on the disk since they are ordered and proposed by a single ZooKeeper leader node. This leads to the possibility of the disk at the nodes becoming the bottleneck for the cluster throughput. The ZooKeeper administrator’s guide [4] recognizes this issue as it mentions that disk write into the transaction log is the most performance critical part of ZooKeeper. Hence, for separate applications, the data can be handled and written separately. Though one option is to have separate clusters for individual applications, it may end up being expensive and lead to underutilization of resources. Hence, an ideal scenario would be to use the same ZooKeeper cluster for multiple applications while removing the ordering guarantees between unrelated requests coming from these separate applications. Then the independent sets of requests can be written to different transaction logs which, as discussed earlier, can improve the performance. This is the issue we tried to address through our implementation of partitioned ZooKeeper. 3. Design Each ZooKeeper node is part of a quorum of nodes, where it participates in voting on proposals and leader election. A ZooKeeper node can also service client requests. A client request is processed in multiple stages. Each stage reads events from an incoming queue, performs its actions and passes the output to the incoming queue of the next stage. The use of stages and queues is designed to avoid blocking of incoming requests due to network and disk operations. In ZooKeeper’s current design, a cluster can have only one leader and an associated quorum. We modified this design to enable multiple leaders to exist within a cluster, each with their own quorum. Each node can be part of multiple quorums. A node can be a leader in one quorum and a follower in others, or a follower in all the quorums that it is part of. Having a node be the leader for multiple quorums would have no benefit for the performance gains we are trying to obtain, so we did not experiment with that option. The namespace of nodes is partitioned among the leaders, with routing logic in each follower to forward requests to the appropriate leader. Figure 1 below shows the current and modified designs of ZooKeeper.
  • 3. 3 Figure 1: ZooKeeper cluster with multiple quorums Whenever a node receives a proposal from the leader, it appends the request to a transaction log on disk and sends back an acknowledgement. This transaction log is used for recovery in case of leader failure. The write to disk constitutes a performance critical path of the request processing flow. Therefore, the ZooKeeper administrator's guide [4] recommends keeping the transaction log on a separate device from any other reads and writes on the system. Since each node can now be part of multiple quorums, the transaction log for each quorum needs to be on its own device. We have modified ZooKeeper to support this. Figure 2 shows this design. Figure 2: Node operation as a part of multiple quorums
  • 4. 4 4. Implementation Our initial approach [5] focused on partitioning incoming requests and handling them in parallel. For each distinct top-level path that comes in (/app1, /app2 etc.), we create a new chain of request processing stages and all subsequent requests on that path are routed to that chain. Though this gave us a performance speed up in the request processing stage, the strict ordering guarantees of ZooKeeper’s atomic broadcast protocol within a quorum meant that the requests were getting serialized before being forwarded to the leader. This led us to the current design of having multiple quorums, in order to relax the ordering guarantees across different request paths. We are able to reuse a significant amount of the routing logic from the initial implementation. Figure 3 shows the class diagram for the multiple quorum approach [6]. A separate thread is created corresponding to each quorum to handle the request flow and the inter-node communication. Each QuorumPeerThread can be allocated one or more partitions of the request namespace by the routing logic, which determines the quorum by examining the request. Figure 3 Class diagram for the multiple quorum approach Challenges during implementation The original ZooKeeper codebase has the following core assumptions: i. Total ordering of all requests, by a single primary ii. All requests are treated alike iii. Each node is part of only one quorum Our design required changing each of these core assumptions, without compromising on performance or safety. Further, the code uses multithreading extensively, and has optimized usage patterns for disk and network I/O. Thus it was a challenge to modify the code-base to suit our needs. Below we cite a couple of examples: i. Request specific processing We introduced a new interface called RequestFlushable, along the lines of the existing Flushable interface. This is implemented by
  • 5. 5 SendAckRequestProcessor, which now needs to examine the request to decide who to send an acknowledgement to. ii. Existence of multiple quorums The new class QuorumPeerMulti abstracts the existence of multiple quorums, by implementing the same interface as the existing QuorumPeer. It is used selectively by multi-leader aware parts of the code to use the new functionality. 5. Performance Evaluation We measure the performance implications of the proposed design in terms of write throughput and compare the results with that of ZooKeeper 3.4.6. This is the current stable version of ZooKeeper available from Apache. We wrote a test program in Java to measure the total time taken for a given number of operations. We have documented our results for the create command and expect similar results for get, set and delete. The following sections describe the experimental setup and discuss the results from our testing. Experimental Setup We ran the experiments on a five node cluster setup on Amazon EC2. Each instance was installed with Ubuntu 14.04 LTS and ZooKeeper software. Every node is assigned 4 vCPUs, 15 GB RAM and 2 SDD 40 GB each. Two additional volumes 10GB in size were attached to each instance to study ZooKeeper performance with various disk configurations. Separate disks are configured for snapshots and transaction logs. The path to the respective directories is specified in the ZooKeeper configuration file. ZooKeeper service is started on all instances. The tests are run with globalOutstandingLimit set to 1000 (default) and snapCount set to 500,000. With low snapCount, snapshots will be taken too frequently. Hence, to eliminate the possibility of deviations in measurements due to frequent snapshotting, we set this value to a higher number. We wrote a performance test script that creates a specified number of ZooKeeper nodes under a given namespace, by connecting to ZooKeeper using the client API and using the asynchronous create primitive. A warm up phase of 10000 nodes is executed in order to test the system in a stable state. The results tabulated below are calculated averages over five test runs. Results The first experiment involves 400000 create commands on one client, 200000 for /app1/* and 200000 for /app2/*. ZK Setup # Requests Time (ms) Ops/Sec Version 3.4.6 400K 37455 10679.48 Multi-Leader 200K + 200K 24594 17176.92 Table 1 Comparison of throughput between ZooKeeper 3.4.6 and partitioned ZooKeeper
  • 6. 6 Table 1 records the total time and throughput. With ZooKeeper 3.4.6, this operation takes 37.45 seconds. With our code changes, we are able to demonstrate that the requests when split between two different leaders can reduce the total time to approximately 24 seconds, which gives an order of 1.5x improvement. ZK Setup # Requests Client2 Client4 Time (ms) Ops/Sec Time (ms) Ops/Sec Version 3.4.6 200K 27034 7398.091 26282 7609.771 300K 40745 7362.867 40024 7495.503 400K 53843 7429.007 53077 7536.221 Multi- Leader 200K 24594 8132.065 22112 9044.862 300K 38520 7788.162 32800 9146.342 400K 50600 7905.139 43338 9229.775 Table 2 Comparison of throughput between ZooKeeper 3.4.6 and partitioned ZooKeeper In our second experiment, we use two separate clients to issue 200K, 300K and 500K create requests on each in separate runs. Table 2 summarizes the results of this experiment. We observe that with two clients processing the requests, we overcome the overhead involved in the client dispatching the request to the leader. Figure 4 Comparative throughput measurement on two different clients in a 5-node cluster While ZooKeeper 3.4.6 shows considerable improvement, experiment repeated with our code changes indicates that the throughput improved further by approximately 14%. This can be attributed to the fact that with ZooKeeper 3.4.6, all the requests are totally ordered. In our multi-leader design, the different namespaces are processed in parallel. Table 3 Overall percentage improvement in throughput 0 2000 4000 6000 8000 10000 200K 300K 400K No.ofOperations/Sec No. of requests Comparison of Throughput Measurement on Client 2 Version 3.4.6 Multi-Leader 0 2000 4000 6000 8000 10000 200K 300K 400K No.ofOperations/Sec No. of requests Comparison of Throughput Measurement on Client 4 Version 3.4.6 Multi-Leader # Requests Overall improvement (%) 200K 14.35 300K 13.95 400K 14.2
  • 7. 7 6. Future Work In the current design of the ZooKeeper, a number of messages are exchanged between the leader and the followers. By introducing two leaders, we do not increase the number of network messages since the requests are split between the leaders. Hence, no latency is introduced by the partitioned ZooKeeper design. Network latency can be one of the areas that might be of interest for future work through. In our current implementation, the leaders for the quorum are statically chosen on system startup. We have not handled leader failure and election of multiple leaders. The routing code that we have is quite basic, and can be extended to be user configurable. We have also not studied the impact of network latency on larger quorums with multiple leaders. For the purpose of performance evaluation, we chose to implement only one of the primitives (create). This is similar to the rationale mentioned on the original ZooKeeper ( [7], Section 5.1) paper for their performance evaluations. 7. Conclusion Thus, using the same set of servers, we have implemented multiple quorums, each of which is responsible for processing requests for a particular request path. Each quorum has its own leader which orders only the requests targeted to the path that quorum is associated with. This relaxes the ordering constraint across operations related to different paths while ensuring that for a particular path, the ordering guarantee is still maintained. In order to parallelize disk writes, we assigned a separate disk to each node for every quorum it was a part of. We demonstrated that write throughput of ZooKeeper can be improved by separating the znode namespace and by having different leaders process independent paths. We see an approximate improvement of 14% using this method.
  • 8. 8 References [1] "Apache_ZooKeeper," [Online]. Available: http://en.wikipedia.org/wiki/Apache_ZooKeeper. [2] "Partitioned ZooKeeper," [Online]. Available: http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZooKeeper. [Last edited by F. Junqueira, 18 May 2010]. .[3] D. Williams. [Online]. Available: http://ria101.wordpress.com/2010/05/12/locking-and- transactions-over-cassandra-using-cages/. [4] "ZooKeeper Adminitrator's Guide," [Online]. Available: http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_commonProblems. [5] "Bitbucket Link to Older Approach (zk-Partition)," [Online]. Available: https://bitbucket.org/pramodbiligiri/zk-partition. [6] "Bitbucket link to New approach (zk-Multileader)," [Online]. Available: https://bitbucket.org/pramodbiligiri/zk-multileader. [7] Patrick Hunt et al, "ZooKeeper: Wait-free coordination for Internet-scale systems," in USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference, 2010-06-23.