Dsm project-h base-cassandra

Comparison of HBase and Cassandra: The two
NoSQL Databases
Shantanu Deshpande
x18125514
Abstract:
The recent years have seen a rapid growth in the digital world,
and it has resulted in an increased data complexity in terms of its
volume, velocity and variety termed as Big Data. For instance, nowadays,
social media websites are generating terabytes, petabytes of information
on daily basis which needs to be collected and effectively managed in
real time. The rate at which read-write operations are being performed is
immense with expectations of even faster retrievals and loading. The
traditional methods like SQL are incapable to process the new generation
data due to lack of high scalability, structure and elasticity needs. Of late,
NoSQL has surged in popularity as they are claimed to perform better
than traditional methods. The two widely popular NoSQL databases are
HBase and Cassandra. In this paper, we will examine the performance of
these two databases and compare the results thus obtained through
different operations on the Ubuntu interface.
Keywords: Big Data, SQL, NoSQL, HBase, Cassandra, Ubuntu
1) Introduction:
Due to the advent of digital age and the growing number of internet users worldwide, there
has been an astounding increase in the data across the globe. One such example is Internet
of Things which performs real-time analysis and continuously gathers data through its
sensors.
Managing all these data is a complex task and a challenge for the companies that own the d
ata and need it to be processed further. Previously, it was possible for the organizations to
maintain the data with the help of relational database management systems however as the
load kept on increasing, the processing time increased significantly and resulted in high
latency in query processing, data transmission rate went down significantly and it had poor
horizontal scalability. This had an adverse impact on the associated cost of data processing
thereby increasing company ioverheads and still getting poor performance. As a result of the

drawbacks of relational database systems, NoSQL was introduced around a decade back.
The characteristics include - Design simplicity, simpler "horizontal" scaling to machine
clusters and improved performance because of node-to-node architecture.
Because it has structure storage, the irelationship database SQL acts as a subset of
NoSQL.Unlike the vertical scalability scheme of traditional databases, results in lower
maintenance costs. (Anon., n.d.)
The four types of NoSQL databases are –
Column - There is only one column of data in each storage block. Ex. Cassandra and HBase
Document - The document - oriented system is based on the document's internal structure
to extract metadata for further optimization. Ex. MongoDB
Graph– A database that depicts and stores data with nodes, edges and properties using
semantic graph structures. Ex. Neo4j
Key Value - Is a storage, retrieval and management paradigm for associative arrays,
commonly known as hash tables. Ex. Amazon S3.
2) Key Characteristics:
2.1 HBase:
HBase is an open-source project built on top of Hadoop file system. It is a distributed
column-oriented database and is horizontally scalable. HBase is not a relational data store
hence it does not support a structured query language like SQL. Much like a traditional
database, HBase also comprises of tables that contain rows and columns and it must define
an element as Primary key.
The key characteristics of HBase are-
• Consistency: For high-speed requirements. It is suitable to use HBase as it provides
consistent read-write operations.
• Sharding: It is a process of division of logical database into smaller, more
manageable parts called as data shards. This process reduces the I/O time and
overhead. The split can be done either automatically or manually at a threshold size.
• Atomic Read and Write: While the system is processing one read or write operation,
all other processes are iprevented from performing another read or write operation.
This is known as atomic read/write. HBase performs this on a row-level.

• High Availability: As HBase offers WAN and LAN, it supports recovery and failover.
Basically, at the core it has a master server, which handles the metadata for the
cluster as well as monitors the region servers.
• It also has an effortless Java API for the client.
Based on above characteristics, it is ideal to use HBase wherever there is requirement of
write heavy operations. It is also used where there is a need to provide quick random access
to the available data.
2.2 Cassandra:
Cassandra is an open source, distributed and a decentralized system that has been
designed to manage humongous amounts of data. It provides no single point of failure with
highly available service.
The key characteristics of Cassandra are-
• Always on architecture: It does not have a single point of failure thus ensuring that
no critical business application fails.
• Flexible data storage: Cassandra can accommodate any possible type of data i.e. the
data can be either structured, semi-structured or unstructured. According to the
requirement it can accommodate changes to the data structures.
• Data distribution: Data is replicated across multiple data centres; Cassandra thus
provides the flexibility to distribute data as and where it is required.
• Elastic scalability: It is one of the key characteristics of Cassandra. It is possible to
easily scale-up or scale-down the cluster, as it provides the flexibility for deletion and
addition of any number of nodes without any disruptions.
• Faster linear-scale performance: It is able to achieve and maintain quick response
time by increasing the throughput as you go on increasing the number of nodes.
• Tunable Consistency: Cassandra has two types of consistency, Strong consistency
and Eventual consistency. Whenever the cluster accepts a write, eventual
consistency is responsible and imakes it sure that it is approved by the client. Strong
consistency, on the other hand, ensures that any update is transmitted to all nodes
or machines where the data is appropriate.

3) Architecture:
3.1 HBase:
There are three important components in HBase architecture, HMaster, Zookeeper and
Region Server.
HMaster: HBase HMaster does the task of assigning the regions to region servers in the
Hadoop cluster for uniform load balancing.
Region Server: They are the worker nodes that handle the transactional queries like read,
write, update and delete from the clients. This process runs on every node within the
Hadoop cluster.
ZooKeeper: It is a centralized monitoring server that does the task of region assignment and
recovers any server region crashes by loading it to other working region servers.
3.2 Cassandra:
Cassandra is designed to handle large data workloads with no single point of failure a
cross multiple nodes. The architecture is based such that it is understood that both hardwar
e and system failures do occur. Cassandra addresses the issue of failures by using a peer -

to peer distributed system across homogeneous nodes where data is distributed across all cl
uster nodes. All nodes within a cluster play similar role. Each node is interconnected with
other nodes and is also independent.
Key Structure:
Node: Here the data is stored and is the basic infrastructure component of Cassandra.
Datacentre: A collection of related nodes is termed as datacentre.
Cluster:It contains one or more datacentres.
Commit log: Complete data is first written on the commit log. Once the data is transferred
to SSTable, then it is either deleted, archived or recycled.
SSTable: A sorted string table (SSTable) is an unchangeable data file that Cassandra
periodically writes memtables to.
CQL Table: A collection of columns that have been ordered by table row. A table is made up
of columns with a primary key.
4) Comparison between the two:
For the purpose of designing distributed database systems, the CAP theorem made the
designers aware about the various trade-offs that need to be considered beforehand. This
theorem applies to distributed systems that store data and stands for Consistency,
Availability and Partition tolerant. The key aspect is to lookout if the database is able to
achieve at least two parameters out of the three. Here, we have compared HBase and
Cassandra in terms of their Scalability, Availability, Reliability and Security.
Scalability:
A database's scalability is characterized by its capacity to deal with a lot of information
together with high effectiveness of execution. Here we can say that the HBase is profoundly
versatile as the information is disseminated evenly along the tables when it develops in the
database. It can be supported very well asthe HBase relies on Google's Big Table. We can
watch dynamic tables circulation in HBase. Horizontal Scalability can be observed in Hbase
over the Region Servers as it acts as slaves in the cluster. Region, in HBase, is termed as the
basic unit for horizontal scalability.
Regions are a subset of data from the table and are basically a contiguous, sorted range of r
ows that are stored together. Initially, a table has only one region. Once the number of rows

increases and the region becomes large, it is split into two at middle key, thereby creating
two almost equal halves.
In case of Cassandra, the database is linear scalable. That means, by simply adding new
nodes the scalability can be increased. It is possible to scale Cassandra database both
horizontally, by adding more databases or vertically, by adding more nodes.
Availability:
Availability of a database means that any request given to the database as an input should
receive a response from the system, either success or failure. Also, it refers to the
accessibility of data even incase of ifailure of server or data nodes in the cluster. If the
database has high availability, then this will lead to fewer interruptions for the client in the
event of server failure. As we know, HBase has a master-slave relationship just like HDFS
however it also has a HMaster thus having many masters thereby ensuring that even incase
one of the masters fail to communicate, the data transmission would not be halted. This
would no doubt create inconsistency in the data but as explained above, in order to satisfy
CAP theorem, it is fine to proceed even if any of the two parameters are fulfilled. In case of
Cassandra, it does not have a master-slave relationship. Just that all the nodes are same and
there is no master node for controlling all other nodes. And thus, this avoids single point of
failure. Cassandra also provides replication feature, which means that even if any of the
node within the cluster goes down, one or more copies are available on different machines
within the cluster. Source: (Anon., n.d.)
Reliability:
Reliability of a database is measured by its performance in terms of its deliverables
which should ideally be as per defined specifications. A highly reliable system is the one
which shows same or better performance even in the event of any environment changes or
fault in the system. Zookeeper assures the reliability for HBase. Znodes are present which
act as the subordinate. Once the Zookeeper receives a request from the client it then runs
on all the Region servers and data is then stored across various levels. Through various
experiments it has also been observed that HBase performance efficiency increases as the
workload increases thus assuring higher reliability. Even Incase of Cassandra, due to the
distributed ring structure and replication of nodes, Cassandra is also considered to be
reliable.
4.2 Security:
HBase:
The key security features available in HBase, according to (Anon., n.d.) are-

1. Authentication:
For gaining a secure access to a database, it should be must that client authenticate with the
server for establishing credentials. The various options for authentication are-
• Client authentication: There are numerous security protocols for allowing clients to
authenticate with the database. For HBase they are - Kerberos, SSL.
• Server Authentication: Different database servers must as well authenticate
with each other for ensuring a secure operating environment. In HBase,
shared keyfile is the one such method.
2. Role Based Security:
Role - based security simplifies the administration and operations of security
considerably. There are various security role features available in HBase for
supporting ease of administration; they are, custom roles, default roles etc. It is also
important to define the scope of roles as this would be useful for systems that
normally support extremely sensitive data.
3. Database Security:
HBase supports database encryption and it is highly important to encrypt the data in
sensitive application domains. Logging is also essential for recording all the activities
and interaction of clients with the system for auditing and detailed investigations.
Administrator is able to define which security groups to be logged. In Hbase, fixed
event logging and configurable event logging are the options supported for logging.
Cassandra:
According to (Anon., n.d.) the three main components of the security features furnished by
Cassandra are –
1. TLS/SSL encryption for inter-node communication and client.
There are two options in Cassandra for ensuring encryption and both are managed
separately and need to be configured independently – client-to-node encryption and node-
to-node encryption. When encryption is enabled, both the cipher suites and JVM defaults
are utilized. Although these can be overridden using settings, it is not recommended unless
certain specific settings need to configure as per certain policy.
2. Client Authentication
Authentication is configured in Cassandra using the ‘authenticator’ setting in
Cassandra.yaml. Under default settings, the system does not perform any

authentication checks and thereby requires no credentials. Password Authenticator
is also included in the package that stores encrypted credentials.
3. Authorization
Similar to encryption, there are two options for authorization. By default, no check is
performed thus allowing all permissions to all roles. Cassandra also includes
Cassandra Authorizer which provides functionality to manage full permissions and
the related data is stored in Cassandra system tables.
5 Learning’s from Literature Review:
With the development of the Internet and cloud computing, databases are needed t
o be able to effectively store and process big data, demanding high performance when readi
ng and writing, while the traditional relational database confronts many new challenges.
(Han, 2011)
Especially in large scale and highly competitive applications such as search engines a
nd SNS, it appeared to be inadequate ito use the relational database to store and query dyn
amic user data. NoSQL database has been created in this case. With the exponential growth
in the global data generation, the demands from the database technology grew significantly.
Some of them being, iireading and writing simultaneously with low latency, Efficient
requirements for large data storage and access, improved scalability and ihigh availability
and Lower operating and management costs. These were some of the key limitations of
traditional relational databases. iTo overcome this, NoSQL has emerged as an alternative
paradigm for this new non-relational data schema (Dede, 2013). NoSQL database features
described above are common; in reality, each product is compliant with the various data
models and the CAP theorem. CAP theorem stands for Consistency, Availability and
tolerance of network Partition. The core idea of CAP theorem is that a distributed system
cannot simultaneously meet the three needs but can only meet two (Han, 2011). Depending
on the project requirements, idifferent storages offer different consistency levels. These
options enable users to choose various trade-offs like availability, latency and consistency.
(Kumar, 2014). Therefore, in order to understand which system would be better, it is
essential to assess the performance of each storage system so as to judge the appropriate
storage type for a particular application. In this paper ( (Abubakar, 2014)) the author
attempts to introduce YCSB, an open source tool provided by Yahoo that allows
benchmarking multiple systems and comparing them by creating workloads. Distributed
systems are often more complicated than their isingle-network counterparts due to the
trade-off which need to be balance as per the applications requirements. The author made
an attempt in this paper to upgrade YCSB in such a way that the YCSB could calculate stale

reads in real time. One can use the model created in this paper to calculate the trade-offs
between availability, latency and consistency.
According to (Dede, 2013)Internet applications are rapidly increasing, generating
enormous iamounts of data. In order to store humongous amounts of data we make use of
NoSQL database systems like HBase and Cassandra as they are widely used by many
organizations as their storage solution. The author ihas tried to test the Cassandra database
based on its performance. In this paper, the author has discussed how Cassandra's different
features, like replication and data partitioning, affect the performance of Apache Hadoop.
Then a test model is introduced that icarries out the testing on the basis of the system's
performance and ensures that it considers the architecture and its business while
conducting the testing. Finally, these tests are applied at the level of the architecture based
on performance, which also includes ifew performance-based elements such as the column-
oriented data model, the split mechanism data model and the data replication factor. A test
procedure is performed for each performance element and a test scenario is designed. Due
to the continuous development of cloud computing, non - structural data storage is also
steadily increasing. The schema evaluation iwas divided into a separate unit known as a
schematic analyser. The schema analyser therefore does not have to rely on web
applications and can be connected to visual tools.
Performance of five NoSQL databases in another study by (Tang, 2016) included
Cassandra and HBase and they were compared on YCSB (Yahoo Cloud Serving Benchmark).
The experiment involved three different workloads- Workload A (50% read and 50% write),
Workload C (100% read) and Workload H(100% write). These workloads have been
performed on iaround 10000 operations out of the 100,000 loaded operations. Out of the
two experiments that were conducted, the initial was for executing total time taken by
these databases iagainst all three workloads. Redis turned out to be superior than the other
databases as the time taken for loading and executing the data was less. As compared to
Cassandra and HBase, it was 1.43 and 3.61 times faster respectively. Second experiment was
for the Throughput. Notably, all the five databases showed a similar trend in this
experiment. Here as well, Redis performed significantly better than the other databases. In
this case, Cassandra performed isignificantly better with greater throughput than HBase.
Based on the experiments, it proved that Redis database is ibetter capable for execution and
loading of the workloads and thus this study thereby proved to be a motivation for our
study. In the following section, we will work on finding out these experiments are relevant
to the study that we have performed in this paper.
6) Performance Test Plan:
For the execution of the process and the subsequent comparison of the two databases, we
first created an instance on the OpenStack which is hosted on cloud. Then we assigned a

floating ip to this instance for getting access to Ubuntu system. A keypair was generated
with authorized keys in the ssh directory. Then we installed Hadoop along with HBase. For
initiating Hadoop installation, we first installed Java version 8 and created a group with
name Hadoop group and a user named hduser. Then we disabled the IPV6 and downloaded
the Hadoop, unzipped the file and assigned hduser to Hadoop file by creating a symbolic
link. The various xml files, namely, hadoop-env.sh, core-site and hdfs-site were edited
according to the manual. Thereafter, we formatted the name node and started the dfs and
yarn.
After successful installation of Hadoop, we installed HBase. Similar to the Hadoop
process, we downloaded HBase from website, unzipped it and a symbolic link was
established. Then we edit the hbase-env.sh file, start the HBase and create a user table.
YCSB, a benchmarking tool, was then installed in the system by referring the lab manual.
Test harness was already created in ycsb specifying workload type, number of opcounts,
database type, etc. As per our requirement the files in the test harness were updated.
Workload types considered were Workload A and Workload C and three opcounts were
considered, 100000, 150000 and 200000 for both HBase and Cassandra. Workload A is a
combination of 50% reads and 50% writes whereas Workload C is 100% read. The process
was run using command runtest.sh for 3 times. Following this, Cassandra was downloaded
and installed in system by following the guidelines given in an online manual. (Anon., n.d.).
The files in the test harness were modified as required for Cassandra and similar activity was
performed. After successful completion of both HBase and Cassandra, the average of the
output was then evaluated.
Devise Specifications:
• Sony Vaio Fit 14 SVF14A15SNB
• 8GB RAM
• Intel Core I5 (3rd Generation)
• 1.8 GHz With Turbo Boost Upto 2.7 GHz
• 1TB HDD
Databases:
• HBase
• Cassandra
Workload Type:
• Workload A: 50% read and 50% write
• Workload C: 100% read
Operating Environment:

Open stack
• Name: m1. medium
• VCPU’s: 2
• RAM: 4GB
• Disk size: 40GB
• MSc data-net
7. Evaluation and Results:
Here, we have performed two workload tests, Workload A and Workload C against our two
databases, HBase and Cassandra using YCSB as the benchmarking tool. Following are the
test specifications:
Workload A:
1. Read: 50 %
2. Update: 50 %
Workload C:
1. Read: 100 %
7.1Workload A Results:
7.1.1 Average Insert latency vs. overall throughput
Database Workload A Count
[OVERALL]
Throughput(ops/sec) [INSERT] AverageLatency(us)
Cassandra Count 1 100000 1830.161054 471.04802
Cassandra Count 2 150000 2207.667967 405.07366
Cassandra Count 3 200000 2472.157328 361.238945
Hbase Count 1 100000 1907.632437 432.87972
Hbase Count 2 150000 2395.821687 394.2167
Hbase Count 3 200000 2363.256094 395.25737

• Here, we are comparing the average insert latency with the overall throughput.
• If the database latency is lower, then we can say that the database performance is
good.
• The latency of HBase is less than Cassandra for a lower count but as the data size
increases, the latency rate of Cassandra drops below that of HBase.
7.1.2 Average Update Latency vs. Update operations
Database
Workload
A Count
[UPDATE]
Operations
[UPDATE] Average
Latency(us)
Cassandra Count 1 100000 50118 405.3865677
Cassandra Count 2 150000 75256 394.719039
Cassandra Count 3 200000 99707 335.3669451
HBase Count 1 100000 49972 387.6640519
HBase Count 2 150000 75369 373.7879101
HBase Count 3 200000 100005 383.7894005

• Here, we are comparing average update latency with update operations.
• As the workload increases, the number of update operations are increased, the
latency of HBase increases whereas that of Cassandra decreases significantly.
• This shows Cassandra performs better for update operations when the workload is
high.
7.1.3 Read operations vs. Avg. Read latency
Database
Workload
A Count
[READ]
Operations
[READ] Average
Latency(us)
Cassandra Count 1 100000 49882 499.271621
Cassandra Count 2 150000 74744 526.7530905
Cassandra Count 3 200000 100293 439.4176164
HBase Count 1 100000 50028 334.307208
HBase Count 2 150000 74631 314.4751645
HBase Count 3 200000 99995 325.520106

• Here, we compare the Read operations with the average read latency.
• From this graph, we can interpret that the average latency for HBase is consistent
even with the increase in workload whereas for Cassandra, as the workload
increases beyond count 2, the latency rate drops significantly.
7.2 Workload C
7.2.1 Average Insert latency vs. overall throughput
Database
Workload
C Count
[OVERALL]
Throughput(ops/sec)
[INSERT] Average
Latency(us)
Cassandra Count 1 100000 2072.023538 412.90974
Cassandra Count 2 150000 2202.610828 406.38106
Cassandra Count 3 200000 2521.718299 360.39823
HBase Count 1 100000 2312.726936 401.83924
HBase Count 2 150000 2421.893921 395.4957067
HBase Count 3 200000 2276.789272 410.009585

• Here, we compare Average Insert latency with overall throughput for Workload C.
• It can be observed from the graph that the latency for Cassandra is lower than
• HBase in all the three cases and also it is decreasing as the workload is increasing.
7.2.2 Average Read Latency(us) vs. Overall Throughput(ops/sec)
Database
Workloa
d C Count
[OVERALL]
Throughput(ops/sec)
[READ]
AverageLatency(us)
Cassandr
a Count 1 100000 2058.248431 416.41129
Cassandr
a Count 2 150000 2350.581377 378.94206
Cassandr
a Count 3 200000 2494.636531 366.025005
HBase Count 1 100000 3037.667072 283.80343
HBase Count 2 150000 3758.833258 254.5179933
HBase Count 3 200000 4144.734115 221.76259

• Here, Here, we have compared Average Read Latency with Overall Throughput for
Workload C.
• From the graph it is visible that the start count has the maximum latency rate for both the
databases, HBase and Cassandra, although as the workload increases, the latency rate for
both the databases drops significantly.
8 Conclusions and Discussion:
In this paper, we have explained the underlying concepts of HBase and Cassandra database. The
benchmarking tool that was used for the comparison is Yahoo! Cloud Servicing Benchmark
(YCSB) to determine which database performed better under different workload scenarios.
Similar count of workloads was provided to each of the database. The workloads that were
provided are 100000, 150000 and 200000. We have used two types of workloads here, A and C.
Workload A supports 50% read and 50% write operations and workload C which supports 100%
read operations. Upon visualization of the data on Tableau, we found out that the latency
behaviour of HBase is different than that of Cassandra. Although in both databases the latency
rate is decreasing upon increase in workload, this rate is more in Cassandra database than
HBase. In workload A, as the update operations increases the average latency for Cassandra
database goes below the HBase latency rate. Overall, we can observe that for higher workloads
the performance of Cassandra is better than that of HBase and we can recommend to use
Cassandra for higher workload requirements. Also, all the benchmarking parameters were

available in the YCSB tool hence we can say that it is one of the great tool for benchmarking
several NoSQL databases on cloud environment.
Bibliography
Abubakar, Y., 2014. Performance Evaluation of NoSQL Systems using YCSB in a Resource
Austere Environment. ResearchGate.
Anon., n.d. An Evaluation of Cassandra for Hadoop. [Online]
Available at: http://sci-hub.tw/https://ieeexplore.ieee.org/abstract/document/6676732
[Accessed 2019].
Anon., n.d. Cassandra. [Online]
Available at: https://www.rapidvaluesolutions.com/tech_blog/cassandra-the-right-data-
store-for-scalability-performance-availability-and-maintainability/
[Accessed 2019].
Anon., n.d. Cassandra Installation. [Online]
Available at: https://www.vultr.com/docs/how-to-install-apache-cassandra-3-11-x-on-
ubuntu-16-04-lts
[Accessed 2019].
Anon., n.d. Cassandra-Security. [Online]
Available at: http://cassandra.apache.org/doc/latest/operating/security.html
[Accessed 2019].
Anon., n.d. HBase security features. [Online]
Available at: https://quabase.sei.cmu.edu/mediawiki/index.php/HBase_Security_Features
[Accessed 2019].
Dede, E., 2013. An Evaluation of Cassandra for Hadoop. IEEE.
Han, J., 2011. Survey on NoSQL database. IEEE.
Kumar, S. P., 2014. Evaluating consistency on the fly using YCSB. IEEE.
Tang, E., 2016. Performance Comparison between Five NoSQL Databases. IEEE.

Dsm project-h base-cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dsm project-h base-cassandra

Similar to Dsm project-h base-cassandra (20)

More from Shantanu Deshpande

More from Shantanu Deshpande (7)

Recently uploaded

Recently uploaded (20)

Dsm project-h base-cassandra