I have examined the performance of two databases - HBase and Cassandra in terms of their scalability, security, performance and compared the results thus obtained through different operations on the Ubuntu interface.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
Data Storage and Management project ReportTushar Dalvi
This paper aims at evaluating the performance of random reads and random writes the information of HBase and Cassandra and compare the results that we got through various ubuntu operation
Devise and implement a test strategy in order to perform a comparative analysis of the capabilities of two database management systems (Cassandra and HBase) in terms of performance.
Approach: Installation and implementation of instances of the two data storage and management systems. The Yahoo Cloud Serving Benchmark is used to compare the performances of HBase and Cassandra. Average latency and throughput were considered for analyzing the comparison of the two databases. The results obtained from YCSB are then analyzed and visualized with the help of Tableau.
Findings: HBase performs insertion, reading, and updating of records faster than Cassandra but only when the operations count is less. At heavier loads, Cassandra performs better than Hbase.
Tools: Hbase, Cassandra, Hadoop, Tableau, YCSB
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
http://tyfs.rocks
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
Data Storage and Management project ReportTushar Dalvi
This paper aims at evaluating the performance of random reads and random writes the information of HBase and Cassandra and compare the results that we got through various ubuntu operation
Devise and implement a test strategy in order to perform a comparative analysis of the capabilities of two database management systems (Cassandra and HBase) in terms of performance.
Approach: Installation and implementation of instances of the two data storage and management systems. The Yahoo Cloud Serving Benchmark is used to compare the performances of HBase and Cassandra. Average latency and throughput were considered for analyzing the comparison of the two databases. The results obtained from YCSB are then analyzed and visualized with the help of Tableau.
Findings: HBase performs insertion, reading, and updating of records faster than Cassandra but only when the operations count is less. At heavier loads, Cassandra performs better than Hbase.
Tools: Hbase, Cassandra, Hadoop, Tableau, YCSB
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
http://tyfs.rocks
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
This presentation is all about for the difference in between the Sql and NoSQL database because this question generally comes in the mind of every people that on what parameters and
how we can differentiate both these databases.
So, after viewing this presentation all your doubts and misconfusion between Sql and NoSQL got clear.
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
Comparison between RDBMS, Hadoop and Apache based on parameters like Data Variety, Data Storage, Querying, Cost, Schema, Speed, Data Objects, Hardware profile, and Used cases. It also mentions benefits and limitations.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
This presentation is all about for the difference in between the Sql and NoSQL database because this question generally comes in the mind of every people that on what parameters and
how we can differentiate both these databases.
So, after viewing this presentation all your doubts and misconfusion between Sql and NoSQL got clear.
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEMIJCI JOURNAL
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
Comparison between RDBMS, Hadoop and Apache based on parameters like Data Variety, Data Storage, Querying, Cost, Schema, Speed, Data Objects, Hardware profile, and Used cases. It also mentions benefits and limitations.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Benchmarking Scalability and Elasticity of DistributedDataba.docxjasoninnes20
Benchmarking Scalability and Elasticity of Distributed
Database Systems
Jörn Kuhlenkamp
Technische Universität Berlin
Information Systems
Engineering Group
Berlin, Germany
[email protected]
Markus Klems
Technische Universität Berlin
Information Systems
Engineering Group
Berlin, Germany
[email protected]
Oliver Röss
Karlsruhe Institute of
Technology (KIT)
Karlsruhe, Germany
[email protected]
ABSTRACT
Distributed database system performance benchmarks are
an important source of information for decision makers who
must select the right technology for their data management
problems. Since important decisions rely on trustworthy
experimental data, it is necessary to reproduce experiments
and verify the results. We reproduce performance and scal-
ability benchmarking experiments of HBase and Cassandra
that have been conducted by previous research and com-
pare the results. The scope of our reproduced experiments
is extended with a performance evaluation of Cassandra on
different Amazon EC2 infrastructure configurations, and an
evaluation of Cassandra and HBase elasticity by measuring
scaling speed and performance impact while scaling.
1. INTRODUCTION
Modern distributed database systems, such as HBase, Cas-
sandra, MongoDB, Redis, Riak, etc. have become popular
choices for solving a variety of data management challenges.
Since these systems are optimized for different types of work-
loads, decision makers rely on performance benchmarks to
select the right data management solution for their prob-
lems. Furthermore, for many applications, it is not sufficient
to only evaluate performance of one particular system setup;
scalability and elasticity must also be taken into considera-
tion. Scalability measures how much performance increases
when resource capacity is added to a system, or how much
performance decreases when resource capacity is removed,
respectively. Elasticity measures how efficient a system can
be scaled at runtime, in terms of scaling speed and perfor-
mance impact on the concurrent workloads.
Experiment reproduction. In section 4, we reproduce
performance and scalability benchmarking experiments that
were originally conducted by Rabl, et al. [14] for evaluating
distributed database systems in the context of Enterprise
Application Performance Management (APM) on virtual-
ized infrastructure. In section 5, we discuss the problem of
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-
mission prior to any use beyond those covered by the license. Contact
copyright holder by emailing [email protected] Articles from this volume
were invited to present their results at the 40th International Conference on
Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China.
Proceedings of the VLDB Endowment, Vol. 7, No. 13
Copyright 2014 VLDB Endowment 2150-8097/14/08.
selec ...
Comparative study of no sql document, column store databases and evaluation o...ijdms
In the last decade, rapid growth in mobile applications, web technologies, social media generating
unstructured data has led to the advent of various nosql data stores. Demands of web scale are in
increasing trend everyday and nosql databases are evolving to meet up with stern big data requirements.
The purpose of this paper is to explore nosql technologies and present a comparative study of document
and column store nosql databases such as cassandra, MongoDB and Hbase in various attributes of
relational and distributed database system principles. Detailed study and analysis of architecture and
internal working cassandra, Mongo DB and HBase is done theoretically and core concepts are depicted.
This paper also presents evaluation of cassandra for an industry specific use case and results are
published.
Performance Analysis of HBASE and MONGODBKaushik Rajan
Comparison of different NoSQL databases,
namely, HBase and MongoDB at different workloads using Yahoo Cloud Serving Benchmarking (YCSB)
Tools used
> HBase, MongoDB, Shell Scripting, YCSB, Hadoop Environment
> Tableau for Visualization
> LATEX for documentation
Big Data Frameworks: Introduction to NoSQL – Aggregate Data Models – Hbase: Data Model and Implementations – Hbase Clients – Examples – .Cassandra: Data Model – Examples – Cassandra Clients – Hadoop Integration. Pig – Grunt – Pig Data Model – Pig Latin – developing and testing Pig Latin scripts. Hive – Data Types and File Formats – HiveQL Data Definition – HiveQL Data Manipulation – HiveQL Queries
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Apache Cassandra is a distributed storage system for managing very large amounts of structured data. Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large components fail continuously. Cassandra manages the persistent state in the face of the failures which drives the reliability and scalability of the software systems. Cassandra does not support a full relational data model because it resembles a database and shares many design and implementation strategies. In this paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and read efficiency.
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Apache Cassandra is a distributed storage system for managing very large amounts of structured data.
Cassandra provides highly available service with no single point of failure. Cassandra aims to run on top
of an infrastructure of hundreds of nodes possibly spread across different data centers with small and large
components fail continuously. Cassandra manages the persistent state in the face of the failures which
drives the reliability and scalability of the software systems. Cassandra does not support a full relational
data model because it resembles a database and shares many design and implementation strategies. In this
paper, discuss an implementation of Cassandra as Hotel Management System application. Cassandra
system was designed to run on cheap commodity hardware. Cassandra provides high write throughput and
read efficiency.
Prediction of Corporate Bankruptcy using Machine Learning Techniques Shantanu Deshpande
Aim is to build a classification model to predict whether company will become bankrupt or not using financial ratios of Polish companies. Applied various machine learning models like Random Forest, KNN, AdaBoost & Decision Tree with pre-processing techniques like SMOTE-ENN (to deal with class imbalance) & feature selection (for identifying ) and trained on Polish Bankruptcy dataset with prediction accuracy of 89%.
Corporate bankruptcy prediction using Deep learning techniquesShantanu Deshpande
Corporate Bankruptcy prediction using Recurrent neural networks – Aim is to build a recurrent neural network-based model to predict whether company will become bankrupt or not using financial ratios of Polish companies.
Methodologies & Tools: CRISP-DM, SMOTE-ENN, GA Algorithm, LSTM network (type of RNN)
Analyzing financial behavior of a person based on financial literacyShantanu Deshpande
6. Analysed consumer behaviour and relationship using Financial literacy dataset. Identified patterns and predictor variables using logistic regression.
Methodologies & Tools: IBM SPSS, RapidMiner, PowerBI
Built a CNN based machine learning model to diagnose Pneumonia disease using chest x-rays.
Methodologies & Tools: KDD, Python, VGG19 model, Convolutional Neural Network.
4. Performed statistical analysis on a chosen data table and understood relationship amongst different data fields using IBM SPSS software.
Methodologies: Multi linear regression, Logistic linear regression
IBM SPSS
2. Developed a strategic management information system for a virtual organization while considering the analytical requirements for management dashboards.
Tools: Salesforce Developer platform
Built a data warehouse from multiple data sources and ETL methodologies and executed three non-trivial Business Intelligence queries.
Technologies/Tools: R, SQL, Visual Studio, SQL Server Management, Tableau
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Dsm project-h base-cassandra
1. Comparison of HBase and Cassandra: The two
NoSQL Databases
Shantanu Deshpande
x18125514
Abstract:
The recent years have seen a rapid growth in the digital world,
and it has resulted in an increased data complexity in terms of its
volume, velocity and variety termed as Big Data. For instance, nowadays,
social media websites are generating terabytes, petabytes of information
on daily basis which needs to be collected and effectively managed in
real time. The rate at which read-write operations are being performed is
immense with expectations of even faster retrievals and loading. The
traditional methods like SQL are incapable to process the new generation
data due to lack of high scalability, structure and elasticity needs. Of late,
NoSQL has surged in popularity as they are claimed to perform better
than traditional methods. The two widely popular NoSQL databases are
HBase and Cassandra. In this paper, we will examine the performance of
these two databases and compare the results thus obtained through
different operations on the Ubuntu interface.
Keywords: Big Data, SQL, NoSQL, HBase, Cassandra, Ubuntu
1) Introduction:
Due to the advent of digital age and the growing number of internet users worldwide, there
has been an astounding increase in the data across the globe. One such example is Internet
of Things which performs real-time analysis and continuously gathers data through its
sensors.
Managing all these data is a complex task and a challenge for the companies that own the d
ata and need it to be processed further. Previously, it was possible for the organizations to
maintain the data with the help of relational database management systems however as the
load kept on increasing, the processing time increased significantly and resulted in high
latency in query processing, data transmission rate went down significantly and it had poor
horizontal scalability. This had an adverse impact on the associated cost of data processing
thereby increasing company ioverheads and still getting poor performance. As a result of the
2. drawbacks of relational database systems, NoSQL was introduced around a decade back.
The characteristics include - Design simplicity, simpler "horizontal" scaling to machine
clusters and improved performance because of node-to-node architecture.
Because it has structure storage, the irelationship database SQL acts as a subset of
NoSQL.Unlike the vertical scalability scheme of traditional databases, results in lower
maintenance costs. (Anon., n.d.)
The four types of NoSQL databases are –
Column - There is only one column of data in each storage block. Ex. Cassandra and HBase
Document - The document - oriented system is based on the document's internal structure
to extract metadata for further optimization. Ex. MongoDB
Graph– A database that depicts and stores data with nodes, edges and properties using
semantic graph structures. Ex. Neo4j
Key Value - Is a storage, retrieval and management paradigm for associative arrays,
commonly known as hash tables. Ex. Amazon S3.
2) Key Characteristics:
2.1 HBase:
HBase is an open-source project built on top of Hadoop file system. It is a distributed
column-oriented database and is horizontally scalable. HBase is not a relational data store
hence it does not support a structured query language like SQL. Much like a traditional
database, HBase also comprises of tables that contain rows and columns and it must define
an element as Primary key.
The key characteristics of HBase are-
• Consistency: For high-speed requirements. It is suitable to use HBase as it provides
consistent read-write operations.
• Sharding: It is a process of division of logical database into smaller, more
manageable parts called as data shards. This process reduces the I/O time and
overhead. The split can be done either automatically or manually at a threshold size.
• Atomic Read and Write: While the system is processing one read or write operation,
all other processes are iprevented from performing another read or write operation.
This is known as atomic read/write. HBase performs this on a row-level.
3. • High Availability: As HBase offers WAN and LAN, it supports recovery and failover.
Basically, at the core it has a master server, which handles the metadata for the
cluster as well as monitors the region servers.
• It also has an effortless Java API for the client.
Based on above characteristics, it is ideal to use HBase wherever there is requirement of
write heavy operations. It is also used where there is a need to provide quick random access
to the available data.
2.2 Cassandra:
Cassandra is an open source, distributed and a decentralized system that has been
designed to manage humongous amounts of data. It provides no single point of failure with
highly available service.
The key characteristics of Cassandra are-
• Always on architecture: It does not have a single point of failure thus ensuring that
no critical business application fails.
• Flexible data storage: Cassandra can accommodate any possible type of data i.e. the
data can be either structured, semi-structured or unstructured. According to the
requirement it can accommodate changes to the data structures.
• Data distribution: Data is replicated across multiple data centres; Cassandra thus
provides the flexibility to distribute data as and where it is required.
• Elastic scalability: It is one of the key characteristics of Cassandra. It is possible to
easily scale-up or scale-down the cluster, as it provides the flexibility for deletion and
addition of any number of nodes without any disruptions.
• Faster linear-scale performance: It is able to achieve and maintain quick response
time by increasing the throughput as you go on increasing the number of nodes.
• Tunable Consistency: Cassandra has two types of consistency, Strong consistency
and Eventual consistency. Whenever the cluster accepts a write, eventual
consistency is responsible and imakes it sure that it is approved by the client. Strong
consistency, on the other hand, ensures that any update is transmitted to all nodes
or machines where the data is appropriate.
4. 3) Architecture:
3.1 HBase:
There are three important components in HBase architecture, HMaster, Zookeeper and
Region Server.
HMaster: HBase HMaster does the task of assigning the regions to region servers in the
Hadoop cluster for uniform load balancing.
Region Server: They are the worker nodes that handle the transactional queries like read,
write, update and delete from the clients. This process runs on every node within the
Hadoop cluster.
ZooKeeper: It is a centralized monitoring server that does the task of region assignment and
recovers any server region crashes by loading it to other working region servers.
3.2 Cassandra:
Cassandra is designed to handle large data workloads with no single point of failure a
cross multiple nodes. The architecture is based such that it is understood that both hardwar
e and system failures do occur. Cassandra addresses the issue of failures by using a peer -
5. to peer distributed system across homogeneous nodes where data is distributed across all cl
uster nodes. All nodes within a cluster play similar role. Each node is interconnected with
other nodes and is also independent.
Key Structure:
Node: Here the data is stored and is the basic infrastructure component of Cassandra.
Datacentre: A collection of related nodes is termed as datacentre.
Cluster:It contains one or more datacentres.
Commit log: Complete data is first written on the commit log. Once the data is transferred
to SSTable, then it is either deleted, archived or recycled.
SSTable: A sorted string table (SSTable) is an unchangeable data file that Cassandra
periodically writes memtables to.
CQL Table: A collection of columns that have been ordered by table row. A table is made up
of columns with a primary key.
4) Comparison between the two:
For the purpose of designing distributed database systems, the CAP theorem made the
designers aware about the various trade-offs that need to be considered beforehand. This
theorem applies to distributed systems that store data and stands for Consistency,
Availability and Partition tolerant. The key aspect is to lookout if the database is able to
achieve at least two parameters out of the three. Here, we have compared HBase and
Cassandra in terms of their Scalability, Availability, Reliability and Security.
Scalability:
A database's scalability is characterized by its capacity to deal with a lot of information
together with high effectiveness of execution. Here we can say that the HBase is profoundly
versatile as the information is disseminated evenly along the tables when it develops in the
database. It can be supported very well asthe HBase relies on Google's Big Table. We can
watch dynamic tables circulation in HBase. Horizontal Scalability can be observed in Hbase
over the Region Servers as it acts as slaves in the cluster. Region, in HBase, is termed as the
basic unit for horizontal scalability.
Regions are a subset of data from the table and are basically a contiguous, sorted range of r
ows that are stored together. Initially, a table has only one region. Once the number of rows
6. increases and the region becomes large, it is split into two at middle key, thereby creating
two almost equal halves.
In case of Cassandra, the database is linear scalable. That means, by simply adding new
nodes the scalability can be increased. It is possible to scale Cassandra database both
horizontally, by adding more databases or vertically, by adding more nodes.
Availability:
Availability of a database means that any request given to the database as an input should
receive a response from the system, either success or failure. Also, it refers to the
accessibility of data even incase of ifailure of server or data nodes in the cluster. If the
database has high availability, then this will lead to fewer interruptions for the client in the
event of server failure. As we know, HBase has a master-slave relationship just like HDFS
however it also has a HMaster thus having many masters thereby ensuring that even incase
one of the masters fail to communicate, the data transmission would not be halted. This
would no doubt create inconsistency in the data but as explained above, in order to satisfy
CAP theorem, it is fine to proceed even if any of the two parameters are fulfilled. In case of
Cassandra, it does not have a master-slave relationship. Just that all the nodes are same and
there is no master node for controlling all other nodes. And thus, this avoids single point of
failure. Cassandra also provides replication feature, which means that even if any of the
node within the cluster goes down, one or more copies are available on different machines
within the cluster. Source: (Anon., n.d.)
Reliability:
Reliability of a database is measured by its performance in terms of its deliverables
which should ideally be as per defined specifications. A highly reliable system is the one
which shows same or better performance even in the event of any environment changes or
fault in the system. Zookeeper assures the reliability for HBase. Znodes are present which
act as the subordinate. Once the Zookeeper receives a request from the client it then runs
on all the Region servers and data is then stored across various levels. Through various
experiments it has also been observed that HBase performance efficiency increases as the
workload increases thus assuring higher reliability. Even Incase of Cassandra, due to the
distributed ring structure and replication of nodes, Cassandra is also considered to be
reliable.
4.2 Security:
HBase:
The key security features available in HBase, according to (Anon., n.d.) are-
7. 1. Authentication:
For gaining a secure access to a database, it should be must that client authenticate with the
server for establishing credentials. The various options for authentication are-
• Client authentication: There are numerous security protocols for allowing clients to
authenticate with the database. For HBase they are - Kerberos, SSL.
• Server Authentication: Different database servers must as well authenticate
with each other for ensuring a secure operating environment. In HBase,
shared keyfile is the one such method.
2. Role Based Security:
Role - based security simplifies the administration and operations of security
considerably. There are various security role features available in HBase for
supporting ease of administration; they are, custom roles, default roles etc. It is also
important to define the scope of roles as this would be useful for systems that
normally support extremely sensitive data.
3. Database Security:
HBase supports database encryption and it is highly important to encrypt the data in
sensitive application domains. Logging is also essential for recording all the activities
and interaction of clients with the system for auditing and detailed investigations.
Administrator is able to define which security groups to be logged. In Hbase, fixed
event logging and configurable event logging are the options supported for logging.
Cassandra:
According to (Anon., n.d.) the three main components of the security features furnished by
Cassandra are –
1. TLS/SSL encryption for inter-node communication and client.
There are two options in Cassandra for ensuring encryption and both are managed
separately and need to be configured independently – client-to-node encryption and node-
to-node encryption. When encryption is enabled, both the cipher suites and JVM defaults
are utilized. Although these can be overridden using settings, it is not recommended unless
certain specific settings need to configure as per certain policy.
2. Client Authentication
Authentication is configured in Cassandra using the ‘authenticator’ setting in
Cassandra.yaml. Under default settings, the system does not perform any
8. authentication checks and thereby requires no credentials. Password Authenticator
is also included in the package that stores encrypted credentials.
3. Authorization
Similar to encryption, there are two options for authorization. By default, no check is
performed thus allowing all permissions to all roles. Cassandra also includes
Cassandra Authorizer which provides functionality to manage full permissions and
the related data is stored in Cassandra system tables.
5 Learning’s from Literature Review:
With the development of the Internet and cloud computing, databases are needed t
o be able to effectively store and process big data, demanding high performance when readi
ng and writing, while the traditional relational database confronts many new challenges.
(Han, 2011)
Especially in large scale and highly competitive applications such as search engines a
nd SNS, it appeared to be inadequate ito use the relational database to store and query dyn
amic user data. NoSQL database has been created in this case. With the exponential growth
in the global data generation, the demands from the database technology grew significantly.
Some of them being, iireading and writing simultaneously with low latency, Efficient
requirements for large data storage and access, improved scalability and ihigh availability
and Lower operating and management costs. These were some of the key limitations of
traditional relational databases. iTo overcome this, NoSQL has emerged as an alternative
paradigm for this new non-relational data schema (Dede, 2013). NoSQL database features
described above are common; in reality, each product is compliant with the various data
models and the CAP theorem. CAP theorem stands for Consistency, Availability and
tolerance of network Partition. The core idea of CAP theorem is that a distributed system
cannot simultaneously meet the three needs but can only meet two (Han, 2011). Depending
on the project requirements, idifferent storages offer different consistency levels. These
options enable users to choose various trade-offs like availability, latency and consistency.
(Kumar, 2014). Therefore, in order to understand which system would be better, it is
essential to assess the performance of each storage system so as to judge the appropriate
storage type for a particular application. In this paper ( (Abubakar, 2014)) the author
attempts to introduce YCSB, an open source tool provided by Yahoo that allows
benchmarking multiple systems and comparing them by creating workloads. Distributed
systems are often more complicated than their isingle-network counterparts due to the
trade-off which need to be balance as per the applications requirements. The author made
an attempt in this paper to upgrade YCSB in such a way that the YCSB could calculate stale
9. reads in real time. One can use the model created in this paper to calculate the trade-offs
between availability, latency and consistency.
According to (Dede, 2013)Internet applications are rapidly increasing, generating
enormous iamounts of data. In order to store humongous amounts of data we make use of
NoSQL database systems like HBase and Cassandra as they are widely used by many
organizations as their storage solution. The author ihas tried to test the Cassandra database
based on its performance. In this paper, the author has discussed how Cassandra's different
features, like replication and data partitioning, affect the performance of Apache Hadoop.
Then a test model is introduced that icarries out the testing on the basis of the system's
performance and ensures that it considers the architecture and its business while
conducting the testing. Finally, these tests are applied at the level of the architecture based
on performance, which also includes ifew performance-based elements such as the column-
oriented data model, the split mechanism data model and the data replication factor. A test
procedure is performed for each performance element and a test scenario is designed. Due
to the continuous development of cloud computing, non - structural data storage is also
steadily increasing. The schema evaluation iwas divided into a separate unit known as a
schematic analyser. The schema analyser therefore does not have to rely on web
applications and can be connected to visual tools.
Performance of five NoSQL databases in another study by (Tang, 2016) included
Cassandra and HBase and they were compared on YCSB (Yahoo Cloud Serving Benchmark).
The experiment involved three different workloads- Workload A (50% read and 50% write),
Workload C (100% read) and Workload H(100% write). These workloads have been
performed on iaround 10000 operations out of the 100,000 loaded operations. Out of the
two experiments that were conducted, the initial was for executing total time taken by
these databases iagainst all three workloads. Redis turned out to be superior than the other
databases as the time taken for loading and executing the data was less. As compared to
Cassandra and HBase, it was 1.43 and 3.61 times faster respectively. Second experiment was
for the Throughput. Notably, all the five databases showed a similar trend in this
experiment. Here as well, Redis performed significantly better than the other databases. In
this case, Cassandra performed isignificantly better with greater throughput than HBase.
Based on the experiments, it proved that Redis database is ibetter capable for execution and
loading of the workloads and thus this study thereby proved to be a motivation for our
study. In the following section, we will work on finding out these experiments are relevant
to the study that we have performed in this paper.
6) Performance Test Plan:
For the execution of the process and the subsequent comparison of the two databases, we
first created an instance on the OpenStack which is hosted on cloud. Then we assigned a
10. floating ip to this instance for getting access to Ubuntu system. A keypair was generated
with authorized keys in the ssh directory. Then we installed Hadoop along with HBase. For
initiating Hadoop installation, we first installed Java version 8 and created a group with
name Hadoop group and a user named hduser. Then we disabled the IPV6 and downloaded
the Hadoop, unzipped the file and assigned hduser to Hadoop file by creating a symbolic
link. The various xml files, namely, hadoop-env.sh, core-site and hdfs-site were edited
according to the manual. Thereafter, we formatted the name node and started the dfs and
yarn.
After successful installation of Hadoop, we installed HBase. Similar to the Hadoop
process, we downloaded HBase from website, unzipped it and a symbolic link was
established. Then we edit the hbase-env.sh file, start the HBase and create a user table.
YCSB, a benchmarking tool, was then installed in the system by referring the lab manual.
Test harness was already created in ycsb specifying workload type, number of opcounts,
database type, etc. As per our requirement the files in the test harness were updated.
Workload types considered were Workload A and Workload C and three opcounts were
considered, 100000, 150000 and 200000 for both HBase and Cassandra. Workload A is a
combination of 50% reads and 50% writes whereas Workload C is 100% read. The process
was run using command runtest.sh for 3 times. Following this, Cassandra was downloaded
and installed in system by following the guidelines given in an online manual. (Anon., n.d.).
The files in the test harness were modified as required for Cassandra and similar activity was
performed. After successful completion of both HBase and Cassandra, the average of the
output was then evaluated.
Devise Specifications:
• Sony Vaio Fit 14 SVF14A15SNB
• 8GB RAM
• Intel Core I5 (3rd Generation)
• 1.8 GHz With Turbo Boost Upto 2.7 GHz
• 1TB HDD
Databases:
• HBase
• Cassandra
Workload Type:
• Workload A: 50% read and 50% write
• Workload C: 100% read
Operating Environment:
11. Open stack
• Name: m1. medium
• VCPU’s: 2
• RAM: 4GB
• Disk size: 40GB
• MSc data-net
7. Evaluation and Results:
Here, we have performed two workload tests, Workload A and Workload C against our two
databases, HBase and Cassandra using YCSB as the benchmarking tool. Following are the
test specifications:
Workload A:
1. Read: 50 %
2. Update: 50 %
Workload C:
1. Read: 100 %
7.1Workload A Results:
7.1.1 Average Insert latency vs. overall throughput
Database Workload A Count
[OVERALL]
Throughput(ops/sec) [INSERT] AverageLatency(us)
Cassandra Count 1 100000 1830.161054 471.04802
Cassandra Count 2 150000 2207.667967 405.07366
Cassandra Count 3 200000 2472.157328 361.238945
Hbase Count 1 100000 1907.632437 432.87972
Hbase Count 2 150000 2395.821687 394.2167
Hbase Count 3 200000 2363.256094 395.25737
12. • Here, we are comparing the average insert latency with the overall throughput.
• If the database latency is lower, then we can say that the database performance is
good.
• The latency of HBase is less than Cassandra for a lower count but as the data size
increases, the latency rate of Cassandra drops below that of HBase.
7.1.2 Average Update Latency vs. Update operations
Database
Workload
A Count
[UPDATE]
Operations
[UPDATE] Average
Latency(us)
Cassandra Count 1 100000 50118 405.3865677
Cassandra Count 2 150000 75256 394.719039
Cassandra Count 3 200000 99707 335.3669451
HBase Count 1 100000 49972 387.6640519
HBase Count 2 150000 75369 373.7879101
HBase Count 3 200000 100005 383.7894005
13. • Here, we are comparing average update latency with update operations.
• As the workload increases, the number of update operations are increased, the
latency of HBase increases whereas that of Cassandra decreases significantly.
• This shows Cassandra performs better for update operations when the workload is
high.
7.1.3 Read operations vs. Avg. Read latency
Database
Workload
A Count
[READ]
Operations
[READ] Average
Latency(us)
Cassandra Count 1 100000 49882 499.271621
Cassandra Count 2 150000 74744 526.7530905
Cassandra Count 3 200000 100293 439.4176164
HBase Count 1 100000 50028 334.307208
HBase Count 2 150000 74631 314.4751645
HBase Count 3 200000 99995 325.520106
14. • Here, we compare the Read operations with the average read latency.
• From this graph, we can interpret that the average latency for HBase is consistent
even with the increase in workload whereas for Cassandra, as the workload
increases beyond count 2, the latency rate drops significantly.
7.2 Workload C
7.2.1 Average Insert latency vs. overall throughput
Database
Workload
C Count
[OVERALL]
Throughput(ops/sec)
[INSERT] Average
Latency(us)
Cassandra Count 1 100000 2072.023538 412.90974
Cassandra Count 2 150000 2202.610828 406.38106
Cassandra Count 3 200000 2521.718299 360.39823
HBase Count 1 100000 2312.726936 401.83924
HBase Count 2 150000 2421.893921 395.4957067
HBase Count 3 200000 2276.789272 410.009585
15. • Here, we compare Average Insert latency with overall throughput for Workload C.
• It can be observed from the graph that the latency for Cassandra is lower than
• HBase in all the three cases and also it is decreasing as the workload is increasing.
7.2.2 Average Read Latency(us) vs. Overall Throughput(ops/sec)
Database
Workloa
d C Count
[OVERALL]
Throughput(ops/sec)
[READ]
AverageLatency(us)
Cassandr
a Count 1 100000 2058.248431 416.41129
Cassandr
a Count 2 150000 2350.581377 378.94206
Cassandr
a Count 3 200000 2494.636531 366.025005
HBase Count 1 100000 3037.667072 283.80343
HBase Count 2 150000 3758.833258 254.5179933
HBase Count 3 200000 4144.734115 221.76259
16. • Here, Here, we have compared Average Read Latency with Overall Throughput for
Workload C.
• From the graph it is visible that the start count has the maximum latency rate for both the
databases, HBase and Cassandra, although as the workload increases, the latency rate for
both the databases drops significantly.
8 Conclusions and Discussion:
In this paper, we have explained the underlying concepts of HBase and Cassandra database. The
benchmarking tool that was used for the comparison is Yahoo! Cloud Servicing Benchmark
(YCSB) to determine which database performed better under different workload scenarios.
Similar count of workloads was provided to each of the database. The workloads that were
provided are 100000, 150000 and 200000. We have used two types of workloads here, A and C.
Workload A supports 50% read and 50% write operations and workload C which supports 100%
read operations. Upon visualization of the data on Tableau, we found out that the latency
behaviour of HBase is different than that of Cassandra. Although in both databases the latency
rate is decreasing upon increase in workload, this rate is more in Cassandra database than
HBase. In workload A, as the update operations increases the average latency for Cassandra
database goes below the HBase latency rate. Overall, we can observe that for higher workloads
the performance of Cassandra is better than that of HBase and we can recommend to use
Cassandra for higher workload requirements. Also, all the benchmarking parameters were
17. available in the YCSB tool hence we can say that it is one of the great tool for benchmarking
several NoSQL databases on cloud environment.
Bibliography
Abubakar, Y., 2014. Performance Evaluation of NoSQL Systems using YCSB in a Resource
Austere Environment. ResearchGate.
Anon., n.d. An Evaluation of Cassandra for Hadoop. [Online]
Available at: http://sci-hub.tw/https://ieeexplore.ieee.org/abstract/document/6676732
[Accessed 2019].
Anon., n.d. Cassandra. [Online]
Available at: https://www.rapidvaluesolutions.com/tech_blog/cassandra-the-right-data-
store-for-scalability-performance-availability-and-maintainability/
[Accessed 2019].
Anon., n.d. Cassandra Installation. [Online]
Available at: https://www.vultr.com/docs/how-to-install-apache-cassandra-3-11-x-on-
ubuntu-16-04-lts
[Accessed 2019].
Anon., n.d. Cassandra-Security. [Online]
Available at: http://cassandra.apache.org/doc/latest/operating/security.html
[Accessed 2019].
Anon., n.d. HBase security features. [Online]
Available at: https://quabase.sei.cmu.edu/mediawiki/index.php/HBase_Security_Features
[Accessed 2019].
Dede, E., 2013. An Evaluation of Cassandra for Hadoop. IEEE.
Han, J., 2011. Survey on NoSQL database. IEEE.
Kumar, S. P., 2014. Evaluating consistency on the fly using YCSB. IEEE.
Tang, E., 2016. Performance Comparison between Five NoSQL Databases. IEEE.