Cassandra is an open source, distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability and performance, as well as flexibility in schemas. Cassandra finds use in large companies like Facebook, Netflix and eBay due to its abilities to scale and perform well under heavy loads. However, it may not be suited for applications requiring many joins, transactions or strong consistency guarantees.
Agenda
• Introduction
• WhereDid Cassandra Come From?
• Why Cassandra?
• Data Model
• CQL (Cassandra Query Language)
• Who Uses Cassandra?
• MySQL Comparision
• Strengths
• Weaknesses
3.
Introduction
Open source distributeddatabase
management system for handling
huge amounts of data across
many commodity systems.
Cassandra is a “NoSQL” or “Non-
Relational” database and can be
described as:
Scalable, fault-tolerant, and
consistent.
A column-oriented database.
4.
Where Did CassandraCome From?
•Cassandra was initially created at Facebook.
•Combination of Google Big Table and Amazon
Dynamo.
•It was created to power the “Inbox Search”
feature.
•Cassandra was released as open source in
July of 2008.
•It became an Apache Incubator project in
February of 2009 and It became a full level
project a year after that.
5.
Why Cassandra?
Gigabyte toPetabyte scalability
No single point of failure
Data distribution & Decentralized
Data Relication
High performance
Elastic scalability
Fault tolerant
Flexible schema design
Data Compression
CQL language (like SQL)
No need for special hardware or software
6.
Distributed &
Decentralized
● Distributed:Capable of
running on multiple machines
● Decentralized: No single point
of failure
● No master-slave issues due to
peer-to-peer architecture
(protocol "gossip")
Read- and write-requests
to any node
6
7.
Elastic
Scalability
● Cassandra scaleshorizontally,
adding more machines.
● Addition of nodes increase
performance throughput
linearly.
● Decreasing and increasing the
node count happen seamlessly.
Linearly scales toterabytes
and petabytes of data
7
8.
High Availability &
FaultTolerance
● Multiple networked computers
operating in a cluster.
● Cassandra uses the Gossip
Protocol for recognizing node
failures.
● Forward failing over requests
to another part of the system.
No single point of failure
due to the peer-to-peer
architecture
8
9.
Data
Replication
● In Cassandra,one or more of the
nodes in a cluster act as replicas
for a given piece of data.
No single point of failure
due to replicated data
6
1
2
3
4
5
9
10.
Components of
Cassandra?
The keycomponents of Cassandra are as follows −
Node − It is the place where data is stored.
Data center − It is a collection of related nodes.
Cluster − is a component that contains one or more data centers.
Commit log − is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
Mem-table − is a memory-resident data structure. After commit log, the
data will be written to the mem-table.
SSTable − It is a disk file to which the data is flushed from the mem-table
when its contents reach a threshold value.
Bloom filter − It is a special kind of cache. Bloom filters are accessed
after every query.
11.
Data Model
• Cluster:Cassandradatabase is distributed over
several machines that operate together. The
outermost container is known as the Cluster.
• Keyspace:Keyspace is the outermost container
for data.
• Column Families: Represent the structure of
data. Each keyspace has at least one and often
many column families.
• Two types of Column Families
– Simple
– Super (nested Column Families)
• Column:Is the basic data structure of Cassandra
with three values.
• Each Column has
– Key
– Value
– Timestamp
CQL (Cassandra QueryLanguage)
• CQL is very similar to S Q L (Structured Query
Language) in terms of syntax and commands.
• CQL treats the database (Keyspace) as a container of
tables. All statements end with a semi-colon.
• cqlsh: a prompt to work with CQL or separate
application language drivers.Using cqlsh, we can:
• define a schema,
• insert data, and
• execute a query.
Relational
Model
Cassandra
Model
Database Keyspace
Table Column Family
(CF)
Primary key Row key
Column name Column
name/key
Column value Column value
14.
Examples Using CQL
TheFollowing Slides will
demonstrate different cases with
different CQL interfaces like DDL,
DML etc..
User
• Id
• Name
• Phone
• Age
Emails
• Id
• email
15.
• Type
• Keyspace,Table
• Index , Trigger
DROP
• Type
• Keyspace ,Table
• Index , Trigger
CREATE
• Type
• Keyspace ,Table
• Index ,Trigger
ALTER
CREATE KEYSPACE - Creates a KeySpace in Cassandra.
USE - Connects to a created KeySpace.
ALTER KEYSPACE - Changes the properties of a KeySpace.
DROP KEYSPACE - Removes a KeySpace
CREATE TABLE - Creates a table in a KeySpace.
ALTER TABLE - Modifies the column properties of a table.
DROP TABLE - Removes a table.
TRUNCATE - Removes all the data from a table.
CREATE INDEX - Defines a new index on a single column of
a table.
DROP INDEX - Deletes a named index.
Interface DDL
16.
Interface DML
SELECT INSERT
UPDATEDELETE
DML
INSERT - Adds
columns for a row in
a table.
UPDATE - Updates a
column of a row.
DELETE - Deletes
data from a table.
BATCH - Executes
multiple DML
statements at once.
17.
CQL Clauses
SELECT -This clause reads data from a table
WHERE - The where clause is used along with select to read a
specific data.
ORDERBY - The orderby clause is used along with select to read a
specific data in a specific order.
MySQL Comparision
Cassandra MySQL
AverageWrite 0.12 ms ~300 ms
Average Read 15 ms ~350 ms
Statistics based on 50 GB Data
Stats provided by Authors using Facebook data.
20.
● Flexible datamodel
Supports modern data types with fast writes and reads.
● Peer to peerarchitecture
Cassandra follows a peer-to-peer architecture, instead of
master-slave architecture.
● Schema-free/Schema-less
In Cassandra, columns can be created at your will within the rows.
Cassandra data model is also famously known as a schema-optional
data model.
● AP-CAP
Cassandra is typically classified as an AP system, meaning that
availability and partition tolerance are generally considered to be
more important than consistency in Cassandra.
Strengths
21.
Strengths
● Linear scaleperformance
The ability to add nodes without failures leads to predictable increases In
performance.
Supports multiplelanguages
Python, C#/.NET, C++, Ruby, Java, Go, and many more…
● Operational and developmental simplicity
There are no complex software tiers to be managed, so administration
duties are greatly simplified.
● Ability to deploy across data centers
Cassandra can be deployed across multiple, geographically dispersed data
centers.
● Cloud availability
Installations in cloud environments.
22.
Weaknesses
Use Cases whereit is better to avoid using Cassandra
● If there are too many joins required to retrieve the data.
● To store configuration data.
● During compaction, things slow down and throughput
degrades.
● Basic things like aggregation operators are not
supported.
● Range queries on partition key are not supported.
● If there are transactional data which require 100%
consistency.
● Cassandra can update and delete data but it is not
designed to do so.