Apache Cassandra
Database
Agenda
• Introduction
• Where Did Cassandra Come From?
• Why Cassandra?
• Data Model
• CQL (Cassandra Query Language)
• Who Uses Cassandra?
• MySQL Comparision
• Strengths
• Weaknesses
Introduction
Open source distributed database
management system for handling
huge amounts of data across
many commodity systems.
Cassandra is a “NoSQL” or “Non-
Relational” database and can be
described as:
 Scalable, fault-tolerant, and
consistent.
 A column-oriented database.
Where Did Cassandra Come From?
•Cassandra was initially created at Facebook.
•Combination of Google Big Table and Amazon
Dynamo.
•It was created to power the “Inbox Search”
feature.
•Cassandra was released as open source in
July of 2008.
•It became an Apache Incubator project in
February of 2009 and It became a full level
project a year after that.
Why Cassandra?
Gigabyte to Petabyte scalability
No single point of failure
Data distribution & Decentralized
Data Relication
High performance
Elastic scalability
Fault tolerant
Flexible schema design
Data Compression
CQL language (like SQL)
No need for special hardware or software
Distributed &
Decentralized
● Distributed: Capable of
running on multiple machines
● Decentralized: No single point
of failure
● No master-slave issues due to
peer-to-peer architecture
(protocol "gossip")
Read- and write-requests
to any node
6
Elastic
Scalability
● Cassandra scales horizontally,
adding more machines.
● Addition of nodes increase
performance throughput
linearly.
● Decreasing and increasing the
node count happen seamlessly.
Linearly scales toterabytes
and petabytes of data
7
High Availability &
Fault Tolerance
● Multiple networked computers
operating in a cluster.
● Cassandra uses the Gossip
Protocol for recognizing node
failures.
● Forward failing over requests
to another part of the system.
No single point of failure
due to the peer-to-peer
architecture
8
Data
Replication
● In Cassandra, one or more of the
nodes in a cluster act as replicas
for a given piece of data.
No single point of failure
due to replicated data
6
1
2
3
4
5
9
Components of
Cassandra?
The key components of Cassandra are as follows −
Node − It is the place where data is stored.
Data center − It is a collection of related nodes.
Cluster − is a component that contains one or more data centers.
Commit log − is a crash-recovery mechanism in Cassandra. Every write
operation is written to the commit log.
Mem-table − is a memory-resident data structure. After commit log, the
data will be written to the mem-table.
SSTable − It is a disk file to which the data is flushed from the mem-table
when its contents reach a threshold value.
Bloom filter − It is a special kind of cache. Bloom filters are accessed
after every query.
Data Model
• Cluster:Cassandra database is distributed over
several machines that operate together. The
outermost container is known as the Cluster.
• Keyspace:Keyspace is the outermost container
for data.
• Column Families: Represent the structure of
data. Each keyspace has at least one and often
many column families.
• Two types of Column Families
– Simple
– Super (nested Column Families)
• Column:Is the basic data structure of Cassandra
with three values.
• Each Column has
– Key
– Value
– Timestamp
Simple column
family
Super Column
family
CQL (Cassandra Query Language)
• CQL is very similar to S Q L (Structured Query
Language) in terms of syntax and commands.
• CQL treats the database (Keyspace) as a container of
tables. All statements end with a semi-colon.
• cqlsh: a prompt to work with CQL or separate
application language drivers.Using cqlsh, we can:
• define a schema,
• insert data, and
• execute a query.
Relational
Model
Cassandra
Model
Database Keyspace
Table Column Family
(CF)
Primary key Row key
Column name Column
name/key
Column value Column value
Examples Using CQL
The Following Slides will
demonstrate different cases with
different CQL interfaces like DDL,
DML etc..
User
• Id
• Name
• Phone
• Age
Emails
• Id
• email
• Type
• Keyspace ,Table
• Index , Trigger
DROP
• Type
• Keyspace ,Table
• Index , Trigger
CREATE
• Type
• Keyspace ,Table
• Index ,Trigger
ALTER
CREATE KEYSPACE - Creates a KeySpace in Cassandra.
USE - Connects to a created KeySpace.
ALTER KEYSPACE - Changes the properties of a KeySpace.
DROP KEYSPACE - Removes a KeySpace
CREATE TABLE - Creates a table in a KeySpace.
ALTER TABLE - Modifies the column properties of a table.
DROP TABLE - Removes a table.
TRUNCATE - Removes all the data from a table.
CREATE INDEX - Defines a new index on a single column of
a table.
DROP INDEX - Deletes a named index.
Interface DDL
Interface DML
SELECT INSERT
UPDATE DELETE
DML
INSERT - Adds
columns for a row in
a table.
UPDATE - Updates a
column of a row.
DELETE - Deletes
data from a table.
BATCH - Executes
multiple DML
statements at once.
CQL Clauses
SELECT - This clause reads data from a table
WHERE - The where clause is used along with select to read a
specific data.
ORDERBY - The orderby clause is used along with select to read a
specific data in a specific order.
Who Uses Cassandra?
• Facebook
• WalmartLabs
• Constant
Contact
• Digg
• AppScale
• Netflix
• Twitter
• Zoho
• IBM
• FormSpring
• Cisco
WebEx
• Rackspace
• OpenX
• Adobe
• Comcast
• eBay
MySQL Comparision
Cassandra MySQL
Average Write 0.12 ms ~300 ms
Average Read 15 ms ~350 ms
Statistics based on 50 GB Data
Stats provided by Authors using Facebook data.
● Flexible data model
Supports modern data types with fast writes and reads.
● Peer to peerarchitecture
Cassandra follows a peer-to-peer architecture, instead of
master-slave architecture.
● Schema-free/Schema-less
In Cassandra, columns can be created at your will within the rows.
Cassandra data model is also famously known as a schema-optional
data model.
● AP-CAP
Cassandra is typically classified as an AP system, meaning that
availability and partition tolerance are generally considered to be
more important than consistency in Cassandra.
Strengths
Strengths
● Linear scale performance
The ability to add nodes without failures leads to predictable increases In
performance.
Supports multiplelanguages
Python, C#/.NET, C++, Ruby, Java, Go, and many more…
● Operational and developmental simplicity
There are no complex software tiers to be managed, so administration
duties are greatly simplified.
● Ability to deploy across data centers
Cassandra can be deployed across multiple, geographically dispersed data
centers.
● Cloud availability
Installations in cloud environments.
Weaknesses
Use Cases where it is better to avoid using Cassandra
● If there are too many joins required to retrieve the data.
● To store configuration data.
● During compaction, things slow down and throughput
degrades.
● Basic things like aggregation operators are not
supported.
● Range queries on partition key are not supported.
● If there are transactional data which require 100%
consistency.
● Cassandra can update and delete data but it is not
designed to do so.

Appache Cassandra

  • 1.
  • 2.
    Agenda • Introduction • WhereDid Cassandra Come From? • Why Cassandra? • Data Model • CQL (Cassandra Query Language) • Who Uses Cassandra? • MySQL Comparision • Strengths • Weaknesses
  • 3.
    Introduction Open source distributeddatabase management system for handling huge amounts of data across many commodity systems. Cassandra is a “NoSQL” or “Non- Relational” database and can be described as:  Scalable, fault-tolerant, and consistent.  A column-oriented database.
  • 4.
    Where Did CassandraCome From? •Cassandra was initially created at Facebook. •Combination of Google Big Table and Amazon Dynamo. •It was created to power the “Inbox Search” feature. •Cassandra was released as open source in July of 2008. •It became an Apache Incubator project in February of 2009 and It became a full level project a year after that.
  • 5.
    Why Cassandra? Gigabyte toPetabyte scalability No single point of failure Data distribution & Decentralized Data Relication High performance Elastic scalability Fault tolerant Flexible schema design Data Compression CQL language (like SQL) No need for special hardware or software
  • 6.
    Distributed & Decentralized ● Distributed:Capable of running on multiple machines ● Decentralized: No single point of failure ● No master-slave issues due to peer-to-peer architecture (protocol "gossip") Read- and write-requests to any node 6
  • 7.
    Elastic Scalability ● Cassandra scaleshorizontally, adding more machines. ● Addition of nodes increase performance throughput linearly. ● Decreasing and increasing the node count happen seamlessly. Linearly scales toterabytes and petabytes of data 7
  • 8.
    High Availability & FaultTolerance ● Multiple networked computers operating in a cluster. ● Cassandra uses the Gossip Protocol for recognizing node failures. ● Forward failing over requests to another part of the system. No single point of failure due to the peer-to-peer architecture 8
  • 9.
    Data Replication ● In Cassandra,one or more of the nodes in a cluster act as replicas for a given piece of data. No single point of failure due to replicated data 6 1 2 3 4 5 9
  • 10.
    Components of Cassandra? The keycomponents of Cassandra are as follows − Node − It is the place where data is stored. Data center − It is a collection of related nodes. Cluster − is a component that contains one or more data centers. Commit log − is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log. Mem-table − is a memory-resident data structure. After commit log, the data will be written to the mem-table. SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value. Bloom filter − It is a special kind of cache. Bloom filters are accessed after every query.
  • 11.
    Data Model • Cluster:Cassandradatabase is distributed over several machines that operate together. The outermost container is known as the Cluster. • Keyspace:Keyspace is the outermost container for data. • Column Families: Represent the structure of data. Each keyspace has at least one and often many column families. • Two types of Column Families – Simple – Super (nested Column Families) • Column:Is the basic data structure of Cassandra with three values. • Each Column has – Key – Value – Timestamp
  • 12.
  • 13.
    CQL (Cassandra QueryLanguage) • CQL is very similar to S Q L (Structured Query Language) in terms of syntax and commands. • CQL treats the database (Keyspace) as a container of tables. All statements end with a semi-colon. • cqlsh: a prompt to work with CQL or separate application language drivers.Using cqlsh, we can: • define a schema, • insert data, and • execute a query. Relational Model Cassandra Model Database Keyspace Table Column Family (CF) Primary key Row key Column name Column name/key Column value Column value
  • 14.
    Examples Using CQL TheFollowing Slides will demonstrate different cases with different CQL interfaces like DDL, DML etc.. User • Id • Name • Phone • Age Emails • Id • email
  • 15.
    • Type • Keyspace,Table • Index , Trigger DROP • Type • Keyspace ,Table • Index , Trigger CREATE • Type • Keyspace ,Table • Index ,Trigger ALTER CREATE KEYSPACE - Creates a KeySpace in Cassandra. USE - Connects to a created KeySpace. ALTER KEYSPACE - Changes the properties of a KeySpace. DROP KEYSPACE - Removes a KeySpace CREATE TABLE - Creates a table in a KeySpace. ALTER TABLE - Modifies the column properties of a table. DROP TABLE - Removes a table. TRUNCATE - Removes all the data from a table. CREATE INDEX - Defines a new index on a single column of a table. DROP INDEX - Deletes a named index. Interface DDL
  • 16.
    Interface DML SELECT INSERT UPDATEDELETE DML INSERT - Adds columns for a row in a table. UPDATE - Updates a column of a row. DELETE - Deletes data from a table. BATCH - Executes multiple DML statements at once.
  • 17.
    CQL Clauses SELECT -This clause reads data from a table WHERE - The where clause is used along with select to read a specific data. ORDERBY - The orderby clause is used along with select to read a specific data in a specific order.
  • 18.
    Who Uses Cassandra? •Facebook • WalmartLabs • Constant Contact • Digg • AppScale • Netflix • Twitter • Zoho • IBM • FormSpring • Cisco WebEx • Rackspace • OpenX • Adobe • Comcast • eBay
  • 19.
    MySQL Comparision Cassandra MySQL AverageWrite 0.12 ms ~300 ms Average Read 15 ms ~350 ms Statistics based on 50 GB Data Stats provided by Authors using Facebook data.
  • 20.
    ● Flexible datamodel Supports modern data types with fast writes and reads. ● Peer to peerarchitecture Cassandra follows a peer-to-peer architecture, instead of master-slave architecture. ● Schema-free/Schema-less In Cassandra, columns can be created at your will within the rows. Cassandra data model is also famously known as a schema-optional data model. ● AP-CAP Cassandra is typically classified as an AP system, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra. Strengths
  • 21.
    Strengths ● Linear scaleperformance The ability to add nodes without failures leads to predictable increases In performance. Supports multiplelanguages Python, C#/.NET, C++, Ruby, Java, Go, and many more… ● Operational and developmental simplicity There are no complex software tiers to be managed, so administration duties are greatly simplified. ● Ability to deploy across data centers Cassandra can be deployed across multiple, geographically dispersed data centers. ● Cloud availability Installations in cloud environments.
  • 22.
    Weaknesses Use Cases whereit is better to avoid using Cassandra ● If there are too many joins required to retrieve the data. ● To store configuration data. ● During compaction, things slow down and throughput degrades. ● Basic things like aggregation operators are not supported. ● Range queries on partition key are not supported. ● If there are transactional data which require 100% consistency. ● Cassandra can update and delete data but it is not designed to do so.