Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data-centres,with asynchronous master-less replication allowing low latency operations for all clients.
2. What are we going to learn today?
New Problems which can’t be handled by traditional RDBMS
Tradeoff between Consistency, Availability, Partition Tolerance (CAP theorem)
What are the different solutions available?
What is Cassandra?
Use-Cases for Cassandra
Cassandra Features – Tunable Consistency, P2P Architecture, Elastic Scalability, Column
Orientation
Data Model for Cassandra
Demo Application using Cassandra
8. So, What Is Common?
Huge Data
Fast Random access
Variable Schema
Need of Compression
High Availability
Need for Consistency
Need of Distribution (Sharding)
12. Using Cassandra
5000 TPS
300 ~ 500 SQL
Transaction
WEB APPLICATIONElastic Scale
Elastic Scale
CASSANDRA
1000 TPS
100 ~ 200 SQL
Transaction
Applications Changing Data
13. Cassandra Usecase - Summary
E-Commerce (Travel Portal) Development Approaches
Both B2B & B2C Consumers
High volume of shopping transactions
(> 500 Million Visits / Day)
High volume supply changes
(Manual & System) generated.
Huge Inventory Database
(Millions of hotels)
High Read/Write
(Thousands Reads & Writes/Second)
Application has to 99.99% Available
Fault Tolerant & Reliable.
Fast & Quick Shopping Experience.
Elastic Scale
Innovative Recommendations & Algorithms.
Should be fast for new changes
Should be cost effective for maintenance.
Legacy Way (Pure RDBMS)
Augmented (RDBMS + Caching, Heavy
Database Hardware)
Using Cassandra
14. What is Apache Cassandra?
Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available,
fault-tolerant, Tuneably consistent, column-oriented database.
Open
Source
Column
Oriented
Decentralized
Cassandra Features
Tuneably
Consistent
Elastically
Scalable
Distributed
Highly
Scalable
Fault Tolerant
15. Distributed and Decentralized
Post Office Post Office
CCY
Exchange CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
stationary Letter/Couriers
Ccy Courier StationaryCcy Courier Stationary
DecentralisedCentralised
16. Distributed and Decentralized
Every Node Is Identical.
Peer to Peer Protocol and uses Gossip Protocol to
maintain and keep the List of nodes in Sync.CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
No Special Host to Coordinate Activities.
No Single Point of Failure.
Easier to Operate and Maintain because all nodes
are same.
Ccy Courier Stationary
17. Elastic Scalability
Types of Scalability
Vertical Scalability
Horizontal Scalability
What is Elastic Scalability?
This is special property of Horizontal Scalability.
The cluster can seamlessly scale up and scale back down without major disruption.
18. Elastic Scalability
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
Cluster must accept new nodes without major disruption or
reconfiguration.
Process should not be restarted
Do not have to change application charges
Don’t have to rebalance data
ADD A NODE AND MOVE ON!!
Ccy Courier Stationary
19. High Availability and Fault Tolerance
Highly Available
No Downtime
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
Ccy Courier Stationary
21. High Performance
Cassandra was designed specifically from the ground up to take full advantage
of multiprocessor/ multicore machines, and to run across many dozens of
these machines housed in multiple data centres.
It scales consistently and seamlessly to hundreds of terabytes.
Shows exceptional performance under heavy loads.
Consistently shows very fast throughput for writes per second on a basic
commodity workstation.
22. Where to Use Cassandra?
Use if your application has:
Big Data (Billions Of Records Rows & Columns)
Very High Velocity Random Reads & Writes
Flexible Sparse / Wide Column Requirements
No Multiple Secondary Index Needs
Low Latency
Use Cases:
eCommerce Inventory Cache Use Cases
Time Series / Events Use Cases
Feed Based Activities / Use Cases
23. Where NOT to Use Cassandra?
Don’t Use if your application has:
Secondary Indexes.
Relational Data.
Transactional (Rollback, Commit)
Primary & Financial Records.
Stringent Security & Authorization Needs On Data
Dynamic Queries on Columns.
Searching Column Data
Low Latency
24. Data Model
RDBMS vs Cassandra
In RDBMS,
Define Schema
Define tables with defined columns
The table defines the column names and their data types
Add rows conforming to that schema: each row contains the same fixed set of
columns.
In Cassandra,
Define Keyspaces
Define columnfamilies/tables
Column families can define metadata about the columns
Each row can have a different set of columns
25. Data Model – Column Families
Designing Column Families/Tables
Static Column Families,
Static set of column names
Similar to a relational database table
Rows are not required to have all of the columns defined
Dynamic Column Families,
Use arbitrary column names to store data
26. Data Model - Keys
Type of Keys
Primary Key
create table test ( key text PRIMARY KEY, data text );
Composite Primary (or Compound) Key
create table test ( key_part_one text, key_part_two int, data text, PRIMARY
KEY(key_part_one, key_part_two) );
In above, the "first part" of the key is called Partition Key(key_part_one) and the
second part of the key is the Clustering Key(key_part_two)
The Partition Key is responsible for data distribution across your nodes
The Clustering Key is responsible for data sorting within the partition
The Primary Key is equivalent to the Partition Key in a single-field-key table
27. Data Model - Columns
Type of Columns
Standard Columns
A tuple containing a name, a value and a timestamp
Expiring Columns
optional expiration date called TTL (time to live)
defined in seconds
Counter Columns
store a number that incrementally counts the occurrences
Example Page Views
28. Data Model - Columns
Type of Columns
Composite Columns
CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar,
PRIMARY KEY (user_id, tweet_id) );
29. Data Model – Data Types
Data Types (Comparators & Validators)
Data type for a column (or row key) value is
called a validator
Data type for a column name is called a
comparator
Can define data types while column family
schemas creation but not required.
Internally, stores column names and values as
hex byte arrays (BytesType).
30. Data Model - Indexes
Type of Indexes
Primary Indexes
the primary index for a column family is the index of its row keys
Each node maintains this index for the data it manages
Secondary Indexes
Indexes on column values
Implemented as a hidden table, separate from the main table
Do not use secondary indexes to query a huge volume of records for a small number
of results
more efficient to manually maintain a lookup column family instead of using a
secondary index
31. Cassandra – Writes/Reads
Writes in Cassandra
Cassandra writes are first written to a commit log (for durability), and then to an in-
memory table structure called a memtable.
There is very minimal disk I/O at the time of write.
Writes are batched in memory and periodically written to disk to a persistent table
structure called an SSTable (sorted string table).
Memtables and SSTables are maintained per column family.
Memtables are organized in sorted order by row key and flushed to SSTables sequentially
Reads in Cassandra
At read time, a row must be combined from all SSTables on disk (as well as unflushed
memtables) to produce the requested data.
Each SSTable has a Bloom filter associated with it that checks if a requested row key exists
in the SSTable before doing any disk seeks.
32. Application Demo
Cassandra Installation & Configuration
Conf/cassandra.yaml
Tools
Key Space Setup
Column Family / Data Model Setup
Key
Columns & Data Types
Indexes (Primary & Secondary)
Programmatic Consistency
Thrift Hector API
CQL3 API