Learning Cassandra NoSQL

Cassandra NoSQL
- Pankaj Khattar

What are we going to learn today?
 New Problems which can’t be handled by traditional RDBMS
 Tradeoff between Consistency, Availability, Partition Tolerance (CAP theorem)
 What are the different solutions available?
 What is Cassandra?
 Use-Cases for Cassandra
Cassandra Features – Tunable Consistency, P2P Architecture, Elastic Scalability, Column
Orientation
Data Model for Cassandra
Demo Application using Cassandra




Twitter – Massive Scale, High Availability

Travel Booking – Scale and Availability

Movie Booking – Consistency and Scale

Facebook Graph Search – Fast, Complex Querying

Facebook Messenger – Consistency and Scale

So, What Is Common?
 Huge Data
 Fast Random access
 Variable Schema
 Need of Compression
 High Availability
 Need for Consistency
 Need of Distribution (Sharding)

y
Brewer’s CAP Theorem
MongoDB
HBase
Redis
RDBMS
Consistenc
CA CP
Partition
Tolerance
Availability AP
CouchDB Cassandra DynamoDB Riak
http://www.w3resource.com/mongodb/nosql.php

P
NoSQL Landscape
Big Table
Clones
BigTable
(Google),
Cassandra,
HBase,
Hypertable
Key-Value
Stores Dynamo
(Amazon),
Voldemort
(LinkedIn), Citrusleaf,
Membase, Riak,
Tokyo Cabinet
Document
Database
CouchOne,
MongoDB,
Terrastore,
OrientDB
Graph
Databases
FlockDB (Twitter),
AllegroGraph,
DEX, InfoGrid,
Neo4J, Sones
Performance
Query and Navigational Complexity
Scalability&Speed

Cassandra Usecase – Deep Drive
5000 TPS
300 ~ 500 SQL
Transaction
WEB APPLICATION
Caching Layer
Elastic Scale
1000 TPS
100 ~ 200 SQL
Transaction
Applications Changing Data
RDBMS 2RDBMS 1

Using Cassandra
5000 TPS
300 ~ 500 SQL
Transaction
WEB APPLICATIONElastic Scale
Elastic Scale
CASSANDRA
1000 TPS
100 ~ 200 SQL
Transaction
Applications Changing Data

Cassandra Usecase - Summary
 E-Commerce (Travel Portal)  Development Approaches


Both B2B & B2C Consumers
High volume of shopping transactions
(> 500 Million Visits / Day)
High volume supply changes
(Manual & System) generated.
Huge Inventory Database
(Millions of hotels)
High Read/Write
(Thousands Reads & Writes/Second)
Application has to 99.99% Available
Fault Tolerant & Reliable.
Fast & Quick Shopping Experience.
Elastic Scale
Innovative Recommendations & Algorithms.
Should be fast for new changes
Should be cost effective for maintenance.


Legacy Way (Pure RDBMS)
Augmented (RDBMS + Caching, Heavy
Database Hardware)
Using Cassandra 










What is Apache Cassandra?
Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available,
fault-tolerant, Tuneably consistent, column-oriented database.
Open
Source
Column
Oriented
Decentralized
Cassandra Features
Tuneably
Consistent
Elastically
Scalable
Distributed
Highly
Scalable
Fault Tolerant

Distributed and Decentralized
Post Office Post Office
CCY
Exchange CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
stationary Letter/Couriers
Ccy Courier StationaryCcy Courier Stationary
DecentralisedCentralised

Distributed and Decentralized
 Every Node Is Identical.
 Peer to Peer Protocol and uses Gossip Protocol to
maintain and keep the List of nodes in Sync.CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
 No Special Host to Coordinate Activities.
 No Single Point of Failure.
 Easier to Operate and Maintain because all nodes
are same.
Ccy Courier Stationary

Elastic Scalability
Types of Scalability
 Vertical Scalability
 Horizontal Scalability
What is Elastic Scalability?
This is special property of Horizontal Scalability.
 The cluster can seamlessly scale up and scale back down without major disruption.

Elastic Scalability
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
 Cluster must accept new nodes without major disruption or
reconfiguration.
Process should not be restarted
Do not have to change application charges
Don’t have to rebalance data



ADD A NODE AND MOVE ON!!

High Availability and Fault Tolerance
Highly Available
 No Downtime
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers

Tunable Consistency
 Cassandra enables us to define consistency as per application requirements

High Performance
 Cassandra was designed specifically from the ground up to take full advantage
of multiprocessor/ multicore machines, and to run across many dozens of
these machines housed in multiple data centres.
 It scales consistently and seamlessly to hundreds of terabytes.
 Shows exceptional performance under heavy loads.
 Consistently shows very fast throughput for writes per second on a basic
commodity workstation.

Where to Use Cassandra?
Use if your application has:
 Big Data (Billions Of Records Rows & Columns)
 Very High Velocity Random Reads & Writes
 Flexible Sparse / Wide Column Requirements
 No Multiple Secondary Index Needs
 Low Latency
Use Cases:



eCommerce Inventory Cache Use Cases
Time Series / Events Use Cases
Feed Based Activities / Use Cases

Where NOT to Use Cassandra?
Don’t Use if your application has:
Secondary Indexes.
Relational Data.
Transactional (Rollback, Commit)
Primary & Financial Records.
Stringent Security & Authorization Needs On Data
Dynamic Queries on Columns.
Searching Column Data
Low Latency

Data Model
RDBMS vs Cassandra
 In RDBMS,
 Define Schema
 Define tables with defined columns
 The table defines the column names and their data types
 Add rows conforming to that schema: each row contains the same fixed set of
columns.
 In Cassandra,
 Define Keyspaces
 Define columnfamilies/tables
 Column families can define metadata about the columns
 Each row can have a different set of columns

Data Model – Column Families
Designing Column Families/Tables
 Static Column Families,
 Static set of column names
 Similar to a relational database table
 Rows are not required to have all of the columns defined
 Dynamic Column Families,
 Use arbitrary column names to store data

Data Model - Keys
Type of Keys
 Primary Key
create table test ( key text PRIMARY KEY, data text );
 Composite Primary (or Compound) Key
create table test ( key_part_one text, key_part_two int, data text, PRIMARY
KEY(key_part_one, key_part_two) );
 In above, the "first part" of the key is called Partition Key(key_part_one) and the
second part of the key is the Clustering Key(key_part_two)
 The Partition Key is responsible for data distribution across your nodes
 The Clustering Key is responsible for data sorting within the partition
 The Primary Key is equivalent to the Partition Key in a single-field-key table

Data Model - Columns
Type of Columns
 Standard Columns
 A tuple containing a name, a value and a timestamp
 Expiring Columns
 optional expiration date called TTL (time to live)
 defined in seconds
 Counter Columns
 store a number that incrementally counts the occurrences
 Example Page Views

Data Model - Columns
Type of Columns
 Composite Columns
CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar,
PRIMARY KEY (user_id, tweet_id) );

Data Model – Data Types
Data Types (Comparators & Validators)
 Data type for a column (or row key) value is
called a validator
 Data type for a column name is called a
comparator
 Can define data types while column family
schemas creation but not required.
 Internally, stores column names and values as
hex byte arrays (BytesType).

Data Model - Indexes
Type of Indexes
 Primary Indexes
 the primary index for a column family is the index of its row keys
 Each node maintains this index for the data it manages
 Secondary Indexes
 Indexes on column values
 Implemented as a hidden table, separate from the main table
 Do not use secondary indexes to query a huge volume of records for a small number
of results
 more efficient to manually maintain a lookup column family instead of using a
secondary index

Cassandra – Writes/Reads
Writes in Cassandra
 Cassandra writes are first written to a commit log (for durability), and then to an in-
memory table structure called a memtable.
 There is very minimal disk I/O at the time of write.
 Writes are batched in memory and periodically written to disk to a persistent table
structure called an SSTable (sorted string table).
 Memtables and SSTables are maintained per column family.
 Memtables are organized in sorted order by row key and flushed to SSTables sequentially
Reads in Cassandra
 At read time, a row must be combined from all SSTables on disk (as well as unflushed
memtables) to produce the requested data.
 Each SSTable has a Bloom filter associated with it that checks if a requested row key exists
in the SSTable before doing any disk seeks.

Application Demo
Cassandra Installation & Configuration
 Conf/cassandra.yaml
 Tools
Key Space Setup
Column Family / Data Model Setup
 Key
 Columns & Data Types
 Indexes (Primary & Secondary)
 Programmatic Consistency
Thrift Hector API
CQL3 API






Learning Cassandra NoSQL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learning Cassandra NoSQL

Similar to Learning Cassandra NoSQL (20)

Recently uploaded

Recently uploaded (20)

Learning Cassandra NoSQL