Cassandra NoSQL
- Pankaj Khattar
What are we going to learn today?
 New Problems which can’t be handled by traditional RDBMS
 Tradeoff between Consistency, Availability, Partition Tolerance (CAP theorem)
 What are the different solutions available?
 What is Cassandra?
 Use-Cases for Cassandra
Cassandra Features – Tunable Consistency, P2P Architecture, Elastic Scalability, Column
Orientation
Data Model for Cassandra
Demo Application using Cassandra



Twitter – Massive Scale, High Availability
Travel Booking – Scale and Availability
Movie Booking – Consistency and Scale
Facebook Graph Search – Fast, Complex Querying
Facebook Messenger – Consistency and Scale
So, What Is Common?
 Huge Data
 Fast Random access
 Variable Schema
 Need of Compression
 High Availability
 Need for Consistency
 Need of Distribution (Sharding)
y
Brewer’s CAP Theorem
MongoDB
HBase
Redis
RDBMS
Consistenc
CA CP
Partition
Tolerance
Availability AP
CouchDB Cassandra DynamoDB Riak
http://www.w3resource.com/mongodb/nosql.php
P
NoSQL Landscape
Big Table
Clones
BigTable
(Google),
Cassandra,
HBase,
Hypertable
Key-Value
Stores Dynamo
(Amazon),
Voldemort
(LinkedIn), Citrusleaf,
Membase, Riak,
Tokyo Cabinet
Document
Database
CouchOne,
MongoDB,
Terrastore,
OrientDB
Graph
Databases
FlockDB (Twitter),
AllegroGraph,
DEX, InfoGrid,
Neo4J, Sones
Performance
Query and Navigational Complexity
Scalability&Speed
Cassandra Usecase – Deep Drive
5000 TPS
300 ~ 500 SQL
Transaction
WEB APPLICATION
Caching Layer
Elastic Scale
1000 TPS
100 ~ 200 SQL
Transaction
Applications Changing Data
RDBMS 2RDBMS 1
Using Cassandra
5000 TPS
300 ~ 500 SQL
Transaction
WEB APPLICATIONElastic Scale
Elastic Scale
CASSANDRA
1000 TPS
100 ~ 200 SQL
Transaction
Applications Changing Data
Cassandra Usecase - Summary
 E-Commerce (Travel Portal)  Development Approaches


Both B2B & B2C Consumers
High volume of shopping transactions
(> 500 Million Visits / Day)
High volume supply changes
(Manual & System) generated.
Huge Inventory Database
(Millions of hotels)
High Read/Write
(Thousands Reads & Writes/Second)
Application has to 99.99% Available
Fault Tolerant & Reliable.
Fast & Quick Shopping Experience.
Elastic Scale
Innovative Recommendations & Algorithms.
Should be fast for new changes
Should be cost effective for maintenance.


Legacy Way (Pure RDBMS)
Augmented (RDBMS + Caching, Heavy
Database Hardware)
Using Cassandra 









What is Apache Cassandra?
Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available,
fault-tolerant, Tuneably consistent, column-oriented database.
Open
Source
Column
Oriented
Decentralized
Cassandra Features
Tuneably
Consistent
Elastically
Scalable
Distributed
Highly
Scalable
Fault Tolerant
Distributed and Decentralized
Post Office Post Office
CCY
Exchange CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
stationary Letter/Couriers
Ccy Courier StationaryCcy Courier Stationary
DecentralisedCentralised
Distributed and Decentralized
 Every Node Is Identical.
 Peer to Peer Protocol and uses Gossip Protocol to
maintain and keep the List of nodes in Sync.CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
 No Special Host to Coordinate Activities.
 No Single Point of Failure.
 Easier to Operate and Maintain because all nodes
are same.
Ccy Courier Stationary
Elastic Scalability
Types of Scalability
 Vertical Scalability
 Horizontal Scalability
What is Elastic Scalability?
This is special property of Horizontal Scalability.
 The cluster can seamlessly scale up and scale back down without major disruption.
Elastic Scalability
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
 Cluster must accept new nodes without major disruption or
reconfiguration.
Process should not be restarted
Do not have to change application charges
Don’t have to rebalance data



ADD A NODE AND MOVE ON!!
Ccy Courier Stationary
High Availability and Fault Tolerance
Highly Available
 No Downtime
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
CCY, Stationary,
Letter/Couriers
Ccy Courier Stationary
Tunable Consistency
 Cassandra enables us to define consistency as per application requirements
High Performance
 Cassandra was designed specifically from the ground up to take full advantage
of multiprocessor/ multicore machines, and to run across many dozens of
these machines housed in multiple data centres.
 It scales consistently and seamlessly to hundreds of terabytes.
 Shows exceptional performance under heavy loads.
 Consistently shows very fast throughput for writes per second on a basic
commodity workstation.
Where to Use Cassandra?
Use if your application has:
 Big Data (Billions Of Records Rows & Columns)
 Very High Velocity Random Reads & Writes
 Flexible Sparse / Wide Column Requirements
 No Multiple Secondary Index Needs
 Low Latency
Use Cases:



eCommerce Inventory Cache Use Cases
Time Series / Events Use Cases
Feed Based Activities / Use Cases
Where NOT to Use Cassandra?
Don’t Use if your application has:
Secondary Indexes.
Relational Data.
Transactional (Rollback, Commit)
Primary & Financial Records.
Stringent Security & Authorization Needs On Data
Dynamic Queries on Columns.
Searching Column Data
Low Latency
Data Model
RDBMS vs Cassandra
 In RDBMS,
 Define Schema
 Define tables with defined columns
 The table defines the column names and their data types
 Add rows conforming to that schema: each row contains the same fixed set of
columns.
 In Cassandra,
 Define Keyspaces
 Define columnfamilies/tables
 Column families can define metadata about the columns
 Each row can have a different set of columns
Data Model – Column Families
Designing Column Families/Tables
 Static Column Families,
 Static set of column names
 Similar to a relational database table
 Rows are not required to have all of the columns defined
 Dynamic Column Families,
 Use arbitrary column names to store data
Data Model - Keys
Type of Keys
 Primary Key
create table test ( key text PRIMARY KEY, data text );
 Composite Primary (or Compound) Key
create table test ( key_part_one text, key_part_two int, data text, PRIMARY
KEY(key_part_one, key_part_two) );
 In above, the "first part" of the key is called Partition Key(key_part_one) and the
second part of the key is the Clustering Key(key_part_two)
 The Partition Key is responsible for data distribution across your nodes
 The Clustering Key is responsible for data sorting within the partition
 The Primary Key is equivalent to the Partition Key in a single-field-key table
Data Model - Columns
Type of Columns
 Standard Columns
 A tuple containing a name, a value and a timestamp
 Expiring Columns
 optional expiration date called TTL (time to live)
 defined in seconds
 Counter Columns
 store a number that incrementally counts the occurrences
 Example Page Views
Data Model - Columns
Type of Columns
 Composite Columns
CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar,
PRIMARY KEY (user_id, tweet_id) );
Data Model – Data Types
Data Types (Comparators & Validators)
 Data type for a column (or row key) value is
called a validator
 Data type for a column name is called a
comparator
 Can define data types while column family
schemas creation but not required.
 Internally, stores column names and values as
hex byte arrays (BytesType).
Data Model - Indexes
Type of Indexes
 Primary Indexes
 the primary index for a column family is the index of its row keys
 Each node maintains this index for the data it manages
 Secondary Indexes
 Indexes on column values
 Implemented as a hidden table, separate from the main table
 Do not use secondary indexes to query a huge volume of records for a small number
of results
 more efficient to manually maintain a lookup column family instead of using a
secondary index
Cassandra – Writes/Reads
Writes in Cassandra
 Cassandra writes are first written to a commit log (for durability), and then to an in-
memory table structure called a memtable.
 There is very minimal disk I/O at the time of write.
 Writes are batched in memory and periodically written to disk to a persistent table
structure called an SSTable (sorted string table).
 Memtables and SSTables are maintained per column family.
 Memtables are organized in sorted order by row key and flushed to SSTables sequentially
Reads in Cassandra
 At read time, a row must be combined from all SSTables on disk (as well as unflushed
memtables) to produce the requested data.
 Each SSTable has a Bloom filter associated with it that checks if a requested row key exists
in the SSTable before doing any disk seeks.
Application Demo
Cassandra Installation & Configuration
 Conf/cassandra.yaml
 Tools
Key Space Setup
Column Family / Data Model Setup
 Key
 Columns & Data Types
 Indexes (Primary & Secondary)
 Programmatic Consistency
Thrift Hector API
CQL3 API





Application Demo
Application Demo
Application Demo
Application Demo
Application Demo
Application Demo
Questions?
Thanks

Learning Cassandra NoSQL

  • 1.
  • 2.
    What are wegoing to learn today?  New Problems which can’t be handled by traditional RDBMS  Tradeoff between Consistency, Availability, Partition Tolerance (CAP theorem)  What are the different solutions available?  What is Cassandra?  Use-Cases for Cassandra Cassandra Features – Tunable Consistency, P2P Architecture, Elastic Scalability, Column Orientation Data Model for Cassandra Demo Application using Cassandra   
  • 3.
    Twitter – MassiveScale, High Availability
  • 4.
    Travel Booking –Scale and Availability
  • 5.
    Movie Booking –Consistency and Scale
  • 6.
    Facebook Graph Search– Fast, Complex Querying
  • 7.
    Facebook Messenger –Consistency and Scale
  • 8.
    So, What IsCommon?  Huge Data  Fast Random access  Variable Schema  Need of Compression  High Availability  Need for Consistency  Need of Distribution (Sharding)
  • 9.
    y Brewer’s CAP Theorem MongoDB HBase Redis RDBMS Consistenc CACP Partition Tolerance Availability AP CouchDB Cassandra DynamoDB Riak http://www.w3resource.com/mongodb/nosql.php
  • 10.
    P NoSQL Landscape Big Table Clones BigTable (Google), Cassandra, HBase, Hypertable Key-Value StoresDynamo (Amazon), Voldemort (LinkedIn), Citrusleaf, Membase, Riak, Tokyo Cabinet Document Database CouchOne, MongoDB, Terrastore, OrientDB Graph Databases FlockDB (Twitter), AllegroGraph, DEX, InfoGrid, Neo4J, Sones Performance Query and Navigational Complexity Scalability&Speed
  • 11.
    Cassandra Usecase –Deep Drive 5000 TPS 300 ~ 500 SQL Transaction WEB APPLICATION Caching Layer Elastic Scale 1000 TPS 100 ~ 200 SQL Transaction Applications Changing Data RDBMS 2RDBMS 1
  • 12.
    Using Cassandra 5000 TPS 300~ 500 SQL Transaction WEB APPLICATIONElastic Scale Elastic Scale CASSANDRA 1000 TPS 100 ~ 200 SQL Transaction Applications Changing Data
  • 13.
    Cassandra Usecase -Summary  E-Commerce (Travel Portal)  Development Approaches   Both B2B & B2C Consumers High volume of shopping transactions (> 500 Million Visits / Day) High volume supply changes (Manual & System) generated. Huge Inventory Database (Millions of hotels) High Read/Write (Thousands Reads & Writes/Second) Application has to 99.99% Available Fault Tolerant & Reliable. Fast & Quick Shopping Experience. Elastic Scale Innovative Recommendations & Algorithms. Should be fast for new changes Should be cost effective for maintenance.   Legacy Way (Pure RDBMS) Augmented (RDBMS + Caching, Heavy Database Hardware) Using Cassandra          
  • 14.
    What is ApacheCassandra? Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, Tuneably consistent, column-oriented database. Open Source Column Oriented Decentralized Cassandra Features Tuneably Consistent Elastically Scalable Distributed Highly Scalable Fault Tolerant
  • 15.
    Distributed and Decentralized PostOffice Post Office CCY Exchange CCY, Stationary, Letter/Couriers CCY, Stationary, Letter/Couriers CCY, Stationary, Letter/Couriers stationary Letter/Couriers Ccy Courier StationaryCcy Courier Stationary DecentralisedCentralised
  • 16.
    Distributed and Decentralized Every Node Is Identical.  Peer to Peer Protocol and uses Gossip Protocol to maintain and keep the List of nodes in Sync.CCY, Stationary, Letter/Couriers CCY, Stationary, Letter/Couriers CCY, Stationary, Letter/Couriers  No Special Host to Coordinate Activities.  No Single Point of Failure.  Easier to Operate and Maintain because all nodes are same. Ccy Courier Stationary
  • 17.
    Elastic Scalability Types ofScalability  Vertical Scalability  Horizontal Scalability What is Elastic Scalability? This is special property of Horizontal Scalability.  The cluster can seamlessly scale up and scale back down without major disruption.
  • 18.
    Elastic Scalability CCY, Stationary, Letter/Couriers CCY,Stationary, Letter/Couriers CCY, Stationary, Letter/Couriers CCY, Stationary, Letter/Couriers  Cluster must accept new nodes without major disruption or reconfiguration. Process should not be restarted Do not have to change application charges Don’t have to rebalance data    ADD A NODE AND MOVE ON!! Ccy Courier Stationary
  • 19.
    High Availability andFault Tolerance Highly Available  No Downtime CCY, Stationary, Letter/Couriers CCY, Stationary, Letter/Couriers CCY, Stationary, Letter/Couriers Ccy Courier Stationary
  • 20.
    Tunable Consistency  Cassandraenables us to define consistency as per application requirements
  • 21.
    High Performance  Cassandrawas designed specifically from the ground up to take full advantage of multiprocessor/ multicore machines, and to run across many dozens of these machines housed in multiple data centres.  It scales consistently and seamlessly to hundreds of terabytes.  Shows exceptional performance under heavy loads.  Consistently shows very fast throughput for writes per second on a basic commodity workstation.
  • 22.
    Where to UseCassandra? Use if your application has:  Big Data (Billions Of Records Rows & Columns)  Very High Velocity Random Reads & Writes  Flexible Sparse / Wide Column Requirements  No Multiple Secondary Index Needs  Low Latency Use Cases:    eCommerce Inventory Cache Use Cases Time Series / Events Use Cases Feed Based Activities / Use Cases
  • 23.
    Where NOT toUse Cassandra? Don’t Use if your application has: Secondary Indexes. Relational Data. Transactional (Rollback, Commit) Primary & Financial Records. Stringent Security & Authorization Needs On Data Dynamic Queries on Columns. Searching Column Data Low Latency
  • 24.
    Data Model RDBMS vsCassandra  In RDBMS,  Define Schema  Define tables with defined columns  The table defines the column names and their data types  Add rows conforming to that schema: each row contains the same fixed set of columns.  In Cassandra,  Define Keyspaces  Define columnfamilies/tables  Column families can define metadata about the columns  Each row can have a different set of columns
  • 25.
    Data Model –Column Families Designing Column Families/Tables  Static Column Families,  Static set of column names  Similar to a relational database table  Rows are not required to have all of the columns defined  Dynamic Column Families,  Use arbitrary column names to store data
  • 26.
    Data Model -Keys Type of Keys  Primary Key create table test ( key text PRIMARY KEY, data text );  Composite Primary (or Compound) Key create table test ( key_part_one text, key_part_two int, data text, PRIMARY KEY(key_part_one, key_part_two) );  In above, the "first part" of the key is called Partition Key(key_part_one) and the second part of the key is the Clustering Key(key_part_two)  The Partition Key is responsible for data distribution across your nodes  The Clustering Key is responsible for data sorting within the partition  The Primary Key is equivalent to the Partition Key in a single-field-key table
  • 27.
    Data Model -Columns Type of Columns  Standard Columns  A tuple containing a name, a value and a timestamp  Expiring Columns  optional expiration date called TTL (time to live)  defined in seconds  Counter Columns  store a number that incrementally counts the occurrences  Example Page Views
  • 28.
    Data Model -Columns Type of Columns  Composite Columns CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar, PRIMARY KEY (user_id, tweet_id) );
  • 29.
    Data Model –Data Types Data Types (Comparators & Validators)  Data type for a column (or row key) value is called a validator  Data type for a column name is called a comparator  Can define data types while column family schemas creation but not required.  Internally, stores column names and values as hex byte arrays (BytesType).
  • 30.
    Data Model -Indexes Type of Indexes  Primary Indexes  the primary index for a column family is the index of its row keys  Each node maintains this index for the data it manages  Secondary Indexes  Indexes on column values  Implemented as a hidden table, separate from the main table  Do not use secondary indexes to query a huge volume of records for a small number of results  more efficient to manually maintain a lookup column family instead of using a secondary index
  • 31.
    Cassandra – Writes/Reads Writesin Cassandra  Cassandra writes are first written to a commit log (for durability), and then to an in- memory table structure called a memtable.  There is very minimal disk I/O at the time of write.  Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table).  Memtables and SSTables are maintained per column family.  Memtables are organized in sorted order by row key and flushed to SSTables sequentially Reads in Cassandra  At read time, a row must be combined from all SSTables on disk (as well as unflushed memtables) to produce the requested data.  Each SSTable has a Bloom filter associated with it that checks if a requested row key exists in the SSTable before doing any disk seeks.
  • 32.
    Application Demo Cassandra Installation& Configuration  Conf/cassandra.yaml  Tools Key Space Setup Column Family / Data Model Setup  Key  Columns & Data Types  Indexes (Primary & Secondary)  Programmatic Consistency Thrift Hector API CQL3 API     
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.