Introduction to Apache Cassandra

Apache Cassandra
Harshit Daga
Software Consultant
Knoldus Software LLP

Agenda
● What is Cassandra
● Gossip communication protocol
● Cassandra- Data Model
● Cassandra- Architecture
● Reading/Writing a node
● Data consistency

Cassandra
● Cassandra is massively scalable schemaless database.
● Open source database, licensed under Apache.
● Originally, developed by Facebok for inbox search.
● Data model based upon Google’s BigTable.
● Distributed design is based upon Amazon Dynamo.
● Promoted massively by Datastax.

Gossip Communication Protocol
● Peer to peer communication protocol.
● Nodes are arranged in ring format.
● Data is replicated to multiple nodes.
● Nodes periodically exchange info. they have.
● Nodes also exchange their own info.
● Each message has its associated version.
● No master-slave concept, and hence no single point of failure.

Cassandra- Data Model
● Column data is stored as in key/value pair.
● Collection of column makes a Row.
● Column family is then becomes as collection of all rows.
● In RDBMS, each column must have some value else NULL,
but not in case of cassandra database.

● Consider following example,
● Now inserting a new row:
● Above insertion would not fail.

● It means, data are stored as multi-dimensional sparse array.

Cassandra- Architecture
● A ring has several nodes.
● Each node is assigned a Partition value.
● Data processing is based on the Partition Key.
● When a client makes a request to a node, it becomes the
coordinator for that request.
● The coordinator determines which node in the ring should
process upon that request.

● Virtual Nodes (Vnodes)
– Responsible for assigning the partition token range.
– Tokens are automatically calculated & assigned to each
node.
– Cluster re-balancing is done automatically.

● Which node gets what data is based on the partition key.
● Cassandra assigns a hash value to each partition key.
● And data gets to a node as per the hash value

● How write request gets fulfilled:-

Data Replication
● Data replication
– Simple Strategy
● Used for only one cluster
– Network Topology Strategy
● Used for multiple clusters in multiple data centers.

Writing data in a Node
● Write an entry in the commit log
● Write data to memtable.
● When memtable is full, Store data on disk in SSTables.
● SSTables are immutable data structure.
● Also has a support for TTL.
Cassandra is the fastest db in concern with the write operation

Reading data from a Node
● First, checks the memtable using Bloom filter.
● If found, then data is sent as response.
● Else, fetch the data from the SSTables.
Cassandra may write many versions of the same row, then
how to identify the latest one?

Update/Delete data from Node
● Data is not immediately deleted.
● It is marked to be deleted/updated in memtables.
● This process is called tombstone.
● Tombstone, runs at configured interval of time.
● During each interval, it collects all the SSTables and updates
the marked record and discards the old SSTables.

Data Consistency
● Data is not necessarily on every node all the time.
● For maintaining consistency, no. of replicas should respond:
– ONE
– QUORUM
– ALL
● Consistency has major impact on performance.
● For strong consistency:
R + W > N

References
● O’reilly- Cassandra Definitive Guide
● https://cassandra.apache.org/doc/latest/
● http://docs.datastax.com/en/cassandra/3.0/

Introduction to Apache Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Apache Cassandra

Similar to Introduction to Apache Cassandra (20)

More from Knoldus Inc.

More from Knoldus Inc. (20)

Recently uploaded

Recently uploaded (20)

Introduction to Apache Cassandra