Apache Cassandra
Harshit Daga
Software Consultant
Knoldus Software LLP
Agenda
● What is Cassandra
● Gossip communication protocol
● Cassandra- Data Model
● Cassandra- Architecture
● Reading/Writing a node
● Data consistency
Cassandra
● Cassandra is massively scalable schemaless database.
● Open source database, licensed under Apache.
● Originally, developed by Facebok for inbox search.
● Data model based upon Google’s BigTable.
● Distributed design is based upon Amazon Dynamo.
● Promoted massively by Datastax.
Gossip Communication Protocol
● Peer to peer communication protocol.
● Nodes are arranged in ring format.
● Data is replicated to multiple nodes.
● Nodes periodically exchange info. they have.
● Nodes also exchange their own info.
● Each message has its associated version.
● No master-slave concept, and hence no single point of failure.
Cassandra- Data Model
● Column data is stored as in key/value pair.
● Collection of column makes a Row.
● Column family is then becomes as collection of all rows.
● In RDBMS, each column must have some value else NULL,
but not in case of cassandra database.
Cassandra- Data Model
● Consider following example,
● Now inserting a new row:
● Above insertion would not fail.
Cassandra- Data Model
● It means, data are stored as multi-dimensional sparse array.
Cassandra- Architecture
● A ring has several nodes.
● Each node is assigned a Partition value.
● Data processing is based on the Partition Key.
● When a client makes a request to a node, it becomes the
coordinator for that request.
● The coordinator determines which node in the ring should
process upon that request.
Cassandra- Architecture
● Virtual Nodes (Vnodes)
– Responsible for assigning the partition token range.
– Tokens are automatically calculated & assigned to each
node.
– Cluster re-balancing is done automatically.
Cassandra- Architecture
● Which node gets what data is based on the partition key.
● Cassandra assigns a hash value to each partition key.
● And data gets to a node as per the hash value
Cassandra- Architecture
● How write request gets fulfilled:-
Data Replication
● Data replication
– Simple Strategy
● Used for only one cluster
– Network Topology Strategy
● Used for multiple clusters in multiple data centers.
Writing data in a Node
● Write an entry in the commit log
● Write data to memtable.
● When memtable is full, Store data on disk in SSTables.
● SSTables are immutable data structure.
● Also has a support for TTL.
Cassandra is the fastest db in concern with the write operation
Reading data from a Node
● First, checks the memtable using Bloom filter.
● If found, then data is sent as response.
● Else, fetch the data from the SSTables.
Cassandra may write many versions of the same row, then
how to identify the latest one?
Update/Delete data from Node
● Data is not immediately deleted.
● It is marked to be deleted/updated in memtables.
● This process is called tombstone.
● Tombstone, runs at configured interval of time.
● During each interval, it collects all the SSTables and updates
the marked record and discards the old SSTables.
Data Consistency
● Data is not necessarily on every node all the time.
● For maintaining consistency, no. of replicas should respond:
– ONE
– QUORUM
– ALL
● Consistency has major impact on performance.
● For strong consistency:
R + W > N
References
● O’reilly- Cassandra Definitive Guide
● https://cassandra.apache.org/doc/latest/
● http://docs.datastax.com/en/cassandra/3.0/
Thank You !!

Introduction to Apache Cassandra

  • 1.
    Apache Cassandra Harshit Daga SoftwareConsultant Knoldus Software LLP
  • 2.
    Agenda ● What isCassandra ● Gossip communication protocol ● Cassandra- Data Model ● Cassandra- Architecture ● Reading/Writing a node ● Data consistency
  • 3.
    Cassandra ● Cassandra ismassively scalable schemaless database. ● Open source database, licensed under Apache. ● Originally, developed by Facebok for inbox search. ● Data model based upon Google’s BigTable. ● Distributed design is based upon Amazon Dynamo. ● Promoted massively by Datastax.
  • 4.
    Gossip Communication Protocol ●Peer to peer communication protocol. ● Nodes are arranged in ring format. ● Data is replicated to multiple nodes. ● Nodes periodically exchange info. they have. ● Nodes also exchange their own info. ● Each message has its associated version. ● No master-slave concept, and hence no single point of failure.
  • 5.
    Cassandra- Data Model ●Column data is stored as in key/value pair. ● Collection of column makes a Row. ● Column family is then becomes as collection of all rows. ● In RDBMS, each column must have some value else NULL, but not in case of cassandra database.
  • 6.
    Cassandra- Data Model ●Consider following example, ● Now inserting a new row: ● Above insertion would not fail.
  • 7.
    Cassandra- Data Model ●It means, data are stored as multi-dimensional sparse array.
  • 8.
    Cassandra- Architecture ● Aring has several nodes. ● Each node is assigned a Partition value. ● Data processing is based on the Partition Key. ● When a client makes a request to a node, it becomes the coordinator for that request. ● The coordinator determines which node in the ring should process upon that request.
  • 9.
    Cassandra- Architecture ● VirtualNodes (Vnodes) – Responsible for assigning the partition token range. – Tokens are automatically calculated & assigned to each node. – Cluster re-balancing is done automatically.
  • 10.
    Cassandra- Architecture ● Whichnode gets what data is based on the partition key. ● Cassandra assigns a hash value to each partition key. ● And data gets to a node as per the hash value
  • 11.
    Cassandra- Architecture ● Howwrite request gets fulfilled:-
  • 12.
    Data Replication ● Datareplication – Simple Strategy ● Used for only one cluster – Network Topology Strategy ● Used for multiple clusters in multiple data centers.
  • 13.
    Writing data ina Node ● Write an entry in the commit log ● Write data to memtable. ● When memtable is full, Store data on disk in SSTables. ● SSTables are immutable data structure. ● Also has a support for TTL. Cassandra is the fastest db in concern with the write operation
  • 14.
    Reading data froma Node ● First, checks the memtable using Bloom filter. ● If found, then data is sent as response. ● Else, fetch the data from the SSTables. Cassandra may write many versions of the same row, then how to identify the latest one?
  • 15.
    Update/Delete data fromNode ● Data is not immediately deleted. ● It is marked to be deleted/updated in memtables. ● This process is called tombstone. ● Tombstone, runs at configured interval of time. ● During each interval, it collects all the SSTables and updates the marked record and discards the old SSTables.
  • 16.
    Data Consistency ● Datais not necessarily on every node all the time. ● For maintaining consistency, no. of replicas should respond: – ONE – QUORUM – ALL ● Consistency has major impact on performance. ● For strong consistency: R + W > N
  • 17.
    References ● O’reilly- CassandraDefinitive Guide ● https://cassandra.apache.org/doc/latest/ ● http://docs.datastax.com/en/cassandra/3.0/
  • 18.