Casandra is a open-source, distributed, highly scalable and fault-tolerant database. It is a best choice for managing structured, semi-structured or unstructured data at a large amount.
2. Agenda
● What is Cassandra
● Gossip communication protocol
● Cassandra- Data Model
● Cassandra- Architecture
● Reading/Writing a node
● Data consistency
3. Cassandra
● Cassandra is massively scalable schemaless database.
● Open source database, licensed under Apache.
● Originally, developed by Facebok for inbox search.
● Data model based upon Google’s BigTable.
● Distributed design is based upon Amazon Dynamo.
● Promoted massively by Datastax.
4. Gossip Communication Protocol
● Peer to peer communication protocol.
● Nodes are arranged in ring format.
● Data is replicated to multiple nodes.
● Nodes periodically exchange info. they have.
● Nodes also exchange their own info.
● Each message has its associated version.
● No master-slave concept, and hence no single point of failure.
5. Cassandra- Data Model
● Column data is stored as in key/value pair.
● Collection of column makes a Row.
● Column family is then becomes as collection of all rows.
● In RDBMS, each column must have some value else NULL,
but not in case of cassandra database.
6. Cassandra- Data Model
● Consider following example,
● Now inserting a new row:
● Above insertion would not fail.
8. Cassandra- Architecture
● A ring has several nodes.
● Each node is assigned a Partition value.
● Data processing is based on the Partition Key.
● When a client makes a request to a node, it becomes the
coordinator for that request.
● The coordinator determines which node in the ring should
process upon that request.
9. Cassandra- Architecture
● Virtual Nodes (Vnodes)
– Responsible for assigning the partition token range.
– Tokens are automatically calculated & assigned to each
node.
– Cluster re-balancing is done automatically.
10. Cassandra- Architecture
● Which node gets what data is based on the partition key.
● Cassandra assigns a hash value to each partition key.
● And data gets to a node as per the hash value
12. Data Replication
● Data replication
– Simple Strategy
● Used for only one cluster
– Network Topology Strategy
● Used for multiple clusters in multiple data centers.
13. Writing data in a Node
● Write an entry in the commit log
● Write data to memtable.
● When memtable is full, Store data on disk in SSTables.
● SSTables are immutable data structure.
● Also has a support for TTL.
Cassandra is the fastest db in concern with the write operation
14. Reading data from a Node
● First, checks the memtable using Bloom filter.
● If found, then data is sent as response.
● Else, fetch the data from the SSTables.
Cassandra may write many versions of the same row, then
how to identify the latest one?
15. Update/Delete data from Node
● Data is not immediately deleted.
● It is marked to be deleted/updated in memtables.
● This process is called tombstone.
● Tombstone, runs at configured interval of time.
● During each interval, it collects all the SSTables and updates
the marked record and discards the old SSTables.
16. Data Consistency
● Data is not necessarily on every node all the time.
● For maintaining consistency, no. of replicas should respond:
– ONE
– QUORUM
– ALL
● Consistency has major impact on performance.
● For strong consistency:
R + W > N