Overview of Cassandra architecture. Learn about how data is read and written into a Cassandra cluster. Internal gossip protocol. Some key data structure Cassandra uses like bloom filters, consistent hashing.
2. Distributed
Peer to Peer
Client
● There is no leader/follower.
● Each node is aware of keys held by other nodes and coordinates with that node to fetch the
data.
● Depending on the replication factor & consistency level the coordinator talks to one of more
nodes before returning the response to the client.
● Every table defines a partition key.
● Data is distributed across the various nodes in the cluster using the hash on the partition
key. Uses Consistent hashing algo.
● Partitions are replicated across multiple nodes to prevent single point of failure.
3. Replication copies of the data across
multiple nodes within/across the DCs.
Replication Factor (RF) denotes the no of
copies.
Set at the keyspace level.
Snitch: Is a strategy to identify the DC and
Rack the node belongs to. This identity
can be manually shared across all nodes
or via Gossiping.
Coordinator is aware of the RF/keyspace
and coordinates the writes upto that factor
to the various nodes within/across DCs.
Hinted Handoff - While the replica node is
down the coordinator will delay the
transmission to that node by persisting
that data locally. It can retransmits it once
that replica node is back online.
Cassandra configuration sets the duration
for holding such data before handoff.
Replication & Consistency Consistency is an agreeable factor across
the nodes that ensures the acceptance of
a read/write.
Consistency can be set for both
read/writes.
Consistency levels (CL) can be set from
low to high (ONE, LOCAL_QUOROUM,
QUORUM, ALL)
CL is a trade off b/w consistency and
availability.
Read Repair: Coordinator performs a read
repair on some/all of the replicas that
have trailing versions. Depending on the
CL this can be done async during a read
request.
4. Gossip Each node stores info about itself and
every other node in its Knowledge base.
Each node initiates the gossip every
second with 2 or 3 other nodes to share
its knowledge base.
Knowledge Base:
Each node increments its heartbeat
version every second.
When it receives a gossip from other
node, it checks each nodes heart beat
version and updates if it had received the
latest version.
Optimization to reduce message
bandwidth during gossiping
Gossip is initiated with a SYN to the
receiving node.
SYN: Just a digest - no AppState included
Receiving node ACKs back to the sender.
ACK: Digest for the trailing versions &
detailed (includes AppState) for leading
versions.
Sender updates the trailing versions and
Acks back with the detailed info for the
requested trailing versions on the other
end.
EndPt State: <IP of a node>
HeartBeat State:
Generation: 10
Version: 34
Application State;
Status::
Norma/Removed/Arrived…
DataCenter:
Rack:
Load:
Severity:
….
EndPt State: <IP of a
node>...
Knowledge Base
5. Mem table
Commit log
Client
Write Path Client writes to both commit log and
memtable. In the event of the node
failures, the memtable can be constructed
from the commit log.
Commit log is append only, does not
maintain any order.
Memtable is partitioned by partition key
and ordered by clustering columns.
Eventually memtable grows out of size
and is flushed to disk (SSTable). SSTable
is immutable so with each flush a new
SSTable file is created.
SSTable holds each partition
Compaction is a process of merging
numerous sstable files into one. It relies
on timestamp of each row to resolve dups.
SSTable 1
SSTable 1
SSTable 1
SSTable
Compaction
Flushing
Disk
Memory
23, USA 4
23, USA 8
23, Mexico 7
55, Korea 9
23, USA 5
55, Korea 9
23, Mexico 7
23, USA 4
23, USA 5
23, USA 8
23, Mexico 7
23, USA 4
55, Korea 9
23, USA 5
23, USA 8
55, China 20
55, China 40
55, Korea 9
23, Mexico 7
23, USA 4
23, USA 5
23, USA 8
Replica Node
Coordinator
Bloom Filters
6. Read Path
Mem table
Client
SSTable 1
SSTable 1
SSTable 1
SSTable
Compaction
Flushing
DiskMemory
Partition
Index
Summary
Index
Key
Cache
(LRU)
Order of search during a Read:
Coordinator node calls one of the replica
node for the requested partition key.
Replica Node first looks in the Mem table.
If not found, follows the below path until
the key is found.
Bloom filters help determine two things.
The key doesn’t exist in the sstable or the
key may exist in the sstable.
Key Cache, An LRU cache with partition
key & value is the offset of the partition in
the SSTable file.
Summary Index is range based index for
the keys in the partition index and their
offset.
Partition Index is the indexed lookup on
the partition key and the offset of the
partition in the SSTable file.
Replica Node
Coordinator
Bloom Filters
Bloom Filters
Bloom Filters
9. Consistent Hashing
Given a set of key/value pairs, hashing is strategy to
spread each pair evenly as possible, so that we can fetch
them in almost constant time by their key.
Consistent hashing is one such hashing strategy to spread
the keys in a distributed env.
The hash of keys are hypothetically spread on ring. The
position the key takes on the ring can be anywhere b/w 0 -
360 based on hash of key (mostly mod on the hash).
The stores/server that hosts these key are also given a
position on the ring (e.g., A, B, C…)
The key is stored on the server that is found first, while
traversing the ring in anti-clockwise direction from the keys
position.
E.g., key Steve @ 352.3 finds server C @ 81.7
If we maintain a sorted list of server and their position, a
quick binary search will point us to the server where the
key can be found eliminating the need to query all servers.
Keys can be replicated on succeeding servers to avoid
SPF (Single point of failures).
10. Consistent Hashing
Although the keys are spread over several servers, the
distribution may not be even due to the uneven clustering
of the key in real world (names starting with a certain
alphabet may be more common).
In such scenarios, to overcome the load on an individual
server, we define virtual servers. What this means is we
provide multiple positions for the same server simulating
multiple instances of the same server across the ring.
With ref to the pic here, the refined sorted list of servers
will now have virtual instances of servers a1, a2, b2, c3
etc... Thereby distributed the load on C to B and A as well.
11. Bloom Filters
It's a probabilistic data structure to determine if an element is present in the set of not.
It consists of a set of n bits & a collection of independent hash functions. Each of which return a no between 0 to n-1 representing one of
the nth bit.
Writes:
A key is run thru the collection of hash functions. The resulting nth bit is flipped on to mark the elements presence.
Reads:
A key is run thru the collection of hash functions. Iff all the resulting nth bit is turned on, we can ensure that the key MAY be present in the
underlying set. Even if one of them is not flipped on, we can GUARANTEE that the key is not present.