SlideShare a Scribd company logo
Distributed
Peer to Peer
Client
● There is no leader/follower.
● Each node is aware of keys held by other nodes and coordinates with that node to fetch the
data.
● Depending on the replication factor & consistency level the coordinator talks to one of more
nodes before returning the response to the client.
● Every table defines a partition key.
● Data is distributed across the various nodes in the cluster using the hash on the partition
key. Uses Consistent hashing algo.
● Partitions are replicated across multiple nodes to prevent single point of failure.
Replication copies of the data across
multiple nodes within/across the DCs.
Replication Factor (RF) denotes the no of
copies.
Set at the keyspace level.
Snitch: Is a strategy to identify the DC and
Rack the node belongs to. This identity
can be manually shared across all nodes
or via Gossiping.
Coordinator is aware of the RF/keyspace
and coordinates the writes upto that factor
to the various nodes within/across DCs.
Hinted Handoff - While the replica node is
down the coordinator will delay the
transmission to that node by persisting
that data locally. It can retransmits it once
that replica node is back online.
Cassandra configuration sets the duration
for holding such data before handoff.
Replication & Consistency Consistency is an agreeable factor across
the nodes that ensures the acceptance of
a read/write.
Consistency can be set for both
read/writes.
Consistency levels (CL) can be set from
low to high (ONE, LOCAL_QUOROUM,
QUORUM, ALL)
CL is a trade off b/w consistency and
availability.
Read Repair: Coordinator performs a read
repair on some/all of the replicas that
have trailing versions. Depending on the
CL this can be done async during a read
request.
Gossip Each node stores info about itself and
every other node in its Knowledge base.
Each node initiates the gossip every
second with 2 or 3 other nodes to share
its knowledge base.
Knowledge Base:
Each node increments its heartbeat
version every second.
When it receives a gossip from other
node, it checks each nodes heart beat
version and updates if it had received the
latest version.
Optimization to reduce message
bandwidth during gossiping
Gossip is initiated with a SYN to the
receiving node.
SYN: Just a digest - no AppState included
Receiving node ACKs back to the sender.
ACK: Digest for the trailing versions &
detailed (includes AppState) for leading
versions.
Sender updates the trailing versions and
Acks back with the detailed info for the
requested trailing versions on the other
end.
EndPt State: <IP of a node>
HeartBeat State:
Generation: 10
Version: 34
Application State;
Status::
Norma/Removed/Arrived…
DataCenter:
Rack:
Load:
Severity:
….
EndPt State: <IP of a
node>...
Knowledge Base
Mem table
Commit log
Client
Write Path Client writes to both commit log and
memtable. In the event of the node
failures, the memtable can be constructed
from the commit log.
Commit log is append only, does not
maintain any order.
Memtable is partitioned by partition key
and ordered by clustering columns.
Eventually memtable grows out of size
and is flushed to disk (SSTable). SSTable
is immutable so with each flush a new
SSTable file is created.
SSTable holds each partition
Compaction is a process of merging
numerous sstable files into one. It relies
on timestamp of each row to resolve dups.
SSTable 1
SSTable 1
SSTable 1
SSTable
Compaction
Flushing
Disk
Memory
23, USA 4
23, USA 8
23, Mexico 7
55, Korea 9
23, USA 5
55, Korea 9
23, Mexico 7
23, USA 4
23, USA 5
23, USA 8
23, Mexico 7
23, USA 4
55, Korea 9
23, USA 5
23, USA 8
55, China 20
55, China 40
55, Korea 9
23, Mexico 7
23, USA 4
23, USA 5
23, USA 8
Replica Node
Coordinator
Bloom Filters
Read Path
Mem table
Client
SSTable 1
SSTable 1
SSTable 1
SSTable
Compaction
Flushing
DiskMemory
Partition
Index
Summary
Index
Key
Cache
(LRU)
Order of search during a Read:
Coordinator node calls one of the replica
node for the requested partition key.
Replica Node first looks in the Mem table.
If not found, follows the below path until
the key is found.
Bloom filters help determine two things.
The key doesn’t exist in the sstable or the
key may exist in the sstable.
Key Cache, An LRU cache with partition
key & value is the offset of the partition in
the SSTable file.
Summary Index is range based index for
the keys in the partition index and their
offset.
Partition Index is the indexed lookup on
the partition key and the offset of the
partition in the SSTable file.
Replica Node
Coordinator
Bloom Filters
Bloom Filters
Bloom Filters
Summary Index
Partition Index
Key Cache
References:
● https://academy.datastax.com
● https://www.youtube.com/watch?v=s1xc1HVsRk0&list=PLalrWAGybpB-L1PGA-
NfFu2uiWHEsdscD&index=1
● https://www.toptal.com/big-data/consistent-hashing
● https://www.baeldung.com/cassandra-data-modeling
Consistent Hashing
Given a set of key/value pairs, hashing is strategy to
spread each pair evenly as possible, so that we can fetch
them in almost constant time by their key.
Consistent hashing is one such hashing strategy to spread
the keys in a distributed env.
The hash of keys are hypothetically spread on ring. The
position the key takes on the ring can be anywhere b/w 0 -
360 based on hash of key (mostly mod on the hash).
The stores/server that hosts these key are also given a
position on the ring (e.g., A, B, C…)
The key is stored on the server that is found first, while
traversing the ring in anti-clockwise direction from the keys
position.
E.g., key Steve @ 352.3 finds server C @ 81.7
If we maintain a sorted list of server and their position, a
quick binary search will point us to the server where the
key can be found eliminating the need to query all servers.
Keys can be replicated on succeeding servers to avoid
SPF (Single point of failures).
Consistent Hashing
Although the keys are spread over several servers, the
distribution may not be even due to the uneven clustering
of the key in real world (names starting with a certain
alphabet may be more common).
In such scenarios, to overcome the load on an individual
server, we define virtual servers. What this means is we
provide multiple positions for the same server simulating
multiple instances of the same server across the ring.
With ref to the pic here, the refined sorted list of servers
will now have virtual instances of servers a1, a2, b2, c3
etc... Thereby distributed the load on C to B and A as well.
Bloom Filters
It's a probabilistic data structure to determine if an element is present in the set of not.
It consists of a set of n bits & a collection of independent hash functions. Each of which return a no between 0 to n-1 representing one of
the nth bit.
Writes:
A key is run thru the collection of hash functions. The resulting nth bit is flipped on to mark the elements presence.
Reads:
A key is run thru the collection of hash functions. Iff all the resulting nth bit is turned on, we can ensure that the key MAY be present in the
underlying set. Even if one of them is not flipped on, we can GUARANTEE that the key is not present.

More Related Content

What's hot

Locks In Disributed Systems
Locks In Disributed SystemsLocks In Disributed Systems
Locks In Disributed Systems
mridul mishra
 
Istanbul BFT
Istanbul BFTIstanbul BFT
Istanbul BFT
Yu-Te Lin
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14
宇帆 盛
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
zqhxuyuan
 
GopherCon 2017 - Writing Networking Clients in Go: The Design & Implementati...
GopherCon 2017 -  Writing Networking Clients in Go: The Design & Implementati...GopherCon 2017 -  Writing Networking Clients in Go: The Design & Implementati...
GopherCon 2017 - Writing Networking Clients in Go: The Design & Implementati...
wallyqs
 
Cassandra basic
Cassandra basicCassandra basic
Cassandra basic
zqhxuyuan
 
Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large State
Flink Forward
 
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
 Dynamic Resource Management In a Massively Parallel Stream Processing Engine Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Kasper Grud Skat Madsen
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
Pratik Tambekar
 
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Logging Last Resource Optimization for Distributed Transactions in  Oracle We...Logging Last Resource Optimization for Distributed Transactions in  Oracle We...
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Gera Shegalov
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward
 
Elixir concurrency 101
Elixir concurrency 101Elixir concurrency 101
Elixir concurrency 101
Rafael Antonio Gutiérrez Turullols
 
Cassandra Internals Overview
Cassandra Internals OverviewCassandra Internals Overview
Cassandra Internals Overview
beobal
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
Matthew Dennis
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency Control
Dilum Bandara
 
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Kévin LOVATO
 
Everything You Thought You Already Knew About Orchestration
Everything You Thought You Already Knew About OrchestrationEverything You Thought You Already Knew About Orchestration
Everything You Thought You Already Knew About Orchestration
Laura Frank Tacho
 
Chapter 12 transactions and concurrency control
Chapter 12 transactions and concurrency controlChapter 12 transactions and concurrency control
Chapter 12 transactions and concurrency control
AbDul ThaYyal
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleM|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
MariaDB plc
 
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie SatterlySeattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
btoddb
 

What's hot (20)

Locks In Disributed Systems
Locks In Disributed SystemsLocks In Disributed Systems
Locks In Disributed Systems
 
Istanbul BFT
Istanbul BFTIstanbul BFT
Istanbul BFT
 
Pulsar connector on flink 1.14
Pulsar connector on flink 1.14Pulsar connector on flink 1.14
Pulsar connector on flink 1.14
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
 
GopherCon 2017 - Writing Networking Clients in Go: The Design & Implementati...
GopherCon 2017 -  Writing Networking Clients in Go: The Design & Implementati...GopherCon 2017 -  Writing Networking Clients in Go: The Design & Implementati...
GopherCon 2017 - Writing Networking Clients in Go: The Design & Implementati...
 
Cassandra basic
Cassandra basicCassandra basic
Cassandra basic
 
Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large State
 
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
 Dynamic Resource Management In a Massively Parallel Stream Processing Engine Dynamic Resource Management In a Massively Parallel Stream Processing Engine
Dynamic Resource Management In a Massively Parallel Stream Processing Engine
 
Distributed System by Pratik Tambekar
Distributed System by Pratik TambekarDistributed System by Pratik Tambekar
Distributed System by Pratik Tambekar
 
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Logging Last Resource Optimization for Distributed Transactions in  Oracle We...Logging Last Resource Optimization for Distributed Transactions in  Oracle We...
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
 
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
Flink Forward SF 2017: Till Rohrmann - Redesigning Apache Flink’s Distributed...
 
Elixir concurrency 101
Elixir concurrency 101Elixir concurrency 101
Elixir concurrency 101
 
Cassandra Internals Overview
Cassandra Internals OverviewCassandra Internals Overview
Cassandra Internals Overview
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
Transactions and Concurrency Control
Transactions and Concurrency ControlTransactions and Concurrency Control
Transactions and Concurrency Control
 
Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014Building your own Distributed System The easy way - Cassandra Summit EU 2014
Building your own Distributed System The easy way - Cassandra Summit EU 2014
 
Everything You Thought You Already Knew About Orchestration
Everything You Thought You Already Knew About OrchestrationEverything You Thought You Already Knew About Orchestration
Everything You Thought You Already Knew About Orchestration
 
Chapter 12 transactions and concurrency control
Chapter 12 transactions and concurrency controlChapter 12 transactions and concurrency control
Chapter 12 transactions and concurrency control
 
M|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScaleM|18 Architectural Overview: MariaDB MaxScale
M|18 Architectural Overview: MariaDB MaxScale
 
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie SatterlySeattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
 

Similar to Cassandra Architecture

Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
Wu Liang
 
Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
Adam Hutson
 
Samsung DeepSort
Samsung DeepSortSamsung DeepSort
Samsung DeepSort
Ryo Jin
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
Alex Thompson
 
Lab Seminar 2009 12 01 Message Drop Reduction And Movement
Lab Seminar 2009 12 01  Message Drop Reduction And MovementLab Seminar 2009 12 01  Message Drop Reduction And Movement
Lab Seminar 2009 12 01 Message Drop Reduction And Movement
tharindanv
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
Luis Galárraga
 
Technical presentation
Technical presentationTechnical presentation
Technical presentation
Siddharth Singh
 
2.communcation in distributed system
2.communcation in distributed system2.communcation in distributed system
2.communcation in distributed system
Gd Goenka University
 
Lab Seminar 2009 06 17 Description Based Ad Hoc Networks
Lab Seminar 2009 06 17  Description Based Ad Hoc NetworksLab Seminar 2009 06 17  Description Based Ad Hoc Networks
Lab Seminar 2009 06 17 Description Based Ad Hoc Networks
tharindanv
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
OPEN SHORTEST PATH FIRST (OSPF)
OPEN SHORTEST PATH FIRST (OSPF)OPEN SHORTEST PATH FIRST (OSPF)
OPEN SHORTEST PATH FIRST (OSPF)
Ann Joseph
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
Kris Buytaert
 
Session 7 Tp 7
Session 7 Tp 7Session 7 Tp 7
Session 7 Tp 7
githe26200
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
Open shortest path first (ospf)
Open shortest path first (ospf)Open shortest path first (ospf)
Open shortest path first (ospf)
Respa Peter
 
1 ddbms jan 2011_u
1 ddbms jan 2011_u1 ddbms jan 2011_u
1 ddbms jan 2011_u
betheperformer
 
No sql
No sqlNo sql
Link state routing protocol
Link state routing protocolLink state routing protocol
Link state routing protocol
university of Gujrat, pakistan
 
Link state routing protocol
Link state routing protocolLink state routing protocol
Link state routing protocol
university of Gujrat, pakistan
 

Similar to Cassandra Architecture (20)

Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
Cassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User GroupCassandra & Python - Springfield MO User Group
Cassandra & Python - Springfield MO User Group
 
Samsung DeepSort
Samsung DeepSortSamsung DeepSort
Samsung DeepSort
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Lab Seminar 2009 12 01 Message Drop Reduction And Movement
Lab Seminar 2009 12 01  Message Drop Reduction And MovementLab Seminar 2009 12 01  Message Drop Reduction And Movement
Lab Seminar 2009 12 01 Message Drop Reduction And Movement
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Technical presentation
Technical presentationTechnical presentation
Technical presentation
 
2.communcation in distributed system
2.communcation in distributed system2.communcation in distributed system
2.communcation in distributed system
 
Lab Seminar 2009 06 17 Description Based Ad Hoc Networks
Lab Seminar 2009 06 17  Description Based Ad Hoc NetworksLab Seminar 2009 06 17  Description Based Ad Hoc Networks
Lab Seminar 2009 06 17 Description Based Ad Hoc Networks
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
OPEN SHORTEST PATH FIRST (OSPF)
OPEN SHORTEST PATH FIRST (OSPF)OPEN SHORTEST PATH FIRST (OSPF)
OPEN SHORTEST PATH FIRST (OSPF)
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 
Session 7 Tp 7
Session 7 Tp 7Session 7 Tp 7
Session 7 Tp 7
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Open shortest path first (ospf)
Open shortest path first (ospf)Open shortest path first (ospf)
Open shortest path first (ospf)
 
1 ddbms jan 2011_u
1 ddbms jan 2011_u1 ddbms jan 2011_u
1 ddbms jan 2011_u
 
No sql
No sqlNo sql
No sql
 
Link state routing protocol
Link state routing protocolLink state routing protocol
Link state routing protocol
 
Link state routing protocol
Link state routing protocolLink state routing protocol
Link state routing protocol
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 

Cassandra Architecture

  • 1.
  • 2. Distributed Peer to Peer Client ● There is no leader/follower. ● Each node is aware of keys held by other nodes and coordinates with that node to fetch the data. ● Depending on the replication factor & consistency level the coordinator talks to one of more nodes before returning the response to the client. ● Every table defines a partition key. ● Data is distributed across the various nodes in the cluster using the hash on the partition key. Uses Consistent hashing algo. ● Partitions are replicated across multiple nodes to prevent single point of failure.
  • 3. Replication copies of the data across multiple nodes within/across the DCs. Replication Factor (RF) denotes the no of copies. Set at the keyspace level. Snitch: Is a strategy to identify the DC and Rack the node belongs to. This identity can be manually shared across all nodes or via Gossiping. Coordinator is aware of the RF/keyspace and coordinates the writes upto that factor to the various nodes within/across DCs. Hinted Handoff - While the replica node is down the coordinator will delay the transmission to that node by persisting that data locally. It can retransmits it once that replica node is back online. Cassandra configuration sets the duration for holding such data before handoff. Replication & Consistency Consistency is an agreeable factor across the nodes that ensures the acceptance of a read/write. Consistency can be set for both read/writes. Consistency levels (CL) can be set from low to high (ONE, LOCAL_QUOROUM, QUORUM, ALL) CL is a trade off b/w consistency and availability. Read Repair: Coordinator performs a read repair on some/all of the replicas that have trailing versions. Depending on the CL this can be done async during a read request.
  • 4. Gossip Each node stores info about itself and every other node in its Knowledge base. Each node initiates the gossip every second with 2 or 3 other nodes to share its knowledge base. Knowledge Base: Each node increments its heartbeat version every second. When it receives a gossip from other node, it checks each nodes heart beat version and updates if it had received the latest version. Optimization to reduce message bandwidth during gossiping Gossip is initiated with a SYN to the receiving node. SYN: Just a digest - no AppState included Receiving node ACKs back to the sender. ACK: Digest for the trailing versions & detailed (includes AppState) for leading versions. Sender updates the trailing versions and Acks back with the detailed info for the requested trailing versions on the other end. EndPt State: <IP of a node> HeartBeat State: Generation: 10 Version: 34 Application State; Status:: Norma/Removed/Arrived… DataCenter: Rack: Load: Severity: …. EndPt State: <IP of a node>... Knowledge Base
  • 5. Mem table Commit log Client Write Path Client writes to both commit log and memtable. In the event of the node failures, the memtable can be constructed from the commit log. Commit log is append only, does not maintain any order. Memtable is partitioned by partition key and ordered by clustering columns. Eventually memtable grows out of size and is flushed to disk (SSTable). SSTable is immutable so with each flush a new SSTable file is created. SSTable holds each partition Compaction is a process of merging numerous sstable files into one. It relies on timestamp of each row to resolve dups. SSTable 1 SSTable 1 SSTable 1 SSTable Compaction Flushing Disk Memory 23, USA 4 23, USA 8 23, Mexico 7 55, Korea 9 23, USA 5 55, Korea 9 23, Mexico 7 23, USA 4 23, USA 5 23, USA 8 23, Mexico 7 23, USA 4 55, Korea 9 23, USA 5 23, USA 8 55, China 20 55, China 40 55, Korea 9 23, Mexico 7 23, USA 4 23, USA 5 23, USA 8 Replica Node Coordinator Bloom Filters
  • 6. Read Path Mem table Client SSTable 1 SSTable 1 SSTable 1 SSTable Compaction Flushing DiskMemory Partition Index Summary Index Key Cache (LRU) Order of search during a Read: Coordinator node calls one of the replica node for the requested partition key. Replica Node first looks in the Mem table. If not found, follows the below path until the key is found. Bloom filters help determine two things. The key doesn’t exist in the sstable or the key may exist in the sstable. Key Cache, An LRU cache with partition key & value is the offset of the partition in the SSTable file. Summary Index is range based index for the keys in the partition index and their offset. Partition Index is the indexed lookup on the partition key and the offset of the partition in the SSTable file. Replica Node Coordinator Bloom Filters Bloom Filters Bloom Filters
  • 8. References: ● https://academy.datastax.com ● https://www.youtube.com/watch?v=s1xc1HVsRk0&list=PLalrWAGybpB-L1PGA- NfFu2uiWHEsdscD&index=1 ● https://www.toptal.com/big-data/consistent-hashing ● https://www.baeldung.com/cassandra-data-modeling
  • 9. Consistent Hashing Given a set of key/value pairs, hashing is strategy to spread each pair evenly as possible, so that we can fetch them in almost constant time by their key. Consistent hashing is one such hashing strategy to spread the keys in a distributed env. The hash of keys are hypothetically spread on ring. The position the key takes on the ring can be anywhere b/w 0 - 360 based on hash of key (mostly mod on the hash). The stores/server that hosts these key are also given a position on the ring (e.g., A, B, C…) The key is stored on the server that is found first, while traversing the ring in anti-clockwise direction from the keys position. E.g., key Steve @ 352.3 finds server C @ 81.7 If we maintain a sorted list of server and their position, a quick binary search will point us to the server where the key can be found eliminating the need to query all servers. Keys can be replicated on succeeding servers to avoid SPF (Single point of failures).
  • 10. Consistent Hashing Although the keys are spread over several servers, the distribution may not be even due to the uneven clustering of the key in real world (names starting with a certain alphabet may be more common). In such scenarios, to overcome the load on an individual server, we define virtual servers. What this means is we provide multiple positions for the same server simulating multiple instances of the same server across the ring. With ref to the pic here, the refined sorted list of servers will now have virtual instances of servers a1, a2, b2, c3 etc... Thereby distributed the load on C to B and A as well.
  • 11. Bloom Filters It's a probabilistic data structure to determine if an element is present in the set of not. It consists of a set of n bits & a collection of independent hash functions. Each of which return a no between 0 to n-1 representing one of the nth bit. Writes: A key is run thru the collection of hash functions. The resulting nth bit is flipped on to mark the elements presence. Reads: A key is run thru the collection of hash functions. Iff all the resulting nth bit is turned on, we can ensure that the key MAY be present in the underlying set. Even if one of them is not flipped on, we can GUARANTEE that the key is not present.