SlideShare a Scribd company logo
Apache Cassandra
By Markus Klems
Big Data & NoSQL
Why do we need NoSQL Databases?
● Modern Internet application requirements

low-latency CRUD operations
elastic scalability
high availability
reliable and durable storage
geographic distribution
flexible schema

● Less prioritized
○ Transactions, ACID guarantees

... but some form of data consistency is still desirable

○ SQL support

… but some SQL features are still desirable
What is Big Data?

near realtime










Scalability & High Availability
● Workload and data volume grows ->
Partition and distribute data across multiple
servers (horizontal scaling)
○ How to determine partition boundaries?
○ Can partitions be changed dynamically?
○ How to route requests to the right server?

● High Availability -> replicate data

What kind of failures can we deal with?
Sync or async replication?
Local or geo replication?
Consistency model?
NoSQL Databases categorized by
System Architecture



Ring (P2P)

All nodes are equal. Each node
stores a data partition +
replicated data. Eventually

Amazon DynamoDB


Data partitioned across slaves.
Each slave stores a data
partition + replicated data.
Strong consistency guarantees.

Yahoo! Sherpa,


Bi-directional, incremental
replication between all nodes.
Each node stores all data.
Eventually consistent.

NoSQL Databases categorized by
Data Model and Storage

Each row stores a flexible
number of columns. Data is
partitioned by row key.

Amazon DynamoDB


Storage and retrieval of
structured data in the form of



Row-oriented data storage of
simple (key,value) pairs in a flat

Yahoo! Sherpa


Storage and retrieval of data
that is stored as nodes and links
of graphs in a graph-space.
Cassandra Background:
Amazon Dynamo +
Google BigTable
Dynamo: Amazon's highly available
key-value store
● Amazon Dynamo Paper
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani,
Gunavardhan Kakulapati, Avinash Lakshman, Alex
Pilchin, Swaminathan Sivasubramanian, Peter Vosshall,
and Werner Vogels. 2007. Dynamo: Amazon's highly
available key-value store. In Proceedings of twenty-first
ACM SIGOPS symposium on Operating systems
principles (SOSP '07). ACM, New York, NY, USA, 205220.
● Download:
Amazon Dynamo: Techniques (1)




Consistent Hashing


High Availability

Vector clocks with

Version size is

for writes

reconciliation during

decoupled from


update rates.

Handling temporary

Sloppy Quorum and

Provides high


hinted handoff

availability and
durability guarantee
when some of the
replicas are not
Amazon Dynamo: Techniques (2)



Recovering from

Anti-entropy using


permanent failures

Merkle trees

divergent replicas in
the background.

Membership and


Preserves symmetry

failure detection

membership protocol

and avoids having a

and failure detection.

centralized registry
for storing
membership and
node liveness
Dynamo: Incremental scalability
● Simple key/value API
● Consistent hashing
○ Cryptographic MD5 hash of key generates a 128-bit
○ The largest value wraps around the smallest one.
The keyspace is a "ring".
Dynamo: Incremental scalability

Source: Amazon Dynamo paper
Dynamo: Data versioning
● Optimistic replication: updates may travel
asynchronously. This can lead to
○ How to identify inconsistencies?
○ Data versions, vector clocks!
Dynamo: Sloppy Quroum
● (N,R,W) quorum configuration
○ N replica
○ R votes for a successful READ operation
○ W votes for a successful WRITE operation

● Quorum intersection invariants
○ N < R+W
○ (W > N/2)

● Sloppy quorum
○ If a node goes down, save the data temporarily as
"Hinted Handoffs" on another node.
○ Thus avoiding unavailability.
○ Hinted Handoffs must be resolved after some time.
Dynamo: Merkle Trees
● Each server node calculates one Merkle
Tree for one owned keyrange.
● Merkle Tree = A tree of hash values of parts
of the keyrange.



Dynamo: Anti Entropy Protocol
● Anti-entropy protocol

Exchange Merkle Trees with replica server nodes.
Lazy loading of Merkle Trees
Compare the tree and track down inconsistencies.
Trigger conflict resolution (last write wins)
Send Hash


Send Hash(Part1)

Dynamo: Membership & failure
● Gossip-based protocol
○ All nodes (eventually) share the same view of the
○ Information is exchanged via gossip between peers







Dynamo: Summary
● CAP theorem: in a widely distributed
system, strong consistency and high
availability (+ low latency) cannot be
achieved at the same time.
● Optimistic replication: improve availability
+ latency at the cost of inconsistencies
● Requires conflict resolution:
○ when to resolve conflicts?
○ who resolves conflicts?
Bigtable: A Distributed Storage System for
Structured Data
● Google Bigtable Paper
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.
Hsieh, Deborah A. Wallach, Mike Burrows, Tushar
Chandra, Andrew Fikes, and Robert E. Gruber. 2008.
Bigtable: A Distributed Storage System for Structured
Data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June
2008), 26 pages.
Google Bigtable
"A Bigtable is a sparse, distributed, persistent
multi-dimensional sorted map. The map is
indexed by a row key, column key, and a
timestamp; each value in the map is an
uninterpreted array of bytes."

Source: Bigtable paper
Google Bigtable: Data Model

Source: Bigtable paper
Google Bigtable: Structure and
query model
● A table contains a number of rows and is
broken down into tablets which each contain a
subset of the rows
● Tablets are the unit of distribution and load
● Rows are sorted lexicographically by row key
○ Subsequent row keys are within a tablet
○ Allows efficient range queries for small numbers of rows

● Operations: Read, write and delete items +
batch writes
● Supports single-row transactions
Google Bigtable: Local persistence
● Tablet servers store updates in commit logs
written in so-called SSTables
● Fresh updates are kept in memory
(memtable), old updates are stored in GFS
● Minor compactions flush memtable into new
● Major compaction merge SSTables into just
one new SSTable
Google Bigtable: Local persistence

Source: Bigtable paper
Cassandra Architecture and
Main Features
Cassandra: a decentralized structured
storage system
● Cassandra Paper
Avinash Lakshman and Prashant Malik. 2010.
Cassandra: a decentralized structured storage system.
SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35-40.
● URL: http://www.cs.cornell.
Architecture Topics
● Data Partitioning & Distribution
○ Partitioners
○ Virtual Nodes

● Data Replication
● Network Toplogy (Snitches)
● Server-to-Server Communication (Gossip)
○ Membership
○ Failure detection

● Scaling a Cluster
● Client-Server Communication
● Local Persistence
Architecture Topics
● Data Partitioning & Distribution
○ Partitioners
○ Virtual Nodes

● Data Replication
● Network Toplogy (Snitches)
● Server-to-Server Communication (Gossip)
○ Membership
○ Failure detection

● Scaling a Cluster
● Client-Server Communication
● Local Persistence
Data Partitioning & Distribution


Server [Keyrange]









[-922, -461]

[0, 460]


[461, 922]
● ByteOrderedPartitioner (not recommended)
○ Plus: You can scan across lexically ordered keys
○ Minus: bad load balancing, hotspots, etc.

● RandomPartitioner (default before 1.2)
○ The RandomPartition distributes data evenly across the nodes
using an MD5 hash value of the row key. The possible range of
hash values is from 0 to 2127 -1.

● Murmur3Partitioner (default since 1.2)
○ The Murmur3Partitioner uses the MurmurHash function. This
hashing function creates a 64-bit hash value of the row key.
The possible range of hash values is from -263 to +263.

Data Partitioning & Distribution
● Virtual Nodes (Vnodes)
● Since Cassandra 1.2: Virtual Nodes for
○ better load balancing
○ easier scaling with differently sized servers
Virtual Nodes
Example with num_tokens: 3

Server [Keyrange]




[882, -854]
[-110, --26]
[372, 469]


[675, 882]
[227, 298]
[298, 372]


[-798, -743]
[-364, -110]
[469, -675]


[-854, -798]
[-743, -364]
[-26, 277]



Architecture Topics
● Data Partitioning & Distribution
○ Partitioners
○ Virtual Nodes

● Data Replication
● Network Toplogy (Snitches)
● Server-to-Server Communication (Gossip)
○ Membership
○ Failure detection

● Scaling a Cluster
● Client-Server Communication
● Local Persistence
Data Replication


Server [Keyrange]









[461, -461]

[-460, 460]


[0, 922]
Data Replication
● Replication for high availability and data
○ Replication factor N: Each row is replicated at N
○ Each row key k is assigned to a coordinator node.
○ The coordinator is responsible for replicating the
rows within its key range.
Multi-DC Data Replication







0 -1




0 -1



[461,461] [-922,-1][-460,460] [0,922]



[0,922] [-922,-1]
Architecture Topics
● Data Partitioning & Distribution
○ Partitioners
○ Virtual Nodes

● Data Replication
● Network Toplogy (Snitches)
● Server-to-Server Communication (Gossip)
○ Membership
○ Failure detection

● Scaling a Cluster
● Client-Server Communication
● Local Persistence
Network Topology (Snitches)
● Selected Snitches

SimpleSnitch (default)

● Set the endpoint_snitch property in
● SimpleSnitch does not recognize data center
or rack information
● Only useful for small single-DC deployments
● Assumes the network topology from the
node's IP address
DC Rack Server
● Uses conf/ file
to infer data center and rack information
● Useful if cluster layout is not matched by IP
addresses or if you have complex grouping
● Example properties file:
# Data Center One
# Data Center Two
● Each node sets its own data center and rack
info via conf/cassandra-rackdc.rpoperties
● The info is propagated to other nodes via
gossip. Fits nicely the P2P style of
● Example properties file:
Dynamic Snitching
● Dynamic snitching avoids routing requests to
badly performing nodes.
● Properties in the cassandra.yaml
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
Architecture Topics
● Data Partitioning & Distribution
○ Partitioners
○ Virtual Nodes

● Data Replication
● Network Toplogy (Snitches)
● Server-to-Server Communication
○ Membership
○ Failure detection

● Scaling a Cluster
● Client-Server Communication
● Local Persistence
Server-to-Server Communication:
● Cassandra uses a gossip protocol to
exchange information between servers in a
cluster in a peer-to-peer fashion
○ The gossip process runs every second on each
Cassandra server
○ Each server sends its state in a message to other
servers in the cluster
○ Each gossip message has a version. Old gossip
state information on a server is overwritten.
Server-to-Server Communication:
● Seeds
○ The list of seeds addresses which in the cassandra.
yaml file is only used during initial bootstrapping
of a new server in the cluster.
○ The bootstrapping server establishes gossip
communication with the servers in the seeds list.

● You should use the same seeds list on all
servers to prevent split-brain partitions in
gossip communication.
● In a Multi-DC setup, the seeds list must
include at least one server from each DC.
Server-to-Server Communication
● Delete gossip state on a server
○ You can delete the gossip state of server by adding
the following in your file:

● This is necessary in certain situations when
you restart one or more servers, such as
○ You restart a server after its IP address has been
○ You restart all servers with a new cluster_name
Server-to-Server Communication:
Failure Detection
● Cassandra implements a Gossip-based
accrual failure detector that adapts the time
interval based on historic latency data.
○ Every Cassandra node maintains a sliding window of
inter-arrival times of Gossip messages.
○ The Cassandra failure detector assumes an
exponential distribution of inter-arrival times.
○ The failure detector can be configured with the cassandra.yaml

parameter phi_convict_threshold
Tip: You can make the failure detector less sensitive to latency
variability, for example during times of network congestion or in
Multi-DC setups, by increasing the phi_convict_threshold
Heartbeat Failure Detection
● Heartbeat failure detector

Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama.
2004. The Phi Accrual Failure Detector. In Proceedings of the 23rd IEEE
International Symposium on Reliable Distributed Systems (SRDS '04). IEEE
Computer Society, Washington, DC, USA, 66-78.
Accrual Failure Detection
● Accrual failure detector

Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama.
2004. The Phi Accrual Failure Detector. In Proceedings of the 23rd IEEE
International Symposium on Reliable Distributed Systems (SRDS '04). IEEE
Computer Society, Washington, DC, USA, 66-78.
Architecture Topics
● Data Partitioning & Distribution
○ Partitioners
○ Virtual Nodes

● Data Replication
● Network Toplogy (Snitches)
● Server-to-Server Communication (Gossip)
○ Membership
○ Failure detection

● Scaling a Cluster
● Client-Server Communication
● Local Persistence
Scaling a Cluster
● Bootstrapping managed via command line
admin tool
● A node starts for the first time


○ Chose position on the ring (via token)
○ Join cluster: connect with seed node(s) and start
data streaming from neighboring node.
○ Bootstrapping/joining phase is completed when all
data for the newly split keyrange has been streamed
to the new node.
Basically the same process for a cluster with Vnodes,
however, a server with Vnodes has multiple positions in
the ring and streams from multiple neighboring servers.
Add Server










[-922, -461]

[0, 460]


[461, 922]
Add a Server







[-922, -712]


[-711, -461]





[0, 460]


[461, 922]
Remove a Server







[-922, -712]


[-711, -461]





[0, 460]


[461, 922]
Architecture Topics
● Data Partitioning & Distribution
○ Partitioners
○ Virtual Nodes

● Data Replication
● Network Toplogy (Snitches)
● Server-to-Server Communication (Gossip)
○ Membership
○ Failure detection

● Scaling a Cluster
● Client-Server Communication
● Local Persistence
Client-Server Communication



-461 [-922, -461]


Request record with
primary key -333

Client-Server Communication



-461 [-922, -461]


Request record with
primary key -333

from B

Client-Server Communication



-461 [-922, -461]


return record


Architecture Topics
● Data Partitioning & Distribution
○ Partitioners
○ Virtual Nodes

● Data Replication
● Network Toplogy (Snitches)
● Server-to-Server Communication (Gossip)
○ Membership
○ Failure detection

● Scaling a Cluster
● Client-Server Communication
● Local Persistence
Local Persistence - Write
● Write
1. Append to commit log for durability (recoverability)
2. Update of in-memory, per-column-family Memtable

● If Memtable crosses a threshold
1. Sequential write to disk (SSTable).
2. Merge SSTables from time to time (compactions)
Local Persistence - Write (2)
Local Persistence - Write
Example (1)
Local Persistence - Write
Example (2)
Local Persistence - Write
Example (3)
Local Persistence - CommitLog
# commitlog_sync may be either "periodic" or "batch."
# When in batch mode, Cassandra won't ack writes until the
# commit log has been fsynced to disk. It will wait up to
# commitlog_sync_batch_window_in_ms milliseconds for other
# writes, before performing the sync.
# commitlog_sync: batch
# commitlog_sync_batch_window_in_ms: 50
# the other option is "periodic" where writes may be acked
# immediately and the CommitLog is simply synced every
# commitlog_sync_period_in_ms milliseconds.
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
Local Persistence - Read
● Read
○ Query in-memory Memtable
○ Check in-memory bloom filter
■ Used to prevent unnecessary disk access.
■ A bloom filter summarizes the keys in a file.
■ False Positives are possible
○ Check column index to jump to the columns on disk
as fast as possible.
■ Index for every 256K chunk.
Local Persistence - Read (2)
Local Persistence - Read (3)
Local Persistence - Read (4)
Compactions in Cassandra
Compactions in Cassandra (2)
Size-Tiered Compaction Strategy
Leveled Compaction Strategy
Cassandra: Summary
● Scalability: Peer-to-Peer architecture
● High availability:
○ replication
○ quorum-based replica control
○ failure detection and recovery

● High performance (particularly for WRITE
operations): Bigtable-style storage engine
○ All writes to disk are sequential. Files are written
once (immutable) and not updated hereafter

More Related Content

What's hot

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
Scaling Redis To 1M Ops/Sec: Jane Paek
Scaling Redis To 1M Ops/Sec: Jane PaekScaling Redis To 1M Ops/Sec: Jane Paek
Scaling Redis To 1M Ops/Sec: Jane PaekRedis Labs
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented DatabasesFabio Fumarola
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...HostedbyConfluent
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDBMongoDB
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
A guide of PostgreSQL on Kubernetes
A guide of PostgreSQL on KubernetesA guide of PostgreSQL on Kubernetes
A guide of PostgreSQL on Kubernetest8kobayashi
Group Replication in MySQL 8.0 ( A Walk Through )
Group Replication in MySQL 8.0 ( A Walk Through ) Group Replication in MySQL 8.0 ( A Walk Through )
Group Replication in MySQL 8.0 ( A Walk Through ) Mydbops

What's hot (20)

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Scaling Redis To 1M Ops/Sec: Jane Paek
Scaling Redis To 1M Ops/Sec: Jane PaekScaling Redis To 1M Ops/Sec: Jane Paek
Scaling Redis To 1M Ops/Sec: Jane Paek
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareClickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
A guide of PostgreSQL on Kubernetes
A guide of PostgreSQL on KubernetesA guide of PostgreSQL on Kubernetes
A guide of PostgreSQL on Kubernetes
Group Replication in MySQL 8.0 ( A Walk Through )
Group Replication in MySQL 8.0 ( A Walk Through ) Group Replication in MySQL 8.0 ( A Walk Through )
Group Replication in MySQL 8.0 ( A Walk Through )

Viewers also liked

2014 09-23 Mechanism of Gossip protocol
2014 09-23 Mechanism of Gossip protocol2014 09-23 Mechanism of Gossip protocol
2014 09-23 Mechanism of Gossip protocolSugawara Genki
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basicsnickmbailey
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
Tarantool 1.6 talk at SECR 2014 conference
Tarantool 1.6 talk at SECR 2014 conferenceTarantool 1.6 talk at SECR 2014 conference
Tarantool 1.6 talk at SECR 2014 conferenceKostja Osipov
Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB graphdevroom
Rhel cluster basics 3
Rhel cluster basics   3Rhel cluster basics   3
Rhel cluster basics 3Manoj Singh
Fisl15 Streaming de vídeo ao vivo na
Fisl15 Streaming de vídeo ao vivo na globo.comFisl15 Streaming de vídeo ao vivo na
Fisl15 Streaming de vídeo ao vivo na globo.comLeandro Moreira
Apache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckApache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckDataStax Academy
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value StoreSajeev P
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathJoshua McKenzie
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage systemArunit Gupta
Consistencia es un término más amplio que el de integridad
Consistencia es un término más amplio que el de integridadConsistencia es un término más amplio que el de integridad
Consistencia es un término más amplio que el de integridadAngel Sanchez Virgen
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized ViewsCarl Yeksigian
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandraJon Haddad
NoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAPNoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAPAricelio Souza
Cassandra vs. Redis
Cassandra vs. RedisCassandra vs. Redis
Cassandra vs. RedisTim Lossen

Viewers also liked (20)

2014 09-23 Mechanism of Gossip protocol
2014 09-23 Mechanism of Gossip protocol2014 09-23 Mechanism of Gossip protocol
2014 09-23 Mechanism of Gossip protocol
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
Tarantool 1.6 talk at SECR 2014 conference
Tarantool 1.6 talk at SECR 2014 conferenceTarantool 1.6 talk at SECR 2014 conference
Tarantool 1.6 talk at SECR 2014 conference
Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB Works with persistent graphs using OrientDB
Works with persistent graphs using OrientDB
Rhel cluster basics 3
Rhel cluster basics   3Rhel cluster basics   3
Rhel cluster basics 3
Fisl15 Streaming de vídeo ao vivo na
Fisl15 Streaming de vídeo ao vivo na globo.comFisl15 Streaming de vídeo ao vivo na
Fisl15 Streaming de vídeo ao vivo na
Apache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckApache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide Deck
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value Store
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
Cassandra - A decentralized storage system
Cassandra - A decentralized storage systemCassandra - A decentralized storage system
Cassandra - A decentralized storage system
Consistencia es un término más amplio que el de integridad
Consistencia es un término más amplio que el de integridadConsistencia es un término más amplio que el de integridad
Consistencia es un término más amplio que el de integridad
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized Views
Python and cassandra
Python and cassandraPython and cassandra
Python and cassandra
NoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAPNoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAP
Cassandra vs. Redis
Cassandra vs. RedisCassandra vs. Redis
Cassandra vs. Redis

Similar to Cassandra background-and-architecture

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokesGagan Bajpai
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
Webinar slides: Designing Open Source Databases for High Availability
Webinar slides: Designing Open Source Databases for High AvailabilityWebinar slides: Designing Open Source Databases for High Availability
Webinar slides: Designing Open Source Databases for High AvailabilitySeveralnines
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
distributed system lab materials about ad
distributed system lab materials about addistributed system lab materials about ad
distributed system lab materials about admilkesa13
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla ClusterScyllaDB
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshersrajkamaltibacademy
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare

Similar to Cassandra background-and-architecture (20)

Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Cassandra training
Cassandra trainingCassandra training
Cassandra training
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
Webinar slides: Designing Open Source Databases for High Availability
Webinar slides: Designing Open Source Databases for High AvailabilityWebinar slides: Designing Open Source Databases for High Availability
Webinar slides: Designing Open Source Databases for High Availability
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
distributed system lab materials about ad
distributed system lab materials about addistributed system lab materials about ad
distributed system lab materials about ad
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop

Cassandra background-and-architecture

  • 2. Big Data & NoSQL
  • 3. Why do we need NoSQL Databases? ● Modern Internet application requirements ○ ○ ○ ○ ○ ○ low-latency CRUD operations elastic scalability high availability reliable and durable storage geographic distribution flexible schema ● Less prioritized ○ Transactions, ACID guarantees ■ ... but some form of data consistency is still desirable ○ SQL support ■ … but some SQL features are still desirable
  • 4. What is Big Data? Velocity realtime near realtime interactive periodic batch structured data MB GB TB PB Volume unstructured data Variety c.f /forum/topics/the-3vs-that-define-big-data
  • 5. Scalability & High Availability ● Workload and data volume grows -> Partition and distribute data across multiple servers (horizontal scaling) ○ How to determine partition boundaries? ○ Can partitions be changed dynamically? ○ How to route requests to the right server? ● High Availability -> replicate data ○ ○ ○ ○ What kind of failures can we deal with? Sync or async replication? Local or geo replication? Consistency model?
  • 6. NoSQL Databases categorized by System Architecture Architecture Techniques Systems Dynamo-style Ring (P2P) All nodes are equal. Each node stores a data partition + replicated data. Eventually consistent. Cassandra, Riak, Voldemort, Amazon DynamoDB Master-Slave Data partitioned across slaves. Each slave stores a data partition + replicated data. Strong consistency guarantees. HBase, MongoDB, Redis, Yahoo! Sherpa, Neo4j Full replication Bi-directional, incremental replication between all nodes. Each node stores all data. Eventually consistent. CouchDB
  • 7. NoSQL Databases categorized by Data Model and Storage Wide-Column Each row stores a flexible number of columns. Data is partitioned by row key. Cassandra, HBase, Amazon DynamoDB Document Storage and retrieval of structured data in the form of JSON, YAML, or RDF documents. CouchDB, MongoDB Key-value Row-oriented data storage of simple (key,value) pairs in a flat namespace. Riak, Redis, Voldemort, Yahoo! Sherpa Graph Storage and retrieval of data Neo4j that is stored as nodes and links of graphs in a graph-space.
  • 9. Dynamo: Amazon's highly available key-value store ● Amazon Dynamo Paper Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles (SOSP '07). ACM, New York, NY, USA, 205220. ● Download: edu/~kohler/class/cs239-w08/decandia07dynamo.pdf
  • 10. Amazon Dynamo: Techniques (1) Problem Technique Advantage Partitioning Consistent Hashing Incremental Scalability High Availability Vector clocks with Version size is for writes reconciliation during decoupled from reads update rates. Handling temporary Sloppy Quorum and Provides high failures hinted handoff availability and durability guarantee when some of the replicas are not available.
  • 11. Amazon Dynamo: Techniques (2) Problem Technique Advantage Recovering from Anti-entropy using Synchronizes permanent failures Merkle trees divergent replicas in the background. Membership and Gossip-based Preserves symmetry failure detection membership protocol and avoids having a and failure detection. centralized registry for storing membership and node liveness information.
  • 12. Dynamo: Incremental scalability ● Simple key/value API ● Consistent hashing ○ Cryptographic MD5 hash of key generates a 128-bit identifier ○ The largest value wraps around the smallest one. The keyspace is a "ring".
  • 14. Dynamo: Data versioning ● Optimistic replication: updates may travel asynchronously. This can lead to inconsistencies. ○ How to identify inconsistencies? ○ Data versions, vector clocks!
  • 15. Dynamo: Sloppy Quroum ● (N,R,W) quorum configuration ○ N replica ○ R votes for a successful READ operation ○ W votes for a successful WRITE operation ● Quorum intersection invariants ○ N < R+W ○ (W > N/2) ● Sloppy quorum ○ If a node goes down, save the data temporarily as "Hinted Handoffs" on another node. ○ Thus avoiding unavailability. ○ Hinted Handoffs must be resolved after some time.
  • 16. Dynamo: Merkle Trees ● Each server node calculates one Merkle Tree for one owned keyrange. ● Merkle Tree = A tree of hash values of parts of the keyrange. Hash(Keyrange) Hash (Part1) Hash (Part1-1) Hash (Part1-2) Hash (Part2) Hash (Part2-1) Hash (Part2-2)
  • 17. Dynamo: Anti Entropy Protocol ● Anti-entropy protocol ○ ○ ○ ○ Exchange Merkle Trees with replica server nodes. Lazy loading of Merkle Trees Compare the tree and track down inconsistencies. Trigger conflict resolution (last write wins) Send Hash (Keyrange) Server A Comparison failed Send Hash(Part1) ... Server B
  • 18. Dynamo: Membership & failure detection ● Gossip-based protocol ○ All nodes (eventually) share the same view of the system. ○ Information is exchanged via gossip between peers E B G C A D F H
  • 19. Dynamo: Summary ● CAP theorem: in a widely distributed system, strong consistency and high availability (+ low latency) cannot be achieved at the same time. ● Optimistic replication: improve availability + latency at the cost of inconsistencies ● Requires conflict resolution: ○ when to resolve conflicts? ○ who resolves conflicts?
  • 20. Bigtable: A Distributed Storage System for Structured Data ● Google Bigtable Paper Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages. ●
  • 21. Google Bigtable "A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes." Source: Bigtable paper
  • 22. Google Bigtable: Data Model Source: Bigtable paper
  • 23. Google Bigtable: Structure and query model ● A table contains a number of rows and is broken down into tablets which each contain a subset of the rows ● Tablets are the unit of distribution and load balancing ● Rows are sorted lexicographically by row key ○ Subsequent row keys are within a tablet ○ Allows efficient range queries for small numbers of rows ● Operations: Read, write and delete items + batch writes ● Supports single-row transactions
  • 24. Google Bigtable: Local persistence ● Tablet servers store updates in commit logs written in so-called SSTables ● Fresh updates are kept in memory (memtable), old updates are stored in GFS ● Minor compactions flush memtable into new SSTable ● Major compaction merge SSTables into just one new SSTable
  • 25. Google Bigtable: Local persistence Source: Bigtable paper
  • 27. Cassandra: a decentralized structured storage system ● Cassandra Paper Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (April 2010), 35-40. ● URL: http://www.cs.cornell. edu/Projects/ladis2009/papers/Lakshman-ladis2009. PDF
  • 28. Architecture Topics ● Data Partitioning & Distribution ○ Partitioners ○ Virtual Nodes ● Data Replication ● Network Toplogy (Snitches) ● Server-to-Server Communication (Gossip) ○ Membership ○ Failure detection ● Scaling a Cluster ● Client-Server Communication ● Local Persistence
  • 29. Architecture Topics ● Data Partitioning & Distribution ○ Partitioners ○ Virtual Nodes ● Data Replication ● Network Toplogy (Snitches) ● Server-to-Server Communication (Gossip) ○ Membership ○ Failure detection ● Scaling a Cluster ● Client-Server Communication ● Local Persistence
  • 30. Data Partitioning & Distribution 922 -922 Server [Keyrange] A 461 460 0 -1 -461 -460 B [-460,-1] C Keyspace [-922, -461] [0, 460] D [461, 922]
  • 31. Partitioners ● ByteOrderedPartitioner (not recommended) ○ Plus: You can scan across lexically ordered keys ○ Minus: bad load balancing, hotspots, etc. ● RandomPartitioner (default before 1.2) ○ The RandomPartition distributes data evenly across the nodes using an MD5 hash value of the row key. The possible range of hash values is from 0 to 2127 -1. ● Murmur3Partitioner (default since 1.2) ○ The Murmur3Partitioner uses the MurmurHash function. This hashing function creates a 64-bit hash value of the row key. The possible range of hash values is from -263 to +263. Source:
  • 32. Data Partitioning & Distribution ● Virtual Nodes (Vnodes) ● Since Cassandra 1.2: Virtual Nodes for ○ better load balancing ○ easier scaling with differently sized servers
  • 33. Virtual Nodes Example with num_tokens: 3 882 Server [Keyrange] -854 -798 -743 675 A [882, -854] [-110, --26] [372, 469] B [675, 882] [227, 298] [298, 372] C [-798, -743] [-364, -110] [469, -675] D [-854, -798] [-743, -364] [-26, 277] 469 Keyspace 372 298 277 -364 -110 -26
  • 34. Architecture Topics ● Data Partitioning & Distribution ○ Partitioners ○ Virtual Nodes ● Data Replication ● Network Toplogy (Snitches) ● Server-to-Server Communication (Gossip) ○ Membership ○ Failure detection ● Scaling a Cluster ● Client-Server Communication ● Local Persistence
  • 36. Data Replication ● Replication for high availability and data durability ○ Replication factor N: Each row is replicated at N nodes. ○ Each row key k is assigned to a coordinator node. ○ The coordinator is responsible for replicating the rows within its key range.
  • 37. Multi-DC Data Replication 922 461 460 -922 Keyspac DC1:2 e 922 -461 -460 DC2:1 0 -1 A B -922 0 -1 C D [461,461] [-922,-1][-460,460] [0,922] E F [0,922] [-922,-1]
  • 38. Architecture Topics ● Data Partitioning & Distribution ○ Partitioners ○ Virtual Nodes ● Data Replication ● Network Toplogy (Snitches) ● Server-to-Server Communication (Gossip) ○ Membership ○ Failure detection ● Scaling a Cluster ● Client-Server Communication ● Local Persistence
  • 39. Network Topology (Snitches) ● Selected Snitches ○ ○ ○ ○ SimpleSnitch (default) RackInferringSnitch PropertyFileSnitch GossipingPropertyFileSnitch ● Set the endpoint_snitch property in cassandra.yaml
  • 40. SimpleSnitch ● SimpleSnitch does not recognize data center or rack information ● Only useful for small single-DC deployments
  • 41. RackInferringSnitch ● Assumes the network topology from the node's IP address DC Rack Server
  • 42. PropertyFileSnitch ● Uses conf/ file to infer data center and rack information ● Useful if cluster layout is not matched by IP addresses or if you have complex grouping requirements ● Example properties file: # Data Center One # Data Center Two
  • 43. GossipingPropertyFileSnitch ● Each node sets its own data center and rack info via conf/cassandra-rackdc.rpoperties file. ● The info is propagated to other nodes via gossip. Fits nicely the P2P style of Cassandra. ● Example properties file: dc=DC1 rack=RAC1
  • 44. Dynamic Snitching ● Dynamic snitching avoids routing requests to badly performing nodes. ● Properties in the cassandra.yaml dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 600000 dynamic_snitch_badness_threshold: 0.1
  • 45. Architecture Topics ● Data Partitioning & Distribution ○ Partitioners ○ Virtual Nodes ● Data Replication ● Network Toplogy (Snitches) ● Server-to-Server Communication (Gossip) ○ Membership ○ Failure detection ● Scaling a Cluster ● Client-Server Communication ● Local Persistence
  • 46. Server-to-Server Communication: Gossip ● Cassandra uses a gossip protocol to exchange information between servers in a cluster in a peer-to-peer fashion ○ The gossip process runs every second on each Cassandra server ○ Each server sends its state in a message to other servers in the cluster ○ Each gossip message has a version. Old gossip state information on a server is overwritten.
  • 47. Server-to-Server Communication: Seeds ● Seeds ○ The list of seeds addresses which in the cassandra. yaml file is only used during initial bootstrapping of a new server in the cluster. ○ The bootstrapping server establishes gossip communication with the servers in the seeds list. ● You should use the same seeds list on all servers to prevent split-brain partitions in gossip communication. ● In a Multi-DC setup, the seeds list must include at least one server from each DC.
  • 48. Server-to-Server Communication ● Delete gossip state on a server ○ You can delete the gossip state of server by adding the following in your file: -Dcassandra.load_ring_state=false ● This is necessary in certain situations when you restart one or more servers, such as ○ You restart a server after its IP address has been changed ○ You restart all servers with a new cluster_name
  • 49. Server-to-Server Communication: Failure Detection ● Cassandra implements a Gossip-based accrual failure detector that adapts the time interval based on historic latency data. ○ Every Cassandra node maintains a sliding window of inter-arrival times of Gossip messages. ○ The Cassandra failure detector assumes an exponential distribution of inter-arrival times. ○ The failure detector can be configured with the cassandra.yaml ○ parameter phi_convict_threshold Tip: You can make the failure detector less sensitive to latency variability, for example during times of network congestion or in Multi-DC setups, by increasing the phi_convict_threshold value.
  • 50. Heartbeat Failure Detection ● Heartbeat failure detector Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama. 2004. The Phi Accrual Failure Detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS '04). IEEE Computer Society, Washington, DC, USA, 66-78.
  • 51. Accrual Failure Detection ● Accrual failure detector Naohiro Hayashibara, Xavier Defago, Rami Yared, and Takuya Katayama. 2004. The Phi Accrual Failure Detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS '04). IEEE Computer Society, Washington, DC, USA, 66-78.
  • 52. Architecture Topics ● Data Partitioning & Distribution ○ Partitioners ○ Virtual Nodes ● Data Replication ● Network Toplogy (Snitches) ● Server-to-Server Communication (Gossip) ○ Membership ○ Failure detection ● Scaling a Cluster ● Client-Server Communication ● Local Persistence
  • 53. Scaling a Cluster ● Bootstrapping managed via command line admin tool ● A node starts for the first time ● ○ Chose position on the ring (via token) ○ Join cluster: connect with seed node(s) and start data streaming from neighboring node. ○ Bootstrapping/joining phase is completed when all data for the newly split keyrange has been streamed to the new node. Basically the same process for a cluster with Vnodes, however, a server with Vnodes has multiple positions in the ring and streams from multiple neighboring servers.
  • 55. Add a Server Z -922 -712 -711 461 460 Keyspace 0 -1 -461 -460 [-922, -712] A [-711, -461] B [-460,-1] C 922 [0, 460] D [461, 922]
  • 56. Remove a Server Z -922 -712 -711 461 460 Keyspace 0 -1 -461 -460 [-922, -712] A [-711, -461] B [-460,460] C 922 [0, 460] D [461, 922]
  • 57. Architecture Topics ● Data Partitioning & Distribution ○ Partitioners ○ Virtual Nodes ● Data Replication ● Network Toplogy (Snitches) ● Server-to-Server Communication (Gossip) ○ Membership ○ Failure detection ● Scaling a Cluster ● Client-Server Communication ● Local Persistence
  • 58. Client-Server Communication -922 A pace -461 [-922, -461] -460 B [-460,-1] -1 Request record with primary key -333 Cassandra client
  • 59. Client-Server Communication -922 A pace -461 [-922, -461] -460 B [-460,-1] -1 Request record with primary key -333 Request record from B Cassandra client
  • 60. Client-Server Communication -922 A pace -461 [-922, -461] -460 B [-460,-1] -1 return record return record Cassandra client
  • 61. Architecture Topics ● Data Partitioning & Distribution ○ Partitioners ○ Virtual Nodes ● Data Replication ● Network Toplogy (Snitches) ● Server-to-Server Communication (Gossip) ○ Membership ○ Failure detection ● Scaling a Cluster ● Client-Server Communication ● Local Persistence
  • 62. Local Persistence - Write ● Write 1. Append to commit log for durability (recoverability) 2. Update of in-memory, per-column-family Memtable ● If Memtable crosses a threshold 1. Sequential write to disk (SSTable). 2. Merge SSTables from time to time (compactions)
  • 63. Local Persistence - Write (2)
  • 64. Local Persistence - Write Example (1)
  • 65. Local Persistence - Write Example (2)
  • 66. Local Persistence - Write Example (3)
  • 67. Local Persistence - CommitLog cassandra.yaml # commitlog_sync may be either "periodic" or "batch." # When in batch mode, Cassandra won't ack writes until the # commit log has been fsynced to disk. It will wait up to # commitlog_sync_batch_window_in_ms milliseconds for other # writes, before performing the sync. # commitlog_sync: batch # commitlog_sync_batch_window_in_ms: 50 # the other option is "periodic" where writes may be acked # immediately and the CommitLog is simply synced every # commitlog_sync_period_in_ms milliseconds. commitlog_sync: periodic commitlog_sync_period_in_ms: 10000
  • 68. Local Persistence - Read ● Read ○ Query in-memory Memtable ○ Check in-memory bloom filter ■ Used to prevent unnecessary disk access. ■ A bloom filter summarizes the keys in a file. ■ False Positives are possible ○ Check column index to jump to the columns on disk as fast as possible. ■ Index for every 256K chunk.
  • 76. Cassandra: Summary ● Scalability: Peer-to-Peer architecture ● High availability: ○ replication ○ quorum-based replica control ○ failure detection and recovery ● High performance (particularly for WRITE operations): Bigtable-style storage engine ○ All writes to disk are sequential. Files are written once (immutable) and not updated hereafter