Cassandra
A Decentralized Structured Storage System
By Sameera Nelson
Outline …
 Introduction
 Data Model
 System Architecture
 Failure Detection & Recovery
 Local Persistence
 Performance
 Statistics
What is Cassandra ?
 Distributed Storage System
 Manages Structured Data
 Highly available , No SPoF
 Not a Relational Data Model
 Handle high write throughput
◦ No impact on read efficiency
Motivation
 Operational Requirements in Facebook
◦ Performance
◦ Reliability/ Dealing with Failures
◦ Efficiency
◦ Continues Growth
 Application
◦ Inbox Search Problem, Facebook
Similar Work
 Google File System
◦ Distributed FS, Single master/Slave
 Ficus/ Coda
◦ Distributed FS
 Farsite
◦ Distributed FS, No centralized server
 Bayou
◦ Distributed Relational DB System
 Dynamo
◦ Distributed Storage system
Data Model
Data Model
Figure from Eben Hewitt’s slides.
Supported Operations
 insert(table; key; rowMutation)
 get(table; key; columnName)
 delete(table; key; columnName)
Query Language
CREATE TABLE users
( user_id int PRIMARY KEY,
fname text,
lname text );
INSERT INTO users
(user_id, fname, lname) VALUES
(1745, 'john', 'smith');
SELECT * FROM users;
Data Structure
 Log-Structured Merge Tree
System
Architecture
Architecture
Fully Distributed …
 No Single Point of Failure
Cassandra Architecture
 Partitioning
 Data distribution across nodes
 Replication
 Data duplication across nodes
 Cluster Membership
 Node management in cluster
 adding/ deleting
Partitioning
 The Token Ring
Partitioning
 Partitions using Consistent hashing
Partitioning
 Assignment in to the relevant partition
Partitioning, Vnodes
Replication
 Based on configured replication factor
Replication
 Different Replication Policies
◦ Rack Unaware
◦ Rack Aware
◦ Data center Aware
Cluster Membership
 Based on scuttlebutt
 Efficient Gossip based mechanism
 Inspired for real life rumor spreading.
 Anti Entropy protocol
◦ Repair replicated data by comparing &
reconciling differences
Cluster Membership
Gossip Based
Failure Detection
& Recovery
Failure Detection
 Track state
◦ Directly, Indirectly
 Accrual Detection mechanism
 Permanent Node change
◦ Admin should explicitly add or remove
 Hints
◦ Data to be replayed in replication
◦ Saved in system.hints table
Accrual Failure Detector
• Node is faulty, suspicion level
monotonically increases.
• Φ(t)  k
• k - threshold variable
• Node is correct
• Φ(t) = 0
Local
Persistence
Write Request
Write Operation
Write Operation
 Logging data in commit log/ memtable
 Flushing data from the memtable
◦ Flushing data on threshold
 Storing data on disk in SSTables
 Mark with tombstone
 Compaction
 Remove deletes, Sorts, Merges data, consolidation
Write Operation
 Compaction
Read Request
 Direct/ Background (Read repair)
Read Operation
Delete Operation
 Data not removed immediately
 Only Tombstone is written
 Deleted in Compacting Process
Additional Features
 Adding compression
 Snappy Compression
 Secondary index support
 SSL support
◦ Client/ Node
◦ Node/ Node
 Rolling commit logs
 SSTable data file merging
Performance
Performance
 High Throughput & Low Latency
◦ Eliminating on-disk data modification
◦ Eliminate erase-block cycles
◦ No Locking for concurrency control
◦ Maintaining integrity not required
 High Availability
 Linear Scalability
 Fault Tolerant
Statistics
Stats from Netflix
 Liner scalability
Stats from Netflix
Some users
Thank you
Read Detailed Structure

Cassandra - Deep Dive ...

Editor's Notes

  • #25 Run repair tool
  • #34 Must run regular node repair on every node in the cluster (by default, every 10 days)
  • #35 Not maximum compressionVery high speeds and reasonable compression
  • #37 Cassandra updates the bytes and rewrites the entire sector back out instead of modifying the data on disk