Introduction to Cassandra: Replication and Consistency
Apr. 29, 2010•0 likes
145 likes
Be the first to like this
Show More
•51,810 views
views
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download to read offline
Report
Technology
A short introduction to replication and consistency in the Cassandra distributed database. Delivered April 28th, 2010 at the Seattle Scalability Meetup.
Dynamo BigTable
Cluster Sparse,
management, columnar data
replication, fault model, storage
tolerance architecture
Cassandra
Dynamo-like
Features
Symmetric, P2P architecture
No special nodes/SPOFs
Gossip-based cluster management
Distributed hash table for data
placement
Pluggable partitioning
Pluggable topology discovery
Pluggable placement strategies
Tunable, eventual consistency
BigTable-like
Features
Sparse, “columnar” data model
Optional, 2-level maps called
Super Column Families
SSTable disk storage
Append-only commit log
Memtable (buffer and sort)
Immutable SSTable files
Hadoop integration
Consistency
Level
How many replicas must
respond to declare success?
W=2 R=2
?
CL.Options
WRITE READ
Level Description Level Description
ZERO Cross fingers
ANY
WEAK
1st Response
(including HH)
ONE 1st Response ONE 1st Response
STRONG
QUORUM N/2 + 1 replicas QUORUM N/2 + 1 replicas
ALL All replicas ALL All replicas
A Side Note on
CL
Consistency
Level is based
on Replication
Factor (N), not
on the number
of nodes in the
system.
A Question of
Time
row
column column column column column
value value value value value
timestamp timestamp timestamp timestamp timestamp
All columns have a value and a timestamp
Timestamps provided by clients
usec resolution by convention
Latest timestamp wins
Vector clocks may be introduced in 0.7
Read Repair
?
Query all replicas on every read
Data from one replica
Checksum/timestamps from all
others
If there is a mismatch:
Pull all data and merge
Write back to out of sync replicas
Weak vs. Strong
Weak Consistency
(reads)Perform repair after
returning results
Strong Consistency (reads)
Perform repair before returning
results
R+W>N
Please imagine this inequality has huge fangs, dripping with the
blood of innocent, enterprise developers so you can best appreciate
the terror it inspires.
Tokens
A TOKEN is a
partitioner-dependent
element on the ring
Each NODE has a
single, unique TOKEN
Each NODE claims a RANGE of
the ring from its TOKEN to the
token of the previous node on
the ring
Partitioning
Map from Key Space to Token
RandomPartitioner
Tokens are integers in the range 0-2127
MD5(Key) -> Token
Good: Even key distribution, Bad:
Inefficient range queries
OrderPreservingPartitioner
Tokens are UTF8 strings in the range ‘’-∞
Key -> Token
Good: Efficient range queries, Bad:
Uneven key distribution
Snitching
Map from Nodes to Physical
Location
EndpointSnitch
Guess at rack and datacenter based on IP address octets.
DatacenterEndpointSnitch
Specify IP subnets for racks, grouped per datacenter.
PropertySnitch
Specify arbitrary mappings from individual IP addresses to
racks and datacenters.
Or write your own!
Placement
Map from Token Space to Nodes
The first replica is always placed
on the node that claims the
range in which the token falls.
Strategies determine where the
rest of the replicas are placed.
RackUnaware
Place replicas on the N-1
subsequent nodes around the ring,
ignoring topology.
datacenter A datacenter B
rack 1 rack 2 rack 1 rack 2
RackAware
Place the second replica in another
datacenter, and the remaining N-2
replicas on nodes in other racks in
the same datacenter.
datacenter A datacenter B
rack 1 rack 2 rack 1 rack 2
DatacenterShard
Place M of the N replicas in another
datacenter, and the remaining N -
(M + 1) replicas on nodes in other
racks in the same datacenter.
datacenter A datacenter B
rack 1 rack 2 rack 1 rack 2
Amazon Dynamo
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Google BigTable
http://labs.google.com/papers/bigtable.html
Facebook Cassandra
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf