Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam Overton

Highly Available: The
Cassandra Distribution
Model
Sam Overton

Cassandra Europe 2012

Highly Available: The Cassandra Distribution Model

Cassandra is:
● built for scalability

● built to tolerate failure

In this talk:
● Cassandra distribution overview

● Partitioning and placement

● Replication

● Consistency



Overview

● High availability
● Partition tolerant

● Tunable consistency

● Scalable

● Replication

● No single point of failure



Partitioning and placement

Should...
● Assign data to hosts

● Have no S.P.O.F for routing clients to data

● Balance load

● Allow scaling without moving too much data



Consistent Hashing



Consistent Hashing

(k2, v2)

(k1, v1)

(k3, v3)



Consistent Hashing

● partitioner maps key to ring token
● hosts' tokens determine placement of keys

● and proportion of data assigned to each host

● each row is stored on one host

● wide rows can cause hot-spotting!

So how does it scale?



Consistent Hashing

Bootstrapping a
new node



Consistent Hashing

Range is
transferred from old
host to new host



Consistent Hashing

Decommission is
the reverse process



Consistent Hashing

● Tokens can be assigned manually, automatically
or randomly
● Every node has full knowledge of placement

● Client connects to any node, max 1 hop to data

● Node status is gossiped



Partitioners

● Converts a row key (from client data) into a
token on the ring
● RandomPartitioner

● Order Preserving Partitioner



Partitioners

Random Partitioner
● token = hash(key)

● good load balancing

● no range queries across row keys



Partitioners

Order Preserving Partitioner
● token = key

● requires manual load balancing

● careful selection of tokens around the ring

● allows range queries across row keys



Partitioners

● Get it right first time!
● Design data model for RP

● Custom partitioners are possible if necessary



Replication

● For availability
● For redundancy

● Can increase read bandwidth



Replication

● Replication Factor (RF) is number of copies of
data
● Defined per-keyspace

● Can be changed (eg. If data becomes more/less

valuable)
● Determines how many failures can be tolerated



Replication Strategy

● Determines how replicas are assigned for each
host
● Defined per keyspace (like RF)

● SimpleStrategy

● NetworkTopologyStrategy

● Custom strategies can be written



Replication Strategy : Simple Strategy

(k1, v1)

eg. RF=3

(k2, v2)



Replication Strategy : Network Topology Strategy



Replication Strategy : Network Topology Strategy
Multi-datacentre support

DC1 DC2



Snitches

● Enables routing of requests according to node
proximity
● Used by replication strategy to determine rack

and DC membership
● Custom snitches can be written



Simple Snitch

●Every host is in the same rack & DC with equal
proximity

RackInferringSnitch

Infers the rack & DC from IP address of host
●

123.8.2.100

DC
rack host


EC2Snitch

● DC = EC2 region
● Rack = EC2 availability zone

Property file snitch

●Rack and DC membership read from
configuration file



DynamicSnitch

● Wraps each of the other snitches
● Records latency stats from read operations

● Avoids routing to slow hosts

● Configurable update intervals



Consistency

● Replication and failures/partitions cause
inconsistency
● Old versions of data can be returned

Timestamps:
● Chosen by the client

● Can be used to avoid read-modify-write



Consistency

● Cassandra allows a trade-off between partition-
tolerance and consistency

For strong consistency:
●

R+W>N
1 1
●Eg. with 5 replicas
1 1 1
(RF = N = 5)
write to 3
read from 3 Cassandra Europe 2012


Consistency


●
write
R+W>N
2 1
2 2 1
(RF = N = 5)
write to 3


Consistency


●
read
R+W>N
2 1
2 2 1
(RF = N = 5)
write to 3


Consistency Level

● ANY (only for writes)
● ONE, TWO, THREE

● QUORUM (N/2 + 1)
● LOCAL QUORUM

● ALL

● Relax strong consistency for partition tolerance
● To tolerate 1 node failure with strong consistency

use RF=3 with CL=QUORUM


Increasing Consistency

● Read repair
● Hinted hand-off

● Anti-entropy repair



Read Repair



Hinted Hand-off
(k1, v1)

eg. RF=2

(k1, v1)



Hinted Hand-off
(k1, v1)

eg. RF=2

(k1, v1)

Write (k1, v2)



Hinted Hand-off
(k1, v1)

eg. RF=2

(k1, v1)

Write (k1, v2)

(k1,
Cassandra Europe 2012 v2)


Hinted Hand-off
(k1, v2)

eg. RF=2

(k1, v1)

(k1,


Hinted Hand-off
(k1, v2)

eg. RF=2

(k1, v2)

(k1,


Hinted Hand-off

● Hinted writes do not count towards the chosen
consistency level
● … except with CL=ANY which succeeds even if

all replicas are down
● Don't rely on hints: hints cannot be read!



Anti-entropy repair

● Manual maintenance process
● Compares all data stored on a host with the

replicas
● Differences are streamed to restore consistency

● Must be run every 10 days to ensure

tombstones are replicated



Cassandra is:
● built for scalability

● built to tolerate failure

In this talk:
● Cassandra distribution overview

● Partitioning and placement

● Replication

● Consistency

fin. Cassandra Europe 2012

Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam Overton

Recommended

Recommended

More Related Content

Similar to Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam Overton

Similar to Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam Overton (20)

More from Acunu

More from Acunu (20)

Recently uploaded

Recently uploaded (20)

Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam Overton