Highly Available: The
Cassandra Distribution
        Model
      Sam Overton

  Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Cassandra is:
● built for scalability

● built to tolerate failure




 In this talk:
● Cassandra distribution overview

● Partitioning and placement

● Replication

● Consistency




                        Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Cassandra is:
● built for scalability

● built to tolerate failure




 In this talk:
● Cassandra distribution overview

● Partitioning and placement

● Replication

● Consistency




                        Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Overview

● High availability
● Partition tolerant

● Tunable consistency

● Scalable

● Replication

● No single point of failure




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Cassandra is:
● built for scalability

● built to tolerate failure




 In this talk:
● Cassandra distribution overview

● Partitioning and placement

● Replication

● Consistency




                        Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Partitioning and placement

Should...
● Assign data to hosts

● Have no S.P.O.F for routing clients to data

● Balance load

● Allow scaling without moving too much data




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing



                                                    (k2, v2)

                 (k1, v1)

                                                     (k3, v3)




                            Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing

● partitioner maps key to ring token
● hosts' tokens determine placement of keys

● and proportion of data assigned to each host

● each row is stored on one host

● wide rows can cause hot-spotting!




So how does it scale?

                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing



Bootstrapping a
new node




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing

Range is
transferred from old
host to new host




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing

Decommission is
the reverse process




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistent Hashing

● Tokens can be assigned manually, automatically
or randomly
● Every node has full knowledge of placement

● Client connects to any node, max 1 hop to data

● Node status is gossiped




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Partitioners

● Converts a row key (from client data) into a
token on the ring
● RandomPartitioner

● Order Preserving Partitioner




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Partitioners

Random Partitioner
● token = hash(key)

● good load balancing

● no range queries across row keys




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Partitioners

Order Preserving Partitioner
● token = key

● requires manual load balancing

● careful selection of tokens around the ring

● allows range queries across row keys




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Partitioners

● Get it right first time!
● Design data model for RP

● Custom partitioners are possible if necessary




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Cassandra is:
● built for scalability

● built to tolerate failure




 In this talk:
● Cassandra distribution overview

● Partitioning and placement

● Replication

● Consistency




                        Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Replication

● For availability
● For redundancy

● Can increase read bandwidth




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Replication

● Replication Factor (RF) is number of copies of
data
● Defined per-keyspace

● Can be changed (eg. If data becomes more/less

valuable)
● Determines how many failures can be tolerated




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Replication Strategy

● Determines how replicas are assigned for each
host
● Defined per keyspace (like RF)

● SimpleStrategy

● NetworkTopologyStrategy

● Custom strategies can be written




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


  Replication Strategy : Simple Strategy

(k1, v1)




 eg. RF=3




 (k2, v2)

                               Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Replication Strategy : Network Topology Strategy




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Replication Strategy : Network Topology Strategy
                  Multi-datacentre support




           DC1                                 DC2




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Replication Strategy : Network Topology Strategy




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Snitches

● Enables routing of requests according to node
proximity
● Used by replication strategy to determine rack

and DC membership
● Custom snitches can be written




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Simple Snitch

●Every host is in the same rack & DC with equal
proximity

RackInferringSnitch

Infers the rack & DC from IP address of host
●

123.8.2.100

    DC
         rack   host
                            Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


EC2Snitch

● DC = EC2 region
● Rack = EC2 availability zone




Property file snitch

●Rack and DC membership read from
configuration file

                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


DynamicSnitch

● Wraps each of the other snitches
● Records latency stats from read operations

● Avoids routing to slow hosts

● Configurable update intervals




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Cassandra is:
● built for scalability

● built to tolerate failure




 In this talk:
● Cassandra distribution overview

● Partitioning and placement

● Replication

● Consistency




                        Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistency

● Replication and failures/partitions cause
inconsistency
● Old versions of data can be returned




 Timestamps:
● Chosen by the client

● Can be used to avoid read-modify-write




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistency

● Cassandra allows a trade-off between partition-
tolerance and consistency

For strong consistency:
●

R+W>N
                                               1       1
●Eg. with 5 replicas
                                 1                 1       1
(RF = N = 5)
write to 3
read from 3            Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistency

● Cassandra allows a trade-off between partition-
tolerance and consistency

For strong consistency:
●
                                         write
R+W>N
                                               2       1
●Eg. with 5 replicas
                                 2                 2       1
(RF = N = 5)
write to 3
read from 3            Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistency

● Cassandra allows a trade-off between partition-
tolerance and consistency

For strong consistency:
●
                                                       read
R+W>N
                                               2        1
●Eg. with 5 replicas
                                 2                 2          1
(RF = N = 5)
write to 3
read from 3            Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Consistency Level

● ANY (only for writes)
● ONE, TWO, THREE

● QUORUM                                       (N/2 + 1)
● LOCAL QUORUM

● ALL



● Relax strong consistency for partition tolerance
● To tolerate 1 node failure with strong consistency

use RF=3 with CL=QUORUM
                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Increasing Consistency

● Read repair
● Hinted hand-off

● Anti-entropy repair




                        Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Read Repair




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Read Repair




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Read Repair




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Read Repair




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Hinted Hand-off
                                              (k1, v1)

eg. RF=2




                                                         (k1, v1)




                      Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Hinted Hand-off
                                               (k1, v1)

eg. RF=2




                                                          (k1, v1)


 Write (k1, v2)

                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Hinted Hand-off
                                               (k1, v1)

eg. RF=2




                                                          (k1, v1)


 Write (k1, v2)

                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Hinted Hand-off
                                               (k1, v1)

eg. RF=2




                                                          (k1, v1)


 Write (k1, v2)

                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Hinted Hand-off
                                                        (k1, v1)

eg. RF=2




                                                                   (k1, v1)


 Write (k1, v2)


                                           (k1,
                       Cassandra Europe 2012      v2)
Highly Available: The Cassandra Distribution Model


Hinted Hand-off
                                                       (k1, v2)

eg. RF=2




                                                                  (k1, v1)




                                          (k1,
                      Cassandra Europe 2012      v2)
Highly Available: The Cassandra Distribution Model


Hinted Hand-off
                                                       (k1, v2)

eg. RF=2




                                                                  (k1, v2)




                                          (k1,
                      Cassandra Europe 2012      v2)
Highly Available: The Cassandra Distribution Model


Hinted Hand-off
                                                       (k1, v2)

eg. RF=2




                                                                  (k1, v2)




                                          (k1,
                      Cassandra Europe 2012      v2)
Highly Available: The Cassandra Distribution Model


Hinted Hand-off

● Hinted writes do not count towards the chosen
consistency level
● … except with CL=ANY which succeeds even if

all replicas are down
● Don't rely on hints: hints cannot be read!




                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Anti-entropy repair

● Manual maintenance process
● Compares all data stored on a host with the

replicas
● Differences are streamed to restore consistency

● Must be run every 10 days to ensure

tombstones are replicated



                       Cassandra Europe 2012
Highly Available: The Cassandra Distribution Model


Cassandra is:
● built for scalability

● built to tolerate failure




 In this talk:
● Cassandra distribution overview

● Partitioning and placement

● Replication

● Consistency




 fin.                   Cassandra Europe 2012

Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam Overton

  • 1.
    Highly Available: The CassandraDistribution Model Sam Overton Cassandra Europe 2012
  • 2.
    Highly Available: TheCassandra Distribution Model Cassandra is: ● built for scalability ● built to tolerate failure In this talk: ● Cassandra distribution overview ● Partitioning and placement ● Replication ● Consistency Cassandra Europe 2012
  • 3.
    Highly Available: TheCassandra Distribution Model Cassandra is: ● built for scalability ● built to tolerate failure In this talk: ● Cassandra distribution overview ● Partitioning and placement ● Replication ● Consistency Cassandra Europe 2012
  • 4.
    Highly Available: TheCassandra Distribution Model Overview ● High availability ● Partition tolerant ● Tunable consistency ● Scalable ● Replication ● No single point of failure Cassandra Europe 2012
  • 5.
    Highly Available: TheCassandra Distribution Model Cassandra is: ● built for scalability ● built to tolerate failure In this talk: ● Cassandra distribution overview ● Partitioning and placement ● Replication ● Consistency Cassandra Europe 2012
  • 6.
    Highly Available: TheCassandra Distribution Model Partitioning and placement Should... ● Assign data to hosts ● Have no S.P.O.F for routing clients to data ● Balance load ● Allow scaling without moving too much data Cassandra Europe 2012
  • 7.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Cassandra Europe 2012
  • 8.
    Highly Available: TheCassandra Distribution Model Consistent Hashing (k2, v2) (k1, v1) (k3, v3) Cassandra Europe 2012
  • 9.
    Highly Available: TheCassandra Distribution Model Consistent Hashing ● partitioner maps key to ring token ● hosts' tokens determine placement of keys ● and proportion of data assigned to each host ● each row is stored on one host ● wide rows can cause hot-spotting! So how does it scale? Cassandra Europe 2012
  • 10.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Cassandra Europe 2012
  • 11.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Bootstrapping a new node Cassandra Europe 2012
  • 12.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Range is transferred from old host to new host Cassandra Europe 2012
  • 13.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Cassandra Europe 2012
  • 14.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Cassandra Europe 2012
  • 15.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Cassandra Europe 2012
  • 16.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Decommission is the reverse process Cassandra Europe 2012
  • 17.
    Highly Available: TheCassandra Distribution Model Consistent Hashing Cassandra Europe 2012
  • 18.
    Highly Available: TheCassandra Distribution Model Consistent Hashing ● Tokens can be assigned manually, automatically or randomly ● Every node has full knowledge of placement ● Client connects to any node, max 1 hop to data ● Node status is gossiped Cassandra Europe 2012
  • 19.
    Highly Available: TheCassandra Distribution Model Partitioners ● Converts a row key (from client data) into a token on the ring ● RandomPartitioner ● Order Preserving Partitioner Cassandra Europe 2012
  • 20.
    Highly Available: TheCassandra Distribution Model Partitioners Random Partitioner ● token = hash(key) ● good load balancing ● no range queries across row keys Cassandra Europe 2012
  • 21.
    Highly Available: TheCassandra Distribution Model Partitioners Order Preserving Partitioner ● token = key ● requires manual load balancing ● careful selection of tokens around the ring ● allows range queries across row keys Cassandra Europe 2012
  • 22.
    Highly Available: TheCassandra Distribution Model Partitioners ● Get it right first time! ● Design data model for RP ● Custom partitioners are possible if necessary Cassandra Europe 2012
  • 23.
    Highly Available: TheCassandra Distribution Model Cassandra is: ● built for scalability ● built to tolerate failure In this talk: ● Cassandra distribution overview ● Partitioning and placement ● Replication ● Consistency Cassandra Europe 2012
  • 24.
    Highly Available: TheCassandra Distribution Model Replication ● For availability ● For redundancy ● Can increase read bandwidth Cassandra Europe 2012
  • 25.
    Highly Available: TheCassandra Distribution Model Replication ● Replication Factor (RF) is number of copies of data ● Defined per-keyspace ● Can be changed (eg. If data becomes more/less valuable) ● Determines how many failures can be tolerated Cassandra Europe 2012
  • 26.
    Highly Available: TheCassandra Distribution Model Replication Strategy ● Determines how replicas are assigned for each host ● Defined per keyspace (like RF) ● SimpleStrategy ● NetworkTopologyStrategy ● Custom strategies can be written Cassandra Europe 2012
  • 27.
    Highly Available: TheCassandra Distribution Model Replication Strategy : Simple Strategy (k1, v1) eg. RF=3 (k2, v2) Cassandra Europe 2012
  • 28.
    Highly Available: TheCassandra Distribution Model Replication Strategy : Network Topology Strategy Cassandra Europe 2012
  • 29.
    Highly Available: TheCassandra Distribution Model Replication Strategy : Network Topology Strategy Multi-datacentre support DC1 DC2 Cassandra Europe 2012
  • 30.
    Highly Available: TheCassandra Distribution Model Replication Strategy : Network Topology Strategy Cassandra Europe 2012
  • 31.
    Highly Available: TheCassandra Distribution Model Snitches ● Enables routing of requests according to node proximity ● Used by replication strategy to determine rack and DC membership ● Custom snitches can be written Cassandra Europe 2012
  • 32.
    Highly Available: TheCassandra Distribution Model Simple Snitch ●Every host is in the same rack & DC with equal proximity RackInferringSnitch Infers the rack & DC from IP address of host ● 123.8.2.100 DC rack host Cassandra Europe 2012
  • 33.
    Highly Available: TheCassandra Distribution Model EC2Snitch ● DC = EC2 region ● Rack = EC2 availability zone Property file snitch ●Rack and DC membership read from configuration file Cassandra Europe 2012
  • 34.
    Highly Available: TheCassandra Distribution Model DynamicSnitch ● Wraps each of the other snitches ● Records latency stats from read operations ● Avoids routing to slow hosts ● Configurable update intervals Cassandra Europe 2012
  • 35.
    Highly Available: TheCassandra Distribution Model Cassandra is: ● built for scalability ● built to tolerate failure In this talk: ● Cassandra distribution overview ● Partitioning and placement ● Replication ● Consistency Cassandra Europe 2012
  • 36.
    Highly Available: TheCassandra Distribution Model Consistency ● Replication and failures/partitions cause inconsistency ● Old versions of data can be returned Timestamps: ● Chosen by the client ● Can be used to avoid read-modify-write Cassandra Europe 2012
  • 37.
    Highly Available: TheCassandra Distribution Model Consistency ● Cassandra allows a trade-off between partition- tolerance and consistency For strong consistency: ● R+W>N 1 1 ●Eg. with 5 replicas 1 1 1 (RF = N = 5) write to 3 read from 3 Cassandra Europe 2012
  • 38.
    Highly Available: TheCassandra Distribution Model Consistency ● Cassandra allows a trade-off between partition- tolerance and consistency For strong consistency: ● write R+W>N 2 1 ●Eg. with 5 replicas 2 2 1 (RF = N = 5) write to 3 read from 3 Cassandra Europe 2012
  • 39.
    Highly Available: TheCassandra Distribution Model Consistency ● Cassandra allows a trade-off between partition- tolerance and consistency For strong consistency: ● read R+W>N 2 1 ●Eg. with 5 replicas 2 2 1 (RF = N = 5) write to 3 read from 3 Cassandra Europe 2012
  • 40.
    Highly Available: TheCassandra Distribution Model Consistency Level ● ANY (only for writes) ● ONE, TWO, THREE ● QUORUM (N/2 + 1) ● LOCAL QUORUM ● ALL ● Relax strong consistency for partition tolerance ● To tolerate 1 node failure with strong consistency use RF=3 with CL=QUORUM Cassandra Europe 2012
  • 41.
    Highly Available: TheCassandra Distribution Model Increasing Consistency ● Read repair ● Hinted hand-off ● Anti-entropy repair Cassandra Europe 2012
  • 42.
    Highly Available: TheCassandra Distribution Model Read Repair Cassandra Europe 2012
  • 43.
    Highly Available: TheCassandra Distribution Model Read Repair Cassandra Europe 2012
  • 44.
    Highly Available: TheCassandra Distribution Model Read Repair Cassandra Europe 2012
  • 45.
    Highly Available: TheCassandra Distribution Model Read Repair Cassandra Europe 2012
  • 46.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off (k1, v1) eg. RF=2 (k1, v1) Cassandra Europe 2012
  • 47.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off (k1, v1) eg. RF=2 (k1, v1) Write (k1, v2) Cassandra Europe 2012
  • 48.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off (k1, v1) eg. RF=2 (k1, v1) Write (k1, v2) Cassandra Europe 2012
  • 49.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off (k1, v1) eg. RF=2 (k1, v1) Write (k1, v2) Cassandra Europe 2012
  • 50.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off (k1, v1) eg. RF=2 (k1, v1) Write (k1, v2) (k1, Cassandra Europe 2012 v2)
  • 51.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off (k1, v2) eg. RF=2 (k1, v1) (k1, Cassandra Europe 2012 v2)
  • 52.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off (k1, v2) eg. RF=2 (k1, v2) (k1, Cassandra Europe 2012 v2)
  • 53.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off (k1, v2) eg. RF=2 (k1, v2) (k1, Cassandra Europe 2012 v2)
  • 54.
    Highly Available: TheCassandra Distribution Model Hinted Hand-off ● Hinted writes do not count towards the chosen consistency level ● … except with CL=ANY which succeeds even if all replicas are down ● Don't rely on hints: hints cannot be read! Cassandra Europe 2012
  • 55.
    Highly Available: TheCassandra Distribution Model Anti-entropy repair ● Manual maintenance process ● Compares all data stored on a host with the replicas ● Differences are streamed to restore consistency ● Must be run every 10 days to ensure tombstones are replicated Cassandra Europe 2012
  • 56.
    Highly Available: TheCassandra Distribution Model Cassandra is: ● built for scalability ● built to tolerate failure In this talk: ● Cassandra distribution overview ● Partitioning and placement ● Replication ● Consistency fin. Cassandra Europe 2012