2. I: Fat Clients are Expensive
II: Availability vs. Consistency
III: Strategies for Eventual Consistency
Cassandra: Strategies for Distributed Data Storage
3. I: Fat Clients are Expensive
Cassandra: Strategies for Distributed Data Storage
4. In the Beginning...
Web
Thin Data API
Simple:
1 web server
DB 1 database
Cassandra: Strategies for Distributed Data Storage
5. Your Data Grows...
Web
Data API
Move tables to
DB DB different DBs.
user item
Cassandra: Strategies for Distributed Data Storage
6. A table grows too large...
Web
Data API
...
Shard table by
DB DB DB PK ranges.
item item item ...
0 1 2
PK Range: [0, 10k) [10k, 20k) [20k, 30k)
Cassandra: Strategies for Distributed Data Storage
8. Are there other trade-offs?
Cassandra: Strategies for Distributed Data Storage
9. II: Availability vs. Consistency
Cassandra: Strategies for Distributed Data Storage
10. Why consistency vs. availability?
CAP Theorem
Cassandra: Strategies for Distributed Data Storage
11. CAP Theorem
You can have at most two of these properties
in a shared-data system:
Consistency
Availability
Partition-Tolerance
Cassandra: Strategies for Distributed Data Storage
12. Problem:
Sharded DB Cluster Favors C over A.
Web
Data API
... ...
SPOF
No
... DB
shard ... Replication
Cassandra: Strategies for Distributed Data Storage
13. Slightly better with master-slave replication...
Web
Data
Write:
... DB
shard ... SPOF
Bottlenecked
master
... DB
... Read:
Replicated
shard
slave
Cassandra: Strategies for Distributed Data Storage
14. Availability Arguments
Avoid SPOFs
Distribute Writes to All Nodes in Replica Set
Cassandra: Strategies for Distributed Data Storage
15. Availability
Easy: Write
replica
A value: “x”
Write
coord. replica
B
replica
C
Cassandra: Strategies for Distributed Data Storage
16. Availability
Harder: Consistency Across Replicas
replica
A value: “x”
coord. replica
B value: “x”
replica
C value: “x”
Cassandra: Strategies for Distributed Data Storage
17. So, how do we achieve consistency?
Cassandra: Strategies for Distributed Data Storage
18. III: Strategies for Eventual Consistency
Cassandra: Strategies for Distributed Data Storage
22. Hinted Hand-Off
Problem
Write to an Unavailable Node
Cassandra: Strategies for Distributed Data Storage
23. Hinted Hand-Off
Solution
1) “hinted” write to a live node
2) deliver hints when node is reachable
Cassandra: Strategies for Distributed Data Storage
24. Hinted Hand-Off
Step 1: “hinted” write to a live node
part of replica set is available
A target
(dead)
“hinted”
coord. write B nearest
live replica
C
Cassandra: Strategies for Distributed Data Storage
25. Hinted Hand-Off
Step 1: “hinted” write to a live node
all replica nodes unreachable
A target
(dead)
closest coord. “hinted” B
node (dead)
write
C (dead)
Cassandra: Strategies for Distributed Data Storage
26. Hinted Hand-Off
Step 2: deliver hints when node is reachable
node
deliver replica target
(now available)
“hinted”
writes
Cassandra: Strategies for Distributed Data Storage
27. How does a node learn when
another node is available?
Cassandra: Strategies for Distributed Data Storage
29. Gossip
Problem
Each node cannot scalably ping every other node.
8 nodes: 82 = 64
100 nodes: 1002 = 10,000
Cassandra: Strategies for Distributed Data Storage
30. Gossip
Solution
I: Anti-Entropy Gossip Protocol
II: Phi-Accrual Failure Detector
Cassandra: Strategies for Distributed Data Storage
32. Gossip
Phi-Accrual Failure Detector
Dynamically adjusts its “suspicion” level of another node,
based on inter-arrival times of gossip messages.
Cassandra: Strategies for Distributed Data Storage
33. Read-Related Strategies
I: Read-Repair
II: Anti-Entropy Service
Cassandra: Strategies for Distributed Data Storage
35. Read-Repair
Problem
A Write Has Not Propagated to
All Replicas
Cassandra: Strategies for Distributed Data Storage
36. Read-Repair
Solution
Repair Outdated Replicas
After Read
Cassandra: Strategies for Distributed Data Storage
37. Read-Repair
Example
Quorum Read
Replication Factor: 3
Cassandra: Strategies for Distributed Data Storage
38. Read-Repair
Steps
1) do digest-based read (if digests match)
2) do full read and repair replicas
Cassandra: Strategies for Distributed Data Storage
39. Read-Repair
Step 1: do digest-based read
one full read; other reads are digest
A
F
coord.
B
D
D
C
Cassandra: Strategies for Distributed Data Storage
40. Read-Repair
Step 1: do digest-based read
wait for 2 replies (where one is full read)
A
F
coord.
B
D
C
Cassandra: Strategies for Distributed Data Storage
41. Read-Repair
Step 1: do digest-based read
return value to client (if all digests match)
D == digest( F )
coord.
return value
to client
Cassandra: Strategies for Distributed Data Storage
42. Read-Repair
Step 2: do full read and repair replicas
full read from all replicas
A
F
coord.
B
F
F
C
Cassandra: Strategies for Distributed Data Storage
43. Read-Repair
Step 2: do full read and repair replicas
wait for 2 replies
A
F
coord.
B
F
C
Cassandra: Strategies for Distributed Data Storage
44. Read-Repair
Step 2: do full read and repair replicas
calculate newest value from replies
value timestamp
replica A: “x” t0
replica B: “y” t1
reconciled: “y” t1
Cassandra: Strategies for Distributed Data Storage
45. Read-Repair
Step 2: do full read and repair replicas
return newest value to client
coord.
return
reconciled value
to client
Cassandra: Strategies for Distributed Data Storage
46. Read-Repair
Step 2: do full read and repair replicas
calculate repair mutations for each replica
diff(reconciled value, replica value)
= repair mutation
Repair for Replica A Repair for Replica B
diff( “y” @ t1, “x” @ t0) diff( “y” @ t1, “y” @ t1)
= “y” @ t1 = null
Cassandra: Strategies for Distributed Data Storage
47. Read-Repair
Step 2: do full read and repair replicas
send repair mutation to each replica
A
R
coord.
B
C
Cassandra: Strategies for Distributed Data Storage
48. What about values that
have not been read?
Cassandra: Strategies for Distributed Data Storage
51. Anti-Entropy Service
Solution
1) detect inconsistency via Merkle Trees
2) repair inconsistent data
Cassandra: Strategies for Distributed Data Storage
52. Anti-Entropy Service
Merkle Tree
a tree where a node’s hash summarizes
the hashes of its children
root node hash
A summarizes its children’s hashes
node hash
B C summarizes its children’s hashes
leaf hash
D E F G hash of a data block
Cassandra: Strategies for Distributed Data Storage
53. Anti-Entropy Service
Step 1: detect inconsistency
create Merkle Trees on all replicas
B
request
A Merkle Tree
creation
create
local Merkle Tree C
Cassandra: Strategies for Distributed Data Storage
54. Anti-Entropy Service
Step 1: detect inconsistency
exchange Merkle Trees between replicas
B
exchange
A Merkle Tree
across all replicas
C
Cassandra: Strategies for Distributed Data Storage
55. Anti-Entropy Service
Step 1: detect inconsistency
compare local and remote Merkle Trees
Replica A Replica B
A A match
mismatch
B C B C
D E F G D E F G
Cassandra: Strategies for Distributed Data Storage
56. Anti-Entropy Service
Step 2: repair inconsistent data
send repair to remote replica
A B
send repair
for data hashed by node F
Cassandra: Strategies for Distributed Data Storage
Kelvin Kakugawa
infrastructure engineer @ Digg
working on extending Cassandra (can talk about this more at the end of the session)
3 parts of my talk
let’s go through the journey of a typical web developer
so, we can understand why certain properties of Cassandra may be attractive
just a web server and a database; nothing special
so, your data starts growing
what do you do?
move your tables to different DB servers
ok, so, now what happens when one table grows too large?
shard DB cluster
problem:
data access API just got fatter
now, client needs to know which shard to hit for a given read/write
problem:
now, you’re pushing up logic that is data store-specific up into your client layer
not the best abstraction
the problem gets compounded w/ multiple client languages
what do you do?
1) replicate the logic in all languages?
2) write a C library w/ bindings for every language?
[5m]
examples:
consistency:
when you write a value to the cluster
on the next read, will you get the most up-to-date value
availability:
if a subset of nodes goes down
are you still able to write or read a given key
so, let’s think back to the sharded DB example
when you write to a shard, you’ll get the most recent value on the next read
however:
the shard is SPOF
no replication
reads are now replicated
however, writes still have:
SPOF
bottle-necked on 1 server (can’t write to any node in the replica set)
avoid SPOFs:
machines fail
depending on your use case, it may be advantageous to be able to write to multiple nodes in the replica set
if you’re read-bound, then this probably doesn’t matter
but, if you’re write-bound, it’s important
so, how do we achieve availability?
it’s easy to think about writes
pretty straightforward
write to one of the replicas in the replica set
it’s harder to propagate that write to the other nodes in the replica set
non-trivial
[10m]
separate into 2 sections
first situation:
part of replica set is still available
second situation:
all nodes in replica set are down
so, what happens?
let’s first talk about the distinction between coordinator and nodes in the replica set.
basically, a client can talk to any node in the cassandra cluster
and that node will then become the coordinator for that client
making the appropriate calls to other nodes in the cassandra cluster that are part of the replica set for a given key
so, getting back, what happens when all of the replica nodes are down?
in this case, the coordinator node is the closest node, so it’ll write the hint locally
and, naturally, when a node w/ hinted writes learns that the target node is back up
it’ll deliver the hinted writes it has for the target
great for the virality of your product
bad for your network load
gossip protocols (in general):
randomly choose a node to exchange state w/
expectation: updates spread in logarithmic time w/ the # of nodes in the cluster
anti-entropy protocol:
gossip information until it’s made obsolete by newer info
compare: rumor-mongering protocol:
only gossips state for a limited amount of time, such that likelihood of state change has been propogated to all nodes in the cluster
note:
it’s important to note that cassandra uses an anti-entropy protocol, because of the failure detector
failure detector:
acts as an oracle for the node
node consults FD
FD returns a suspicion-level of whether a given node is up/down
FD:
maintains a sliding window of the most recent heart beats from a given node
sliding window used to estimate arrival time of next heartbeat
distribution of past samples used as an approximation for the probabilistic distribution of future hearbeat messages
(cassandra uses an exponential distribution)
as the next heartbeat message takes longer and longer to arrive, the suspicion level of that node being down increases
quorum consistency level:
quorum = majority (here: 2)
requires:
quorum write
so that quorum read will catch at least 1 node w/ the most recent value
example:
let’s say we receive the merkle trees from two different replicas
if the root node’s hash from both trees match
we can be reasonably sure that both replicas are consistent
each node creates a merkle tree,
then exchanges them w/ the other replicas
from the initiating replica, replica A
we compare the MT from replica B
replica A will send the inconsistent data to replica B
(note: replica B will compare the MT from A and send the same range of keys to A)
implementation detail:
actually creates a repair SSTable for replica B (that only includes the inconsistent keys)
then streams it over to replica B
replica B will drop the streamed SSTable directly onto disk
possible:
talk about the relationship between Memtable and SSTables and how cassandra writes / reads data