Strategies for Distributed Data Storage

Cassandra
Strategies for Distributed Data Storage

I: Fat Clients are Expensive
II: Availability vs. Consistency
III: Strategies for Eventual Consistency

Cassandra: Strategies for Distributed Data Storage

I: Fat Clients are Expensive


In the Beginning...

Web
Thin Data API

Simple:
1 web server
DB 1 database


Your Data Grows...

Web
Data API

Move tables to
DB DB different DBs.
user item


A table grows too large...

Web

Data API

...
Shard table by
DB DB DB PK ranges.
item item item ...
0 1 2

PK Range: [0, 10k) [10k, 20k) [20k, 30k)


Problem:
Multiple Client Languages

python ruby java

Data API Data API Data API


Are there other trade-offs?


II: Availability vs. Consistency


Why consistency vs. availability?
CAP Theorem


CAP Theorem

You can have at most two of these properties
in a shared-data system:
Consistency
Availability
Partition-Tolerance


Problem:
Sharded DB Cluster Favors C over A.

Web

Data API
... ...
SPOF
No
... DB
shard ... Replication


Slightly better with master-slave replication...

Web

Data

Write:
... DB
shard ... SPOF
Bottlenecked
master

... DB
... Read:
Replicated
shard
slave


Availability Arguments

Avoid SPOFs
Distribute Writes to All Nodes in Replica Set


Availability
Easy: Write

replica
A value: “x”
Write

coord. replica
B

replica
C


Availability
Harder: Consistency Across Replicas

replica
A value: “x”

coord. replica
B value: “x”

replica
C value: “x”


So, how do we achieve consistency?


III: Strategies for Eventual Consistency


I: Write-Related Strategies
II: Read-Related Strategies


Write-Related Strategies

I: Hinted Hand-Off
II: Gossip


I: Hinted Hand-Off


Hinted Hand-Off
Problem

Write to an Unavailable Node


Hinted Hand-Off
Solution

1) “hinted” write to a live node
2) deliver hints when node is reachable


Hinted Hand-Off
Step 1: “hinted” write to a live node
part of replica set is available

A target
(dead)

“hinted”
coord. write B nearest
live replica

C


Hinted Hand-Off
Step 1: “hinted” write to a live node
all replica nodes unreachable

A target
(dead)

closest coord. “hinted” B
node (dead)
write

C (dead)


Hinted Hand-Off
Step 2: deliver hints when node is reachable

node
deliver replica target
(now available)

“hinted”
writes


How does a node learn when
another node is available?


II: Gossip


Gossip
Problem
Each node cannot scalably ping every other node.

8 nodes: 82 = 64
100 nodes: 1002 = 10,000


Gossip
Solution

I: Anti-Entropy Gossip Protocol
II: Phi-Accrual Failure Detector


Gossip
Anti-Entropy Gossip Protocol

node node


Gossip
Phi-Accrual Failure Detector

Dynamically adjusts its “suspicion” level of another node,
based on inter-arrival times of gossip messages.


Read-Related Strategies

I: Read-Repair
II: Anti-Entropy Service


I: Read-Repair


Read-Repair
Problem

A Write Has Not Propagated to
All Replicas


Read-Repair
Solution

Repair Outdated Replicas
After Read


Read-Repair
Example

Quorum Read
Replication Factor: 3


Read-Repair
Steps

1) do digest-based read (if digests match)
2) do full read and repair replicas


Read-Repair
Step 1: do digest-based read
one full read; other reads are digest

A

F
coord.
B
D

D
C


Read-Repair
wait for 2 replies (where one is full read)

A

F
coord.
B
D

C


Read-Repair
return value to client (if all digests match)

D == digest( F )

coord.

return value
to client


Read-Repair
Step 2: do full read and repair replicas
full read from all replicas

A

F
coord.
B
F

F
C


Read-Repair
wait for 2 replies

A

F
coord.
B
F

C


Read-Repair
calculate newest value from replies

value timestamp
replica A: “x” t0
replica B: “y” t1
reconciled: “y” t1


Read-Repair
return newest value to client

coord.

return
reconciled value
to client


Read-Repair
calculate repair mutations for each replica

diff(reconciled value, replica value)
= repair mutation

Repair for Replica A Repair for Replica B
diff( “y” @ t1, “x” @ t0) diff( “y” @ t1, “y” @ t1)
= “y” @ t1 = null


Read-Repair
send repair mutation to each replica

A

R
coord.
B

C


What about values that
have not been read?


II: Anti-Entropy Service


Anti-Entropy Service
Problem

How to Repair Unread Values


Solution

1) detect inconsistency via Merkle Trees
2) repair inconsistent data


Merkle Tree
a tree where a node’s hash summarizes
the hashes of its children
root node hash
A summarizes its children’s hashes

node hash
B C summarizes its children’s hashes

leaf hash
D E F G hash of a data block


Step 1: detect inconsistency
create Merkle Trees on all replicas

B

request
A Merkle Tree
creation

create
local Merkle Tree C


exchange Merkle Trees between replicas

B

exchange
A Merkle Tree
across all replicas

C


compare local and remote Merkle Trees

Replica A Replica B
A A match

mismatch

B C B C

D E F G D E F G


Step 2: repair inconsistent data
send repair to remote replica

A B

send repair
for data hashed by node F


Any Questions?


More Information

Cassandra Site:
http://cassandra.apache.org/

My email address:
kakugawa@gmail.com


Strategies for Distributed Data Storage

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Strategies for Distributed Data Storage

Similar to Strategies for Distributed Data Storage (20)

Recently uploaded

Recently uploaded (20)

Strategies for Distributed Data Storage

Editor's Notes