Distributed Counters in Cassandra (Cassandra Summit 2010)

Distributed Counters
in Cassandra

Friday, August 13, 2010

I: Goal
II: Design
III: Implementation

Distributed Counters in Cassandra


I: Goal



Goal

Low Latency,
Highly Available
Counters



II: Design



I: Traditional Counter Design
II: Abstract Strategy
III: Distributed Counter Design



Design

I: Traditional Counter Design



Traditional Counter Design
Atomic Counters

1. single machine
2. one order of execution
3. strongly consistent



Problems

1. SPOF / single master
2. high latency
3. manually sharded



Question

What constraints can we relax?



Design

II: Abstract Strategy



Abstract Strategy
Constraints to Relax

1. one order of execution
2. strong consistency



Abstract Strategy
Relax: One Order of Execution

commutative operation:
- operations must be re-orderable



Abstract Strategy
Relax: Strong Consistency

partitioned work:
- each op must occur once
- unique partition identiﬁer
idempotent repair:
- recognize ops from other partitions



Design

III: Distributed Counter Design



Distributed Counter Design
Requirements

1. commutative operation
2. partitioned work
3. idempotent repair



Commutative Operation

addition:
- commutative operation
- sum ops performed by all replicas
-a + b = b + a



Partitioned Work

each op assigned to a replica:
- every replica sums all of its ops



Idempotent Repair

save counts from remote replicas:
- keep highest count seen
prevent multiple execution:
- do not transfer the target replica’s count



III: Implementation



I: Data Structure
II: Single Node
III: Eventual Consistency



I: Data Structure



Data Structure
Requirements

local counts:
- incrementally update
remote counts:
- independently track partitions



Data Structure
Context Format

list of (replica id, count) tuples:
[(replica A, count), (replica B, count), ...]



Data Structure
Context Mutations

local write:
sum local count and write delta
note: memtable



Data Structure
Context Mutations

remote repair:
for each replica,
keep highest count seen
(local or from repair)



II: Single Node



Single Node
Write Path

client
1. construct column
- value: delta (big-endian long)
- clock: empty
2. thrift: insert / batch_mutate



Single Node
Write Path

coordinator
1. choose partition
- choose target replica
- requirement: ConsistencyLevel.ONE
2. construct clock
- context format: [(target replica id, count delta)]



Single Node
Write Path

target replica
insert:
1. memtable does not contain column
2. insert column into memtable



Single Node
Write Path
target replica
update:
1. memtable contains column
2. retrieve existing column
3. create new column
- context: sum local count w/ delta from write
4. replace column in ConcurrentSkipListMap
5. if failed to replace column, go to step 2.



Single Node
Write Path
Interesting Note:
MTs are serialized to SSTs, as-is
- each SST encapsulates the updates
when it was an MT
- local count total must be aggregated
across the MT and all SSTs



Single Node
Read Path
target replica
read:
1. construct collating iterator over:
- frozen snapshot of MT
- all relevant SSTs
2. resolve column
- local counts: sum
- remote counts: keep max
3. construct value
- sum local and remote counts (big-endian long)



Single Node
Compaction

replica
compaction:
1. construct collating iterator over all SSTs
2. resolve every column in the CF
- local counts: sum
- remote counts: keep max
3. write out resolved CF



III: Eventual Consistency



Eventual Consistency
Read Repair

coordinator / replica
read repair:
1. calculate resolved (superset) CF
- resolve every column (local: sum, remote: max)
2. return resolved CF to client



Read Repair

coordinator / replica
read repair:
1. calculate repair CF for each replica
- calculate diff CF between resolved and received
- modify columns to remove target replica’s counts
2. send repair CF to each replica



Anti-Entropy Service

sending replica
AES:
1. follow normal AES code path
- calculate repair SST based on shared ranges
- send repair SST



Anti-Entropy Service

receiving replica
AES:
1. post-process streamed SST
- re-build streamed SST
- note: strip out local replica’s counts
2. remove temporary descriptor
3. add to SSTableTracker



Questions?



More Information
Issues:
#580: Vector Clocks
#1072: Distributed Counters

Related Work:
Helland and Campbell, Building on Quicksand, CIDR (2009),
Sections 5 & 6.

My email address:
kakugawa@gmail.com



Distributed Counters in Cassandra (Cassandra Summit 2010)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed Counters in Cassandra (Cassandra Summit 2010)

Similar to Distributed Counters in Cassandra (Cassandra Summit 2010) (15)

Recently uploaded

Recently uploaded (20)

Distributed Counters in Cassandra (Cassandra Summit 2010)