durability, durability, durability

Apache Cassandra
Durability, Durability, Durability ...

Matthew F. Dennis // @mdennis
CassandraSF
August 8, 2012

Keyspaces
Column Families
Cassandra Data Model Rows
Columns (tuples)
Name, Value, Timestamp, TTL

credit_account(acct, delta)
A Banking Application
get_account_balance(acct)
(the canonical example) xfer_funds(from, to, delta)

“Work Backwards From Your Queries” --everyone

The only “query” we really have is
get_account_balance

accounts column family

00:00:00 etc hash(“details”) - a unique, one-one
idempotent id for xact0

all_balls xact_id0 xact_id1 ...
acctX
$123.45 “details” “details” ...

current “base” total

from, to, delta, sessionId, timestamp, amount,
check number, order number, et cetera
as a JSON/ProtoBuf/XML blob
(i.e. everything about a “change” to the acct)


● read the entire row

● apply deltas to “base”

credit_account(acct, delta)

● write “details” to accounts CF


● obviously the row for each account
grows unbounded

● need to safely recalculate the
“base” to avoid unbounded growth

● a standard single master setup for
consolidating works well here
(because the system is slower, not
broken, while the master is down)
● lets be clear on that; the master
being up or down is independent of
the correctness of the system
(otherwise master => SPOF => bad)

(consolidation, WOT form)
● pick number of processors, hash acctId mod num_consolidators
(clearly other options exist to assign accounts to consolidators)
● only the assigned processor can update the base
● read row for account, calculate new base, write new base + delete
columns that went into the base
● read at CL.ALL the first time an account is seen by the processor after boot
(in memory, BDB, etc)
● on failure of a write at CL.Q, do not continue processing for that account
until a read at CL.ALL for that account has completed
● adding consolidators is easy and requires no down time; shut them down,
reconfigure with a new number, start them up
● essentially check pointing

(consolidation)

xact0

acctX
{x} {y} {z} ...

current “base” total xact1

if “base” was written at CL.Q, then a read at CL.Q will return
the most most current version.

(consolidation)

xact0 tombstone

acctX
{x, y, z} ...

new “base” total xact1 tombstone

processor calculates new base and deletes corresponding deltas

(consolidation)

xact0 tbmbstone

acctX
{x, y, z} ...

new “base” total xact1 tombstone

row level isolation guarantees that if the base includes the delta
then delta is absent from the delta list. Likewise, if the base does not
include the delta then the delta is in the delta list

(durability, consistency)
● writes can be at any any
consistency level that meets your
durability, consistency and
availability requirements (CL.Q?)
● base updates and the queries to
calculate them must be at CL.Q
with the initial read at CL.ALL

why the CL.ALL read?
node0 node1 node2

left set is base, right is delta list {} {} {} {} {} {}

concurrent write for x {} {x} {} {} {} {}

CL.Q response from node0 and node1
calculate base as {x}, CL.Q write fails {x} {} {} {} {} {}

concurrent write for y {x} {} {} {} {} {y}

CL.Q response from node1 and node2
calculate base as {y} {x} {} {} {} {y} {}

node2 propagates base={y}
node0 propagates deltas={} {y} {} {y} {} {y} {}
resulting in x missing from base *and* from deltas

(xfers)
● clearly source account and
destination account can be on
different nodes
● so, how do you maintain
consistency across them when
doing transfers?

(xfers)
● the common approach is to use a
transaction log (go go wikipedia)
● Oracle uses one
● PGSQL uses one
● C* uses one
● we should have one too!

the xact_log column family
randomly chosen from set of nodes
(or from a known range, e.g. 0-100)

timeuuid of when
xact0 occurred

timeuuid(xact0) timeuuid(xact1) timeuuid(xact2)
node_token
“details” “details” “details”

same “complete”
details as previously

the xact_log column family

● a durable (e.g. multi node) place to write changes
● a write to xact_log CF ~= “commit”
● each node runs a crond job that periodically (e.g.
every minute +/- 15 seconds) queries a slice of its
corresponding row(s) and those of it’s neighbor
(could improve on polling)
● node replays any messages found in their entirety
and deletes the column
● normally, the query returns no results

xfer_funds(from, to, delta)
(the interesting one)

● write “details” to xact_log CF
● in parallel, write “details” for from
and to account rows
● delete “details” from xact_log CF
(could be done after client response)
● failures?

xfer_funds(from, to, delta)
(failures)
● before insert
● after insert
● after from xor to is applied
● after from and to is applied
● after delete from xact_log CF

consistency?
(eventually)
● partitions between data centers?
● failures for xacts in flight?
● maintenance?
● upgrades?
● you have requirements, be honest about
what they are …
● do not page your ops team at 4am unless
required (which *should* be rare)

accounts column family settings
● normal gc_grace_seconds
● row cache friendly*
● key cache friendly (~everything is)
● level compaction strategy
(IO “now” or IO “later”?)
● should probably use
commit_log_sync=batch
(not a per CF setting)

* in general you should probably just avoid the row cache all together

xact_log column family settings

● gc_grace_seconds = 0
● row cache unfriendly
● key cache friendly, but not needed
● level compaction strategy
(or sized with min_threshold=2)

other uses

● “base” and “deltas” need not
represent money
● character inventory/trading
● portfolios
● escrow exchanges
● anything combinable
(you control the consolidate code)

is this the best way?

● not always, of course not, depends on
your requirements and goals
● could use C* for xact_log, Oracle for
balances
● could use zookeeper instead of CL.Q
and CL.ALL for consolidators
● C* solutions favors availability,
scalability and durability over other
desirable traits

Q?
Matthew F. Dennis // @mdennis

Thank You!
(now go prep for your lighting talk)

durability, durability, durability

More Related Content

What's hot

Viewers also liked

Similar to durability, durability, durability

Recently uploaded

durability, durability, durability