Apache Cassandra
Durability, Durability, Durability ...

Matthew F. Dennis // @mdennis
CassandraSF
August 8, 2012
Keyspaces
                       Column Families
Cassandra Data Model   Rows
                       Columns (tuples)
                       Name, Value, Timestamp, TTL
Yeah Matt, you told us ...
credit_account(acct, delta)
 A Banking Application
                          get_account_balance(acct)
(the canonical example)   xfer_funds(from, to, delta)
“Work Backwards From Your Queries” --everyone




     The only “query” we really have is
          get_account_balance
accounts column family


00:00:00 etc                                         hash(“details”) - a unique, one-one
                                                          idempotent id for xact0


                           all_balls     xact_id0     xact_id1        ...
                acctX
                           $123.45       “details”     “details”      ...

current “base” total




from, to, delta, sessionId, timestamp, amount,
     check number, order number, et cetera
          as a JSON/ProtoBuf/XML blob
  (i.e. everything about a “change” to the acct)
get_account_balance(acct)


●   read the entire row

●   apply deltas to “base”
credit_account(acct, delta)


●   write “details” to accounts CF
get_account_balance(acct)


●   obviously the row for each account
    grows unbounded

●   need to safely recalculate the
    “base” to avoid unbounded growth
accounts column family
●   a standard single master setup for
    consolidating works well here
    (because the system is slower, not
    broken, while the master is down)
●   lets be clear on that; the master
    being up or down is independent of
    the correctness of the system
    (otherwise master => SPOF => bad)
accounts column family
            (consolidation, WOT form)
●   pick number of processors, hash acctId mod num_consolidators
    (clearly other options exist to assign accounts to consolidators)
●   only the assigned processor can update the base
●   read row for account, calculate new base, write new base + delete
    columns that went into the base
●   read at CL.ALL the first time an account is seen by the processor after boot
    (in memory, BDB, etc)
●   on failure of a write at CL.Q, do not continue processing for that account
    until a read at CL.ALL for that account has completed
●   adding consolidators is easy and requires no down time; shut them down,
    reconfigure with a new number, start them up
●   essentially check pointing
accounts column family
                            (consolidation)


                                                     xact0


                           all_balls   xact_id0     xact_id1         ...
               acctX
                              {x}         {y}          {z}           ...

current “base” total                      xact1



             if “base” was written at CL.Q, then a read at CL.Q will return
             the most most current version.
accounts column family
                       (consolidation)


                                                 xact0 tombstone


                      all_balls   xact_id0   xact_id1        ...
           acctX
                      {x, y, z}                              ...

new “base” total           xact1 tombstone



       processor calculates new base and deletes corresponding deltas
accounts column family
                        (consolidation)


                                                     xact0 tbmbstone


                       all_balls    xact_id0     xact_id1        ...
           acctX
                       {x, y, z}                                 ...

new “base” total            xact1 tombstone



         row level isolation guarantees that if the base includes the delta
         then delta is absent from the delta list. Likewise, if the base does not
         include the delta then the delta is in the delta list
accounts column family
        (durability, consistency)
●   writes can be at any any
    consistency level that meets your
    durability, consistency and
    availability requirements (CL.Q?)
●   base updates and the queries to
    calculate them must be at CL.Q
    with the initial read at CL.ALL
why the CL.ALL read?
                                                     node0    node1    node2

left set is base, right is delta list                 {} {}    {} {}    {} {}


concurrent write for x                               {} {x}    {} {}    {} {}

CL.Q response from node0 and node1
calculate base as {x}, CL.Q write fails              {x} {}    {} {}    {} {}


concurrent write for y                               {x} {}    {} {}   {} {y}

CL.Q response from node1 and node2
calculate base as {y}                                {x} {}    {} {}   {y} {}

node2 propagates base={y}
node0 propagates deltas={}                           {y} {}   {y} {}   {y} {}
resulting in x missing from base *and* from deltas
accounts column family
                (xfers)
●   clearly source account and
    destination account can be on
    different nodes
●   so, how do you maintain
    consistency across them when
    doing transfers?
accounts column family
                (xfers)
●   the common approach is to use a
    transaction log (go go wikipedia)
●   Oracle uses one
●   PGSQL uses one
●   C* uses one
●   we should have one too!
the xact_log column family
                                   randomly chosen from set of nodes
                                   (or from a known range, e.g. 0-100)

                                                    timeuuid of when
                                                      xact0 occurred

                        timeuuid(xact0)   timeuuid(xact1)   timeuuid(xact2)
       node_token
                           “details”         “details”         “details”




 same “complete”
details as previously
the xact_log column family

●   a durable (e.g. multi node) place to write changes
●   a write to xact_log CF ~= “commit”
●   each node runs a crond job that periodically (e.g.
    every minute +/- 15 seconds) queries a slice of its
    corresponding row(s) and those of it’s neighbor
    (could improve on polling)
●   node replays any messages found in their entirety
    and deletes the column
●   normally, the query returns no results
xfer_funds(from, to, delta)
                (the interesting one)


●   write “details” to xact_log CF
●   in parallel, write “details” for from
    and to account rows
●   delete “details” from xact_log CF
    (could be done after client response)
●   failures?
xfer_funds(from, to, delta)
                    (failures)
●   before insert
●   after insert
●   after from xor to is applied
●   after from and to is applied
●   after delete from xact_log CF
consistency?
                  (eventually)
●   partitions between data centers?
●   failures for xacts in flight?
●   maintenance?
●   upgrades?
●   you have requirements, be honest about
    what they are …
●   do not page your ops team at 4am unless
    required (which *should* be rare)
accounts column family settings
●   normal gc_grace_seconds
●   row cache friendly*
●   key cache friendly (~everything is)
●   level compaction strategy
    (IO “now” or IO “later”?)
●   should probably use
    commit_log_sync=batch
    (not a per CF setting)

* in general you should probably just avoid the row cache all together
xact_log column family settings

●   gc_grace_seconds = 0
●   row cache unfriendly
●   key cache friendly, but not needed
●   level compaction strategy
    (or sized with min_threshold=2)
other uses

●   “base” and “deltas” need not
    represent money
●   character inventory/trading
●   portfolios
●   escrow exchanges
●   anything combinable
    (you control the consolidate code)
is this the best way?

●   not always, of course not, depends on
    your requirements and goals
●   could use C* for xact_log, Oracle for
    balances
●   could use zookeeper instead of CL.Q
    and CL.ALL for consolidators
●   C* solutions favors availability,
    scalability and durability over other
    desirable traits
Q?
Matthew F. Dennis // @mdennis
Thank You!
 (now go prep for your lighting talk)

durability, durability, durability

  • 1.
    Apache Cassandra Durability, Durability,Durability ... Matthew F. Dennis // @mdennis CassandraSF August 8, 2012
  • 2.
    Keyspaces Column Families Cassandra Data Model Rows Columns (tuples) Name, Value, Timestamp, TTL
  • 3.
    Yeah Matt, youtold us ...
  • 4.
    credit_account(acct, delta) ABanking Application get_account_balance(acct) (the canonical example) xfer_funds(from, to, delta)
  • 5.
    “Work Backwards FromYour Queries” --everyone The only “query” we really have is get_account_balance
  • 6.
    accounts column family 00:00:00etc hash(“details”) - a unique, one-one idempotent id for xact0 all_balls xact_id0 xact_id1 ... acctX $123.45 “details” “details” ... current “base” total from, to, delta, sessionId, timestamp, amount, check number, order number, et cetera as a JSON/ProtoBuf/XML blob (i.e. everything about a “change” to the acct)
  • 7.
    get_account_balance(acct) ● read the entire row ● apply deltas to “base”
  • 8.
    credit_account(acct, delta) ● write “details” to accounts CF
  • 9.
    get_account_balance(acct) ● obviously the row for each account grows unbounded ● need to safely recalculate the “base” to avoid unbounded growth
  • 10.
    accounts column family ● a standard single master setup for consolidating works well here (because the system is slower, not broken, while the master is down) ● lets be clear on that; the master being up or down is independent of the correctness of the system (otherwise master => SPOF => bad)
  • 11.
    accounts column family (consolidation, WOT form) ● pick number of processors, hash acctId mod num_consolidators (clearly other options exist to assign accounts to consolidators) ● only the assigned processor can update the base ● read row for account, calculate new base, write new base + delete columns that went into the base ● read at CL.ALL the first time an account is seen by the processor after boot (in memory, BDB, etc) ● on failure of a write at CL.Q, do not continue processing for that account until a read at CL.ALL for that account has completed ● adding consolidators is easy and requires no down time; shut them down, reconfigure with a new number, start them up ● essentially check pointing
  • 12.
    accounts column family (consolidation) xact0 all_balls xact_id0 xact_id1 ... acctX {x} {y} {z} ... current “base” total xact1 if “base” was written at CL.Q, then a read at CL.Q will return the most most current version.
  • 13.
    accounts column family (consolidation) xact0 tombstone all_balls xact_id0 xact_id1 ... acctX {x, y, z} ... new “base” total xact1 tombstone processor calculates new base and deletes corresponding deltas
  • 14.
    accounts column family (consolidation) xact0 tbmbstone all_balls xact_id0 xact_id1 ... acctX {x, y, z} ... new “base” total xact1 tombstone row level isolation guarantees that if the base includes the delta then delta is absent from the delta list. Likewise, if the base does not include the delta then the delta is in the delta list
  • 15.
    accounts column family (durability, consistency) ● writes can be at any any consistency level that meets your durability, consistency and availability requirements (CL.Q?) ● base updates and the queries to calculate them must be at CL.Q with the initial read at CL.ALL
  • 16.
    why the CL.ALLread? node0 node1 node2 left set is base, right is delta list {} {} {} {} {} {} concurrent write for x {} {x} {} {} {} {} CL.Q response from node0 and node1 calculate base as {x}, CL.Q write fails {x} {} {} {} {} {} concurrent write for y {x} {} {} {} {} {y} CL.Q response from node1 and node2 calculate base as {y} {x} {} {} {} {y} {} node2 propagates base={y} node0 propagates deltas={} {y} {} {y} {} {y} {} resulting in x missing from base *and* from deltas
  • 17.
    accounts column family (xfers) ● clearly source account and destination account can be on different nodes ● so, how do you maintain consistency across them when doing transfers?
  • 18.
    accounts column family (xfers) ● the common approach is to use a transaction log (go go wikipedia) ● Oracle uses one ● PGSQL uses one ● C* uses one ● we should have one too!
  • 19.
    the xact_log columnfamily randomly chosen from set of nodes (or from a known range, e.g. 0-100) timeuuid of when xact0 occurred timeuuid(xact0) timeuuid(xact1) timeuuid(xact2) node_token “details” “details” “details” same “complete” details as previously
  • 20.
    the xact_log columnfamily ● a durable (e.g. multi node) place to write changes ● a write to xact_log CF ~= “commit” ● each node runs a crond job that periodically (e.g. every minute +/- 15 seconds) queries a slice of its corresponding row(s) and those of it’s neighbor (could improve on polling) ● node replays any messages found in their entirety and deletes the column ● normally, the query returns no results
  • 21.
    xfer_funds(from, to, delta) (the interesting one) ● write “details” to xact_log CF ● in parallel, write “details” for from and to account rows ● delete “details” from xact_log CF (could be done after client response) ● failures?
  • 22.
    xfer_funds(from, to, delta) (failures) ● before insert ● after insert ● after from xor to is applied ● after from and to is applied ● after delete from xact_log CF
  • 23.
    consistency? (eventually) ● partitions between data centers? ● failures for xacts in flight? ● maintenance? ● upgrades? ● you have requirements, be honest about what they are … ● do not page your ops team at 4am unless required (which *should* be rare)
  • 24.
    accounts column familysettings ● normal gc_grace_seconds ● row cache friendly* ● key cache friendly (~everything is) ● level compaction strategy (IO “now” or IO “later”?) ● should probably use commit_log_sync=batch (not a per CF setting) * in general you should probably just avoid the row cache all together
  • 25.
    xact_log column familysettings ● gc_grace_seconds = 0 ● row cache unfriendly ● key cache friendly, but not needed ● level compaction strategy (or sized with min_threshold=2)
  • 26.
    other uses ● “base” and “deltas” need not represent money ● character inventory/trading ● portfolios ● escrow exchanges ● anything combinable (you control the consolidate code)
  • 27.
    is this thebest way? ● not always, of course not, depends on your requirements and goals ● could use C* for xact_log, Oracle for balances ● could use zookeeper instead of CL.Q and CL.ALL for consolidators ● C* solutions favors availability, scalability and durability over other desirable traits
  • 28.
  • 29.
    Thank You! (nowgo prep for your lighting talk)