C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

When Bad Things
Happen to Good Data:
Understanding Anti-Entropy in
Cassandra
Jason Brown
@jasobrown jasedbrown@gmail.com

About me
•  Senior Software Engineer @ Netflix
•  Apache Cassandra committer
•  E-Commerce Architect, Major League
Baseball Advanced Media
•  Wireless developer (J2ME and BREW)

Maintaining consistent state is hard in a
distributed system
CAP theorem works against you

Inconsistencies creep in
•  Node is down
•  Network partition
•  Dropped mutations
•  Process crash before commit log flush
•  File corruption
Cassandra trades C for AP

Anti-Entropy Overview
•  write time
o  tunable consistency
o  atomic batches
o  hinted handoff
•  read time
o  consistent reads
o  read repair
•  maintenance time
o  node repair

Cassandra Writes Basics
•  determine all replica nodes in all DCs
•  send to replicas in local DC
•  send one replica node in remote DCs,
o  it will forward to peers
•  all respond back to original coordinator

Writes - Tunable consistency
Coordinator blocks for specified count of
replicas to respond
•  consistency level
o  ALL
o  EACH_QUORUM
o  LOCAL_QUORUM
o  ONE / TWO / THREE
o  ANY

Hinted handoff
Save a copy of the write for down nodes, and
replay later
hint = target replica + mutation data

Hinted handoff - storing
•  on coordinator, store a hint for any nodes not
currently 'up'
•  if a replica doesn't respond within
write_request_timeout_in_ms, store a hint
•  max_hint_window_in_ms - maximum
amount of time a dead host will have hints
generated.

Hinted handoff - replay
•  try to send hints to nodes
•  runs every ten minutes
•  multithreaded (as of 1.2)
•  throttable (kb per second)

Hinted Handoff - R2 down
R2 down, coordinator (R1) stores hint

Hinted handoff - replay
R2 comes back up, R1 plays hints for it

Atomic Batches
•  coordinator stores incoming mutation to two
peers in same DC
o  deletes from peers on successful completion
•  peers will replay the batch if not deleted
o  runs every 60 seconds
•  with 1.2, all mutates use atomic batch

Cassandra Reads - setup
•  determine endpoints to invoke
o  consistency level vs. read repair
•  first data node to send back full data set,
other nodes only return a digest
•  wait until the CL number of nodes to return

LOCAL_QUORUM read
Pink nodes contain requested row key

Consistent reads
•  compare the digests of returned data sets
•  if any mismatches, send request again to
same CL data nodes.
o  this time no digests, full data set
•  compare the full data sets, send updates to
out of date replicas
•  block until those fixes are responded to
•  return data to caller

Read Repair
•  synchronizes the client-requested data
amongst all replicas
•  piggy-backs on normal reads, but waits for
all replicas to respond asynchronously
•  then, just like consistent reads, compares
the digests, and fix if needed

Read Repair
green lines = LOCAL_QUORUM nodes
blue lines = nodes for read repair

Read Repair - configuration
•  setting per column family
•  percentage of all calls to CF
•  Local DC vs. Global chance

Read repair fixes data that is actually
requested,
... but what about data that isn't requested?

Node Repair - introduction
•  repairs inconsistencies across all replicas for
a given range
•  nodetool repair
o  repairs the ranges the node contains
o  one of more column families (within the same
keyspace)
o  can choose local datacenter only (c* 1.2)

•  should be part of std operations
maintenance for c*, esp if you delete data
o  ensures tombstones are propagated, and avoid
resurrected data
•  repair is IO and CPU intensive
Node Repair - cautions

Node Repair - details 1
•  determine peer nodes with matching ranges
•  triggers a major (validation) compaction on
peer nodes
o  read and generate hash for every row in CF
o  add result to a Merkle Tree
o  return tree to initiator

Node Repair - details 2
•  initiator awaits trees from all nodes
•  compares each tree to every other tree
•  if any differences exist, two nodes are
exchange the conflicting ranges
o  these ranges get written out as new, local sstables

'ABC' node is repair initiator

Five nodes participating in repair

Anti-Entropy wrap-up
•  CAP Theorem lives, tradeoffs must be made
•  C* contains processes to make diverging
data sets consistent
•  Tunable controls exist at write and read
times, as well on-demand

Thank you!
Q & A time
@jasobrown

Notes from Netflix
•  carefully tune RR_chance
•  schedule repair operations
•  tickler
•  store more hints vs. running repair

C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Recommended

Recommended

More Related Content

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown