When Bad ThingsHappen to Good Data:Understanding Anti-Entropy inCassandraJason Brown@jasobrown jasedbrown@gmail.com
About me•  Senior Software Engineer @ Netflix•  Apache Cassandra committer•  E-Commerce Architect, Major LeagueBaseball Ad...
Maintaining consistent state is hard in adistributed systemCAP theorem works against you
Inconsistencies creep in•  Node is down•  Network partition•  Dropped mutations•  Process crash before commit log flush•  ...
Anti-Entropy Overview•  write timeo  tunable consistencyo  atomic batcheso  hinted handoff•  read timeo  consistent readso...
Write Time
Cassandra Writes Basics•  determine all replica nodes in all DCs•  send to replicas in local DC•  send one replica node in...
Writes - request path
Writes - response path
Writes - Tunable consistencyCoordinator blocks for specified count ofreplicas to respond•  consistency levelo  ALLo  EACH_...
Hinted handoffSave a copy of the write for down nodes, andreplay laterhint = target replica + mutation data
Hinted handoff - storing•  on coordinator, store a hint for any nodes notcurrently up•  if a replica doesnt respond within...
Hinted handoff - replay•  try to send hints to nodes•  runs every ten minutes•  multithreaded (as of 1.2)•  throttable (kb...
Hinted Handoff - R2 downR2 down, coordinator (R1) stores hint
Hinted handoff - replayR2 comes back up, R1 plays hints for it
What if coordinator dies?
Atomic Batches•  coordinator stores incoming mutation to twopeers in same DCo  deletes from peers on successful completion...
Read Time
Cassandra Reads - setup•  determine endpoints to invokeo  consistency level vs. read repair•  first data node to send back...
LOCAL_QUORUM readPink nodes contain requested row key
Consistent reads•  compare the digests of returned data sets•  if any mismatches, send request again tosame CL data nodes....
Read Repair•  synchronizes the client-requested dataamongst all replicas•  piggy-backs on normal reads, but waits forall r...
Read Repairgreen lines = LOCAL_QUORUM nodesblue lines = nodes for read repair
Read Repair - configuration•  setting per column family•  percentage of all calls to CF•  Local DC vs. Global chance
Read repair fixes data that is actuallyrequested,... but what about data that isnt requested?
Node Repair - introduction•  repairs inconsistencies across all replicas fora given range•  nodetool repairo  repairs the ...
•  should be part of std operationsmaintenance for c*, esp if you delete datao  ensures tombstones are propagated, and avo...
Node Repair - details 1•  determine peer nodes with matching ranges•  triggers a major (validation) compaction onpeer node...
Node Repair - details 2•  initiator awaits trees from all nodes•  compares each tree to every other tree•  if any differen...
ABC node is repair initiator
Nodes sharing range A
Nodes sharing range B
Nodes sharing range C
Five nodes participating in repair
Anti-Entropy wrap-up•  CAP Theorem lives, tradeoffs must be made•  C* contains processes to make divergingdata sets consis...
Thank you!Q & A time@jasobrown
Notes from Netflix•  carefully tune RR_chance•  schedule repair operations•  tickler•  store more hints vs. running repair
Upcoming SlideShare
Loading in …5
×

C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

2,080 views

Published on

This talk focuses Cassandra's anti-entrpoy mechanisms. Jason will discuss the details of read repair, hinted handoff, node repair, and more as they aide in reolving data that has become inconsistent across nodes. In addition, he'll provide insight into how those techniques are used to ensure data consistency at Netflix.

Published in: Technology, Business

C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

  1. 1. When Bad ThingsHappen to Good Data:Understanding Anti-Entropy inCassandraJason Brown@jasobrown jasedbrown@gmail.com
  2. 2. About me•  Senior Software Engineer @ Netflix•  Apache Cassandra committer•  E-Commerce Architect, Major LeagueBaseball Advanced Media•  Wireless developer (J2ME and BREW)
  3. 3. Maintaining consistent state is hard in adistributed systemCAP theorem works against you
  4. 4. Inconsistencies creep in•  Node is down•  Network partition•  Dropped mutations•  Process crash before commit log flush•  File corruptionCassandra trades C for AP
  5. 5. Anti-Entropy Overview•  write timeo  tunable consistencyo  atomic batcheso  hinted handoff•  read timeo  consistent readso  read repair•  maintenance timeo  node repair
  6. 6. Write Time
  7. 7. Cassandra Writes Basics•  determine all replica nodes in all DCs•  send to replicas in local DC•  send one replica node in remote DCs,o  it will forward to peers•  all respond back to original coordinator
  8. 8. Writes - request path
  9. 9. Writes - response path
  10. 10. Writes - Tunable consistencyCoordinator blocks for specified count ofreplicas to respond•  consistency levelo  ALLo  EACH_QUORUMo  LOCAL_QUORUMo  ONE / TWO / THREEo  ANY
  11. 11. Hinted handoffSave a copy of the write for down nodes, andreplay laterhint = target replica + mutation data
  12. 12. Hinted handoff - storing•  on coordinator, store a hint for any nodes notcurrently up•  if a replica doesnt respond withinwrite_request_timeout_in_ms, store a hint•  max_hint_window_in_ms - maximumamount of time a dead host will have hintsgenerated.
  13. 13. Hinted handoff - replay•  try to send hints to nodes•  runs every ten minutes•  multithreaded (as of 1.2)•  throttable (kb per second)
  14. 14. Hinted Handoff - R2 downR2 down, coordinator (R1) stores hint
  15. 15. Hinted handoff - replayR2 comes back up, R1 plays hints for it
  16. 16. What if coordinator dies?
  17. 17. Atomic Batches•  coordinator stores incoming mutation to twopeers in same DCo  deletes from peers on successful completion•  peers will replay the batch if not deletedo  runs every 60 seconds•  with 1.2, all mutates use atomic batch
  18. 18. Read Time
  19. 19. Cassandra Reads - setup•  determine endpoints to invokeo  consistency level vs. read repair•  first data node to send back full data set,other nodes only return a digest•  wait until the CL number of nodes to return
  20. 20. LOCAL_QUORUM readPink nodes contain requested row key
  21. 21. Consistent reads•  compare the digests of returned data sets•  if any mismatches, send request again tosame CL data nodes.o  this time no digests, full data set•  compare the full data sets, send updates toout of date replicas•  block until those fixes are responded to•  return data to caller
  22. 22. Read Repair•  synchronizes the client-requested dataamongst all replicas•  piggy-backs on normal reads, but waits forall replicas to respond asynchronously•  then, just like consistent reads, comparesthe digests, and fix if needed
  23. 23. Read Repairgreen lines = LOCAL_QUORUM nodesblue lines = nodes for read repair
  24. 24. Read Repair - configuration•  setting per column family•  percentage of all calls to CF•  Local DC vs. Global chance
  25. 25. Read repair fixes data that is actuallyrequested,... but what about data that isnt requested?
  26. 26. Node Repair - introduction•  repairs inconsistencies across all replicas fora given range•  nodetool repairo  repairs the ranges the node containso  one of more column families (within the samekeyspace)o  can choose local datacenter only (c* 1.2)
  27. 27. •  should be part of std operationsmaintenance for c*, esp if you delete datao  ensures tombstones are propagated, and avoidresurrected data•  repair is IO and CPU intensiveNode Repair - cautions
  28. 28. Node Repair - details 1•  determine peer nodes with matching ranges•  triggers a major (validation) compaction onpeer nodeso  read and generate hash for every row in CFo  add result to a Merkle Treeo  return tree to initiator
  29. 29. Node Repair - details 2•  initiator awaits trees from all nodes•  compares each tree to every other tree•  if any differences exist, two nodes areexchange the conflicting rangeso  these ranges get written out as new, local sstables
  30. 30. ABC node is repair initiator
  31. 31. Nodes sharing range A
  32. 32. Nodes sharing range B
  33. 33. Nodes sharing range C
  34. 34. Five nodes participating in repair
  35. 35. Anti-Entropy wrap-up•  CAP Theorem lives, tradeoffs must be made•  C* contains processes to make divergingdata sets consistent•  Tunable controls exist at write and readtimes, as well on-demand
  36. 36. Thank you!Q & A time@jasobrown
  37. 37. Notes from Netflix•  carefully tune RR_chance•  schedule repair operations•  tickler•  store more hints vs. running repair

×