When Bad ThingsHappen to Good DataUnderstanding Anti-Entropy in Cassandra#cassandra13Jason Brown@jasobrown jasedbrown@gmai...
About me• Senior Software Engineer, Netflix• Apache Cassandra committer• E-commerce Architect, Major League Baseball Advan...
Maintaining consistent state is hard in a distributed systemCAP theorem is working against you#cassandra13
Inconsistencies creep in• Node is down• Network partition• Dropped Mutations• Process crash before flush• File corruption#...
Anti-Entropy Overview• Write time• Tunable consistency• Atomic batches• Hinted handoff• Read time• Consistent reads• Read ...
Write Time#cassandra13
C* Write Basics• Determine all replica nodes, in all DCs• Send to all replicas in local DC• Send to one replica in remote ...
Writes – request path#cassandra13
Writes – response path#cassandra13
Tunable consistencyCoordinator blocks for specified count of replicas to respondconsistency levels:• ANY• ONE / TWO / THRE...
Hinted HandoffSave a copy of the write for down nodes, and replay laterHint = target replica ID + mutation data#cassandra13
Hinted Handoff - storing• On coordinator, store hint for nodes not up• Also, if a replica doesn’t respond withinwrite_requ...
Hinted Handoff - replay• Try to send hints to nodes• Runs every ten minutes• Multithreaded (c* 1.2)• Throttleable (kb per ...
Hinted Handoff – down node#cassandra13
Hinted Handoff – replay#cassandra13
What if coordinator dies?#cassandra13
Atomic Batches• Coordinator stores incoming mutation to two peers insame DC• Deletes batch from peers on successful comple...
Read time#cassandra13
Cassandra reads - setup• Determine replicas to invoke• consistency level vs. read repair• First data node responds with fu...
LOCAL_QUORUM read#cassandra13
Consistent reads• Compare digests• If any mismatches• re-request to same nodes (full data set)• compare full data sets, se...
Read repair• Synchronizes the client-requested data amongst allreplicas• Piggy-backs on normal reads, but waits for all re...
Read Repair#cassandra13Green lines = LOCAL_QUORUM nodesBlue lines = nodes for read repair
Read repair configuration• Setting per column family• Percentage of all reads to CF• Local DC vs. Global#cassandra13
Read repair fixes data that is actuallyrequested,…but what about data that isn’t requested?#cassandra13
Node repair - introduction• Repairs inconsistencies across all replicas for a givenrange• nodetool repair• repairs the ran...
Node Repair - cautions• Should be part of standard c* operations• Especially if you delete data• Repair is IO and CPU inte...
Node Repair – details, 1• Determine peer nodes with matching ranges• Triggers a major (validation) compaction on peer node...
Node Repair – details, 2• Initiator awaits trees from participating nodes• Compares every tree to every other tree• If any...
Read Repair – example#cassandra13
#cassandra13
#cassandra13
#cassandra13
#cassandra13
Anti-Entropy – Wrap Up• CAP Theorem lives, tradeoffs must be understood andmade• C* contains processes to make diverging d...
Thank you!Q & A time@jasobrown#cassandra13
Upcoming SlideShare
Loading in...5
×

Understanding AntiEntropy in Cassandra

5,831

Published on

Introduction to the anti-entropy mechanisms in cassandra. Covers write and read paths as well as node repair.

Published in: Technology, Business
2 Comments
22 Likes
Statistics
Notes
  • https://www.facebook.com/pages/King-of-fighter/494576370602534?ref=hl
    aime moi ma page stp
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • legal gostei
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
5,831
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
2
Likes
22
Embeds 0
No embeds

No notes for slide

Transcript of "Understanding AntiEntropy in Cassandra"

  1. 1. When Bad ThingsHappen to Good DataUnderstanding Anti-Entropy in Cassandra#cassandra13Jason Brown@jasobrown jasedbrown@gmail.com
  2. 2. About me• Senior Software Engineer, Netflix• Apache Cassandra committer• E-commerce Architect, Major League Baseball AdvancedMedia• Wireless Developer (J2ME and BREW)#cassandra13
  3. 3. Maintaining consistent state is hard in a distributed systemCAP theorem is working against you#cassandra13
  4. 4. Inconsistencies creep in• Node is down• Network partition• Dropped Mutations• Process crash before flush• File corruption#cassandra13
  5. 5. Anti-Entropy Overview• Write time• Tunable consistency• Atomic batches• Hinted handoff• Read time• Consistent reads• Read repair• Maintenance time• Node repair#cassandra13
  6. 6. Write Time#cassandra13
  7. 7. C* Write Basics• Determine all replica nodes, in all DCs• Send to all replicas in local DC• Send to one replica in remote DCs• It will forward to peers• All respond back to coordinator#cassandra13
  8. 8. Writes – request path#cassandra13
  9. 9. Writes – response path#cassandra13
  10. 10. Tunable consistencyCoordinator blocks for specified count of replicas to respondconsistency levels:• ANY• ONE / TWO / THREE• LOCAL_QUORUM• EACH_QUORUM• ALL#cassandra13
  11. 11. Hinted HandoffSave a copy of the write for down nodes, and replay laterHint = target replica ID + mutation data#cassandra13
  12. 12. Hinted Handoff - storing• On coordinator, store hint for nodes not up• Also, if a replica doesn’t respond withinwrite_request_timeout_in_ms, store a hint• max_hint_window_in_ms – max time a node will createhints for a dead node#cassandra13
  13. 13. Hinted Handoff - replay• Try to send hints to nodes• Runs every ten minutes• Multithreaded (c* 1.2)• Throttleable (kb per second)#cassandra13
  14. 14. Hinted Handoff – down node#cassandra13
  15. 15. Hinted Handoff – replay#cassandra13
  16. 16. What if coordinator dies?#cassandra13
  17. 17. Atomic Batches• Coordinator stores incoming mutation to two peers insame DC• Deletes batch from peers on successful completion• Peers will play batch if not deleted• Runs every 60 seconds• With c* 1.2, all mutates use atomic batch#cassandra13
  18. 18. Read time#cassandra13
  19. 19. Cassandra reads - setup• Determine replicas to invoke• consistency level vs. read repair• First data node responds with full data set, other senddigest• Coordinator waits for consistency_level nodes to respond#cassandra13
  20. 20. LOCAL_QUORUM read#cassandra13
  21. 21. Consistent reads• Compare digests• If any mismatches• re-request to same nodes (full data set)• compare full data sets, send updates• block until out of date replicas respond successfully• Return merged data set to client#cassandra13
  22. 22. Read repair• Synchronizes the client-requested data amongst allreplicas• Piggy-backs on normal reads, but waits for all replicas toresponds (asynchronously)• Compares the digests and follow same alg as consistentread#cassandra13
  23. 23. Read Repair#cassandra13Green lines = LOCAL_QUORUM nodesBlue lines = nodes for read repair
  24. 24. Read repair configuration• Setting per column family• Percentage of all reads to CF• Local DC vs. Global#cassandra13
  25. 25. Read repair fixes data that is actuallyrequested,…but what about data that isn’t requested?#cassandra13
  26. 26. Node repair - introduction• Repairs inconsistencies across all replicas for a givenrange• nodetool repair• repairs the ranges the node contains• one or more column families (within the same keyspace)• can choose local datacenter only (c* 1.2)#cassandra13
  27. 27. Node Repair - cautions• Should be part of standard c* operations• Especially if you delete data• Repair is IO and CPU intensive#cassandra13
  28. 28. Node Repair – details, 1• Determine peer nodes with matching ranges• Triggers a major (validation) compaction on peer nodes• read and generate hash for every row in CF• add result to a Merkle Tree• return tree to initiator#cassandra13
  29. 29. Node Repair – details, 2• Initiator awaits trees from participating nodes• Compares every tree to every other tree• If any differences detected, the differing nodes exchangeconflicting range(s)• Written out as new, local SSTables#cassandra13
  30. 30. Read Repair – example#cassandra13
  31. 31. #cassandra13
  32. 32. #cassandra13
  33. 33. #cassandra13
  34. 34. #cassandra13
  35. 35. Anti-Entropy – Wrap Up• CAP Theorem lives, tradeoffs must be understood andmade• C* contains processes to make diverging data setsconsistent• Tunable controls exist at write and read times, as well on-demand#cassandra13
  36. 36. Thank you!Q & A time@jasobrown#cassandra13

×