Strategies for Distributed Data Storage

7,494 views
7,253 views

Published on

Published in: Technology, Business
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,494
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
216
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide
  • Kelvin Kakugawa
    infrastructure engineer @ Digg
    working on extending Cassandra (can talk about this more at the end of the session)
  • 3 parts of my talk
  • let’s go through the journey of a typical web developer
    so, we can understand why certain properties of Cassandra may be attractive
  • just a web server and a database; nothing special
  • so, your data starts growing
    what do you do?
    move your tables to different DB servers
  • ok, so, now what happens when one table grows too large?
    shard DB cluster

    problem:
    data access API just got fatter
    now, client needs to know which shard to hit for a given read/write

    problem:
    now, you’re pushing up logic that is data store-specific up into your client layer
    not the best abstraction

  • the problem gets compounded w/ multiple client languages

    what do you do?
    1) replicate the logic in all languages?
    2) write a C library w/ bindings for every language?
  • [5m]


  • examples:

    consistency:
    when you write a value to the cluster
    on the next read, will you get the most up-to-date value

    availability:
    if a subset of nodes goes down
    are you still able to write or read a given key
  • so, let’s think back to the sharded DB example
    when you write to a shard, you’ll get the most recent value on the next read

    however:
    the shard is SPOF
    no replication
  • reads are now replicated

    however, writes still have:
    SPOF
    bottle-necked on 1 server (can’t write to any node in the replica set)
  • avoid SPOFs:
    machines fail

    depending on your use case, it may be advantageous to be able to write to multiple nodes in the replica set
    if you’re read-bound, then this probably doesn’t matter
    but, if you’re write-bound, it’s important
  • so, how do we achieve availability?
    it’s easy to think about writes
    pretty straightforward
    write to one of the replicas in the replica set

  • it’s harder to propagate that write to the other nodes in the replica set
    non-trivial
  • [10m]

  • separate into 2 sections




  • first situation:
    part of replica set is still available
  • second situation:
    all nodes in replica set are down

    so, what happens?

    let’s first talk about the distinction between coordinator and nodes in the replica set.
    basically, a client can talk to any node in the cassandra cluster
    and that node will then become the coordinator for that client
    making the appropriate calls to other nodes in the cassandra cluster that are part of the replica set for a given key

    so, getting back, what happens when all of the replica nodes are down?
    in this case, the coordinator node is the closest node, so it’ll write the hint locally
  • and, naturally, when a node w/ hinted writes learns that the target node is back up
    it’ll deliver the hinted writes it has for the target


  • great for the virality of your product
    bad for your network load

  • gossip protocols (in general):
    randomly choose a node to exchange state w/
    expectation: updates spread in logarithmic time w/ the # of nodes in the cluster

    anti-entropy protocol:
    gossip information until it’s made obsolete by newer info

    compare: rumor-mongering protocol:
    only gossips state for a limited amount of time, such that likelihood of state change has been propogated to all nodes in the cluster

    note:
    it’s important to note that cassandra uses an anti-entropy protocol, because of the failure detector
  • failure detector:
    acts as an oracle for the node
    node consults FD
    FD returns a suspicion-level of whether a given node is up/down

    FD:
    maintains a sliding window of the most recent heart beats from a given node
    sliding window used to estimate arrival time of next heartbeat
    distribution of past samples used as an approximation for the probabilistic distribution of future hearbeat messages
    (cassandra uses an exponential distribution)
    as the next heartbeat message takes longer and longer to arrive, the suspicion level of that node being down increases




  • quorum consistency level:
    quorum = majority (here: 2)

    requires:
    quorum write

    so that quorum read will catch at least 1 node w/ the most recent value














  • example:
    let’s say we receive the merkle trees from two different replicas
    if the root node’s hash from both trees match
    we can be reasonably sure that both replicas are consistent


  • each node creates a merkle tree,
    then exchanges them w/ the other replicas
  • from the initiating replica, replica A
    we compare the MT from replica B
  • replica A will send the inconsistent data to replica B
    (note: replica B will compare the MT from A and send the same range of keys to A)

    implementation detail:
    actually creates a repair SSTable for replica B (that only includes the inconsistent keys)
    then streams it over to replica B
    replica B will drop the streamed SSTable directly onto disk

    possible:
    talk about the relationship between Memtable and SSTables and how cassandra writes / reads data


  • Strategies for Distributed Data Storage

    1. 1. Cassandra Strategies for Distributed Data Storage
    2. 2. I: Fat Clients are Expensive II: Availability vs. Consistency III: Strategies for Eventual Consistency Cassandra: Strategies for Distributed Data Storage
    3. 3. I: Fat Clients are Expensive Cassandra: Strategies for Distributed Data Storage
    4. 4. In the Beginning... Web Thin Data API Simple: 1 web server DB 1 database Cassandra: Strategies for Distributed Data Storage
    5. 5. Your Data Grows... Web Data API Move tables to DB DB different DBs. user item Cassandra: Strategies for Distributed Data Storage
    6. 6. A table grows too large... Web Data API ... Shard table by DB DB DB PK ranges. item item item ... 0 1 2 PK Range: [0, 10k) [10k, 20k) [20k, 30k) Cassandra: Strategies for Distributed Data Storage
    7. 7. Problem: Multiple Client Languages python ruby java Data API Data API Data API Cassandra: Strategies for Distributed Data Storage
    8. 8. Are there other trade-offs? Cassandra: Strategies for Distributed Data Storage
    9. 9. II: Availability vs. Consistency Cassandra: Strategies for Distributed Data Storage
    10. 10. Why consistency vs. availability? CAP Theorem Cassandra: Strategies for Distributed Data Storage
    11. 11. CAP Theorem You can have at most two of these properties in a shared-data system: Consistency Availability Partition-Tolerance Cassandra: Strategies for Distributed Data Storage
    12. 12. Problem: Sharded DB Cluster Favors C over A. Web Data API ... ... SPOF No ... DB shard ... Replication Cassandra: Strategies for Distributed Data Storage
    13. 13. Slightly better with master-slave replication... Web Data Write: ... DB shard ... SPOF Bottlenecked master ... DB ... Read: Replicated shard slave Cassandra: Strategies for Distributed Data Storage
    14. 14. Availability Arguments Avoid SPOFs Distribute Writes to All Nodes in Replica Set Cassandra: Strategies for Distributed Data Storage
    15. 15. Availability Easy: Write replica A value: “x” Write coord. replica B replica C Cassandra: Strategies for Distributed Data Storage
    16. 16. Availability Harder: Consistency Across Replicas replica A value: “x” coord. replica B value: “x” replica C value: “x” Cassandra: Strategies for Distributed Data Storage
    17. 17. So, how do we achieve consistency? Cassandra: Strategies for Distributed Data Storage
    18. 18. III: Strategies for Eventual Consistency Cassandra: Strategies for Distributed Data Storage
    19. 19. I: Write-Related Strategies II: Read-Related Strategies Cassandra: Strategies for Distributed Data Storage
    20. 20. Write-Related Strategies I: Hinted Hand-Off II: Gossip Cassandra: Strategies for Distributed Data Storage
    21. 21. I: Hinted Hand-Off Cassandra: Strategies for Distributed Data Storage
    22. 22. Hinted Hand-Off Problem Write to an Unavailable Node Cassandra: Strategies for Distributed Data Storage
    23. 23. Hinted Hand-Off Solution 1) “hinted” write to a live node 2) deliver hints when node is reachable Cassandra: Strategies for Distributed Data Storage
    24. 24. Hinted Hand-Off Step 1: “hinted” write to a live node part of replica set is available A target (dead) “hinted” coord. write B nearest live replica C Cassandra: Strategies for Distributed Data Storage
    25. 25. Hinted Hand-Off Step 1: “hinted” write to a live node all replica nodes unreachable A target (dead) closest coord. “hinted” B node (dead) write C (dead) Cassandra: Strategies for Distributed Data Storage
    26. 26. Hinted Hand-Off Step 2: deliver hints when node is reachable node deliver replica target (now available) “hinted” writes Cassandra: Strategies for Distributed Data Storage
    27. 27. How does a node learn when another node is available? Cassandra: Strategies for Distributed Data Storage
    28. 28. II: Gossip Cassandra: Strategies for Distributed Data Storage
    29. 29. Gossip Problem Each node cannot scalably ping every other node. 8 nodes: 82 = 64 100 nodes: 1002 = 10,000 Cassandra: Strategies for Distributed Data Storage
    30. 30. Gossip Solution I: Anti-Entropy Gossip Protocol II: Phi-Accrual Failure Detector Cassandra: Strategies for Distributed Data Storage
    31. 31. Gossip Anti-Entropy Gossip Protocol node node Cassandra: Strategies for Distributed Data Storage
    32. 32. Gossip Phi-Accrual Failure Detector Dynamically adjusts its “suspicion” level of another node, based on inter-arrival times of gossip messages. Cassandra: Strategies for Distributed Data Storage
    33. 33. Read-Related Strategies I: Read-Repair II: Anti-Entropy Service Cassandra: Strategies for Distributed Data Storage
    34. 34. I: Read-Repair Cassandra: Strategies for Distributed Data Storage
    35. 35. Read-Repair Problem A Write Has Not Propagated to All Replicas Cassandra: Strategies for Distributed Data Storage
    36. 36. Read-Repair Solution Repair Outdated Replicas After Read Cassandra: Strategies for Distributed Data Storage
    37. 37. Read-Repair Example Quorum Read Replication Factor: 3 Cassandra: Strategies for Distributed Data Storage
    38. 38. Read-Repair Steps 1) do digest-based read (if digests match) 2) do full read and repair replicas Cassandra: Strategies for Distributed Data Storage
    39. 39. Read-Repair Step 1: do digest-based read one full read; other reads are digest A F coord. B D D C Cassandra: Strategies for Distributed Data Storage
    40. 40. Read-Repair Step 1: do digest-based read wait for 2 replies (where one is full read) A F coord. B D C Cassandra: Strategies for Distributed Data Storage
    41. 41. Read-Repair Step 1: do digest-based read return value to client (if all digests match) D == digest( F ) coord. return value to client Cassandra: Strategies for Distributed Data Storage
    42. 42. Read-Repair Step 2: do full read and repair replicas full read from all replicas A F coord. B F F C Cassandra: Strategies for Distributed Data Storage
    43. 43. Read-Repair Step 2: do full read and repair replicas wait for 2 replies A F coord. B F C Cassandra: Strategies for Distributed Data Storage
    44. 44. Read-Repair Step 2: do full read and repair replicas calculate newest value from replies value timestamp replica A: “x” t0 replica B: “y” t1 reconciled: “y” t1 Cassandra: Strategies for Distributed Data Storage
    45. 45. Read-Repair Step 2: do full read and repair replicas return newest value to client coord. return reconciled value to client Cassandra: Strategies for Distributed Data Storage
    46. 46. Read-Repair Step 2: do full read and repair replicas calculate repair mutations for each replica diff(reconciled value, replica value) = repair mutation Repair for Replica A Repair for Replica B diff( “y” @ t1, “x” @ t0) diff( “y” @ t1, “y” @ t1) = “y” @ t1 = null Cassandra: Strategies for Distributed Data Storage
    47. 47. Read-Repair Step 2: do full read and repair replicas send repair mutation to each replica A R coord. B C Cassandra: Strategies for Distributed Data Storage
    48. 48. What about values that have not been read? Cassandra: Strategies for Distributed Data Storage
    49. 49. II: Anti-Entropy Service Cassandra: Strategies for Distributed Data Storage
    50. 50. Anti-Entropy Service Problem How to Repair Unread Values Cassandra: Strategies for Distributed Data Storage
    51. 51. Anti-Entropy Service Solution 1) detect inconsistency via Merkle Trees 2) repair inconsistent data Cassandra: Strategies for Distributed Data Storage
    52. 52. Anti-Entropy Service Merkle Tree a tree where a node’s hash summarizes the hashes of its children root node hash A summarizes its children’s hashes node hash B C summarizes its children’s hashes leaf hash D E F G hash of a data block Cassandra: Strategies for Distributed Data Storage
    53. 53. Anti-Entropy Service Step 1: detect inconsistency create Merkle Trees on all replicas B request A Merkle Tree creation create local Merkle Tree C Cassandra: Strategies for Distributed Data Storage
    54. 54. Anti-Entropy Service Step 1: detect inconsistency exchange Merkle Trees between replicas B exchange A Merkle Tree across all replicas C Cassandra: Strategies for Distributed Data Storage
    55. 55. Anti-Entropy Service Step 1: detect inconsistency compare local and remote Merkle Trees Replica A Replica B A A match mismatch B C B C D E F G D E F G Cassandra: Strategies for Distributed Data Storage
    56. 56. Anti-Entropy Service Step 2: repair inconsistent data send repair to remote replica A B send repair for data hashed by node F Cassandra: Strategies for Distributed Data Storage
    57. 57. Any Questions? Cassandra: Strategies for Distributed Data Storage
    58. 58. More Information Cassandra Site: http://cassandra.apache.org/ My email address: kakugawa@gmail.com Cassandra: Strategies for Distributed Data Storage

    ×