Dynamo cassandra


Published on

Amazon 的Dynamo 架构和Apache Cassandra的介绍

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dynamo cassandra

  1. 1. Cassandra<br />chengxiaojun<br />1<br />
  2. 2. Backend<br />Amazon Dynamo<br />Facebook Cassandra(Dynama 2.0)<br />Inbox search<br />Apache<br />2<br />
  3. 3. Cassandra<br />Dynamo-like features<br />Symmetric, P2P architecture<br />Gossip-based cluster management<br />DHT<br />Eventual consistency<br />Bigtable-like features<br />Column family <br />SSTable disk storage<br />Commit log<br />Memtable<br />Immutable Sstable files<br />3<br />
  4. 4. Data Model(1/2)<br />A table is a distributed multi dimensional map indexed by a key<br />Keyspace<br />Column<br />Super Column<br />Column Family Types<br />4<br />
  5. 5. Data Model(2/2)<br />5<br />
  6. 6. APIs<br />Paper:<br />insert(table; key; rowMutation)<br />get(table; key; columnName)<br />delete(table; key; columnName)<br />Wiki:<br />http://wiki.apache.org/cassandra/API<br />6<br />
  7. 7. Architecture Layers<br />7<br />
  8. 8. Partition(1/3)<br />Consistent Hash Table<br />8<br />
  9. 9. Partition(2/3)<br />Problems:<br />the random position assignment of each node on the ring leads to non-uniform data and load distribution<br />the basic algorithm is oblivious to the heterogeneity in the performance of nodes.<br />Two Ways:<br />Dynamo<br />One node is assigned to multiple positions in the circle<br />Cassandra<br />Analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily load nodes. <br />9<br />
  10. 10. Partition(3/3)<br />Each Cassandra server [node] is assigned a unique Token that determines what keys it is the first replica for.<br />Choice<br />InitialToken: assigned<br />RandomPartitioner :Tokens are integers from 0 to 2**127. Keys are converted to this range by MD5 hashing for comparison with Tokens. <br />NetworkTopologyStrategy:calculate the tokens the nodes in each DC independently. Tokens still needed to be unique, so you can add 1 to the tokens in the 2nd DC, add 2 in the 3rd, and so on.<br />10<br />
  11. 11. Replication(1/4)<br />high availability and durability<br />replication_factor:N<br />11<br />
  12. 12. Replication(2/4)<br />Strategy<br />Rack Unaware<br />Rack Aware<br />Datacenter Aware<br />…<br />12<br />
  13. 13. Replication(3/4)<br />Cassandra system elects a leader amongst its nodes using a system called Zookeeper<br />All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for<br />The leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring.<br />The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper<br />This way a node that crashes and comes back up knows what ranges it was responsible for.<br />13<br />
  14. 14. Replication(4/4)<br />Cassandra provides durability guarantees in the presence of node failures and network partitions by relaxing the quorum requirements<br />14<br />
  15. 15. Data Versioning<br />Vector clocks<br />15<br />
  16. 16. Consistency<br />16<br />W + R > N<br />
  17. 17. Consistency<br />put() :<br />the coordinator generates the vector clock for the new version and writes the new version locally. <br />The coordinator then sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes. <br />If at least W-1 nodes respond then the write is considered successful.<br />get() <br />the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, a<br />waits for R responses before returning the result to the client. <br />If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back.<br />17<br />
  18. 18. Handling Temporary Failures<br />Hinted handoff<br />if node A is temporarily down or unreachable during a write operation then a replica that would normally have lived on A will now be sent to node D. <br />The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case<br />A). <br />Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. Upon detecting that A has recovered, D will attempt to deliver the<br />replica to A. <br />Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.<br />
  19. 19. Handling permanent failures<br />Replica synchronization: anti-entropy<br />To detect the inconsistencies between replicas faster and to minimize the amount of transferred data<br />
  20. 20. Cassandra Consistency For Read<br />20<br />
  21. 21. Cassandra Consistency For Write<br />21<br />
  22. 22. Cassandra Read Repair<br />Cassandra repairs data in two ways:<br />Read Repair: every time a read is performed, Cassandra compares the versions at each replica (in the background, if a low consistency was requested by the reader to minimize latency), and the newest version is sent to any out-of-date replicas.<br />Anti-Entropy: when nodetool repair is run, Cassandra computes a Merkle tree for each range of data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient).<br />22<br />
  23. 23. Bootstrapping<br />New node<br />Position<br />specify an InitialToken<br />pick a Token that will give it half the keys from the node with the most disk space used<br />Note:<br />You should wait long enough for all the nodes in your cluster to become aware of the bootstrapping node via gossip before starting another bootstrap<br />Relating to point 1, one can only bootstrap N nodes at a time with automatic token picking, where N is the size of the existing cluster.<br />As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their Token Range to a newly added node.<br />When bootstrapping a new node, existing nodes have to divide the key space before beginning replication.<br />During bootstrap, a node will drop the Thrift port and will not be accessible from nodetool<br />Bootstrap can take many hours when a lot of data is involved<br />23<br />
  24. 24. Moving or Removing nodes<br />Remove nodes<br />Live node: nodetool decommission<br />the data will stream from the decommissioned node<br />Dead node: nodetool removetoken<br />the data will stream from the remaining replicas<br />Mode nodes<br />nodetool move: decommission + bootstrap<br />LB<br />If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.<br />24<br />
  25. 25. Membership<br />Scuttlebutt<br />Based on Gossip<br />efficient CPU utilization <br />efficient utilization of the gossip channel<br />anti-entropy Gossip<br />Paper:Efficient Reconciliation and Flow Control for Anti-Entropy Protocols<br />25<br />
  26. 26. Failure Detection<br />The φ Accrual Failure Detector<br />Idea: the failure detection module doesn't emit a Boolean value stating a node is up or down. Instead thefailure detection module emits a value which represents a suspicion level for each of monitored nodes<br />26<br />
  27. 27. Local Persistence(1/4)<br />Write Operation:<br />1. write into a commit log<br />2. an update into an in-memory data structure<br />3. When the in-memory data structure crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk<br />Read Operation:<br />1. query the in-memory data structure<br />2. look into the files on disk in the order of newest to oldest<br />3. combine<br />27<br />
  28. 28. Local Persistence(2/4)<br />Commit log<br />all writes into the commit log are sequential<br />Fixed size <br />Create/delete<br />Durability and recoverability<br />28<br />
  29. 29. Local Persistence(3/4)<br />Memtable<br />Per column family<br />a write-back cache of data rows that can be looked up by key<br />sorted by key<br />29<br />
  30. 30. Local Persistence(4/4)<br />SStable<br />Flushing<br />Once flushed, SSTable files are immutable; no further writes may be done. <br />Compaction<br />mergingmultiple old SSTable files into a single new one<br />Since the input SSTables are all sorted by key, merging can be done efficiently, still requiring no random i/o.<br />Once compaction is finished, the old SSTable files may be deleted<br />Discard tombstones<br />index<br />All writes are sequential to disk and also generate an index for efficient lookup based on row key. These indices are also persisted along with the data file<br />In order to prevent lookups into les that do not contain the key, a bloom filter, summarizing the keys in the le, is also stored in each data le and also kept in memory.<br />In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval.<br />30<br />
  31. 31. Facebook inbox search<br />31<br />Key: userN<br />
  32. 32. Reference<br />http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf<br />http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf<br />http://wiki.apache.org/cassandra/FrontPage<br />32<br />