Dynamo cassandra
Upcoming SlideShare
Loading in...5
×
 

Dynamo cassandra

on

  • 1,613 views

Amazon 的Dynamo 架构和Apache Cassandra的介绍

Amazon 的Dynamo 架构和Apache Cassandra的介绍

Statistics

Views

Total Views
1,613
Views on SlideShare
1,462
Embed Views
151

Actions

Likes
1
Downloads
21
Comments
0

2 Embeds 151

http://qa.alibaba.com 150
http://cache.baidu.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Dynamo cassandra Dynamo cassandra Presentation Transcript

  • Cassandra
    chengxiaojun
    1
  • Backend
    Amazon Dynamo
    Facebook Cassandra(Dynama 2.0)
    Inbox search
    Apache
    2
  • Cassandra
    Dynamo-like features
    Symmetric, P2P architecture
    Gossip-based cluster management
    DHT
    Eventual consistency
    Bigtable-like features
    Column family
    SSTable disk storage
    Commit log
    Memtable
    Immutable Sstable files
    3
  • Data Model(1/2)
    A table is a distributed multi dimensional map indexed by a key
    Keyspace
    Column
    Super Column
    Column Family Types
    4
  • Data Model(2/2)
    5
  • APIs
    Paper:
    insert(table; key; rowMutation)
    get(table; key; columnName)
    delete(table; key; columnName)
    Wiki:
    http://wiki.apache.org/cassandra/API
    6
  • Architecture Layers
    7
  • Partition(1/3)
    Consistent Hash Table
    8
  • Partition(2/3)
    Problems:
    the random position assignment of each node on the ring leads to non-uniform data and load distribution
    the basic algorithm is oblivious to the heterogeneity in the performance of nodes.
    Two Ways:
    Dynamo
    One node is assigned to multiple positions in the circle
    Cassandra
    Analyze load information on the ring and have lightly loaded nodes move on the ring to alleviate heavily load nodes.
    9
  • Partition(3/3)
    Each Cassandra server [node] is assigned a unique Token that determines what keys it is the first replica for.
    Choice
    InitialToken: assigned
    RandomPartitioner :Tokens are integers from 0 to 2**127. Keys are converted to this range by MD5 hashing for comparison with Tokens. 
    NetworkTopologyStrategy:calculate the tokens the nodes in each DC independently. Tokens still needed to be unique, so you can add 1 to the tokens in the 2nd DC, add 2 in the 3rd, and so on.
    10
  • Replication(1/4)
    high availability and durability
    replication_factor:N
    11
  • Replication(2/4)
    Strategy
    Rack Unaware
    Rack Aware
    Datacenter Aware

    12
  • Replication(3/4)
    Cassandra system elects a leader amongst its nodes using a system called Zookeeper
    All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for
    The leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring.
    The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper
    This way a node that crashes and comes back up knows what ranges it was responsible for.
    13
  • Replication(4/4)
    Cassandra provides durability guarantees in the presence of node failures and network partitions by relaxing the quorum requirements
    14
  • Data Versioning
    Vector clocks
    15
  • Consistency
    16
    W + R > N
  • Consistency
    put() :
    the coordinator generates the vector clock for the new version and writes the new version locally.
    The coordinator then sends the new version (along with the new vector clock) to the N highest-ranked reachable nodes.
    If at least W-1 nodes respond then the write is considered successful.
    get()
    the coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key, a
    waits for R responses before returning the result to the client.
    If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated. The divergent versions are then reconciled and the reconciled version superseding the current versions is written back.
    17
  • Handling Temporary Failures
    Hinted handoff
    if node A is temporarily down or unreachable during a write operation then a replica that would normally have lived on A will now be sent to node D.
    The replica sent to D will have a hint in its metadata that suggests which node was the intended recipient of the replica (in this case
    A).
    Nodes that receive hinted replicas will keep them in a separate local database that is scanned periodically. Upon detecting that A has recovered, D will attempt to deliver the
    replica to A.
    Once the transfer succeeds, D may delete the object from its local store without decreasing the total number of replicas in the system.
  • Handling permanent failures
    Replica synchronization: anti-entropy
    To detect the inconsistencies between replicas faster and to minimize the amount of transferred data
  • Cassandra Consistency For Read
    20
  • Cassandra Consistency For Write
    21
  • Cassandra Read Repair
    Cassandra repairs data in two ways:
    Read Repair: every time a read is performed, Cassandra compares the versions at each replica (in the background, if a low consistency was requested by the reader to minimize latency), and the newest version is sent to any out-of-date replicas.
    Anti-Entropy: when nodetool repair is run, Cassandra computes a Merkle tree for each range of data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient).
    22
  • Bootstrapping
    New node
    Position
    specify an InitialToken
    pick a Token that will give it half the keys from the node with the most disk space used
    Note:
    You should wait long enough for all the nodes in your cluster to become aware of the bootstrapping node via gossip before starting another bootstrap
    Relating to point 1, one can only bootstrap N nodes at a time with automatic token picking, where N is the size of the existing cluster.
    As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their Token Range to a newly added node.
    When bootstrapping a new node, existing nodes have to divide the key space before beginning replication.
    During bootstrap, a node will drop the Thrift port and will not be accessible from nodetool
    Bootstrap can take many hours when a lot of data is involved
    23
  • Moving or Removing nodes
    Remove nodes
    Live node: nodetool decommission
    the data will stream from the decommissioned node
    Dead node: nodetool removetoken
    the data will stream from the remaining replicas
    Mode nodes
    nodetool move: decommission + bootstrap
    LB
    If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.
    24
  • Membership
    Scuttlebutt
    Based on Gossip
    efficient CPU utilization
    efficient utilization of the gossip channel
    anti-entropy Gossip
    Paper:Efficient Reconciliation and Flow Control for Anti-Entropy Protocols
    25
  • Failure Detection
    The φ Accrual Failure Detector
    Idea: the failure detection module doesn't emit a Boolean value stating a node is up or down. Instead thefailure detection module emits a value which represents a suspicion level for each of monitored nodes
    26
  • Local Persistence(1/4)
    Write Operation:
    1. write into a commit log
    2. an update into an in-memory data structure
    3. When the in-memory data structure crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk
    Read Operation:
    1. query the in-memory data structure
    2. look into the files on disk in the order of newest to oldest
    3. combine
    27
  • Local Persistence(2/4)
    Commit log
    all writes into the commit log are sequential
    Fixed size
    Create/delete
    Durability and recoverability
    28
  • Local Persistence(3/4)
    Memtable
    Per column family
    a write-back cache of data rows that can be looked up by key
    sorted by key
    29
  • Local Persistence(4/4)
    SStable
    Flushing
    Once flushed, SSTable files are immutable; no further writes may be done. 
    Compaction
    mergingmultiple old SSTable files into a single new one
    Since the input SSTables are all sorted by key, merging can be done efficiently, still requiring no random i/o.
    Once compaction is finished, the old SSTable files may be deleted
    Discard tombstones
    index
    All writes are sequential to disk and also generate an index for efficient lookup based on row key. These indices are also persisted along with the data file
    In order to prevent lookups into les that do not contain the key, a bloom filter, summarizing the keys in the le, is also stored in each data le and also kept in memory.
    In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval.
    30
  • Facebook inbox search
    31
    Key: userN
  • Reference
    http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
    http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
    http://wiki.apache.org/cassandra/FrontPage
    32