Introduction To Cassandra

    张元丰 Cestbon Zhang
      Hanborq Inc.
       2012.7.12
Agenda
•   Introduction
•   Data Module
•   Architecture
•   Gossip
•   Consistence
•   I/O (Read And Write)
•   Case Study
Introduction
Apache Cassandra is an open
source distributed database management system. It is
an Apache Software Foundation top-level
project designed to handle very large amounts of data
spread out across many commodity servers while
providing a highly available service with no single point
of failure. It is a NoSQL solution that was initially
developed by Facebook and powered their Inbox
Search feature until late 2010. Jeff Hammerbacher,
who led the Facebook Data team at the time, has
described Cassandra as a BigTable data model running
on an Amazon Dynamo-like infrastructure.

                                           wikipedia.org
Data Module
• Key
   – RowKey: Identity of a ROW;
• Cluster
   – the machines (nodes) in a logical Cassandra instance. Clusters can
     contain multiple keyspaces;
• Keyspace
   – a namespace for ColumnFamilies, typically one per application;
• ColumnFamilies
   – contain multiple columns, each of which has a name, value, and a
     timestamp, and which are referenced by row keys;
• Column
   – the lowest/smallest increment of data. It's a tuple (triplet) that
     contains a name, a value and a timestamp;
• SuperColumns
   – can be thought of as columns that themselves have subcolumns;
Data Module
• Keyspaces
  – The container for column families;
  – Keyspaces are of roughly the same
    granularity as a schema or database (i.e. a
    logical collection of tables) in the RDBMS
    world;
  – They are the configuration and
    management point for column families,
    and is also the structure on which batch
    inserts are applied.
Data Module
• Column Families
  – A column family is a container for rows;
  – Analogous to the table in a relational system;
  – Each row in a column family can referenced
    by its key;
  – Each column family is stored in a separate file,
    and the file is sorted in row (i.e. key) major
    order;
  – Related columns, those that you'll access
    together, should be kept within the same
    column family.
Data Module
• Columns
  struct   Column {
      1:   binary name,
      2:   binary value,
      3:   i64 timestamp,
  }


  {
      "name": "emailAddress",
      "value": "foo@bar.com",
      "timestamp": 123456789
  }
Data Module
• Row
  {
      "mccv":{
        "Users":{
          "emailAddress":{
            "name":"emailAddress", "value":foo@bar.com
          },
          "webSite":{
            "name":"webSite","value":"http://bar.com"}
          },
        "Stats":{
          "visits":{
            "name":"visits", "value":"243“
          }
        }
      }
  }
Data Module
• Super Column
  {
    "mccv": {
      "Tags": {
       "cassandra": {
         "incubator": {
           "incubator":
             http://incubator.apache.org/cassandra/
         },
         "jira":
           {"jira":
             "http://issues.apache.org/jira/browse/CASSANDRA"}
           },
       "thrift": {
         "jira": {
           "jira":
           "http://issues.apache.org/jira/browse/THRIFT"}
       }
     }
   }
  }
Data Module Columns are added
                                                                                              and modified
                                      ColumnFamily1 Name : MailList              Type : Simple dynamically
                                                                                               Sort : Name
   KEY                                Name : tid1            Name : tid2             Name : tid3              Name : tid4
                                      Value : <Binary>       Value : <Binary>        Value : <Binary>         Value : <Binary>
                                      TimeStamp : t1         TimeStamp : t2          TimeStamp : t3           TimeStamp : t4




                             ColumnFamily2               Name : WordList         Type : Super           Sort : Time
    Column Families          Name : aloha                                                               Name : dude
      are declared             C1               C2             C3               C4                       C2                 C6
        upfront
   SuperColumns are            V1               V2             V3               V4                       V2                 V6

       added and               T1               T2             T3               T4                       T2                 T6

        modified
Columns are added
      dynamically
   and modified
    dynamically        ColumnFamily3 Name : System                  Type : Super      Sort : Name
                       Name : hint1           Name : hint2            Name : hint3           Name : hint4
                       <Column List>          <Column List>           <Column List>          <Column List>
Architecture

Cassandra API                     Tools

                Storage Layer

  Partitioner                   Replicator

Failure Detector         Cluster Membership

             Messaging Layer
Partition
                      1 0     h(key1)
                  E
                              A         N=3

          C

h(key2)                           F


                              B
              D

                      1/2
Gossip
        E
                     A

C

                         F


                     B
    D
Gossip
• Gossip protocol is used for cluster
  membership.
• Super lightweight with mathematically
  provable properties.
• Every 1 seconds each member
  increments its heartbeat counter and
  selects one other member to send its
  list to.
• A member merges the list with its own
  list .
Gossip
  10.0.0.1                                                                   10.0.0.2



EndPointState 10.0.0.1
  HeartBeatState: generation 1259909635, version 325
  ApplicationState "load-information": 5.2, generation 1259909635, version 45
  ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
  ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
  HeartBeatState: generation 1259911052, version 61
  ApplicationState "load-information": 2.7, generation 1259911052, version 2
  ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
EndPointState 10.0.0.3
  HeartBeatState: generation 1259912238, version 5
  ApplicationState "load-information": 12.0, generation 1259912238, version 3
EndPointState 10.0.0.4
  HeartBeatState: generation 1259912942, version 18
  ApplicationState "load-information": 6.7, generation 1259912942, version 3
  ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
Gossip
10.0.0.1                                                                  10.0.0.2



EndPointState 10.0.0.1
  HeartBeatState: generation 1259909635, version 324
  ApplicationState "load-information": 5.2, generation 1259909635, version 45
  ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
  ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
  HeartBeatState: generation 1259911052, version 63
  ApplicationState "load-information": 2.7, generation 1259911052, version 2
  ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
  ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
EndPointState 10.0.0.3
  HeartBeatState: generation 1259812143, version 2142
  ApplicationState "load-information": 16.0, generation 1259812143, version 1803
  ApplicationState "normal": W2U1XYUC3wMppcY7, generation 1259812143, version 6
Gossip
10.0.0.1                   GOSSIP_DIGEST_SYN                       10.0.0.2
                       10.0.0.1:1259909635:325
                       10.0.0.2:1259911052:61
                       10.0.0.3:1259912238:5
                       10.0.0.4:1259912942:18


10.0.0.1                GOSSIP_DIGEST_SYN_ACK                      10.0.0.2
           10.0.0.1:1259909635:324
           10.0.0.3:1259912238:0
           10.0.0.4:1259912942:0
           10.0.0.2:
             [ApplicationState "normal":
                 AujDMftpyUvebtnn,
                 Generation 1259911052,
                 version 62],
             [HeartBeatState, generation 1259911052, version 63]
Gossip
  10.0.0.1                                                                   10.0.0.2


EndPointState 10.0.0.1
  HeartBeatState: generation 1259909635, version 325
  ApplicationState "load-information": 5.2, generation 1259909635, version 45
  ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
  ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
  HeartBeatState: generation 1259911052, version 63
  ApplicationState "load-information": 2.7, generation 1259911052, version 2
  ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
  ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
EndPointState 10.0.0.3
  HeartBeatState: generation 1259912238, version 5
  ApplicationState "load-information": 12.0, generation 1259912238, version 3
EndPointState 10.0.0.4
  HeartBeatState: generation 1259912942, version 18
  ApplicationState "load-information": 6.7, generation 1259912942, version 3
  ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
Gossip
10.0.0.1                  GOSSIP_DIGEST_SYN2                       10.0.0.2
           10.0.0.1:
             [HeartBeatState,
                 generation 1259909635, version 325]
           10.0.0.3:
             [ApplicationState
                 "load-information": 12.0,
                 generation 1259912238, version 3],
             [ HeartBeatState:
                 generation 1259912238, version 5]
           10.0.0.4:
             [ApplicationState
                 "load-information": 6.7,
                 generation 1259912942, version 3],
             [ApplicationState
                 "normal": bj05IVc0lvRXw2xH,
                 generation 1259912942, version 7],
             [HeartBeatState: generation 1259912942, version 18]


10.0.0.1               GOSSIP_DIGEST_SYN_ACK2                      10.0.0.2
Gossip
10.0.0.1                                                                  10.0.0.2


 EndPointState 10.0.0.1
   HeartBeatState: generation 1259909635, version 325
   ApplicationState "load-information": 5.2, generation 1259909635, version 45
   ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
   ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
 EndPointState 10.0.0.2
   HeartBeatState: generation 1259911052, version 63
   ApplicationState "load-information": 2.7, generation 1259911052, version 2
   ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
   ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
 EndPointState 10.0.0.3
   HeartBeatState: generation 1259912238, version 5
   ApplicationState "load-information": 12.0, generation 1259912238, version 3
 EndPointState 10.0.0.4
   HeartBeatState: generation 1259912942, version 18
   ApplicationState "load-information": 6.7, generation 1259912942, version 3
   ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
Failure Detect
• Valuable for system management,
  replication, load balancing etc.
• Defined as a failure detector that outputs
  a value, PHI, associated with each process.
• Also known as Adaptive Failure detectors
  - designed to adapt to changing network
  conditions.
• The value output, PHI, represents a
  suspicion level.
• Applications set an appropriate threshold,
  trigger suspicions and perform
  appropriate actions.
Failure Detect
• PHI estimation is done in three phases
   – Inter arrival times for each member are stored in a sampling
     window.
   – Estimate the distribution of the above inter arrival times.
   – Gossip follows an exponential distribution.
   – The value of PHI is now computed as follows:

                                           tNow −tLast
                                         −
                                              interval
      phi = − log(1 − (1 −             e       size      ))
I/O
• A client issues a write request to a
  random node in the Cassandra cluster.
• The “Partitioner” determines the
  nodes responsible for the data.
• Locally, write operations are logged
  and then applied to an in-memory
  version.
• Commit log is stored on a dedicated
  disk local to the machine.
I/O
                        E
Read/Write Op                     A

                C

                                      F


                                  B
                    D
I/O
 Key (CF1 , CF2 , CF3)
                                      Memtable (CF1)                         Triggered By:
                                                                             • Data size
  Commit Log                          Memtable (CF2)                         • Lifetime
  Binary serialized                                                 Flush
  Key (CF1 , CF2 , CF3)               Memtable (CF3)

                                         Index file on disks          Data file on disks
                                                          <Size>   <Index>   <Serialized Cells>
                             K128 Offset                  ---
                                                          ---
                             K256 Offset           <Key> Offset
      Dedicated                                    <Key> Offset
        Disk                 K384 Offset           ---    ---

                                                   ---    ---
                             Bloom Filter of Key
                                                          ---
                          (Sparse Indexes in memory)

The storage architecture refers to relevant techniques of Google and other databases.
It’s similar to Bigtable, but it’s index scheme is different.
I/O
                                                     K2 < Serialized data >             K4 < Serialized data >
              K1 < Serialized data >
                                                     K10 < Serialized data >            K5 < Serialized data >
              K2 < Serialized data >
                                                     K30 < Serialized data >            K10 < Serialized data >
              K3 < Serialized data >



                                       DELETED
                                                     --                                 --
              --
                                        Sorted       --                        Sorted   --
Sorted        --
                                                     --                                 --
              --




                                                 MERGE SORT


    Index File
                                                    K1 < Serialized data >
           Loaded in memory                         K2 < Serialized data >
                                                    K3 < Serialized data >
         K1 Offset
                                                    K4 < Serialized data >
         K5 Offset                     Sorted
                                                    K5 < Serialized data >
         K30 Offset
                                                    K10 < Serialized data >
         Bloom Filter
                                                    K30 < Serialized data >


                                                 Data File
I/O
            Index Level-1
           Consistent Hash                      Index Level-3
                        1 0   h(key1)
                                                 Sorted Map,                                  BloomFilter
                    E                                                                                                  64KB (Changeable)
                              A                 mirror of data                            of Columns on Row
                                        N=3

            C                                        K0                 K0
h(key2)                           F                                                            Columns       Columns     Columns           Columns
                                                                                    Key         Index        Block 0     Block 1
                                                                                                                                   ...
                                                                                                                                           Block N
                              B
                D

                        1/2
                                                                                            Index Level-4
                                                                                             Block Index
  Range of                                                                                     B-Tree
                                                    K128              K128
 Hash to Node
                                                                                           (Binary Search)
                                                                                      Columns Block 0 -> Position
      BloomFilter                                                                     Columns Block 1-> Position
  of Keys on SSTable                                                                              ...
                                                    K256              K256           Columns Block N -> Position

    KeyCache

            Inde Level-2
             Block Index
               B-Tree
                                                    K384              K384
           (Binary Search)
                  K0
                K128
                K256                                                                Totally 4 levels of indexing.
                K384
                                                                                    Indexes are relatively small.
                                              Key Position Maps     Data Rows
          Sparse Block Index
          (Key interval = 128,
                                                  in Index file
                                              [on disk, cachable]
                                                                    in Data File    Very fit to store data of a individuals,
                                                                     [on disk]
             changeable)
             [in memory]
                                                                                     such as users, etc.
                                                                                    Good for CDR data serving.
I/O
                              Client


                    Query          Result

                          Cassandra Cluster


            Closest replica        Result                        Read repair if
                                                                 digests differ
                           Replica A


                         Digest Query
Digest Response                                Digest Response


             Replica B                      Replica C
I/O
• Consistence Level
  – Write
    • ZERO
       – Ensure nothing. A write happens asynchronously in
         background。
    • ONE
       – Ensure that the write has been written to at least 1
         nodes commit log and memory table before
         responding to the client.
    • QUORUM
       – Ensure that the write has been written to
         <ReplicationFactor> / 2 + 1 nodes before responding
         to the client.
    • ALL
       – Ensure that the write is written to <ReplicationFactor>
         nodes before responding to the client.
I/O
• Consistence Level
  – Read
    • ZERO
       – Not supported, because it doesnt make sense.
    • ONE
       – Will return the record returned by the first node to
         respond. A consistency check is always done in a
         background thread to fix any consistency issues when
         ConsistencyLevel.ONE is used. This means subsequent
         calls will have correct data even if the initial read gets
         an older value. (This is called read repair.)
    • QUORUM
       – Will query all storage nodes and return the record with
         the most recent timestamp once it has at least a
         majority of replicas reported. Again, the remaining
         replicas will be checked in the background.
Case Study
• CASE: CDR Query
• TIME: 2010
Case Study
          Key                   Date(Day) as CF              ……                Date(Day) as CF
                                  (20101020)                                     (20101024)
         User ID         CDR    CDR     CDR     …     CDR    ……     CDR      CDR   CDR   CDR     …   CDR



                             sorted by timestamp                     cells
•   Schema
     –     Key: The User ID (Phone Number), string
     –     ColumnFamily: The date(day) name, string
     –     Column: CDR, Thrift (or ProtocolBuffer)
           compacted encoding                               • Data Patterns
•   Semantics                                                  – A short set of temporal data
     –     Each user’s everyday CDRs are sorted by               that tends to be volatile.
           timestamp, and stored together.
                                                               – An ever-growing set of data
•   Stored Files                                                 that rarely gets accessed.
     –     The SSTable files are separated by
           ColumnFamilies.

•   Flexible and applicable to various CDR
    structures.
Case Study
•   Hardware                                                          The existing testbed and
     – Cluster with 9 nodes                                              configuration are not
          • 5 nodes
                 – DELL PowerEdge R710
                                                                         ideal for performance.
                 – CPU: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz, cache
                   size=8192 KB
                                                                      Preferred
                 – Core: 2x 4core CPU, HyperThread, => 16 cores
                 – RAM: 16GB                                              – Commit Log: dedicated
                 – Hard Disk: 2x 1TB SATA 7.2k rpm, RAID0                   hard disk.
          • 4 nodes
                                                                          – File system: XFS/EXT4.
                 – DELL PowerEdge 2970
                 – CPU: Quad-Core AMD Opteron (tm) Processor 2378,        – More memory to cache
                   cache size=512 KB                                        more indexes and
                 – Core: 2x 4core CPU, => 8 cores
                                                                            memadata.
                 – RAM: 16GB
                 – Hard Disk: 2x 1TB STAT 7.2k rpm, RAID0
     – Totally
          • 9 nodes, 112 cores, 144GB RAM, 18(18TB) Hard Disks
     – Network: within a single 1Gbps switch.
•   Linux: RedHat EL 5.3, Kernel=2.6.18-128.el5
•   File System: Ext3
•   JDK: Sun Java 1.6.0_20-b02
Case Study
• Each node runs 6 clients (threads), totally 54 clients.
• Each client generates random CDRs for 50 million users/phone-
  numbers, and puts them into Cassandra one by one.
   – Key Space: 50 million
   – Size of a CDR: Thrift-compacted encoding, ~200 bytes




 Throughput: average ~80K ops/s; per-node: average ~9K ops/s
 Latency: average ~0.5ms
 Bottleneck: network (and memory)
Case Study
• Each node runs 8 clients (threads) , totally 72 clients.
• Each client randomly uses a user-id/phone-number out of the 50-
  million space, to get it’s recent 20 CDRs (one page) from Cassandra.
• All clients read CDRs of a same day/bucket.

   ------------------------------------------------------------------------------------

• The 1st run:
    – Before compaction.
    – Average 8 SSTables on each node for everyday.

• The 2nd run:
    – After compaction.
    – Only one SSTable on each node for everyday.
Case Study
                                                                                of one node

             of the cluster (9 nodes)




      percentage of read ops
    25.00%

    20.00%

    15.00%

    10.00%

     5.00%

     0.00%
              1   3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61
                                                    100ms



 Throughput: average ~140 ops/s; per-node: average ~16 ops/s
 Latency: average ~500ms, 97% < 2s (SLA)
 Bottleneck: disk IO (random seek) (CPU load is very low)
Case Study
                                                                                         of one node
             of the cluster (9 nodes)




         percentage of read ops
        100.00%
         80.00%
         60.00%
         40.00%
         20.00%
          0.00%
                  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
                                                           100ms


   Compaction of ~8 SSTables, ~200GB. Time:16core node: 1:40; 8core node: 2:25
   Throughput: average ~1.1K ops/s; per-node: average ~120 ops/s
   Latency: average ~60ms, 95% < 500ms (SLA)
   Bottleneck: disk IO (random seek) (CPU load is very low)
Summary
• Inspired by Dynamo;
• Partition keys with Consistence Hash
  Ring;
• Gossip: Automatic node/failure
  detection;
• Storage: LOCAL (different from HBase);
• IO: Fast Write and Slower Read;
• Maintenance: Not Very Easy.
Thanks !
 Q&A

Introduction to Cassandra

  • 1.
    Introduction To Cassandra 张元丰 Cestbon Zhang Hanborq Inc. 2012.7.12
  • 2.
    Agenda • Introduction • Data Module • Architecture • Gossip • Consistence • I/O (Read And Write) • Case Study
  • 3.
    Introduction Apache Cassandra isan open source distributed database management system. It is an Apache Software Foundation top-level project designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010. Jeff Hammerbacher, who led the Facebook Data team at the time, has described Cassandra as a BigTable data model running on an Amazon Dynamo-like infrastructure. wikipedia.org
  • 4.
    Data Module • Key – RowKey: Identity of a ROW; • Cluster – the machines (nodes) in a logical Cassandra instance. Clusters can contain multiple keyspaces; • Keyspace – a namespace for ColumnFamilies, typically one per application; • ColumnFamilies – contain multiple columns, each of which has a name, value, and a timestamp, and which are referenced by row keys; • Column – the lowest/smallest increment of data. It's a tuple (triplet) that contains a name, a value and a timestamp; • SuperColumns – can be thought of as columns that themselves have subcolumns;
  • 5.
    Data Module • Keyspaces – The container for column families; – Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world; – They are the configuration and management point for column families, and is also the structure on which batch inserts are applied.
  • 6.
    Data Module • ColumnFamilies – A column family is a container for rows; – Analogous to the table in a relational system; – Each row in a column family can referenced by its key; – Each column family is stored in a separate file, and the file is sorted in row (i.e. key) major order; – Related columns, those that you'll access together, should be kept within the same column family.
  • 7.
    Data Module • Columns struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } { "name": "emailAddress", "value": "foo@bar.com", "timestamp": 123456789 }
  • 8.
    Data Module • Row { "mccv":{ "Users":{ "emailAddress":{ "name":"emailAddress", "value":foo@bar.com }, "webSite":{ "name":"webSite","value":"http://bar.com"} }, "Stats":{ "visits":{ "name":"visits", "value":"243“ } } } }
  • 9.
    Data Module • SuperColumn { "mccv": { "Tags": { "cassandra": { "incubator": { "incubator": http://incubator.apache.org/cassandra/ }, "jira": {"jira": "http://issues.apache.org/jira/browse/CASSANDRA"} }, "thrift": { "jira": { "jira": "http://issues.apache.org/jira/browse/THRIFT"} } } } }
  • 10.
    Data Module Columnsare added and modified ColumnFamily1 Name : MailList Type : Simple dynamically Sort : Name KEY Name : tid1 Name : tid2 Name : tid3 Name : tid4 Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary> TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Column Families Name : aloha Name : dude are declared C1 C2 C3 C4 C2 C6 upfront SuperColumns are V1 V2 V3 V4 V2 V6 added and T1 T2 T3 T4 T2 T6 modified Columns are added dynamically and modified dynamically ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 Name : hint2 Name : hint3 Name : hint4 <Column List> <Column List> <Column List> <Column List>
  • 11.
    Architecture Cassandra API Tools Storage Layer Partitioner Replicator Failure Detector Cluster Membership Messaging Layer
  • 12.
    Partition 1 0 h(key1) E A N=3 C h(key2) F B D 1/2
  • 13.
    Gossip E A C F B D
  • 14.
    Gossip • Gossip protocolis used for cluster membership. • Super lightweight with mathematically provable properties. • Every 1 seconds each member increments its heartbeat counter and selects one other member to send its list to. • A member merges the list with its own list .
  • 15.
    Gossip 10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 61 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3 EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  • 16.
    Gossip 10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 324 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62 EndPointState 10.0.0.3 HeartBeatState: generation 1259812143, version 2142 ApplicationState "load-information": 16.0, generation 1259812143, version 1803 ApplicationState "normal": W2U1XYUC3wMppcY7, generation 1259812143, version 6
  • 17.
    Gossip 10.0.0.1 GOSSIP_DIGEST_SYN 10.0.0.2 10.0.0.1:1259909635:325 10.0.0.2:1259911052:61 10.0.0.3:1259912238:5 10.0.0.4:1259912942:18 10.0.0.1 GOSSIP_DIGEST_SYN_ACK 10.0.0.2 10.0.0.1:1259909635:324 10.0.0.3:1259912238:0 10.0.0.4:1259912942:0 10.0.0.2: [ApplicationState "normal": AujDMftpyUvebtnn, Generation 1259911052, version 62], [HeartBeatState, generation 1259911052, version 63]
  • 18.
    Gossip 10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62 EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3 EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  • 19.
    Gossip 10.0.0.1 GOSSIP_DIGEST_SYN2 10.0.0.2 10.0.0.1: [HeartBeatState, generation 1259909635, version 325] 10.0.0.3: [ApplicationState "load-information": 12.0, generation 1259912238, version 3], [ HeartBeatState: generation 1259912238, version 5] 10.0.0.4: [ApplicationState "load-information": 6.7, generation 1259912942, version 3], [ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7], [HeartBeatState: generation 1259912942, version 18] 10.0.0.1 GOSSIP_DIGEST_SYN_ACK2 10.0.0.2
  • 20.
    Gossip 10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62 EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3 EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  • 21.
    Failure Detect • Valuablefor system management, replication, load balancing etc. • Defined as a failure detector that outputs a value, PHI, associated with each process. • Also known as Adaptive Failure detectors - designed to adapt to changing network conditions. • The value output, PHI, represents a suspicion level. • Applications set an appropriate threshold, trigger suspicions and perform appropriate actions.
  • 22.
    Failure Detect • PHIestimation is done in three phases – Inter arrival times for each member are stored in a sampling window. – Estimate the distribution of the above inter arrival times. – Gossip follows an exponential distribution. – The value of PHI is now computed as follows: tNow −tLast − interval phi = − log(1 − (1 − e size ))
  • 23.
    I/O • A clientissues a write request to a random node in the Cassandra cluster. • The “Partitioner” determines the nodes responsible for the data. • Locally, write operations are logged and then applied to an in-memory version. • Commit log is stored on a dedicated disk local to the machine.
  • 24.
    I/O E Read/Write Op A C F B D
  • 25.
    I/O Key (CF1, CF2 , CF3) Memtable (CF1) Triggered By: • Data size Commit Log Memtable (CF2) • Lifetime Binary serialized Flush Key (CF1 , CF2 , CF3) Memtable (CF3) Index file on disks Data file on disks <Size> <Index> <Serialized Cells> K128 Offset --- --- K256 Offset <Key> Offset Dedicated <Key> Offset Disk K384 Offset --- --- --- --- Bloom Filter of Key --- (Sparse Indexes in memory) The storage architecture refers to relevant techniques of Google and other databases. It’s similar to Bigtable, but it’s index scheme is different.
  • 26.
    I/O K2 < Serialized data > K4 < Serialized data > K1 < Serialized data > K10 < Serialized data > K5 < Serialized data > K2 < Serialized data > K30 < Serialized data > K10 < Serialized data > K3 < Serialized data > DELETED -- -- -- Sorted -- Sorted -- Sorted -- -- -- -- MERGE SORT Index File K1 < Serialized data > Loaded in memory K2 < Serialized data > K3 < Serialized data > K1 Offset K4 < Serialized data > K5 Offset Sorted K5 < Serialized data > K30 Offset K10 < Serialized data > Bloom Filter K30 < Serialized data > Data File
  • 27.
    I/O Index Level-1 Consistent Hash Index Level-3 1 0 h(key1) Sorted Map, BloomFilter E 64KB (Changeable) A mirror of data of Columns on Row N=3 C K0 K0 h(key2) F Columns Columns Columns Columns Key Index Block 0 Block 1 ... Block N B D 1/2 Index Level-4 Block Index Range of B-Tree K128 K128 Hash to Node (Binary Search) Columns Block 0 -> Position BloomFilter Columns Block 1-> Position of Keys on SSTable ... K256 K256 Columns Block N -> Position KeyCache Inde Level-2 Block Index B-Tree K384 K384 (Binary Search) K0 K128 K256  Totally 4 levels of indexing. K384  Indexes are relatively small. Key Position Maps Data Rows Sparse Block Index (Key interval = 128, in Index file [on disk, cachable] in Data File  Very fit to store data of a individuals, [on disk] changeable) [in memory] such as users, etc.  Good for CDR data serving.
  • 28.
    I/O Client Query Result Cassandra Cluster Closest replica Result Read repair if digests differ Replica A Digest Query Digest Response Digest Response Replica B Replica C
  • 29.
    I/O • Consistence Level – Write • ZERO – Ensure nothing. A write happens asynchronously in background。 • ONE – Ensure that the write has been written to at least 1 nodes commit log and memory table before responding to the client. • QUORUM – Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes before responding to the client. • ALL – Ensure that the write is written to <ReplicationFactor> nodes before responding to the client.
  • 30.
    I/O • Consistence Level – Read • ZERO – Not supported, because it doesnt make sense. • ONE – Will return the record returned by the first node to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called read repair.) • QUORUM – Will query all storage nodes and return the record with the most recent timestamp once it has at least a majority of replicas reported. Again, the remaining replicas will be checked in the background.
  • 31.
    Case Study • CASE:CDR Query • TIME: 2010
  • 32.
    Case Study Key Date(Day) as CF …… Date(Day) as CF (20101020) (20101024) User ID CDR CDR CDR … CDR …… CDR CDR CDR CDR … CDR sorted by timestamp cells • Schema – Key: The User ID (Phone Number), string – ColumnFamily: The date(day) name, string – Column: CDR, Thrift (or ProtocolBuffer) compacted encoding • Data Patterns • Semantics – A short set of temporal data – Each user’s everyday CDRs are sorted by that tends to be volatile. timestamp, and stored together. – An ever-growing set of data • Stored Files that rarely gets accessed. – The SSTable files are separated by ColumnFamilies. • Flexible and applicable to various CDR structures.
  • 33.
    Case Study • Hardware The existing testbed and – Cluster with 9 nodes configuration are not • 5 nodes – DELL PowerEdge R710 ideal for performance. – CPU: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz, cache size=8192 KB Preferred – Core: 2x 4core CPU, HyperThread, => 16 cores – RAM: 16GB – Commit Log: dedicated – Hard Disk: 2x 1TB SATA 7.2k rpm, RAID0 hard disk. • 4 nodes – File system: XFS/EXT4. – DELL PowerEdge 2970 – CPU: Quad-Core AMD Opteron (tm) Processor 2378, – More memory to cache cache size=512 KB more indexes and – Core: 2x 4core CPU, => 8 cores memadata. – RAM: 16GB – Hard Disk: 2x 1TB STAT 7.2k rpm, RAID0 – Totally • 9 nodes, 112 cores, 144GB RAM, 18(18TB) Hard Disks – Network: within a single 1Gbps switch. • Linux: RedHat EL 5.3, Kernel=2.6.18-128.el5 • File System: Ext3 • JDK: Sun Java 1.6.0_20-b02
  • 34.
    Case Study • Eachnode runs 6 clients (threads), totally 54 clients. • Each client generates random CDRs for 50 million users/phone- numbers, and puts them into Cassandra one by one. – Key Space: 50 million – Size of a CDR: Thrift-compacted encoding, ~200 bytes  Throughput: average ~80K ops/s; per-node: average ~9K ops/s  Latency: average ~0.5ms  Bottleneck: network (and memory)
  • 35.
    Case Study • Eachnode runs 8 clients (threads) , totally 72 clients. • Each client randomly uses a user-id/phone-number out of the 50- million space, to get it’s recent 20 CDRs (one page) from Cassandra. • All clients read CDRs of a same day/bucket. ------------------------------------------------------------------------------------ • The 1st run: – Before compaction. – Average 8 SSTables on each node for everyday. • The 2nd run: – After compaction. – Only one SSTable on each node for everyday.
  • 36.
    Case Study of one node of the cluster (9 nodes) percentage of read ops 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 100ms  Throughput: average ~140 ops/s; per-node: average ~16 ops/s  Latency: average ~500ms, 97% < 2s (SLA)  Bottleneck: disk IO (random seek) (CPU load is very low)
  • 37.
    Case Study of one node of the cluster (9 nodes) percentage of read ops 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 100ms  Compaction of ~8 SSTables, ~200GB. Time:16core node: 1:40; 8core node: 2:25  Throughput: average ~1.1K ops/s; per-node: average ~120 ops/s  Latency: average ~60ms, 95% < 500ms (SLA)  Bottleneck: disk IO (random seek) (CPU load is very low)
  • 38.
    Summary • Inspired byDynamo; • Partition keys with Consistence Hash Ring; • Gossip: Automatic node/failure detection; • Storage: LOCAL (different from HBase); • IO: Fast Write and Slower Read; • Maintenance: Not Very Easy.
  • 39.