SlideShare a Scribd company logo
1 of 39
Introduction To Cassandra

    张元丰 Cestbon Zhang
      Hanborq Inc.
       2012.7.12
Agenda
•   Introduction
•   Data Module
•   Architecture
•   Gossip
•   Consistence
•   I/O (Read And Write)
•   Case Study
Introduction
Apache Cassandra is an open
source distributed database management system. It is
an Apache Software Foundation top-level
project designed to handle very large amounts of data
spread out across many commodity servers while
providing a highly available service with no single point
of failure. It is a NoSQL solution that was initially
developed by Facebook and powered their Inbox
Search feature until late 2010. Jeff Hammerbacher,
who led the Facebook Data team at the time, has
described Cassandra as a BigTable data model running
on an Amazon Dynamo-like infrastructure.

                                           wikipedia.org
Data Module
• Key
   – RowKey: Identity of a ROW;
• Cluster
   – the machines (nodes) in a logical Cassandra instance. Clusters can
     contain multiple keyspaces;
• Keyspace
   – a namespace for ColumnFamilies, typically one per application;
• ColumnFamilies
   – contain multiple columns, each of which has a name, value, and a
     timestamp, and which are referenced by row keys;
• Column
   – the lowest/smallest increment of data. It's a tuple (triplet) that
     contains a name, a value and a timestamp;
• SuperColumns
   – can be thought of as columns that themselves have subcolumns;
Data Module
• Keyspaces
  – The container for column families;
  – Keyspaces are of roughly the same
    granularity as a schema or database (i.e. a
    logical collection of tables) in the RDBMS
    world;
  – They are the configuration and
    management point for column families,
    and is also the structure on which batch
    inserts are applied.
Data Module
• Column Families
  – A column family is a container for rows;
  – Analogous to the table in a relational system;
  – Each row in a column family can referenced
    by its key;
  – Each column family is stored in a separate file,
    and the file is sorted in row (i.e. key) major
    order;
  – Related columns, those that you'll access
    together, should be kept within the same
    column family.
Data Module
• Columns
  struct   Column {
      1:   binary name,
      2:   binary value,
      3:   i64 timestamp,
  }


  {
      "name": "emailAddress",
      "value": "foo@bar.com",
      "timestamp": 123456789
  }
Data Module
• Row
  {
      "mccv":{
        "Users":{
          "emailAddress":{
            "name":"emailAddress", "value":foo@bar.com
          },
          "webSite":{
            "name":"webSite","value":"http://bar.com"}
          },
        "Stats":{
          "visits":{
            "name":"visits", "value":"243“
          }
        }
      }
  }
Data Module
• Super Column
  {
    "mccv": {
      "Tags": {
       "cassandra": {
         "incubator": {
           "incubator":
             http://incubator.apache.org/cassandra/
         },
         "jira":
           {"jira":
             "http://issues.apache.org/jira/browse/CASSANDRA"}
           },
       "thrift": {
         "jira": {
           "jira":
           "http://issues.apache.org/jira/browse/THRIFT"}
       }
     }
   }
  }
Data Module Columns are added
                                                                                              and modified
                                      ColumnFamily1 Name : MailList              Type : Simple dynamically
                                                                                               Sort : Name
   KEY                                Name : tid1            Name : tid2             Name : tid3              Name : tid4
                                      Value : <Binary>       Value : <Binary>        Value : <Binary>         Value : <Binary>
                                      TimeStamp : t1         TimeStamp : t2          TimeStamp : t3           TimeStamp : t4




                             ColumnFamily2               Name : WordList         Type : Super           Sort : Time
    Column Families          Name : aloha                                                               Name : dude
      are declared             C1               C2             C3               C4                       C2                 C6
        upfront
   SuperColumns are            V1               V2             V3               V4                       V2                 V6

       added and               T1               T2             T3               T4                       T2                 T6

        modified
Columns are added
      dynamically
   and modified
    dynamically        ColumnFamily3 Name : System                  Type : Super      Sort : Name
                       Name : hint1           Name : hint2            Name : hint3           Name : hint4
                       <Column List>          <Column List>           <Column List>          <Column List>
Architecture

Cassandra API                     Tools

                Storage Layer

  Partitioner                   Replicator

Failure Detector         Cluster Membership

             Messaging Layer
Partition
                      1 0     h(key1)
                  E
                              A         N=3

          C

h(key2)                           F


                              B
              D

                      1/2
Gossip
        E
                     A

C

                         F


                     B
    D
Gossip
• Gossip protocol is used for cluster
  membership.
• Super lightweight with mathematically
  provable properties.
• Every 1 seconds each member
  increments its heartbeat counter and
  selects one other member to send its
  list to.
• A member merges the list with its own
  list .
Gossip
  10.0.0.1                                                                   10.0.0.2



EndPointState 10.0.0.1
  HeartBeatState: generation 1259909635, version 325
  ApplicationState "load-information": 5.2, generation 1259909635, version 45
  ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
  ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
  HeartBeatState: generation 1259911052, version 61
  ApplicationState "load-information": 2.7, generation 1259911052, version 2
  ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
EndPointState 10.0.0.3
  HeartBeatState: generation 1259912238, version 5
  ApplicationState "load-information": 12.0, generation 1259912238, version 3
EndPointState 10.0.0.4
  HeartBeatState: generation 1259912942, version 18
  ApplicationState "load-information": 6.7, generation 1259912942, version 3
  ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
Gossip
10.0.0.1                                                                  10.0.0.2



EndPointState 10.0.0.1
  HeartBeatState: generation 1259909635, version 324
  ApplicationState "load-information": 5.2, generation 1259909635, version 45
  ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
  ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
  HeartBeatState: generation 1259911052, version 63
  ApplicationState "load-information": 2.7, generation 1259911052, version 2
  ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
  ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
EndPointState 10.0.0.3
  HeartBeatState: generation 1259812143, version 2142
  ApplicationState "load-information": 16.0, generation 1259812143, version 1803
  ApplicationState "normal": W2U1XYUC3wMppcY7, generation 1259812143, version 6
Gossip
10.0.0.1                   GOSSIP_DIGEST_SYN                       10.0.0.2
                       10.0.0.1:1259909635:325
                       10.0.0.2:1259911052:61
                       10.0.0.3:1259912238:5
                       10.0.0.4:1259912942:18


10.0.0.1                GOSSIP_DIGEST_SYN_ACK                      10.0.0.2
           10.0.0.1:1259909635:324
           10.0.0.3:1259912238:0
           10.0.0.4:1259912942:0
           10.0.0.2:
             [ApplicationState "normal":
                 AujDMftpyUvebtnn,
                 Generation 1259911052,
                 version 62],
             [HeartBeatState, generation 1259911052, version 63]
Gossip
  10.0.0.1                                                                   10.0.0.2


EndPointState 10.0.0.1
  HeartBeatState: generation 1259909635, version 325
  ApplicationState "load-information": 5.2, generation 1259909635, version 45
  ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
  ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
  HeartBeatState: generation 1259911052, version 63
  ApplicationState "load-information": 2.7, generation 1259911052, version 2
  ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
  ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
EndPointState 10.0.0.3
  HeartBeatState: generation 1259912238, version 5
  ApplicationState "load-information": 12.0, generation 1259912238, version 3
EndPointState 10.0.0.4
  HeartBeatState: generation 1259912942, version 18
  ApplicationState "load-information": 6.7, generation 1259912942, version 3
  ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
Gossip
10.0.0.1                  GOSSIP_DIGEST_SYN2                       10.0.0.2
           10.0.0.1:
             [HeartBeatState,
                 generation 1259909635, version 325]
           10.0.0.3:
             [ApplicationState
                 "load-information": 12.0,
                 generation 1259912238, version 3],
             [ HeartBeatState:
                 generation 1259912238, version 5]
           10.0.0.4:
             [ApplicationState
                 "load-information": 6.7,
                 generation 1259912942, version 3],
             [ApplicationState
                 "normal": bj05IVc0lvRXw2xH,
                 generation 1259912942, version 7],
             [HeartBeatState: generation 1259912942, version 18]


10.0.0.1               GOSSIP_DIGEST_SYN_ACK2                      10.0.0.2
Gossip
10.0.0.1                                                                  10.0.0.2


 EndPointState 10.0.0.1
   HeartBeatState: generation 1259909635, version 325
   ApplicationState "load-information": 5.2, generation 1259909635, version 45
   ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
   ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
 EndPointState 10.0.0.2
   HeartBeatState: generation 1259911052, version 63
   ApplicationState "load-information": 2.7, generation 1259911052, version 2
   ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
   ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
 EndPointState 10.0.0.3
   HeartBeatState: generation 1259912238, version 5
   ApplicationState "load-information": 12.0, generation 1259912238, version 3
 EndPointState 10.0.0.4
   HeartBeatState: generation 1259912942, version 18
   ApplicationState "load-information": 6.7, generation 1259912942, version 3
   ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
Failure Detect
• Valuable for system management,
  replication, load balancing etc.
• Defined as a failure detector that outputs
  a value, PHI, associated with each process.
• Also known as Adaptive Failure detectors
  - designed to adapt to changing network
  conditions.
• The value output, PHI, represents a
  suspicion level.
• Applications set an appropriate threshold,
  trigger suspicions and perform
  appropriate actions.
Failure Detect
• PHI estimation is done in three phases
   – Inter arrival times for each member are stored in a sampling
     window.
   – Estimate the distribution of the above inter arrival times.
   – Gossip follows an exponential distribution.
   – The value of PHI is now computed as follows:

                                           tNow −tLast
                                         −
                                              interval
      phi = − log(1 − (1 −             e       size      ))
I/O
• A client issues a write request to a
  random node in the Cassandra cluster.
• The “Partitioner” determines the
  nodes responsible for the data.
• Locally, write operations are logged
  and then applied to an in-memory
  version.
• Commit log is stored on a dedicated
  disk local to the machine.
I/O
                        E
Read/Write Op                     A

                C

                                      F


                                  B
                    D
I/O
 Key (CF1 , CF2 , CF3)
                                      Memtable (CF1)                         Triggered By:
                                                                             • Data size
  Commit Log                          Memtable (CF2)                         • Lifetime
  Binary serialized                                                 Flush
  Key (CF1 , CF2 , CF3)               Memtable (CF3)

                                         Index file on disks          Data file on disks
                                                          <Size>   <Index>   <Serialized Cells>
                             K128 Offset                  ---
                                                          ---
                             K256 Offset           <Key> Offset
      Dedicated                                    <Key> Offset
        Disk                 K384 Offset           ---    ---

                                                   ---    ---
                             Bloom Filter of Key
                                                          ---
                          (Sparse Indexes in memory)

The storage architecture refers to relevant techniques of Google and other databases.
It’s similar to Bigtable, but it’s index scheme is different.
I/O
                                                     K2 < Serialized data >             K4 < Serialized data >
              K1 < Serialized data >
                                                     K10 < Serialized data >            K5 < Serialized data >
              K2 < Serialized data >
                                                     K30 < Serialized data >            K10 < Serialized data >
              K3 < Serialized data >



                                       DELETED
                                                     --                                 --
              --
                                        Sorted       --                        Sorted   --
Sorted        --
                                                     --                                 --
              --




                                                 MERGE SORT


    Index File
                                                    K1 < Serialized data >
           Loaded in memory                         K2 < Serialized data >
                                                    K3 < Serialized data >
         K1 Offset
                                                    K4 < Serialized data >
         K5 Offset                     Sorted
                                                    K5 < Serialized data >
         K30 Offset
                                                    K10 < Serialized data >
         Bloom Filter
                                                    K30 < Serialized data >


                                                 Data File
I/O
            Index Level-1
           Consistent Hash                      Index Level-3
                        1 0   h(key1)
                                                 Sorted Map,                                  BloomFilter
                    E                                                                                                  64KB (Changeable)
                              A                 mirror of data                            of Columns on Row
                                        N=3

            C                                        K0                 K0
h(key2)                           F                                                            Columns       Columns     Columns           Columns
                                                                                    Key         Index        Block 0     Block 1
                                                                                                                                   ...
                                                                                                                                           Block N
                              B
                D

                        1/2
                                                                                            Index Level-4
                                                                                             Block Index
  Range of                                                                                     B-Tree
                                                    K128              K128
 Hash to Node
                                                                                           (Binary Search)
                                                                                      Columns Block 0 -> Position
      BloomFilter                                                                     Columns Block 1-> Position
  of Keys on SSTable                                                                              ...
                                                    K256              K256           Columns Block N -> Position

    KeyCache

            Inde Level-2
             Block Index
               B-Tree
                                                    K384              K384
           (Binary Search)
                  K0
                K128
                K256                                                                Totally 4 levels of indexing.
                K384
                                                                                    Indexes are relatively small.
                                              Key Position Maps     Data Rows
          Sparse Block Index
          (Key interval = 128,
                                                  in Index file
                                              [on disk, cachable]
                                                                    in Data File    Very fit to store data of a individuals,
                                                                     [on disk]
             changeable)
             [in memory]
                                                                                     such as users, etc.
                                                                                    Good for CDR data serving.
I/O
                              Client


                    Query          Result

                          Cassandra Cluster


            Closest replica        Result                        Read repair if
                                                                 digests differ
                           Replica A


                         Digest Query
Digest Response                                Digest Response


             Replica B                      Replica C
I/O
• Consistence Level
  – Write
    • ZERO
       – Ensure nothing. A write happens asynchronously in
         background。
    • ONE
       – Ensure that the write has been written to at least 1
         nodes commit log and memory table before
         responding to the client.
    • QUORUM
       – Ensure that the write has been written to
         <ReplicationFactor> / 2 + 1 nodes before responding
         to the client.
    • ALL
       – Ensure that the write is written to <ReplicationFactor>
         nodes before responding to the client.
I/O
• Consistence Level
  – Read
    • ZERO
       – Not supported, because it doesnt make sense.
    • ONE
       – Will return the record returned by the first node to
         respond. A consistency check is always done in a
         background thread to fix any consistency issues when
         ConsistencyLevel.ONE is used. This means subsequent
         calls will have correct data even if the initial read gets
         an older value. (This is called read repair.)
    • QUORUM
       – Will query all storage nodes and return the record with
         the most recent timestamp once it has at least a
         majority of replicas reported. Again, the remaining
         replicas will be checked in the background.
Case Study
• CASE: CDR Query
• TIME: 2010
Case Study
          Key                   Date(Day) as CF              ……                Date(Day) as CF
                                  (20101020)                                     (20101024)
         User ID         CDR    CDR     CDR     …     CDR    ……     CDR      CDR   CDR   CDR     …   CDR



                             sorted by timestamp                     cells
•   Schema
     –     Key: The User ID (Phone Number), string
     –     ColumnFamily: The date(day) name, string
     –     Column: CDR, Thrift (or ProtocolBuffer)
           compacted encoding                               • Data Patterns
•   Semantics                                                  – A short set of temporal data
     –     Each user’s everyday CDRs are sorted by               that tends to be volatile.
           timestamp, and stored together.
                                                               – An ever-growing set of data
•   Stored Files                                                 that rarely gets accessed.
     –     The SSTable files are separated by
           ColumnFamilies.

•   Flexible and applicable to various CDR
    structures.
Case Study
•   Hardware                                                          The existing testbed and
     – Cluster with 9 nodes                                              configuration are not
          • 5 nodes
                 – DELL PowerEdge R710
                                                                         ideal for performance.
                 – CPU: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz, cache
                   size=8192 KB
                                                                      Preferred
                 – Core: 2x 4core CPU, HyperThread, => 16 cores
                 – RAM: 16GB                                              – Commit Log: dedicated
                 – Hard Disk: 2x 1TB SATA 7.2k rpm, RAID0                   hard disk.
          • 4 nodes
                                                                          – File system: XFS/EXT4.
                 – DELL PowerEdge 2970
                 – CPU: Quad-Core AMD Opteron (tm) Processor 2378,        – More memory to cache
                   cache size=512 KB                                        more indexes and
                 – Core: 2x 4core CPU, => 8 cores
                                                                            memadata.
                 – RAM: 16GB
                 – Hard Disk: 2x 1TB STAT 7.2k rpm, RAID0
     – Totally
          • 9 nodes, 112 cores, 144GB RAM, 18(18TB) Hard Disks
     – Network: within a single 1Gbps switch.
•   Linux: RedHat EL 5.3, Kernel=2.6.18-128.el5
•   File System: Ext3
•   JDK: Sun Java 1.6.0_20-b02
Case Study
• Each node runs 6 clients (threads), totally 54 clients.
• Each client generates random CDRs for 50 million users/phone-
  numbers, and puts them into Cassandra one by one.
   – Key Space: 50 million
   – Size of a CDR: Thrift-compacted encoding, ~200 bytes




 Throughput: average ~80K ops/s; per-node: average ~9K ops/s
 Latency: average ~0.5ms
 Bottleneck: network (and memory)
Case Study
• Each node runs 8 clients (threads) , totally 72 clients.
• Each client randomly uses a user-id/phone-number out of the 50-
  million space, to get it’s recent 20 CDRs (one page) from Cassandra.
• All clients read CDRs of a same day/bucket.

   ------------------------------------------------------------------------------------

• The 1st run:
    – Before compaction.
    – Average 8 SSTables on each node for everyday.

• The 2nd run:
    – After compaction.
    – Only one SSTable on each node for everyday.
Case Study
                                                                                of one node

             of the cluster (9 nodes)




      percentage of read ops
    25.00%

    20.00%

    15.00%

    10.00%

     5.00%

     0.00%
              1   3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61
                                                    100ms



 Throughput: average ~140 ops/s; per-node: average ~16 ops/s
 Latency: average ~500ms, 97% < 2s (SLA)
 Bottleneck: disk IO (random seek) (CPU load is very low)
Case Study
                                                                                         of one node
             of the cluster (9 nodes)




         percentage of read ops
        100.00%
         80.00%
         60.00%
         40.00%
         20.00%
          0.00%
                  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
                                                           100ms


   Compaction of ~8 SSTables, ~200GB. Time:16core node: 1:40; 8core node: 2:25
   Throughput: average ~1.1K ops/s; per-node: average ~120 ops/s
   Latency: average ~60ms, 95% < 500ms (SLA)
   Bottleneck: disk IO (random seek) (CPU load is very low)
Summary
• Inspired by Dynamo;
• Partition keys with Consistence Hash
  Ring;
• Gossip: Automatic node/failure
  detection;
• Storage: LOCAL (different from HBase);
• IO: Fast Write and Slower Read;
• Maintenance: Not Very Easy.
Thanks !
 Q&A

More Related Content

What's hot

All Your Base
All Your BaseAll Your Base
All Your BaseAcunu
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
 
MariaDB and Cassandra Interoperability
MariaDB and Cassandra InteroperabilityMariaDB and Cassandra Interoperability
MariaDB and Cassandra InteroperabilityColin Charles
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shaintrihug
 
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...
Plongée profonde  dans les technos de haute disponibilité d’Exchange 2010 par...Plongée profonde  dans les technos de haute disponibilité d’Exchange 2010 par...
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...Microsoft Technet France
 
Storing and manipulating graphs in HBase
Storing and manipulating graphs in HBaseStoring and manipulating graphs in HBase
Storing and manipulating graphs in HBaseDan Lynn
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­TimeSeven Nguyen
 
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)Shun Nakamura
 
Top five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solutionTop five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solutionjbellis
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Lucidworks (Archived)
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataSteven Francia
 
Cassandra 1.1
Cassandra 1.1Cassandra 1.1
Cassandra 1.1jbellis
 

What's hot (12)

All Your Base
All Your BaseAll Your Base
All Your Base
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 
MariaDB and Cassandra Interoperability
MariaDB and Cassandra InteroperabilityMariaDB and Cassandra Interoperability
MariaDB and Cassandra Interoperability
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
 
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...
Plongée profonde  dans les technos de haute disponibilité d’Exchange 2010 par...Plongée profonde  dans les technos de haute disponibilité d’Exchange 2010 par...
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...
 
Storing and manipulating graphs in HBase
Storing and manipulating graphs in HBaseStoring and manipulating graphs in HBase
Storing and manipulating graphs in HBase
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
 
Top five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solutionTop five questions to ask when choosing a big data solution
Top five questions to ask when choosing a big data solution
 
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
 
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
 
Cassandra 1.1
Cassandra 1.1Cassandra 1.1
Cassandra 1.1
 

Similar to Introduction to Cassandra

Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorialmubarakss
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupMichael Wynholds
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandrarantav
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client TutorialJoe McTee
 
Cassandra structured storage system over a p2 p network
Cassandra structured storage system over a p2 p networkCassandra structured storage system over a p2 p network
Cassandra structured storage system over a p2 p networkJoão Gabriel Lima
 
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersLuke Tillman
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into CassandraBrent Theisen
 
Mito, a successor of Integral
Mito, a successor of IntegralMito, a successor of Integral
Mito, a successor of Integralfukamachi
 
How Prometheus Store the Data
How Prometheus Store the DataHow Prometheus Store the Data
How Prometheus Store the DataHao Chen
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Boris Yen
 
Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017confluent
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraDave Gardner
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Apache Cassandra, part 2 – data model example, machinery
Apache Cassandra, part 2 – data model example, machineryApache Cassandra, part 2 – data model example, machinery
Apache Cassandra, part 2 – data model example, machineryAndrey Lomakin
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandraPL dream
 
Cassandra
CassandraCassandra
Cassandrarobjk
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_uploadRajini Ramesh
 

Similar to Introduction to Cassandra (20)

Cassandra Tutorial
Cassandra TutorialCassandra Tutorial
Cassandra Tutorial
 
Cassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL MeetupCassandra and Rails at LA NoSQL Meetup
Cassandra and Rails at LA NoSQL Meetup
 
NOSQL and Cassandra
NOSQL and CassandraNOSQL and Cassandra
NOSQL and Cassandra
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client Tutorial
 
Cassandra structured storage system over a p2 p network
Cassandra structured storage system over a p2 p networkCassandra structured storage system over a p2 p network
Cassandra structured storage system over a p2 p network
 
Cassandra Nosql
Cassandra NosqlCassandra Nosql
Cassandra Nosql
 
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET Developers
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Mito, a successor of Integral
Mito, a successor of IntegralMito, a successor of Integral
Mito, a successor of Integral
 
How Prometheus Store the Data
How Prometheus Store the DataHow Prometheus Store the Data
How Prometheus Store the Data
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
 
Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017Exactly-once Data Processing with Kafka Streams - July 27, 2017
Exactly-once Data Processing with Kafka Streams - July 27, 2017
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache CassandraCassandra's Sweet Spot - an introduction to Apache Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Apache Cassandra, part 2 – data model example, machinery
Apache Cassandra, part 2 – data model example, machineryApache Cassandra, part 2 – data model example, machinery
Apache Cassandra, part 2 – data model example, machinery
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Cassandra
CassandraCassandra
Cassandra
 
Slide presentation pycassa_upload
Slide presentation pycassa_uploadSlide presentation pycassa_upload
Slide presentation pycassa_upload
 

More from Hanborq Inc.

Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHanborq Inc.
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Hanborq Inc.
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive IntroductionHanborq Inc.
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase IntroductionHanborq Inc.
 
Hadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHanborq Inc.
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
How to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHow to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHanborq Inc.
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 

More from Hanborq Inc. (12)

Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
FlumeBase Study
FlumeBase StudyFlumeBase Study
FlumeBase Study
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase Introduction
 
Hadoop Versioning
Hadoop VersioningHadoop Versioning
Hadoop Versioning
 
Hadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler Introduction
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
How to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHow to Build Cloud Storage Service Systems
How to Build Cloud Storage Service Systems
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 

Recently uploaded

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Introduction to Cassandra

  • 1. Introduction To Cassandra 张元丰 Cestbon Zhang Hanborq Inc. 2012.7.12
  • 2. Agenda • Introduction • Data Module • Architecture • Gossip • Consistence • I/O (Read And Write) • Case Study
  • 3. Introduction Apache Cassandra is an open source distributed database management system. It is an Apache Software Foundation top-level project designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. It is a NoSQL solution that was initially developed by Facebook and powered their Inbox Search feature until late 2010. Jeff Hammerbacher, who led the Facebook Data team at the time, has described Cassandra as a BigTable data model running on an Amazon Dynamo-like infrastructure. wikipedia.org
  • 4. Data Module • Key – RowKey: Identity of a ROW; • Cluster – the machines (nodes) in a logical Cassandra instance. Clusters can contain multiple keyspaces; • Keyspace – a namespace for ColumnFamilies, typically one per application; • ColumnFamilies – contain multiple columns, each of which has a name, value, and a timestamp, and which are referenced by row keys; • Column – the lowest/smallest increment of data. It's a tuple (triplet) that contains a name, a value and a timestamp; • SuperColumns – can be thought of as columns that themselves have subcolumns;
  • 5. Data Module • Keyspaces – The container for column families; – Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world; – They are the configuration and management point for column families, and is also the structure on which batch inserts are applied.
  • 6. Data Module • Column Families – A column family is a container for rows; – Analogous to the table in a relational system; – Each row in a column family can referenced by its key; – Each column family is stored in a separate file, and the file is sorted in row (i.e. key) major order; – Related columns, those that you'll access together, should be kept within the same column family.
  • 7. Data Module • Columns struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } { "name": "emailAddress", "value": "foo@bar.com", "timestamp": 123456789 }
  • 8. Data Module • Row { "mccv":{ "Users":{ "emailAddress":{ "name":"emailAddress", "value":foo@bar.com }, "webSite":{ "name":"webSite","value":"http://bar.com"} }, "Stats":{ "visits":{ "name":"visits", "value":"243“ } } } }
  • 9. Data Module • Super Column { "mccv": { "Tags": { "cassandra": { "incubator": { "incubator": http://incubator.apache.org/cassandra/ }, "jira": {"jira": "http://issues.apache.org/jira/browse/CASSANDRA"} }, "thrift": { "jira": { "jira": "http://issues.apache.org/jira/browse/THRIFT"} } } } }
  • 10. Data Module Columns are added and modified ColumnFamily1 Name : MailList Type : Simple dynamically Sort : Name KEY Name : tid1 Name : tid2 Name : tid3 Name : tid4 Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary> TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Column Families Name : aloha Name : dude are declared C1 C2 C3 C4 C2 C6 upfront SuperColumns are V1 V2 V3 V4 V2 V6 added and T1 T2 T3 T4 T2 T6 modified Columns are added dynamically and modified dynamically ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 Name : hint2 Name : hint3 Name : hint4 <Column List> <Column List> <Column List> <Column List>
  • 11. Architecture Cassandra API Tools Storage Layer Partitioner Replicator Failure Detector Cluster Membership Messaging Layer
  • 12. Partition 1 0 h(key1) E A N=3 C h(key2) F B D 1/2
  • 13. Gossip E A C F B D
  • 14. Gossip • Gossip protocol is used for cluster membership. • Super lightweight with mathematically provable properties. • Every 1 seconds each member increments its heartbeat counter and selects one other member to send its list to. • A member merges the list with its own list .
  • 15. Gossip 10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 61 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3 EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  • 16. Gossip 10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 324 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62 EndPointState 10.0.0.3 HeartBeatState: generation 1259812143, version 2142 ApplicationState "load-information": 16.0, generation 1259812143, version 1803 ApplicationState "normal": W2U1XYUC3wMppcY7, generation 1259812143, version 6
  • 17. Gossip 10.0.0.1 GOSSIP_DIGEST_SYN 10.0.0.2 10.0.0.1:1259909635:325 10.0.0.2:1259911052:61 10.0.0.3:1259912238:5 10.0.0.4:1259912942:18 10.0.0.1 GOSSIP_DIGEST_SYN_ACK 10.0.0.2 10.0.0.1:1259909635:324 10.0.0.3:1259912238:0 10.0.0.4:1259912942:0 10.0.0.2: [ApplicationState "normal": AujDMftpyUvebtnn, Generation 1259911052, version 62], [HeartBeatState, generation 1259911052, version 63]
  • 18. Gossip 10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62 EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3 EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  • 19. Gossip 10.0.0.1 GOSSIP_DIGEST_SYN2 10.0.0.2 10.0.0.1: [HeartBeatState, generation 1259909635, version 325] 10.0.0.3: [ApplicationState "load-information": 12.0, generation 1259912238, version 3], [ HeartBeatState: generation 1259912238, version 5] 10.0.0.4: [ApplicationState "load-information": 6.7, generation 1259912942, version 3], [ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7], [HeartBeatState: generation 1259912942, version 18] 10.0.0.1 GOSSIP_DIGEST_SYN_ACK2 10.0.0.2
  • 20. Gossip 10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62 EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3 EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  • 21. Failure Detect • Valuable for system management, replication, load balancing etc. • Defined as a failure detector that outputs a value, PHI, associated with each process. • Also known as Adaptive Failure detectors - designed to adapt to changing network conditions. • The value output, PHI, represents a suspicion level. • Applications set an appropriate threshold, trigger suspicions and perform appropriate actions.
  • 22. Failure Detect • PHI estimation is done in three phases – Inter arrival times for each member are stored in a sampling window. – Estimate the distribution of the above inter arrival times. – Gossip follows an exponential distribution. – The value of PHI is now computed as follows: tNow −tLast − interval phi = − log(1 − (1 − e size ))
  • 23. I/O • A client issues a write request to a random node in the Cassandra cluster. • The “Partitioner” determines the nodes responsible for the data. • Locally, write operations are logged and then applied to an in-memory version. • Commit log is stored on a dedicated disk local to the machine.
  • 24. I/O E Read/Write Op A C F B D
  • 25. I/O Key (CF1 , CF2 , CF3) Memtable (CF1) Triggered By: • Data size Commit Log Memtable (CF2) • Lifetime Binary serialized Flush Key (CF1 , CF2 , CF3) Memtable (CF3) Index file on disks Data file on disks <Size> <Index> <Serialized Cells> K128 Offset --- --- K256 Offset <Key> Offset Dedicated <Key> Offset Disk K384 Offset --- --- --- --- Bloom Filter of Key --- (Sparse Indexes in memory) The storage architecture refers to relevant techniques of Google and other databases. It’s similar to Bigtable, but it’s index scheme is different.
  • 26. I/O K2 < Serialized data > K4 < Serialized data > K1 < Serialized data > K10 < Serialized data > K5 < Serialized data > K2 < Serialized data > K30 < Serialized data > K10 < Serialized data > K3 < Serialized data > DELETED -- -- -- Sorted -- Sorted -- Sorted -- -- -- -- MERGE SORT Index File K1 < Serialized data > Loaded in memory K2 < Serialized data > K3 < Serialized data > K1 Offset K4 < Serialized data > K5 Offset Sorted K5 < Serialized data > K30 Offset K10 < Serialized data > Bloom Filter K30 < Serialized data > Data File
  • 27. I/O Index Level-1 Consistent Hash Index Level-3 1 0 h(key1) Sorted Map, BloomFilter E 64KB (Changeable) A mirror of data of Columns on Row N=3 C K0 K0 h(key2) F Columns Columns Columns Columns Key Index Block 0 Block 1 ... Block N B D 1/2 Index Level-4 Block Index Range of B-Tree K128 K128 Hash to Node (Binary Search) Columns Block 0 -> Position BloomFilter Columns Block 1-> Position of Keys on SSTable ... K256 K256 Columns Block N -> Position KeyCache Inde Level-2 Block Index B-Tree K384 K384 (Binary Search) K0 K128 K256  Totally 4 levels of indexing. K384  Indexes are relatively small. Key Position Maps Data Rows Sparse Block Index (Key interval = 128, in Index file [on disk, cachable] in Data File  Very fit to store data of a individuals, [on disk] changeable) [in memory] such as users, etc.  Good for CDR data serving.
  • 28. I/O Client Query Result Cassandra Cluster Closest replica Result Read repair if digests differ Replica A Digest Query Digest Response Digest Response Replica B Replica C
  • 29. I/O • Consistence Level – Write • ZERO – Ensure nothing. A write happens asynchronously in background。 • ONE – Ensure that the write has been written to at least 1 nodes commit log and memory table before responding to the client. • QUORUM – Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes before responding to the client. • ALL – Ensure that the write is written to <ReplicationFactor> nodes before responding to the client.
  • 30. I/O • Consistence Level – Read • ZERO – Not supported, because it doesnt make sense. • ONE – Will return the record returned by the first node to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called read repair.) • QUORUM – Will query all storage nodes and return the record with the most recent timestamp once it has at least a majority of replicas reported. Again, the remaining replicas will be checked in the background.
  • 31. Case Study • CASE: CDR Query • TIME: 2010
  • 32. Case Study Key Date(Day) as CF …… Date(Day) as CF (20101020) (20101024) User ID CDR CDR CDR … CDR …… CDR CDR CDR CDR … CDR sorted by timestamp cells • Schema – Key: The User ID (Phone Number), string – ColumnFamily: The date(day) name, string – Column: CDR, Thrift (or ProtocolBuffer) compacted encoding • Data Patterns • Semantics – A short set of temporal data – Each user’s everyday CDRs are sorted by that tends to be volatile. timestamp, and stored together. – An ever-growing set of data • Stored Files that rarely gets accessed. – The SSTable files are separated by ColumnFamilies. • Flexible and applicable to various CDR structures.
  • 33. Case Study • Hardware The existing testbed and – Cluster with 9 nodes configuration are not • 5 nodes – DELL PowerEdge R710 ideal for performance. – CPU: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz, cache size=8192 KB Preferred – Core: 2x 4core CPU, HyperThread, => 16 cores – RAM: 16GB – Commit Log: dedicated – Hard Disk: 2x 1TB SATA 7.2k rpm, RAID0 hard disk. • 4 nodes – File system: XFS/EXT4. – DELL PowerEdge 2970 – CPU: Quad-Core AMD Opteron (tm) Processor 2378, – More memory to cache cache size=512 KB more indexes and – Core: 2x 4core CPU, => 8 cores memadata. – RAM: 16GB – Hard Disk: 2x 1TB STAT 7.2k rpm, RAID0 – Totally • 9 nodes, 112 cores, 144GB RAM, 18(18TB) Hard Disks – Network: within a single 1Gbps switch. • Linux: RedHat EL 5.3, Kernel=2.6.18-128.el5 • File System: Ext3 • JDK: Sun Java 1.6.0_20-b02
  • 34. Case Study • Each node runs 6 clients (threads), totally 54 clients. • Each client generates random CDRs for 50 million users/phone- numbers, and puts them into Cassandra one by one. – Key Space: 50 million – Size of a CDR: Thrift-compacted encoding, ~200 bytes  Throughput: average ~80K ops/s; per-node: average ~9K ops/s  Latency: average ~0.5ms  Bottleneck: network (and memory)
  • 35. Case Study • Each node runs 8 clients (threads) , totally 72 clients. • Each client randomly uses a user-id/phone-number out of the 50- million space, to get it’s recent 20 CDRs (one page) from Cassandra. • All clients read CDRs of a same day/bucket. ------------------------------------------------------------------------------------ • The 1st run: – Before compaction. – Average 8 SSTables on each node for everyday. • The 2nd run: – After compaction. – Only one SSTable on each node for everyday.
  • 36. Case Study of one node of the cluster (9 nodes) percentage of read ops 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 100ms  Throughput: average ~140 ops/s; per-node: average ~16 ops/s  Latency: average ~500ms, 97% < 2s (SLA)  Bottleneck: disk IO (random seek) (CPU load is very low)
  • 37. Case Study of one node of the cluster (9 nodes) percentage of read ops 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 100ms  Compaction of ~8 SSTables, ~200GB. Time:16core node: 1:40; 8core node: 2:25  Throughput: average ~1.1K ops/s; per-node: average ~120 ops/s  Latency: average ~60ms, 95% < 500ms (SLA)  Bottleneck: disk IO (random seek) (CPU load is very low)
  • 38. Summary • Inspired by Dynamo; • Partition keys with Consistence Hash Ring; • Gossip: Automatic node/failure detection; • Storage: LOCAL (different from HBase); • IO: Fast Write and Slower Read; • Maintenance: Not Very Easy.