Introduction To Cassandra    张元丰 Cestbon Zhang      Hanborq Inc.       2012.7.12
Agenda•   Introduction•   Data Module•   Architecture•   Gossip•   Consistence•   I/O (Read And Write)•   Case Study
IntroductionApache Cassandra is an opensource distributed database management system. It isan Apache Software Foundation t...
Data Module• Key   – RowKey: Identity of a ROW;• Cluster   – the machines (nodes) in a logical Cassandra instance. Cluster...
Data Module• Keyspaces  – The container for column families;  – Keyspaces are of roughly the same    granularity as a sche...
Data Module• Column Families  – A column family is a container for rows;  – Analogous to the table in a relational system;...
Data Module• Columns  struct   Column {      1:   binary name,      2:   binary value,      3:   i64 timestamp,  }  {     ...
Data Module• Row  {      "mccv":{        "Users":{          "emailAddress":{            "name":"emailAddress", "value":foo...
Data Module• Super Column  {    "mccv": {      "Tags": {       "cassandra": {         "incubator": {           "incubator"...
Data Module Columns are added                                                                                             ...
ArchitectureCassandra API                     Tools                Storage Layer  Partitioner                   Replicator...
Partition                      1 0     h(key1)                  E                              A         N=3          Ch(k...
Gossip        E                     AC                         F                     B    D
Gossip• Gossip protocol is used for cluster  membership.• Super lightweight with mathematically  provable properties.• Eve...
Gossip  10.0.0.1                                                                   10.0.0.2EndPointState 10.0.0.1  HeartBe...
Gossip10.0.0.1                                                                  10.0.0.2EndPointState 10.0.0.1  HeartBeatS...
Gossip10.0.0.1                   GOSSIP_DIGEST_SYN                       10.0.0.2                       10.0.0.1:125990963...
Gossip  10.0.0.1                                                                   10.0.0.2EndPointState 10.0.0.1  HeartBe...
Gossip10.0.0.1                  GOSSIP_DIGEST_SYN2                       10.0.0.2           10.0.0.1:             [HeartBe...
Gossip10.0.0.1                                                                  10.0.0.2 EndPointState 10.0.0.1   HeartBea...
Failure Detect• Valuable for system management,  replication, load balancing etc.• Defined as a failure detector that outp...
Failure Detect• PHI estimation is done in three phases   – Inter arrival times for each member are stored in a sampling   ...
I/O• A client issues a write request to a  random node in the Cassandra cluster.• The “Partitioner” determines the  nodes ...
I/O                        ERead/Write Op                     A                C                                      F   ...
I/O Key (CF1 , CF2 , CF3)                                      Memtable (CF1)                         Triggered By:       ...
I/O                                                     K2 < Serialized data >             K4 < Serialized data >         ...
I/O            Index Level-1           Consistent Hash                      Index Level-3                        1 0   h(k...
I/O                              Client                    Query          Result                          Cassandra Cluste...
I/O• Consistence Level  – Write    • ZERO       – Ensure nothing. A write happens asynchronously in         background。   ...
I/O• Consistence Level  – Read    • ZERO       – Not supported, because it doesnt make sense.    • ONE       – Will return...
Case Study• CASE: CDR Query• TIME: 2010
Case Study          Key                   Date(Day) as CF              ……                Date(Day) as CF                  ...
Case Study•   Hardware                                                          The existing testbed and     – Cluster wit...
Case Study• Each node runs 6 clients (threads), totally 54 clients.• Each client generates random CDRs for 50 million user...
Case Study• Each node runs 8 clients (threads) , totally 72 clients.• Each client randomly uses a user-id/phone-number out...
Case Study                                                                                of one node             of the c...
Case Study                                                                                         of one node            ...
Summary• Inspired by Dynamo;• Partition keys with Consistence Hash  Ring;• Gossip: Automatic node/failure  detection;• Sto...
Thanks ! Q&A
Upcoming SlideShare
Loading in...5
×

Introduction to Cassandra

704

Published on

Introduction to Cassandra, for training.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
704
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to Cassandra"

  1. 1. Introduction To Cassandra 张元丰 Cestbon Zhang Hanborq Inc. 2012.7.12
  2. 2. Agenda• Introduction• Data Module• Architecture• Gossip• Consistence• I/O (Read And Write)• Case Study
  3. 3. IntroductionApache Cassandra is an opensource distributed database management system. It isan Apache Software Foundation top-levelproject designed to handle very large amounts of dataspread out across many commodity servers whileproviding a highly available service with no single pointof failure. It is a NoSQL solution that was initiallydeveloped by Facebook and powered their InboxSearch feature until late 2010. Jeff Hammerbacher,who led the Facebook Data team at the time, hasdescribed Cassandra as a BigTable data model runningon an Amazon Dynamo-like infrastructure. wikipedia.org
  4. 4. Data Module• Key – RowKey: Identity of a ROW;• Cluster – the machines (nodes) in a logical Cassandra instance. Clusters can contain multiple keyspaces;• Keyspace – a namespace for ColumnFamilies, typically one per application;• ColumnFamilies – contain multiple columns, each of which has a name, value, and a timestamp, and which are referenced by row keys;• Column – the lowest/smallest increment of data. Its a tuple (triplet) that contains a name, a value and a timestamp;• SuperColumns – can be thought of as columns that themselves have subcolumns;
  5. 5. Data Module• Keyspaces – The container for column families; – Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world; – They are the configuration and management point for column families, and is also the structure on which batch inserts are applied.
  6. 6. Data Module• Column Families – A column family is a container for rows; – Analogous to the table in a relational system; – Each row in a column family can referenced by its key; – Each column family is stored in a separate file, and the file is sorted in row (i.e. key) major order; – Related columns, those that youll access together, should be kept within the same column family.
  7. 7. Data Module• Columns struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } { "name": "emailAddress", "value": "foo@bar.com", "timestamp": 123456789 }
  8. 8. Data Module• Row { "mccv":{ "Users":{ "emailAddress":{ "name":"emailAddress", "value":foo@bar.com }, "webSite":{ "name":"webSite","value":"http://bar.com"} }, "Stats":{ "visits":{ "name":"visits", "value":"243“ } } } }
  9. 9. Data Module• Super Column { "mccv": { "Tags": { "cassandra": { "incubator": { "incubator": http://incubator.apache.org/cassandra/ }, "jira": {"jira": "http://issues.apache.org/jira/browse/CASSANDRA"} }, "thrift": { "jira": { "jira": "http://issues.apache.org/jira/browse/THRIFT"} } } } }
  10. 10. Data Module Columns are added and modified ColumnFamily1 Name : MailList Type : Simple dynamically Sort : Name KEY Name : tid1 Name : tid2 Name : tid3 Name : tid4 Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary> TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Column Families Name : aloha Name : dude are declared C1 C2 C3 C4 C2 C6 upfront SuperColumns are V1 V2 V3 V4 V2 V6 added and T1 T2 T3 T4 T2 T6 modifiedColumns are added dynamically and modified dynamically ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 Name : hint2 Name : hint3 Name : hint4 <Column List> <Column List> <Column List> <Column List>
  11. 11. ArchitectureCassandra API Tools Storage Layer Partitioner ReplicatorFailure Detector Cluster Membership Messaging Layer
  12. 12. Partition 1 0 h(key1) E A N=3 Ch(key2) F B D 1/2
  13. 13. Gossip E AC F B D
  14. 14. Gossip• Gossip protocol is used for cluster membership.• Super lightweight with mathematically provable properties.• Every 1 seconds each member increments its heartbeat counter and selects one other member to send its list to.• A member merges the list with its own list .
  15. 15. Gossip 10.0.0.1 10.0.0.2EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 61 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  16. 16. Gossip10.0.0.1 10.0.0.2EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 324 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62EndPointState 10.0.0.3 HeartBeatState: generation 1259812143, version 2142 ApplicationState "load-information": 16.0, generation 1259812143, version 1803 ApplicationState "normal": W2U1XYUC3wMppcY7, generation 1259812143, version 6
  17. 17. Gossip10.0.0.1 GOSSIP_DIGEST_SYN 10.0.0.2 10.0.0.1:1259909635:325 10.0.0.2:1259911052:61 10.0.0.3:1259912238:5 10.0.0.4:1259912942:1810.0.0.1 GOSSIP_DIGEST_SYN_ACK 10.0.0.2 10.0.0.1:1259909635:324 10.0.0.3:1259912238:0 10.0.0.4:1259912942:0 10.0.0.2: [ApplicationState "normal": AujDMftpyUvebtnn, Generation 1259911052, version 62], [HeartBeatState, generation 1259911052, version 63]
  18. 18. Gossip 10.0.0.1 10.0.0.2EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  19. 19. Gossip10.0.0.1 GOSSIP_DIGEST_SYN2 10.0.0.2 10.0.0.1: [HeartBeatState, generation 1259909635, version 325] 10.0.0.3: [ApplicationState "load-information": 12.0, generation 1259912238, version 3], [ HeartBeatState: generation 1259912238, version 5] 10.0.0.4: [ApplicationState "load-information": 6.7, generation 1259912942, version 3], [ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7], [HeartBeatState: generation 1259912942, version 18]10.0.0.1 GOSSIP_DIGEST_SYN_ACK2 10.0.0.2
  20. 20. Gossip10.0.0.1 10.0.0.2 EndPointState 10.0.0.1 HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2, generation 1259909635, version 45 ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56 ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87 EndPointState 10.0.0.2 HeartBeatState: generation 1259911052, version 63 ApplicationState "load-information": 2.7, generation 1259911052, version 2 ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31 ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62 EndPointState 10.0.0.3 HeartBeatState: generation 1259912238, version 5 ApplicationState "load-information": 12.0, generation 1259912238, version 3 EndPointState 10.0.0.4 HeartBeatState: generation 1259912942, version 18 ApplicationState "load-information": 6.7, generation 1259912942, version 3 ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
  21. 21. Failure Detect• Valuable for system management, replication, load balancing etc.• Defined as a failure detector that outputs a value, PHI, associated with each process.• Also known as Adaptive Failure detectors - designed to adapt to changing network conditions.• The value output, PHI, represents a suspicion level.• Applications set an appropriate threshold, trigger suspicions and perform appropriate actions.
  22. 22. Failure Detect• PHI estimation is done in three phases – Inter arrival times for each member are stored in a sampling window. – Estimate the distribution of the above inter arrival times. – Gossip follows an exponential distribution. – The value of PHI is now computed as follows: tNow −tLast − interval phi = − log(1 − (1 − e size ))
  23. 23. I/O• A client issues a write request to a random node in the Cassandra cluster.• The “Partitioner” determines the nodes responsible for the data.• Locally, write operations are logged and then applied to an in-memory version.• Commit log is stored on a dedicated disk local to the machine.
  24. 24. I/O ERead/Write Op A C F B D
  25. 25. I/O Key (CF1 , CF2 , CF3) Memtable (CF1) Triggered By: • Data size Commit Log Memtable (CF2) • Lifetime Binary serialized Flush Key (CF1 , CF2 , CF3) Memtable (CF3) Index file on disks Data file on disks <Size> <Index> <Serialized Cells> K128 Offset --- --- K256 Offset <Key> Offset Dedicated <Key> Offset Disk K384 Offset --- --- --- --- Bloom Filter of Key --- (Sparse Indexes in memory)The storage architecture refers to relevant techniques of Google and other databases.It’s similar to Bigtable, but it’s index scheme is different.
  26. 26. I/O K2 < Serialized data > K4 < Serialized data > K1 < Serialized data > K10 < Serialized data > K5 < Serialized data > K2 < Serialized data > K30 < Serialized data > K10 < Serialized data > K3 < Serialized data > DELETED -- -- -- Sorted -- Sorted --Sorted -- -- -- -- MERGE SORT Index File K1 < Serialized data > Loaded in memory K2 < Serialized data > K3 < Serialized data > K1 Offset K4 < Serialized data > K5 Offset Sorted K5 < Serialized data > K30 Offset K10 < Serialized data > Bloom Filter K30 < Serialized data > Data File
  27. 27. I/O Index Level-1 Consistent Hash Index Level-3 1 0 h(key1) Sorted Map, BloomFilter E 64KB (Changeable) A mirror of data of Columns on Row N=3 C K0 K0h(key2) F Columns Columns Columns Columns Key Index Block 0 Block 1 ... Block N B D 1/2 Index Level-4 Block Index Range of B-Tree K128 K128 Hash to Node (Binary Search) Columns Block 0 -> Position BloomFilter Columns Block 1-> Position of Keys on SSTable ... K256 K256 Columns Block N -> Position KeyCache Inde Level-2 Block Index B-Tree K384 K384 (Binary Search) K0 K128 K256  Totally 4 levels of indexing. K384  Indexes are relatively small. Key Position Maps Data Rows Sparse Block Index (Key interval = 128, in Index file [on disk, cachable] in Data File  Very fit to store data of a individuals, [on disk] changeable) [in memory] such as users, etc.  Good for CDR data serving.
  28. 28. I/O Client Query Result Cassandra Cluster Closest replica Result Read repair if digests differ Replica A Digest QueryDigest Response Digest Response Replica B Replica C
  29. 29. I/O• Consistence Level – Write • ZERO – Ensure nothing. A write happens asynchronously in background。 • ONE – Ensure that the write has been written to at least 1 nodes commit log and memory table before responding to the client. • QUORUM – Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes before responding to the client. • ALL – Ensure that the write is written to <ReplicationFactor> nodes before responding to the client.
  30. 30. I/O• Consistence Level – Read • ZERO – Not supported, because it doesnt make sense. • ONE – Will return the record returned by the first node to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called read repair.) • QUORUM – Will query all storage nodes and return the record with the most recent timestamp once it has at least a majority of replicas reported. Again, the remaining replicas will be checked in the background.
  31. 31. Case Study• CASE: CDR Query• TIME: 2010
  32. 32. Case Study Key Date(Day) as CF …… Date(Day) as CF (20101020) (20101024) User ID CDR CDR CDR … CDR …… CDR CDR CDR CDR … CDR sorted by timestamp cells• Schema – Key: The User ID (Phone Number), string – ColumnFamily: The date(day) name, string – Column: CDR, Thrift (or ProtocolBuffer) compacted encoding • Data Patterns• Semantics – A short set of temporal data – Each user’s everyday CDRs are sorted by that tends to be volatile. timestamp, and stored together. – An ever-growing set of data• Stored Files that rarely gets accessed. – The SSTable files are separated by ColumnFamilies.• Flexible and applicable to various CDR structures.
  33. 33. Case Study• Hardware The existing testbed and – Cluster with 9 nodes configuration are not • 5 nodes – DELL PowerEdge R710 ideal for performance. – CPU: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz, cache size=8192 KB Preferred – Core: 2x 4core CPU, HyperThread, => 16 cores – RAM: 16GB – Commit Log: dedicated – Hard Disk: 2x 1TB SATA 7.2k rpm, RAID0 hard disk. • 4 nodes – File system: XFS/EXT4. – DELL PowerEdge 2970 – CPU: Quad-Core AMD Opteron (tm) Processor 2378, – More memory to cache cache size=512 KB more indexes and – Core: 2x 4core CPU, => 8 cores memadata. – RAM: 16GB – Hard Disk: 2x 1TB STAT 7.2k rpm, RAID0 – Totally • 9 nodes, 112 cores, 144GB RAM, 18(18TB) Hard Disks – Network: within a single 1Gbps switch.• Linux: RedHat EL 5.3, Kernel=2.6.18-128.el5• File System: Ext3• JDK: Sun Java 1.6.0_20-b02
  34. 34. Case Study• Each node runs 6 clients (threads), totally 54 clients.• Each client generates random CDRs for 50 million users/phone- numbers, and puts them into Cassandra one by one. – Key Space: 50 million – Size of a CDR: Thrift-compacted encoding, ~200 bytes Throughput: average ~80K ops/s; per-node: average ~9K ops/s Latency: average ~0.5ms Bottleneck: network (and memory)
  35. 35. Case Study• Each node runs 8 clients (threads) , totally 72 clients.• Each client randomly uses a user-id/phone-number out of the 50- million space, to get it’s recent 20 CDRs (one page) from Cassandra.• All clients read CDRs of a same day/bucket. ------------------------------------------------------------------------------------• The 1st run: – Before compaction. – Average 8 SSTables on each node for everyday.• The 2nd run: – After compaction. – Only one SSTable on each node for everyday.
  36. 36. Case Study of one node of the cluster (9 nodes) percentage of read ops 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 100ms Throughput: average ~140 ops/s; per-node: average ~16 ops/s Latency: average ~500ms, 97% < 2s (SLA) Bottleneck: disk IO (random seek) (CPU load is very low)
  37. 37. Case Study of one node of the cluster (9 nodes) percentage of read ops 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 100ms Compaction of ~8 SSTables, ~200GB. Time:16core node: 1:40; 8core node: 2:25 Throughput: average ~1.1K ops/s; per-node: average ~120 ops/s Latency: average ~60ms, 95% < 500ms (SLA) Bottleneck: disk IO (random seek) (CPU load is very low)
  38. 38. Summary• Inspired by Dynamo;• Partition keys with Consistence Hash Ring;• Gossip: Automatic node/failure detection;• Storage: LOCAL (different from HBase);• IO: Fast Write and Slower Read;• Maintenance: Not Very Easy.
  39. 39. Thanks ! Q&A

×