2. Agenda
• Introduction
• Data Module
• Architecture
• Gossip
• Consistence
• I/O (Read And Write)
• Case Study
3. Introduction
Apache Cassandra is an open
source distributed database management system. It is
an Apache Software Foundation top-level
project designed to handle very large amounts of data
spread out across many commodity servers while
providing a highly available service with no single point
of failure. It is a NoSQL solution that was initially
developed by Facebook and powered their Inbox
Search feature until late 2010. Jeff Hammerbacher,
who led the Facebook Data team at the time, has
described Cassandra as a BigTable data model running
on an Amazon Dynamo-like infrastructure.
wikipedia.org
4. Data Module
• Key
– RowKey: Identity of a ROW;
• Cluster
– the machines (nodes) in a logical Cassandra instance. Clusters can
contain multiple keyspaces;
• Keyspace
– a namespace for ColumnFamilies, typically one per application;
• ColumnFamilies
– contain multiple columns, each of which has a name, value, and a
timestamp, and which are referenced by row keys;
• Column
– the lowest/smallest increment of data. It's a tuple (triplet) that
contains a name, a value and a timestamp;
• SuperColumns
– can be thought of as columns that themselves have subcolumns;
5. Data Module
• Keyspaces
– The container for column families;
– Keyspaces are of roughly the same
granularity as a schema or database (i.e. a
logical collection of tables) in the RDBMS
world;
– They are the configuration and
management point for column families,
and is also the structure on which batch
inserts are applied.
6. Data Module
• Column Families
– A column family is a container for rows;
– Analogous to the table in a relational system;
– Each row in a column family can referenced
by its key;
– Each column family is stored in a separate file,
and the file is sorted in row (i.e. key) major
order;
– Related columns, those that you'll access
together, should be kept within the same
column family.
10. Data Module Columns are added
and modified
ColumnFamily1 Name : MailList Type : Simple dynamically
Sort : Name
KEY Name : tid1 Name : tid2 Name : tid3 Name : tid4
Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary>
TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4
ColumnFamily2 Name : WordList Type : Super Sort : Time
Column Families Name : aloha Name : dude
are declared C1 C2 C3 C4 C2 C6
upfront
SuperColumns are V1 V2 V3 V4 V2 V6
added and T1 T2 T3 T4 T2 T6
modified
Columns are added
dynamically
and modified
dynamically ColumnFamily3 Name : System Type : Super Sort : Name
Name : hint1 Name : hint2 Name : hint3 Name : hint4
<Column List> <Column List> <Column List> <Column List>
14. Gossip
• Gossip protocol is used for cluster
membership.
• Super lightweight with mathematically
provable properties.
• Every 1 seconds each member
increments its heartbeat counter and
selects one other member to send its
list to.
• A member merges the list with its own
list .
15. Gossip
10.0.0.1 10.0.0.2
EndPointState 10.0.0.1
HeartBeatState: generation 1259909635, version 325
ApplicationState "load-information": 5.2, generation 1259909635, version 45
ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
HeartBeatState: generation 1259911052, version 61
ApplicationState "load-information": 2.7, generation 1259911052, version 2
ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
EndPointState 10.0.0.3
HeartBeatState: generation 1259912238, version 5
ApplicationState "load-information": 12.0, generation 1259912238, version 3
EndPointState 10.0.0.4
HeartBeatState: generation 1259912942, version 18
ApplicationState "load-information": 6.7, generation 1259912942, version 3
ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
16. Gossip
10.0.0.1 10.0.0.2
EndPointState 10.0.0.1
HeartBeatState: generation 1259909635, version 324
ApplicationState "load-information": 5.2, generation 1259909635, version 45
ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
HeartBeatState: generation 1259911052, version 63
ApplicationState "load-information": 2.7, generation 1259911052, version 2
ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
EndPointState 10.0.0.3
HeartBeatState: generation 1259812143, version 2142
ApplicationState "load-information": 16.0, generation 1259812143, version 1803
ApplicationState "normal": W2U1XYUC3wMppcY7, generation 1259812143, version 6
18. Gossip
10.0.0.1 10.0.0.2
EndPointState 10.0.0.1
HeartBeatState: generation 1259909635, version 325
ApplicationState "load-information": 5.2, generation 1259909635, version 45
ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
HeartBeatState: generation 1259911052, version 63
ApplicationState "load-information": 2.7, generation 1259911052, version 2
ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
EndPointState 10.0.0.3
HeartBeatState: generation 1259912238, version 5
ApplicationState "load-information": 12.0, generation 1259912238, version 3
EndPointState 10.0.0.4
HeartBeatState: generation 1259912942, version 18
ApplicationState "load-information": 6.7, generation 1259912942, version 3
ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
19. Gossip
10.0.0.1 GOSSIP_DIGEST_SYN2 10.0.0.2
10.0.0.1:
[HeartBeatState,
generation 1259909635, version 325]
10.0.0.3:
[ApplicationState
"load-information": 12.0,
generation 1259912238, version 3],
[ HeartBeatState:
generation 1259912238, version 5]
10.0.0.4:
[ApplicationState
"load-information": 6.7,
generation 1259912942, version 3],
[ApplicationState
"normal": bj05IVc0lvRXw2xH,
generation 1259912942, version 7],
[HeartBeatState: generation 1259912942, version 18]
10.0.0.1 GOSSIP_DIGEST_SYN_ACK2 10.0.0.2
20. Gossip
10.0.0.1 10.0.0.2
EndPointState 10.0.0.1
HeartBeatState: generation 1259909635, version 325
ApplicationState "load-information": 5.2, generation 1259909635, version 45
ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
ApplicationState "normal": bxLpassF3XD8Kyks, generation 1259909635, version 87
EndPointState 10.0.0.2
HeartBeatState: generation 1259911052, version 63
ApplicationState "load-information": 2.7, generation 1259911052, version 2
ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
ApplicationState "normal": AujDMftpyUvebtnn, generation 1259911052, version 62
EndPointState 10.0.0.3
HeartBeatState: generation 1259912238, version 5
ApplicationState "load-information": 12.0, generation 1259912238, version 3
EndPointState 10.0.0.4
HeartBeatState: generation 1259912942, version 18
ApplicationState "load-information": 6.7, generation 1259912942, version 3
ApplicationState "normal": bj05IVc0lvRXw2xH, generation 1259912942, version 7
21. Failure Detect
• Valuable for system management,
replication, load balancing etc.
• Defined as a failure detector that outputs
a value, PHI, associated with each process.
• Also known as Adaptive Failure detectors
- designed to adapt to changing network
conditions.
• The value output, PHI, represents a
suspicion level.
• Applications set an appropriate threshold,
trigger suspicions and perform
appropriate actions.
22. Failure Detect
• PHI estimation is done in three phases
– Inter arrival times for each member are stored in a sampling
window.
– Estimate the distribution of the above inter arrival times.
– Gossip follows an exponential distribution.
– The value of PHI is now computed as follows:
tNow −tLast
−
interval
phi = − log(1 − (1 − e size ))
23. I/O
• A client issues a write request to a
random node in the Cassandra cluster.
• The “Partitioner” determines the
nodes responsible for the data.
• Locally, write operations are logged
and then applied to an in-memory
version.
• Commit log is stored on a dedicated
disk local to the machine.
25. I/O
Key (CF1 , CF2 , CF3)
Memtable (CF1) Triggered By:
• Data size
Commit Log Memtable (CF2) • Lifetime
Binary serialized Flush
Key (CF1 , CF2 , CF3) Memtable (CF3)
Index file on disks Data file on disks
<Size> <Index> <Serialized Cells>
K128 Offset ---
---
K256 Offset <Key> Offset
Dedicated <Key> Offset
Disk K384 Offset --- ---
--- ---
Bloom Filter of Key
---
(Sparse Indexes in memory)
The storage architecture refers to relevant techniques of Google and other databases.
It’s similar to Bigtable, but it’s index scheme is different.
26. I/O
K2 < Serialized data > K4 < Serialized data >
K1 < Serialized data >
K10 < Serialized data > K5 < Serialized data >
K2 < Serialized data >
K30 < Serialized data > K10 < Serialized data >
K3 < Serialized data >
DELETED
-- --
--
Sorted -- Sorted --
Sorted --
-- --
--
MERGE SORT
Index File
K1 < Serialized data >
Loaded in memory K2 < Serialized data >
K3 < Serialized data >
K1 Offset
K4 < Serialized data >
K5 Offset Sorted
K5 < Serialized data >
K30 Offset
K10 < Serialized data >
Bloom Filter
K30 < Serialized data >
Data File
27. I/O
Index Level-1
Consistent Hash Index Level-3
1 0 h(key1)
Sorted Map, BloomFilter
E 64KB (Changeable)
A mirror of data of Columns on Row
N=3
C K0 K0
h(key2) F Columns Columns Columns Columns
Key Index Block 0 Block 1
...
Block N
B
D
1/2
Index Level-4
Block Index
Range of B-Tree
K128 K128
Hash to Node
(Binary Search)
Columns Block 0 -> Position
BloomFilter Columns Block 1-> Position
of Keys on SSTable ...
K256 K256 Columns Block N -> Position
KeyCache
Inde Level-2
Block Index
B-Tree
K384 K384
(Binary Search)
K0
K128
K256 Totally 4 levels of indexing.
K384
Indexes are relatively small.
Key Position Maps Data Rows
Sparse Block Index
(Key interval = 128,
in Index file
[on disk, cachable]
in Data File Very fit to store data of a individuals,
[on disk]
changeable)
[in memory]
such as users, etc.
Good for CDR data serving.
28. I/O
Client
Query Result
Cassandra Cluster
Closest replica Result Read repair if
digests differ
Replica A
Digest Query
Digest Response Digest Response
Replica B Replica C
29. I/O
• Consistence Level
– Write
• ZERO
– Ensure nothing. A write happens asynchronously in
background。
• ONE
– Ensure that the write has been written to at least 1
nodes commit log and memory table before
responding to the client.
• QUORUM
– Ensure that the write has been written to
<ReplicationFactor> / 2 + 1 nodes before responding
to the client.
• ALL
– Ensure that the write is written to <ReplicationFactor>
nodes before responding to the client.
30. I/O
• Consistence Level
– Read
• ZERO
– Not supported, because it doesnt make sense.
• ONE
– Will return the record returned by the first node to
respond. A consistency check is always done in a
background thread to fix any consistency issues when
ConsistencyLevel.ONE is used. This means subsequent
calls will have correct data even if the initial read gets
an older value. (This is called read repair.)
• QUORUM
– Will query all storage nodes and return the record with
the most recent timestamp once it has at least a
majority of replicas reported. Again, the remaining
replicas will be checked in the background.
32. Case Study
Key Date(Day) as CF …… Date(Day) as CF
(20101020) (20101024)
User ID CDR CDR CDR … CDR …… CDR CDR CDR CDR … CDR
sorted by timestamp cells
• Schema
– Key: The User ID (Phone Number), string
– ColumnFamily: The date(day) name, string
– Column: CDR, Thrift (or ProtocolBuffer)
compacted encoding • Data Patterns
• Semantics – A short set of temporal data
– Each user’s everyday CDRs are sorted by that tends to be volatile.
timestamp, and stored together.
– An ever-growing set of data
• Stored Files that rarely gets accessed.
– The SSTable files are separated by
ColumnFamilies.
• Flexible and applicable to various CDR
structures.
33. Case Study
• Hardware The existing testbed and
– Cluster with 9 nodes configuration are not
• 5 nodes
– DELL PowerEdge R710
ideal for performance.
– CPU: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz, cache
size=8192 KB
Preferred
– Core: 2x 4core CPU, HyperThread, => 16 cores
– RAM: 16GB – Commit Log: dedicated
– Hard Disk: 2x 1TB SATA 7.2k rpm, RAID0 hard disk.
• 4 nodes
– File system: XFS/EXT4.
– DELL PowerEdge 2970
– CPU: Quad-Core AMD Opteron (tm) Processor 2378, – More memory to cache
cache size=512 KB more indexes and
– Core: 2x 4core CPU, => 8 cores
memadata.
– RAM: 16GB
– Hard Disk: 2x 1TB STAT 7.2k rpm, RAID0
– Totally
• 9 nodes, 112 cores, 144GB RAM, 18(18TB) Hard Disks
– Network: within a single 1Gbps switch.
• Linux: RedHat EL 5.3, Kernel=2.6.18-128.el5
• File System: Ext3
• JDK: Sun Java 1.6.0_20-b02
34. Case Study
• Each node runs 6 clients (threads), totally 54 clients.
• Each client generates random CDRs for 50 million users/phone-
numbers, and puts them into Cassandra one by one.
– Key Space: 50 million
– Size of a CDR: Thrift-compacted encoding, ~200 bytes
Throughput: average ~80K ops/s; per-node: average ~9K ops/s
Latency: average ~0.5ms
Bottleneck: network (and memory)
35. Case Study
• Each node runs 8 clients (threads) , totally 72 clients.
• Each client randomly uses a user-id/phone-number out of the 50-
million space, to get it’s recent 20 CDRs (one page) from Cassandra.
• All clients read CDRs of a same day/bucket.
------------------------------------------------------------------------------------
• The 1st run:
– Before compaction.
– Average 8 SSTables on each node for everyday.
• The 2nd run:
– After compaction.
– Only one SSTable on each node for everyday.
36. Case Study
of one node
of the cluster (9 nodes)
percentage of read ops
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61
100ms
Throughput: average ~140 ops/s; per-node: average ~16 ops/s
Latency: average ~500ms, 97% < 2s (SLA)
Bottleneck: disk IO (random seek) (CPU load is very low)
37. Case Study
of one node
of the cluster (9 nodes)
percentage of read ops
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
100ms
Compaction of ~8 SSTables, ~200GB. Time:16core node: 1:40; 8core node: 2:25
Throughput: average ~1.1K ops/s; per-node: average ~120 ops/s
Latency: average ~60ms, 95% < 500ms (SLA)
Bottleneck: disk IO (random seek) (CPU load is very low)
38. Summary
• Inspired by Dynamo;
• Partition keys with Consistence Hash
Ring;
• Gossip: Automatic node/failure
detection;
• Storage: LOCAL (different from HBase);
• IO: Fast Write and Slower Read;
• Maintenance: Not Very Easy.