under the hood
Cassandra
2017
Who I am
Java Software Engineer @ Lohika
More than 7 years of experience
Andriy Rymar
2
What we won’t
• Learn how to use Cassandra
• Learn about performance tuning
• Learn how to manage cluster
• Learn how to interact with Cassandra
3
What we will
4
We will learn what is Cassandra
Content
• General overview
• Data model
• Architecture
• Read & Write operations
5
Preface
6
• RDBMS - is not bad
• RDBMS - has been successful in the last 40 years
RDBMS
7
• Slow queries due to complex joins, long time to reindexing data
• Expensive vertical scaling and problems with horizontal scaling
• When you try to replicate database you hurt the availability of the
system
RDBMS
Issues
8
CAP
consistency availability
partition
tolerance
RDBSM
NoSQLNoSQL
9
CA, CP, AP
• Consistency & Availability
• Consistency & Partition-tolerance
• Availability & Partition-tolerance
10
Eventual consistent
Eventual consistent system without any failures
Eventual consistent system with failures
V0
V0
V0
V0 V0
V1
V0
V0
V1
V1 V1
V1
V1
V1 V1
V0
V1
V1
V1
V1V1
V1
11
V1
Solution
Google BigTable
2004
Cassandra
2008 (2010 , 2013)
Amazon Dynamo DB
2012
12
Cassandra
General Overview
13
Cassandra cluster
N1
N2
N3
A
G
R
Tokens & Seed node & Ring representation
Tokens - determine position of node in ring cluster and portion of data
N1
14
Cassandra cluster
N1
N2
N3
A-F
G-Q
R-Z
pk: «Taras», message: «Hello»
Replication Factor (RF) = 2
G-Q
R-Z
A-F
15
Tokens
Issues
• Manually manage token initial value for all nodes
• Big overhead when restoring node data
for(int i=0; i < CLUSTER_SIZE; i++) {
System.out.println((((2**64 / CLUSTER_SIZE) * i) - 2**63))
}
N1
N2N3
Replication Factor (RF) = 2
Ne
w
N2
16
Virtual Nodes
1
2
3 4
5
6
7
8
910
11
12
Server1 Server2
Server3Server4
17
Virtual Nodes
Data restoring
vnode = 3
S1
S3
S2S4
RF = 2
18
V-nodes
Summary
• Rebalancing a cluster is no longer necessary when adding or removing nodes
• More powerful machines can have more v-nodes. This approach give ability to
build heterogeneous Cassandra ring
19
Cassandra
Data model
20
Introduction into data model
KEYSPACE
Table (column family)
partition key
column1 column2 column3
model123
value value
age email
test@test.com
name
demo14
value
21
Column family
• RDBMS
user
name email title age
…
Taras
Andriy
tm@gm.com
ar@gm.com
Staff Engineer
27
• Column Family
user
…
key: Taras
key: Andriy
value:
value:
email : tm@gm.com title: Staff Engineer
email : ar@gm.com age: 27
22
Column family
“user” : {
“Taras” : {
“email” : “tm@gm.com”,
“title” : “Staff Engineer”
},
“Andriy” : {
“email” : “ar@gm.com”,
“age” : “27”
}
}
user
…
key: Taras
key: Andriy
value:
value:
email : tm@gm.com title: Staff Engineer
email : ar@gm.com age: 27
23
Other differences
• No relations (No Joins)
• Tuples (key-value pairs) are natural sorted
• May want to denormalize data model in database
• No transactions
24
Type of keys
• Primary key
• Composite key
• Partition key
• Clustering key
• Composite partition key
25
Example 1
CREATE TABLE album (
id uuid,
name name,
PRIMARY KEY (id)
)
Primary key and also the partition key
id - partition & primary key at the same time
26
Composite key
Example 2
CREATE TABLE author_book (
author text,
book text,
population int,
PRIMARY KEY (author, book)
)
partition key primary key
27
Example 3
Key with composite partition & clustering keys
CREATE TABLE teacher_lesson (
teacher text,
lesson text,
topic text,
duration int,
PRIMARY KEY ((teacher, lesson), topic, duration)
)
clustering keyscomposite partition key
28
Row vs Partition
Rows
Partitions
Node 1 Node 2
1234
5678
9101112
1234:user 5678:user
1234:address 5678:address
1234:details 5678:details
29
Coffee break
• General overview
• Data model
• Architecture
• Read & Write operations
30
Cassandra
Architecture
31
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
32
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
33
Messaging service
In cluster of 5 nodes , each node has 8 opened socket connections
34
Has 2 opened socket connections with every other node
Gossip
35
Gossip
How Cassandra initiates sessions?
• One session for any random live node
• One session for any random unreachable node
• If the node in point 1 is not a seed node, then create session with
random seed node
36
Gossip
Session
1 : GossipSyncMessage
N1 N2
2 : GossipAckMessage
3 : GossipAck2Message
37
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
38
Failure detection
ϕ accrual failure detector
• Doesn’t use TRUE / FALSE
• Provides continuos value
• This value is called «ϕ»
39
Failure detection
ϕ accrual failure detector
40
time
session
1 2 3 4 5
1s2s
Failure detection
Proposed by Xavier Défago in 2004
41
https://goo.gl/xS0kB0
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
42
Partitioner
All
terabytes
of data
N1
N2
N3
N4
N5
N6
N7
N8
43
Partitioner
• Murmur 3 Partitioner
• Random Partitioner
• Byte Order Partitioner
44
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
45
Replicator
• Replication factor = 3
Write data
request
N1 N2 N3 N4N1 N2 N3
• Consistency Level = 2
N1 N2
46
Consistency level
• ZERO (write only)
• ANY (write only)
• ONE
• QUORUM
• ALL
Push and forget
Success even hinted of write
First replica returned successfully
N/2 +1 replica success
All replica success
47
Replicator
Inconsistency
• 5 node cluster
• Replication factor 3
• Consistency level 1
N1 N2 N3 N4 N5
Write Read
N2 N3 N4
48
Replicator
Tuning
• Use consistency level with at least 1 node overlap (Quorum)
Write CL = 2 Read CL = 2
Replication factor = 3
N1 N2 N3 N4 N5
Write Read
N2 N3 N4
49
Replicator
Tuning
• Tune read and write CL separately to reach high performance
Fast write Fast Read
Write CL = 1 Read CL = ALL Write CL = ALL Read CL = 1
50
Replicator
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
51
Storage layer
52
Client
Mutation Request
Commit log
MemTable
SSTable
mem
hdd
add / update
append
Flush
cleanup
Storage layer
53
Client
Mutation Request
Commit log
MemTable
SSTable
mem
hdd
add / update
append
Flush
cleanup
SSTable
• Representation of MemTable
• Immutable
• Eventually get merged into larger SSTable files (compaction)
• Has next components
• Bloom filter
• Index file
• Data file
54
SSTable
Bloom filter
• Bloom filter is used to determine correct SSTable
• Bloom filter may result as FALSE positive
• Stored on heap memory
55
SSTable
Bloom filter
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 1 1 1 0 0 1 0 0 1 0
0 0 5 0 0 2 0 1 1 3 0 0 1 0 0 2 0
56
murmur3(`key`) = 15
SSTable
Index file
• Contains all row keys and their offset in data file
• Each 128th key from the file will be stored into memory
• Use binary search to determine right index in memory
57
SSTable
Index file
58
memory hdd
126,…
127,…
128,…
…
199,…
200,…
201,…
202,…
203,…
…
Index file
1, 128, 256, 384
Sampled IndexBF
201
Cassandra components
API Tools
Storage layer
Partitioner Replicator
Failure detector Compaction Manager
Messaging layer
59
Compaction
• Merges SSTables
• There are two compaction strategies
• size-tiered
• leveled
60
Compaction
A B A B
C C
D E
C
C F
Size-tiered (Minor compaction)
D E
C F
61
…
Compaction Level = 2
Compaction
B C
D
E
Size-tiered (Major compaction)
F
62
A
Data repair
• Hinted handoff
• Read repair
• Anti-entropy
63
Coffee break
• General overview
• Data model
• Architecture
• Read & Write operations
64
Cassandra
Read & Write operations
65
Write
Write Request
Node1
StorageProxy
Node2
Commit Log
MemTable
SSTable
66
Write
Replication & Consistency
N1
N4
Replication Factor = 3
N2 N3
Consistency Level = 2
67
Anti-entropy / Read repair
Hinted handoff
Read
Snitch Function
Read request
Node1
StorageProxy
?
68
Read
Snitch Function
• SimpleSnitch
• DynamicSnitch
• PropertyFileSnitch
• GossipingPropertyFileSnitch
• RackInferringSnitch
…
69
Read
Snitch functions
SimpleSnitch
N1
N2
N3
N4
N5
N6
N7
70
Read
Snitch functions
DynamicSnitch
N1
N1
N2
N3
0.6ms
0.4ms
0.9ms
1
2
3
71
Read
Snitch functions
GossipingPropertyFileSnitch
$CASSANDRA_HOME/conf/cassandra-rackdc.properties
# indicate the rack and dc for this node
dc=DC1
rack=RAC1
72
Read
In action
• 7 node • RF = 4 • CL = 3
Read request
Node1
StorageProxy Read data
Get digest
Get digest
Node 3
Node 4
Node 5
Node 6
Node 7
Node 2
73
Node 3
Node 4
Node 5
Read on node
B
SSTable
I
B B I
74
B
Almost the end
75
Nishant Neeraj : «Mastering Apache Cassandra - Second Edition»
Throughput comparison
76
Publication
A Real Comparison Of NoSQL Databases HBase, Cassandra & MongoDB
77
https://goo.gl/z5abRu
Summary
78
The end
79
Any
questions?
80
problems
Resources
Nishant Neeraj
Mastering Apache Cassandra - 2015
http://docs.datastax.com/en/cassand
ra/3.x/cassandra/cassandraAbout.ht
ml
Thank you
81

Cassandra under the hood