Introduction to Cassandra 
DuyHai DOAN, Technical Advocate
@doanduyhai 
About me! 
2 
Duy Hai DOAN 
Cassandra technical advocate 
• talks, meetups, confs 
• open-source devs (Achilles, …) 
• OSS technical point of contact in Europe 
☞ duy_hai.doan@datastax.com 
• production troubleshooting
@doanduyhai 3 
Datastax! 
• Founded in April 2010 
• We drive Apache Cassandra™ 
• Home to Cassandra chair & most committers (≈80%) 
• Headquarter in San Francisco Bay area 
• EU headquarter in London, offices in France and Germany 
• Datastax Enterprise = OSS Cassandra + features
@doanduyhai 
Agenda! 
4 
• Architecture 
• Replication 
• Consistency 
• Read/write path 
• Data Model 
• Failure handling
What is Cassandra!
@doanduyhai 
Cassandra history! 
6 
NoSQL database 
• created at Facebook 
• open-sourced since 2008 
• current version = 2.1 
• column-oriented ☞ distributed table
Cassandra 5 key facts! 
@doanduyhai 
7 
Linear scalability 
Small & « huge » scale 
• 3 à1k+ nodes cluster 
• 3Gb à Pb+
Cassandra 5 key facts! 
@doanduyhai 
8 
Continuous availability (≈100% up-time) 
• resilient architecture (Dynamo) 
• rolling upgrades 
• data backward compatible n/n+1 versions
Cassandra 5 key facts! 
@doanduyhai 
9 
Multi-data centers 
• out-of-the-box (config only) 
• AWS conf for multi-region DCs 
• GCE/CloudStack support 
• resilience, work-load segregation 
• virtual data-centers
Cassandra 5 key facts! 
@doanduyhai 
10 
Operational simplicity 
• 1 node = 1 process + 1 config file 
• deployment automation 
• OpsCenter for monitoring
Cassandra 5 key facts! 
@doanduyhai 
11 
Analytics combo 
• Cassandra + Spark = awesome ! 
• realtime streaming
Cassandra architecture!
Cassandra architecture! 
@doanduyhai 
13 
API (CQL & RPC) 
CLUSTER (DYNAMODB) 
DATA STORE (BIG TABLES) 
DISKS 
Node1 
Client request 
API (CQL & RPC) 
CLUSTER (DYNAMODB) 
DATA STORE (BIG TABLES) 
DISKS 
Node2
@doanduyhai 
Data distribution! 
14 
Random: hash of #partition → token = hash(#p) 
Hash: ]-X, X] 
X = huge number (264/2) 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8
@doanduyhai 
Token Ranges! 
15 
A: ]0, X/8] 
B: ] X/8, 2X/8] 
C: ] 2X/8, 3X/8] 
D: ] 3X/8, 4X/8] 
E: ] 4X/8, 5X/8] 
F: ] 5X/8, 6X/8] 
G: ] 6X/8, 7X/8] 
H: ] 7X/8, X] 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
A 
B 
C 
D 
E 
F 
G 
H
8 nodes 10 nodes 
@doanduyhai 
Linear scalability! 
16 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
n1 
n2 
n3 n4 
n5 
n6 
n7 
n9 n8 
n10
! " 
! 
Q & R
Replication & Consistency Model!
Failure tolerance by replication! 
@doanduyhai 
19 
Replication Factor (RF) = 3 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
{B, A, H} 
{C, B, A} 
{D, C, B} 
A 
B 
C 
D 
E 
F 
G 
H
@doanduyhai 
Coordinator node! 
Incoming requests (read/write) 
Coordinator node handles the request 
n1 
Every node can be coordinator àmasterless 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator 
Client request
@doanduyhai 
Consistency! 
21 
Tunable at runtime 
• ONE 
• QUORUM (strict majority w.r.t. RF) 
• ALL 
Apply both to read & write
Write consistency! 
Write ONE 
• write request to all replicas in // 
@doanduyhai 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
Write consistency! 
Write ONE 
• write request to all replicas in // 
• wait for ONE ack before returning to 
@doanduyhai 
client 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator 
5 μs
Write consistency! 
Write ONE 
• write request to all replicas in // 
• wait for ONE ack before returning to 
@doanduyhai 
client 
• other acks later, asynchronously 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator 
5 μs 
10 μs 
120 μs
Write consistency! 
Write QUORUM 
• write request to all replicas in // 
• wait for QUORUM acks before 
@doanduyhai 
returning to client 
• other acks later, asynchronously 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator 
5 μs 
10 μs 
120 μs
Read consistency! 
Read ONE 
• read from one node among all replicas 
@doanduyhai 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
Read consistency! 
Read ONE 
• read from one node among all replicas 
• contact the fastest node (stats) 
@doanduyhai 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
@doanduyhai 
Read consistency! 
Read QUORUM 
• read from one fastest node 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
Read consistency! 
Read QUORUM 
• read from one fastest node 
• AND request digest from other 
@doanduyhai 
replicas to reach QUORUM 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
Read consistency! 
Read QUORUM 
• read from one fastest node 
• AND request digest from other 
@doanduyhai 
replicas to reach QUORUM 
• return most up-to-date data to client 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
Read consistency! 
Read QUORUM 
• read from one fastest node 
• AND request digest from other 
@doanduyhai 
replicas to reach QUORUM 
• return most up-to-date data to client 
• repair if digest mismatch n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
Consistency in action! 
@doanduyhai 
32 
RF = 3, Write ONE, Read ONE 
B A A 
B A A 
Read ONE: A 
data replication in progress … 
Write ONE: B
Consistency in action! 
data replication in progress … 
@doanduyhai 
33 
RF = 3, Write ONE, Read QUORUM 
B A A 
Write ONE: B 
Read QUORUM: A 
B A A
Consistency in action! 
data replication in progress … 
@doanduyhai 
34 
RF = 3, Write ONE, Read ALL 
B A A 
Read ALL: B 
B A A 
Write ONE: B
Consistency in action! 
data replication in progress … 
@doanduyhai 
35 
RF = 3, Write QUORUM, Read ONE 
B B A 
Write QUORUM: B 
Read ONE: A 
B B A
Consistency in action! 
data replication in progress … 
@doanduyhai 
36 
RF = 3, Write QUORUM, Read QUORUM 
B B A 
Read QUORUM: B 
B B A 
Write QUORUM: B
@doanduyhai 
Consistency level! 
37 
ONE 
Fast, may not read latest written value
@doanduyhai 
Consistency level! 
38 
QUORUM 
Strict majority w.r.t. Replication Factor 
Good balance
@doanduyhai 
Consistency level! 
39 
ALL 
Paranoid 
Slow, no high availability
Consistency summary! 
ONERead + ONEWrite 
☞ available for read/write even (N-1) replicas down 
QUORUMRead + QUORUMWrite 
☞ available for read/write even 1+ replica down 
@doanduyhai 40
! " 
! 
Q & R
Read & Write Path!
Cassandra Write Path! 
@doanduyhai 
43 
Commit log1 
. . . 
1 
Commit log2 
Commit logn 
Memory
Cassandra Write Path! 
@doanduyhai 
44 
Memory 
MemTable 
Table1 
Commit log1 
. . . 
1 
Commit log2 
Commit logn 
MemTable 
Table2 
MemTable 
TableN 
2 
. . .
Cassandra Write Path! 
Table2 Table3 
SStable2 SStable3 3 
@doanduyhai 
45 
Commit log1 
Commit log2 
Commit logn 
Table1 
SStable1 
Memory 
. . .
Cassandra Write Path! 
MemTable . . . Memory 
Table1 
@doanduyhai 
46 
Commit log1 
Commit log2 
Commit logn 
Table1 
SStable1 
Table2 Table3 
SStable2 SStable3 
MemTable 
Table2 
MemTable 
TableN 
. . .
Cassandra Write Path! 
SStable3 . . . 
@doanduyhai 
47 
Commit log1 
Commit log2 
Commit logn 
Table1 
SStable1 
Memory 
Table2 Table3 
SStable2 SStable3 
SStable1 
SStable2
Cassandra Read Path! 
@doanduyhai 
48 
1 Row Cache 
Off Heap 
JVM Heap 
SELECT .. FROM … WHERE #partition =…; 
SStable1 SStable2 SStable3
Cassandra Read Path! 
Bloom Filter 
Bloom Filter 
? ? ? 
@doanduyhai 
49 
1 Row Cache 
Off Heap 
JVM Heap 
SELECT .. FROM … WHERE #partition =…; 
2 Bloom Filter 
SStable1 SStable2 SStable3
Bloom filters in action! 
Write #partition = bar 
h3 
@doanduyhai 
50 
#partition = foo 
h1 h2 
1 0 0 1* 0 0 1 0 1 1 
Read #partition = qux
Partition Key Cache 
Cassandra Read Path! 
2 Bloom Filter 
Bloom Filter 
Bloom Filter 
× ✓ × 
@doanduyhai 
51 
1 Row Cache 
Off Heap 
JVM Heap 
SELECT .. FROM … WHERE #partition =…; 
3 
SStable1 SStable2 SStable3
Partition Key Cache! 
Partition Key Cache SStable 
@doanduyhai 
52 
#partition001 
data 
data 
…. 
#partition002 
#partition350 
data 
data 
… 
#Partition Offset … 
#partition001 0x0 … 
#partition002 0x153 … 
#partition350 0x5464321 …
Cassandra Read Path! 
4 Key Index Sample 
2 Bloom Filter 
Bloom Filter 
@doanduyhai 
53 
1 Row Cache 
Off Heap 
JVM Heap 
SELECT .. FROM … WHERE #partition =…; 
3 Partition Key Cache 
SStable2 
Bloom Filter
@doanduyhai 
Key Index Sample! 
54 
SStable 
#partition001 
data 
data 
…. 
#partition128 
#partition350 
data 
data 
… 
Key Index Sample 
Sample Offset 
#partition001 0x0 
#partition128 0x4500 
#partition256 0x851513 
#partition512 0x5464321
Cassandra Read Path! 
4 Key Index Sample 
2 Bloom Filter 
Bloom Filter 
@doanduyhai 
55 
1 Row Cache 
Off Heap 
JVM Heap 
SELECT .. FROM … WHERE #partition =…; 
3 Partition Key Cache 
5 Compression Offset 
SStable2 
Bloom Filter
Compression Offset! 
@doanduyhai 
56 
Compression Offset 
Normal offset Compressed offset 
0x0 0x0 
0x4500 0x100 
0x851513 0x1353 
0x5464321 0x21245
Cassandra Read Path! 
4 Key Index Sample 
2 Bloom Filter 
Bloom Filter 
@doanduyhai 
57 
1 Row Cache 
Off Heap 
JVM Heap 
SELECT .. FROM … WHERE #partition =…; 
3 Partition Key Cache 
5 Compression Offset 
6 SStable2 
7 MemTable 
Σ 
Bloom Filter
Data model! 
Last Write Win! 
CQL basics! 
From SQL to CQL!
Last Write Win (LWW)! 
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); 
@doanduyhai 
59 
jdoe 
age 
name 
33 John DOE 
#partition
Last Write Win (LWW)! 
@doanduyhai 
jdoe 
age (t1) name (t1) 
33 John DOE 
60 
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); 
auto-generated timestamp (μs) 
.
Last Write Win (LWW)! 
SSTable1 SSTable2 
@doanduyhai 
61 
UPDATE users SET age = 34 WHERE login = jdoe; 
jdoe 
age (t1) name (t1) 
33 John DOE 
jdoe 
age (t2) 
34
Last Write Win (LWW)! 
SSTable1 SSTable2 SSTable3 
@doanduyhai 
62 
DELETE age FROM users WHERE login = jdoe; 
tombstone 
jdoe 
age (t3) 
ý 
jdoe 
age (t1) name (t1) 
33 John DOE 
jdoe 
age (t2) 
34
Last Write Win (LWW)! 
@doanduyhai 
63 
SELECT age FROM users WHERE login = jdoe; 
? ? ? 
SSTable1 SSTable2 SSTable3 
jdoe 
age (t3) 
ý 
jdoe 
age (t1) name (t1) 
33 John DOE 
jdoe 
age (t2) 
34
Last Write Win (LWW)! 
@doanduyhai 
64 
SELECT age FROM users WHERE login = jdoe; 
✕ ✕ ✓ 
SSTable1 SSTable2 SSTable3 
jdoe 
age (t3) 
ý 
jdoe 
age (t1) name (t1) 
33 John DOE 
jdoe 
age (t2) 
34
@doanduyhai 
Compaction! 
65 
SSTable1 SSTable2 SSTable3 
jdoe 
age (t3) 
ý 
jdoe 
age (t1) name (t1) 
33 John DOE 
jdoe 
age (t2) 
34 
New SSTable 
jdoe 
age (t3) name (t1) 
ý John DOE
@doanduyhai 
CRUD operations! 
66 
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); 
UPDATE users SET age = 34 WHERE login = jdoe; 
DELETE age FROM users WHERE login = jdoe; 
SELECT age FROM users WHERE login = jdoe;
@doanduyhai 
DDL Statements! 
67 
Keyspace (tables container) 
• CREATE/ALTER/DROP 
Table 
• CREATE/ALTER/DROP 
Type (custom type) 
• CREATE/ALTER/DROP
User management & permissions! 
@doanduyhai 
68 
User 
• CREATE/ALTER/DROP 
Permissions 
• GRANT <xxx> PERMISSION ON <resource> TO <user_name> 
• REVOKE <xxx> PERMISSION ON <resource> FROM <user_name>
@doanduyhai 
Simple Table! 
69 
CREATE TABLE users ( 
login text, 
name text, 
age int, 
… 
PRIMARY KEY(login)); 
partition key (#partition)
@doanduyhai 
Clustered table! 
70 
CREATE TABLE mailbox ( 
login text, 
message_id timeuuid, 
interlocutor text, 
message text, 
PRIMARY KEY((login), message_id)); 
partition key clustering column unicity
@doanduyhai 
Queries! 
71 
Get message by user and message_id (date) 
SELECT * FROM mailbox WHERE login = jdoe 
and message_id = ‘2014-09-25 16:00:00’; 
Get message by user and date interval 
SELECT * FROM mailbox WHERE login = jdoe 
and message_id <= ‘2014-09-25 16:00:00’ 
and message_id >= ‘2014-09-20 16:00:00’;
@doanduyhai 
Queries! 
72 
Get message by message_id only (#partition not provided) 
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’; 
Get message by date interval (#partition not provided) 
SELECT * FROM mailbox WHERE 
and message_id <= ‘2014-09-25 16:00:00’ 
and message_id >= ‘2014-09-20 16:00:00’;
Get message by user range (range query on #partition) 
Get message by user pattern (non exact match on #partition) 
@doanduyhai 
Queries! 
73 
SELECT * FROM mailbox WHERE login >= hsue and login <= jdoe; 
SELECT * FROM mailbox WHERE login like ‘%doe%‘;
WHERE clause restrictions! 
@doanduyhai 
74 
All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition 
Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed 
• ☞ full cluster scan 
On clustering columns, only range queries (<, ≤, >, ≥) and exact match 
WHERE clause only possible on columns defined in PRIMARY KEY
Collections & maps! 
@doanduyhai 
75 
CREATE TABLE users ( 
login text, 
name text, 
age int, 
friends set<text>, 
hobbies list<text>, 
languages map<int, text>, 
… 
PRIMARY KEY(login));
Collections & maps! 
@doanduyhai 
76 
All elements loaded at each access 
• ☞ low cardinality only (≤ 1000 elements) 
Should not be used use for 1-N, N-N relations
User Defined Type (UDT)! 
@doanduyhai 
77 
Instead of 
CREATE TABLE users ( 
login text, 
… 
street_number int, 
street_name text, 
postcode int, 
country text, 
… 
PRIMARY KEY(login));
User Defined Type (UDT)! 
@doanduyhai 
78 
CREATE TYPE address ( 
street_number int, 
street_name text, 
postcode int, 
country text); 
CREATE TABLE users ( 
login text, 
… 
location frozen <address>, 
… 
PRIMARY KEY(login));
@doanduyhai 
UDT in action! 
79 
INSERT INTO users(login,name, location) VALUES ( 
‘jdoe’, 
’John DOE’, 
{ 
‘street_number’: 124, 
‘street_name’: ‘Congress Avenue’, 
‘postcode’: 95054, 
‘country’: ‘USA’ 
});
@doanduyhai 
From SQL to CQL! 
80 
Remember…
@doanduyhai 
From SQL to CQL! 
81 
Remember… 
CQL is not SQL
@doanduyhai 
From SQL to CQL! 
82 
Remember… 
there is no join 
(do you want to scale ?)
@doanduyhai 
From SQL to CQL! 
83 
Remember… 
there is no integrity constraint 
(do you want to read-before-write ?)
@doanduyhai 
From SQL to CQL! 
84 
1-1, N-1 : normalized 
User 
1 
n 
Comment 
CREATE TABLE comments ( 
article_id uuid, 
comment_id timeuuid, 
author_id uuid, // typical join id 
content text, 
PRIMARY KEY((article_id), comment_id));
@doanduyhai 
From SQL to CQL! 
85 
1-1, N-1 : de-normalized 
User 
1 
n 
Comment 
CREATE TABLE comments ( 
article_id uuid, 
comment_id timeuuid, 
author person, // person is UDT 
content text, 
PRIMARY KEY((article_id), comment_id));
@doanduyhai 
From SQL to CQL! 
86 
N-1, N-N : normalized 
n 
User 
n 
CREATE TABLE friends( 
login text, 
friend_login text, // typical join id 
PRIMARY KEY((login), friend_login));
@doanduyhai 
From SQL to CQL! 
87 
N-1, N-N : de-normalized 
n 
User 
n 
CREATE TABLE friends( 
login text, 
friend_login text, 
friend person, // person is UDT 
PRIMARY KEY((login), friend_login));
Data modeling best practices! 
@doanduyhai 
88 
Start by queries 
• identify core functional read paths 
• 1 read scenario ≈ 1 SELECT
Data modeling best practices! 
@doanduyhai 
89 
Start by queries 
• identify core functional read paths 
• 1 read scenario ≈ 1 SELECT 
Denormalize 
• wisely, only duplicate necessary & immutable data 
• functional/technical trade-off
Data modeling best practices! 
@doanduyhai 
90 
Person UDT 
- firstname/lastname 
- date of birth 
- sexe 
- mood 
- location
Data modeling best practices! 
@doanduyhai 
91 
John DOE, male 
birthdate: 21/02/1981 
subscribed since 03/06/2011 
☉ San Mateo, CA 
’’Impossible is not John DOE’’ 
Full detail read from 
User table on click
! " 
! 
Q & R
Failure Handling!
Failure Handling! 
Failure occurs when 
• node hardware failure (disk, …) 
• network issues 
• heavy load, node flapping 
• JVM long GCs 
@doanduyhai
@doanduyhai 
Hinted Handoff! 
When 1 replica down 
• coordinator stores Hints 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator 
× 
Hint
Hinted Handoff! 
When replica up 
• coordinator forward stored Hints 
@doanduyhai 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator 
Hints
Consistent read! 
Read at QUORUM or more 
• read from one fastest node 
• AND request digest from other 
@doanduyhai 
replicas to reach QUORUM 
• return most up-to-date data to client 
• repair if digest mismatch n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
Read Repair! 
Read at CL < QUORUM 
• read from one least-loaded node 
• every x reads, request digests from 
@doanduyhai 
other replicas 
• compare digest & repair 
asynchronously 
• read_repair_chance (10%) 
n1 
n2 
n3 
n4 
n5 
n6 
n7 
n8 
1 
2 
3 
coordinator
Manual repair! 
Should be part of cluster operations 
• scheduled 
• I/O intensive 
@doanduyhai 
Use incremental repair of 2.1
! " 
! 
Q & R
Training Day | December 3rd 
Beginner Track 
• Introduction to Cassandra 
• Introduction to Spark, Shark, Scala and 
Cassandra 
Advanced Track 
• Data Modeling 
• Performance Tuning 
Conference Day | December 4th 
Cassandra Summit Europe 2014 will be the single 
largest gathering of Cassandra users in Europe. 
Learn how the world's most successful companies are 
transforming their businesses and growing faster than 
ever using Apache Cassandra. 
http://bit.ly/cassandrasummit2014 
Compan@yd Coaonndfiudyehnatiia l 101
Thank You 
@doanduyhai 
duy_hai.doan@datastax.com 
https://academy.datastax.com/

Cassandra introduction apache con 2014 budapest

  • 1.
    Introduction to Cassandra DuyHai DOAN, Technical Advocate
  • 2.
    @doanduyhai About me! 2 Duy Hai DOAN Cassandra technical advocate • talks, meetups, confs • open-source devs (Achilles, …) • OSS technical point of contact in Europe ☞ duy_hai.doan@datastax.com • production troubleshooting
  • 3.
    @doanduyhai 3 Datastax! • Founded in April 2010 • We drive Apache Cassandra™ • Home to Cassandra chair & most committers (≈80%) • Headquarter in San Francisco Bay area • EU headquarter in London, offices in France and Germany • Datastax Enterprise = OSS Cassandra + features
  • 4.
    @doanduyhai Agenda! 4 • Architecture • Replication • Consistency • Read/write path • Data Model • Failure handling
  • 5.
  • 6.
    @doanduyhai Cassandra history! 6 NoSQL database • created at Facebook • open-sourced since 2008 • current version = 2.1 • column-oriented ☞ distributed table
  • 7.
    Cassandra 5 keyfacts! @doanduyhai 7 Linear scalability Small & « huge » scale • 3 à1k+ nodes cluster • 3Gb à Pb+
  • 8.
    Cassandra 5 keyfacts! @doanduyhai 8 Continuous availability (≈100% up-time) • resilient architecture (Dynamo) • rolling upgrades • data backward compatible n/n+1 versions
  • 9.
    Cassandra 5 keyfacts! @doanduyhai 9 Multi-data centers • out-of-the-box (config only) • AWS conf for multi-region DCs • GCE/CloudStack support • resilience, work-load segregation • virtual data-centers
  • 10.
    Cassandra 5 keyfacts! @doanduyhai 10 Operational simplicity • 1 node = 1 process + 1 config file • deployment automation • OpsCenter for monitoring
  • 11.
    Cassandra 5 keyfacts! @doanduyhai 11 Analytics combo • Cassandra + Spark = awesome ! • realtime streaming
  • 12.
  • 13.
    Cassandra architecture! @doanduyhai 13 API (CQL & RPC) CLUSTER (DYNAMODB) DATA STORE (BIG TABLES) DISKS Node1 Client request API (CQL & RPC) CLUSTER (DYNAMODB) DATA STORE (BIG TABLES) DISKS Node2
  • 14.
    @doanduyhai Data distribution! 14 Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2) n1 n2 n3 n4 n5 n6 n7 n8
  • 15.
    @doanduyhai Token Ranges! 15 A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] n1 n2 n3 n4 n5 n6 n7 n8 A B C D E F G H
  • 16.
    8 nodes 10nodes @doanduyhai Linear scalability! 16 n1 n2 n3 n4 n5 n6 n7 n8 n1 n2 n3 n4 n5 n6 n7 n9 n8 n10
  • 17.
    ! " ! Q & R
  • 18.
  • 19.
    Failure tolerance byreplication! @doanduyhai 19 Replication Factor (RF) = 3 n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 {B, A, H} {C, B, A} {D, C, B} A B C D E F G H
  • 20.
    @doanduyhai Coordinator node! Incoming requests (read/write) Coordinator node handles the request n1 Every node can be coordinator àmasterless n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator Client request
  • 21.
    @doanduyhai Consistency! 21 Tunable at runtime • ONE • QUORUM (strict majority w.r.t. RF) • ALL Apply both to read & write
  • 22.
    Write consistency! WriteONE • write request to all replicas in // @doanduyhai n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 23.
    Write consistency! WriteONE • write request to all replicas in // • wait for ONE ack before returning to @doanduyhai client n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator 5 μs
  • 24.
    Write consistency! WriteONE • write request to all replicas in // • wait for ONE ack before returning to @doanduyhai client • other acks later, asynchronously n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator 5 μs 10 μs 120 μs
  • 25.
    Write consistency! WriteQUORUM • write request to all replicas in // • wait for QUORUM acks before @doanduyhai returning to client • other acks later, asynchronously n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator 5 μs 10 μs 120 μs
  • 26.
    Read consistency! ReadONE • read from one node among all replicas @doanduyhai n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 27.
    Read consistency! ReadONE • read from one node among all replicas • contact the fastest node (stats) @doanduyhai n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 28.
    @doanduyhai Read consistency! Read QUORUM • read from one fastest node n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 29.
    Read consistency! ReadQUORUM • read from one fastest node • AND request digest from other @doanduyhai replicas to reach QUORUM n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 30.
    Read consistency! ReadQUORUM • read from one fastest node • AND request digest from other @doanduyhai replicas to reach QUORUM • return most up-to-date data to client n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 31.
    Read consistency! ReadQUORUM • read from one fastest node • AND request digest from other @doanduyhai replicas to reach QUORUM • return most up-to-date data to client • repair if digest mismatch n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 32.
    Consistency in action! @doanduyhai 32 RF = 3, Write ONE, Read ONE B A A B A A Read ONE: A data replication in progress … Write ONE: B
  • 33.
    Consistency in action! data replication in progress … @doanduyhai 33 RF = 3, Write ONE, Read QUORUM B A A Write ONE: B Read QUORUM: A B A A
  • 34.
    Consistency in action! data replication in progress … @doanduyhai 34 RF = 3, Write ONE, Read ALL B A A Read ALL: B B A A Write ONE: B
  • 35.
    Consistency in action! data replication in progress … @doanduyhai 35 RF = 3, Write QUORUM, Read ONE B B A Write QUORUM: B Read ONE: A B B A
  • 36.
    Consistency in action! data replication in progress … @doanduyhai 36 RF = 3, Write QUORUM, Read QUORUM B B A Read QUORUM: B B B A Write QUORUM: B
  • 37.
    @doanduyhai Consistency level! 37 ONE Fast, may not read latest written value
  • 38.
    @doanduyhai Consistency level! 38 QUORUM Strict majority w.r.t. Replication Factor Good balance
  • 39.
    @doanduyhai Consistency level! 39 ALL Paranoid Slow, no high availability
  • 40.
    Consistency summary! ONERead+ ONEWrite ☞ available for read/write even (N-1) replicas down QUORUMRead + QUORUMWrite ☞ available for read/write even 1+ replica down @doanduyhai 40
  • 41.
    ! " ! Q & R
  • 42.
  • 43.
    Cassandra Write Path! @doanduyhai 43 Commit log1 . . . 1 Commit log2 Commit logn Memory
  • 44.
    Cassandra Write Path! @doanduyhai 44 Memory MemTable Table1 Commit log1 . . . 1 Commit log2 Commit logn MemTable Table2 MemTable TableN 2 . . .
  • 45.
    Cassandra Write Path! Table2 Table3 SStable2 SStable3 3 @doanduyhai 45 Commit log1 Commit log2 Commit logn Table1 SStable1 Memory . . .
  • 46.
    Cassandra Write Path! MemTable . . . Memory Table1 @doanduyhai 46 Commit log1 Commit log2 Commit logn Table1 SStable1 Table2 Table3 SStable2 SStable3 MemTable Table2 MemTable TableN . . .
  • 47.
    Cassandra Write Path! SStable3 . . . @doanduyhai 47 Commit log1 Commit log2 Commit logn Table1 SStable1 Memory Table2 Table3 SStable2 SStable3 SStable1 SStable2
  • 48.
    Cassandra Read Path! @doanduyhai 48 1 Row Cache Off Heap JVM Heap SELECT .. FROM … WHERE #partition =…; SStable1 SStable2 SStable3
  • 49.
    Cassandra Read Path! Bloom Filter Bloom Filter ? ? ? @doanduyhai 49 1 Row Cache Off Heap JVM Heap SELECT .. FROM … WHERE #partition =…; 2 Bloom Filter SStable1 SStable2 SStable3
  • 50.
    Bloom filters inaction! Write #partition = bar h3 @doanduyhai 50 #partition = foo h1 h2 1 0 0 1* 0 0 1 0 1 1 Read #partition = qux
  • 51.
    Partition Key Cache Cassandra Read Path! 2 Bloom Filter Bloom Filter Bloom Filter × ✓ × @doanduyhai 51 1 Row Cache Off Heap JVM Heap SELECT .. FROM … WHERE #partition =…; 3 SStable1 SStable2 SStable3
  • 52.
    Partition Key Cache! Partition Key Cache SStable @doanduyhai 52 #partition001 data data …. #partition002 #partition350 data data … #Partition Offset … #partition001 0x0 … #partition002 0x153 … #partition350 0x5464321 …
  • 53.
    Cassandra Read Path! 4 Key Index Sample 2 Bloom Filter Bloom Filter @doanduyhai 53 1 Row Cache Off Heap JVM Heap SELECT .. FROM … WHERE #partition =…; 3 Partition Key Cache SStable2 Bloom Filter
  • 54.
    @doanduyhai Key IndexSample! 54 SStable #partition001 data data …. #partition128 #partition350 data data … Key Index Sample Sample Offset #partition001 0x0 #partition128 0x4500 #partition256 0x851513 #partition512 0x5464321
  • 55.
    Cassandra Read Path! 4 Key Index Sample 2 Bloom Filter Bloom Filter @doanduyhai 55 1 Row Cache Off Heap JVM Heap SELECT .. FROM … WHERE #partition =…; 3 Partition Key Cache 5 Compression Offset SStable2 Bloom Filter
  • 56.
    Compression Offset! @doanduyhai 56 Compression Offset Normal offset Compressed offset 0x0 0x0 0x4500 0x100 0x851513 0x1353 0x5464321 0x21245
  • 57.
    Cassandra Read Path! 4 Key Index Sample 2 Bloom Filter Bloom Filter @doanduyhai 57 1 Row Cache Off Heap JVM Heap SELECT .. FROM … WHERE #partition =…; 3 Partition Key Cache 5 Compression Offset 6 SStable2 7 MemTable Σ Bloom Filter
  • 58.
    Data model! LastWrite Win! CQL basics! From SQL to CQL!
  • 59.
    Last Write Win(LWW)! INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); @doanduyhai 59 jdoe age name 33 John DOE #partition
  • 60.
    Last Write Win(LWW)! @doanduyhai jdoe age (t1) name (t1) 33 John DOE 60 INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); auto-generated timestamp (μs) .
  • 61.
    Last Write Win(LWW)! SSTable1 SSTable2 @doanduyhai 61 UPDATE users SET age = 34 WHERE login = jdoe; jdoe age (t1) name (t1) 33 John DOE jdoe age (t2) 34
  • 62.
    Last Write Win(LWW)! SSTable1 SSTable2 SSTable3 @doanduyhai 62 DELETE age FROM users WHERE login = jdoe; tombstone jdoe age (t3) ý jdoe age (t1) name (t1) 33 John DOE jdoe age (t2) 34
  • 63.
    Last Write Win(LWW)! @doanduyhai 63 SELECT age FROM users WHERE login = jdoe; ? ? ? SSTable1 SSTable2 SSTable3 jdoe age (t3) ý jdoe age (t1) name (t1) 33 John DOE jdoe age (t2) 34
  • 64.
    Last Write Win(LWW)! @doanduyhai 64 SELECT age FROM users WHERE login = jdoe; ✕ ✕ ✓ SSTable1 SSTable2 SSTable3 jdoe age (t3) ý jdoe age (t1) name (t1) 33 John DOE jdoe age (t2) 34
  • 65.
    @doanduyhai Compaction! 65 SSTable1 SSTable2 SSTable3 jdoe age (t3) ý jdoe age (t1) name (t1) 33 John DOE jdoe age (t2) 34 New SSTable jdoe age (t3) name (t1) ý John DOE
  • 66.
    @doanduyhai CRUD operations! 66 INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33); UPDATE users SET age = 34 WHERE login = jdoe; DELETE age FROM users WHERE login = jdoe; SELECT age FROM users WHERE login = jdoe;
  • 67.
    @doanduyhai DDL Statements! 67 Keyspace (tables container) • CREATE/ALTER/DROP Table • CREATE/ALTER/DROP Type (custom type) • CREATE/ALTER/DROP
  • 68.
    User management &permissions! @doanduyhai 68 User • CREATE/ALTER/DROP Permissions • GRANT <xxx> PERMISSION ON <resource> TO <user_name> • REVOKE <xxx> PERMISSION ON <resource> FROM <user_name>
  • 69.
    @doanduyhai Simple Table! 69 CREATE TABLE users ( login text, name text, age int, … PRIMARY KEY(login)); partition key (#partition)
  • 70.
    @doanduyhai Clustered table! 70 CREATE TABLE mailbox ( login text, message_id timeuuid, interlocutor text, message text, PRIMARY KEY((login), message_id)); partition key clustering column unicity
  • 71.
    @doanduyhai Queries! 71 Get message by user and message_id (date) SELECT * FROM mailbox WHERE login = jdoe and message_id = ‘2014-09-25 16:00:00’; Get message by user and date interval SELECT * FROM mailbox WHERE login = jdoe and message_id <= ‘2014-09-25 16:00:00’ and message_id >= ‘2014-09-20 16:00:00’;
  • 72.
    @doanduyhai Queries! 72 Get message by message_id only (#partition not provided) SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’; Get message by date interval (#partition not provided) SELECT * FROM mailbox WHERE and message_id <= ‘2014-09-25 16:00:00’ and message_id >= ‘2014-09-20 16:00:00’;
  • 73.
    Get message byuser range (range query on #partition) Get message by user pattern (non exact match on #partition) @doanduyhai Queries! 73 SELECT * FROM mailbox WHERE login >= hsue and login <= jdoe; SELECT * FROM mailbox WHERE login like ‘%doe%‘;
  • 74.
    WHERE clause restrictions! @doanduyhai 74 All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed • ☞ full cluster scan On clustering columns, only range queries (<, ≤, >, ≥) and exact match WHERE clause only possible on columns defined in PRIMARY KEY
  • 75.
    Collections & maps! @doanduyhai 75 CREATE TABLE users ( login text, name text, age int, friends set<text>, hobbies list<text>, languages map<int, text>, … PRIMARY KEY(login));
  • 76.
    Collections & maps! @doanduyhai 76 All elements loaded at each access • ☞ low cardinality only (≤ 1000 elements) Should not be used use for 1-N, N-N relations
  • 77.
    User Defined Type(UDT)! @doanduyhai 77 Instead of CREATE TABLE users ( login text, … street_number int, street_name text, postcode int, country text, … PRIMARY KEY(login));
  • 78.
    User Defined Type(UDT)! @doanduyhai 78 CREATE TYPE address ( street_number int, street_name text, postcode int, country text); CREATE TABLE users ( login text, … location frozen <address>, … PRIMARY KEY(login));
  • 79.
    @doanduyhai UDT inaction! 79 INSERT INTO users(login,name, location) VALUES ( ‘jdoe’, ’John DOE’, { ‘street_number’: 124, ‘street_name’: ‘Congress Avenue’, ‘postcode’: 95054, ‘country’: ‘USA’ });
  • 80.
    @doanduyhai From SQLto CQL! 80 Remember…
  • 81.
    @doanduyhai From SQLto CQL! 81 Remember… CQL is not SQL
  • 82.
    @doanduyhai From SQLto CQL! 82 Remember… there is no join (do you want to scale ?)
  • 83.
    @doanduyhai From SQLto CQL! 83 Remember… there is no integrity constraint (do you want to read-before-write ?)
  • 84.
    @doanduyhai From SQLto CQL! 84 1-1, N-1 : normalized User 1 n Comment CREATE TABLE comments ( article_id uuid, comment_id timeuuid, author_id uuid, // typical join id content text, PRIMARY KEY((article_id), comment_id));
  • 85.
    @doanduyhai From SQLto CQL! 85 1-1, N-1 : de-normalized User 1 n Comment CREATE TABLE comments ( article_id uuid, comment_id timeuuid, author person, // person is UDT content text, PRIMARY KEY((article_id), comment_id));
  • 86.
    @doanduyhai From SQLto CQL! 86 N-1, N-N : normalized n User n CREATE TABLE friends( login text, friend_login text, // typical join id PRIMARY KEY((login), friend_login));
  • 87.
    @doanduyhai From SQLto CQL! 87 N-1, N-N : de-normalized n User n CREATE TABLE friends( login text, friend_login text, friend person, // person is UDT PRIMARY KEY((login), friend_login));
  • 88.
    Data modeling bestpractices! @doanduyhai 88 Start by queries • identify core functional read paths • 1 read scenario ≈ 1 SELECT
  • 89.
    Data modeling bestpractices! @doanduyhai 89 Start by queries • identify core functional read paths • 1 read scenario ≈ 1 SELECT Denormalize • wisely, only duplicate necessary & immutable data • functional/technical trade-off
  • 90.
    Data modeling bestpractices! @doanduyhai 90 Person UDT - firstname/lastname - date of birth - sexe - mood - location
  • 91.
    Data modeling bestpractices! @doanduyhai 91 John DOE, male birthdate: 21/02/1981 subscribed since 03/06/2011 ☉ San Mateo, CA ’’Impossible is not John DOE’’ Full detail read from User table on click
  • 92.
    ! " ! Q & R
  • 93.
  • 94.
    Failure Handling! Failureoccurs when • node hardware failure (disk, …) • network issues • heavy load, node flapping • JVM long GCs @doanduyhai
  • 95.
    @doanduyhai Hinted Handoff! When 1 replica down • coordinator stores Hints n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator × Hint
  • 96.
    Hinted Handoff! Whenreplica up • coordinator forward stored Hints @doanduyhai n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator Hints
  • 97.
    Consistent read! Readat QUORUM or more • read from one fastest node • AND request digest from other @doanduyhai replicas to reach QUORUM • return most up-to-date data to client • repair if digest mismatch n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 98.
    Read Repair! Readat CL < QUORUM • read from one least-loaded node • every x reads, request digests from @doanduyhai other replicas • compare digest & repair asynchronously • read_repair_chance (10%) n1 n2 n3 n4 n5 n6 n7 n8 1 2 3 coordinator
  • 99.
    Manual repair! Shouldbe part of cluster operations • scheduled • I/O intensive @doanduyhai Use incremental repair of 2.1
  • 100.
    ! " ! Q & R
  • 101.
    Training Day |December 3rd Beginner Track • Introduction to Cassandra • Introduction to Spark, Shark, Scala and Cassandra Advanced Track • Data Modeling • Performance Tuning Conference Day | December 4th Cassandra Summit Europe 2014 will be the single largest gathering of Cassandra users in Europe. Learn how the world's most successful companies are transforming their businesses and growing faster than ever using Apache Cassandra. http://bit.ly/cassandrasummit2014 Compan@yd Coaonndfiudyehnatiia l 101
  • 102.
    Thank You @doanduyhai duy_hai.doan@datastax.com https://academy.datastax.com/