Cassandra introduction apache con 2014 budapest

Introduction to Cassandra
DuyHai DOAN, Technical Advocate

@doanduyhai
About me!
2
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS technical point of contact in Europe
☞ duy_hai.doan@datastax.com
• production troubleshooting

@doanduyhai 3
Datastax!
• Founded in April 2010
• We drive Apache Cassandra™
• Home to Cassandra chair & most committers (≈80%)
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + features

@doanduyhai
Agenda!
4
• Architecture
• Replication
• Consistency
• Read/write path
• Data Model
• Failure handling

@doanduyhai
Cassandra history!
6
NoSQL database
• created at Facebook
• open-sourced since 2008
• current version = 2.1
• column-oriented ☞ distributed table

Cassandra 5 key facts!
@doanduyhai
7
Linear scalability
Small & « huge » scale
• 3 à1k+ nodes cluster
• 3Gb à Pb+

@doanduyhai
8
Continuous availability (≈100% up-time)
• resilient architecture (Dynamo)
• rolling upgrades
• data backward compatible n/n+1 versions

@doanduyhai
9
Multi-data centers
• out-of-the-box (config only)
• AWS conf for multi-region DCs
• GCE/CloudStack support
• resilience, work-load segregation
• virtual data-centers

@doanduyhai
10
Operational simplicity
• 1 node = 1 process + 1 config file
• deployment automation
• OpsCenter for monitoring

@doanduyhai
11
Analytics combo
• Cassandra + Spark = awesome !
• realtime streaming

Cassandra architecture!
@doanduyhai
13
API (CQL & RPC)
CLUSTER (DYNAMODB)
DATA STORE (BIG TABLES)
DISKS
Node1
Client request
API (CQL & RPC)
CLUSTER (DYNAMODB)
DATA STORE (BIG TABLES)
DISKS
Node2

@doanduyhai
Data distribution!
14
Random: hash of #partition → token = hash(#p)
Hash: ]-X, X]
X = huge number (264/2)
n1
n2
n3
n4
n5
n6
n7
n8

@doanduyhai
Token Ranges!
15
A: ]0, X/8]
B: ] X/8, 2X/8]
C: ] 2X/8, 3X/8]
D: ] 3X/8, 4X/8]
E: ] 4X/8, 5X/8]
F: ] 5X/8, 6X/8]
G: ] 6X/8, 7X/8]
H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H

8 nodes 10 nodes
@doanduyhai
Linear scalability!
16
n1
n2
n3
n4
n5
n6
n7
n8
n1
n2
n3 n4
n5
n6
n7
n9 n8
n10

Replication & Consistency Model!

Failure tolerance by replication!
@doanduyhai
19
Replication Factor (RF) = 3
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
{B, A, H}
{C, B, A}
{D, C, B}
A
B
C
D
E
F
G
H

@doanduyhai
Coordinator node!
Incoming requests (read/write)
Coordinator node handles the request
n1
Every node can be coordinator àmasterless
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
Client request

@doanduyhai
Consistency!
21
Tunable at runtime
• ONE
• QUORUM (strict majority w.r.t. RF)
• ALL
Apply both to read & write

Write consistency!
Write ONE
• write request to all replicas in //
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

Write consistency!
Write ONE
• wait for ONE ack before returning to
@doanduyhai
client
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs

Write consistency!
Write ONE
• wait for ONE ack before returning to
@doanduyhai
client
• other acks later, asynchronously
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs

Write consistency!
Write QUORUM
• wait for QUORUM acks before
@doanduyhai
returning to client
• other acks later, asynchronously
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs

Read consistency!
Read ONE
• read from one node among all replicas
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

Read consistency!
Read ONE
• read from one node among all replicas
• contact the fastest node (stats)
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

@doanduyhai
Read consistency!
Read QUORUM
• read from one fastest node
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

Read consistency!
Read QUORUM
• AND request digest from other
@doanduyhai
replicas to reach QUORUM
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

Read consistency!
Read QUORUM
@doanduyhai
• return most up-to-date data to client
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

Read consistency!
Read QUORUM
@doanduyhai
• repair if digest mismatch n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

Consistency in action!
@doanduyhai
32
RF = 3, Write ONE, Read ONE
B A A
B A A
Read ONE: A
data replication in progress …
Write ONE: B

@doanduyhai
33
RF = 3, Write ONE, Read QUORUM
B A A
Write ONE: B
Read QUORUM: A
B A A

@doanduyhai
34
RF = 3, Write ONE, Read ALL
B A A
Read ALL: B
B A A
Write ONE: B

@doanduyhai
35
RF = 3, Write QUORUM, Read ONE
B B A
Write QUORUM: B
Read ONE: A
B B A

@doanduyhai
36
RF = 3, Write QUORUM, Read QUORUM
B B A
Read QUORUM: B
B B A
Write QUORUM: B

@doanduyhai
Consistency level!
37
ONE
Fast, may not read latest written value

@doanduyhai
Consistency level!
38
QUORUM
Strict majority w.r.t. Replication Factor
Good balance

@doanduyhai
Consistency level!
39
ALL
Paranoid
Slow, no high availability

Consistency summary!
ONERead + ONEWrite
☞ available for read/write even (N-1) replicas down
QUORUMRead + QUORUMWrite
☞ available for read/write even 1+ replica down
@doanduyhai 40

Cassandra Write Path!
@doanduyhai
43
Commit log1
. . .
1
Commit log2
Commit logn
Memory

@doanduyhai
44
Memory
MemTable
Table1
Commit log1
. . .
1
Commit log2
Commit logn
MemTable
Table2
MemTable
TableN
2
. . .

Table2 Table3
SStable2 SStable3 3
@doanduyhai
45
Commit log1
Commit log2
Commit logn
Table1
SStable1
Memory
. . .

MemTable . . . Memory
Table1
@doanduyhai
46
Commit log1
Commit log2
Commit logn
Table1
SStable1
Table2 Table3
SStable2 SStable3
MemTable
Table2
MemTable
TableN
. . .

SStable3 . . .
@doanduyhai
47
Commit log1
Commit log2
Commit logn
Table1
SStable1
Memory
Table2 Table3
SStable2 SStable3
SStable1
SStable2

Cassandra Read Path!
@doanduyhai
48
1 Row Cache
Off Heap
JVM Heap
SELECT .. FROM … WHERE #partition =…;
SStable1 SStable2 SStable3

Bloom Filter
Bloom Filter
? ? ?
@doanduyhai
49
1 Row Cache
Off Heap
JVM Heap
2 Bloom Filter

Bloom filters in action!
Write #partition = bar
h3
@doanduyhai
50
#partition = foo
h1 h2
1 0 0 1* 0 0 1 0 1 1
Read #partition = qux

Partition Key Cache
2 Bloom Filter
Bloom Filter
Bloom Filter
× ✓ ×
@doanduyhai
51
1 Row Cache
Off Heap
JVM Heap
3

Partition Key Cache!
Partition Key Cache SStable
@doanduyhai
52
#partition001
data
data
….
#partition002
#partition350
data
data
…
#Partition Offset …
#partition001 0x0 …

4 Key Index Sample
2 Bloom Filter
Bloom Filter
@doanduyhai
53
1 Row Cache
Off Heap
JVM Heap
3 Partition Key Cache
SStable2
Bloom Filter

@doanduyhai
Key Index Sample!
54
SStable
#partition001
data
data
….
#partition128
#partition350
data
data
…
Key Index Sample
Sample Offset
#partition001 0x0
#partition128 0x4500

4 Key Index Sample
2 Bloom Filter
Bloom Filter
@doanduyhai
55
1 Row Cache
Off Heap
JVM Heap
5 Compression Offset
SStable2
Bloom Filter

Compression Offset!
@doanduyhai
56
Compression Offset
Normal offset Compressed offset
0x0 0x0
0x4500 0x100
0x851513 0x1353
0x5464321 0x21245

4 Key Index Sample
2 Bloom Filter
Bloom Filter
@doanduyhai
57
1 Row Cache
Off Heap
JVM Heap
5 Compression Offset
6 SStable2
7 MemTable
Σ
Bloom Filter

Data model!
Last Write Win!
CQL basics!
From SQL to CQL!

Last Write Win (LWW)!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
@doanduyhai
59
jdoe
age
name
33 John DOE
#partition

@doanduyhai
jdoe
age (t1) name (t1)
33 John DOE
60
auto-generated timestamp (μs)
.

SSTable1 SSTable2
@doanduyhai
61
UPDATE users SET age = 34 WHERE login = jdoe;
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34

SSTable1 SSTable2 SSTable3
@doanduyhai
62
DELETE age FROM users WHERE login = jdoe;
tombstone
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34

@doanduyhai
63
SELECT age FROM users WHERE login = jdoe;
? ? ?
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34

@doanduyhai
64
✕ ✕ ✓
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34

@doanduyhai
Compaction!
65
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
New SSTable
jdoe
age (t3) name (t1)
ý John DOE

@doanduyhai
CRUD operations!
66
UPDATE users SET age = 34 WHERE login = jdoe;
DELETE age FROM users WHERE login = jdoe;

@doanduyhai
DDL Statements!
67
Keyspace (tables container)
• CREATE/ALTER/DROP
Table
Type (custom type)

User management & permissions!
@doanduyhai
68
User
Permissions
• GRANT <xxx> PERMISSION ON <resource> TO <user_name>
• REVOKE <xxx> PERMISSION ON <resource> FROM <user_name>

@doanduyhai
Simple Table!
69
CREATE TABLE users (
login text,
name text,
age int,
…
PRIMARY KEY(login));
partition key (#partition)

@doanduyhai
Clustered table!
70
CREATE TABLE mailbox (
login text,
message_id timeuuid,
interlocutor text,
message text,
PRIMARY KEY((login), message_id));
partition key clustering column unicity

@doanduyhai
Queries!
71
Get message by user and message_id (date)
SELECT * FROM mailbox WHERE login = jdoe
and message_id = ‘2014-09-25 16:00:00’;
Get message by user and date interval
SELECT * FROM mailbox WHERE login = jdoe
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;

@doanduyhai
Queries!
72
Get message by message_id only (#partition not provided)
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval (#partition not provided)
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;

Get message by user range (range query on #partition)
Get message by user pattern (non exact match on #partition)
@doanduyhai
Queries!
73
SELECT * FROM mailbox WHERE login >= hsue and login <= jdoe;
SELECT * FROM mailbox WHERE login like ‘%doe%‘;

WHERE clause restrictions!
@doanduyhai
74
All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition
Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed
• ☞ full cluster scan
On clustering columns, only range queries (<, ≤, >, ≥) and exact match
WHERE clause only possible on columns defined in PRIMARY KEY

Collections & maps!
@doanduyhai
75
login text,
name text,
age int,
friends set<text>,
hobbies list<text>,
languages map<int, text>,
…

Collections & maps!
@doanduyhai
76
All elements loaded at each access
• ☞ low cardinality only (≤ 1000 elements)
Should not be used use for 1-N, N-N relations

User Defined Type (UDT)!
@doanduyhai
77
Instead of
login text,
…
street_number int,
street_name text,
postcode int,
country text,
…

User Defined Type (UDT)!
@doanduyhai
78
CREATE TYPE address (
street_number int,
street_name text,
postcode int,
country text);
login text,
…
location frozen <address>,
…

@doanduyhai
UDT in action!
79
INSERT INTO users(login,name, location) VALUES (
‘jdoe’,
’John DOE’,
{
‘street_number’: 124,
‘street_name’: ‘Congress Avenue’,
‘postcode’: 95054,
‘country’: ‘USA’
});

@doanduyhai
From SQL to CQL!
80
Remember…

@doanduyhai
From SQL to CQL!
81
Remember…
CQL is not SQL

@doanduyhai
From SQL to CQL!
82
Remember…
there is no join
(do you want to scale ?)

@doanduyhai
From SQL to CQL!
83
Remember…
there is no integrity constraint
(do you want to read-before-write ?)

@doanduyhai
From SQL to CQL!
84
1-1, N-1 : normalized
User
1
n
Comment
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author_id uuid, // typical join id
content text,
PRIMARY KEY((article_id), comment_id));

@doanduyhai
From SQL to CQL!
85
1-1, N-1 : de-normalized
User
1
n
Comment
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author person, // person is UDT
content text,
PRIMARY KEY((article_id), comment_id));

@doanduyhai
From SQL to CQL!
86
N-1, N-N : normalized
n
User
n
CREATE TABLE friends(
login text,
friend_login text, // typical join id
PRIMARY KEY((login), friend_login));

@doanduyhai
From SQL to CQL!
87
N-1, N-N : de-normalized
n
User
n
CREATE TABLE friends(
login text,
friend_login text,
friend person, // person is UDT
PRIMARY KEY((login), friend_login));

Data modeling best practices!
@doanduyhai
88
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT

@doanduyhai
89
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
Denormalize
• wisely, only duplicate necessary & immutable data
• functional/technical trade-off

@doanduyhai
90
Person UDT
- firstname/lastname
- date of birth
- sexe
- mood
- location

@doanduyhai
91
John DOE, male
birthdate: 21/02/1981
subscribed since 03/06/2011
☉ San Mateo, CA
’’Impossible is not John DOE’’
Full detail read from
User table on click

Failure Handling!
Failure occurs when
• node hardware failure (disk, …)
• network issues
• heavy load, node flapping
• JVM long GCs
@doanduyhai

@doanduyhai
Hinted Handoff!
When 1 replica down
• coordinator stores Hints
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
×
Hint

Hinted Handoff!
When replica up
• coordinator forward stored Hints
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
Hints

Consistent read!
Read at QUORUM or more
@doanduyhai
• repair if digest mismatch n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

Read Repair!
Read at CL < QUORUM
• read from one least-loaded node
• every x reads, request digests from
@doanduyhai
other replicas
• compare digest & repair
asynchronously
• read_repair_chance (10%)
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator

Manual repair!
Should be part of cluster operations
• scheduled
• I/O intensive
@doanduyhai
Use incremental repair of 2.1

Training Day | December 3rd
Beginner Track
• Introduction to Cassandra
• Introduction to Spark, Shark, Scala and
Cassandra
Advanced Track
• Data Modeling
• Performance Tuning
Conference Day | December 4th
Cassandra Summit Europe 2014 will be the single
largest gathering of Cassandra users in Europe.
Learn how the world's most successful companies are
transforming their businesses and growing faster than
ever using Apache Cassandra.
http://bit.ly/cassandrasummit2014
Compan@yd Coaonndfiudyehnatiia l 101

Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

Cassandra introduction apache con 2014 budapest

More Related Content

What's hot

Viewers also liked

Similar to Cassandra introduction apache con 2014 budapest

More from Duyhai Doan

Recently uploaded

Cassandra introduction apache con 2014 budapest