Cassandra 2.0
Lyuben Todorov
Cassandra Developer, DataStax
New Core Value
•
•
•
•

Massive Scalability
Performance
Reliability
Ease of Use
Binary Protocol
Binary Protocol
• Paging
• Batching prepared statements
• Improved parameterized statements
Cursors
Before - manual paging
• Have to keep track of position
• Complex for compound
primary keys

CREATE TABLE timeline (
...
PRIMARY KEY (uid, event_id)
);
SELECT * FROM timeline
WHERE (uid = :last_key
AND event_id > :last_event)
OR token(uid) >
token(:last_key)
LIMIT 100;
Cursors
Now
Statement s = new SimpleStatement(‚SELECT * FROM timeline‛);
s.setFetchSize(100);
// transparently fetches next pages
ResultSet result = session.execute(s);
for (Row row : result) {
dooStuff(row);
}
Cursors
Now
Statement s = new SimpleStatement(‚SELECT * FROM timeline‛);
s.setFetchSize(100);
// transparently fetches next pages
ResultSet result = session.execute(s);
for (Row row : result) {
dooStuff(row);
// x2
}
Batching PreparedStatement
PreparedStatement ps = session.prepare("INSERT INTO timeline
(id, event) VALUES(?, ?)");
BatchStatement batch = new BatchStatement();
batch.add(ps.bind(UUID.randomUUID(), event_1));
batch.add(ps.bind(UUID.randomUUID(), event_2));
session.execute(batch);
Simplified Parameterized Statements
session.execute(String cql, Object… values);
session.executeAsynch(String cql, Object… values);
session.execute("INSERT INTO timeline (uid, event)
VALUES (?, ?)", UUID.randomUUID(), event_x);
CQL3
CQL3
• SELECT DISTINCT partition_key;
CQL3
• SELECT DISTINCT partition_key;
• CREATE TABLE IF NOT EXISTS tbl_name;
CQL3
• SELECT DISTINCT partition_key;
• CREATE TABLE IF NOT EXISTS tbl_name;
• Aliases
SELECT event_id, dateOf(created_at) AS creation_date;
CQL3
• SELECT DISTINCT partition_key;
• CREATE TABLE IF NOT EXISTS tbl_name;
• Aliases
SELECT event_id, dateOf(created_at) AS creation_date

• ALTER TABLE tbl DROP column_name;
Lightweight Transactions
SESSION 1

SESSION 2

SELECT * FROM users
WHERE USERNAME = ‘lyuben’;

SELECT * FROM users
WHERE USERNAME = ‘lyuben’;

[empty resultset]

[empty resultset]

INSERT INTO users (...)
VALUES (‘lyubent’, ...)

INSERT INTO users (...)
VALUES (‘lyubent’, ...)

Last write wins.

User
exists?
If not,
create
user
Why not locking?
request
Client
(locks)

Coordinator
internal
request

Replica
Why not locking?
request
Client
(locks)

Coordinator
internal
request

X
Replica
Why not locking?
request

Coordinator

Client
(locks)

internal
request

store for replay

X
Replica
Why not locking?
request

Coordinator

Client
(locks)

internal
request
timeout
response

store for replay

X
Replica
Paxos
A family of protocols for solving consensus
in a network of unreliable processors.

Wikipedia
Paxos
• Immediate Consistency
• QUARUM based operations
• Unfinished operations’ states sent to leader
during prepare phase
• Paxos state is durable
LWT – Use when Appropriate
• Expensive – x4 round trips
• Eventual consistency is your friend
Lightweight Transactions – CQL
INSERT INTO USERS (username, email ...)
VALUES (‘lyuben’, ‘ltodorov@datastax.com’, ... )
IF NOT EXISTS;
Triggers
CREATE TRIGGER <name> ON <table> USING
<classname>;
Triggers
class MyTrigger implements ITrigger
{
public Collection<RowMutation> augment(ByteBuffer key, ColumnFamily update)
{
...
}
}
Tracing
• Detailed view of what’s going on
• Great for debugging queries

cqlsh:test> TRACING ON;
Now tracing requests.
Tracing insert
cqlsh:test> INSERT INTO example (i, j) VALUES ('key', 7);
Tracing session: 69fc9cb0-4fb3-11e3-84ae-612d9c5d36d9
activity
| timestamp
| source
| source_elapsed
------------------------------------+--------------+-----------+---------------Determining replicas for mutation | 18:09:34,722 | 127.0.0.1 |
3507
Sending message to /127.0.0.2 | 18:09:34,724 | 127.0.0.1 |
5720
Acquiring switchLock read lock | 18:09:34,732 | 127.0.0.2 |
6043
Appending to commitlog | 18:09:34,732 | 127.0.0.2 |
6305
Adding to example memtable | 18:09:34,732 | 127.0.0.2 |
6373
Enqueuing response to /127.0.0.1 | 18:09:34,733 | 127.0.0.2 |
6978
Message received from /127.0.0.2 | 18:09:34,737 | 127.0.0.1 |
19055
Processing response from /127.0.0.2 | 18:09:34,738 | 127.0.0.1 |
19993
Request complete | 18:09:34,739 | 127.0.0.1 |
20322
Tracing tombstone
cqlsh:test> SELECT * FROM example;
Tracing session: 79d55380-4fb7-11e3-9ac8-612d9c5d36d9
activity
| timestamp
| source
| source_elapsed
-----------------------------------------+--------------+-----------+--------------...
Sending message to /127.0.0.2 | 18:38:39,291 | 127.0.0.1 |
601
Read 10 live and 100000 tombstoned cells | 18:38:39,291 | 127.0.0.2 |
31655
Scanned 1 rows and matched 1 | 18:38:39,292 | 127.0.0.2 |
31693
Message received from /127.0.0.2 | 18:38:39,292 | 127.0.0.1 |
33150
Enqueuing response to /127.0.0.1 | 18:38:39,292 | 127.0.0.2 |
33724
Processing response from /127.0.0.2 | 18:38:39,292 | 127.0.0.1 |
34704
Sending message to /127.0.0.1 | 18:38:39,292 | 127.0.0.2 |
35220
...
Row Marker
CREATE TABLE tbl (
key int,
a text,
b text,
PRIMARY KEY (key)
);
UPDATE tbl SET a=null, b=null WHERE key=1;
DELETE FROM tbl where key=2;
Row Marker (JSON format)
[
{"key": "00000001","columns": [["","",1384716960173000],
["a","52891aa0",1384716960173000,"d"],
["b","52891aa0",1384716960173000,"d"]]
},
{"key": "00000002","columns": []}
]

Retreied via sstable2json
Rapid Read Protection
• Configurable per-table.
• Reduces occurrences of read timeouts in
overloaded / crashed replicas.
• Enabled by default in 2.0.2.
Rapid Read Protection
Rapid Read Protection
Configuring rapid read protection
# retry if request takes longer than 10ms
ALTER TABLE example WITH speculative_retry = '10ms';
# retry if request takes longer then 99% of requests
ALTER TABLE example WITH speculative_retry = '99percentile';
Going Off Heap
Managed by GC
Stack

Heap

Not Managed by GC
Native
Going Off Heap
• Bloom Filters
1 - 2GB per billion entries
Going Off Heap
• Bloom Filters
1 - 2GB per billion entries
• Compression Offsets
1 - 3GB per TB of compressed data
Going Off Heap
• Bloom Filters
1 - 2GB per billion entries
• Compression Offsets
1 - 3GB per TB of compressed data
• Partition Summary
Depends on # of rows per partition
Compaction
Leveled

Row fragments

Size Tiered
Leveled Compaction Strategy
Happy LCS
Sad LCS
STCS in Level 0
Coming up
• Secondary indexes for collections (4511)
Coming up
• Secondary indexes for collections (4511)
CREATE TABLE image (
id UUID,
tags set<text>,
PRIMARY KEY(id)
);

SELECT * FROM image WHERE tags CONTAINS ’sunny’;
image_id
| tags
--------------------------------------+------------------------------1a3ab520-5177-11e3-91ae-612d9c5d36d9 | {'beach', 'holiday', 'sunny'}
2617d120-5177-11e3-91ae-612d9c5d36d9 | {'mountains', 'sunny'}
Coming up
• Secondary indexes for collections (4511)
• More efficient repairs (5351)
Coming up
• Secondary indexes for collections (4511)
• More efficient repairs (5351)
STCS
Repair

Compaction
Coming up
• Secondary indexes for collections (4511)
• More efficient repairs (5351)
LCS
L0
L0
L1
L2
Coming up
• Secondary indexes for collections (4511)
• More efficient repairs (5351)
• Custom types (5590)
Coming up
• Secondary indexes for collections (4511)
• More efficient repairs (5351)
• Custom types (5590)
CREATE TYPE address (
street text,
city text,
zip_code int,
phones set<text>
)
Key

UTF8Type

UTF8Type

CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
addresses map<int, address>
)
INT32Type

MapType ( Int32Type, Address)
Coming up
•
•
•
•

Secondary indexes for collections (4511)
More efficient repairs (5351)
Custom types (5590)
CQL aggregate functions (4914)
Coming up
•
•
•
•

Secondary indexes for collections (4511)
More efficient repairs (5351)
Custom types (5590)
CQL aggregate functions (4914)
– AVG, MIN, MAX, MEAN, SUM, etc.

SELECT sum(salary) FROM employee where country='UK';
Coming up
•
•
•
•
•

Secondary indexes for collections (4511)
More efficient repairs (5351)
Custom types (5590)
CQL aggregate functions (4914)
Many more!
DataStax Ac*ademy
Free online cassandra training!
https://datastaxacademy.elogiclearning.com/
Dun ddd

Dun ddd

  • 1.
  • 2.
    New Core Value • • • • MassiveScalability Performance Reliability Ease of Use
  • 3.
  • 4.
    Binary Protocol • Paging •Batching prepared statements • Improved parameterized statements
  • 5.
    Cursors Before - manualpaging • Have to keep track of position • Complex for compound primary keys CREATE TABLE timeline ( ... PRIMARY KEY (uid, event_id) ); SELECT * FROM timeline WHERE (uid = :last_key AND event_id > :last_event) OR token(uid) > token(:last_key) LIMIT 100;
  • 6.
    Cursors Now Statement s =new SimpleStatement(‚SELECT * FROM timeline‛); s.setFetchSize(100); // transparently fetches next pages ResultSet result = session.execute(s); for (Row row : result) { dooStuff(row); }
  • 7.
    Cursors Now Statement s =new SimpleStatement(‚SELECT * FROM timeline‛); s.setFetchSize(100); // transparently fetches next pages ResultSet result = session.execute(s); for (Row row : result) { dooStuff(row); // x2 }
  • 8.
    Batching PreparedStatement PreparedStatement ps= session.prepare("INSERT INTO timeline (id, event) VALUES(?, ?)"); BatchStatement batch = new BatchStatement(); batch.add(ps.bind(UUID.randomUUID(), event_1)); batch.add(ps.bind(UUID.randomUUID(), event_2)); session.execute(batch);
  • 9.
    Simplified Parameterized Statements session.execute(Stringcql, Object… values); session.executeAsynch(String cql, Object… values); session.execute("INSERT INTO timeline (uid, event) VALUES (?, ?)", UUID.randomUUID(), event_x);
  • 10.
  • 11.
  • 12.
    CQL3 • SELECT DISTINCTpartition_key; • CREATE TABLE IF NOT EXISTS tbl_name;
  • 13.
    CQL3 • SELECT DISTINCTpartition_key; • CREATE TABLE IF NOT EXISTS tbl_name; • Aliases SELECT event_id, dateOf(created_at) AS creation_date;
  • 14.
    CQL3 • SELECT DISTINCTpartition_key; • CREATE TABLE IF NOT EXISTS tbl_name; • Aliases SELECT event_id, dateOf(created_at) AS creation_date • ALTER TABLE tbl DROP column_name;
  • 15.
    Lightweight Transactions SESSION 1 SESSION2 SELECT * FROM users WHERE USERNAME = ‘lyuben’; SELECT * FROM users WHERE USERNAME = ‘lyuben’; [empty resultset] [empty resultset] INSERT INTO users (...) VALUES (‘lyubent’, ...) INSERT INTO users (...) VALUES (‘lyubent’, ...) Last write wins. User exists? If not, create user
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    Paxos A family ofprotocols for solving consensus in a network of unreliable processors. Wikipedia
  • 21.
    Paxos • Immediate Consistency •QUARUM based operations • Unfinished operations’ states sent to leader during prepare phase • Paxos state is durable
  • 22.
    LWT – Usewhen Appropriate • Expensive – x4 round trips • Eventual consistency is your friend
  • 23.
    Lightweight Transactions –CQL INSERT INTO USERS (username, email ...) VALUES (‘lyuben’, ‘ltodorov@datastax.com’, ... ) IF NOT EXISTS;
  • 24.
    Triggers CREATE TRIGGER <name>ON <table> USING <classname>;
  • 25.
    Triggers class MyTrigger implementsITrigger { public Collection<RowMutation> augment(ByteBuffer key, ColumnFamily update) { ... } }
  • 26.
    Tracing • Detailed viewof what’s going on • Great for debugging queries cqlsh:test> TRACING ON; Now tracing requests.
  • 27.
    Tracing insert cqlsh:test> INSERTINTO example (i, j) VALUES ('key', 7); Tracing session: 69fc9cb0-4fb3-11e3-84ae-612d9c5d36d9 activity | timestamp | source | source_elapsed ------------------------------------+--------------+-----------+---------------Determining replicas for mutation | 18:09:34,722 | 127.0.0.1 | 3507 Sending message to /127.0.0.2 | 18:09:34,724 | 127.0.0.1 | 5720 Acquiring switchLock read lock | 18:09:34,732 | 127.0.0.2 | 6043 Appending to commitlog | 18:09:34,732 | 127.0.0.2 | 6305 Adding to example memtable | 18:09:34,732 | 127.0.0.2 | 6373 Enqueuing response to /127.0.0.1 | 18:09:34,733 | 127.0.0.2 | 6978 Message received from /127.0.0.2 | 18:09:34,737 | 127.0.0.1 | 19055 Processing response from /127.0.0.2 | 18:09:34,738 | 127.0.0.1 | 19993 Request complete | 18:09:34,739 | 127.0.0.1 | 20322
  • 28.
    Tracing tombstone cqlsh:test> SELECT* FROM example; Tracing session: 79d55380-4fb7-11e3-9ac8-612d9c5d36d9 activity | timestamp | source | source_elapsed -----------------------------------------+--------------+-----------+--------------... Sending message to /127.0.0.2 | 18:38:39,291 | 127.0.0.1 | 601 Read 10 live and 100000 tombstoned cells | 18:38:39,291 | 127.0.0.2 | 31655 Scanned 1 rows and matched 1 | 18:38:39,292 | 127.0.0.2 | 31693 Message received from /127.0.0.2 | 18:38:39,292 | 127.0.0.1 | 33150 Enqueuing response to /127.0.0.1 | 18:38:39,292 | 127.0.0.2 | 33724 Processing response from /127.0.0.2 | 18:38:39,292 | 127.0.0.1 | 34704 Sending message to /127.0.0.1 | 18:38:39,292 | 127.0.0.2 | 35220 ...
  • 29.
    Row Marker CREATE TABLEtbl ( key int, a text, b text, PRIMARY KEY (key) ); UPDATE tbl SET a=null, b=null WHERE key=1; DELETE FROM tbl where key=2;
  • 30.
    Row Marker (JSONformat) [ {"key": "00000001","columns": [["","",1384716960173000], ["a","52891aa0",1384716960173000,"d"], ["b","52891aa0",1384716960173000,"d"]] }, {"key": "00000002","columns": []} ] Retreied via sstable2json
  • 31.
    Rapid Read Protection •Configurable per-table. • Reduces occurrences of read timeouts in overloaded / crashed replicas. • Enabled by default in 2.0.2.
  • 32.
  • 33.
    Rapid Read Protection Configuringrapid read protection # retry if request takes longer than 10ms ALTER TABLE example WITH speculative_retry = '10ms'; # retry if request takes longer then 99% of requests ALTER TABLE example WITH speculative_retry = '99percentile';
  • 34.
    Going Off Heap Managedby GC Stack Heap Not Managed by GC Native
  • 35.
    Going Off Heap •Bloom Filters 1 - 2GB per billion entries
  • 36.
    Going Off Heap •Bloom Filters 1 - 2GB per billion entries • Compression Offsets 1 - 3GB per TB of compressed data
  • 37.
    Going Off Heap •Bloom Filters 1 - 2GB per billion entries • Compression Offsets 1 - 3GB per TB of compressed data • Partition Summary Depends on # of rows per partition
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    Coming up • Secondaryindexes for collections (4511)
  • 44.
    Coming up • Secondaryindexes for collections (4511) CREATE TABLE image ( id UUID, tags set<text>, PRIMARY KEY(id) ); SELECT * FROM image WHERE tags CONTAINS ’sunny’; image_id | tags --------------------------------------+------------------------------1a3ab520-5177-11e3-91ae-612d9c5d36d9 | {'beach', 'holiday', 'sunny'} 2617d120-5177-11e3-91ae-612d9c5d36d9 | {'mountains', 'sunny'}
  • 45.
    Coming up • Secondaryindexes for collections (4511) • More efficient repairs (5351)
  • 46.
    Coming up • Secondaryindexes for collections (4511) • More efficient repairs (5351) STCS Repair Compaction
  • 47.
    Coming up • Secondaryindexes for collections (4511) • More efficient repairs (5351) LCS L0 L0 L1 L2
  • 48.
    Coming up • Secondaryindexes for collections (4511) • More efficient repairs (5351) • Custom types (5590)
  • 49.
    Coming up • Secondaryindexes for collections (4511) • More efficient repairs (5351) • Custom types (5590) CREATE TYPE address ( street text, city text, zip_code int, phones set<text> ) Key UTF8Type UTF8Type CREATE TABLE users ( id uuid PRIMARY KEY, name text, addresses map<int, address> ) INT32Type MapType ( Int32Type, Address)
  • 50.
    Coming up • • • • Secondary indexesfor collections (4511) More efficient repairs (5351) Custom types (5590) CQL aggregate functions (4914)
  • 51.
    Coming up • • • • Secondary indexesfor collections (4511) More efficient repairs (5351) Custom types (5590) CQL aggregate functions (4914) – AVG, MIN, MAX, MEAN, SUM, etc. SELECT sum(salary) FROM employee where country='UK';
  • 52.
    Coming up • • • • • Secondary indexesfor collections (4511) More efficient repairs (5351) Custom types (5590) CQL aggregate functions (4914) Many more!
  • 53.
    DataStax Ac*ademy Free onlinecassandra training! https://datastaxacademy.elogiclearning.com/