CASSANDRASUMMIT2013
Jonathan Ellis | DataStax CTO | Project Chair, Apache Cassandra
Jul-09 May-10 Feb-11 Dec-11 Oct-12 Jul-13
0.1 0.3 0.6 0.7 1.0 1.2
...
2.0
DSE
Five Years of Cassandra
Jul-08
Core Values
0
20000
40000
60000
80000
0 2 4 6 8 10 12
Cassandra HBase VoltDB Redis MySQL
*Massive scalability
*High performance
*Reliabilty / Availability
VLDB Benchmark (RWS)
0
20000
40000
60000
80000
0 2 4 6 8 10 12
Cassandra HBase VoltDB Redis MySQL
NUMBER OF NODES
THROUGHPUT(OPS/SEC)
CASSANDRA
Endpoint Benchmark (RW)
0
8750
17500
26250
35000
1 2 4 8 16 32
Cassandra HBase MongoDB
CASSANDRA
Vox Populi
#Cassandra13
*Massive scalability
*High performance
*Reliabilty / Availability
*Ease of use
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int
);
CREATE INDEX ON users(state);
SELECT * FROM users
WHERE state=‘Texas’
AND birth_date > 1950;
New Core Value
CQL is working
"Coming from a relational database background we found
the transition to Cassandra to be very straightforward. There are a
few simple key concepts one must grasp at first but ever since it's
been smooth sailing for us."
Boris Wolf, Comcast
*Key concepts?
*The next Top Data Model (Tomorrow, 11:00, Festival)
*The State of CQL (Tomorrow, 3:10, Marina)
1.2 for Developers
*CQL3
Thrift compatibility
Collections
Data dictionary
Auth support
Hadoop support
Native drivers
*Tracing
*Atomic batches
CQL/Thrift compatibility
*http://www.datastax.com/dev/blog/cql3-for-cassandra-experts
*http://www.datastax.com/dev/blog/thrift-to-cql3
*http://www.datastax.com/dev/blog/does-cql-support-dynamic-
columns-wide-rows
*TLDR: Yes
Collections
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int
);
Collections
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int
);
CREATE TABLE users_addresses (
user_id uuid REFERENCES users,
email text
);
SELECT *
FROM users NATURAL JOIN users_addresses;
Collections
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int
);
CREATE TABLE users_addresses (
user_id uuid REFERENCES users,
email text
);
SELECT *
FROM users NATURAL JOIN users_addresses;X
Collections
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int,
email_addresses set<text>
);
Collections
UPDATE users
SET email_addresses = email_addresses +
{‘jbellis@gmail.com’, ‘jbellis@datastax.com’};
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int,
email_addresses set<text>
);
Data Dictionary
cqlsh:system> use system;
cqlsh:system> select columnfamily_name from schema_columnfamilies
where keyspace_name = 'system';
columnfamily_name
-----------------------
batchlog
hints
local
peer_events
peers
schema_columnfamilies
schema_columns
schema_keyspaces
Authentication
[cassandra.yaml]
authenticator: PasswordAuthenticator
# DSE offers KerberosAuthenticator as well
Authentication
[cassandra.yaml]
authenticator: PasswordAuthenticator
# DSE offers KerberosAuthenticator as well
CREATE USER robin WITH PASSWORD 'manager' SUPERUSER;
ALTER USER cassandra WITH PASSWORD 'newpassword';
LIST USERS;
DROP USER cassandra;
Authorization
[cassandra.yaml]
authorizer: CassandraAuthorizer
GRANT select ON audit TO jonathan;
GRANT modify ON users TO robin;
GRANT all ON ALL KEYSPACES TO lara;
Native Drivers
*CQL native protocol: efficient, lightweight, asynchronous
*Java (GA): https://github.com/datastax/java-driver
*.NET (Beta): https://github.com/datastax/csharp-driver
*Coming soon: Python, PHP, Ruby
*Java and .NET Client Drivers (Tomorrow, 4:10, Marina)
Tracing
cqlsh:foo> INSERT INTO bar (i, j) VALUES (6, 2);
Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9
activity | timestamp | source | source_elapsed
-------------------------------------+--------------+-----------+----------------
Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540
Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779
Message received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63
Applying mutation | 00:02:37,016 | 127.0.0.2 | 220
Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250
Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277
Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378
Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710
Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888
Message received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334
Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550
Tracing an Antipattern
CREATE TABLE queues (
id text,
created_at timeuuid,
value blob,
PRIMARY KEY (id, created_at)
);
id created_at value
myqueue 3092e86f 9b0450d30de9
myqueue 0867f47c fc7aee5f6a66
myqueue 5fc74be0 668fdb3a2196
Tracing an Antipattern
CREATE TABLE queues (
id text,
created_at timeuuid,
value blob,
PRIMARY KEY (id, created_at)
);
id created_at value
myqueue 3092e86f 9b0450d30de9
myqueue 0867f47c fc7aee5f6a66
myqueue 5fc74be0 668fdb3a2196
Tracing an Antipattern
CREATE TABLE queues (
id text,
created_at timeuuid,
value blob,
PRIMARY KEY (id, created_at)
);
id created_at value
myqueue 3092e86f 9b0450d30de9
myqueue 0867f47c fc7aee5f6a66
myqueue 5fc74be0 668fdb3a2196
Tracing an Antipattern
CREATE TABLE queues (
id text,
created_at timeuuid,
value blob,
PRIMARY KEY (id, created_at)
);
id created_at value
myqueue 3092e86f 9b0450d30de9
myqueue 0867f47c fc7aee5f6a66
myqueue 5fc74be0 668fdb3a2196
10000 events, 9999 dequeued
cqlsh:foo> SELECT FROM queues WHERE id = 'myqueue' ORDER BY created_at LIMIT 1;
Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9
activity | timestamp | source | source_elapsed
------------------------------------------+--------------+-----------+---------------
execute_cql3_query | 19:31:05,650 | 127.0.0.1 | 0
Sending message to /127.0.0.3 | 19:31:05,651 | 127.0.0.1 | 541
Message received from /127.0.0.1 | 19:31:05,651 | 127.0.0.3 | 39
Executing single-partition query | 19:31:05,652 | 127.0.0.3 | 943
Acquiring sstable references | 19:31:05,652 | 127.0.0.3 | 973
Merging memtable contents | 19:31:05,652 | 127.0.0.3 | 1020
Merging data from memtables and sstables | 19:31:05,652 | 127.0.0.3 | 1081
Read 1 live cells and 19998 tombstoned | 19:31:05,686 | 127.0.0.3 | 35072
Enqueuing response to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35220
Sending message to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35314
Message received from /127.0.0.3 | 19:31:05,687 | 127.0.0.1 | 36908
Processing response from /127.0.0.3 | 19:31:05,688 | 127.0.0.1 | 37650
Request complete | 19:31:05,688 | 127.0.0.1 | 38047
1.2 for Operators
*Concurrent CREATE TABLE
*Virtual nodes
*“Fat node” support (5-10TB)
*JBOD improvements
Off-heap bloom filters, compression metadata
Improved compaction throttle
Parallel leveled compaction
Memory Usage
Java Heap
Off-Heap
Not managed by GC
JVM
Java Process
Native Memory
On-Heap
Managed by GC
Memory
Disk
Read Path (per SSTable)
Bloom
filter
Memory
Disk
Read Path (per SSTable)
Bloom
filter
Memory
Disk
Partition
key cache
Read Path (per SSTable)
Bloom
filter
Memory
Disk
Partition
key cache
Partition
summary
0X...
0X...
0X...
Read Path (per SSTable)
Bloom
filter
Memory
Disk 0X...
0X...
0X...
0X...
Partition
index
Partition
key cache
Partition
summary
0X...
0X...
0X...
Read Path (per SSTable)
Bloom
filter
Memory
Disk 0X...
0X...
0X...
0X...
Partition
index
Compression
offsets
Partition
key cache
Partition
summary
0X...
0X...
0X...
Read Path (per SSTable)
Bloom
filter
Memory
Disk 0X...
0X...
0X...
0X...
Partition
index
Data
Compression
offsets
Partition
key cache
Partition
summary
0X...
0X...
0X...
Read Path (per SSTable)
Off Heap in 1.2+
*Partition key bloom filter
1-2GB per billion partitions
Data
Partition
summary
0X...
0X...
0X...
Bloom
filter
0X...
0X...
0X...
0X...
Partition
index
Compression
offsets
Partition
key cacheMemory
Disk
Off Heap in 1.2+
*Compression metadata
~1-3GB per TB compressed
Data
Partition
summary
0X...
0X...
0X...
Bloom
filter
0X...
0X...
0X...
0X...
Partition
index
Compression
offsets
Partition
key cacheMemory
Disk
Not off Heap until 2.0
*Partition index summary
(Size cut in ~half in 1.2.5+)
Data
Partition
summary
0X...
0X...
0X...
Bloom
filter
0X...
0X...
0X...
0X...
Partition
index
Compression
offsets
Partition
key cacheMemory
Disk
Throttling on partition
boundaries
Throttling using a
constant RateLimiter
10000 Rows
Time
MB/s
1000
Rows10000 Rows
Time
MB/s
Compaction Throttling
1000
Rows
1000
Rows
1000
Rows
DSE 3.1
*Cassandra 1.2 shipping in
DataStax Enterprise 3.1 on
June 30
*Updated with CQL and
composite column
support for Hive and Solr
*Includes Solr 4.3
DataStax DevCenter
Cassandra 2.0
Removed in 2.0
#CASSANDRA13
Removed in 2.0
Removed in 2.0
Removed in 2.0
*Token range bisection on bootstrap
Removed in 2.0
*Token range bisection on bootstrap
*Supercolumns (only internally)
Removed in 2.0
*Token range bisection on bootstrap
*Supercolumns (only internally)
public List<ColumnOrSuperColumn> get_slice(...)
Removed in 2.0
*Token range bisection on bootstrap
*Supercolumns (only internally)
public List<ColumnOrSuperColumn> get_slice(...)
*Disk compatibility for < 1.2.5
Removed in 2.0
*Token range bisection on bootstrap
*Supercolumns (only internally)
public List<ColumnOrSuperColumn> get_slice(...)
*Disk compatibility for < 1.2.5
*Network compatibility for < 1.2
New in 2.0
*CAS (Compare-and-set = lightweight transactions)
*Eager retries
*Improved compaction
*Triggers (experimental)
*CQL cursors
CAS: The Problem
SELECT * FROM users
WHERE username = ’jbellis’
[empty resultset]
INSERT INTO users (...)
VALUES (’jbellis’, ...)
Session 1
SELECT * FROM users
WHERE username = ’jbellis’
[empty resultset]
INSERT INTO users (...)
VALUES (’jbellis’, ...)
Session 2
Why Locking Doesn’t Work
Client
(locks) Coordinator
request
Replica
internal
request
Why Locking Doesn’t Work
Client
(locks) Coordinator
request
Replica
internal
request
X
Why Locking Doesn’t Work
Client
(locks) Coordinator
request
Replica
internal
request
hint
X
Why Locking Doesn’t Work
Client
(locks) Coordinator
request
Replica
internal
request
hint
timeout
response
X
*All operations are quorum-based
*Each replica sends information about unfinished operations to the
leader during prepare
*Paxos made Simple
Paxos
CAS Details
*3 round trips vs 1 for normal updates
*Paxos state is durable
*Immediate consistency with no leader election or failover
*ConsistencyLevel.SERIAL
Use with Caution
*Great for 1% of your application
*Eventual consistency is your friend
Eventual Consistency != Hopeful Consistency (Today, 1:30, Golden Gate)
Using CAS
UPDATE USERS
SET email = ’jonathan@datastax.com’, ...
WHERE username = ’jbellis’
IF email = ’jbellis@datastax.com’;
INSERT INTO USERS (username, email, ...)
VALUES (‘jbellis’, ‘jbellis@datastax.com’, ... )
IF NOT EXISTS;
Triggers
CREATE TRIGGER <name> ON <table> EXECUTE <classname>;
Trigger Implementation
class MyTrigger implements ITrigger
{
public Collection<RowMutation> augment(ByteBuffer key, ColumnFamily update)
{
...
}
}
Experimental!
*Relies on internal RowMutation, ColumnFamily classes
*[partition] key is a ByteBuffer
*Expect changes in 2.1
#CASSANDRA13
Follow Up Discussion
*After What were they Thinking? (DataStax Lounge)
*Meet the Experts (Today, 3:00, C370)
*Happy Hour (Tonight, 6:15)
CASSANDRASUMMIT2013
Thank You
CASSANDRASUMMIT2013
CASSANDRASUMMIT2013
Thank You
CASSANDRASUMMIT2013

Cassandra Summit 2013 Keynote