Massively Scalable NoSQL with Apache Cassandra

Massively scalable NoSQL
with Apache Cassandra!
Jonathan Ellis
Project Chair, Apache Cassandra
CTO, DataStax
@spyced

Big data

Analytics Realtime
?
(Hadoop) (“NoSQL”)

©2012 DataStax

Some Casandra users

©2012 DataStax

eBay
Application/Use Case
• Social Signals: like/want/own features for
eBay product and item pages
• Hunch taste graph for eBay users and items
• Many time series use cases

Why Cassandra?
• Multi-datacenter
• Scalable
• Write performance
• Distributed counters
• Hadoop support

©2012 DataStax ACE

Time series data

©2012 DataStax

Multi-datacenter support

©2012 DataStax

Distributed counters

©2012 DataStax

Hadoop support

©2012 DataStax

Disney
• Meet the data management needs of user
facing applications across The Walt Disney
Company with a single platform

Why Cassandra?
• DataStax Enterprise can tackle real-time
and search functions in the same cluster
• Scalability
• 24x7 uptime

©2012 DataStax NDI

Multitenancy

©2012 DataStax

Enterprise search

©2012 DataStax

SimpleReach
• SimpleReach tracks social actions for
content creators, from Twitter and
Facebook to Pinterest and Reddit, to deliver
detailed insights and clear metrics around
social behavior.

Why Cassandra?
• Very high velocity data ingest rate and
large data volumes
• Workload separation between realtime
and batch applications

©2012 DataStax NDE

SourceNinja
• SourceNinja notiﬁes you to performance,
security, and bug ﬁxes for the software you
depend on

Why Cassandra?
• Previous database system could not
handle load; HBase has too many points
of failure and was too slow
• Fast real time capabilities, batch analytics
on that data, and enterprise search

©2012 DataStax RDE

Netflix
• General purpose backend for large scale
highly available cloud based web services
supporting Netflix Streaming

Why Cassandra?
• Highly available, highly robust and no
schema change downtime
• Highly scalable, optimized for SSD
• Much lower cost than previous Oracle and
SimpleDB implementations
• Flexible data model
• Ability to directly influence/implement
OSS feature set
• Supports local and wide area distributed
operations, spanning US and Europe

©2012 DataStax RCE

Optimized for SSD

©2012 DataStax

Open source

©2012 DataStax

Use case patterns
• Massively scalable
• High performance
• Reliable/Available

©2012 DataStax

reads/s writes/s

35000

30000

25000

20000

15000

10000

5000
Cassandra 0.6
0
©2012 DataStax
Cassandra 1.0

Classic partitioning with SPOF
partition 1 partition 2 partition 3 partition 4

router

client
©2012 DataStax

Availability
• “High availability implies that a single fault will not bring
down your system. Not ‘we’ll recover quickly.’”
-- Ben Coverston: DataStax

• “The biggest problem with failover is that you're almost
never using it until it really hurts. It's like backups that
you never test.”
-- Rick Branson: Instagram

©2012 DataStax

Fully distributed, no SPOF
client

p3
p6 p1
p1

p1

©2012 DataStax

Partitioning

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru gender: F

johnny age:12 gender: M

suzy age:10 gender: F

©2012 DataStax

Partitioning
Primary key determines placement*

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru gender: F

johnny age:12 gender: M

suzy age:10 gender: F

©2012 DataStax

PK MD5 Hash

jim 5e02739678...
MD5* hash
carol a9a0198010... operation yields a
128-bit number
johnny f4eb27cea7... for keys
of any size.
suzy 78b421309e...

©2012 DataStax

The “token ring”

Node A Node B

Node D Node C

©2012 DataStax

Start End
A 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

©2012 DataStax

Replication

Node A Node B

Node D Node C

carol a9a0198010...
©2012 DataStax

Node A Node B

Node D Node C

carol a9a0198010...
©2012 DataStax

Highlights
• Adding capacity is application-transparent and requires
no downtime
• No SPOF, not even temporarily
• No “primary” replica

• Configurable synchronous/asynchronous
• Tolerates node failure; never have to restart replication
“from scratch”
• “Smart” replication avoids correlated failures

©2012 DataStax

CQL: You got SQL in my NoSQL!
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int
);

CREATE INDEX ON users(state);

SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;

©2012 DataStax

Strictly “realtime” focused
• No joins
• No subqueries
• No aggregation functions* or GROUP BY
• ORDER BY?

©2012 DataStax

Clustering in CQL3
CREATE TABLE sblocks (
    block_id uuid,
    subblock_id uuid,
    data blob,
block_id subblock_id data
    PRIMARY KEY (block_id,
subblock_id)
Block1 subblock A data A
);
Block1 subblock B data B
... ... ...

Block2 subblock C data C
Block2 subblock D data D
... ... ...

Block3 subblock E data E
Block3 subblock F data F
... ... ...
©2012 DataStax

Collections
name text,
state text,
birth_date int
);

CREATE TABLE users_addresses (
user_id uuid REFERENCES users,
email text
);

SELECT *
FROM users NATURAL JOIN users_addresses;

©2012 DataStax

Collections
name text,
state text,

X
birth_date int
);

CREATE TABLE users_addresses (
user_id uuid REFERENCES users,
email text
);

SELECT *
FROM users NATURAL JOIN users_addresses;

©2012 DataStax

Collections
name text,
state text,
birth_date int,
email_addresses set<text>
);

UPDATE users
SET email_addresses = email_addresses + {‘jbellis@gmail.com’,
‘jbellis@datastax.com’};

©2012 DataStax

Better Hadoop than Hadoop
• “Vanilla” Hadoop
• 8+ services to setup, monitor, backup, and recover
(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker,
Zookeeper, Region Server,...)

• Single points of failure
• Can't separate online and oﬄine processing

• DataStax Enterprise
• Single, simplified component
• Self-organizes based on workload
• Peer to peer
• JobTracker failover
©2012 DataStax

Enterprise search with Solr
SELECT title FROM solr WHERE solr_query='title:natio*';

title
--------------------------------------------------------------------------
Bolivia national football team 2002
List of French born footballers who have played for other national teams
Lithuania national basketball team at Eurobasket 2009
Kenya national under-20 football team
Israel men's national inline hockey team

©2012 DataStax

Questions?
• http://www.datastax.com/docs
• http://www.datastax.com/dev/blog/whats-new-in-
cassandra-1-1
• http://www.datastax.com/dev/blog/schema-in-
cassandra-1-1
• http://www.datastax.com/products/enterprise

©2012 DataStax

Massively Scalable NoSQL with Apache Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Massively Scalable NoSQL with Apache Cassandra

Similar to Massively Scalable NoSQL with Apache Cassandra (20)

More from jbellis

More from jbellis (20)

Recently uploaded

Recently uploaded (20)

Massively Scalable NoSQL with Apache Cassandra