Overiew of Cassandra and Doradus

Overview of Cassandra and
The Doradus OSS Project
Randy Guck
Principal Engineer, Dell Software

Overview
•  What is No SQL?
– Common RDB roadblocks
– NoSQL database types
•  Overview of Cassandra
– What's unique
– Limitations
•  Doradus
– Architecture
– Features
– The OLAP and Spider storage managers
– What each is good for
– Where to get Doradus

Why RDB Apps Look for Something Else
•  Performance
– B-trees
– Locking
– One writable copy of each record
•  Scaling costs
– RDBs scale "up"
– Big boxes, SANs, ﬁber channel, etc.
•  What if you want...
– Distributed access
– No single points of failure
– Instant failover
– Sharding
– Replication

NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value
LevelDB, Kyoto Cabinet,
Redis
No No No
Distributed Key–
Value
Dynamo, MemcacheDB,
Riak, Voldemort
Yes No No
Column-Oriented
Accumulo, Cassandra,
HBase
Yes Some No
Document-
Oriented
Couchbase,
Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support

NoSQL Data Models
Data Model Examples Elastic? Queries? Relationships?
Key–Value
LevelDB, Kyoto Cabinet,
Redis
No No No
Distributed Key–
Value
Dynamo, MemcacheDB,
Riak, Voldemort
Yes No No
Column-Oriented
Accumulo, Cassandra,
HBase
Yes Some No
Document-
Oriented
Couchbase,
Elasticsearch, MongoDB
Yes Yes Some
Graph Neo4J, OrientDB, Titan No Yes Yes
Sharding + replication
AND/OR/ranges/etc.
Built-in support
Doradus goals

NoSQL Common Traits
•  Distributed cluster of nodes
– Commodity, shared-nothing servers
– Scales horizontally
– Expands elastically
•  Replication
– Performant local access
– Automatic failover
•  De-normalized data model
•  Schemaless/dynamic columns
•  Eventual consistency
N=5, RF=3

Is NoSQL Catching On?
Source: db-engines.com

Overview of Cassandra
•  Wide column NoSQL database
•  Open sourced by Facebook
•  Apache Project with active community
•  Commercially support by DataStax,
Acunu, others
•  Used by 1,500+ companies
•  "Pure peer" architecture
•  Largest known Cassandra cluster:
300+ TB data and 400+ machines.

What is Cassandra best for?
•  Continuous data streams
– Logs, events, audit records, measurements, ...
– Fast data ingestion
– Predictable read performance
•  Partitionable data
– "1,000's of little databases in one"
•  Elastic scalability
– Expand/upgrade/repair without downtime
•  Not good for:
– Blob store
– Persistent queue
– OLTP transactions

CQL Static Table
CREATE
TABLE
songs
(

id

uuid
PRIMARY
KEY,

title

text,

album

text,

artist

text,

data

blob

);

CREATE
INDEX
ON
songs
(artist);

Row Key Columns: "<column
name>"="<column
value>"

62c36...
"album"="90125"
"artist"="Yes"
"data"=<audio>
"title"="Changes"

837a2...
"album"="Crystal
Ball"
"artist"="Styx"
"data"=<audio>
"title"="Put
Me
On"

2de83...
"album"="Nevermind"
"artist"="Nirvana"
"data"=<audio>
"title"="Breed"

...

CQL Clustered Table
CREATE
TABLE
playlists
(

id

uuid,

song_order
int,

song_id

uuid,

//
copied
from
songs.id

title

text,

//
copied
from
songs.title

album

text,

//
copied
from
songs.album

artist

text,

//
copied
from
songs.artist

PRIMARY
KEY
(id,
song_order)

//
compound
key

);

Row Key Columns: "<song_order>:<column
name>"="<column
value>"

28d23...

"1:"=""
"1:album"="90125"
"1:artist"="Yes"
"1:song_id"="62c36..."

"1:title"="Changes"
"2:"=""
"2:album"="Nevermind"
"2:artist"="Nirvana"

"2:song_id"="2de83..."
"2:title"="Breed"
"3:"=""
...

2ed91...

"1:"=""
"1:album"="Crystal
Ball"
"1:artist"="Styx"
"1:song_id"="837a2..."

"1:title"="Put
Me
On"
"2:"=""
...

...

Row Key Columns: "<song_order>:<column
name>"="<column
value>"

28d23...

"1:"=""
"1:album"="90125"
"1:artist"="Yes"
"1:song_id"="62c36..."

"1:title"="Changes"
"2:"=""
"2:album"="Nevermind"
"2:artist"="Nirvana"

"2:song_id"="2de83..."
"2:title"="Breed"
"3:"=""
...

2ed91...

"1:"=""
"1:album"="Crystal
Ball"
"1:artist"="Styx"
"1:song_id"="837a2..."

"1:title"="Put
Me
On"
"2:"=""
...

...

CQL Clustered Table (cont.)
CQL "Rows"
CREATE
TABLE
playlists
(

id

uuid,

song_order
int,

song_id

uuid,

//
copied
from
songs.id

title

text,

//
copied
from
songs.title

album

text,

//
copied
from
songs.album

artist

text,

//
copied
from
songs.artist

PRIMARY
KEY
(id,
song_order)

//
compound
key

);

Can we make Cassandra more appealing?
•  Data Model
– No direct support for relationships
•  Indexing
– Secondary indexes: single column only
– Hash table only: no range searching
•  Searching
– No joins, embedded queries
– No aggregate queries
– Limited equalities (e.g., SELECT * WHERE <key> IN (<list>))
– No full text search
– No OR clauses
– ...

What is Doradus?
•  Java service that enhances Cassandra
•  Adds features:
– REST API (JSON and XML)
– Multi-tenancy
– Graph model
– Multi-ﬁeld/full text query language
– Automatic data aging
– OLAP and Spider storage services
•  Compatible with NoSQL tenets such as idempotent
updates
•  Under development for ~3 years
•  Open source: Apache 2.0 License

Doradus Graph Model
•  A cluster hosts one of more applications
•  An application own tables which store objects
•  An object consists of single- and multi-valued ﬁelds
•  A pair of link ﬁelds form a bi-directional relationship
Message
{Size, SendDate}
Participant
{ReceiptDate}
Address
{Name}
Person
{Name, Department}
Attachment
{Size, Extension} Managerè
çEmployees
êPerson
Address é
êAttachments
Messageé
Recipientsè
çMessageAsRecipient
Addressè
çParticipants
Senderè
çMessageAsSender

Example Object and Aggregate Queries
•  Lucene full text query
GET
/Email/Person/_query?q=FirstName:j*
AND
NOT
Office:[q
TO
z]

•  Link path with ﬁltering
GET
/Email/Message/_query?q=

Sender.WHERE(ReceiptDate>'2010-‐06-‐01').Address.Name="*.com"

•  Quantiﬁers
GET
/Email/Message/_aggregate?m=COUNT(*)

&q=ANY(Recipients).ALL(Address).NONE(Person).Department:sales

&f=Tags,TOP(3,TRUNCATE(SendDate,DAY))

•  Transitive links
GET
/Email/Person/_query?q=DirectReports^(3).LastName=wilson

&f=DirectReports(Name,DirectReports(Name))

Doradus: Architecture
Application
Doradus
Cassandra
REST API
Thrift or CQL
Data and
Log ﬁles

Doradus: Multi-Data Center Clusters
Cassandra
Doradus
Cassandra Cassandra
Doradus
Cassandra
Doradus
Cassandra Cassandra
Doradus
Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
Rack 1, Data Center 1 Rack 1, Data Center 2
Applications Applications
DC=2, N=6, RF=3

Doradus: Internal Architecture
App App App
Monitor
App
Spider
Storage Service
OLAP
Storage Service
Cassandra Cluster
JMX
REST: Embedded Jetty Server
Cassandra Interface
doradus.yaml
REST

Doradus OLAP Service
•  Borrows from online analytical processing
– Sharding as data "cubes"
– Columnar storage
•  Very dense storage
– No indexes!
– Value arrays are compressed
•  Fast load time
– Up to 500,000 objects/second/node
– Small "data lag" time
•  Very fast queries
– Searches millions of objects/second
– Full DQL object and aggregate query support

OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEventsComputers
EventsEventsDomains
Sources

OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsDomains
T2
T3
T4
T4
Sources Segments
…
Changes in
last n minutes

OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsDomains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards
…
…
Changes in
last n minutes
Date-based shards

OLAP Data Loading
T1
EventsEventsEvents
EventsEventsPeople
EventsEventsDomains
T2
T3
T4
T4
2013-03-01
2013-02-28
2013-02-27
Sources Segments Shards OLAP Store
…
…
Changes in
last n minutes
Date-based shards

OLAP Use Case
•  Data: Windows Events
– 115M events
•  Test parameters
– Server: Quad Xeon CPUs, 32GB memory, 3 disks
– Cassandra memory: 1GB
– Load app/embedded Doradus memory: 4GB
– Load threads: 5
– Batch size: 5,000 events
– Shard size: 1 day (860 shards total)
•  Test results
– Total objects loaded: ~1 billion
– Total time: 32 minutes, 56 seconds
– Load rate: 502,991 objects/second
– Final database size: ~2GB

Doradus Spider Service
•  Analogous to Lucene + NoSQL
•  Fully inverted field indexing
– Configurable analyzers
– Stored-only (non-indexed) fields
•  Unique features:
– Automatic table-level sharding
– Statistics
– Pre-computed aggregate queries
– Refreshed in background
– Object-level data aging
•  Use case example:
– Indexing a massive number of documents

OLAP and Spider: When to Use
•  Spider is best for:
– Unstructured/variable-
structure data
– Conﬁgurable indexing
– Fine-grained updates with
immediate indexing
– Document storage and
searching
– Emphasis on full-text/multi-
ﬁeld searching
•  OLAP is best for:
– High-volume data streams
– High performance analytic
queries
– Dense data storage
– Immutable/semi-mutable
data
– Data that can be loaded in
batches
– Data that can be partitioned
(e.g., time-sharded)

Summary
•  What's cool about Doradus?
– Bi-directional links with referential integrity
– Link paths: simpler than joins
– Idempotent updates
– Partial object updates
– Simple transitive searching
– OLAP: dense storage and fast queries
– It's free!

Thank you !
Doradus is available at:
https://github.com/dell-oss/Doradus
Contact me:
randy.guck@dell.software.com

Overiew of Cassandra and Doradus

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Overiew of Cassandra and Doradus

Similar to Overiew of Cassandra and Doradus (20)

Recently uploaded

Recently uploaded (20)

Overiew of Cassandra and Doradus