Cassandra Data Modelling with CQL (OSCON 2015)

D ATA M O D E L I N G
C A S S A N D R A U S I N G C Q L 3
M I K E B I G L A N & E L I J A H H A M O V I T Z

A B O U T U S
Mike Biglan, M.S.
• Twenty Ideas, Inc – twentyideas.com
• Analytic Spot – spo.td
Elijah Hamovitz
• code.org
• Analytic Spot – spo.td

S I M P L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y

A D ATA M O D E L M O D E L S D ATA
• Framework to store and organize data
• Models things, their differences, and
relationships between them
• Things can be real or virtual

I D E A L D ATA M O D E L P R O P E R T I E S
• Easy to create
• Easy to interface with
• Quick, flexible querying
• Writing is direct and simple
• Easily understandable
• Scalable: can read, write, and store huge amount of data safely

S C A L E A WAY, S C A L E A WAY, S C A L E A WAY
• Availability: fault tolerance, redundancy, supports
multiple data centers
• Consistency: strong or tunable
• Huge amounts of data (that won’t fit on single server)
• High-speed of incoming and/or accessed data

C O L U M N A R D ATA B A S E
• Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase,
Dynamo) are both “NoSQL”
• But modeling in Document-Store is quite different than Columnar
• Atomic unit of data storage
• Document-Store: document
• Relational Database: row
• Columnar: column
C A S S A N D R A C O L U M N :
- N A M E
- V A L U E ( O R T O M B S T O N E )
- T I M E S TA M P
- T T L

C A S S A N D R A
Highly Available Distributed Columnar Datastore that’s:
• Near-linearly scalable
• Fault tolerant, no master
• Tunable consistency
• Performant, especially for writes – don’t read before write

G E T T I N G O N T H E S C A L E
Phase 1 Install Cassandra
Phase 2
Phase 3 Scale!

S Q L / C Q L S TAT E M E N T F L O W
• SQL execution is complex
• CQL execution is relatively
simple, hence tiny subset
of syntax
• Much of CQL Query
complexity is which
node(s) to fetch/write/
confirm data from/to
• So denormalize!
Relational Database Cassandra
SQL Statement CQL Statement
Syntax/Semantic
Check
Query Plan &
Optimization
Result
Data Store Data Store
Syntax/Semantic
Check
Query Execution Query Execution
Result

C Q L ! = S Q L
SELECT <col1>, <col2>, …
FROM <table>
LEFT JOIN <table2>…
WHERE <where-clause>
GROUP BY <colx> HAVING …
ORDER BY <order-clause>
S E V E R E LY L I M I T E D
S E V E R E LY L I M I T E D
• CQL syntax is small subset of SQL
mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/

S O W H Y T H E L I M I TAT I O N S ?
Thinking of Cassandra as a relational database, it’s
hard to understand:
• what is easy
• what is hard
• what is impossible

“Language serves not only to
express thought but to make
possible thoughts which could not
exist without it.”
— Bertrand Russell

T H E D I S T O R T I O N O F C Q L
• Broken mental model hinders optimal modeling
• CQL falsely implies a relational data model and
the design patterns that go with it
• To model Cassandra well, know the underlying
data structure

D ATA M O D E L I N G I N S Q L
( N O S H A R D I N G )
1. What are the Data?
2. What is the normalized data model?
… months pass …
3. How are the data going to be queried?
4. Optimize any slow areas and/or bottlenecks
• Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc

D ATA M O D E L I N G I N C Q L
1. What are the data?
2. What read-queries are needed?
3. How to denormalize during writes?
• on initial write, or use external tools to make this sane
(Some) “premature” optimization is inherent and unavoidable

D ATA E C O S Y S T E M
To fully (and efficiently) enable everything SQL you are
used to, must rely on the big(ish) data ecosystem:
• ElasticSearch, Solr, Sphinx
• Redis, Memcached
• Spark
• Spark Streaming or Storm (and Kafka)

C O M P L E X I T Y O F I N I T I A L D ATA M O D E L
• Modeling with Relational DB
• Items & their relationships
• Modeling with Cassandra
• Items & their relationships
• How/Where they are stored
(sharding and hot spots)
• What data we want to read
• How (and how often) we write
data into those models
C A N O P T I M I Z E L AT E R

T O M O D E L , O P E N T H E B L A C K B O X
• Goal of a good black box is you can do a lot
without knowing much about what’s inside
• CQL DOES NOT allow you to ignore what’s
inside Cassandra

I N T H E B E G I N N I N G : T H R I F T
• Around Cassandra version 0.8, Thrift started getting
replaced with CQL
• Thrift too low-level, but the interface had a close
mapping to the underlying Cassandra data structure

T H R I F T & C Q L T E R M I N O L O G Y
T H R I F T C Q L
C O L U M N FA M I LY TA B L E
R O W PA R T I T I O N
C O L U M N C E L L
[ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N
[ G R O U P O F C E L L S W I T H S H A R E D
C O M P O N E N T P R E F I X E S ]
R O W
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

C Q L TA B L E
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);

V S H O W I T ’ S S T O R E D
employees = {
"Foo, inc" : {
"Fred:age" : 31,
"Fred:role" : "coder",
"Sara:age" : 39,
"Sara:role" : "boss"
},
"BarCo" : {
"Bill:age" : 50,
"Bill:role" : "SQL guru"
"Jane:age" : 20,
"Jane:role" : "hotshot",
}
}
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);

W I D E PA R T I T I O N S
( F O R M E R LY W I D E “ R O W S ” )
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51)
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
• Based on Thrift “rows”, so
actually wide “partitions”
• Columns are the clustering
key values with the column-
name suffix
• Up to 2 billion (but don’t do this)

S E T S , M A P S , A N D L I S T S : O H M Y
• Sets/Maps/List still
column-level storage
• Enabling Schemaless,
but can result in long
column names
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)

PA R T I T I O N A N D C L U S T E R I N G K E Y
CQL Where-Clause variants
• None (kind of)
• key1 & key2
• key1 & key2 & key3
• key1 & key2 & key3 & key4

company | year | month | day | employee | reason
-------------------------------------------------
Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving
Foo, Inc | 2014 | 12 | 25 | Fred | Christmas
Foo, Inc | 2014 | 12 | 25 | Sara | Christmas
Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day
CREATE TABLE breaks (
company text,
year int,
month int,
day int,
employee text,
reason text,
PRIMARY KEY
((company, year),
month, day, employee)
);
C O M P O S I T E K E Y S

C O M P O S I T E K E Y S
CREATE TABLE breaks (
company text,
year int,
month int,
day int,
employee text,
reason text,
PRIMARY KEY
((company, year),
month, day, employee)
);
breaks = {
"Foo, Inc:2014" : {
"11" : {
"27" : {
"Fred:reason" : "Thanksgiving"
}
},
"12" : {
"25" : {
"Fred:reason" : "Christmas",
"Sara:reason" : "Christmas"
},
"26" : {
"Sara:reason" : "Boxing day"
}
}
}
}

T H E N W H AT I S E A S Y ?
With a dictionary of ordered
dictionaries:
• Grabbing the data (or subset)
from a partition-key
• Getting a slice of data (uses linear
search) based on a partition-key
breaks = {
"Foo, Inc:2014" : {
"11" : {
"27" : {
"Fred:reason" : "Thanksgiving"
}
},
"12" : {
"25" : {
"Fred:reason" : "Christmas",
"Sara:reason" : "Christmas"
},
"26" : {
"Sara:reason" : "Boxing day"
}
}
}
}

W H AT I S H A R D ?
( I . E . C O M M O N S Q L PAT T E R N S )
• Unique and Group by
• Ordered
• Inverted Index

G R O U P - B Y & C O U N T E R S
• Often group-by is used for
counting
• Use counter columns or other
tools (e.g. elasticsearch)
CREATE TABLE
employee_break_counts (
company text,
employee text,
break_counts counter,
PRIMARY KEY
((company), employee)
);

O R D E R I N G O R I N V E R T E D - I N D E X
• Redundant table, but ordered by new
“column”
• Depending on needs, this can store
the order-field and lookup key OR
some/all of the other data in that table
• If a read will generate more than a
few subsequent child reads then
some/all the other data should be
included
CREATE TABLE
employees_by_age (
company text,
id int,
age int,
name text,
role text,
PRIMARY KEY
((company), age, id)
);

C * M O D E L I N G A N T I - PAT T E R N S
C * G U I D E L I N E S S Q L G U I D E L I N E S
W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S
S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA
PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K
S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A RY I N D E X E S
S I M P L E Q U E R I E S C O M P L E X Q U E R I E S

C * M O D E L I N G PAT T E R N S
C * G U I D E L I N E S C * PAT T E R N S
W R I T E S A R E C H E A P / FA S T
D U P L I C AT E Y O U R D ATA
S T O R A G E I S C H E A P
PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S
S T R I C T C O M P O S I T E K E Y S
D E S I G N TA B L E S A R O U N D Q U E R I E S
S I M P L E Q U E R I E S

Questions?
Mike Biglan
mike@twentyideas.com
@twentyideas
Elijah Hamovitz
elijah@code.org

Cassandra Data Modelling with CQL (OSCON 2015)

More Related Content

Similar to Cassandra Data Modelling with CQL (OSCON 2015)

Recently uploaded

Cassandra Data Modelling with CQL (OSCON 2015)