D ATA M O D E L I N G
C A S S A N D R A U S I N G C Q L 3
M I K E B I G L A N & E L I J A H H A M O V I T Z
A B O U T U S
Mike Biglan, M.S.
• Twenty Ideas, Inc – twentyideas.com
• Analytic Spot – spo.td
Elijah Hamovitz
• code.org
• Analytic Spot – spo.td
S I M P L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y
A D ATA M O D E L M O D E L S D ATA
• Framework to store and organize data
• Models things, their differences, and
relationships between them
• Things can be real or virtual
I D E A L D ATA M O D E L P R O P E R T I E S
• Easy to create
• Easy to interface with
• Quick, flexible querying
• Writing is direct and simple
• Easily understandable
• Scalable: can read, write, and store huge amount of data safely
S C A L E A WAY, S C A L E A WAY, S C A L E A WAY
• Availability: fault tolerance, redundancy, supports
multiple data centers
• Consistency: strong or tunable
• Huge amounts of data (that won’t fit on single server)
• High-speed of incoming and/or accessed data
C O L U M N A R D ATA B A S E
• Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase,
Dynamo) are both “NoSQL”
• But modeling in Document-Store is quite different than Columnar
• Atomic unit of data storage
• Document-Store: document
• Relational Database: row
• Columnar: column
C A S S A N D R A C O L U M N :
- N A M E
- V A L U E ( O R T O M B S T O N E )
- T I M E S TA M P
- T T L
C A S S A N D R A
Highly Available Distributed Columnar Datastore that’s:
• Near-linearly scalable
• Fault tolerant, no master
• Tunable consistency
• Performant, especially for writes – don’t read before write
G E T T I N G O N T H E S C A L E
Phase 1 Install Cassandra
Phase 2
Phase 3 Scale!
CQL != SQL
S Q L / C Q L S TAT E M E N T F L O W
• SQL execution is complex
• CQL execution is relatively
simple, hence tiny subset
of syntax
• Much of CQL Query
complexity is which
node(s) to fetch/write/
confirm data from/to
• So denormalize!
Relational Database Cassandra
SQL Statement CQL Statement
Syntax/Semantic
Check
Query Plan &
Optimization
Result
Data Store Data Store
Syntax/Semantic
Check
Query Execution Query Execution
Result
C Q L ! = S Q L
SELECT <col1>, <col2>, …
FROM <table>
LEFT JOIN <table2>…
WHERE <where-clause>
GROUP BY <colx> HAVING …
ORDER BY <order-clause>
S E V E R E LY L I M I T E D
S E V E R E LY L I M I T E D
• CQL syntax is small subset of SQL
mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/
S O W H Y T H E L I M I TAT I O N S ?
Thinking of Cassandra as a relational database, it’s
hard to understand:
• what is easy
• what is hard
• what is impossible
“Language serves not only to
express thought but to make
possible thoughts which could not
exist without it.”
— Bertrand Russell
T H E D I S T O R T I O N O F C Q L
• Broken mental model hinders optimal modeling
• CQL falsely implies a relational data model and
the design patterns that go with it
• To model Cassandra well, know the underlying
data structure
D ATA M O D E L I N G I N S Q L
( N O S H A R D I N G )
1. What are the Data?
2. What is the normalized data model?
… months pass …
3. How are the data going to be queried?
4. Optimize any slow areas and/or bottlenecks
• Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc
D ATA M O D E L I N G I N C Q L
1. What are the data?
2. What read-queries are needed?
3. How to denormalize during writes?
• on initial write, or use external tools to make this sane
(Some) “premature” optimization is inherent and unavoidable
D ATA E C O S Y S T E M
To fully (and efficiently) enable everything SQL you are
used to, must rely on the big(ish) data ecosystem:
• ElasticSearch, Solr, Sphinx
• Redis, Memcached
• Spark
• Spark Streaming or Storm (and Kafka)
C O M P L E X I T Y O F I N I T I A L D ATA M O D E L
• Modeling with Relational DB
• Items & their relationships
• Modeling with Cassandra
• Items & their relationships
• How/Where they are stored
(sharding and hot spots)
• What data we want to read
• How (and how often) we write
data into those models
C A N O P T I M I Z E L AT E R
T O M O D E L , O P E N T H E B L A C K B O X
• Goal of a good black box is you can do a lot
without knowing much about what’s inside
• CQL DOES NOT allow you to ignore what’s
inside Cassandra
I N T H E B E G I N N I N G : T H R I F T
• Around Cassandra version 0.8, Thrift started getting
replaced with CQL
• Thrift too low-level, but the interface had a close
mapping to the underlying Cassandra data structure
T H R I F T & C Q L T E R M I N O L O G Y
T H R I F T C Q L
C O L U M N FA M I LY TA B L E
R O W PA R T I T I O N
C O L U M N C E L L
[ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N
[ G R O U P O F C E L L S W I T H S H A R E D
C O M P O N E N T P R E F I X E S ]
R O W
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
C Q L TA B L E
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);
A N D C Q L S E L E C T
> SELECT * from employees;
company | name | age | role
----------------------------
Foo, inc | Fred | 31 | coder
Foo, inc | Sara | 39 | boss
BarCo | Bill | 50 | SQL guru
BarCo | Jane | 20 | hotshot
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);
V S H O W I T ’ S S T O R E D
employees = {
"Foo, inc" : {
"Fred:age" : 31,
"Fred:role" : "coder",
"Sara:age" : 39,
"Sara:role" : "boss"
},
"BarCo" : {
"Bill:age" : 50,
"Bill:role" : "SQL guru"
"Jane:age" : 20,
"Jane:role" : "hotshot",
}
}
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);
W I D E PA R T I T I O N S
( F O R M E R LY W I D E “ R O W S ” )
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51)
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
• Based on Thrift “rows”, so
actually wide “partitions”
• Columns are the clustering
key values with the column-
name suffix
• Up to 2 billion (but don’t do this)
S E T S , M A P S , A N D L I S T S : O H M Y
• Sets/Maps/List still
column-level storage
• Enabling Schemaless,
but can result in long
column names
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)
PA R T I T I O N A N D C L U S T E R I N G K E Y
CQL Where-Clause variants
• None (kind of)
• key1 & key2
• key1 & key2 & key3
• key1 & key2 & key3 & key4
company | year | month | day | employee | reason
-------------------------------------------------
Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving
Foo, Inc | 2014 | 12 | 25 | Fred | Christmas
Foo, Inc | 2014 | 12 | 25 | Sara | Christmas
Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day
CREATE TABLE breaks (
company text,
year int,
month int,
day int,
employee text,
reason text,
PRIMARY KEY
((company, year),
month, day, employee)
);
C O M P O S I T E K E Y S
C O M P O S I T E K E Y S
CREATE TABLE breaks (
company text,
year int,
month int,
day int,
employee text,
reason text,
PRIMARY KEY
((company, year),
month, day, employee)
);
breaks = {
"Foo, Inc:2014" : {
"11" : {
"27" : {
"Fred:reason" : "Thanksgiving"
}
},
"12" : {
"25" : {
"Fred:reason" : "Christmas",
"Sara:reason" : "Christmas"
},
"26" : {
"Sara:reason" : "Boxing day"
}
}
}
}
T H E N W H AT I S E A S Y ?
With a dictionary of ordered
dictionaries:
• Grabbing the data (or subset)
from a partition-key
• Getting a slice of data (uses linear
search) based on a partition-key
breaks = {
"Foo, Inc:2014" : {
"11" : {
"27" : {
"Fred:reason" : "Thanksgiving"
}
},
"12" : {
"25" : {
"Fred:reason" : "Christmas",
"Sara:reason" : "Christmas"
},
"26" : {
"Sara:reason" : "Boxing day"
}
}
}
}
W H AT I S H A R D ?
( I . E . C O M M O N S Q L PAT T E R N S )
• Unique and Group by
• Ordered
• Inverted Index
G R O U P - B Y & C O U N T E R S
• Often group-by is used for
counting
• Use counter columns or other
tools (e.g. elasticsearch)
CREATE TABLE
employee_break_counts (
company text,
employee text,
break_counts counter,
PRIMARY KEY
((company), employee)
);
O R D E R I N G O R I N V E R T E D - I N D E X
• Redundant table, but ordered by new
“column”
• Depending on needs, this can store
the order-field and lookup key OR
some/all of the other data in that table
• If a read will generate more than a
few subsequent child reads then
some/all the other data should be
included
CREATE TABLE
employees_by_age (
company text,
id int,
age int,
name text,
role text,
PRIMARY KEY
((company), age, id)
);
C * M O D E L I N G A N T I - PAT T E R N S
C * G U I D E L I N E S S Q L G U I D E L I N E S
W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S
S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA
PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K
S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A RY I N D E X E S
S I M P L E Q U E R I E S C O M P L E X Q U E R I E S
C * M O D E L I N G PAT T E R N S
C * G U I D E L I N E S C * PAT T E R N S
W R I T E S A R E C H E A P / FA S T
D U P L I C AT E Y O U R D ATA
S T O R A G E I S C H E A P
PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S
S T R I C T C O M P O S I T E K E Y S
D E S I G N TA B L E S A R O U N D Q U E R I E S
S I M P L E Q U E R I E S
Questions?
Mike Biglan
mike@twentyideas.com
@twentyideas
Elijah Hamovitz
elijah@code.org

Cassandra Data Modelling with CQL (OSCON 2015)

  • 1.
    D ATA MO D E L I N G C A S S A N D R A U S I N G C Q L 3 M I K E B I G L A N & E L I J A H H A M O V I T Z
  • 2.
    A B OU T U S Mike Biglan, M.S. • Twenty Ideas, Inc – twentyideas.com • Analytic Spot – spo.td Elijah Hamovitz • code.org • Analytic Spot – spo.td
  • 4.
    S I MP L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y
  • 5.
    A D ATAM O D E L M O D E L S D ATA • Framework to store and organize data • Models things, their differences, and relationships between them • Things can be real or virtual
  • 6.
    I D EA L D ATA M O D E L P R O P E R T I E S • Easy to create • Easy to interface with • Quick, flexible querying • Writing is direct and simple • Easily understandable • Scalable: can read, write, and store huge amount of data safely
  • 7.
    S C AL E A WAY, S C A L E A WAY, S C A L E A WAY • Availability: fault tolerance, redundancy, supports multiple data centers • Consistency: strong or tunable • Huge amounts of data (that won’t fit on single server) • High-speed of incoming and/or accessed data
  • 8.
    C O LU M N A R D ATA B A S E • Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase, Dynamo) are both “NoSQL” • But modeling in Document-Store is quite different than Columnar • Atomic unit of data storage • Document-Store: document • Relational Database: row • Columnar: column C A S S A N D R A C O L U M N : - N A M E - V A L U E ( O R T O M B S T O N E ) - T I M E S TA M P - T T L
  • 9.
    C A SS A N D R A Highly Available Distributed Columnar Datastore that’s: • Near-linearly scalable • Fault tolerant, no master • Tunable consistency • Performant, especially for writes – don’t read before write
  • 10.
    G E TT I N G O N T H E S C A L E Phase 1 Install Cassandra Phase 2 Phase 3 Scale!
  • 11.
  • 12.
    S Q L/ C Q L S TAT E M E N T F L O W • SQL execution is complex • CQL execution is relatively simple, hence tiny subset of syntax • Much of CQL Query complexity is which node(s) to fetch/write/ confirm data from/to • So denormalize! Relational Database Cassandra SQL Statement CQL Statement Syntax/Semantic Check Query Plan & Optimization Result Data Store Data Store Syntax/Semantic Check Query Execution Query Execution Result
  • 13.
    C Q L! = S Q L SELECT <col1>, <col2>, … FROM <table> LEFT JOIN <table2>… WHERE <where-clause> GROUP BY <colx> HAVING … ORDER BY <order-clause> S E V E R E LY L I M I T E D S E V E R E LY L I M I T E D • CQL syntax is small subset of SQL mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/
  • 14.
    S O WH Y T H E L I M I TAT I O N S ? Thinking of Cassandra as a relational database, it’s hard to understand: • what is easy • what is hard • what is impossible
  • 15.
    “Language serves notonly to express thought but to make possible thoughts which could not exist without it.” — Bertrand Russell
  • 16.
    T H ED I S T O R T I O N O F C Q L • Broken mental model hinders optimal modeling • CQL falsely implies a relational data model and the design patterns that go with it • To model Cassandra well, know the underlying data structure
  • 17.
    D ATA MO D E L I N G I N S Q L ( N O S H A R D I N G ) 1. What are the Data? 2. What is the normalized data model? … months pass … 3. How are the data going to be queried? 4. Optimize any slow areas and/or bottlenecks • Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc
  • 18.
    D ATA MO D E L I N G I N C Q L 1. What are the data? 2. What read-queries are needed? 3. How to denormalize during writes? • on initial write, or use external tools to make this sane (Some) “premature” optimization is inherent and unavoidable
  • 19.
    D ATA EC O S Y S T E M To fully (and efficiently) enable everything SQL you are used to, must rely on the big(ish) data ecosystem: • ElasticSearch, Solr, Sphinx • Redis, Memcached • Spark • Spark Streaming or Storm (and Kafka)
  • 20.
    C O MP L E X I T Y O F I N I T I A L D ATA M O D E L • Modeling with Relational DB • Items & their relationships • Modeling with Cassandra • Items & their relationships • How/Where they are stored (sharding and hot spots) • What data we want to read • How (and how often) we write data into those models C A N O P T I M I Z E L AT E R
  • 21.
    T O MO D E L , O P E N T H E B L A C K B O X • Goal of a good black box is you can do a lot without knowing much about what’s inside • CQL DOES NOT allow you to ignore what’s inside Cassandra
  • 22.
    I N TH E B E G I N N I N G : T H R I F T • Around Cassandra version 0.8, Thrift started getting replaced with CQL • Thrift too low-level, but the interface had a close mapping to the underlying Cassandra data structure
  • 23.
    T H RI F T & C Q L T E R M I N O L O G Y T H R I F T C Q L C O L U M N FA M I LY TA B L E R O W PA R T I T I O N C O L U M N C E L L [ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N [ G R O U P O F C E L L S W I T H S H A R E D C O M P O N E N T P R E F I X E S ] R O W www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
  • 24.
    C Q LTA B L E CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
  • 25.
    A N DC Q L S E L E C T > SELECT * from employees; company | name | age | role ---------------------------- Foo, inc | Fred | 31 | coder Foo, inc | Sara | 39 | boss BarCo | Bill | 50 | SQL guru BarCo | Jane | 20 | hotshot CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
  • 26.
    V S HO W I T ’ S S T O R E D employees = { "Foo, inc" : { "Fred:age" : 31, "Fred:role" : "coder", "Sara:age" : 39, "Sara:role" : "boss" }, "BarCo" : { "Bill:age" : 50, "Bill:role" : "SQL guru" "Jane:age" : 20, "Jane:role" : "hotshot", } } CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );
  • 27.
    W I DE PA R T I T I O N S ( F O R M E R LY W I D E “ R O W S ” ) www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51) www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows • Based on Thrift “rows”, so actually wide “partitions” • Columns are the clustering key values with the column- name suffix • Up to 2 billion (but don’t do this)
  • 28.
    S E TS , M A P S , A N D L I S T S : O H M Y • Sets/Maps/List still column-level storage • Enabling Schemaless, but can result in long column names www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)
  • 29.
    PA R TI T I O N A N D C L U S T E R I N G K E Y CQL Where-Clause variants • None (kind of) • key1 & key2 • key1 & key2 & key3 • key1 & key2 & key3 & key4
  • 30.
    company | year| month | day | employee | reason ------------------------------------------------- Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving Foo, Inc | 2014 | 12 | 25 | Fred | Christmas Foo, Inc | 2014 | 12 | 25 | Sara | Christmas Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) ); C O M P O S I T E K E Y S
  • 31.
    C O MP O S I T E K E Y S CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) ); breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }
  • 32.
    T H EN W H AT I S E A S Y ? With a dictionary of ordered dictionaries: • Grabbing the data (or subset) from a partition-key • Getting a slice of data (uses linear search) based on a partition-key breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }
  • 33.
    W H ATI S H A R D ? ( I . E . C O M M O N S Q L PAT T E R N S ) • Unique and Group by • Ordered • Inverted Index
  • 34.
    G R OU P - B Y & C O U N T E R S • Often group-by is used for counting • Use counter columns or other tools (e.g. elasticsearch) CREATE TABLE employee_break_counts ( company text, employee text, break_counts counter, PRIMARY KEY ((company), employee) );
  • 35.
    O R DE R I N G O R I N V E R T E D - I N D E X • Redundant table, but ordered by new “column” • Depending on needs, this can store the order-field and lookup key OR some/all of the other data in that table • If a read will generate more than a few subsequent child reads then some/all the other data should be included CREATE TABLE employees_by_age ( company text, id int, age int, name text, role text, PRIMARY KEY ((company), age, id) );
  • 36.
    C * MO D E L I N G A N T I - PAT T E R N S C * G U I D E L I N E S S Q L G U I D E L I N E S W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A RY I N D E X E S S I M P L E Q U E R I E S C O M P L E X Q U E R I E S
  • 37.
    C * MO D E L I N G PAT T E R N S C * G U I D E L I N E S C * PAT T E R N S W R I T E S A R E C H E A P / FA S T D U P L I C AT E Y O U R D ATA S T O R A G E I S C H E A P PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S S T R I C T C O M P O S I T E K E Y S D E S I G N TA B L E S A R O U N D Q U E R I E S S I M P L E Q U E R I E S
  • 38.