SlideShare a Scribd company logo
Introduction to Data Modeling with
Apache Cassandra
Luke Tillman (@LukeTillman)
Language Evangelist at DataStax
1 Relational Modeling vs. Cassandra
2 The Basics
3 CQL Collections
4 Relationships
5 Time Series Use Case
Relational Modeling vs. Cassandra
The Good ol’ Relational Database
• Been around a long time (first proposed in 1970)
• Data modeling is well understood (typically 3NF or higher)
• ACID guarantees are easy for developers to reason about
• SQL is ubiquitous and allows flexible querying
– JOINs, Sub SELECTs, etc.
Relational Data Modeling
• Five normal forms
• Foreign Keys
• Joins at read time
– Example SQL: Get employee
and department for user id 5
(Helena Edelson)
Id First Last DeptId
1 Luke Tillman 201
2 Jon Haddad 201
5 Helena Edelson 205
Id Dept
201 Evangelists
205 Engineering
SELECT e.First, e.Last, d.Dept
FROM Employees e
JOIN Departments d
ON e.DeptId = d.Id
WHERE e.Id = 5
Relational Data Modeling Thought Process
Cassandra Data Modeling Thought Process
• Similar syntax in many
cases, but...
• No Joins
• No Aggregations
Id First Last DeptId
1 Luke Tillman 201
2 Jon Haddad 201
5 Helena Edelson 205
Id Dept
201 Evangelists
205 Engineering
SELECT e.First, e.Last, d.Dept
FROM Employees e
JOIN Departments d
ON e.DeptId = d.Id
WHERE e.Id = 5
• Combine table columns into single view at write time
• No joins necessary
Id First Last Dept
1 Luke Tillman Evangelists
2 Jon Haddad Evangelists
5 Helena Edelson Engineering
SELECT First, Last, Dept
FROM Employees
WHERE Id = 5
Sequences and Auto-Incrementing Ids
• Great for letting the RDBMS handle auto-generating Ids
• Guaranteed to be unique
• Needs ACID to work (uh oh)
INSERT INTO Employees (Id, First, Last)
VALUES (seq.nextVal(), "Patrick", "McFadin")
No More Sequences
• Almost impossible in a distributed system like Cassandra
• Couple of great choices instead:
– Natural Keys: Unique values like Email
– Surrogate Key: UUID (or GUID for MS folks)
• UUID: Universally Unique Identifier
– 128-bit number represented in character form
– Can be generated easily on the client side
The Basics
Cassandra Data Modeling Thought Process
• Start with your
application and the
queries it needs to
• Then build models to
satisfy those queries
Entity Table
• Query: Find user by id
• Simple view of a single user
• UUID used for ID
• Simple primary key
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
SELECT firstname, lastname
FROM users
WHERE userid = 99051fe9-6a9c-46c2-b949-38ef78858dd0
Entity Table – A reminder on Partition Keys
• First part of Primary Key is the
Partition Key
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
firstname ...
Luke ...
Jon ...
Patrick ...
689d56e5- …
93357d73- …
d978b136- …
More Complicated Primary Keys
• Query: Find comments for a video (most recent first)
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
SELECT commentid, userid, comment
FROM comments_by_video
WHERE videoid = 0fe6ab76-cf17-4664-abcc-4e363cee273f
Let's Break This Down
• TimeUUID: a UUID with a timestamp component
• Ordering by a TimeUUID is like ordering by its timestamp
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
eeaca440-c745-11e4-8830-0800200c9a6603/10/2015 16:53:09 GMT
Let's Break This Down
• The Primary Key uniquely identifies a row, so a comment is
uniquely identified by its videoid and commentid
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
Let's Break This Down
• The first part of the Primary Key is the Partition Key, so
comments for a given video will be stored together in a partition
• When we query for a given videoid, we only need to talk to
one partition (and thus one node), which is fast
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
Let's Break This Down
• The second part of the Primary Key is the Clustering Column(s)
• Inside a partition, comments for a given video will be ordered
by commentid
• Remember ordering by TimeUUID is ordering by timestamp
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
Let's Break This Down
• We can specify a default clustering order when creating the
table which will affect the ordering of the data stored on disk
• Since our query was to get the latest comments for a video, we
order by commentid descending
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
Let's Break This Down
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
(10/1/2014 9:36AM)
(9/17/2014 7:55AM)
This query will be fast
(10/1/2014 9:36AM)
(9/17/2014 7:55AM)
SELECT commentid, userid, comment
FROM comments_by_video
WHERE videoid = 0fe6ab76-cf17-4664-abcc-4e363cee273f
1. Locate
2. Single seek
on disk
3. Slice 10 latest rows and return
Getting the most from queries
• Queries on Partition Key are fast
– Querying inside a single partition should be the goal
– Always specify a value for partition key when querying
• Queries on Partition Key and one or more Clustering Column(s)
are fast
– Again, inside a single partition should be the goal
– Use default ordering when creating the table to optimize if applicable
• Cassandra will give you errors if you try to stray
More than one way to query the same data
• New Query: Find comments made by a user (most recent first)
CREATE TABLE comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
PRIMARY KEY (userid, commentid)
SELECT commentid, videoid, comment
FROM comments_by_user
WHERE userid = 99051fe9-6a9c-46c2-b949-38ef78858dd0
More than one way to query the same data
• Two views of the same data
• Use a batch when inserting to both tables
• Denormalize at write time to do efficient queries at read time
CREATE TABLE comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
userid, commentid)
commentid DESC);
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
videoid, commentid)
commentid DESC);
CQL Collections
CQL Collection Basics
• Store a collection of related things in a column
• Meant to be dynamic part of a table
• Update syntax is very different from insert
• Reads require all of the collection to be read
• No duplicates, sorted by CQL type's comparator
INSERT INTO collections_example (id, set_example)
VALUES (1, {'Patrick', 'Jon', 'Luke'});
set_example set<text>
Collection name
(column name)
Collection type CQL type
• Adding an element to a set
• Removing an element from a set
UPDATE collections_example
SET set_example = set_example + {'Rebecca'}
WHERE id = 1
UPDATE collections_example
SET set_example = set_example - {'Luke'}
WHERE id = 1
CQL List
• Allows duplicates, sorted by insertion order
• Use with caution
INSERT INTO collections_example (id, list_example)
VALUES (1, ['Patrick', 'Jon', 'Luke']);
list_example list<text>
Collection name
(column name)
Collection type CQL type
CQL List
• Adding an element to the end of a list
• Adding an element to the beginning of a list
• Removing an element from a list
UPDATE collections_example
SET list_example = list_example + ['Rebecca']
WHERE id = 1
UPDATE collections_example
SET list_example = ['Rebecca'] + list_example
WHERE id = 1
UPDATE collections_example
SET list_example = list_example - ['Luke']
WHERE id = 1
• Key and value, sorted by key's CQL type comparator
INSERT INTO collections_example (id, map_example)
VALUES (1, { 'Patrick' : 72, 'Jon' : 33, 'Luke' : 34 });
map_example map<text, int>
Collection name
(column name)
Collection type Key CQL type Value CQL type
• Adding an element to a map
• Updating an existing element in a map
• Removing an element from a map
UPDATE collections_example
SET map_example['Rebecca'] = 29
WHERE id = 1
UPDATE collections_example
SET map_example['Jon'] = 34
WHERE id = 1
DELETE map_example['Luke']
FROM collections_example
WHERE id = 1
Revisiting our One-to-Many Relationship
Id First Last DeptId
7bc7a... Luke Tillman 5078c...
d7463... Jon Haddad 5078c...
8c26b... Helena Edelson 1d0f3...
Id Dept
5078c... Evangelists
1d0f3... Engineering
Department Employeehas
Revisiting our One-to-Many Relationship
• Query: Get an employee and
his/her department by
employee id
– Denormalize department data
First Last Dept
Luke Tillman Evangelists
Jon Haddad Evangelists
Helena Edelson Engineering
CREATE TABLE employees (
id uuid,
first text,
last text,
dept text,
SELECT first, last, dept
FROM employees
WHERE id = 7bc7a...
What about the other side of the relationship?
• Query: Get all the employees for a given department
CREATE TABLE employees_by_dept (
dept_id uuid,
emp_id uuid,
first text,
last text,
dept text,
PRIMARY KEY (dept_id, emp_id)
SELECT first, last, dept
FROM employees_by_dept
WHERE dept_id = 5078c...
What about the other side of the relationship?
CREATE TABLE employees_by_dept (
dept_id uuid,
emp_id uuid,
first text,
last text,
dept text,
PRIMARY KEY (dept_id, emp_id)
Static Columns
• Department name (dept)
will be the same across all
rows in the partition
• This is a good candidate
for a static column
CREATE TABLE employees_by_dept (
dept_id uuid,
emp_id uuid,
first text,
last text,
dept text,
PRIMARY KEY (dept_id, emp_id)
Static Columns
• For data that is shared across
all rows in a partition, use
static columns
• Updates to the value will
affect all rows in the partition
CREATE TABLE employees_by_dept (
dept_id uuid,
emp_id uuid,
first text,
last text,
dept text STATIC,
PRIMARY KEY (dept_id, emp_id)
Time Series Use Case
Weather Station
• Weather station collects data
• Cassandra stores in sequence
• Application reads in sequence
Weather Station
Needed Queries
• Get all data for one weather
• Get data for a single date
and time
• Get data for a range of dates
and times
Data Model for Queries
• Store data per weather
• Store time series in order:
first to last
Weather Station
• Weather station id and
time are unique
• Store as many as needed
CREATE TABLE temperatures (
weather_station text,
year int,
month int,
day int,
hour int,
temperature double,
weather_station, year, month, day, hour)
INSERT INTO temperatures (weather_station, year, month, day, hour, temperature)
VALUES ('10010:99999', 2005, 12, 1, 7, -5.6);
INSERT INTO temperatures (weather_station, year, month, day, hour, temperature)
VALUES ('10010:99999', 2005, 12, 1, 8, -5.1);
INSERT INTO temperatures (weather_station, year, month, day, hour, temperature)
VALUES ('10010:99999', 2005, 12, 1, 9, -4.9);
INSERT INTO temperatures (weather_station, year, month, day, hour, temperature)
VALUES ('10010:99999', 2005, 12, 1, 10, -5.3);
Storage Model: Logical View
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
Storage Model: Disk Layout
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
Storage Model: Disk Layout
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
Merged, Sorted, and Stored Sequentially
Query Patterns
• Range queries
• "Slice" operation on disk
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10
Partition key for locality
Single seek on disk
Query Patterns
• Range queries
• "Slice" operation on disk
weather_station hour temperature
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10
Query Patterns
• Programmers like this
weather_station hour temperature
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10
Sorted in
time order
Takeaway: Goals of Cassandra Data Modeling
• Spread data evenly around the cluster
– Choose a good Primary Key (particularly, the Partition Key portion)
• Minimize the number of partitions read for a given query
– Remember: Partitions are spread out around the cluster
• Do not worry about:
– Minimizing the number of writes: Cassandra is really fast at writes
– Minimizing data duplication: this is not 3NF from RDBMS, disk is cheap
Follow me for updates or to ask questions later: @LukeTillman

More Related Content

What's hot

Cassandra Summit 2014: Real Data Models of Silicon Valley
Cassandra Summit 2014: Real Data Models of Silicon ValleyCassandra Summit 2014: Real Data Models of Silicon Valley
Cassandra Summit 2014: Real Data Models of Silicon ValleyDataStax Academy
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
Via forensics icloud-keychain_passwords_13
Via forensics icloud-keychain_passwords_13Via forensics icloud-keychain_passwords_13
Via forensics icloud-keychain_passwords_13viaForensics
Integrating OpenStack with Active Directory
Integrating OpenStack with Active DirectoryIntegrating OpenStack with Active Directory
Integrating OpenStack with Active Directorycjellick
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
Walkthrough Neo4j 1.9 & 2.0
Walkthrough Neo4j 1.9 & 2.0Walkthrough Neo4j 1.9 & 2.0
Walkthrough Neo4j 1.9 & 2.0Neo4j
Keystone deep dive 1
Keystone deep dive 1Keystone deep dive 1
Keystone deep dive 1Jsonr4
Leveraging Open Source for Database Development: Database Version Control wit...
Leveraging Open Source for Database Development: Database Version Control wit...Leveraging Open Source for Database Development: Database Version Control wit...
Leveraging Open Source for Database Development: Database Version Control wit...All Things Open
Improving DSpace Backups, Restores & Migrations
Improving DSpace Backups, Restores & MigrationsImproving DSpace Backups, Restores & Migrations
Improving DSpace Backups, Restores & MigrationsTim Donohue
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 201510 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015Scott Sutherland
DataStax NYC Java Meetup: Cassandra with Java
DataStax NYC Java Meetup: Cassandra with JavaDataStax NYC Java Meetup: Cassandra with Java
DataStax NYC Java Meetup: Cassandra with Javacarolinedatastax
Capture, record, clip, embed and play, search: video from newbie to ninja
Capture, record, clip, embed and play, search: video from newbie to ninjaCapture, record, clip, embed and play, search: video from newbie to ninja
Capture, record, clip, embed and play, search: video from newbie to ninjaVito Flavio Lorusso
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
Lucene/Solr 8: The Next Major Release Steve Rowe, LucidworksLucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
Lucene/Solr 8: The Next Major Release Steve Rowe, LucidworksLucidworks
DSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital LibraryDSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital Libraryrajivkumarmca
Async servers and clients in
Async servers and clients in Rest.liAsync servers and clients in
Async servers and clients in Rest.liKaran Parikh
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDuraSpace
Keystone - Openstack Identity Service
Keystone - Openstack Identity Service Keystone - Openstack Identity Service
Keystone - Openstack Identity Service Prasad Mukhedkar

What's hot (20)

Cassandra Summit 2014: Real Data Models of Silicon Valley
Cassandra Summit 2014: Real Data Models of Silicon ValleyCassandra Summit 2014: Real Data Models of Silicon Valley
Cassandra Summit 2014: Real Data Models of Silicon Valley
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
Via forensics icloud-keychain_passwords_13
Via forensics icloud-keychain_passwords_13Via forensics icloud-keychain_passwords_13
Via forensics icloud-keychain_passwords_13
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
Integrating OpenStack with Active Directory
Integrating OpenStack with Active DirectoryIntegrating OpenStack with Active Directory
Integrating OpenStack with Active Directory
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
Walkthrough Neo4j 1.9 & 2.0
Walkthrough Neo4j 1.9 & 2.0Walkthrough Neo4j 1.9 & 2.0
Walkthrough Neo4j 1.9 & 2.0
Keystone deep dive 1
Keystone deep dive 1Keystone deep dive 1
Keystone deep dive 1
Leveraging Open Source for Database Development: Database Version Control wit...
Leveraging Open Source for Database Development: Database Version Control wit...Leveraging Open Source for Database Development: Database Version Control wit...
Leveraging Open Source for Database Development: Database Version Control wit...
Improving DSpace Backups, Restores & Migrations
Improving DSpace Backups, Restores & MigrationsImproving DSpace Backups, Restores & Migrations
Improving DSpace Backups, Restores & Migrations
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 201510 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
DataStax NYC Java Meetup: Cassandra with Java
DataStax NYC Java Meetup: Cassandra with JavaDataStax NYC Java Meetup: Cassandra with Java
DataStax NYC Java Meetup: Cassandra with Java
Hadoop Hive
Hadoop HiveHadoop Hive
Hadoop Hive
Introduction to DSpace
Introduction to DSpaceIntroduction to DSpace
Introduction to DSpace
Capture, record, clip, embed and play, search: video from newbie to ninja
Capture, record, clip, embed and play, search: video from newbie to ninjaCapture, record, clip, embed and play, search: video from newbie to ninja
Capture, record, clip, embed and play, search: video from newbie to ninja
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
Lucene/Solr 8: The Next Major Release Steve Rowe, LucidworksLucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
DSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital LibraryDSpace Tutorial : Open Source Digital Library
DSpace Tutorial : Open Source Digital Library
Async servers and clients in
Async servers and clients in Rest.liAsync servers and clients in
Async servers and clients in
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & Configuration
Keystone - Openstack Identity Service
Keystone - Openstack Identity Service Keystone - Openstack Identity Service
Keystone - Openstack Identity Service

Viewers also liked

Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraLuke Tillman
Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)
Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)
Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)Luke Tillman
Getting started with DataStax .NET Driver for Cassandra
Getting started with DataStax .NET Driver for CassandraGetting started with DataStax .NET Driver for Cassandra
Getting started with DataStax .NET Driver for CassandraLuke Tillman
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraLuke Tillman
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersLuke Tillman
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)Luke Tillman
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...Luke Tillman

Viewers also liked (7)

Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)
Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)
Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)
Getting started with DataStax .NET Driver for Cassandra
Getting started with DataStax .NET Driver for CassandraGetting started with DataStax .NET Driver for Cassandra
Getting started with DataStax .NET Driver for Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraAvoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
A Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET DevelopersA Deep Dive into Apache Cassandra for .NET Developers
A Deep Dive into Apache Cassandra for .NET Developers
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...

Similar to Introduction to Data Modeling with Apache Cassandra

Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101DataStax Academy
Cassandra Day London 2015: Data Modeling 101
Cassandra Day London 2015: Data Modeling 101Cassandra Day London 2015: Data Modeling 101
Cassandra Day London 2015: Data Modeling 101DataStax Academy
Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101DataStax Academy
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraPatrick McFadin
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingVassilis Bekiaris
Apache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckApache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckDataStax Academy
Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerCassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerDataStax
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceDataStax Academy
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valleyPatrick McFadin
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandraPatrick McFadin
Implementing Tables and Views.pptx
Implementing Tables and Views.pptxImplementing Tables and Views.pptx
Implementing Tables and Views.pptxLuisManuelUrbinaAmad
Vienna Feb 2015: Cassandra: How it works and what it's good for!
Vienna Feb 2015: Cassandra: How it works and what it's good for!Vienna Feb 2015: Cassandra: How it works and what it's good for!
Vienna Feb 2015: Cassandra: How it works and what it's good for!Christopher Batey
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Manchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra IntroManchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra IntroChristopher Batey
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into CassandraBrent Theisen
Unit4_Lecture-sql.ppt and data science relate
Unit4_Lecture-sql.ppt and data science relateUnit4_Lecture-sql.ppt and data science relate
Unit4_Lecture-sql.ppt and data science relateumang2782love

Similar to Introduction to Data Modeling with Apache Cassandra (20)

Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
Cassandra Day London 2015: Data Modeling 101
Cassandra Day London 2015: Data Modeling 101Cassandra Day London 2015: Data Modeling 101
Cassandra Day London 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day Atlanta 2015: Data Modeling 101
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandra
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
Apache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide DeckApache Cassandra Developer Training Slide Deck
Apache Cassandra Developer Training Slide Deck
Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerCassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super Modeler
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
Apache Cassandra & Data Modeling
Apache Cassandra & Data ModelingApache Cassandra & Data Modeling
Apache Cassandra & Data Modeling
Implementing Tables and Views.pptx
Implementing Tables and Views.pptxImplementing Tables and Views.pptx
Implementing Tables and Views.pptx
1 Dundee - Cassandra 101
1 Dundee - Cassandra 1011 Dundee - Cassandra 101
1 Dundee - Cassandra 101
Vienna Feb 2015: Cassandra: How it works and what it's good for!
Vienna Feb 2015: Cassandra: How it works and what it's good for!Vienna Feb 2015: Cassandra: How it works and what it's good for!
Vienna Feb 2015: Cassandra: How it works and what it's good for!
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Manchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra IntroManchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra Intro
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
Unit4_Lecture-sql.ppt and data science relate
Unit4_Lecture-sql.ppt and data science relateUnit4_Lecture-sql.ppt and data science relate
Unit4_Lecture-sql.ppt and data science relate
Rdbms day3
Rdbms day3Rdbms day3
Rdbms day3

Recently uploaded

IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024Stephanie Beckett
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano

Recently uploaded (20)

IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups

Introduction to Data Modeling with Apache Cassandra

  • 1. Introduction to Data Modeling with Apache Cassandra Luke Tillman (@LukeTillman) Language Evangelist at DataStax
  • 2. 1 Relational Modeling vs. Cassandra 2 The Basics 3 CQL Collections 4 Relationships 5 Time Series Use Case 2
  • 4. The Good ol’ Relational Database • Been around a long time (first proposed in 1970) • Data modeling is well understood (typically 3NF or higher) • ACID guarantees are easy for developers to reason about • SQL is ubiquitous and allows flexible querying – JOINs, Sub SELECTs, etc. 4
  • 5. Relational Data Modeling • Five normal forms • Foreign Keys • Joins at read time – Example SQL: Get employee and department for user id 5 (Helena Edelson) Id First Last DeptId 1 Luke Tillman 201 2 Jon Haddad 201 5 Helena Edelson 205 5 Id Dept 201 Evangelists 205 Engineering Employees Departments SELECT e.First, e.Last, d.Dept FROM Employees e JOIN Departments d ON e.DeptId = d.Id WHERE e.Id = 5
  • 6. Relational Data Modeling Thought Process 6 Data Models Application
  • 7. Cassandra Data Modeling Thought Process 7 Models Application Data
  • 8. CQL vs SQL • Similar syntax in many cases, but... • No Joins • No Aggregations Id First Last DeptId 1 Luke Tillman 201 2 Jon Haddad 201 5 Helena Edelson 205 8 Id Dept 201 Evangelists 205 Engineering Employees Departments SELECT e.First, e.Last, d.Dept FROM Employees e JOIN Departments d ON e.DeptId = d.Id WHERE e.Id = 5
  • 9. Denormalization • Combine table columns into single view at write time • No joins necessary 9 Id First Last Dept 1 Luke Tillman Evangelists 2 Jon Haddad Evangelists 5 Helena Edelson Engineering Employees SELECT First, Last, Dept FROM Employees WHERE Id = 5
  • 10. Sequences and Auto-Incrementing Ids • Great for letting the RDBMS handle auto-generating Ids • Guaranteed to be unique • Needs ACID to work (uh oh) 10 INSERT INTO Employees (Id, First, Last) VALUES (seq.nextVal(), "Patrick", "McFadin")
  • 11. No More Sequences • Almost impossible in a distributed system like Cassandra • Couple of great choices instead: – Natural Keys: Unique values like Email – Surrogate Key: UUID (or GUID for MS folks) • UUID: Universally Unique Identifier – 128-bit number represented in character form – Can be generated easily on the client side 11 99051fe9-6a9c-46c2-b949-38ef78858dd0
  • 13. Cassandra Data Modeling Thought Process • Start with your application and the queries it needs to run • Then build models to satisfy those queries 13 Models Application Data
  • 14. Entity Table • Query: Find user by id • Simple view of a single user • UUID used for ID • Simple primary key 14 CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) ); SELECT firstname, lastname FROM users WHERE userid = 99051fe9-6a9c-46c2-b949-38ef78858dd0
  • 15. Entity Table – A reminder on Partition Keys • First part of Primary Key is the Partition Key 15 CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) ); firstname ... Luke ... Jon ... Patrick ... userid 689d56e5- … 93357d73- … d978b136- …
  • 16. More Complicated Primary Keys • Query: Find comments for a video (most recent first) 16 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); SELECT commentid, userid, comment FROM comments_by_video WHERE videoid = 0fe6ab76-cf17-4664-abcc-4e363cee273f LIMIT 10
  • 17. Let's Break This Down • TimeUUID: a UUID with a timestamp component • Ordering by a TimeUUID is like ordering by its timestamp 17 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); eeaca440-c745-11e4-8830-0800200c9a6603/10/2015 16:53:09 GMT
  • 18. Let's Break This Down • The Primary Key uniquely identifies a row, so a comment is uniquely identified by its videoid and commentid 18 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 19. Let's Break This Down • The first part of the Primary Key is the Partition Key, so comments for a given video will be stored together in a partition • When we query for a given videoid, we only need to talk to one partition (and thus one node), which is fast 19 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 20. Let's Break This Down • The second part of the Primary Key is the Clustering Column(s) • Inside a partition, comments for a given video will be ordered by commentid • Remember ordering by TimeUUID is ordering by timestamp 20 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 21. Let's Break This Down • We can specify a default clustering order when creating the table which will affect the ordering of the data stored on disk • Since our query was to get the latest comments for a video, we order by commentid descending 21 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 22. Let's Break This Down 22 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); videoid='0fe6a...' userid= 'ac346...' comment= 'Awesome!' commentid='82be1...' (10/1/2014 9:36AM) userid= 'f89d3...' comment= 'Garbage!' commentid='765ac...' (9/17/2014 7:55AM)
  • 23. This query will be fast 23 videoid='0fe6a...' userid= 'ac346...' comment= 'Awesome!' commentid='82be1...' (10/1/2014 9:36AM) userid= 'f89d3...' comment= 'Garbage!' commentid='765ac...' (9/17/2014 7:55AM) SELECT commentid, userid, comment FROM comments_by_video WHERE videoid = 0fe6ab76-cf17-4664-abcc-4e363cee273f LIMIT 10 1. Locate single partition 2. Single seek on disk 3. Slice 10 latest rows and return
  • 24. Getting the most from queries • Queries on Partition Key are fast – Querying inside a single partition should be the goal – Always specify a value for partition key when querying • Queries on Partition Key and one or more Clustering Column(s) are fast – Again, inside a single partition should be the goal – Use default ordering when creating the table to optimize if applicable • Cassandra will give you errors if you try to stray 24
  • 25. More than one way to query the same data • New Query: Find comments made by a user (most recent first) 25 CREATE TABLE comments_by_user ( userid uuid, commentid timeuuid, videoid uuid, comment text, PRIMARY KEY (userid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); SELECT commentid, videoid, comment FROM comments_by_user WHERE userid = 99051fe9-6a9c-46c2-b949-38ef78858dd0 LIMIT 10
  • 26. More than one way to query the same data • Two views of the same data • Use a batch when inserting to both tables • Denormalize at write time to do efficient queries at read time 26 CREATE TABLE comments_by_user ( userid uuid, commentid timeuuid, videoid uuid, comment text, PRIMARY KEY ( userid, commentid) ) WITH CLUSTERING ORDER BY ( commentid DESC); CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY ( videoid, commentid) ) WITH CLUSTERING ORDER BY ( commentid DESC);
  • 28. CQL Collection Basics • Store a collection of related things in a column • Meant to be dynamic part of a table • Update syntax is very different from insert • Reads require all of the collection to be read 28
  • 29. CQL Set • No duplicates, sorted by CQL type's comparator 29 INSERT INTO collections_example (id, set_example) VALUES (1, {'Patrick', 'Jon', 'Luke'}); set_example set<text> Collection name (column name) Collection type CQL type
  • 30. CQL Set • Adding an element to a set • Removing an element from a set 30 UPDATE collections_example SET set_example = set_example + {'Rebecca'} WHERE id = 1 UPDATE collections_example SET set_example = set_example - {'Luke'} WHERE id = 1
  • 31. CQL List • Allows duplicates, sorted by insertion order • Use with caution 31 INSERT INTO collections_example (id, list_example) VALUES (1, ['Patrick', 'Jon', 'Luke']); list_example list<text> Collection name (column name) Collection type CQL type
  • 32. CQL List • Adding an element to the end of a list • Adding an element to the beginning of a list • Removing an element from a list 32 UPDATE collections_example SET list_example = list_example + ['Rebecca'] WHERE id = 1 UPDATE collections_example SET list_example = ['Rebecca'] + list_example WHERE id = 1 UPDATE collections_example SET list_example = list_example - ['Luke'] WHERE id = 1
  • 33. CQL Map • Key and value, sorted by key's CQL type comparator 33 INSERT INTO collections_example (id, map_example) VALUES (1, { 'Patrick' : 72, 'Jon' : 33, 'Luke' : 34 }); map_example map<text, int> Collection name (column name) Collection type Key CQL type Value CQL type
  • 34. CQL Map • Adding an element to a map • Updating an existing element in a map • Removing an element from a map 34 UPDATE collections_example SET map_example['Rebecca'] = 29 WHERE id = 1 UPDATE collections_example SET map_example['Jon'] = 34 WHERE id = 1 DELETE map_example['Luke'] FROM collections_example WHERE id = 1
  • 36. Revisiting our One-to-Many Relationship 36 Id First Last DeptId 7bc7a... Luke Tillman 5078c... d7463... Jon Haddad 5078c... 8c26b... Helena Edelson 1d0f3... Id Dept 5078c... Evangelists 1d0f3... Engineering EmployeesDepartments Department Employeehas n1
  • 37. Revisiting our One-to-Many Relationship • Query: Get an employee and his/her department by employee id – Denormalize department data 37 First Last Dept Luke Tillman Evangelists Jon Haddad Evangelists Helena Edelson Engineering Id 7bc7a... d7463... 8c26b... Employees CREATE TABLE employees ( id uuid, first text, last text, dept text, PRIMARY KEY (id) ); SELECT first, last, dept FROM employees WHERE id = 7bc7a...
  • 38. What about the other side of the relationship? • Query: Get all the employees for a given department 38 CREATE TABLE employees_by_dept ( dept_id uuid, emp_id uuid, first text, last text, dept text, PRIMARY KEY (dept_id, emp_id) ); SELECT first, last, dept FROM employees_by_dept WHERE dept_id = 5078c...
  • 39. What about the other side of the relationship? 39 CREATE TABLE employees_by_dept ( dept_id uuid, emp_id uuid, first text, last text, dept text, PRIMARY KEY (dept_id, emp_id) ); dept_id= '5078c...' emp_id='7bc7a...' dept= 'Evangelists' first= 'Luke' last= 'Tillman' emp_id='d7463...' dept= 'Evangelists' first= 'Jon' last= 'Haddad'
  • 40. Static Columns • Department name (dept) will be the same across all rows in the partition • This is a good candidate for a static column 40 CREATE TABLE employees_by_dept ( dept_id uuid, emp_id uuid, first text, last text, dept text, PRIMARY KEY (dept_id, emp_id) ); dept_id= '5078c...' emp_id='7bc7a...' dept= 'Evangelists' first= 'Luke' last= 'Tillman' emp_id='d7463...' dept= 'Evangelists' first= 'Jon' last= 'Haddad'
  • 41. Static Columns • For data that is shared across all rows in a partition, use static columns • Updates to the value will affect all rows in the partition 41 CREATE TABLE employees_by_dept ( dept_id uuid, emp_id uuid, first text, last text, dept text STATIC, PRIMARY KEY (dept_id, emp_id) ); dept_id= '5078c...' dept= 'Evangelists' emp_id='7bc7a...' first= 'Luke' last= 'Tillman' emp_id='d7463...' first= 'Jon' last= 'Haddad'
  • 42. Time Series Use Case 42
  • 43. Weather Station • Weather station collects data • Cassandra stores in sequence • Application reads in sequence 43
  • 44. Weather Station Needed Queries • Get all data for one weather station • Get data for a single date and time • Get data for a range of dates and times Data Model for Queries • Store data per weather station • Store time series in order: first to last 44
  • 45. Weather Station • Weather station id and time are unique • Store as many as needed 45 CREATE TABLE temperatures ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY ( weather_station, year, month, day, hour) ); INSERT INTO temperatures (weather_station, year, month, day, hour, temperature) VALUES ('10010:99999', 2005, 12, 1, 7, -5.6); INSERT INTO temperatures (weather_station, year, month, day, hour, temperature) VALUES ('10010:99999', 2005, 12, 1, 8, -5.1); INSERT INTO temperatures (weather_station, year, month, day, hour, temperature) VALUES ('10010:99999', 2005, 12, 1, 9, -4.9); INSERT INTO temperatures (weather_station, year, month, day, hour, temperature) VALUES ('10010:99999', 2005, 12, 1, 10, -5.3);
  • 46. Storage Model: Logical View 46 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' 10010:99999 10010:99999 10010:99999 10010:99999 weather_station 7 8 9 10 hour -5.6 -5.1 -4.9 -5.3 temperature
  • 47. Storage Model: Disk Layout 47 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' 10010:99999 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 2005:12:1:10 -5.3
  • 48. Storage Model: Disk Layout 48 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' 10010:99999 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 2005:12:1:10 -5.3 2005:12:1:11 Merged, Sorted, and Stored Sequentially
  • 49. Query Patterns • Range queries • "Slice" operation on disk 49 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10 10010:99999 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 2005:12:1:10 -5.3 2005:12:1:11 Partition key for locality Single seek on disk
  • 50. Query Patterns 50 • Range queries • "Slice" operation on disk 10010:99999 10010:99999 10010:99999 10010:99999 weather_station hour temperature 7 8 9 10 -5.6 -5.1 -4.9 -5.3 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10
  • 51. Query Patterns 51 • Programmers like this 10010:99999 10010:99999 10010:99999 10010:99999 weather_station hour temperature 7 8 9 10 -5.6 -5.1 -4.9 -5.3 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10 Sorted in time order
  • 52. Takeaway: Goals of Cassandra Data Modeling • Spread data evenly around the cluster – Choose a good Primary Key (particularly, the Partition Key portion) • Minimize the number of partitions read for a given query – Remember: Partitions are spread out around the cluster • Do not worry about: – Minimizing the number of writes: Cassandra is really fast at writes – Minimizing data duplication: this is not 3NF from RDBMS, disk is cheap 52
  • 53. Questions? Follow me for updates or to ask questions later: @LukeTillman 53