Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra

Using the Chebotko Method to
Design Sound and Scalable Data Models for
Apache Cassandra™
Artem Chebotko, Ph.D.
November, 2019

Agenda
• Introduction to Apache Cassandra™ and CQL
• The Data Modeling Framework
• Conceptual Data Modeling and Application Workflows
• Logical Data Modeling and Chebotko Diagrams
• Physical Data Modeling and Optimization Techniques
• Time Series Modeling Example
• Summary and Resources
Slide 2

“Manage massive amounts of data,
fast, without losing sleep”
cassandra.apache.org

Cassandra Use Cases
Slide 4
“Cassandra is in use at Constant
Contact, CERN, Comcast, eBay,
GitHub, GoDaddy, Hulu, Instagram,
Intuit, Netflix, Reddit,The Weather
Channel, and over 1500 more
companies that have large, active
data sets.”
“Some of the largest production
deployments include Apple's, with
over 75,000 nodes storing over 10
PB of data, Netflix (2,500 nodes, 420
TB, over 1 trillion requests per day),
Chinese search engine Easou (270
nodes, 300 TB, over 800 million
requests per day), and eBay (over
100 nodes, 250 TB).”
PersonalizationFraud detectionMessaging Playlists Internet of Things

• High Availability – Always On
• No single point of failure
• Fault-tolerance via replication
and tunable consistency
• Best-in-class multi-datacenter
support
• No downtime or interruption
due to node maintenance
• Performance and Scalability
• Very fast writes
• Fast reads
• Linear scalability
• Elasticity
Many Reasons to Choose Cassandra
Slide 5

How Cassandra Organizes Data
Slide 6
CREATE KEYSPACE library
WITH replication = {'class':
'NetworkTopologyStrategy',
'DC-West': '3',
'DC-East': '5'};
library

Slide 7
CREATE KEYSPACE library
WITH replication = {'class':
'NetworkTopologyStrategy',
'DC-West': '3',
'DC-East': '5'};
CREATE TABLE
library.venues_by_year (
year INT,
name TEXT,
country TEXT,
homepage TEXT,
PRIMARY KEY (year,name));
artifacts venues_by_year
library

Slide 8
CREATE TABLE
...
library
year name country …
2019 Apache Cassandra Summit USA …
2019 Data Modeling Zone USA …
2019 DataStax Accelerate USA …
year name country …
2015 A … … …
2015 B … … …
2015 C … … …

Slide 9
CREATE TABLE
...
library
DC-West DC-East
RF=3 RF=5

Slide 10
CREATE TABLE
...
library
DC-West DC-East
RF=3 RF=5

Slide 11
CREATE TABLE
...
library
DC-West DC-East
RF=3 RF=5

Cassandra Query Language (CQL)
• Data Definition
• CREATE KEYSPACE, CREATE TABLE
• CREATE INDEX, CREATE CUSTOM INDEX
• CREATE MATERIALIZEDVIEW
• Data Manipulation
• SELECT
• INSERT, UPDATE, DELETE
Slide 12

CQL CREATE TABLE
Slide 13
CREATE TABLE
(
column type STATIC,
column type STATIC,
...,
PRIMARY KEY ( )
) ;
name
WITH CLUSTERING ORDER BY (clustering_key_column (ASC|DESC), ...)
(column, ...), column, ...
table name
column names, types,
optional STATIC designation
partition key
optional
clustering key
row ordering in a partition

Table with Single-Row Partitions
Slide 14
CREATE TABLE users (
id UUID,
name TEXT,
email TEXT,
PRIMARY KEY (id)
);
id email name
a7e78478-0a54-4949-90f3-14ec4cbea40c jbellis@datastax.com Jonathan
67657da3-4443-46ab-b60a-510a658fc7bb achebotko@datastax.com Artem
3b1f62b1-386b-46e3-b55d-00f1abbafb2b patrick@datastax.com Patrick
Users
id K
name
email
TEXT
TEXT
UUID

CREATE TABLE artifacts_by_venue (
venue TEXT, year INT,
artifact TEXT,
title TEXT,
country TEXT STATIC,
PRIMARY KEY ((venue, year), artifact)
);
venue year artifact title country
DataStax Accelerate 2019
A… Linear Scalability …
USA… …
Z… Building Cloud …
Data Modeling Zone 2019
A… New approach to …
USA
… …
Artifacts_by_venue
venue K
year K
artifact C↑
title
country S
TEXT
TEXT
TEXT
TEXT
INT
Table with Multi-Row Partitions
Slide 15

CQL SELECT
Slide 16
SELECT selectors
FROM table_name
WHERE primary_key_conditions
AND index_conditions
GROUP BY primary_key_columns
ORDER BY clustering_key_columns ( ASC | DESC )
LIMIT N
ALLOW FILTERING ;
one table per query
restricted to
primary key columns
columns, aggregates, functions
danger

Sample CQL Queries
Slide 17
SELECT *
FROM artifacts_by_venue
Artifacts_by_venue
venue K
year K
artifact C↑
title
country S
TEXT
TEXT
TEXT
TEXT
INTWHERE venue = ? AND year = ?;
WHERE venue = ? AND year = ? AND artifact = ?;
WHERE venue = ? AND year = ? AND
artifact > ? AND artifact < ?
ORDER BY artifact DESC;

Invalid CQL Queries
Slide 18
SELECT *
Artifacts_by_venue
venue K
year K
artifact C↑
title
country S
TEXT
TEXT
TEXT
TEXT
INT
WHERE venue = ?;
WHERE venue = ? AND artifact = ?;
WHERE artifact > ? AND artifact < ?;
WHERE venue = ? AND year = ? AND title = ?;
WHERE country = ?;

Important Implications for Data Modeling
• Data
• Primary keys define data uniqueness
• Partition keys define data distribution
• Partition keys affect partition sizes
• Clustering keys define row ordering
• Query
• Primary keys define how data is retrieved
• Partition keys allow equality predicates
• Clustering keys allow inequality predicates and ordering
• Only one table per query, no joins
Slide 19

Agenda
Slide 20

Data Modeling
• Collection and analysis of data requirements
• Identification of participating entities and relationships
• Identification of data access patterns
• A particular way of organizing and structuring data
• Design and specification of a database schema
• Schema optimization and data indexing techniques
Slide 21
Data quality: completeness consistency accuracy
Data access: queryability efficiency scalability

Key Data Modeling Steps
• Understand the data
• Identify access patterns
• Apply a query-first approach
• Optimize and implement
Slide 22
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model

The Data Modeling Framework
Slide 23
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model
optimizemap
Defines models and transitions

Agenda
Slide 24

Conceptual Data Modeling
• Conceptual data model
• High-level view of data – entity and relationship types
• Technology-independent or technology-agnostic
• Not specific to Cassandra or any other database system
• Purpose: understanding data
• The scope of what needs to be accomplished
• Essential components, concepts, entities, relationships
• Keys and cardinality constraints are absolutely essential
Slide 25

Advantages of Conceptual Data Modeling
• Complex data modeling problems are more manageable
• Less chance of producing an incorrect or incomplete model
• Saves time in the long run
• Improves data, business process, and risk management
• Readable by both technical and non-technical people
• Good for data governance
• Understanding of a data modeling problem is documented
• Can be shared, reviewed and agreed on
• Improves understanding and eliminates ambiguity
Slide 26

Conceptual Data Modeling Techniques
• Entity-relationship modeling
• Entities and relationships that can exist among them
• Unified Modeling Language (UML) class diagrams
• Classes and associations that can exist among them
• Object-role modeling or fact-oriented modeling
• Entities and facts that define relationships among them
• Dimensional modeling
• Dimensions and facts that relate them
• Ontological modeling
• Concepts and relations that can exist among them
Slide 27

ER Model
• Chen’s notation
• Simple and original
• Truly technology-independent
• Other notations may be influenced by relational databases
• Crow’s foot, Barker’s notation, Information engineering, IDEF1X
• Many hybrids exist
Slide 28

ER Model Basics
• Entity – object that is involved in an information system
• Example: Jonathan Ellis (an author),“The State of Cassandra, 2014” (a presentation)
• Entity type – set of similar objects
• Example: Author, Presentation
• Relationship – relates two or more entities
• Example: Jonathan Ellis creates “The State of Cassandra, 2014”
• Relationship type – set of similar relationships
• Example: Author creates Presentation
Slide 29

Entity Type Example
• Name – usually a noun
• Attributes – atomic, set-valued, composite, derived
• Key – minimum set of attributes that uniquely identify an entity
Slide 30
Userid
name
first_name
last_name
DOB
age
emails

Relationship Type Example
• Name – usually a verb
• Attributes – can be atomic, set-valued, composite, derived
• Roles – each role names a related entity
• Key – minimum set of roles and attributes that uniquely identify a relationship
• Cardinality constraints – how many times an entity can participate in a relationship
Slide 31
Venue
year
country
homepage
features
1 n
name
Artifact
id
title
authors
keywords
date

How is a Key of a Relationship Derived?
• 1:1 relationship
• 1:n relationship
• m:n relationship
Slide 32
Author
lname
affiliation
has
date
1 1
fname
Bio
fname
bio
lnameinterests
references
Venue
year
country
homepage
features
1 n
name
Artifact
id
title
authors
keywords
date
Author
lname
affiliation
creates
m n
fname
Artifact
id
title
keywords
interests
fname, lname
id
fname, lname, id

Entity Type Hierarchy Example
• Attribute inheritance: id, title, authors, and keywords are inherited by Article and
Presentation
• Disjoint – cannot have an entity that is both Article and Presentation
• Covering – cannot have a Digital Artifact to be anything but Article or Presentation
Slide 33
Digital
Artifact
IsA
Article Presentation
disjoint
covering
id title
keywords
authors
url
abstract
doi

Conceptual Data Model for Digital Library
Slide 34
User
Digital
Artifact
Venue
likes
n
m
features
1
n
IsA
disjoint
covering
posts
id
title
keywords
authors
1
n
year
country
name
homepage
id
name
email
timestamp
title
Review
likes
features
n
m
id
body
n
1
rating

Application Workflow Model
• High-level application design
• Tasks, causal dependencies, access patterns
• Technology-independent or technology-agnostic
• Not specific to Cassandra or any other database system
• Purpose: understand data access patterns
• Each application has a workflow
• Data-driven tasks access a database
• A sequence of tasks defines a sequence of data access patterns
Slide 35

Application Workflow for Digital Library
Slide 36
Search for
artifacts by a
venue,
author, title,
or keyword
Display
information
for a venue
Display a
rating of an
artifact
Display
reviews for
an artifact
Display likes
for an artifact
Find
information
for an artifact
with a given
id
Show
information
about a user
Show likes
for a review
Show reviews
by a user
Tasks and causal dependencies

Application Workflow for Digital Library
Slide 37
ACCESS PATTERNS
Q1: Find artifacts for a specified venue ...
Q2: Find artifacts for a specified author ...
Q3: Find artifacts with a specified title ...
Q4: Find artifacts with a specified keyword ...
Q5: Find information for a specified venue.
Q6: Find an average rating for a specified artifact.
Q7: Find reviews for a specified artifact ...
...
Q5
Q1,Q2,Q3,Q4
Q8Q6 Q7
Search for
artifacts by a
venue,
author, title,
or keyword
Display
information
for a venue
Display a
rating of an
artifact
Display
reviews for
an artifact
Display likes
for an artifact
Find
information
for an artifact
with a given
id
Show
information
about a user
Show likes
for a review
Show reviews
by a user
Q9 Q10
Q12
Q11
Data access patterns

Agenda
Slide 38

Logical Data Modeling
• Logical data model
• Sketch data model for Cassandra
• Chebotko Diagrams for visualization
• Purpose: sound, query-driven design
• Data organization into tables according to the queries
• Correctness of primary key design
• Denormalization, nesting, duplication
Slide 39

Chebotko Diagrams
• Visual representation of a logical data model
• Tables are represented by rectangles and
have names and columns
• Columns may optionally be designated as K
(partition key column), C (clustering key
column), S (static column), and IDX (indexed
column)
• Access patterns and their ordering are
represented by query-labeled connections
Slide 40
Venues
e K
Q5
ifacts_by_venue
ue K
C↓
act C↑
ors (list)
words (set)
Artifacts_by_author
author K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Artifacts_by_title
title K
year C↓
artifact C↑
type
authors (list)
keywords (set)
venue
Artifacts_by_keyword
keyword K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Ratings_by_artifact
artifact K
Reviews_by_artifact
artifact K
Likes_by_artifact
artifact K
Q1 Q2 Q3 Q4
artifa
type
title
autho
keyw
venue
year
Q8Q6 Q7

Conceptual-to-Logical Mapping
Slide 41
Conceptual
Data Model
Application
Workflow
Logical
Data Model
map
Mapping rules
Mapping patterns

Mapping Rules
• Mapping rule 1:“Entities and relationships”
• Entity and relationship types map to tables
• Mapping rule 2:“Key attributes”
• Key attributes map to primary key columns
• Mapping rule 3:“Equality search attributes”
• Equality search attributes map to partition key columns
• Mapping rule 4:“Inequality search attributes”
• Inequality search attributes map to clustering columns
• Mapping rule 5:“Ordering attributes”
• Ordering attributes map to clustering columns
Slide 42
Conceptual
Data Model
Application
Workflow

Example
• Conceptual data model
• Query
• Find artifacts that appeared in a particular venue after a specified year;
order results by year (desc) and title (asc)
• Query predicate (equality and inequality): name = ? AND year > ?
• Ordering attributes: year (DESC), title (ASC)
Slide 43
Digital
Artifact
Venue features
1 n
idyear
country
name title
homepage keywords authors

Applying Mapping Rules 1 and 2
• Mapping rule 1:“Entities and relationships”
• The query only concerns a part of the diagram
• Data should be organized based on the relationship type
• Mapping rule 2:“Key attributes”
• Key of the relationship is id
• id must map to a primary key column (say column artifact)
Slide 44
Digital
Artifact
Venue features
1 n
idyear
country
name title
Artifacts_by_venue
...
artifact C↑
...

Applying Mapping Rule 3
• Mapping rule 3:“Equality search attributes”
• Equality search attribute name=? maps to
the 1st column of the primary key
• It must be a partition key column (say column venue)
Slide 45
Artifacts_by_venue
venue K
...
artifact C↑
...

• Mapping rule 4:“Inequality search attributes”
• Inequality search attribute year>? maps to
a clustering column
Slide 46
Artifacts_by_venue
venue K
year C↑
...
artifact C↑
...

• Mapping rule 5:“Ordering attributes”
• Ordering attributes year (DESC) and title (ASC) map to
clustering columns
• year is already part of the schema but
its order should be reversed to DESC
• title is added next
Slide 47
Artifacts_by_venue
venue K
year C↓
...
artifact C↑
...
Artifacts_by_venue
venue K
year C↓
title C↑
artifact C↑
...

Final Result
How did we get column type?
Slide 48
SELECT *
WHERE venue = ? AND year > ?
ORDER BY year DESC, title ASC
Artifacts_by_venue
venue K
year C↓
title C↑
artifact C↑
type
authors (list)
keywords (set)
Digital
Artifact
Venue features
1 n
idyear
country
name title
Digital
Artifact
IsA
disjoint
covering

Mapping Patterns
• Semi-formal definitions of common mapping use cases
• Graphical rather than mathematical representation
• Use clustering columns as the data nesting mechanism
• Do not take ordering of results into consideration
• Guide schema design
• Ensure correctness and efficiency
• Enable automation
Slide 49

Common Mapping Patterns
• Entity patterns
• 1:1 relationship patterns
• 1:n relationship patterns
• m:n relationship patterns
• Hierarchical patterns
Slide 50

1:n relationship mapping pattern 3.1
• Search attributes = key attributes
Slide 51
ET1
key1.2
attr1.1
attr1.2
ET2_by_ET1_key
key1.1 K
key1.2 K
key2.1 C↑
key2.2 C↑
attr1.1 S
attr1.2 S
attr1.3 (collection) S
attr2.1
attr2.2
attr2.3 (collection)
attr
RT
attr
1 n
key1.1
ET2
key2.1
attr2.1
attr2.2
key2.2
attr2.3
attr1.3
ACCESS PATTERN
search attributes: key1.1 key1.2
ET2_by_ET1_key
key1.1 K
key1.2 C↑
key2.1 C↑
key2.2 C↑
attr2.1
attr2.2
attr
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of RT
STATIC COLUMNS:
Non-key attributes of
ET1, iff all key
attributes of ET1 are
part of the partition key
What if we add green attributes
to the above table?

1:n relationship mapping pattern 3.1 (example)
• Search attributes = key attributes
Slide 52
Venue
year
country
homepage
Artifacts_by_venue
venue (=name) K
year K
artifact (= id) C↑
country S
homepage S
title
authors (list)
keywords (set)
features
1 n
name
Artifact
id
title
ACCESS PATTERN
search attributes: name year
= >
PRIMARY KEY:
followed by all key
attributes of features
STATIC COLUMNS:
Non-key attributes of
Venue, iff all key
attributes of Venue are
part of the partition key
What about country and
homepage?
authors
keywords
Artifacts_by_venue
venue (=name) K
year C↓
title
authors (list)
keywords (set)

1:n relationship mapping pattern 3.2
• Search attributes ≠ key attributes
Slide 53
ET1
key1.2
attr1.1
attr1.2
ET2_by_ET1_non-key
attr1.1 K
attr1.2 K
key2.1 C↑
key2.2 C↑
attr2.1
attr2.2
attr
RT
attr
1 n
key1.1
ET2
key2.1
attr2.1
attr2.2
key2.2
attr2.3
attr1.3
ACCESS PATTERN
search attributes: attr1.1 attr1.2
ET2_by_ET1_non-key
attr1.1 K
attr1.2 C↑
key2.1 C↑
key2.2 C↑
attr2.1
attr2.2
attr
= >
PRIMARY KEY:
followed by all key
attributes of RT
All ET1's attributes can
be added at the cost of
duplicating them for
every entity of type 2

1:n relationship mapping pattern 3.2 (example)
• Search attributes ≠ key attributes
Slide 54
Venue
year
country
homepage
Artifacts_by_country
country K
year K
title
authors (list)
keywords (set)
features
1 n
name
Artifact
id
title
ACCESS PATTERN
search attributes: country year
= >
PRIMARY KEY:
followed by all key
attributes of features
authors
keywords
name and homepage
can be added at the
cost of duplicating
them for every artifact
Artifacts_by_country
country K
year C↓
title
authors (list)
keywords (set)

Logical Data Model for Digital Library
Slide 55
ACCESS PATTERNS
Q1: Find artifacts for a specified venue; order by year (DESC).
Q2: Find artifacts for a specified author; order by year (DESC).
Q3: Find artifacts with a specified title; order by year (DESC).
Q4: Find artifacts with a specified keyword; order by year (DESC).
Q7: Find reviews for a specified artifact, possibly with a specified rating .
Q8: Find a number of ‘likes’ for a specified artifact.
Q9: Find reviews for a specified user; order by review timestamp (DESC).
Q10: Find a user with a specified id.
Q11: Find a number of ‘likes’ for a specified review.
Q12: Find information for a specified artifact.
...
Venues
name K
year K
country IDX
homepage
Q5
Artifacts_by_venue
venue K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
Artifacts_by_author
author K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Artifacts_by_title
title K
year C↓
artifact C↑
type
authors (list)
keywords (set)
venue
keyword K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Users
id K
name
email
Ratings_by_artifact
artifact K
num_ratings (counter)
sum_ratings (counter)
Reviews_by_user
user K
review (timeuuid) C↓
rating
title
body
artifact_id
artifact_title
artifact_authors (list)
user_name S
user_email S
Reviews_by_artifact
artifact K
review (timeuuid) C↓
rating IDX
title
body
user
Likes_by_artifact
artifact K
num_likes (counter)
Likes_by_review
review K
num_likes (counter)
Q1 Q2 Q3 Q4
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
Q8Q6 Q7
Q11
Q10
Q9
Q12

Agenda
• Physical Data Modeling and OptimizationTechniques
Slide 56

Physical Data Modeling
• Physical data model
• Complete data model for Cassandra
• Chebotko Diagrams for visualization
• CQL for a database schema
• Purpose: efficient, implementation-ready design
• Data model efficiency analysis and validation
• Schema design optimizations
• Techniques for concurrent data access
Slide 57

Logical Data Model – Correctness and Efficiency
• But …
• Database engine has limitations
• Resources are finite
• Some operations may require special considerations
Slide 58

Physical Data Model – More Efficiency
• Physical data model takes into account …
• Partition sizes
• Data duplication factors
• Data types, indexes, materialized views
• Concurrent data access requirements
Slide 59

Partition Size Limits
• Theoretical limits
• 2 billion values
• Node disk size
• Practical limits for Cassandra
Slide 60
Cassandra 2 Cassandra 3
100K Up to 10x
100MB Up to 10x

Estimating a Partition Size (Cassandra 3)
Slide 61
Nv – number of values in a partition
Ncv – number of clustering column values in a partition
Nrv – number of regular column values in a partition
Nsv – number of static column values in a table definition
Nr – number of rows in a partition
Ncc – number of clustering columns in a table definition
Nrc – number of regular columns in a table definition
Nsc – number of static columns in a table definition
Sp – size of a partition in bytes
sizeOf – size (in bytes) of a CQL data type
ck – partition key column in a table definition
cs – static column in a table definition
cr – regular column in a table definition
cc – clustering column in a table definition
Nr – number of rows in a partition
sizeOf(tavg) – average size (in bytes) of a timestamp delta
associated with a value

Simplified Example:Values
• User with 1000 reviews = 1000 rows in a partition
Slide 62
Reviews_by_user
user K
review C↓
rating
title
user_name S
user_email S
FLOAT
TEXT
TEXT
TIMEUUID
UUID
TEXT
1000
1000
1000
1
1
3002
+
+
+
+
------
values

Simplified Example: Bytes
• User with 1000 reviews = 1000 rows in a partition
Slide 63
Reviews_by_user
user K
review C↓
rating
title
user_name S
user_email S
FLOAT
TEXT
TEXT
TIMEUUID
UUID
TEXT
16
1000x16
1000x 4
1000x60
12
20
80048
2002x8
96064
+
+
+
+
+
---------
+
-------
bytes

Getting a Partition Size Empirically
Slide 64
$ nodetool flush library reviews_by_user
$ nodetool tablestats -H library.reviews_by_user
Keyspace : library
...
Table: reviews_by_user
SSTable count: 1
...
Number of partitions (estimate): 1
...
Compacted partition minimum bytes: 88149
Compacted partition maximum bytes: 105778
Compacted partition mean bytes: 105778
...

Splitting Large Partitions
• Solution: introduce an additional column to a partition key
• Use an existing column – convenience
• Use an artificial “bucket” column – more control
• Cons: supported access patterns may change
• Example:
• Millions of artifacts with the same keyword
across different venues and years
Slide 65
keyword K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue

Using an Existing Column
Slide 66
keyword K
year C↓
artifact C↑
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
keyword K
year K
artifact C↑
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT

Using an Artificial Column
Slide 67
keyword K
year K
bucket K
artifact C↑
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
INT
keyword K
year K
artifact C↑
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT

Data Duplication Considerations
• Data duplication is necessary
• Data duplication vs. data replication
• Data duplication factor
• Data duplication and data consistency
Slide 68

Duplication Across Tables
• Each artifact is stored once in Artifacts
• Each artifact is stored once in Artifacts_by_venue
• Duplication factor = 2
Slide 69
Artifacts_by_venue
venue K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year

Duplication Across Partitions
• An artifact with 5 authors is stored in 5 different partitions
• Duplication factor = 5
Slide 70
Artifacts_by_author
author K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue

Artifacts_by_author
author K
keyword C↑
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Duplication Across Rows
• An artifact with 5 keywords is stored in 5 rows of the same partition
• Duplication factor = 5 x 5 = 25
Slide 71

Artifacts_by_author
author K
keyword C↑
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Beware of Non-constant Duplication Factors
• Users can add new keywords to artifacts
• There is no limit on the number of keywords per artifact
• An artifact with n keywords is stored in n rows of the same partition
• Duplication factor = 5 x n
• Do things differently
• Place reasonable limits that can be gradually increased
Slide 72

BEGIN BATCH
UPDATE artifacts ...
INSERT INTO artifacts_by_title ...
DELETE FROM artifacts_by_title ...
APPLY BATCH;
Keeping Up with Data Duplication and Data Consistency
• Insert a new artifact
• Update an existing artifact title
Slide 73
BEGIN BATCH
INSERT INTO artifacts ...
INSERT INTO artifacts_by_title ...
APPLY BATCH;
Artifacts_by_title
title K
year C↓
artifact C↑
type
authors (list)
keywords (set)
venue
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year

CREATE MATERIALIZED VIEW
artifacts_by_title AS
SELECT title, year,
artifact, type,
authors, keywords,
venue
FROM artifacts
WHERE title IS NOT NULL
AND artifact IS NOT NULL
PRIMARY KEY (title,
artifact);
Keeping Up with Data Duplication and Data Consistency
How About MaterializedViews?
Base table Materialized view
Slide 74
CREATE TABLE artifacts (
artifact TEXT,
type TEXT,
title TEXT,
authors LIST<TEXT>,
keywords SET<TEXT>,
venue TEXT,
year INT,
PRIMARY KEY (artifact)
);
Artifacts_by_title
title K
year C↓
artifact C↑
type
authors (list)
keywords (set)
venue
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year

Selecting Column Data Types
Slide 75
ASCII
BIGINT
BLOB
BOOLEAN
COUNTER
DATE
DECIMAL
DOUBLE
FLOAT
INET
INT
LIST
MAP
SET
SMALLINT
TEXT
TIME
TIMESTAMP
TIMEUUID
TINYINT
TUPLE
UUID
VARCHAR
VARINT
CREATE TYPE library.ADDRESS (
street TEXT,
city TEXT,
state TEXT,
postal_code TEXT
);
CREATE TABLE library.users (
id UUID PRIMARY KEY,
name TEXT,
other_names SET<TEXT>,
phones MAP<TEXT,TEXT>,
current_address ADDRESS,
past_addresses LIST<FROZEN<ADDRESS>>
);

Indexing Options
• Local indexes
• Secondary indexes
• SSTable-attached secondary indexes (SASI)
• Distributed indexes
• Materialized views
Slide 76

When to Use a Secondary Index?
• Queries on low-cardinality columns
• Mostly analytical queries with larger result sets
• Generally, these are expensive queries
• Queries that involve both partition key and indexed column
• Searching within a large partition
• Efficient queries
Slide 77
Venues
name K
year K
country IDX
homepage
INT
TEXT
TEXT
TEXT
Reviews_by_artifact
artifact K
review C↓
rating IDX
title
body
user
INT
TEXT
TEXT
TEXT
TIMEUUID
UUID

When to Use a MaterializedView?
• Queries on higher-cardinality columns
• Similar advantages as those of regular tables
• Convenience of automatic view maintenance
• Reads are as fasts as for regular tables
• Important limitations
• Restrictions on how PRIMARY KEY is constructed
• Slower writes to the base table
• Base-view inconsistencies
Slide 78

• Implementing a voting system for artifacts
• Two users submit their votes concurrently for the same artifact
read(votes:10) write(votes:11) incorrect
read(votes:10) write(votes:11) timeUser 2:
User 1:
Concurrent Data Access and Data Consistency
Votes_by_artifact
artifact K
votes
TEXT
INT
Slide 79

Lightweight Transactions and Concurrent Data Access
• LWTs guarantee correctness
• Expensive – four coordinator-replica round trips
• Failed LWTs must be repeated – can become a bottleneck
UPDATE votes_by_artifact
SET votes = 11
WHERE artifact = 'conf/cassandra/Ellis11(1)'
IF votes = 10;
read(votes:10) LWT-write(votes:11)
read(votes:10) LWT-write(votes:11) timeUser 2:
User 1:
Slide 80

COUNTERs and Concurrent Data Access
• COUNTERs may give slightly inaccurate results
• Expensive – mutexed read-before-write
• Limited to integral columns and operations of addition or subtraction
UPDATE votes_by_artifact
SET votes = votes + 1
WHERE artifact = 'conf/cassandra/Ellis11(1)';
Votes_by_artifact
artifact K
votes
TEXT
COUNTER
Slide 81

Eliminating the Need for Concurrent Data Access
• Isolating computation by isolating data
• Each vote is stored separately – writes are fast
• Data aggregation could be a bit more expensive
or
Slide 82
Votes_by_artifact
artifact K
user C↑
TEXT
UUID
Votes_by_artifact
artifact K
user C↑
vote
TEXT
UUID
INT

Adding Additional Columns To a Table
• Avoiding querying multiple tables or partitions
• Storing aggregate values for faster access
©2014 DataStax Training. Use only with permission. Slide 83
Artifacts
artifact K
type
title
authors
keywords
venue
year
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
Ratings_by_artifact
artifact K
num_ratings
sum_ratings
TEXT
COUNTER
COUNTER
Artifacts
artifact K
avg_rating
type
title
authors
keywords
venue
year
LIST<TEXT>
SET<TEXT>
FLOAT
INT
TEXT
TEXT
TEXT
TEXT

Physical Data Model for Digital Library
Slide 84
ACCESS PATTERNS
Q1: Find artifacts for a specified venue; order by year (DESC).
Q2: Find artifacts for a specified author; order by year (DESC).
Q3: Find artifacts with a specified title; order by year (DESC).
Q4: Find artifacts with a specified keyword; order by year (DESC).
Q7: Find reviews for a specified artifact, possibly with a specified rating.
Q8: Find a number of ‘likes’ for a specified artifact.
Q9: Find reviews for a specified user; order by review timestamp (DESC).
Q10: Find a user with a specified id.
Q11: Find a number of ‘likes’ for a specified review.
Q12: Find information for a specified artifact.
...
Venues
name K
year K
country IDX
homepage
Q5
Artifacts_by_venue
venue K
year C↓
artifact C↑
avg_rating
type
title
authors
keywords
Artifacts_by_author
author K
year C↓
artifact C↑
avg_rating
type
title
authors
keywords
venue
Artifacts_by_title
title K
year C↓
artifact C↑
avg_rating
type
authors
keywords
venue
keyword K
year K
artifact C↑
avg_rating
type
title
authors
keywords
venue
Users
id K
name
email
Ratings_by_artifact
artifact K
num_ratings
sum_ratings
Reviews_by_user
user K
review C↓
rating
title
body
artifact_id
artifact_title
artifact_authors
user_name S
user_email S
Reviews_by_artifact
artifact K
review C↓
rating IDX
title
body
user
Likes_by_artifact
artifact K
num_likes
Likes_by_review
review K
num_likes
Q1,Q6 Q2,Q6 Q3,Q6 Q4,Q6
Artifacts
artifact K
avg_rating
type
title
authors
keywords
venue
year
Q8Q6 Q7
Q11
Q10
Q9
Q12
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
LIST<TEXT>
LIST<TEXT>
LIST<TEXT>LIST<TEXT>
LIST<TEXT>
LIST<TEXT>
SET<TEXT> SET<TEXT>SET<TEXT>
SET<TEXT>
SET<TEXT>
FLOAT FLOAT
FLOAT
FLOAT
INT
Q6
INT
INT
INT
INTINT
INT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXTTEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXTTEXT
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TIMEUUID
TIMEUUID
TIMEUUID
COUNTER
COUNTERCOUNTER
COUNTER
UUID
UUID
UUID

Agenda
Slide 85

Sensor Networks and Time Series
• Data description
• Multiple sensor networks are deployed over non-overlapping regions
• A sensor network is identified by a unique number
• A sensor belongs to exactly one network
• A sensor has a unique identifier, location, and characteristics (e.g., accuracy,
interface, size)
• A sensor records new measurements (e.g., temperature, humidity, pressure) every
second
Slide 86

Conceptual Data Model
• Keys
• has: id
• records and Measurement:
id, timestamp, parameter
Slide 87
Network Sensorhas
n1
timestamp
id locationnumber
region
description
#of_sensors
1
n
value
parameter
characteristics
records
Measurement

Application Workflow and Access Patterns
• Access patterns
• Q1: Find information about all networks
• Q2: Find hourly average temperatures for every sensor in a specified network for a
specified date range; order by date (DESC) and hour (DESC)
• Q3: Find information about all sensors in a specified network
• Q4: Find raw measurements for a particular sensor; order by timestamp (DESC)
Slide 88
Q1
Networks
Q2
Heatmap
Q3
Sensors
Q4
Raw data

Logical Data Model
• Access patterns
• Q2: Find hourly average temperatures for every sensor in a specified network for a specified date
range; order by date (DESC) and hour (DESC)
Slide 89
Temperatures_by_network
network K
date C↓
hour C↓
sensor C↑
avg_temp
location
region S
Q2
Q3
Sensors_by_network
network K
sensor C↑
location
characteristics (map)
Q4
Measurements_by_sensor
sensor K
timestamp C↓
parameter C↑
value
Networks
number K
description
region
n_sensors
Q1

Analysis and Optimization
• Table Networks
• Partition size
• Single-row partitions
• Small partitions
• Optimization
• Merge small partitions into a larger partition
• “Partition per query” access pattern
• One small partition
Slide 90
Networks
number K
description
region
n_sensors
Networks
bucket K
number C↑
description
region
n_sensors
TEXT
TEXT
INT
INT
INT

• Table Temperatures_by_network
• Partition size
• Multi-row partitions
• Large partitions
• Optimization – split partitions
Slide 91
network K
date C↓
hour C↓
sensor C↑
avg_temp
location
region S
date and hour can
be combined into
one column
network K
week_first_day K
date C↓
hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
INT
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP

• Table Sensors_by_network
• Partition size
• Small partitions
(assuming 1,000 sensors per network)
• Optimization
• None
Slide 92
Sensors_by_network
network K
sensor C↑
location
characteristics (map)
Sensors_by_network
network K
sensor C↑
location
characteristics MAP<TEXT,TEXT>
TEXT
TEXT
INT

• Table Measurements_by_sensor
• Partition size
• Large partitions
• Optimization – split partitions
Slide 93
sensor K
timestamp C↓
parameter C↑
value
because all timestamps
have the same date in
a partition, we can
store a number of
seconds elapsed since
midnight
sensor K
date K
second C↓
parameter C↑
value
TEXT
FLOAT
TIMESTAMP
INT
TEXT
sensor K
date K
timestamp C↓
parameter C↑
value
TEXT
FLOAT
TIMESTAMP
TEXT
TIMESTAMP

• Duplication
• How many times is region stored per network?
• Once in table Networks
• Once in table Temperatures_by_network
• Static column value is stored
only once in a partition
Slide 94
Networks
bucket K
number C↑
description
region
n_sensors
TEXT
TEXT
INT
INT
INT
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP

• Duplication
• How many times is location stored per sensor?
• Once in table Sensors_by_network
• 24 x 7 times in each partition
in table Temperatures_by_network
• Duplication across partitions
• Duplication across rows in a partition
Slide 95
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
Sensors_by_network
network K
sensor C↑
location
TEXT
TEXT
INT

Physical Data Model
• Access patterns
• Q2: Find hourly average temperatures for every sensor in a specified network for a specified date
range; order by date (DESC) and hour (DESC)
Slide 96
sensor K
date K
second C↓
parameter C↑
value
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
Networks
bucket K
number C↑
description
region
n_sensors
Q1
Q3
Q2
Sensors_by_network
network K
sensor C↑
location
characteristics
Q4
TEXT
TEXT
INT
MAP<TEXT,TEXT>
INT
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
TEXT
FLOAT
TIMESTAMP
TIMESTAMP
INT
INT
TIMESTAMP
INT
INT
TEXT

Physical Data Model
Slide 97
CREATE TABLE networks (
bucket INT,
number INT,
description TEXT,
region TEXT,
n_sensors INT,
PRIMARY KEY (bucket, number)
);
-- Q1
SELECT *
FROM networks
WHERE bucket = 1;
Networks
bucket K
number C↑
description
region
n_sensors
TEXT
TEXT
INT
INT
INT

Physical Data Model
Slide 98
CREATE TABLE temperatures_by_network (
network INT,
week_first_day TIMESTAMP,
date_hour TIMESTAMP,
sensor TEXT,
avg_temp FLOAT,
location TEXT,
region TEXT STATIC,
PRIMARY KEY ((network, week_first_day), date_hour, sensor)
) WITH CLUSTERING ORDER BY (date_hour DESC, sensor ASC);
-- Q2
SELECT * FROM temperatures_by_network
WHERE network = ? AND week_first_day = ?
AND date_hour >= ? AND date_hour <= ?;
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP

Physical Data Model
Slide 99
CREATE TABLE sensors_by_network (
network INT,
sensor TEXT,
location TEXT,
characteristics MAP<TEXT,TEXT>,
PRIMARY KEY (network, sensor)
);
-- Q3
SELECT * FROM sensors_by_network
WHERE network = ?;
Sensors_by_network
network K
sensor C↑
location
TEXT
TEXT
INT

Physical Data Model
Slide 100
CREATE TABLE measurements_by_sensor (
sensor TEXT,
date TIMESTAMP,
second INT,
parameter TEXT,
value FLOAT,
PRIMARY KEY ((sensor, date), second, parameter)
) WITH CLUSTERING ORDER BY (second DESC, parameter ASC);
-- Q4
SELECT * FROM measurements_by_sensor
WHERE sensor = ? AND date = ?;
sensor K
date K
second C↓
parameter C↑
value
TEXT
FLOAT
TIMESTAMP
INT
TEXT

Agenda
Slide 101

Summary
Slide 102
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model
optimizemap

Want to Learn More?
Slide 103
academy.datastax.com
kdm.dataview.org
cassandra.apache.org

ThankYou
Slide 104
104
Artem Chebotko, Ph.D.
achebotko@datastax.com
www.linkedin.com/in/artemchebotko

Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra

Similar to Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra (20)

More from Artem Chebotko

More from Artem Chebotko (6)

Recently uploaded

Recently uploaded (20)

Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra