SlideShare a Scribd company logo
1 of 109
Download to read offline
Using the Chebotko Method to
Design Sound and Scalable Data Models for
Apache Cassandra™
Artem Chebotko, Ph.D.
November, 2019
Agenda
• Introduction to Apache Cassandra™ and CQL
• The Data Modeling Framework
• Conceptual Data Modeling and Application Workflows
• Logical Data Modeling and Chebotko Diagrams
• Physical Data Modeling and Optimization Techniques
• Time Series Modeling Example
• Summary and Resources
Slide 2
Slide 3
“Manage massive amounts of data,
fast, without losing sleep”
cassandra.apache.org
Cassandra Use Cases
Slide 4
“Cassandra is in use at Constant
Contact, CERN, Comcast, eBay,
GitHub, GoDaddy, Hulu, Instagram,
Intuit, Netflix, Reddit,The Weather
Channel, and over 1500 more
companies that have large, active
data sets.”
“Some of the largest production
deployments include Apple's, with
over 75,000 nodes storing over 10
PB of data, Netflix (2,500 nodes, 420
TB, over 1 trillion requests per day),
Chinese search engine Easou (270
nodes, 300 TB, over 800 million
requests per day), and eBay (over
100 nodes, 250 TB).”
PersonalizationFraud detectionMessaging Playlists Internet of Things
• High Availability – Always On
• No single point of failure
• Fault-tolerance via replication
and tunable consistency
• Best-in-class multi-datacenter
support
• No downtime or interruption
due to node maintenance
• Performance and Scalability
• Very fast writes
• Fast reads
• Linear scalability
• Elasticity
Many Reasons to Choose Cassandra
Slide 5
How Cassandra Organizes Data
Slide 6
CREATE KEYSPACE library
WITH replication = {'class':
'NetworkTopologyStrategy',
'DC-West': '3',
'DC-East': '5'};
library
How Cassandra Organizes Data
Slide 7
CREATE KEYSPACE library
WITH replication = {'class':
'NetworkTopologyStrategy',
'DC-West': '3',
'DC-East': '5'};
CREATE TABLE
library.venues_by_year (
year INT,
name TEXT,
country TEXT,
homepage TEXT,
PRIMARY KEY (year,name));
artifacts venues_by_year
library
How Cassandra Organizes Data
Slide 8
CREATE TABLE
library.venues_by_year (
...
PRIMARY KEY (year,name));
artifacts venues_by_year
library
year name country …
2019 Apache Cassandra Summit USA …
2019 Data Modeling Zone USA …
2019 DataStax Accelerate USA …
year name country …
2015 A … … …
2015 B … … …
2015 C … … …
How Cassandra Organizes Data
Slide 9
CREATE TABLE
library.venues_by_year (
...
PRIMARY KEY (year,name));
artifacts venues_by_year
library
DC-West DC-East
RF=3 RF=5
How Cassandra Organizes Data
Slide 10
CREATE TABLE
library.venues_by_year (
...
PRIMARY KEY (year,name));
artifacts venues_by_year
library
DC-West DC-East
RF=3 RF=5
How Cassandra Organizes Data
Slide 11
CREATE TABLE
library.venues_by_year (
...
PRIMARY KEY (year,name));
artifacts venues_by_year
library
DC-West DC-East
RF=3 RF=5
Cassandra Query Language (CQL)
• Data Definition
• CREATE KEYSPACE, CREATE TABLE
• CREATE INDEX, CREATE CUSTOM INDEX
• CREATE MATERIALIZEDVIEW
• Data Manipulation
• SELECT
• INSERT, UPDATE, DELETE
Slide 12
CQL CREATE TABLE
Slide 13
CREATE TABLE
(
column type STATIC,
column type STATIC,
...,
PRIMARY KEY ( )
) ;
name
WITH CLUSTERING ORDER BY (clustering_key_column (ASC|DESC), ...)
(column, ...), column, ...
table name
column names, types,
optional STATIC designation
partition key
optional
clustering key
row ordering in a partition
Table with Single-Row Partitions
Slide 14
CREATE TABLE users (
id UUID,
name TEXT,
email TEXT,
PRIMARY KEY (id)
);
id email name
a7e78478-0a54-4949-90f3-14ec4cbea40c jbellis@datastax.com Jonathan
67657da3-4443-46ab-b60a-510a658fc7bb achebotko@datastax.com Artem
3b1f62b1-386b-46e3-b55d-00f1abbafb2b patrick@datastax.com Patrick
Users
id K
name
email
TEXT
TEXT
UUID
CREATE TABLE artifacts_by_venue (
venue TEXT, year INT,
artifact TEXT,
title TEXT,
country TEXT STATIC,
PRIMARY KEY ((venue, year), artifact)
);
venue year artifact title country
DataStax Accelerate 2019
A… Linear Scalability …
USA… …
Z… Building Cloud …
Data Modeling Zone 2019
A… New approach to …
USA
… …
Artifacts_by_venue
venue K
year K
artifact C↑
title
country S
TEXT
TEXT
TEXT
TEXT
INT
Table with Multi-Row Partitions
Slide 15
CQL SELECT
Slide 16
SELECT selectors
FROM table_name
WHERE primary_key_conditions
AND index_conditions
GROUP BY primary_key_columns
ORDER BY clustering_key_columns ( ASC | DESC )
LIMIT N
ALLOW FILTERING ;
one table per query
restricted to
primary key columns
columns, aggregates, functions
danger
Sample CQL Queries
Slide 17
SELECT *
FROM artifacts_by_venue
Artifacts_by_venue
venue K
year K
artifact C↑
title
country S
TEXT
TEXT
TEXT
TEXT
INTWHERE venue = ? AND year = ?;
WHERE venue = ? AND year = ? AND artifact = ?;
WHERE venue = ? AND year = ? AND
artifact > ? AND artifact < ?
ORDER BY artifact DESC;
Invalid CQL Queries
Slide 18
SELECT *
FROM artifacts_by_venue
Artifacts_by_venue
venue K
year K
artifact C↑
title
country S
TEXT
TEXT
TEXT
TEXT
INT
WHERE venue = ?;
WHERE venue = ? AND artifact = ?;
WHERE artifact > ? AND artifact < ?;
WHERE venue = ? AND year = ? AND title = ?;
WHERE country = ?;
Important Implications for Data Modeling
• Data
• Primary keys define data uniqueness
• Partition keys define data distribution
• Partition keys affect partition sizes
• Clustering keys define row ordering
• Query
• Primary keys define how data is retrieved
• Partition keys allow equality predicates
• Clustering keys allow inequality predicates and ordering
• Only one table per query, no joins
Slide 19
Agenda
• Introduction to Apache Cassandra™ and CQL
• The Data Modeling Framework
• Conceptual Data Modeling and Application Workflows
• Logical Data Modeling and Chebotko Diagrams
• Physical Data Modeling and Optimization Techniques
• Time Series Modeling Example
• Summary and Resources
Slide 20
Data Modeling
• Collection and analysis of data requirements
• Identification of participating entities and relationships
• Identification of data access patterns
• A particular way of organizing and structuring data
• Design and specification of a database schema
• Schema optimization and data indexing techniques
Slide 21
Data quality: completeness consistency accuracy
Data access: queryability efficiency scalability
Key Data Modeling Steps
• Understand the data
• Identify access patterns
• Apply a query-first approach
• Optimize and implement
Slide 22
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model
The Data Modeling Framework
Slide 23
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model
optimizemap
Defines models and transitions
Agenda
• Introduction to Apache Cassandra™ and CQL
• The Data Modeling Framework
• Conceptual Data Modeling and Application Workflows
• Logical Data Modeling and Chebotko Diagrams
• Physical Data Modeling and Optimization Techniques
• Time Series Modeling Example
• Summary and Resources
Slide 24
Conceptual Data Modeling
• Conceptual data model
• High-level view of data – entity and relationship types
• Technology-independent or technology-agnostic
• Not specific to Cassandra or any other database system
• Purpose: understanding data
• The scope of what needs to be accomplished
• Essential components, concepts, entities, relationships
• Keys and cardinality constraints are absolutely essential
Slide 25
Advantages of Conceptual Data Modeling
• Complex data modeling problems are more manageable
• Less chance of producing an incorrect or incomplete model
• Saves time in the long run
• Improves data, business process, and risk management
• Readable by both technical and non-technical people
• Good for data governance
• Understanding of a data modeling problem is documented
• Can be shared, reviewed and agreed on
• Improves understanding and eliminates ambiguity
Slide 26
Conceptual Data Modeling Techniques
• Entity-relationship modeling
• Entities and relationships that can exist among them
• Unified Modeling Language (UML) class diagrams
• Classes and associations that can exist among them
• Object-role modeling or fact-oriented modeling
• Entities and facts that define relationships among them
• Dimensional modeling
• Dimensions and facts that relate them
• Ontological modeling
• Concepts and relations that can exist among them
Slide 27
ER Model
• Chen’s notation
• Simple and original
• Truly technology-independent
• Other notations may be influenced by relational databases
• Crow’s foot, Barker’s notation, Information engineering, IDEF1X
• Many hybrids exist
Slide 28
ER Model Basics
• Entity – object that is involved in an information system
• Example: Jonathan Ellis (an author),“The State of Cassandra, 2014” (a presentation)
• Entity type – set of similar objects
• Example: Author, Presentation
• Relationship – relates two or more entities
• Example: Jonathan Ellis creates “The State of Cassandra, 2014”
• Relationship type – set of similar relationships
• Example: Author creates Presentation
Slide 29
Entity Type Example
• Name – usually a noun
• Attributes – atomic, set-valued, composite, derived
• Key – minimum set of attributes that uniquely identify an entity
Slide 30
Userid
name
first_name
last_name
DOB
age
emails
Relationship Type Example
• Name – usually a verb
• Attributes – can be atomic, set-valued, composite, derived
• Roles – each role names a related entity
• Key – minimum set of roles and attributes that uniquely identify a relationship
• Cardinality constraints – how many times an entity can participate in a relationship
Slide 31
Venue
year
country
homepage
features
1 n
name
Artifact
id
title
authors
keywords
date
How is a Key of a Relationship Derived?
• 1:1 relationship
• 1:n relationship
• m:n relationship
Slide 32
Author
lname
affiliation
has
date
1 1
fname
Bio
fname
bio
lnameinterests
references
Venue
year
country
homepage
features
1 n
name
Artifact
id
title
authors
keywords
date
Author
lname
affiliation
creates
m n
fname
Artifact
id
title
keywords
interests
fname, lname
id
fname, lname, id
Entity Type Hierarchy Example
• Attribute inheritance: id, title, authors, and keywords are inherited by Article and
Presentation
• Disjoint – cannot have an entity that is both Article and Presentation
• Covering – cannot have a Digital Artifact to be anything but Article or Presentation
Slide 33
Digital
Artifact
IsA
Article Presentation
disjoint
covering
id title
keywords
authors
url
abstract
doi
Conceptual Data Model for Digital Library
Slide 34
User
Digital
Artifact
Venue
likes
n
m
features
1
n
IsA
Article Presentation
disjoint
covering
posts
id
title
keywords
authors
1
n
year
country
name
homepage
id
name
email
timestamp
title
Review
likes
features
n
m
id
body
n
1
rating
Application Workflow Model
• High-level application design
• Tasks, causal dependencies, access patterns
• Technology-independent or technology-agnostic
• Not specific to Cassandra or any other database system
• Purpose: understand data access patterns
• Each application has a workflow
• Data-driven tasks access a database
• A sequence of tasks defines a sequence of data access patterns
Slide 35
Application Workflow for Digital Library
Slide 36
Search for
artifacts by a
venue,
author, title,
or keyword
Display
information
for a venue
Display a
rating of an
artifact
Display
reviews for
an artifact
Display likes
for an artifact
Find
information
for an artifact
with a given
id
Show
information
about a user
Show likes
for a review
Show reviews
by a user
Tasks and causal dependencies
Application Workflow for Digital Library
Slide 37
ACCESS PATTERNS
Q1: Find artifacts for a specified venue ...
Q2: Find artifacts for a specified author ...
Q3: Find artifacts with a specified title ...
Q4: Find artifacts with a specified keyword ...
Q5: Find information for a specified venue.
Q6: Find an average rating for a specified artifact.
Q7: Find reviews for a specified artifact ...
...
Q5
Q1,Q2,Q3,Q4
Q8Q6 Q7
Search for
artifacts by a
venue,
author, title,
or keyword
Display
information
for a venue
Display a
rating of an
artifact
Display
reviews for
an artifact
Display likes
for an artifact
Find
information
for an artifact
with a given
id
Show
information
about a user
Show likes
for a review
Show reviews
by a user
Q9 Q10
Q12
Q11
Data access patterns
Agenda
• Introduction to Apache Cassandra™ and CQL
• The Data Modeling Framework
• Conceptual Data Modeling and Application Workflows
• Logical Data Modeling and Chebotko Diagrams
• Physical Data Modeling and Optimization Techniques
• Time Series Modeling Example
• Summary and Resources
Slide 38
Logical Data Modeling
• Logical data model
• Sketch data model for Cassandra
• Chebotko Diagrams for visualization
• Purpose: sound, query-driven design
• Data organization into tables according to the queries
• Correctness of primary key design
• Denormalization, nesting, duplication
Slide 39
Chebotko Diagrams
• Visual representation of a logical data model
• Tables are represented by rectangles and
have names and columns
• Columns may optionally be designated as K
(partition key column), C (clustering key
column), S (static column), and IDX (indexed
column)
• Access patterns and their ordering are
represented by query-labeled connections
Slide 40
Venues
e K
Q5
ifacts_by_venue
ue K
C↓
act C↑
ors (list)
words (set)
Artifacts_by_author
author K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Artifacts_by_title
title K
year C↓
artifact C↑
type
authors (list)
keywords (set)
venue
Artifacts_by_keyword
keyword K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Ratings_by_artifact
artifact K
Reviews_by_artifact
artifact K
Likes_by_artifact
artifact K
Q1 Q2 Q3 Q4
artifa
type
title
autho
keyw
venue
year
Q8Q6 Q7
Conceptual-to-Logical Mapping
Slide 41
Conceptual
Data Model
Application
Workflow
Logical
Data Model
map
Mapping rules
Mapping patterns
Mapping Rules
• Mapping rule 1:“Entities and relationships”
• Entity and relationship types map to tables
• Mapping rule 2:“Key attributes”
• Key attributes map to primary key columns
• Mapping rule 3:“Equality search attributes”
• Equality search attributes map to partition key columns
• Mapping rule 4:“Inequality search attributes”
• Inequality search attributes map to clustering columns
• Mapping rule 5:“Ordering attributes”
• Ordering attributes map to clustering columns
Slide 42
Conceptual
Data Model
Application
Workflow
Example
• Conceptual data model
• Query
• Find artifacts that appeared in a particular venue after a specified year;
order results by year (desc) and title (asc)
• Query predicate (equality and inequality): name = ? AND year > ?
• Ordering attributes: year (DESC), title (ASC)
Slide 43
Digital
Artifact
Venue features
1 n
idyear
country
name title
homepage keywords authors
Applying Mapping Rules 1 and 2
• Mapping rule 1:“Entities and relationships”
• The query only concerns a part of the diagram
• Data should be organized based on the relationship type
• Mapping rule 2:“Key attributes”
• Key of the relationship is id
• id must map to a primary key column (say column artifact)
Slide 44
Digital
Artifact
Venue features
1 n
idyear
country
name title
homepage keywords authors
Artifacts_by_venue
...
artifact C↑
...
Applying Mapping Rule 3
• Mapping rule 3:“Equality search attributes”
• Equality search attribute name=? maps to
the 1st column of the primary key
• It must be a partition key column (say column venue)
Slide 45
Artifacts_by_venue
venue K
...
artifact C↑
...
Applying Mapping Rule 4
• Mapping rule 4:“Inequality search attributes”
• Inequality search attribute year>? maps to
a clustering column
Slide 46
Artifacts_by_venue
venue K
year C↑
...
artifact C↑
...
Applying Mapping Rule 5
• Mapping rule 5:“Ordering attributes”
• Ordering attributes year (DESC) and title (ASC) map to
clustering columns
• year is already part of the schema but
its order should be reversed to DESC
• title is added next
Slide 47
Artifacts_by_venue
venue K
year C↓
...
artifact C↑
...
Artifacts_by_venue
venue K
year C↓
title C↑
artifact C↑
...
Final Result
How did we get column type?
Slide 48
SELECT *
FROM artifacts_by_venue
WHERE venue = ? AND year > ?
ORDER BY year DESC, title ASC
Artifacts_by_venue
venue K
year C↓
title C↑
artifact C↑
type
authors (list)
keywords (set)
Digital
Artifact
Venue features
1 n
idyear
country
name title
homepage keywords authors
Digital
Artifact
IsA
Article Presentation
disjoint
covering
Mapping Patterns
• Semi-formal definitions of common mapping use cases
• Graphical rather than mathematical representation
• Use clustering columns as the data nesting mechanism
• Do not take ordering of results into consideration
• Guide schema design
• Ensure correctness and efficiency
• Enable automation
Slide 49
Common Mapping Patterns
• Entity patterns
• 1:1 relationship patterns
• 1:n relationship patterns
• m:n relationship patterns
• Hierarchical patterns
Slide 50
1:n relationship mapping pattern 3.1
• Search attributes = key attributes
Slide 51
ET1
key1.2
attr1.1
attr1.2
ET2_by_ET1_key
key1.1 K
key1.2 K
key2.1 C↑
key2.2 C↑
attr1.1 S
attr1.2 S
attr1.3 (collection) S
attr2.1
attr2.2
attr2.3 (collection)
attr
RT
attr
1 n
key1.1
ET2
key2.1
attr2.1
attr2.2
key2.2
attr2.3
attr1.3
ACCESS PATTERN
search attributes: key1.1 key1.2
ET2_by_ET1_key
key1.1 K
key1.2 C↑
key2.1 C↑
key2.2 C↑
attr2.1
attr2.2
attr2.3 (collection)
attr
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of RT
STATIC COLUMNS:
Non-key attributes of
ET1, iff all key
attributes of ET1 are
part of the partition key
What if we add green attributes
to the above table?
1:n relationship mapping pattern 3.1 (example)
• Search attributes = key attributes
Slide 52
Venue
year
country
homepage
Artifacts_by_venue
venue (=name) K
year K
artifact (= id) C↑
country S
homepage S
title
authors (list)
keywords (set)
features
1 n
name
Artifact
id
title
ACCESS PATTERN
search attributes: name year
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of features
STATIC COLUMNS:
Non-key attributes of
Venue, iff all key
attributes of Venue are
part of the partition key
What about country and
homepage?
authors
keywords
Artifacts_by_venue
venue (=name) K
year C↓
artifact (= id) C↑
title
authors (list)
keywords (set)
1:n relationship mapping pattern 3.2
• Search attributes ≠ key attributes
Slide 53
ET1
key1.2
attr1.1
attr1.2
ET2_by_ET1_non-key
attr1.1 K
attr1.2 K
key2.1 C↑
key2.2 C↑
attr2.1
attr2.2
attr2.3 (collection)
attr
RT
attr
1 n
key1.1
ET2
key2.1
attr2.1
attr2.2
key2.2
attr2.3
attr1.3
ACCESS PATTERN
search attributes: attr1.1 attr1.2
ET2_by_ET1_non-key
attr1.1 K
attr1.2 C↑
key2.1 C↑
key2.2 C↑
attr2.1
attr2.2
attr2.3 (collection)
attr
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of RT
All ET1's attributes can
be added at the cost of
duplicating them for
every entity of type 2
1:n relationship mapping pattern 3.2 (example)
• Search attributes ≠ key attributes
Slide 54
Venue
year
country
homepage
Artifacts_by_country
country K
year K
artifact (= id) C↑
title
authors (list)
keywords (set)
features
1 n
name
Artifact
id
title
ACCESS PATTERN
search attributes: country year
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of features
authors
keywords
name and homepage
can be added at the
cost of duplicating
them for every artifact
Artifacts_by_country
country K
year C↓
artifact (= id) C↑
title
authors (list)
keywords (set)
Logical Data Model for Digital Library
Slide 55
ACCESS PATTERNS
Q1: Find artifacts for a specified venue; order by year (DESC).
Q2: Find artifacts for a specified author; order by year (DESC).
Q3: Find artifacts with a specified title; order by year (DESC).
Q4: Find artifacts with a specified keyword; order by year (DESC).
Q5: Find information for a specified venue.
Q6: Find an average rating for a specified artifact.
Q7: Find reviews for a specified artifact, possibly with a specified rating .
Q8: Find a number of ‘likes’ for a specified artifact.
Q9: Find reviews for a specified user; order by review timestamp (DESC).
Q10: Find a user with a specified id.
Q11: Find a number of ‘likes’ for a specified review.
Q12: Find information for a specified artifact.
...
Venues
name K
year K
country IDX
homepage
Q5
Artifacts_by_venue
venue K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
Artifacts_by_author
author K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Artifacts_by_title
title K
year C↓
artifact C↑
type
authors (list)
keywords (set)
venue
Artifacts_by_keyword
keyword K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Users
id K
name
email
Ratings_by_artifact
artifact K
num_ratings (counter)
sum_ratings (counter)
Reviews_by_user
user K
review (timeuuid) C↓
rating
title
body
artifact_id
artifact_title
artifact_authors (list)
user_name S
user_email S
Reviews_by_artifact
artifact K
review (timeuuid) C↓
rating IDX
title
body
user
Likes_by_artifact
artifact K
num_likes (counter)
Likes_by_review
review K
num_likes (counter)
Q1 Q2 Q3 Q4
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
Q8Q6 Q7
Q11
Q10
Q9
Q12
Agenda
• Introduction to Apache Cassandra™ and CQL
• The Data Modeling Framework
• Conceptual Data Modeling and Application Workflows
• Logical Data Modeling and Chebotko Diagrams
• Physical Data Modeling and OptimizationTechniques
• Time Series Modeling Example
• Summary and Resources
Slide 56
Physical Data Modeling
• Physical data model
• Complete data model for Cassandra
• Chebotko Diagrams for visualization
• CQL for a database schema
• Purpose: efficient, implementation-ready design
• Data model efficiency analysis and validation
• Schema design optimizations
• Techniques for concurrent data access
Slide 57
Logical Data Model – Correctness and Efficiency
• But …
• Database engine has limitations
• Resources are finite
• Some operations may require special considerations
Slide 58
Physical Data Model – More Efficiency
• Physical data model takes into account …
• Partition sizes
• Data duplication factors
• Data types, indexes, materialized views
• Concurrent data access requirements
Slide 59
Partition Size Limits
• Theoretical limits
• 2 billion values
• Node disk size
• Practical limits for Cassandra
Slide 60
Cassandra 2 Cassandra 3
100K Up to 10x
100MB Up to 10x
Estimating a Partition Size (Cassandra 3)
Slide 61
Nv – number of values in a partition
Ncv – number of clustering column values in a partition
Nrv – number of regular column values in a partition
Nsv – number of static column values in a table definition
Nr – number of rows in a partition
Ncc – number of clustering columns in a table definition
Nrc – number of regular columns in a table definition
Nsc – number of static columns in a table definition
Sp – size of a partition in bytes
sizeOf – size (in bytes) of a CQL data type
ck – partition key column in a table definition
cs – static column in a table definition
cr – regular column in a table definition
cc – clustering column in a table definition
Nr – number of rows in a partition
sizeOf(tavg) – average size (in bytes) of a timestamp delta
associated with a value
Simplified Example:Values
• User with 1000 reviews = 1000 rows in a partition
Slide 62
Reviews_by_user
user K
review C↓
rating
title
user_name S
user_email S
FLOAT
TEXT
TEXT
TIMEUUID
UUID
TEXT
1000
1000
1000
1
1
3002
+
+
+
+
------
values
Simplified Example: Bytes
• User with 1000 reviews = 1000 rows in a partition
Slide 63
Reviews_by_user
user K
review C↓
rating
title
user_name S
user_email S
FLOAT
TEXT
TEXT
TIMEUUID
UUID
TEXT
16
1000x16
1000x 4
1000x60
12
20
80048
2002x8
96064
+
+
+
+
+
---------
+
-------
bytes
Getting a Partition Size Empirically
Slide 64
$ nodetool flush library reviews_by_user
$ nodetool tablestats -H library.reviews_by_user
Keyspace : library
...
Table: reviews_by_user
SSTable count: 1
...
Number of partitions (estimate): 1
...
Compacted partition minimum bytes: 88149
Compacted partition maximum bytes: 105778
Compacted partition mean bytes: 105778
...
Splitting Large Partitions
• Solution: introduce an additional column to a partition key
• Use an existing column – convenience
• Use an artificial “bucket” column – more control
• Cons: supported access patterns may change
• Example:
• Millions of artifacts with the same keyword
across different venues and years
Slide 65
Artifacts_by_keyword
keyword K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Using an Existing Column
Slide 66
Artifacts_by_keyword
keyword K
year C↓
artifact C↑
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
Artifacts_by_keyword
keyword K
year K
artifact C↑
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
Using an Artificial Column
Slide 67
Artifacts_by_keyword
keyword K
year K
bucket K
artifact C↑
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
INT
Artifacts_by_keyword
keyword K
year K
artifact C↑
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
Data Duplication Considerations
• Data duplication is necessary
• Data duplication vs. data replication
• Data duplication factor
• Data duplication and data consistency
Slide 68
Duplication Across Tables
• Each artifact is stored once in Artifacts
• Each artifact is stored once in Artifacts_by_venue
• Duplication factor = 2
Slide 69
Artifacts_by_venue
venue K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
Duplication Across Partitions
• An artifact with 5 authors is stored in 5 different partitions
• Duplication factor = 5
Slide 70
Artifacts_by_author
author K
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Artifacts_by_author
author K
keyword C↑
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Duplication Across Rows
• An artifact with 5 keywords is stored in 5 rows of the same partition
• An artifact with 5 authors is stored in 5 different partitions
• Duplication factor = 5 x 5 = 25
Slide 71
Artifacts_by_author
author K
keyword C↑
year C↓
artifact C↑
type
title
authors (list)
keywords (set)
venue
Beware of Non-constant Duplication Factors
• Users can add new keywords to artifacts
• There is no limit on the number of keywords per artifact
• An artifact with n keywords is stored in n rows of the same partition
• An artifact with 5 authors is stored in 5 different partitions
• Duplication factor = 5 x n
• Do things differently
• Place reasonable limits that can be gradually increased
Slide 72
BEGIN BATCH
UPDATE artifacts ...
INSERT INTO artifacts_by_title ...
DELETE FROM artifacts_by_title ...
APPLY BATCH;
Keeping Up with Data Duplication and Data Consistency
• Insert a new artifact
• Update an existing artifact title
Slide 73
BEGIN BATCH
INSERT INTO artifacts ...
INSERT INTO artifacts_by_title ...
APPLY BATCH;
Artifacts_by_title
title K
year C↓
artifact C↑
type
authors (list)
keywords (set)
venue
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
CREATE MATERIALIZED VIEW
artifacts_by_title AS
SELECT title, year,
artifact, type,
authors, keywords,
venue
FROM artifacts
WHERE title IS NOT NULL
AND artifact IS NOT NULL
PRIMARY KEY (title,
artifact);
Keeping Up with Data Duplication and Data Consistency
How About MaterializedViews?
Base table Materialized view
Slide 74
CREATE TABLE artifacts (
artifact TEXT,
type TEXT,
title TEXT,
authors LIST<TEXT>,
keywords SET<TEXT>,
venue TEXT,
year INT,
PRIMARY KEY (artifact)
);
Artifacts_by_title
title K
year C↓
artifact C↑
type
authors (list)
keywords (set)
venue
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
Selecting Column Data Types
Slide 75
ASCII
BIGINT
BLOB
BOOLEAN
COUNTER
DATE
DECIMAL
DOUBLE
FLOAT
INET
INT
LIST
MAP
SET
SMALLINT
TEXT
TIME
TIMESTAMP
TIMEUUID
TINYINT
TUPLE
UUID
VARCHAR
VARINT
CREATE TYPE library.ADDRESS (
street TEXT,
city TEXT,
state TEXT,
postal_code TEXT
);
CREATE TABLE library.users (
id UUID PRIMARY KEY,
name TEXT,
other_names SET<TEXT>,
phones MAP<TEXT,TEXT>,
current_address ADDRESS,
past_addresses LIST<FROZEN<ADDRESS>>
);
Indexing Options
• Local indexes
• Secondary indexes
• SSTable-attached secondary indexes (SASI)
• Distributed indexes
• Materialized views
Slide 76
When to Use a Secondary Index?
• Queries on low-cardinality columns
• Mostly analytical queries with larger result sets
• Generally, these are expensive queries
• Queries that involve both partition key and indexed column
• Searching within a large partition
• Efficient queries
Slide 77
Venues
name K
year K
country IDX
homepage
INT
TEXT
TEXT
TEXT
Reviews_by_artifact
artifact K
review C↓
rating IDX
title
body
user
INT
TEXT
TEXT
TEXT
TIMEUUID
UUID
When to Use a MaterializedView?
• Queries on higher-cardinality columns
• Similar advantages as those of regular tables
• Convenience of automatic view maintenance
• Reads are as fasts as for regular tables
• Important limitations
• Restrictions on how PRIMARY KEY is constructed
• Slower writes to the base table
• Base-view inconsistencies
Slide 78
• Implementing a voting system for artifacts
• Two users submit their votes concurrently for the same artifact
read(votes:10) write(votes:11) incorrect
read(votes:10) write(votes:11) timeUser 2:
User 1:
Concurrent Data Access and Data Consistency
Votes_by_artifact
artifact K
votes
TEXT
INT
Slide 79
Lightweight Transactions and Concurrent Data Access
• LWTs guarantee correctness
• Expensive – four coordinator-replica round trips
• Failed LWTs must be repeated – can become a bottleneck
UPDATE votes_by_artifact
SET votes = 11
WHERE artifact = 'conf/cassandra/Ellis11(1)'
IF votes = 10;
read(votes:10) LWT-write(votes:11)
read(votes:10) LWT-write(votes:11) timeUser 2:
User 1:
Slide 80
COUNTERs and Concurrent Data Access
• COUNTERs may give slightly inaccurate results
• Expensive – mutexed read-before-write
• Limited to integral columns and operations of addition or subtraction
UPDATE votes_by_artifact
SET votes = votes + 1
WHERE artifact = 'conf/cassandra/Ellis11(1)';
Votes_by_artifact
artifact K
votes
TEXT
COUNTER
Slide 81
Eliminating the Need for Concurrent Data Access
• Isolating computation by isolating data
• Each vote is stored separately – writes are fast
• Data aggregation could be a bit more expensive
or
Slide 82
Votes_by_artifact
artifact K
user C↑
TEXT
UUID
Votes_by_artifact
artifact K
user C↑
vote
TEXT
UUID
INT
Adding Additional Columns To a Table
• Avoiding querying multiple tables or partitions
• Storing aggregate values for faster access
Š2014 DataStax Training. Use only with permission. Slide 83
Artifacts
artifact K
type
title
authors
keywords
venue
year
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
Ratings_by_artifact
artifact K
num_ratings
sum_ratings
TEXT
COUNTER
COUNTER
Artifacts
artifact K
avg_rating
type
title
authors
keywords
venue
year
LIST<TEXT>
SET<TEXT>
FLOAT
INT
TEXT
TEXT
TEXT
TEXT
Physical Data Model for Digital Library
Slide 84
ACCESS PATTERNS
Q1: Find artifacts for a specified venue; order by year (DESC).
Q2: Find artifacts for a specified author; order by year (DESC).
Q3: Find artifacts with a specified title; order by year (DESC).
Q4: Find artifacts with a specified keyword; order by year (DESC).
Q5: Find information for a specified venue.
Q6: Find an average rating for a specified artifact.
Q7: Find reviews for a specified artifact, possibly with a specified rating.
Q8: Find a number of ‘likes’ for a specified artifact.
Q9: Find reviews for a specified user; order by review timestamp (DESC).
Q10: Find a user with a specified id.
Q11: Find a number of ‘likes’ for a specified review.
Q12: Find information for a specified artifact.
...
Venues
name K
year K
country IDX
homepage
Q5
Artifacts_by_venue
venue K
year C↓
artifact C↑
avg_rating
type
title
authors
keywords
Artifacts_by_author
author K
year C↓
artifact C↑
avg_rating
type
title
authors
keywords
venue
Artifacts_by_title
title K
year C↓
artifact C↑
avg_rating
type
authors
keywords
venue
Artifacts_by_keyword
keyword K
year K
artifact C↑
avg_rating
type
title
authors
keywords
venue
Users
id K
name
email
Ratings_by_artifact
artifact K
num_ratings
sum_ratings
Reviews_by_user
user K
review C↓
rating
title
body
artifact_id
artifact_title
artifact_authors
user_name S
user_email S
Reviews_by_artifact
artifact K
review C↓
rating IDX
title
body
user
Likes_by_artifact
artifact K
num_likes
Likes_by_review
review K
num_likes
Q1,Q6 Q2,Q6 Q3,Q6 Q4,Q6
Artifacts
artifact K
avg_rating
type
title
authors
keywords
venue
year
Q8Q6 Q7
Q11
Q10
Q9
Q12
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
LIST<TEXT>
LIST<TEXT>
LIST<TEXT>LIST<TEXT>
LIST<TEXT>
LIST<TEXT>
SET<TEXT> SET<TEXT>SET<TEXT>
SET<TEXT>
SET<TEXT>
FLOAT FLOAT
FLOAT
FLOAT
INT
Q6
INT
INT
INT
INTINT
INT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXTTEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXTTEXT
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TIMEUUID
TIMEUUID
TIMEUUID
COUNTER
COUNTERCOUNTER
COUNTER
UUID
UUID
UUID
Agenda
• Introduction to Apache Cassandra™ and CQL
• The Data Modeling Framework
• Conceptual Data Modeling and Application Workflows
• Logical Data Modeling and Chebotko Diagrams
• Physical Data Modeling and Optimization Techniques
• Time Series Modeling Example
• Summary and Resources
Slide 85
Sensor Networks and Time Series
• Data description
• Multiple sensor networks are deployed over non-overlapping regions
• A sensor network is identified by a unique number
• A sensor belongs to exactly one network
• A sensor has a unique identifier, location, and characteristics (e.g., accuracy,
interface, size)
• A sensor records new measurements (e.g., temperature, humidity, pressure) every
second
Slide 86
Conceptual Data Model
• Keys
• has: id
• records and Measurement:
id, timestamp, parameter
Slide 87
Network Sensorhas
n1
timestamp
id locationnumber
region
description
#of_sensors
1
n
value
parameter
characteristics
records
Measurement
Application Workflow and Access Patterns
• Access patterns
• Q1: Find information about all networks
• Q2: Find hourly average temperatures for every sensor in a specified network for a
specified date range; order by date (DESC) and hour (DESC)
• Q3: Find information about all sensors in a specified network
• Q4: Find raw measurements for a particular sensor; order by timestamp (DESC)
Slide 88
Q1
Networks
Q2
Heatmap
Q3
Sensors
Q4
Raw data
Logical Data Model
• Access patterns
• Q1: Find information about all networks
• Q2: Find hourly average temperatures for every sensor in a specified network for a specified date
range; order by date (DESC) and hour (DESC)
• Q3: Find information about all sensors in a specified network
• Q4: Find raw measurements for a particular sensor; order by timestamp (DESC)
Slide 89
Temperatures_by_network
network K
date C↓
hour C↓
sensor C↑
avg_temp
location
region S
Q2
Q3
Sensors_by_network
network K
sensor C↑
location
characteristics (map)
Q4
Measurements_by_sensor
sensor K
timestamp C↓
parameter C↑
value
Networks
number K
description
region
n_sensors
Q1
Analysis and Optimization
• Table Networks
• Partition size
• Single-row partitions
• Small partitions
• Optimization
• Merge small partitions into a larger partition
• “Partition per query” access pattern
• One small partition
Slide 90
Networks
number K
description
region
n_sensors
Networks
bucket K
number C↑
description
region
n_sensors
TEXT
TEXT
INT
INT
INT
Analysis and Optimization
• Table Temperatures_by_network
• Partition size
• Multi-row partitions
• Large partitions
• Optimization – split partitions
Slide 91
Temperatures_by_network
network K
date C↓
hour C↓
sensor C↑
avg_temp
location
region S
date and hour can
be combined into
one column
Temperatures_by_network
network K
week_first_day K
date C↓
hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
INT
Temperatures_by_network
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
Analysis and Optimization
• Table Sensors_by_network
• Partition size
• Multi-row partitions
• Small partitions
(assuming 1,000 sensors per network)
• Optimization
• None
Slide 92
Sensors_by_network
network K
sensor C↑
location
characteristics (map)
Sensors_by_network
network K
sensor C↑
location
characteristics MAP<TEXT,TEXT>
TEXT
TEXT
INT
Analysis and Optimization
• Table Measurements_by_sensor
• Partition size
• Multi-row partitions
• Large partitions
• Optimization – split partitions
Slide 93
Measurements_by_sensor
sensor K
timestamp C↓
parameter C↑
value
because all timestamps
have the same date in
a partition, we can
store a number of
seconds elapsed since
midnight
Measurements_by_sensor
sensor K
date K
second C↓
parameter C↑
value
TEXT
FLOAT
TIMESTAMP
INT
TEXT
Measurements_by_sensor
sensor K
date K
timestamp C↓
parameter C↑
value
TEXT
FLOAT
TIMESTAMP
TEXT
TIMESTAMP
Analysis and Optimization
• Duplication
• How many times is region stored per network?
• Once in table Networks
• Once in table Temperatures_by_network
• Static column value is stored
only once in a partition
Slide 94
Networks
bucket K
number C↑
description
region
n_sensors
TEXT
TEXT
INT
INT
INT
Temperatures_by_network
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
Analysis and Optimization
• Duplication
• How many times is location stored per sensor?
• Once in table Sensors_by_network
• 24 x 7 times in each partition
in table Temperatures_by_network
• Duplication across partitions
• Duplication across rows in a partition
Slide 95
Temperatures_by_network
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
Sensors_by_network
network K
sensor C↑
location
characteristics MAP<TEXT,TEXT>
TEXT
TEXT
INT
Physical Data Model
• Access patterns
• Q1: Find information about all networks
• Q2: Find hourly average temperatures for every sensor in a specified network for a specified date
range; order by date (DESC) and hour (DESC)
• Q3: Find information about all sensors in a specified network
• Q4: Find raw measurements for a particular sensor; order by timestamp (DESC)
Slide 96
Measurements_by_sensor
sensor K
date K
second C↓
parameter C↑
value
Temperatures_by_network
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
Networks
bucket K
number C↑
description
region
n_sensors
Q1
Q3
Q2
Sensors_by_network
network K
sensor C↑
location
characteristics
Q4
TEXT
TEXT
INT
MAP<TEXT,TEXT>
INT
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
TEXT
FLOAT
TIMESTAMP
TIMESTAMP
INT
INT
TIMESTAMP
INT
INT
TEXT
Physical Data Model
Slide 97
CREATE TABLE networks (
bucket INT,
number INT,
description TEXT,
region TEXT,
n_sensors INT,
PRIMARY KEY (bucket, number)
);
-- Q1
SELECT *
FROM networks
WHERE bucket = 1;
Networks
bucket K
number C↑
description
region
n_sensors
TEXT
TEXT
INT
INT
INT
Physical Data Model
Slide 98
CREATE TABLE temperatures_by_network (
network INT,
week_first_day TIMESTAMP,
date_hour TIMESTAMP,
sensor TEXT,
avg_temp FLOAT,
location TEXT,
region TEXT STATIC,
PRIMARY KEY ((network, week_first_day), date_hour, sensor)
) WITH CLUSTERING ORDER BY (date_hour DESC, sensor ASC);
-- Q2
SELECT * FROM temperatures_by_network
WHERE network = ? AND week_first_day = ?
AND date_hour >= ? AND date_hour <= ?;
Temperatures_by_network
network K
week_first_day K
date_hour C↓
sensor C↑
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
Physical Data Model
Slide 99
CREATE TABLE sensors_by_network (
network INT,
sensor TEXT,
location TEXT,
characteristics MAP<TEXT,TEXT>,
PRIMARY KEY (network, sensor)
);
-- Q3
SELECT * FROM sensors_by_network
WHERE network = ?;
Sensors_by_network
network K
sensor C↑
location
characteristics MAP<TEXT,TEXT>
TEXT
TEXT
INT
Physical Data Model
Slide 100
CREATE TABLE measurements_by_sensor (
sensor TEXT,
date TIMESTAMP,
second INT,
parameter TEXT,
value FLOAT,
PRIMARY KEY ((sensor, date), second, parameter)
) WITH CLUSTERING ORDER BY (second DESC, parameter ASC);
-- Q4
SELECT * FROM measurements_by_sensor
WHERE sensor = ? AND date = ?;
Measurements_by_sensor
sensor K
date K
second C↓
parameter C↑
value
TEXT
FLOAT
TIMESTAMP
INT
TEXT
Agenda
• Introduction to Apache Cassandra™ and CQL
• The Data Modeling Framework
• Conceptual Data Modeling and Application Workflows
• Logical Data Modeling and Chebotko Diagrams
• Physical Data Modeling and Optimization Techniques
• Time Series Modeling Example
• Summary and Resources
Slide 101
Summary
Slide 102
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model
optimizemap
Want to Learn More?
Slide 103
academy.datastax.com
kdm.dataview.org
cassandra.apache.org
ThankYou
Slide 104
104
Artem Chebotko, Ph.D.
achebotko@datastax.com
www.linkedin.com/in/artemchebotko
Entity Mapping Patterns
Slide 105
1:1 Relationship Mapping Patterns
Slide 106
1:n Relationship Mapping Patterns
Slide 107
m:n Relationship Mapping Patterns
Slide 108
Hierarchical Mapping Pattern
Slide 109

More Related Content

What's hot

Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsCommand Prompt., Inc
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraDataStax Academy
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks EDB
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 
Make Your Application “Oracle RAC Ready” & Test For It
Make Your Application “Oracle RAC Ready” & Test For ItMake Your Application “Oracle RAC Ready” & Test For It
Make Your Application “Oracle RAC Ready” & Test For ItMarkus Michalewicz
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudOracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudMarkus Michalewicz
 
Cassandra techniques de modelisation avancee
Cassandra techniques de modelisation avanceeCassandra techniques de modelisation avancee
Cassandra techniques de modelisation avanceeDuyhai Doan
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Alexis Seigneurin
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
 
Oracle 12c PDB insights
Oracle 12c PDB insightsOracle 12c PDB insights
Oracle 12c PDB insightsKirill Loifman
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL PerformanceCommand Prompt., Inc
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 

What's hot (20)

Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Managing (Schema) Migrations in Cassandra
Managing (Schema) Migrations in CassandraManaging (Schema) Migrations in Cassandra
Managing (Schema) Migrations in Cassandra
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Make Your Application “Oracle RAC Ready” & Test For It
Make Your Application “Oracle RAC Ready” & Test For ItMake Your Application “Oracle RAC Ready” & Test For It
Make Your Application “Oracle RAC Ready” & Test For It
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the CloudOracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
 
Cassandra techniques de modelisation avancee
Cassandra techniques de modelisation avanceeCassandra techniques de modelisation avancee
Cassandra techniques de modelisation avancee
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!Snowflake: The most cost-effective agile and scalable data warehouse ever!
Snowflake: The most cost-effective agile and scalable data warehouse ever!
 
Oracle 12c PDB insights
Oracle 12c PDB insightsOracle 12c PDB insights
Oracle 12c PDB insights
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 

Similar to Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.pptMumitAhmed1
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?DATAVERSITY
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudRobert Dempsey
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayInformation Development World
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2Neo4j
 
Semi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented DatabasesSemi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented DatabasesDaniel Coupal
 
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...Lucas Jellema
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server DatabasesColdFusionConference
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldKaren Lopez
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Dios Kurniawan
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapersdarthvader42
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Intake at AnacondaCon
Intake at AnacondaConIntake at AnacondaCon
Intake at AnacondaConMartin Durant
 
lecture5 (1) (2).pptx
lecture5 (1) (2).pptxlecture5 (1) (2).pptx
lecture5 (1) (2).pptxRabiullahNazari
 

Similar to Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?
 
Analyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The CloudAnalyzing Semi-Structured Data At Volume In The Cloud
Analyzing Semi-Structured Data At Volume In The Cloud
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
 
Graphs fun vjug2
Graphs fun vjug2Graphs fun vjug2
Graphs fun vjug2
 
Semi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented DatabasesSemi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented Databases
 
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
50 Shades of Data - how, when and why Big, Fast, Relational, NoSQL, Elastic, ...
 
Building better SQL Server Databases
Building better SQL Server DatabasesBuilding better SQL Server Databases
Building better SQL Server Databases
 
How to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database WorldHow to Survive as a Data Architect in a Polyglot Database World
How to Survive as a Data Architect in a Polyglot Database World
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Database Systems - Lecture Week 1
Database Systems - Lecture Week 1Database Systems - Lecture Week 1
Database Systems - Lecture Week 1
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapers
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Intake at AnacondaCon
Intake at AnacondaConIntake at AnacondaCon
Intake at AnacondaCon
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
lecture5 (1) (2).pptx
lecture5 (1) (2).pptxlecture5 (1) (2).pptx
lecture5 (1) (2).pptx
 

More from Artem Chebotko

Traversing Graphs with Gremlin
Traversing Graphs with GremlinTraversing Graphs with Gremlin
Traversing Graphs with GremlinArtem Chebotko
 
Graph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseGraph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseArtem Chebotko
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
 
Rigorous Cassandra Data Modeling for the Relational Data Architect
Rigorous Cassandra Data Modeling  for the Relational Data ArchitectRigorous Cassandra Data Modeling  for the Relational Data Architect
Rigorous Cassandra Data Modeling for the Relational Data ArchitectArtem Chebotko
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling ToolArtem Chebotko
 
data-modeling-paper
data-modeling-paperdata-modeling-paper
data-modeling-paperArtem Chebotko
 

More from Artem Chebotko (6)

Traversing Graphs with Gremlin
Traversing Graphs with GremlinTraversing Graphs with Gremlin
Traversing Graphs with Gremlin
 
Graph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseGraph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax Enterprise
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
Rigorous Cassandra Data Modeling for the Relational Data Architect
Rigorous Cassandra Data Modeling  for the Relational Data ArchitectRigorous Cassandra Data Modeling  for the Relational Data Architect
Rigorous Cassandra Data Modeling for the Relational Data Architect
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling Tool
 
data-modeling-paper
data-modeling-paperdata-modeling-paper
data-modeling-paper
 

Recently uploaded

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Recently uploaded (20)

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra

  • 1. Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra™ Artem Chebotko, Ph.D. November, 2019
  • 2. Agenda • Introduction to Apache Cassandra™ and CQL • The Data Modeling Framework • Conceptual Data Modeling and Application Workflows • Logical Data Modeling and Chebotko Diagrams • Physical Data Modeling and Optimization Techniques • Time Series Modeling Example • Summary and Resources Slide 2
  • 3. Slide 3 “Manage massive amounts of data, fast, without losing sleep” cassandra.apache.org
  • 4. Cassandra Use Cases Slide 4 “Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit,The Weather Channel, and over 1500 more companies that have large, active data sets.” “Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB).” PersonalizationFraud detectionMessaging Playlists Internet of Things
  • 5. • High Availability – Always On • No single point of failure • Fault-tolerance via replication and tunable consistency • Best-in-class multi-datacenter support • No downtime or interruption due to node maintenance • Performance and Scalability • Very fast writes • Fast reads • Linear scalability • Elasticity Many Reasons to Choose Cassandra Slide 5
  • 6. How Cassandra Organizes Data Slide 6 CREATE KEYSPACE library WITH replication = {'class': 'NetworkTopologyStrategy', 'DC-West': '3', 'DC-East': '5'}; library
  • 7. How Cassandra Organizes Data Slide 7 CREATE KEYSPACE library WITH replication = {'class': 'NetworkTopologyStrategy', 'DC-West': '3', 'DC-East': '5'}; CREATE TABLE library.venues_by_year ( year INT, name TEXT, country TEXT, homepage TEXT, PRIMARY KEY (year,name)); artifacts venues_by_year library
  • 8. How Cassandra Organizes Data Slide 8 CREATE TABLE library.venues_by_year ( ... PRIMARY KEY (year,name)); artifacts venues_by_year library year name country … 2019 Apache Cassandra Summit USA … 2019 Data Modeling Zone USA … 2019 DataStax Accelerate USA … year name country … 2015 A … … … 2015 B … … … 2015 C … … …
  • 9. How Cassandra Organizes Data Slide 9 CREATE TABLE library.venues_by_year ( ... PRIMARY KEY (year,name)); artifacts venues_by_year library DC-West DC-East RF=3 RF=5
  • 10. How Cassandra Organizes Data Slide 10 CREATE TABLE library.venues_by_year ( ... PRIMARY KEY (year,name)); artifacts venues_by_year library DC-West DC-East RF=3 RF=5
  • 11. How Cassandra Organizes Data Slide 11 CREATE TABLE library.venues_by_year ( ... PRIMARY KEY (year,name)); artifacts venues_by_year library DC-West DC-East RF=3 RF=5
  • 12. Cassandra Query Language (CQL) • Data Definition • CREATE KEYSPACE, CREATE TABLE • CREATE INDEX, CREATE CUSTOM INDEX • CREATE MATERIALIZEDVIEW • Data Manipulation • SELECT • INSERT, UPDATE, DELETE Slide 12
  • 13. CQL CREATE TABLE Slide 13 CREATE TABLE ( column type STATIC, column type STATIC, ..., PRIMARY KEY ( ) ) ; name WITH CLUSTERING ORDER BY (clustering_key_column (ASC|DESC), ...) (column, ...), column, ... table name column names, types, optional STATIC designation partition key optional clustering key row ordering in a partition
  • 14. Table with Single-Row Partitions Slide 14 CREATE TABLE users ( id UUID, name TEXT, email TEXT, PRIMARY KEY (id) ); id email name a7e78478-0a54-4949-90f3-14ec4cbea40c jbellis@datastax.com Jonathan 67657da3-4443-46ab-b60a-510a658fc7bb achebotko@datastax.com Artem 3b1f62b1-386b-46e3-b55d-00f1abbafb2b patrick@datastax.com Patrick Users id K name email TEXT TEXT UUID
  • 15. CREATE TABLE artifacts_by_venue ( venue TEXT, year INT, artifact TEXT, title TEXT, country TEXT STATIC, PRIMARY KEY ((venue, year), artifact) ); venue year artifact title country DataStax Accelerate 2019 A… Linear Scalability … USA… … Z… Building Cloud … Data Modeling Zone 2019 A… New approach to … USA … … Artifacts_by_venue venue K year K artifact C↑ title country S TEXT TEXT TEXT TEXT INT Table with Multi-Row Partitions Slide 15
  • 16. CQL SELECT Slide 16 SELECT selectors FROM table_name WHERE primary_key_conditions AND index_conditions GROUP BY primary_key_columns ORDER BY clustering_key_columns ( ASC | DESC ) LIMIT N ALLOW FILTERING ; one table per query restricted to primary key columns columns, aggregates, functions danger
  • 17. Sample CQL Queries Slide 17 SELECT * FROM artifacts_by_venue Artifacts_by_venue venue K year K artifact C↑ title country S TEXT TEXT TEXT TEXT INTWHERE venue = ? AND year = ?; WHERE venue = ? AND year = ? AND artifact = ?; WHERE venue = ? AND year = ? AND artifact > ? AND artifact < ? ORDER BY artifact DESC;
  • 18. Invalid CQL Queries Slide 18 SELECT * FROM artifacts_by_venue Artifacts_by_venue venue K year K artifact C↑ title country S TEXT TEXT TEXT TEXT INT WHERE venue = ?; WHERE venue = ? AND artifact = ?; WHERE artifact > ? AND artifact < ?; WHERE venue = ? AND year = ? AND title = ?; WHERE country = ?;
  • 19. Important Implications for Data Modeling • Data • Primary keys define data uniqueness • Partition keys define data distribution • Partition keys affect partition sizes • Clustering keys define row ordering • Query • Primary keys define how data is retrieved • Partition keys allow equality predicates • Clustering keys allow inequality predicates and ordering • Only one table per query, no joins Slide 19
  • 20. Agenda • Introduction to Apache Cassandra™ and CQL • The Data Modeling Framework • Conceptual Data Modeling and Application Workflows • Logical Data Modeling and Chebotko Diagrams • Physical Data Modeling and Optimization Techniques • Time Series Modeling Example • Summary and Resources Slide 20
  • 21. Data Modeling • Collection and analysis of data requirements • Identification of participating entities and relationships • Identification of data access patterns • A particular way of organizing and structuring data • Design and specification of a database schema • Schema optimization and data indexing techniques Slide 21 Data quality: completeness consistency accuracy Data access: queryability efficiency scalability
  • 22. Key Data Modeling Steps • Understand the data • Identify access patterns • Apply a query-first approach • Optimize and implement Slide 22 Conceptual Data Model Application Workflow Logical Data Model Physical Data Model
  • 23. The Data Modeling Framework Slide 23 Conceptual Data Model Application Workflow Logical Data Model Physical Data Model optimizemap Defines models and transitions
  • 24. Agenda • Introduction to Apache Cassandra™ and CQL • The Data Modeling Framework • Conceptual Data Modeling and Application Workflows • Logical Data Modeling and Chebotko Diagrams • Physical Data Modeling and Optimization Techniques • Time Series Modeling Example • Summary and Resources Slide 24
  • 25. Conceptual Data Modeling • Conceptual data model • High-level view of data – entity and relationship types • Technology-independent or technology-agnostic • Not specific to Cassandra or any other database system • Purpose: understanding data • The scope of what needs to be accomplished • Essential components, concepts, entities, relationships • Keys and cardinality constraints are absolutely essential Slide 25
  • 26. Advantages of Conceptual Data Modeling • Complex data modeling problems are more manageable • Less chance of producing an incorrect or incomplete model • Saves time in the long run • Improves data, business process, and risk management • Readable by both technical and non-technical people • Good for data governance • Understanding of a data modeling problem is documented • Can be shared, reviewed and agreed on • Improves understanding and eliminates ambiguity Slide 26
  • 27. Conceptual Data Modeling Techniques • Entity-relationship modeling • Entities and relationships that can exist among them • Unified Modeling Language (UML) class diagrams • Classes and associations that can exist among them • Object-role modeling or fact-oriented modeling • Entities and facts that define relationships among them • Dimensional modeling • Dimensions and facts that relate them • Ontological modeling • Concepts and relations that can exist among them Slide 27
  • 28. ER Model • Chen’s notation • Simple and original • Truly technology-independent • Other notations may be influenced by relational databases • Crow’s foot, Barker’s notation, Information engineering, IDEF1X • Many hybrids exist Slide 28
  • 29. ER Model Basics • Entity – object that is involved in an information system • Example: Jonathan Ellis (an author),“The State of Cassandra, 2014” (a presentation) • Entity type – set of similar objects • Example: Author, Presentation • Relationship – relates two or more entities • Example: Jonathan Ellis creates “The State of Cassandra, 2014” • Relationship type – set of similar relationships • Example: Author creates Presentation Slide 29
  • 30. Entity Type Example • Name – usually a noun • Attributes – atomic, set-valued, composite, derived • Key – minimum set of attributes that uniquely identify an entity Slide 30 Userid name first_name last_name DOB age emails
  • 31. Relationship Type Example • Name – usually a verb • Attributes – can be atomic, set-valued, composite, derived • Roles – each role names a related entity • Key – minimum set of roles and attributes that uniquely identify a relationship • Cardinality constraints – how many times an entity can participate in a relationship Slide 31 Venue year country homepage features 1 n name Artifact id title authors keywords date
  • 32. How is a Key of a Relationship Derived? • 1:1 relationship • 1:n relationship • m:n relationship Slide 32 Author lname affiliation has date 1 1 fname Bio fname bio lnameinterests references Venue year country homepage features 1 n name Artifact id title authors keywords date Author lname affiliation creates m n fname Artifact id title keywords interests fname, lname id fname, lname, id
  • 33. Entity Type Hierarchy Example • Attribute inheritance: id, title, authors, and keywords are inherited by Article and Presentation • Disjoint – cannot have an entity that is both Article and Presentation • Covering – cannot have a Digital Artifact to be anything but Article or Presentation Slide 33 Digital Artifact IsA Article Presentation disjoint covering id title keywords authors url abstract doi
  • 34. Conceptual Data Model for Digital Library Slide 34 User Digital Artifact Venue likes n m features 1 n IsA Article Presentation disjoint covering posts id title keywords authors 1 n year country name homepage id name email timestamp title Review likes features n m id body n 1 rating
  • 35. Application Workflow Model • High-level application design • Tasks, causal dependencies, access patterns • Technology-independent or technology-agnostic • Not specific to Cassandra or any other database system • Purpose: understand data access patterns • Each application has a workflow • Data-driven tasks access a database • A sequence of tasks defines a sequence of data access patterns Slide 35
  • 36. Application Workflow for Digital Library Slide 36 Search for artifacts by a venue, author, title, or keyword Display information for a venue Display a rating of an artifact Display reviews for an artifact Display likes for an artifact Find information for an artifact with a given id Show information about a user Show likes for a review Show reviews by a user Tasks and causal dependencies
  • 37. Application Workflow for Digital Library Slide 37 ACCESS PATTERNS Q1: Find artifacts for a specified venue ... Q2: Find artifacts for a specified author ... Q3: Find artifacts with a specified title ... Q4: Find artifacts with a specified keyword ... Q5: Find information for a specified venue. Q6: Find an average rating for a specified artifact. Q7: Find reviews for a specified artifact ... ... Q5 Q1,Q2,Q3,Q4 Q8Q6 Q7 Search for artifacts by a venue, author, title, or keyword Display information for a venue Display a rating of an artifact Display reviews for an artifact Display likes for an artifact Find information for an artifact with a given id Show information about a user Show likes for a review Show reviews by a user Q9 Q10 Q12 Q11 Data access patterns
  • 38. Agenda • Introduction to Apache Cassandra™ and CQL • The Data Modeling Framework • Conceptual Data Modeling and Application Workflows • Logical Data Modeling and Chebotko Diagrams • Physical Data Modeling and Optimization Techniques • Time Series Modeling Example • Summary and Resources Slide 38
  • 39. Logical Data Modeling • Logical data model • Sketch data model for Cassandra • Chebotko Diagrams for visualization • Purpose: sound, query-driven design • Data organization into tables according to the queries • Correctness of primary key design • Denormalization, nesting, duplication Slide 39
  • 40. Chebotko Diagrams • Visual representation of a logical data model • Tables are represented by rectangles and have names and columns • Columns may optionally be designated as K (partition key column), C (clustering key column), S (static column), and IDX (indexed column) • Access patterns and their ordering are represented by query-labeled connections Slide 40 Venues e K Q5 ifacts_by_venue ue K C↓ act C↑ ors (list) words (set) Artifacts_by_author author K year C↓ artifact C↑ type title authors (list) keywords (set) venue Artifacts_by_title title K year C↓ artifact C↑ type authors (list) keywords (set) venue Artifacts_by_keyword keyword K year C↓ artifact C↑ type title authors (list) keywords (set) venue Ratings_by_artifact artifact K Reviews_by_artifact artifact K Likes_by_artifact artifact K Q1 Q2 Q3 Q4 artifa type title autho keyw venue year Q8Q6 Q7
  • 41. Conceptual-to-Logical Mapping Slide 41 Conceptual Data Model Application Workflow Logical Data Model map Mapping rules Mapping patterns
  • 42. Mapping Rules • Mapping rule 1:“Entities and relationships” • Entity and relationship types map to tables • Mapping rule 2:“Key attributes” • Key attributes map to primary key columns • Mapping rule 3:“Equality search attributes” • Equality search attributes map to partition key columns • Mapping rule 4:“Inequality search attributes” • Inequality search attributes map to clustering columns • Mapping rule 5:“Ordering attributes” • Ordering attributes map to clustering columns Slide 42 Conceptual Data Model Application Workflow
  • 43. Example • Conceptual data model • Query • Find artifacts that appeared in a particular venue after a specified year; order results by year (desc) and title (asc) • Query predicate (equality and inequality): name = ? AND year > ? • Ordering attributes: year (DESC), title (ASC) Slide 43 Digital Artifact Venue features 1 n idyear country name title homepage keywords authors
  • 44. Applying Mapping Rules 1 and 2 • Mapping rule 1:“Entities and relationships” • The query only concerns a part of the diagram • Data should be organized based on the relationship type • Mapping rule 2:“Key attributes” • Key of the relationship is id • id must map to a primary key column (say column artifact) Slide 44 Digital Artifact Venue features 1 n idyear country name title homepage keywords authors Artifacts_by_venue ... artifact C↑ ...
  • 45. Applying Mapping Rule 3 • Mapping rule 3:“Equality search attributes” • Equality search attribute name=? maps to the 1st column of the primary key • It must be a partition key column (say column venue) Slide 45 Artifacts_by_venue venue K ... artifact C↑ ...
  • 46. Applying Mapping Rule 4 • Mapping rule 4:“Inequality search attributes” • Inequality search attribute year>? maps to a clustering column Slide 46 Artifacts_by_venue venue K year C↑ ... artifact C↑ ...
  • 47. Applying Mapping Rule 5 • Mapping rule 5:“Ordering attributes” • Ordering attributes year (DESC) and title (ASC) map to clustering columns • year is already part of the schema but its order should be reversed to DESC • title is added next Slide 47 Artifacts_by_venue venue K year C↓ ... artifact C↑ ... Artifacts_by_venue venue K year C↓ title C↑ artifact C↑ ...
  • 48. Final Result How did we get column type? Slide 48 SELECT * FROM artifacts_by_venue WHERE venue = ? AND year > ? ORDER BY year DESC, title ASC Artifacts_by_venue venue K year C↓ title C↑ artifact C↑ type authors (list) keywords (set) Digital Artifact Venue features 1 n idyear country name title homepage keywords authors Digital Artifact IsA Article Presentation disjoint covering
  • 49. Mapping Patterns • Semi-formal definitions of common mapping use cases • Graphical rather than mathematical representation • Use clustering columns as the data nesting mechanism • Do not take ordering of results into consideration • Guide schema design • Ensure correctness and efficiency • Enable automation Slide 49
  • 50. Common Mapping Patterns • Entity patterns • 1:1 relationship patterns • 1:n relationship patterns • m:n relationship patterns • Hierarchical patterns Slide 50
  • 51. 1:n relationship mapping pattern 3.1 • Search attributes = key attributes Slide 51 ET1 key1.2 attr1.1 attr1.2 ET2_by_ET1_key key1.1 K key1.2 K key2.1 C↑ key2.2 C↑ attr1.1 S attr1.2 S attr1.3 (collection) S attr2.1 attr2.2 attr2.3 (collection) attr RT attr 1 n key1.1 ET2 key2.1 attr2.1 attr2.2 key2.2 attr2.3 attr1.3 ACCESS PATTERN search attributes: key1.1 key1.2 ET2_by_ET1_key key1.1 K key1.2 C↑ key2.1 C↑ key2.2 C↑ attr2.1 attr2.2 attr2.3 (collection) attr = > PRIMARY KEY: All search attributes, followed by all key attributes of RT STATIC COLUMNS: Non-key attributes of ET1, iff all key attributes of ET1 are part of the partition key What if we add green attributes to the above table?
  • 52. 1:n relationship mapping pattern 3.1 (example) • Search attributes = key attributes Slide 52 Venue year country homepage Artifacts_by_venue venue (=name) K year K artifact (= id) C↑ country S homepage S title authors (list) keywords (set) features 1 n name Artifact id title ACCESS PATTERN search attributes: name year = > PRIMARY KEY: All search attributes, followed by all key attributes of features STATIC COLUMNS: Non-key attributes of Venue, iff all key attributes of Venue are part of the partition key What about country and homepage? authors keywords Artifacts_by_venue venue (=name) K year C↓ artifact (= id) C↑ title authors (list) keywords (set)
  • 53. 1:n relationship mapping pattern 3.2 • Search attributes ≠ key attributes Slide 53 ET1 key1.2 attr1.1 attr1.2 ET2_by_ET1_non-key attr1.1 K attr1.2 K key2.1 C↑ key2.2 C↑ attr2.1 attr2.2 attr2.3 (collection) attr RT attr 1 n key1.1 ET2 key2.1 attr2.1 attr2.2 key2.2 attr2.3 attr1.3 ACCESS PATTERN search attributes: attr1.1 attr1.2 ET2_by_ET1_non-key attr1.1 K attr1.2 C↑ key2.1 C↑ key2.2 C↑ attr2.1 attr2.2 attr2.3 (collection) attr = > PRIMARY KEY: All search attributes, followed by all key attributes of RT All ET1's attributes can be added at the cost of duplicating them for every entity of type 2
  • 54. 1:n relationship mapping pattern 3.2 (example) • Search attributes ≠ key attributes Slide 54 Venue year country homepage Artifacts_by_country country K year K artifact (= id) C↑ title authors (list) keywords (set) features 1 n name Artifact id title ACCESS PATTERN search attributes: country year = > PRIMARY KEY: All search attributes, followed by all key attributes of features authors keywords name and homepage can be added at the cost of duplicating them for every artifact Artifacts_by_country country K year C↓ artifact (= id) C↑ title authors (list) keywords (set)
  • 55. Logical Data Model for Digital Library Slide 55 ACCESS PATTERNS Q1: Find artifacts for a specified venue; order by year (DESC). Q2: Find artifacts for a specified author; order by year (DESC). Q3: Find artifacts with a specified title; order by year (DESC). Q4: Find artifacts with a specified keyword; order by year (DESC). Q5: Find information for a specified venue. Q6: Find an average rating for a specified artifact. Q7: Find reviews for a specified artifact, possibly with a specified rating . Q8: Find a number of ‘likes’ for a specified artifact. Q9: Find reviews for a specified user; order by review timestamp (DESC). Q10: Find a user with a specified id. Q11: Find a number of ‘likes’ for a specified review. Q12: Find information for a specified artifact. ... Venues name K year K country IDX homepage Q5 Artifacts_by_venue venue K year C↓ artifact C↑ type title authors (list) keywords (set) Artifacts_by_author author K year C↓ artifact C↑ type title authors (list) keywords (set) venue Artifacts_by_title title K year C↓ artifact C↑ type authors (list) keywords (set) venue Artifacts_by_keyword keyword K year C↓ artifact C↑ type title authors (list) keywords (set) venue Users id K name email Ratings_by_artifact artifact K num_ratings (counter) sum_ratings (counter) Reviews_by_user user K review (timeuuid) C↓ rating title body artifact_id artifact_title artifact_authors (list) user_name S user_email S Reviews_by_artifact artifact K review (timeuuid) C↓ rating IDX title body user Likes_by_artifact artifact K num_likes (counter) Likes_by_review review K num_likes (counter) Q1 Q2 Q3 Q4 Artifacts artifact K type title authors (list) keywords (set) venue year Q8Q6 Q7 Q11 Q10 Q9 Q12
  • 56. Agenda • Introduction to Apache Cassandra™ and CQL • The Data Modeling Framework • Conceptual Data Modeling and Application Workflows • Logical Data Modeling and Chebotko Diagrams • Physical Data Modeling and OptimizationTechniques • Time Series Modeling Example • Summary and Resources Slide 56
  • 57. Physical Data Modeling • Physical data model • Complete data model for Cassandra • Chebotko Diagrams for visualization • CQL for a database schema • Purpose: efficient, implementation-ready design • Data model efficiency analysis and validation • Schema design optimizations • Techniques for concurrent data access Slide 57
  • 58. Logical Data Model – Correctness and Efficiency • But … • Database engine has limitations • Resources are finite • Some operations may require special considerations Slide 58
  • 59. Physical Data Model – More Efficiency • Physical data model takes into account … • Partition sizes • Data duplication factors • Data types, indexes, materialized views • Concurrent data access requirements Slide 59
  • 60. Partition Size Limits • Theoretical limits • 2 billion values • Node disk size • Practical limits for Cassandra Slide 60 Cassandra 2 Cassandra 3 100K Up to 10x 100MB Up to 10x
  • 61. Estimating a Partition Size (Cassandra 3) Slide 61 Nv – number of values in a partition Ncv – number of clustering column values in a partition Nrv – number of regular column values in a partition Nsv – number of static column values in a table definition Nr – number of rows in a partition Ncc – number of clustering columns in a table definition Nrc – number of regular columns in a table definition Nsc – number of static columns in a table definition Sp – size of a partition in bytes sizeOf – size (in bytes) of a CQL data type ck – partition key column in a table definition cs – static column in a table definition cr – regular column in a table definition cc – clustering column in a table definition Nr – number of rows in a partition sizeOf(tavg) – average size (in bytes) of a timestamp delta associated with a value
  • 62. Simplified Example:Values • User with 1000 reviews = 1000 rows in a partition Slide 62 Reviews_by_user user K review C↓ rating title user_name S user_email S FLOAT TEXT TEXT TIMEUUID UUID TEXT 1000 1000 1000 1 1 3002 + + + + ------ values
  • 63. Simplified Example: Bytes • User with 1000 reviews = 1000 rows in a partition Slide 63 Reviews_by_user user K review C↓ rating title user_name S user_email S FLOAT TEXT TEXT TIMEUUID UUID TEXT 16 1000x16 1000x 4 1000x60 12 20 80048 2002x8 96064 + + + + + --------- + ------- bytes
  • 64. Getting a Partition Size Empirically Slide 64 $ nodetool flush library reviews_by_user $ nodetool tablestats -H library.reviews_by_user Keyspace : library ... Table: reviews_by_user SSTable count: 1 ... Number of partitions (estimate): 1 ... Compacted partition minimum bytes: 88149 Compacted partition maximum bytes: 105778 Compacted partition mean bytes: 105778 ...
  • 65. Splitting Large Partitions • Solution: introduce an additional column to a partition key • Use an existing column – convenience • Use an artificial “bucket” column – more control • Cons: supported access patterns may change • Example: • Millions of artifacts with the same keyword across different venues and years Slide 65 Artifacts_by_keyword keyword K year C↓ artifact C↑ type title authors (list) keywords (set) venue
  • 66. Using an Existing Column Slide 66 Artifacts_by_keyword keyword K year C↓ artifact C↑ type title authors keywords venue LIST<TEXT> SET<TEXT> INT TEXT TEXT TEXT TEXT TEXT Artifacts_by_keyword keyword K year K artifact C↑ type title authors keywords venue LIST<TEXT> SET<TEXT> INT TEXT TEXT TEXT TEXT TEXT
  • 67. Using an Artificial Column Slide 67 Artifacts_by_keyword keyword K year K bucket K artifact C↑ type title authors keywords venue LIST<TEXT> SET<TEXT> INT TEXT TEXT TEXT TEXT TEXT INT Artifacts_by_keyword keyword K year K artifact C↑ type title authors keywords venue LIST<TEXT> SET<TEXT> INT TEXT TEXT TEXT TEXT TEXT
  • 68. Data Duplication Considerations • Data duplication is necessary • Data duplication vs. data replication • Data duplication factor • Data duplication and data consistency Slide 68
  • 69. Duplication Across Tables • Each artifact is stored once in Artifacts • Each artifact is stored once in Artifacts_by_venue • Duplication factor = 2 Slide 69 Artifacts_by_venue venue K year C↓ artifact C↑ type title authors (list) keywords (set) Artifacts artifact K type title authors (list) keywords (set) venue year
  • 70. Duplication Across Partitions • An artifact with 5 authors is stored in 5 different partitions • Duplication factor = 5 Slide 70 Artifacts_by_author author K year C↓ artifact C↑ type title authors (list) keywords (set) venue
  • 71. Artifacts_by_author author K keyword C↑ year C↓ artifact C↑ type title authors (list) keywords (set) venue Duplication Across Rows • An artifact with 5 keywords is stored in 5 rows of the same partition • An artifact with 5 authors is stored in 5 different partitions • Duplication factor = 5 x 5 = 25 Slide 71
  • 72. Artifacts_by_author author K keyword C↑ year C↓ artifact C↑ type title authors (list) keywords (set) venue Beware of Non-constant Duplication Factors • Users can add new keywords to artifacts • There is no limit on the number of keywords per artifact • An artifact with n keywords is stored in n rows of the same partition • An artifact with 5 authors is stored in 5 different partitions • Duplication factor = 5 x n • Do things differently • Place reasonable limits that can be gradually increased Slide 72
  • 73. BEGIN BATCH UPDATE artifacts ... INSERT INTO artifacts_by_title ... DELETE FROM artifacts_by_title ... APPLY BATCH; Keeping Up with Data Duplication and Data Consistency • Insert a new artifact • Update an existing artifact title Slide 73 BEGIN BATCH INSERT INTO artifacts ... INSERT INTO artifacts_by_title ... APPLY BATCH; Artifacts_by_title title K year C↓ artifact C↑ type authors (list) keywords (set) venue Artifacts artifact K type title authors (list) keywords (set) venue year
  • 74. CREATE MATERIALIZED VIEW artifacts_by_title AS SELECT title, year, artifact, type, authors, keywords, venue FROM artifacts WHERE title IS NOT NULL AND artifact IS NOT NULL PRIMARY KEY (title, artifact); Keeping Up with Data Duplication and Data Consistency How About MaterializedViews? Base table Materialized view Slide 74 CREATE TABLE artifacts ( artifact TEXT, type TEXT, title TEXT, authors LIST<TEXT>, keywords SET<TEXT>, venue TEXT, year INT, PRIMARY KEY (artifact) ); Artifacts_by_title title K year C↓ artifact C↑ type authors (list) keywords (set) venue Artifacts artifact K type title authors (list) keywords (set) venue year
  • 75. Selecting Column Data Types Slide 75 ASCII BIGINT BLOB BOOLEAN COUNTER DATE DECIMAL DOUBLE FLOAT INET INT LIST MAP SET SMALLINT TEXT TIME TIMESTAMP TIMEUUID TINYINT TUPLE UUID VARCHAR VARINT CREATE TYPE library.ADDRESS ( street TEXT, city TEXT, state TEXT, postal_code TEXT ); CREATE TABLE library.users ( id UUID PRIMARY KEY, name TEXT, other_names SET<TEXT>, phones MAP<TEXT,TEXT>, current_address ADDRESS, past_addresses LIST<FROZEN<ADDRESS>> );
  • 76. Indexing Options • Local indexes • Secondary indexes • SSTable-attached secondary indexes (SASI) • Distributed indexes • Materialized views Slide 76
  • 77. When to Use a Secondary Index? • Queries on low-cardinality columns • Mostly analytical queries with larger result sets • Generally, these are expensive queries • Queries that involve both partition key and indexed column • Searching within a large partition • Efficient queries Slide 77 Venues name K year K country IDX homepage INT TEXT TEXT TEXT Reviews_by_artifact artifact K review C↓ rating IDX title body user INT TEXT TEXT TEXT TIMEUUID UUID
  • 78. When to Use a MaterializedView? • Queries on higher-cardinality columns • Similar advantages as those of regular tables • Convenience of automatic view maintenance • Reads are as fasts as for regular tables • Important limitations • Restrictions on how PRIMARY KEY is constructed • Slower writes to the base table • Base-view inconsistencies Slide 78
  • 79. • Implementing a voting system for artifacts • Two users submit their votes concurrently for the same artifact read(votes:10) write(votes:11) incorrect read(votes:10) write(votes:11) timeUser 2: User 1: Concurrent Data Access and Data Consistency Votes_by_artifact artifact K votes TEXT INT Slide 79
  • 80. Lightweight Transactions and Concurrent Data Access • LWTs guarantee correctness • Expensive – four coordinator-replica round trips • Failed LWTs must be repeated – can become a bottleneck UPDATE votes_by_artifact SET votes = 11 WHERE artifact = 'conf/cassandra/Ellis11(1)' IF votes = 10; read(votes:10) LWT-write(votes:11) read(votes:10) LWT-write(votes:11) timeUser 2: User 1: Slide 80
  • 81. COUNTERs and Concurrent Data Access • COUNTERs may give slightly inaccurate results • Expensive – mutexed read-before-write • Limited to integral columns and operations of addition or subtraction UPDATE votes_by_artifact SET votes = votes + 1 WHERE artifact = 'conf/cassandra/Ellis11(1)'; Votes_by_artifact artifact K votes TEXT COUNTER Slide 81
  • 82. Eliminating the Need for Concurrent Data Access • Isolating computation by isolating data • Each vote is stored separately – writes are fast • Data aggregation could be a bit more expensive or Slide 82 Votes_by_artifact artifact K user C↑ TEXT UUID Votes_by_artifact artifact K user C↑ vote TEXT UUID INT
  • 83. Adding Additional Columns To a Table • Avoiding querying multiple tables or partitions • Storing aggregate values for faster access Š2014 DataStax Training. Use only with permission. Slide 83 Artifacts artifact K type title authors keywords venue year LIST<TEXT> SET<TEXT> INT TEXT TEXT TEXT TEXT Ratings_by_artifact artifact K num_ratings sum_ratings TEXT COUNTER COUNTER Artifacts artifact K avg_rating type title authors keywords venue year LIST<TEXT> SET<TEXT> FLOAT INT TEXT TEXT TEXT TEXT
  • 84. Physical Data Model for Digital Library Slide 84 ACCESS PATTERNS Q1: Find artifacts for a specified venue; order by year (DESC). Q2: Find artifacts for a specified author; order by year (DESC). Q3: Find artifacts with a specified title; order by year (DESC). Q4: Find artifacts with a specified keyword; order by year (DESC). Q5: Find information for a specified venue. Q6: Find an average rating for a specified artifact. Q7: Find reviews for a specified artifact, possibly with a specified rating. Q8: Find a number of ‘likes’ for a specified artifact. Q9: Find reviews for a specified user; order by review timestamp (DESC). Q10: Find a user with a specified id. Q11: Find a number of ‘likes’ for a specified review. Q12: Find information for a specified artifact. ... Venues name K year K country IDX homepage Q5 Artifacts_by_venue venue K year C↓ artifact C↑ avg_rating type title authors keywords Artifacts_by_author author K year C↓ artifact C↑ avg_rating type title authors keywords venue Artifacts_by_title title K year C↓ artifact C↑ avg_rating type authors keywords venue Artifacts_by_keyword keyword K year K artifact C↑ avg_rating type title authors keywords venue Users id K name email Ratings_by_artifact artifact K num_ratings sum_ratings Reviews_by_user user K review C↓ rating title body artifact_id artifact_title artifact_authors user_name S user_email S Reviews_by_artifact artifact K review C↓ rating IDX title body user Likes_by_artifact artifact K num_likes Likes_by_review review K num_likes Q1,Q6 Q2,Q6 Q3,Q6 Q4,Q6 Artifacts artifact K avg_rating type title authors keywords venue year Q8Q6 Q7 Q11 Q10 Q9 Q12 TEXT TEXT FLOAT TEXT TEXT TEXT LIST<TEXT> LIST<TEXT> LIST<TEXT>LIST<TEXT> LIST<TEXT> LIST<TEXT> SET<TEXT> SET<TEXT>SET<TEXT> SET<TEXT> SET<TEXT> FLOAT FLOAT FLOAT FLOAT INT Q6 INT INT INT INTINT INT TEXT TEXT TEXT TEXT TEXT TEXTTEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXTTEXT TEXT TEXT FLOAT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TIMEUUID TIMEUUID TIMEUUID COUNTER COUNTERCOUNTER COUNTER UUID UUID UUID
  • 85. Agenda • Introduction to Apache Cassandra™ and CQL • The Data Modeling Framework • Conceptual Data Modeling and Application Workflows • Logical Data Modeling and Chebotko Diagrams • Physical Data Modeling and Optimization Techniques • Time Series Modeling Example • Summary and Resources Slide 85
  • 86. Sensor Networks and Time Series • Data description • Multiple sensor networks are deployed over non-overlapping regions • A sensor network is identified by a unique number • A sensor belongs to exactly one network • A sensor has a unique identifier, location, and characteristics (e.g., accuracy, interface, size) • A sensor records new measurements (e.g., temperature, humidity, pressure) every second Slide 86
  • 87. Conceptual Data Model • Keys • has: id • records and Measurement: id, timestamp, parameter Slide 87 Network Sensorhas n1 timestamp id locationnumber region description #of_sensors 1 n value parameter characteristics records Measurement
  • 88. Application Workflow and Access Patterns • Access patterns • Q1: Find information about all networks • Q2: Find hourly average temperatures for every sensor in a specified network for a specified date range; order by date (DESC) and hour (DESC) • Q3: Find information about all sensors in a specified network • Q4: Find raw measurements for a particular sensor; order by timestamp (DESC) Slide 88 Q1 Networks Q2 Heatmap Q3 Sensors Q4 Raw data
  • 89. Logical Data Model • Access patterns • Q1: Find information about all networks • Q2: Find hourly average temperatures for every sensor in a specified network for a specified date range; order by date (DESC) and hour (DESC) • Q3: Find information about all sensors in a specified network • Q4: Find raw measurements for a particular sensor; order by timestamp (DESC) Slide 89 Temperatures_by_network network K date C↓ hour C↓ sensor C↑ avg_temp location region S Q2 Q3 Sensors_by_network network K sensor C↑ location characteristics (map) Q4 Measurements_by_sensor sensor K timestamp C↓ parameter C↑ value Networks number K description region n_sensors Q1
  • 90. Analysis and Optimization • Table Networks • Partition size • Single-row partitions • Small partitions • Optimization • Merge small partitions into a larger partition • “Partition per query” access pattern • One small partition Slide 90 Networks number K description region n_sensors Networks bucket K number C↑ description region n_sensors TEXT TEXT INT INT INT
  • 91. Analysis and Optimization • Table Temperatures_by_network • Partition size • Multi-row partitions • Large partitions • Optimization – split partitions Slide 91 Temperatures_by_network network K date C↓ hour C↓ sensor C↑ avg_temp location region S date and hour can be combined into one column Temperatures_by_network network K week_first_day K date C↓ hour C↓ sensor C↑ avg_temp location region S INT FLOAT TEXT TEXT TEXT TIMESTAMP TIMESTAMP INT Temperatures_by_network network K week_first_day K date_hour C↓ sensor C↑ avg_temp location region S INT FLOAT TEXT TEXT TEXT TIMESTAMP TIMESTAMP
  • 92. Analysis and Optimization • Table Sensors_by_network • Partition size • Multi-row partitions • Small partitions (assuming 1,000 sensors per network) • Optimization • None Slide 92 Sensors_by_network network K sensor C↑ location characteristics (map) Sensors_by_network network K sensor C↑ location characteristics MAP<TEXT,TEXT> TEXT TEXT INT
  • 93. Analysis and Optimization • Table Measurements_by_sensor • Partition size • Multi-row partitions • Large partitions • Optimization – split partitions Slide 93 Measurements_by_sensor sensor K timestamp C↓ parameter C↑ value because all timestamps have the same date in a partition, we can store a number of seconds elapsed since midnight Measurements_by_sensor sensor K date K second C↓ parameter C↑ value TEXT FLOAT TIMESTAMP INT TEXT Measurements_by_sensor sensor K date K timestamp C↓ parameter C↑ value TEXT FLOAT TIMESTAMP TEXT TIMESTAMP
  • 94. Analysis and Optimization • Duplication • How many times is region stored per network? • Once in table Networks • Once in table Temperatures_by_network • Static column value is stored only once in a partition Slide 94 Networks bucket K number C↑ description region n_sensors TEXT TEXT INT INT INT Temperatures_by_network network K week_first_day K date_hour C↓ sensor C↑ avg_temp location region S INT FLOAT TEXT TEXT TEXT TIMESTAMP TIMESTAMP
  • 95. Analysis and Optimization • Duplication • How many times is location stored per sensor? • Once in table Sensors_by_network • 24 x 7 times in each partition in table Temperatures_by_network • Duplication across partitions • Duplication across rows in a partition Slide 95 Temperatures_by_network network K week_first_day K date_hour C↓ sensor C↑ avg_temp location region S INT FLOAT TEXT TEXT TEXT TIMESTAMP TIMESTAMP Sensors_by_network network K sensor C↑ location characteristics MAP<TEXT,TEXT> TEXT TEXT INT
  • 96. Physical Data Model • Access patterns • Q1: Find information about all networks • Q2: Find hourly average temperatures for every sensor in a specified network for a specified date range; order by date (DESC) and hour (DESC) • Q3: Find information about all sensors in a specified network • Q4: Find raw measurements for a particular sensor; order by timestamp (DESC) Slide 96 Measurements_by_sensor sensor K date K second C↓ parameter C↑ value Temperatures_by_network network K week_first_day K date_hour C↓ sensor C↑ avg_temp location region S Networks bucket K number C↑ description region n_sensors Q1 Q3 Q2 Sensors_by_network network K sensor C↑ location characteristics Q4 TEXT TEXT INT MAP<TEXT,TEXT> INT TEXT TEXT FLOAT TEXT TEXT TEXT TEXT FLOAT TIMESTAMP TIMESTAMP INT INT TIMESTAMP INT INT TEXT
  • 97. Physical Data Model Slide 97 CREATE TABLE networks ( bucket INT, number INT, description TEXT, region TEXT, n_sensors INT, PRIMARY KEY (bucket, number) ); -- Q1 SELECT * FROM networks WHERE bucket = 1; Networks bucket K number C↑ description region n_sensors TEXT TEXT INT INT INT
  • 98. Physical Data Model Slide 98 CREATE TABLE temperatures_by_network ( network INT, week_first_day TIMESTAMP, date_hour TIMESTAMP, sensor TEXT, avg_temp FLOAT, location TEXT, region TEXT STATIC, PRIMARY KEY ((network, week_first_day), date_hour, sensor) ) WITH CLUSTERING ORDER BY (date_hour DESC, sensor ASC); -- Q2 SELECT * FROM temperatures_by_network WHERE network = ? AND week_first_day = ? AND date_hour >= ? AND date_hour <= ?; Temperatures_by_network network K week_first_day K date_hour C↓ sensor C↑ avg_temp location region S INT FLOAT TEXT TEXT TEXT TIMESTAMP TIMESTAMP
  • 99. Physical Data Model Slide 99 CREATE TABLE sensors_by_network ( network INT, sensor TEXT, location TEXT, characteristics MAP<TEXT,TEXT>, PRIMARY KEY (network, sensor) ); -- Q3 SELECT * FROM sensors_by_network WHERE network = ?; Sensors_by_network network K sensor C↑ location characteristics MAP<TEXT,TEXT> TEXT TEXT INT
  • 100. Physical Data Model Slide 100 CREATE TABLE measurements_by_sensor ( sensor TEXT, date TIMESTAMP, second INT, parameter TEXT, value FLOAT, PRIMARY KEY ((sensor, date), second, parameter) ) WITH CLUSTERING ORDER BY (second DESC, parameter ASC); -- Q4 SELECT * FROM measurements_by_sensor WHERE sensor = ? AND date = ?; Measurements_by_sensor sensor K date K second C↓ parameter C↑ value TEXT FLOAT TIMESTAMP INT TEXT
  • 101. Agenda • Introduction to Apache Cassandra™ and CQL • The Data Modeling Framework • Conceptual Data Modeling and Application Workflows • Logical Data Modeling and Chebotko Diagrams • Physical Data Modeling and Optimization Techniques • Time Series Modeling Example • Summary and Resources Slide 101
  • 103. Want to Learn More? Slide 103 academy.datastax.com kdm.dataview.org cassandra.apache.org
  • 104. ThankYou Slide 104 104 Artem Chebotko, Ph.D. achebotko@datastax.com www.linkedin.com/in/artemchebotko
  • 106. 1:1 Relationship Mapping Patterns Slide 106
  • 107. 1:n Relationship Mapping Patterns Slide 107
  • 108. m:n Relationship Mapping Patterns Slide 108