Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Â
Using the Chebotko Method to Design Sound and Scalable Data Models for Apache Cassandra
1. Using the Chebotko Method to
Design Sound and Scalable Data Models for
Apache Cassandraâ˘
Artem Chebotko, Ph.D.
November, 2019
2. Agenda
⢠Introduction to Apache Cassandra⢠and CQL
⢠The Data Modeling Framework
⢠Conceptual Data Modeling and Application Workflows
⢠Logical Data Modeling and Chebotko Diagrams
⢠Physical Data Modeling and Optimization Techniques
⢠Time Series Modeling Example
⢠Summary and Resources
Slide 2
4. Cassandra Use Cases
Slide 4
âCassandra is in use at Constant
Contact, CERN, Comcast, eBay,
GitHub, GoDaddy, Hulu, Instagram,
Intuit, Netflix, Reddit,The Weather
Channel, and over 1500 more
companies that have large, active
data sets.â
âSome of the largest production
deployments include Apple's, with
over 75,000 nodes storing over 10
PB of data, Netflix (2,500 nodes, 420
TB, over 1 trillion requests per day),
Chinese search engine Easou (270
nodes, 300 TB, over 800 million
requests per day), and eBay (over
100 nodes, 250 TB).â
PersonalizationFraud detectionMessaging Playlists Internet of Things
5. ⢠High Availability â Always On
⢠No single point of failure
⢠Fault-tolerance via replication
and tunable consistency
⢠Best-in-class multi-datacenter
support
⢠No downtime or interruption
due to node maintenance
⢠Performance and Scalability
⢠Very fast writes
⢠Fast reads
⢠Linear scalability
⢠Elasticity
Many Reasons to Choose Cassandra
Slide 5
6. How Cassandra Organizes Data
Slide 6
CREATE KEYSPACE library
WITH replication = {'class':
'NetworkTopologyStrategy',
'DC-West': '3',
'DC-East': '5'};
library
7. How Cassandra Organizes Data
Slide 7
CREATE KEYSPACE library
WITH replication = {'class':
'NetworkTopologyStrategy',
'DC-West': '3',
'DC-East': '5'};
CREATE TABLE
library.venues_by_year (
year INT,
name TEXT,
country TEXT,
homepage TEXT,
PRIMARY KEY (year,name));
artifacts venues_by_year
library
8. How Cassandra Organizes Data
Slide 8
CREATE TABLE
library.venues_by_year (
...
PRIMARY KEY (year,name));
artifacts venues_by_year
library
year name country âŚ
2019 Apache Cassandra Summit USA âŚ
2019 Data Modeling Zone USA âŚ
2019 DataStax Accelerate USA âŚ
year name country âŚ
2015 A ⌠⌠âŚ
2015 B ⌠⌠âŚ
2015 C ⌠⌠âŚ
12. Cassandra Query Language (CQL)
⢠Data Definition
⢠CREATE KEYSPACE, CREATE TABLE
⢠CREATE INDEX, CREATE CUSTOM INDEX
⢠CREATE MATERIALIZEDVIEW
⢠Data Manipulation
⢠SELECT
⢠INSERT, UPDATE, DELETE
Slide 12
13. CQL CREATE TABLE
Slide 13
CREATE TABLE
(
column type STATIC,
column type STATIC,
...,
PRIMARY KEY ( )
) ;
name
WITH CLUSTERING ORDER BY (clustering_key_column (ASC|DESC), ...)
(column, ...), column, ...
table name
column names, types,
optional STATIC designation
partition key
optional
clustering key
row ordering in a partition
14. Table with Single-Row Partitions
Slide 14
CREATE TABLE users (
id UUID,
name TEXT,
email TEXT,
PRIMARY KEY (id)
);
id email name
a7e78478-0a54-4949-90f3-14ec4cbea40c jbellis@datastax.com Jonathan
67657da3-4443-46ab-b60a-510a658fc7bb achebotko@datastax.com Artem
3b1f62b1-386b-46e3-b55d-00f1abbafb2b patrick@datastax.com Patrick
Users
id K
name
email
TEXT
TEXT
UUID
15. CREATE TABLE artifacts_by_venue (
venue TEXT, year INT,
artifact TEXT,
title TEXT,
country TEXT STATIC,
PRIMARY KEY ((venue, year), artifact)
);
venue year artifact title country
DataStax Accelerate 2019
A⌠Linear Scalability âŚ
USA⌠âŚ
Z⌠Building Cloud âŚ
Data Modeling Zone 2019
A⌠New approach to âŚ
USA
⌠âŚ
Artifacts_by_venue
venue K
year K
artifact Câ
title
country S
TEXT
TEXT
TEXT
TEXT
INT
Table with Multi-Row Partitions
Slide 15
16. CQL SELECT
Slide 16
SELECT selectors
FROM table_name
WHERE primary_key_conditions
AND index_conditions
GROUP BY primary_key_columns
ORDER BY clustering_key_columns ( ASC | DESC )
LIMIT N
ALLOW FILTERING ;
one table per query
restricted to
primary key columns
columns, aggregates, functions
danger
17. Sample CQL Queries
Slide 17
SELECT *
FROM artifacts_by_venue
Artifacts_by_venue
venue K
year K
artifact Câ
title
country S
TEXT
TEXT
TEXT
TEXT
INTWHERE venue = ? AND year = ?;
WHERE venue = ? AND year = ? AND artifact = ?;
WHERE venue = ? AND year = ? AND
artifact > ? AND artifact < ?
ORDER BY artifact DESC;
18. Invalid CQL Queries
Slide 18
SELECT *
FROM artifacts_by_venue
Artifacts_by_venue
venue K
year K
artifact Câ
title
country S
TEXT
TEXT
TEXT
TEXT
INT
WHERE venue = ?;
WHERE venue = ? AND artifact = ?;
WHERE artifact > ? AND artifact < ?;
WHERE venue = ? AND year = ? AND title = ?;
WHERE country = ?;
19. Important Implications for Data Modeling
⢠Data
⢠Primary keys define data uniqueness
⢠Partition keys define data distribution
⢠Partition keys affect partition sizes
⢠Clustering keys define row ordering
⢠Query
⢠Primary keys define how data is retrieved
⢠Partition keys allow equality predicates
⢠Clustering keys allow inequality predicates and ordering
⢠Only one table per query, no joins
Slide 19
20. Agenda
⢠Introduction to Apache Cassandra⢠and CQL
⢠The Data Modeling Framework
⢠Conceptual Data Modeling and Application Workflows
⢠Logical Data Modeling and Chebotko Diagrams
⢠Physical Data Modeling and Optimization Techniques
⢠Time Series Modeling Example
⢠Summary and Resources
Slide 20
21. Data Modeling
⢠Collection and analysis of data requirements
⢠Identification of participating entities and relationships
⢠Identification of data access patterns
⢠A particular way of organizing and structuring data
⢠Design and specification of a database schema
⢠Schema optimization and data indexing techniques
Slide 21
Data quality: completeness consistency accuracy
Data access: queryability efficiency scalability
22. Key Data Modeling Steps
⢠Understand the data
⢠Identify access patterns
⢠Apply a query-first approach
⢠Optimize and implement
Slide 22
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model
23. The Data Modeling Framework
Slide 23
Conceptual
Data Model
Application
Workflow
Logical
Data Model
Physical
Data Model
optimizemap
Defines models and transitions
24. Agenda
⢠Introduction to Apache Cassandra⢠and CQL
⢠The Data Modeling Framework
⢠Conceptual Data Modeling and Application Workflows
⢠Logical Data Modeling and Chebotko Diagrams
⢠Physical Data Modeling and Optimization Techniques
⢠Time Series Modeling Example
⢠Summary and Resources
Slide 24
25. Conceptual Data Modeling
⢠Conceptual data model
⢠High-level view of data â entity and relationship types
⢠Technology-independent or technology-agnostic
⢠Not specific to Cassandra or any other database system
⢠Purpose: understanding data
⢠The scope of what needs to be accomplished
⢠Essential components, concepts, entities, relationships
⢠Keys and cardinality constraints are absolutely essential
Slide 25
26. Advantages of Conceptual Data Modeling
⢠Complex data modeling problems are more manageable
⢠Less chance of producing an incorrect or incomplete model
⢠Saves time in the long run
⢠Improves data, business process, and risk management
⢠Readable by both technical and non-technical people
⢠Good for data governance
⢠Understanding of a data modeling problem is documented
⢠Can be shared, reviewed and agreed on
⢠Improves understanding and eliminates ambiguity
Slide 26
27. Conceptual Data Modeling Techniques
⢠Entity-relationship modeling
⢠Entities and relationships that can exist among them
⢠Unified Modeling Language (UML) class diagrams
⢠Classes and associations that can exist among them
⢠Object-role modeling or fact-oriented modeling
⢠Entities and facts that define relationships among them
⢠Dimensional modeling
⢠Dimensions and facts that relate them
⢠Ontological modeling
⢠Concepts and relations that can exist among them
Slide 27
28. ER Model
⢠Chenâs notation
⢠Simple and original
⢠Truly technology-independent
⢠Other notations may be influenced by relational databases
⢠Crowâs foot, Barkerâs notation, Information engineering, IDEF1X
⢠Many hybrids exist
Slide 28
29. ER Model Basics
⢠Entity â object that is involved in an information system
⢠Example: Jonathan Ellis (an author),âThe State of Cassandra, 2014â (a presentation)
⢠Entity type â set of similar objects
⢠Example: Author, Presentation
⢠Relationship â relates two or more entities
⢠Example: Jonathan Ellis creates âThe State of Cassandra, 2014â
⢠Relationship type â set of similar relationships
⢠Example: Author creates Presentation
Slide 29
30. Entity Type Example
⢠Name â usually a noun
⢠Attributes â atomic, set-valued, composite, derived
⢠Key â minimum set of attributes that uniquely identify an entity
Slide 30
Userid
name
first_name
last_name
DOB
age
emails
31. Relationship Type Example
⢠Name â usually a verb
⢠Attributes â can be atomic, set-valued, composite, derived
⢠Roles â each role names a related entity
⢠Key â minimum set of roles and attributes that uniquely identify a relationship
⢠Cardinality constraints â how many times an entity can participate in a relationship
Slide 31
Venue
year
country
homepage
features
1 n
name
Artifact
id
title
authors
keywords
date
32. How is a Key of a Relationship Derived?
⢠1:1 relationship
⢠1:n relationship
⢠m:n relationship
Slide 32
Author
lname
affiliation
has
date
1 1
fname
Bio
fname
bio
lnameinterests
references
Venue
year
country
homepage
features
1 n
name
Artifact
id
title
authors
keywords
date
Author
lname
affiliation
creates
m n
fname
Artifact
id
title
keywords
interests
fname, lname
id
fname, lname, id
33. Entity Type Hierarchy Example
⢠Attribute inheritance: id, title, authors, and keywords are inherited by Article and
Presentation
⢠Disjoint â cannot have an entity that is both Article and Presentation
⢠Covering â cannot have a Digital Artifact to be anything but Article or Presentation
Slide 33
Digital
Artifact
IsA
Article Presentation
disjoint
covering
id title
keywords
authors
url
abstract
doi
34. Conceptual Data Model for Digital Library
Slide 34
User
Digital
Artifact
Venue
likes
n
m
features
1
n
IsA
Article Presentation
disjoint
covering
posts
id
title
keywords
authors
1
n
year
country
name
homepage
id
name
email
timestamp
title
Review
likes
features
n
m
id
body
n
1
rating
35. Application Workflow Model
⢠High-level application design
⢠Tasks, causal dependencies, access patterns
⢠Technology-independent or technology-agnostic
⢠Not specific to Cassandra or any other database system
⢠Purpose: understand data access patterns
⢠Each application has a workflow
⢠Data-driven tasks access a database
⢠A sequence of tasks defines a sequence of data access patterns
Slide 35
36. Application Workflow for Digital Library
Slide 36
Search for
artifacts by a
venue,
author, title,
or keyword
Display
information
for a venue
Display a
rating of an
artifact
Display
reviews for
an artifact
Display likes
for an artifact
Find
information
for an artifact
with a given
id
Show
information
about a user
Show likes
for a review
Show reviews
by a user
Tasks and causal dependencies
37. Application Workflow for Digital Library
Slide 37
ACCESS PATTERNS
Q1: Find artifacts for a specified venue ...
Q2: Find artifacts for a specified author ...
Q3: Find artifacts with a specified title ...
Q4: Find artifacts with a specified keyword ...
Q5: Find information for a specified venue.
Q6: Find an average rating for a specified artifact.
Q7: Find reviews for a specified artifact ...
...
Q5
Q1,Q2,Q3,Q4
Q8Q6 Q7
Search for
artifacts by a
venue,
author, title,
or keyword
Display
information
for a venue
Display a
rating of an
artifact
Display
reviews for
an artifact
Display likes
for an artifact
Find
information
for an artifact
with a given
id
Show
information
about a user
Show likes
for a review
Show reviews
by a user
Q9 Q10
Q12
Q11
Data access patterns
38. Agenda
⢠Introduction to Apache Cassandra⢠and CQL
⢠The Data Modeling Framework
⢠Conceptual Data Modeling and Application Workflows
⢠Logical Data Modeling and Chebotko Diagrams
⢠Physical Data Modeling and Optimization Techniques
⢠Time Series Modeling Example
⢠Summary and Resources
Slide 38
39. Logical Data Modeling
⢠Logical data model
⢠Sketch data model for Cassandra
⢠Chebotko Diagrams for visualization
⢠Purpose: sound, query-driven design
⢠Data organization into tables according to the queries
⢠Correctness of primary key design
⢠Denormalization, nesting, duplication
Slide 39
40. Chebotko Diagrams
⢠Visual representation of a logical data model
⢠Tables are represented by rectangles and
have names and columns
⢠Columns may optionally be designated as K
(partition key column), C (clustering key
column), S (static column), and IDX (indexed
column)
⢠Access patterns and their ordering are
represented by query-labeled connections
Slide 40
Venues
e K
Q5
ifacts_by_venue
ue K
Câ
act Câ
ors (list)
words (set)
Artifacts_by_author
author K
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
venue
Artifacts_by_title
title K
year Câ
artifact Câ
type
authors (list)
keywords (set)
venue
Artifacts_by_keyword
keyword K
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
venue
Ratings_by_artifact
artifact K
Reviews_by_artifact
artifact K
Likes_by_artifact
artifact K
Q1 Q2 Q3 Q4
artifa
type
title
autho
keyw
venue
year
Q8Q6 Q7
42. Mapping Rules
⢠Mapping rule 1:âEntities and relationshipsâ
⢠Entity and relationship types map to tables
⢠Mapping rule 2:âKey attributesâ
⢠Key attributes map to primary key columns
⢠Mapping rule 3:âEquality search attributesâ
⢠Equality search attributes map to partition key columns
⢠Mapping rule 4:âInequality search attributesâ
⢠Inequality search attributes map to clustering columns
⢠Mapping rule 5:âOrdering attributesâ
⢠Ordering attributes map to clustering columns
Slide 42
Conceptual
Data Model
Application
Workflow
43. Example
⢠Conceptual data model
⢠Query
⢠Find artifacts that appeared in a particular venue after a specified year;
order results by year (desc) and title (asc)
⢠Query predicate (equality and inequality): name = ? AND year > ?
⢠Ordering attributes: year (DESC), title (ASC)
Slide 43
Digital
Artifact
Venue features
1 n
idyear
country
name title
homepage keywords authors
44. Applying Mapping Rules 1 and 2
⢠Mapping rule 1:âEntities and relationshipsâ
⢠The query only concerns a part of the diagram
⢠Data should be organized based on the relationship type
⢠Mapping rule 2:âKey attributesâ
⢠Key of the relationship is id
⢠id must map to a primary key column (say column artifact)
Slide 44
Digital
Artifact
Venue features
1 n
idyear
country
name title
homepage keywords authors
Artifacts_by_venue
...
artifact Câ
...
45. Applying Mapping Rule 3
⢠Mapping rule 3:âEquality search attributesâ
⢠Equality search attribute name=? maps to
the 1st column of the primary key
⢠It must be a partition key column (say column venue)
Slide 45
Artifacts_by_venue
venue K
...
artifact Câ
...
46. Applying Mapping Rule 4
⢠Mapping rule 4:âInequality search attributesâ
⢠Inequality search attribute year>? maps to
a clustering column
Slide 46
Artifacts_by_venue
venue K
year Câ
...
artifact Câ
...
47. Applying Mapping Rule 5
⢠Mapping rule 5:âOrdering attributesâ
⢠Ordering attributes year (DESC) and title (ASC) map to
clustering columns
⢠year is already part of the schema but
its order should be reversed to DESC
⢠title is added next
Slide 47
Artifacts_by_venue
venue K
year Câ
...
artifact Câ
...
Artifacts_by_venue
venue K
year Câ
title Câ
artifact Câ
...
48. Final Result
How did we get column type?
Slide 48
SELECT *
FROM artifacts_by_venue
WHERE venue = ? AND year > ?
ORDER BY year DESC, title ASC
Artifacts_by_venue
venue K
year Câ
title Câ
artifact Câ
type
authors (list)
keywords (set)
Digital
Artifact
Venue features
1 n
idyear
country
name title
homepage keywords authors
Digital
Artifact
IsA
Article Presentation
disjoint
covering
49. Mapping Patterns
⢠Semi-formal definitions of common mapping use cases
⢠Graphical rather than mathematical representation
⢠Use clustering columns as the data nesting mechanism
⢠Do not take ordering of results into consideration
⢠Guide schema design
⢠Ensure correctness and efficiency
⢠Enable automation
Slide 49
51. 1:n relationship mapping pattern 3.1
⢠Search attributes = key attributes
Slide 51
ET1
key1.2
attr1.1
attr1.2
ET2_by_ET1_key
key1.1 K
key1.2 K
key2.1 Câ
key2.2 Câ
attr1.1 S
attr1.2 S
attr1.3 (collection) S
attr2.1
attr2.2
attr2.3 (collection)
attr
RT
attr
1 n
key1.1
ET2
key2.1
attr2.1
attr2.2
key2.2
attr2.3
attr1.3
ACCESS PATTERN
search attributes: key1.1 key1.2
ET2_by_ET1_key
key1.1 K
key1.2 Câ
key2.1 Câ
key2.2 Câ
attr2.1
attr2.2
attr2.3 (collection)
attr
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of RT
STATIC COLUMNS:
Non-key attributes of
ET1, iff all key
attributes of ET1 are
part of the partition key
What if we add green attributes
to the above table?
52. 1:n relationship mapping pattern 3.1 (example)
⢠Search attributes = key attributes
Slide 52
Venue
year
country
homepage
Artifacts_by_venue
venue (=name) K
year K
artifact (= id) Câ
country S
homepage S
title
authors (list)
keywords (set)
features
1 n
name
Artifact
id
title
ACCESS PATTERN
search attributes: name year
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of features
STATIC COLUMNS:
Non-key attributes of
Venue, iff all key
attributes of Venue are
part of the partition key
What about country and
homepage?
authors
keywords
Artifacts_by_venue
venue (=name) K
year Câ
artifact (= id) Câ
title
authors (list)
keywords (set)
53. 1:n relationship mapping pattern 3.2
⢠Search attributes â key attributes
Slide 53
ET1
key1.2
attr1.1
attr1.2
ET2_by_ET1_non-key
attr1.1 K
attr1.2 K
key2.1 Câ
key2.2 Câ
attr2.1
attr2.2
attr2.3 (collection)
attr
RT
attr
1 n
key1.1
ET2
key2.1
attr2.1
attr2.2
key2.2
attr2.3
attr1.3
ACCESS PATTERN
search attributes: attr1.1 attr1.2
ET2_by_ET1_non-key
attr1.1 K
attr1.2 Câ
key2.1 Câ
key2.2 Câ
attr2.1
attr2.2
attr2.3 (collection)
attr
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of RT
All ET1's attributes can
be added at the cost of
duplicating them for
every entity of type 2
54. 1:n relationship mapping pattern 3.2 (example)
⢠Search attributes â key attributes
Slide 54
Venue
year
country
homepage
Artifacts_by_country
country K
year K
artifact (= id) Câ
title
authors (list)
keywords (set)
features
1 n
name
Artifact
id
title
ACCESS PATTERN
search attributes: country year
= >
PRIMARY KEY:
All search attributes,
followed by all key
attributes of features
authors
keywords
name and homepage
can be added at the
cost of duplicating
them for every artifact
Artifacts_by_country
country K
year Câ
artifact (= id) Câ
title
authors (list)
keywords (set)
55. Logical Data Model for Digital Library
Slide 55
ACCESS PATTERNS
Q1: Find artifacts for a specified venue; order by year (DESC).
Q2: Find artifacts for a specified author; order by year (DESC).
Q3: Find artifacts with a specified title; order by year (DESC).
Q4: Find artifacts with a specified keyword; order by year (DESC).
Q5: Find information for a specified venue.
Q6: Find an average rating for a specified artifact.
Q7: Find reviews for a specified artifact, possibly with a specified rating .
Q8: Find a number of âlikesâ for a specified artifact.
Q9: Find reviews for a specified user; order by review timestamp (DESC).
Q10: Find a user with a specified id.
Q11: Find a number of âlikesâ for a specified review.
Q12: Find information for a specified artifact.
...
Venues
name K
year K
country IDX
homepage
Q5
Artifacts_by_venue
venue K
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
Artifacts_by_author
author K
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
venue
Artifacts_by_title
title K
year Câ
artifact Câ
type
authors (list)
keywords (set)
venue
Artifacts_by_keyword
keyword K
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
venue
Users
id K
name
email
Ratings_by_artifact
artifact K
num_ratings (counter)
sum_ratings (counter)
Reviews_by_user
user K
review (timeuuid) Câ
rating
title
body
artifact_id
artifact_title
artifact_authors (list)
user_name S
user_email S
Reviews_by_artifact
artifact K
review (timeuuid) Câ
rating IDX
title
body
user
Likes_by_artifact
artifact K
num_likes (counter)
Likes_by_review
review K
num_likes (counter)
Q1 Q2 Q3 Q4
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
Q8Q6 Q7
Q11
Q10
Q9
Q12
56. Agenda
⢠Introduction to Apache Cassandra⢠and CQL
⢠The Data Modeling Framework
⢠Conceptual Data Modeling and Application Workflows
⢠Logical Data Modeling and Chebotko Diagrams
⢠Physical Data Modeling and OptimizationTechniques
⢠Time Series Modeling Example
⢠Summary and Resources
Slide 56
57. Physical Data Modeling
⢠Physical data model
⢠Complete data model for Cassandra
⢠Chebotko Diagrams for visualization
⢠CQL for a database schema
⢠Purpose: efficient, implementation-ready design
⢠Data model efficiency analysis and validation
⢠Schema design optimizations
⢠Techniques for concurrent data access
Slide 57
58. Logical Data Model â Correctness and Efficiency
⢠But âŚ
⢠Database engine has limitations
⢠Resources are finite
⢠Some operations may require special considerations
Slide 58
59. Physical Data Model â More Efficiency
⢠Physical data model takes into account âŚ
⢠Partition sizes
⢠Data duplication factors
⢠Data types, indexes, materialized views
⢠Concurrent data access requirements
Slide 59
60. Partition Size Limits
⢠Theoretical limits
⢠2 billion values
⢠Node disk size
⢠Practical limits for Cassandra
Slide 60
Cassandra 2 Cassandra 3
100K Up to 10x
100MB Up to 10x
61. Estimating a Partition Size (Cassandra 3)
Slide 61
Nv â number of values in a partition
Ncv â number of clustering column values in a partition
Nrv â number of regular column values in a partition
Nsv â number of static column values in a table definition
Nr â number of rows in a partition
Ncc â number of clustering columns in a table definition
Nrc â number of regular columns in a table definition
Nsc â number of static columns in a table definition
Sp â size of a partition in bytes
sizeOf â size (in bytes) of a CQL data type
ck â partition key column in a table definition
cs â static column in a table definition
cr â regular column in a table definition
cc â clustering column in a table definition
Nr â number of rows in a partition
sizeOf(tavg) â average size (in bytes) of a timestamp delta
associated with a value
62. Simplified Example:Values
⢠User with 1000 reviews = 1000 rows in a partition
Slide 62
Reviews_by_user
user K
review Câ
rating
title
user_name S
user_email S
FLOAT
TEXT
TEXT
TIMEUUID
UUID
TEXT
1000
1000
1000
1
1
3002
+
+
+
+
------
values
63. Simplified Example: Bytes
⢠User with 1000 reviews = 1000 rows in a partition
Slide 63
Reviews_by_user
user K
review Câ
rating
title
user_name S
user_email S
FLOAT
TEXT
TEXT
TIMEUUID
UUID
TEXT
16
1000x16
1000x 4
1000x60
12
20
80048
2002x8
96064
+
+
+
+
+
---------
+
-------
bytes
64. Getting a Partition Size Empirically
Slide 64
$ nodetool flush library reviews_by_user
$ nodetool tablestats -H library.reviews_by_user
Keyspace : library
...
Table: reviews_by_user
SSTable count: 1
...
Number of partitions (estimate): 1
...
Compacted partition minimum bytes: 88149
Compacted partition maximum bytes: 105778
Compacted partition mean bytes: 105778
...
65. Splitting Large Partitions
⢠Solution: introduce an additional column to a partition key
⢠Use an existing column â convenience
⢠Use an artificial âbucketâ column â more control
⢠Cons: supported access patterns may change
⢠Example:
⢠Millions of artifacts with the same keyword
across different venues and years
Slide 65
Artifacts_by_keyword
keyword K
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
venue
66. Using an Existing Column
Slide 66
Artifacts_by_keyword
keyword K
year Câ
artifact Câ
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
Artifacts_by_keyword
keyword K
year K
artifact Câ
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
67. Using an Artificial Column
Slide 67
Artifacts_by_keyword
keyword K
year K
bucket K
artifact Câ
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
INT
Artifacts_by_keyword
keyword K
year K
artifact Câ
type
title
authors
keywords
venue
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
TEXT
68. Data Duplication Considerations
⢠Data duplication is necessary
⢠Data duplication vs. data replication
⢠Data duplication factor
⢠Data duplication and data consistency
Slide 68
69. Duplication Across Tables
⢠Each artifact is stored once in Artifacts
⢠Each artifact is stored once in Artifacts_by_venue
⢠Duplication factor = 2
Slide 69
Artifacts_by_venue
venue K
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
70. Duplication Across Partitions
⢠An artifact with 5 authors is stored in 5 different partitions
⢠Duplication factor = 5
Slide 70
Artifacts_by_author
author K
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
venue
71. Artifacts_by_author
author K
keyword Câ
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
venue
Duplication Across Rows
⢠An artifact with 5 keywords is stored in 5 rows of the same partition
⢠An artifact with 5 authors is stored in 5 different partitions
⢠Duplication factor = 5 x 5 = 25
Slide 71
72. Artifacts_by_author
author K
keyword Câ
year Câ
artifact Câ
type
title
authors (list)
keywords (set)
venue
Beware of Non-constant Duplication Factors
⢠Users can add new keywords to artifacts
⢠There is no limit on the number of keywords per artifact
⢠An artifact with n keywords is stored in n rows of the same partition
⢠An artifact with 5 authors is stored in 5 different partitions
⢠Duplication factor = 5 x n
⢠Do things differently
⢠Place reasonable limits that can be gradually increased
Slide 72
73. BEGIN BATCH
UPDATE artifacts ...
INSERT INTO artifacts_by_title ...
DELETE FROM artifacts_by_title ...
APPLY BATCH;
Keeping Up with Data Duplication and Data Consistency
⢠Insert a new artifact
⢠Update an existing artifact title
Slide 73
BEGIN BATCH
INSERT INTO artifacts ...
INSERT INTO artifacts_by_title ...
APPLY BATCH;
Artifacts_by_title
title K
year Câ
artifact Câ
type
authors (list)
keywords (set)
venue
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
74. CREATE MATERIALIZED VIEW
artifacts_by_title AS
SELECT title, year,
artifact, type,
authors, keywords,
venue
FROM artifacts
WHERE title IS NOT NULL
AND artifact IS NOT NULL
PRIMARY KEY (title,
artifact);
Keeping Up with Data Duplication and Data Consistency
How About MaterializedViews?
Base table Materialized view
Slide 74
CREATE TABLE artifacts (
artifact TEXT,
type TEXT,
title TEXT,
authors LIST<TEXT>,
keywords SET<TEXT>,
venue TEXT,
year INT,
PRIMARY KEY (artifact)
);
Artifacts_by_title
title K
year Câ
artifact Câ
type
authors (list)
keywords (set)
venue
Artifacts
artifact K
type
title
authors (list)
keywords (set)
venue
year
75. Selecting Column Data Types
Slide 75
ASCII
BIGINT
BLOB
BOOLEAN
COUNTER
DATE
DECIMAL
DOUBLE
FLOAT
INET
INT
LIST
MAP
SET
SMALLINT
TEXT
TIME
TIMESTAMP
TIMEUUID
TINYINT
TUPLE
UUID
VARCHAR
VARINT
CREATE TYPE library.ADDRESS (
street TEXT,
city TEXT,
state TEXT,
postal_code TEXT
);
CREATE TABLE library.users (
id UUID PRIMARY KEY,
name TEXT,
other_names SET<TEXT>,
phones MAP<TEXT,TEXT>,
current_address ADDRESS,
past_addresses LIST<FROZEN<ADDRESS>>
);
77. When to Use a Secondary Index?
⢠Queries on low-cardinality columns
⢠Mostly analytical queries with larger result sets
⢠Generally, these are expensive queries
⢠Queries that involve both partition key and indexed column
⢠Searching within a large partition
⢠Efficient queries
Slide 77
Venues
name K
year K
country IDX
homepage
INT
TEXT
TEXT
TEXT
Reviews_by_artifact
artifact K
review Câ
rating IDX
title
body
user
INT
TEXT
TEXT
TEXT
TIMEUUID
UUID
78. When to Use a MaterializedView?
⢠Queries on higher-cardinality columns
⢠Similar advantages as those of regular tables
⢠Convenience of automatic view maintenance
⢠Reads are as fasts as for regular tables
⢠Important limitations
⢠Restrictions on how PRIMARY KEY is constructed
⢠Slower writes to the base table
⢠Base-view inconsistencies
Slide 78
79. ⢠Implementing a voting system for artifacts
⢠Two users submit their votes concurrently for the same artifact
read(votes:10) write(votes:11) incorrect
read(votes:10) write(votes:11) timeUser 2:
User 1:
Concurrent Data Access and Data Consistency
Votes_by_artifact
artifact K
votes
TEXT
INT
Slide 79
80. Lightweight Transactions and Concurrent Data Access
⢠LWTs guarantee correctness
⢠Expensive â four coordinator-replica round trips
⢠Failed LWTs must be repeated â can become a bottleneck
UPDATE votes_by_artifact
SET votes = 11
WHERE artifact = 'conf/cassandra/Ellis11(1)'
IF votes = 10;
read(votes:10) LWT-write(votes:11)
read(votes:10) LWT-write(votes:11) timeUser 2:
User 1:
Slide 80
81. COUNTERs and Concurrent Data Access
⢠COUNTERs may give slightly inaccurate results
⢠Expensive â mutexed read-before-write
⢠Limited to integral columns and operations of addition or subtraction
UPDATE votes_by_artifact
SET votes = votes + 1
WHERE artifact = 'conf/cassandra/Ellis11(1)';
Votes_by_artifact
artifact K
votes
TEXT
COUNTER
Slide 81
82. Eliminating the Need for Concurrent Data Access
⢠Isolating computation by isolating data
⢠Each vote is stored separately â writes are fast
⢠Data aggregation could be a bit more expensive
or
Slide 82
Votes_by_artifact
artifact K
user Câ
TEXT
UUID
Votes_by_artifact
artifact K
user Câ
vote
TEXT
UUID
INT
83. Adding Additional Columns To a Table
⢠Avoiding querying multiple tables or partitions
⢠Storing aggregate values for faster access
Š2014 DataStax Training. Use only with permission. Slide 83
Artifacts
artifact K
type
title
authors
keywords
venue
year
LIST<TEXT>
SET<TEXT>
INT
TEXT
TEXT
TEXT
TEXT
Ratings_by_artifact
artifact K
num_ratings
sum_ratings
TEXT
COUNTER
COUNTER
Artifacts
artifact K
avg_rating
type
title
authors
keywords
venue
year
LIST<TEXT>
SET<TEXT>
FLOAT
INT
TEXT
TEXT
TEXT
TEXT
84. Physical Data Model for Digital Library
Slide 84
ACCESS PATTERNS
Q1: Find artifacts for a specified venue; order by year (DESC).
Q2: Find artifacts for a specified author; order by year (DESC).
Q3: Find artifacts with a specified title; order by year (DESC).
Q4: Find artifacts with a specified keyword; order by year (DESC).
Q5: Find information for a specified venue.
Q6: Find an average rating for a specified artifact.
Q7: Find reviews for a specified artifact, possibly with a specified rating.
Q8: Find a number of âlikesâ for a specified artifact.
Q9: Find reviews for a specified user; order by review timestamp (DESC).
Q10: Find a user with a specified id.
Q11: Find a number of âlikesâ for a specified review.
Q12: Find information for a specified artifact.
...
Venues
name K
year K
country IDX
homepage
Q5
Artifacts_by_venue
venue K
year Câ
artifact Câ
avg_rating
type
title
authors
keywords
Artifacts_by_author
author K
year Câ
artifact Câ
avg_rating
type
title
authors
keywords
venue
Artifacts_by_title
title K
year Câ
artifact Câ
avg_rating
type
authors
keywords
venue
Artifacts_by_keyword
keyword K
year K
artifact Câ
avg_rating
type
title
authors
keywords
venue
Users
id K
name
email
Ratings_by_artifact
artifact K
num_ratings
sum_ratings
Reviews_by_user
user K
review Câ
rating
title
body
artifact_id
artifact_title
artifact_authors
user_name S
user_email S
Reviews_by_artifact
artifact K
review Câ
rating IDX
title
body
user
Likes_by_artifact
artifact K
num_likes
Likes_by_review
review K
num_likes
Q1,Q6 Q2,Q6 Q3,Q6 Q4,Q6
Artifacts
artifact K
avg_rating
type
title
authors
keywords
venue
year
Q8Q6 Q7
Q11
Q10
Q9
Q12
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
LIST<TEXT>
LIST<TEXT>
LIST<TEXT>LIST<TEXT>
LIST<TEXT>
LIST<TEXT>
SET<TEXT> SET<TEXT>SET<TEXT>
SET<TEXT>
SET<TEXT>
FLOAT FLOAT
FLOAT
FLOAT
INT
Q6
INT
INT
INT
INTINT
INT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXTTEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXTTEXT
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TEXT
TIMEUUID
TIMEUUID
TIMEUUID
COUNTER
COUNTERCOUNTER
COUNTER
UUID
UUID
UUID
85. Agenda
⢠Introduction to Apache Cassandra⢠and CQL
⢠The Data Modeling Framework
⢠Conceptual Data Modeling and Application Workflows
⢠Logical Data Modeling and Chebotko Diagrams
⢠Physical Data Modeling and Optimization Techniques
⢠Time Series Modeling Example
⢠Summary and Resources
Slide 85
86. Sensor Networks and Time Series
⢠Data description
⢠Multiple sensor networks are deployed over non-overlapping regions
⢠A sensor network is identified by a unique number
⢠A sensor belongs to exactly one network
⢠A sensor has a unique identifier, location, and characteristics (e.g., accuracy,
interface, size)
⢠A sensor records new measurements (e.g., temperature, humidity, pressure) every
second
Slide 86
87. Conceptual Data Model
⢠Keys
⢠has: id
⢠records and Measurement:
id, timestamp, parameter
Slide 87
Network Sensorhas
n1
timestamp
id locationnumber
region
description
#of_sensors
1
n
value
parameter
characteristics
records
Measurement
88. Application Workflow and Access Patterns
⢠Access patterns
⢠Q1: Find information about all networks
⢠Q2: Find hourly average temperatures for every sensor in a specified network for a
specified date range; order by date (DESC) and hour (DESC)
⢠Q3: Find information about all sensors in a specified network
⢠Q4: Find raw measurements for a particular sensor; order by timestamp (DESC)
Slide 88
Q1
Networks
Q2
Heatmap
Q3
Sensors
Q4
Raw data
89. Logical Data Model
⢠Access patterns
⢠Q1: Find information about all networks
⢠Q2: Find hourly average temperatures for every sensor in a specified network for a specified date
range; order by date (DESC) and hour (DESC)
⢠Q3: Find information about all sensors in a specified network
⢠Q4: Find raw measurements for a particular sensor; order by timestamp (DESC)
Slide 89
Temperatures_by_network
network K
date Câ
hour Câ
sensor Câ
avg_temp
location
region S
Q2
Q3
Sensors_by_network
network K
sensor Câ
location
characteristics (map)
Q4
Measurements_by_sensor
sensor K
timestamp Câ
parameter Câ
value
Networks
number K
description
region
n_sensors
Q1
90. Analysis and Optimization
⢠Table Networks
⢠Partition size
⢠Single-row partitions
⢠Small partitions
⢠Optimization
⢠Merge small partitions into a larger partition
⢠âPartition per queryâ access pattern
⢠One small partition
Slide 90
Networks
number K
description
region
n_sensors
Networks
bucket K
number Câ
description
region
n_sensors
TEXT
TEXT
INT
INT
INT
91. Analysis and Optimization
⢠Table Temperatures_by_network
⢠Partition size
⢠Multi-row partitions
⢠Large partitions
⢠Optimization â split partitions
Slide 91
Temperatures_by_network
network K
date Câ
hour Câ
sensor Câ
avg_temp
location
region S
date and hour can
be combined into
one column
Temperatures_by_network
network K
week_first_day K
date Câ
hour Câ
sensor Câ
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
INT
Temperatures_by_network
network K
week_first_day K
date_hour Câ
sensor Câ
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
92. Analysis and Optimization
⢠Table Sensors_by_network
⢠Partition size
⢠Multi-row partitions
⢠Small partitions
(assuming 1,000 sensors per network)
⢠Optimization
⢠None
Slide 92
Sensors_by_network
network K
sensor Câ
location
characteristics (map)
Sensors_by_network
network K
sensor Câ
location
characteristics MAP<TEXT,TEXT>
TEXT
TEXT
INT
93. Analysis and Optimization
⢠Table Measurements_by_sensor
⢠Partition size
⢠Multi-row partitions
⢠Large partitions
⢠Optimization â split partitions
Slide 93
Measurements_by_sensor
sensor K
timestamp Câ
parameter Câ
value
because all timestamps
have the same date in
a partition, we can
store a number of
seconds elapsed since
midnight
Measurements_by_sensor
sensor K
date K
second Câ
parameter Câ
value
TEXT
FLOAT
TIMESTAMP
INT
TEXT
Measurements_by_sensor
sensor K
date K
timestamp Câ
parameter Câ
value
TEXT
FLOAT
TIMESTAMP
TEXT
TIMESTAMP
94. Analysis and Optimization
⢠Duplication
⢠How many times is region stored per network?
⢠Once in table Networks
⢠Once in table Temperatures_by_network
⢠Static column value is stored
only once in a partition
Slide 94
Networks
bucket K
number Câ
description
region
n_sensors
TEXT
TEXT
INT
INT
INT
Temperatures_by_network
network K
week_first_day K
date_hour Câ
sensor Câ
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
95. Analysis and Optimization
⢠Duplication
⢠How many times is location stored per sensor?
⢠Once in table Sensors_by_network
⢠24 x 7 times in each partition
in table Temperatures_by_network
⢠Duplication across partitions
⢠Duplication across rows in a partition
Slide 95
Temperatures_by_network
network K
week_first_day K
date_hour Câ
sensor Câ
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
Sensors_by_network
network K
sensor Câ
location
characteristics MAP<TEXT,TEXT>
TEXT
TEXT
INT
96. Physical Data Model
⢠Access patterns
⢠Q1: Find information about all networks
⢠Q2: Find hourly average temperatures for every sensor in a specified network for a specified date
range; order by date (DESC) and hour (DESC)
⢠Q3: Find information about all sensors in a specified network
⢠Q4: Find raw measurements for a particular sensor; order by timestamp (DESC)
Slide 96
Measurements_by_sensor
sensor K
date K
second Câ
parameter Câ
value
Temperatures_by_network
network K
week_first_day K
date_hour Câ
sensor Câ
avg_temp
location
region S
Networks
bucket K
number Câ
description
region
n_sensors
Q1
Q3
Q2
Sensors_by_network
network K
sensor Câ
location
characteristics
Q4
TEXT
TEXT
INT
MAP<TEXT,TEXT>
INT
TEXT
TEXT
FLOAT
TEXT
TEXT
TEXT
TEXT
FLOAT
TIMESTAMP
TIMESTAMP
INT
INT
TIMESTAMP
INT
INT
TEXT
97. Physical Data Model
Slide 97
CREATE TABLE networks (
bucket INT,
number INT,
description TEXT,
region TEXT,
n_sensors INT,
PRIMARY KEY (bucket, number)
);
-- Q1
SELECT *
FROM networks
WHERE bucket = 1;
Networks
bucket K
number Câ
description
region
n_sensors
TEXT
TEXT
INT
INT
INT
98. Physical Data Model
Slide 98
CREATE TABLE temperatures_by_network (
network INT,
week_first_day TIMESTAMP,
date_hour TIMESTAMP,
sensor TEXT,
avg_temp FLOAT,
location TEXT,
region TEXT STATIC,
PRIMARY KEY ((network, week_first_day), date_hour, sensor)
) WITH CLUSTERING ORDER BY (date_hour DESC, sensor ASC);
-- Q2
SELECT * FROM temperatures_by_network
WHERE network = ? AND week_first_day = ?
AND date_hour >= ? AND date_hour <= ?;
Temperatures_by_network
network K
week_first_day K
date_hour Câ
sensor Câ
avg_temp
location
region S
INT
FLOAT
TEXT
TEXT
TEXT
TIMESTAMP
TIMESTAMP
99. Physical Data Model
Slide 99
CREATE TABLE sensors_by_network (
network INT,
sensor TEXT,
location TEXT,
characteristics MAP<TEXT,TEXT>,
PRIMARY KEY (network, sensor)
);
-- Q3
SELECT * FROM sensors_by_network
WHERE network = ?;
Sensors_by_network
network K
sensor Câ
location
characteristics MAP<TEXT,TEXT>
TEXT
TEXT
INT
100. Physical Data Model
Slide 100
CREATE TABLE measurements_by_sensor (
sensor TEXT,
date TIMESTAMP,
second INT,
parameter TEXT,
value FLOAT,
PRIMARY KEY ((sensor, date), second, parameter)
) WITH CLUSTERING ORDER BY (second DESC, parameter ASC);
-- Q4
SELECT * FROM measurements_by_sensor
WHERE sensor = ? AND date = ?;
Measurements_by_sensor
sensor K
date K
second Câ
parameter Câ
value
TEXT
FLOAT
TIMESTAMP
INT
TEXT
101. Agenda
⢠Introduction to Apache Cassandra⢠and CQL
⢠The Data Modeling Framework
⢠Conceptual Data Modeling and Application Workflows
⢠Logical Data Modeling and Chebotko Diagrams
⢠Physical Data Modeling and Optimization Techniques
⢠Time Series Modeling Example
⢠Summary and Resources
Slide 101