Discover how to model data for wide column databases such as ScyllaDB and Apache Cassandra. Contrast the differerence from traditional RDBMS data modeling, going from a normalized “schema first” design to a denormalized “query first” design. Plus how to use advanced features like secondary indexes and materialized views to use the same base table to get the answers you need.
SQL Database Design For Developers at php[tek] 2024
Modeling Data and Queries for Wide Column NoSQL
1. High Performance NoSQL Masterclass
Modeling Data and Queries
for Wide Column NoSQL
Allan Mason
2. High Performance NoSQL Masterclass
Allan Mason
■ Lead Database Consultant.
■ Privileged to Mentor and Lead a skilled team
of DBAs at Pythian.
■ Senior SW Engineer - video games to DB
tools.
■ International Speaker.
■ Fiction Writer.
■ Father of Two.
2
3. Pythian Services Inc | 3
Data is in our
DNA.
We do not just provide database advice and
consulting. We are your database partner.
25
Years in Business
420+
Experts across every Data
Domain & Technology
400+
Global Customers
4. Pythian Services Inc | 4
Data Estate Planning, Professional Services, Managed Services
RDBMS NoSQL Cloud databases
Oracle
Data lakes/data warehouse
Oracle Exadata
Microsoft SQL Server
MySQL
Postgres
DB2
Informix
HANA
MaxDB
Vertica DB
MongoDB
Cassandra
HBase
Scylla
OCI DBCS
OCI ADB
Amazon RDS
Amazon Aurora
MS Azure SQL Database
MS Azure Cosmos DB
Google Cloud Datastore
Google Cloud Spanner
Google Cloud SQL
Google Cloud Bigtable
Hadoop/Spark
Amazon Redshift
MS Synapse Analytics
MS Azure Data Lake Storage
Google BigQuery
Oracle Exadata
Oracle Autonomous Database
Snowflake Cloud Data WH
Database Monitoring and Alerting Proprietary Tool: AvailX
Cloud Migration
6. High Performance NoSQL Masterclass
Agenda
This talk is about Wide Column NoSQL (ScyllaDB and Cassandra).
■ Wide Column NoSQL Overview
■ Pros and Cons
■ Data Modeling RDBMS vs Wide Column NoSQL
■ Data Modeling Rules
■ Queries
■ Additional Resources
7
7. High Performance NoSQL Masterclass
Wide Column NoSQL Overview
8
■ Cassandra and ScyllaDB are both Wide Column NoSQL DBs.
■ ScyllaDB has a fully compatible API with Cassandra.
○ ScyllaDB is written in C++.
○ Cassandra is written in Java.
■ They are powerful tools, well designed to scale to millions of
operations per second over geographically distributed locations
operating in a highly available manner.
■ Log Structured Merge Tree engine (LSM Tree).
8. High Performance NoSQL Masterclass
Overview - Pros
■ Stability - Pythian has worked on Cassandra clusters that have
operated for years without interruption.
○ High Availability
○ Self-Healing and Automation
○ No SPOF when correctly configured (RF, CF, etc).
■ Scalability
■ Horizontal Scaling (Sharding) built-in.
○ Linear scaling to hundreds+ of nodes is very realistic if you need it.
■ The LSM Tree engine makes writes and reads very performant.
■ Vendor Independent FOSS.
9
9. High Performance NoSQL Masterclass
Overview - Cons
10
■ High Disk Usage - The LSM Tree engine requires "Compaction."
○ The Compaction process can eat a lot of disk space.
○ Look into the Compaction choices.
■ Poor engine performance when reads exceed writes by a large
magnitude.
■ Far fewer community open-source tools compared to
MySQL/MariaDB and PostgreSQL.
○ https://cassandra.apache.org/_/ecosystem.html
■ Great at what it does, but not a generic do it all DB.
10. High Performance NoSQL Masterclass
Overview - Cassandra Specific Cons
■ Java
○ Limits on the Heap
○ JMX
○ Garbage Collection
■ Significant CPU spikes during Compaction and Garbage
Collection.
■ Default settings can significantly impact performance.
11
12. High Performance NoSQL Masterclass
Relational DB Management Systems (RDBMS)
Schema-First Design
■ Type of database management system based on the relational
model invented by Edgar F. Codd.
○ Entirely driven by data.
■ Normalization – Process of structuring a DB to reduce data
redundancy and improve data integrity
○ Central role in relational design.
○ The goal is to store an entity in a single location, to minimize application
management of INSERT, UPDATE, and DELETE changes.
○ Duplicated data makes ensuring data integrity a challenge.
■ After data is normalized based on tables and their relationships,
queries are written based on them.
13
13. High Performance NoSQL Masterclass
Wide Column NoSQL - Query-First Design
14
■ Data in WC NoSQL is structured differently.
■ The key goal is very fast data access without the necessary joins
created by schema normalization.
■ Driven by the queries, not by the data.
○ Identify the expected queries and design the tables around them.
○ Achieves more efficient reads.
○ Data duplication is not considered a problem.
14. High Performance NoSQL Masterclass
No Referential Integrity
15
RDBMS
■ Referential integrity is an RDBMS feature ensuring relationships
between tables in a database remain accurate by applying
constraints (e.g., foreign keys) .
■ This prevents applications or users from writing inaccurate data
or references pointing to data that does not exist.
○ Relationships between data linked by keys remain consistent.
○ Important in relational DBs, as queries often combine data from multiple tables.
Wide Column NoSQL
■ WC NoSQL - No referential integrity (foreign key) support.
○ Cascading deletes are not supported.
15. High Performance NoSQL Masterclass
Atomicity
16
■ Atomicity: Either all of a Transaction's operations are completed,
or none of them are.
■ WC NoSQL write operations are atomic at the partition level;
inserts, updates, or deletes of two or more rows in the same
partition are treated as one write operation.
○ E.G. if writing with a Consistency Level of Quorum and RF (Replication Factor) of
3, it will replicate the write to all nodes, and wait for acknowledgement of 2
nodes.
○ If the write fails on one of the nodes but succeeds on the other, it reports back a
failure to replicate the write on that node.
○ However, the replicated write that succeeds on the other node is not
automatically removed as there is no roll-back support.
16. High Performance NoSQL Masterclass
Denormalization
17
■ WC NoSQL is optimized for writes.
■ Writing data in multiple locations incurs a minimal penalty.
■ Unlike relational databases, data duplication is treated as a good
thing.
○ This provides different preset combinations of the data to serve different
queries.
■ With an RF=3, and CL=Quorum, we expect strong consistency.
17. High Performance NoSQL Masterclass
Sorting
18
■ Relational DBs generally return rows of data in the order in which
they are written.
○ Use ORDER BY to sort the records returned by a query.
■ In WC NoSQL, sorting is a design decision. Sort order is based on the
CLUSTERING ORDER specified in the table definition.
CREATE TABLE blogpostsbyuseryear (
userid bigint,
posttime timestamp,
postid uuid,
postcontent text,
year bigint,
PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER
BY (posttime DESC));
18. High Performance NoSQL Masterclass
■ Aggregation functions (count, sum, etc.) are supported.
○ However, for performance reasons, it is recommended to use an external tool,
such as Spark, for these jobs.
○ Though if only accessing one partition the performance should be fine.
■ For data analytics, WC NoSQL is more likely to be integrated with
Spark to do query aggregations and data analysis.
○ Spark can move large chunks of data in and out for data analysis.
■ Integrate with solr or elasticsearch to search and index large data
sets (big data ecosystem).
Aggregation in Wide Column NoSQL
19
19. High Performance NoSQL Masterclass
■ Keyspaces are similar to a relational DB schema.
■ They are the highest level of the data model.
■ Usually contain many tables.
○ A grouping of tables.
■ Defines to how many nodes/replicas and DCs the data will be
replicated to.
■ Defines options that apply to all included tables.
■ Keyspaces are created using the CREATE KEYSPACE command.
Keyspaces
20
20. High Performance NoSQL Masterclass
■ Batches to a single partition are applied as a single mutation and are
recommended.
■ Batch statements with mutations to several partitions simultaneously are
strongly discouraged.
○ Individual mutations are better for performance than such a batch.
■ Only modification statements (INSERT, UPDATE, or DELETE) are allowed.
■ Batches are atomic, i.e., everything succeeds, or nothing does. No Rollbacks.
■ Isolation is guaranteed at a partition level but not across all involved partitions.
The mutation might become available in some partitions, but other mutations
from the same batch might not have been applied yet to the other partitions.
■ Batches are not transactional, but they can include LWT. If multiple LWTs are
used, they need to target the same partition.
■ In order to update counters, a "counter batch" is required.
Batches
21
21. High Performance NoSQL Masterclass
■ MV is a view of a “base table”.
● ScyllaDB creates it as a separate (read-only) table.
● Takes up space (table) in the cluster.
■ MV might be on a different node than the base table, based on
Partition Key.
● The view itself exists across the entire cluster.
■ Automatically updated when the base table is updated.
■ All of the original table's Primary Key components MUST also
appear in the MV’s key.
■ A view can have some or all of the base table's columns and use
different sorting orders.
Materialized Views - Wide Column NoSQL (1 of 2)
22
22. High Performance NoSQL Masterclass
Materialized Views - Wide Column NoSQL (2 of 2)
23
Cassandra
■ Cassandra's Materialized Views are considered experimental due
to their instability.
■ Not recommended for Production workloads in Cassandra.
ScyllaDB
■ Production Ready.
24. High Performance NoSQL Masterclass
Intro to Wide Column NoSQL Data Modeling
■ Data Model Design - "Measure Twice, Cut Once".
○ Spend time on Requirements and Design.
○ Especially Important when working on the Data Model.
○ It guides design of the rest of the solution.
○ Tech Debt - Difficult to change later.
■ How will the data be distributed, accessed, used ?
■ Design for common as well as uncommon usage.
■ Avoid "Hot Partitions"
■ Avoid Large Partitions
25
25. High Performance NoSQL Masterclass
Data Modeling Goals
26
The goal of data modeling is to design a DB cluster that is performant,
complete, and organized. It should provide the following
■ Data is evenly distributed across nodes, ideally.
■ Minimize number of nodes / partitions accessed in a read query.
○ Ideally, only one partition.
○ E.G. Avoid Range Queries
26. High Performance NoSQL Masterclass
Data Modeling Process for Wide-Column NoSQL
■ Consider the Conceptual Data Model.
■ Application Workflow right behind it.
○ Start thinking about queries.
■ Logical Data Model will come from those.
○ Primary and Clustering Key selection is critical here.
■ Physical Data Model - create the actual DB using CQL.
■ Review, Test, and Optimize the model.
27
27. High Performance NoSQL Masterclass
Model Tables Around Query Patterns
■ Identify the most common query patterns, then design the tables
around them.
■ All the data must be available for the query, without joins to other
tables, and in the order it needs to be returned.
■ If any joins or other sorts are needed, they have to be done by the
application.
28
28. High Performance NoSQL Masterclass
Model Tables Around Query Patterns (Example 1)
■ "Give me the post content for userID #____ in the year _____ sorted by time of post.",
the table would be designed like this:
CREATE TABLE blogpostsbyuseryear (
userid bigint,
posttime timestamp,
postid uuid,
postcontent text,
year bigint,
PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER BY (posttime
DESC));
■ Notice that we broke out "year" from the posttime timestamp.
SELECT postcontent FROM blogpostsbyuseryear WHERE userid=N and
year=2022;
29
29. High Performance NoSQL Masterclass
Model Tables Around Query Patterns (Example 2)
■ "Give me the names of users who posted today, sorted by last name,"
we would need another table:
CREATE TABLE blogpostsbyusertoday (
userid bigint,
userfirstname text,
userlastname text,
posttime timestamp,
year bigint,
PRIMARY KEY (posttime, userlastname) WITH CLUSTERING ORDER
BY (userlastname ASC));
SELECT userfirstname, userlastname FROM
blogpostsbyusertoday WHERE posttime = '2022-11-09';
30
30. High Performance NoSQL Masterclass
Model Tables Around Query Patterns (Example 3)
■ "How many users posted today?" We could use a Counter type to
avoid aggregation for performance reasons:
CREATE TABLE blogpostcounttoday (
counter_value counter,
postdate bigint,
PRIMARY KEY (postdate) WITH CLUSTERING ORDER BY
(postdate DESC));
SELECT counter_value FROM blogpostcounttoday WHERE
postdate= '2022-11-09';
31
31. High Performance NoSQL Masterclass
Conceptual Data Modeling
■ What are the business requirements ?
■ What data is available and what is to be stored ?
■ Working out the basic concepts here.
33
32. High Performance NoSQL Masterclass
Partitions
■ Don't let partitions get too large.
■ Don't let one or more get too large or busy
■ Hot Partitions - Very busy partition's that don't spread the load.
○ These will have a serious impact on performance.
○ E.G. LIFO design - latest data is what everyone wants, like sports scores - Could
lead to a Hot Partition.
34
33. High Performance NoSQL Masterclass
Minimize Number of Partitions to be Read
■ Queries should read from as few partitions as possible.
■ Fewer fetched partitions mean faster queries.
■ The reason for this is that each partition can be stored on a
different node.
■ When you issue a query, the coordinator generally will need to
issue the command to several nodes.
■ This results in additional overhead and increases the standard
deviation in latency.
■ Even if all partitions are stored on a single node, the way rows are
stored in WC NoSQL, it is cheaper to read data from a single
partition than from multiple ones at the same time.
35
34. High Performance NoSQL Masterclass
Spread Data Evenly Across the Cluster
■ Try to spread data evenly across the cluster, avoiding data
hotspots that put pressure on specific nodes.
■ The partition key – the first element of the primary key –
determines which node stores the data.
■ It is responsible for data distribution across the cluster.
○ It is thus of the utmost importance to choose the primary key wisely.
36
35. High Performance NoSQL Masterclass
Design for Storage
37
■ Data duplication is expected in WC NoSQL, as data is stored in
multiple tables to support multiple queries.
■ In modern times, data storage is considered cheaper than other
server resources, and expectations are high that queries return
quickly, even for very large datasets.
■ Therefore, planning storage requirements is necessary.
■ After you have created the logical and physical models of your
schema, then calculate storage needs by looking at space
required by the individual data types in each table.
36. High Performance NoSQL Masterclass
Logical Data Modeling - Keys
■ Proper Key selection is critical to performance, data distribution,
and sorting capabilities.
■ The partition key is assigned a token, which is placed on the token
ring and automatic sharding determines which node owns the
data and which nodes it is replicated to, according to the
replication factor.
○ Automatically shards and distributes data across the cluster.
■ The clustering Key (AKA Sort Key) sorts the data within a given
partition.
38
37. High Performance NoSQL Masterclass
Secondary Indexes
39
Secondary Indexes in a Relational DB
■ Alternate access path to rows.
■ Filtering based on values.
■ Can be very effective when designed and used correctly.
ScyllaDB Secondary Indexes
■ Implemented differently from Cassandra, where they are only
Local.
■ ScyllaDB Secondary Indexes can be either Global or Local.
■ Materialized Views allow the creation of a Secondary index on a
table.
39. High Performance NoSQL Masterclass
Queries - Performance Recommendations
41
■ The data model is a critical part of a highly performant DB.
○ Expect an order of magnitude difference in performance between poorly
designed and well-designed data models.
○ Queries need to be an early part of the data modeling process.
■ Use Prepared Statements, with placeholders.
■ Don't use "SELECT *" if you don't need all of the columns for the
rows returned.
■ Avoid full table scans.
■ Avoid large Mutations (insert/update/delete), as they can create a
performance bottleneck.
40. High Performance NoSQL Masterclass
Queries - Prepared Statements - Overview
42
An application will often reuse the same queries; with different values
(parameters). This is where prepared statements shine.
■ Repeating CQL statements can be prepared and saved in a
PreparedStatement object.
■ Every time the statement is executed, the parameters are passed in
as arguments.
■ Before a query is executed the DB parses it. This is a costly operation,
and it is better to prepare the statement once and reuse it.
■ Prepare a query string once, and reuse it with different values. More
efficient than simple statements for queries that are used often.
PreparedStatement prodS1 = session.prepare("SELECT sku FROM product
WHERE sku = ?");
41. High Performance NoSQL Masterclass
■ Prevents CQL Injection (Security).
■ Query are only Parsed once, saves time every time it's later
executed.
■ Routing Key (token awareness) metadata is cached.
■ Future optimizations are still to come, some already planned.
■ Reduces data sent over the network.
■ Use them where possible; there are a lot of great reasons.
Queries - Prepared Statements - Advantages
43
42. High Performance NoSQL Masterclass
Queries - Lightweight Transactions (LWTs)
44
■ Based on Paxos Consensus Algorithm.
■ LWTs are a feature of WC NoSQL that allows atomic updates to
multiple rows in a single query.
■ In a relational DB, this would be a transaction.
■ NoSQL DBs are not ACID compliant, but LWTs are a step in that
direction.
■ No rollback is possible; if the LWT fails, the query fails, and no data
is updated.
■ Expensive - Four round trips; therefore, use only when necessary.
43. High Performance NoSQL Masterclass
Change Data Capture (CDC) Overview
45
■ When enabled, query a table's current data or the history of changes.
■ CDC uses disk space for each enabled table.
○ When the CDC space for a table fills up, the related table no longer accepts writes.
■ CDC's free space needs to be managed on all nodes.
○ More work for larger clusters (think 100's of servers or more).
■ Ensure you have a cleanup mechanism.
Example Use Cases
■ Replication between heterogeneous DBs.
○ E.G. Replication to ElasticSearch.
■ Implementing a Notification System.
■ Fraud Detection - In-flight analytics, looking for (abnormal) patterns
in the changes.
45. High Performance NoSQL Masterclass
Learning
■ ScyllaDB University - university.scylladb.com/
■ Pythian's Blog - blog.pythian.com/technical-track/
■ Check out Cassandra materials
■ github.com/Anant/awesome-cassandra
Professional Services
■ Pythian Services - pythian.com/
■ ScyllaDB - www.scylladb.com/
Additional Resources
47
46. High Performance NoSQL Masterclass
What We Covered
This talk is about Wide Column NoSQL (ScyllaDB and Cassandra).
■ Wide Column NoSQL Overview
■ Pros and Cons
■ Data Modeling RDBMS vs Wide Column NoSQL
■ Data Modeling Rules
■ Queries
■ Additional Resources
48
47. High Performance NoSQL Masterclass
Keep in Touch !
Allan Mason
Lead Database Consultant.
Pythian Services
■ amason@pythian.com
■ @_digitalknight
■ linkedin.com/in/allan-mason-7b50b426
54
Editor's Notes
Welcome. In this part of our class today, I'll be discussing Modeling Data and Queries for Wide Column NoSQL.
A little background about me. I'm currently a Lead DB Consultant at Pythian. I feel privileged to be leading a very skilled team of DBAs. In former life, I was a Senior SW Engineer, speaker, and I enjoy writing. I'm the father of two great young adults.
A quick overview of who I work for. Pythian started as a database service company, and continues to deliver top of the line database services, 25 year later. We are proud to have served over 400+ customers globally, more than 420+ experts across every data domains and technologies and counting! Pythian maximizes the value of your database by delivering advanced on-prem, hybrid, cloud, and multi-cloud solutions, solving your toughest data and analytics challenges.
From database design, migrations, capacity planning, upgrades, performance tuning, backup and recovery, to round-the-clock monitoring, problem detection and resolution, our teams help you keep your mission-critical systems operating flawlessly and in an optimized fashion - without having to worry about hiring, covering vacations, sick days and so on.
Our Database Services include over 25+ technologies and platforms. We have database experts who have the experience you need.
In this talk, I'll be focused on Wide Column NoSQL, specifically for ScyllaDB and Cassandra. We'll go into an overview, some pros and cons. Data modeling and how it compares to relational data modeling. We'll cover some data modeling rules, talk about how queries are an important part of the modeling process, and then wrap up with some additional links.
Cassandra and ScyllaDB are both wide column NoSQL databases.ScyllaDB has a fully compatible API with Cassandra.
ScyllaDB is written in C++.
Cassandra is written in Java.
They are powerful tools, well designed to scale to millions of operations per second over geographically distributed locations operating in a highly available manner.
They us a Log Structured Merge Tree engine (LSM Tree), which is the core of how Wide Column NoSQL works.
Let's review some of the pros about Wide Column NoSQL.
It's very stable. Pythian has worked on some Cassandra clusters that have operated for years without interruption.
It's got amazing High Availability.
Self-Healing and Automation help contribute to the stabilityh.
There is No single point of failure when correctly configured - think replication factor, and consistency level, among others.
It's easily scalable
Horizontal Scaling, AKA Sharding, is built-in, which is really nice.
It is very realistic to have linear cluster scaling, to hundreds or more nodes if you need it.
The LSM Tree engine makes writes and reads very performant.
Vendor Independent free open software. It's mostly platform independent, except ScyllaDB only runs on Linux.
Some of the cons against wide column NoSQL.
Using an LSM Tree engine requires "Compaction." This results in High Disk Usage.
The Compaction process needs a lot of free disk space.
Look into the different Compaction choices to see what suits your needs.
Another con, is poor engine performance if your reads exceed writes by a large magnitude. The LSM tree isn't ideal in this case.
There are also far fewer community open-source tools compared to MySQL, MariaDB and PostgreSQL. Mainly because they've been around a lot longer.
Here's a link to some of those tools and resources that are available.https://cassandra.apache.org/_/ecosystem.html
Wide column is great at what it does, but it's not a generic, do it all DB.
Some of the cons specific to Cassandra, which mostly revolve around the choice of Java as the programming language.
There are some tricky catch 22 limits on the heap, it can't be too big or too small.
JMX
And the famous Java Garbage collection
There can be Significant CPU spikes during the Compaction and Garbage Collection processes.
Also, default settings can significantly impact performance. Just ensure you always review and tune your settings.
Let's talk about data modeling and compare it a bit to how it differs from Relational DB data modeling.
Relational databases are very much a schema-first design.
Relational databases are a type of database management system based on the relational model invented by Edgar F. Codd.
They are Entirely driven by data.
Normalization is a part of the process of structuring a relational database in order to reduce data redundancy and improve data integrity.
This plays a central role in relational design.
The goal is to store an entity in a single location, to minimize application management of changes such as INSERTs, UPDATEs, and DELETEs.
Duplicated data makes ensuring data integrity a challenge.
After the data is normalized into a schema based on tables and their relationships, queries are then written based on it.
On the other hand, Wide Column NoSQL is very much a query-first design.
Data in Wide Column NoSQL is structured differently.
The key goal here is very fast data access without any joins which become necessary as a result of schema normalization in the relational DB world.
Wide Column NoSQL is driven by the queries, not by the data.
Identify all of your expected queries and design the tables around them.
This will help you achieve more efficient reads.
Data duplication is not considered a problem.
(Relational)
Referential Referential integrity is an relational DB feature that ensures relationships between the data in database tables remain accurate by applying constraints (e.g., foreign keys) .
This prevents applications or users from writing inaccurate data or references pointing to data that does not exist.
This ensures Relationships between data linked by keys remain consistent.
This is an Important part of relational DBs, where queries often combine data from multiple tables.
(Wide Column NoSQL)
In WC NoSQL - there is No referential integrity (foreign key) support.
As a result, things like Cascading deletes are not supported.
Atomicity: means Either all of a Transaction's operations are completed, or none of them are.
In WC NoSQL, write operations are atomic at the partition level; inserts, updates, or deletes of two or more rows in the same partition are treated as one write operation.
For example, if writing with a Consistency Level of Quorum and a Replication Factor of 3, it will replicate a given write to all nodes, and wait for acknowledgement from 2 of the nodes.
If the write fails on one of the nodes but succeeds on the other, it reports back a failure to replicate the write on that node.
However, the replicated write that succeeded on the other node is not automatically removed as there is no roll-back support. This is important to remember.
Writes are practically free with Wide Column NoSQL databases.
Denormalization
WC NoSQL is optimized for writes.
Writing data in multiple locations incurs a very minimal penalty.
Unlike relational databases, data duplication is actually considered a good thing.
This provides different preset combinations of the data to serve different queries.
If we have a replication factor of 3, and a Consistency Level of Quorum, we expect strong consistency.
SortingRelational DBs generally return rows of data in the order in which they are written.
To change this, Use ORDER BY in the query, to sort the records returned by that query.
In WC NoSQL, sorting is actually a design decision. Sort order is based on the CLUSTERING ORDER which is specified in the table definition.
Here we see an example of sorting by the posttime column, as it's specified in the table definition.
CREATE TABLE blogpostsbyuseryear (
userid bigint,
posttime timestamp,
postid uuid,
postcontent text,
year bigint,
PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER BY (posttime DESC));
Aggregation functions such as count, sum, etc. are supported in wide column NoSQL.
However, for performance reasons, it is usually recommended to use an external tool, such as Spark, for these jobs.
Though if you're only accessing one partition, the performance should be fine.
For data analytics, WC NoSQL is more likely to be integrated with something like Spark to do query aggregations and data analysis.
Spark can handle moving large chunks of data in and out of the DB for analysis.
One might also Integrate wide column NoSQL with something like solr or elasticsearch to search and index large data sets (think big data ecosystem).
Keyspaces
Keyspaces are similar to a DB schema in a relational DB.
They are the highest level of the data model.
They Usually contain many tables, and can also be thought of as a grouping of tables.
Keyspaces define to how many nodes/replicas and DCs the data will be replicated to.
They also define options that apply to all included tables.
Keyspaces are created using the CREATE KEYSPACE command.
Batches to a single partition are applied as a single mutation and are recommended.
Batch statements with mutations to several partitions simultaneously are strongly discouraged.
Individual mutations are better for performance than such a batch.
Only modification statements such as (INSERT, UPDATE, or DELETE) are allowed.
Batches are atomic, i.e., everything succeeds, or nothing does. There are No Rollbacks.
Isolation is guaranteed at a partition level but not across all involved partitions. The mutation might become available in some partitions, but other mutations from the same batch might not have been applied yet to the other partitions.
Batches are not transactional, but they can include LWT. If multiple LWTs are used, they need to target the same partition.
In order to update counters, a special type of batch, called a "counter batch" is required.
Materialized Views in WC NoSQL
A Materialized View is a view of a “base table”.
ScyllaDB implements a Materialized View as a separate (read-only) table.
These naturally Take up space (table) in the cluster.
A Materialized View might be on a different node than the base table, based on Partition Key.
The view itself though exists across the entire cluster.
A Materialized View is Automatically updated whenever the base table is updated.
All of the original table's Primary Key components MUST also appear in the MV’s key.
A view can have some or all of the base table's columns and use different sorting orders.
Cassandra's Materialized Views are considered experimental due to their instability.
Therefore they are Not recommended for Production workloads in Cassandra.
In ScyllaDB they Production Ready.
Some data modeling rules for Wide Column NoSQL
The Data Model Design step is very much a - "Measure Twice, Cut Once".
Make sure to Spend as much time as you need on the Requirements and Design.
Especially Important when working on the Data Model, as it's usually the biggest impact on performance, good or bad.
The data model guides the design of the rest of the solution.
Tech Debt is where shortcuts are taken. These are very Difficult and hence costly to change later, and the longer those bad implementation choices remain, the harder it is to fix them, as more and more of your application and infrastructure is built and expects things a certain way.
Consider How will the data be distributed, accessed, used ?
Design the model for common as well as uncommon usage.
Avoid "Hot Partitions"
And Avoid Large Partitions
The goal of data modeling is to design a DB cluster that is performant, complete, and organized. It should provide the following
Data that is ideally evenly distributed across nodes.
Minimize the number of nodes / partitions accessed in a given read query.
There should Ideally, only one partition in each case.
E.G. Avoid Range Queries
Data modeling process for WC NoSQL
Consider the Conceptual Data Model. This is all about the beginning basics of your solution.
Then the Application Workflow follows right behind it.
Here's where you Start thinking about the queries.
Logical Data Model will come out of those steps.
Primary and Clustering Key selection is critical at this point.
Next comes the Physical Data Model - where you will create the actual DB using CQL commands.
And of course, Review, Test, and Optimize your model.
Model Tables Around Query Patterns
Identify the most common query patterns, then design the tables around them.
All the data must be available for the query, without joins to other tables, and in the order it needs to be returned.
If any joins or other sorts are needed, they have to be done by the application.
Model Tables Around Query Patterns (Example 1)
"Give me the post content for userID #____ in the year _____ sorted by time of post.", the table would be designed like this:
CREATE TABLE blogpostsbyuseryear (
userid bigint,
posttime timestamp,
postid uuid,
postcontent text,
year bigint,
PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER BY (posttime DESC));
Notice that we broke out "year" from the posttime timestamp.
SELECT postcontent FROM blogpostsbyuseryear WHERE userid=N and year=2022;
"Give me the names of users who posted today, sorted by last name," we would need another table:
CREATE TABLE blogpostsbyusertoday (
userid bigint,
userfirstname text,
userlastname text,
posttime timestamp,
year bigint,
PRIMARY KEY (posttime, userlastname) WITH CLUSTERING ORDER BY (userlastname ASC));
SELECT userfirstname, userlastname FROM blogpostsbyusertoday WHERE posttime = '2022-11-09';
"How many users posted today?" We could use a Counter type to avoid aggregation for performance reasons:
CREATE TABLE blogpostcounttoday (
counter_value counter,
postdate bigint,
PRIMARY KEY (postdate) WITH CLUSTERING ORDER BY (postdate DESC));
SELECT counter_value FROM blogpostcounttoday WHERE postdate= '2022-11-09';
Conceptual Data Modeling
What are the business requirements ?
What data is available and what is to be stored ?
Working out the basic concepts here.
Partitions
Don't let partitions get too large.
Don't let one or more get too large or busy
Hot Partitions - Very busy partition's that don't spread the load.
These will have a serious impact on performance.
E.G. LIFO design - latest data is what everyone wants, like sports scores - Could lead to a Hot Partition.
Minimize Number of Partitions to be Read
Spread Data Evenly Across the Cluster
Try to spread data evenly across the cluster, avoiding data hotspots that put pressure on specific nodes.
The partition key – the first element of the primary key – determines which node stores the data.
It is responsible for data distribution across the cluster.
It is thus of the utmost importance to choose the primary key wisely.
Design for Storage
Data duplication is expected in WC NoSQL, as data is stored in multiple tables to support multiple queries.
In modern times, data storage is considered cheaper than other server resources, and expectations are high that queries return quickly, even for very large datasets.
Therefore, planning storage requirements is necessary.
After you have created the logical and physical models of your schema, then calculate storage needs by looking at space required by the individual data types in each table.
Logical Data Modeling - Keys
Proper Key selection is critical to performance, data distribution, and sorting capabilities.
The partition key is assigned a token, which is placed on the token ring and automatic sharding determines which node owns the data and which nodes it is replicated to, according to the replication factor.
Automatically shards and distributes data across the cluster.
The clustering Key (AKA Sort Key) sorts the data within a given partition.
Secondary Indexes
Secondary Indexes in a Relational DB
Alternate access path to rows.
Filtering based on values.
Can be very effective when designed and used correctly.
ScyllaDB Secondary Indexes
Implemented differently from Cassandra, where they are only Local.
ScyllaDB Secondary Indexes can be either Global or Local.
Materialized Views allow the creation of a Secondary index on a table.
Cassandra Secondary indexes are not global, so it is not especially useful to use them as alternate paths into the data. Trying to use secondary indexes in Cassandra this way will not scale and is a generally bad idea.
Queries - Performance Recommendations
The data model is a critical part of a highly performant DB.
Expect an order of magnitude difference in performance between poorly designed and well-designed data models.
Queries need to be an early part of the data modeling process.
Use Prepared Statements, with placeholders.
Don't use "SELECT *" if you don't need all of the columns for the rows returned.
Avoid full table scans.
Avoid large Mutations (insert/update/delete), as they can create a performance bottleneck.
An application will often reuse the same queries; with different values (parameters). This is where prepared statements shine.
Repeating CQL statements can be prepared and saved in a PreparedStatement object.
Every time the statement is executed, the parameters are passed in as arguments.
Before a query is executed the DB parses it. This is a costly operation, and it is better to prepare the statement once and reuse it.
Prepare a query string once, and reuse it with different values. More efficient than simple statements for queries that are used often.
PreparedStatement prodS1 = session.prepare("SELECT sku FROM product WHERE sku = ?");
Prevents CQL Injection (Security).
Query are only Parsed once, saves time every time it's later executed.
Routing Key (token awareness) metadata is cached.
Future optimizations are still to come, some already planned.
Reduces data sent over the network.
Use them where possible; there are a lot of great reasons.
Based on Paxos Consensus Algorithm.
LWTs are a feature of WC NoSQL that allows atomic updates to multiple rows in a single query.
In a relational DB, this would be a transaction.
NoSQL DBs are not ACID compliant, but LWTs are a step in that direction.
No rollback is possible; if the LWT fails, the query fails, and no data is updated.
Expensive - Four round trips; therefore, use only when necessary.
When enabled, query a table's current data or the history of changes.
CDC uses disk space for each enabled table.
When the CDC space for a table fills up, the related table no longer accepts writes.
CDC's free space needs to be managed on all nodes.
More work for larger clusters (think 100's of servers or more).
Ensure you have a cleanup mechanism.
Example Use Cases
Replication between heterogeneous DBs.
E.G. Replication to ElasticSearch.
Implementing a Notification System.
Fraud Detection - In-flight analytics, looking for (abnormal) patterns in the changes.
Heterogeneous database replication: applying captured changes to another database or table. The other database may use a different schema (or no schema at all), better suited for some specific workloads. An example is replication to ElasticSearch for efficient text searches.
There was a lot we didn't cover
Learning
ScyllaDB University - university.scylladb.com/
Pythian's Blog - blog.pythian.com/technical-track/
Check out Cassandra materials
github.com/Anant/awesome-cassandra
Professional Services
Pythian Services - pythian.com/
ScyllaDB - www.scylladb.com/