SlideShare a Scribd company logo
1 of 47
High Performance NoSQL Masterclass
Modeling Data and Queries
for Wide Column NoSQL
Allan Mason
High Performance NoSQL Masterclass
Allan Mason
■ Lead Database Consultant.
■ Privileged to Mentor and Lead a skilled team
of DBAs at Pythian.
■ Senior SW Engineer - video games to DB
tools.
■ International Speaker.
■ Fiction Writer.
■ Father of Two.
2
Pythian Services Inc | 3
Data is in our
DNA.
We do not just provide database advice and
consulting. We are your database partner.
25
Years in Business
420+
Experts across every Data
Domain & Technology
400+
Global Customers
Pythian Services Inc | 4
Data Estate Planning, Professional Services, Managed Services
RDBMS NoSQL Cloud databases
Oracle
Data lakes/data warehouse
Oracle Exadata
Microsoft SQL Server
MySQL
Postgres
DB2
Informix
HANA
MaxDB
Vertica DB
MongoDB
Cassandra
HBase
Scylla
OCI DBCS
OCI ADB
Amazon RDS
Amazon Aurora
MS Azure SQL Database
MS Azure Cosmos DB
Google Cloud Datastore
Google Cloud Spanner
Google Cloud SQL
Google Cloud Bigtable
Hadoop/Spark
Amazon Redshift
MS Synapse Analytics
MS Azure Data Lake Storage
Google BigQuery
Oracle Exadata
Oracle Autonomous Database
Snowflake Cloud Data WH
Database Monitoring and Alerting Proprietary Tool: AvailX
Cloud Migration
High Performance NoSQL Masterclass
Overview
High Performance NoSQL Masterclass
Agenda
This talk is about Wide Column NoSQL (ScyllaDB and Cassandra).
■ Wide Column NoSQL Overview
■ Pros and Cons
■ Data Modeling RDBMS vs Wide Column NoSQL
■ Data Modeling Rules
■ Queries
■ Additional Resources
7
High Performance NoSQL Masterclass
Wide Column NoSQL Overview
8
■ Cassandra and ScyllaDB are both Wide Column NoSQL DBs.
■ ScyllaDB has a fully compatible API with Cassandra.
○ ScyllaDB is written in C++.
○ Cassandra is written in Java.
■ They are powerful tools, well designed to scale to millions of
operations per second over geographically distributed locations
operating in a highly available manner.
■ Log Structured Merge Tree engine (LSM Tree).
High Performance NoSQL Masterclass
Overview - Pros
■ Stability - Pythian has worked on Cassandra clusters that have
operated for years without interruption.
○ High Availability
○ Self-Healing and Automation
○ No SPOF when correctly configured (RF, CF, etc).
■ Scalability
■ Horizontal Scaling (Sharding) built-in.
○ Linear scaling to hundreds+ of nodes is very realistic if you need it.
■ The LSM Tree engine makes writes and reads very performant.
■ Vendor Independent FOSS.
9
High Performance NoSQL Masterclass
Overview - Cons
10
■ High Disk Usage - The LSM Tree engine requires "Compaction."
○ The Compaction process can eat a lot of disk space.
○ Look into the Compaction choices.
■ Poor engine performance when reads exceed writes by a large
magnitude.
■ Far fewer community open-source tools compared to
MySQL/MariaDB and PostgreSQL.
○ https://cassandra.apache.org/_/ecosystem.html
■ Great at what it does, but not a generic do it all DB.
High Performance NoSQL Masterclass
Overview - Cassandra Specific Cons
■ Java
○ Limits on the Heap
○ JMX
○ Garbage Collection
■ Significant CPU spikes during Compaction and Garbage
Collection.
■ Default settings can significantly impact performance.
11
High Performance NoSQL Masterclass
Data Modeling
Wide Column vs. RDBMS
High Performance NoSQL Masterclass
Relational DB Management Systems (RDBMS)
Schema-First Design
■ Type of database management system based on the relational
model invented by Edgar F. Codd.
○ Entirely driven by data.
■ Normalization – Process of structuring a DB to reduce data
redundancy and improve data integrity
○ Central role in relational design.
○ The goal is to store an entity in a single location, to minimize application
management of INSERT, UPDATE, and DELETE changes.
○ Duplicated data makes ensuring data integrity a challenge.
■ After data is normalized based on tables and their relationships,
queries are written based on them.
13
High Performance NoSQL Masterclass
Wide Column NoSQL - Query-First Design
14
■ Data in WC NoSQL is structured differently.
■ The key goal is very fast data access without the necessary joins
created by schema normalization.
■ Driven by the queries, not by the data.
○ Identify the expected queries and design the tables around them.
○ Achieves more efficient reads.
○ Data duplication is not considered a problem.
High Performance NoSQL Masterclass
No Referential Integrity
15
RDBMS
■ Referential integrity is an RDBMS feature ensuring relationships
between tables in a database remain accurate by applying
constraints (e.g., foreign keys) .
■ This prevents applications or users from writing inaccurate data
or references pointing to data that does not exist.
○ Relationships between data linked by keys remain consistent.
○ Important in relational DBs, as queries often combine data from multiple tables.
Wide Column NoSQL
■ WC NoSQL - No referential integrity (foreign key) support.
○ Cascading deletes are not supported.
High Performance NoSQL Masterclass
Atomicity
16
■ Atomicity: Either all of a Transaction's operations are completed,
or none of them are.
■ WC NoSQL write operations are atomic at the partition level;
inserts, updates, or deletes of two or more rows in the same
partition are treated as one write operation.
○ E.G. if writing with a Consistency Level of Quorum and RF (Replication Factor) of
3, it will replicate the write to all nodes, and wait for acknowledgement of 2
nodes.
○ If the write fails on one of the nodes but succeeds on the other, it reports back a
failure to replicate the write on that node.
○ However, the replicated write that succeeds on the other node is not
automatically removed as there is no roll-back support.
High Performance NoSQL Masterclass
Denormalization
17
■ WC NoSQL is optimized for writes.
■ Writing data in multiple locations incurs a minimal penalty.
■ Unlike relational databases, data duplication is treated as a good
thing.
○ This provides different preset combinations of the data to serve different
queries.
■ With an RF=3, and CL=Quorum, we expect strong consistency.
High Performance NoSQL Masterclass
Sorting
18
■ Relational DBs generally return rows of data in the order in which
they are written.
○ Use ORDER BY to sort the records returned by a query.
■ In WC NoSQL, sorting is a design decision. Sort order is based on the
CLUSTERING ORDER specified in the table definition.
CREATE TABLE blogpostsbyuseryear (
userid bigint,
posttime timestamp,
postid uuid,
postcontent text,
year bigint,
PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER
BY (posttime DESC));
High Performance NoSQL Masterclass
■ Aggregation functions (count, sum, etc.) are supported.
○ However, for performance reasons, it is recommended to use an external tool,
such as Spark, for these jobs.
○ Though if only accessing one partition the performance should be fine.
■ For data analytics, WC NoSQL is more likely to be integrated with
Spark to do query aggregations and data analysis.
○ Spark can move large chunks of data in and out for data analysis.
■ Integrate with solr or elasticsearch to search and index large data
sets (big data ecosystem).
Aggregation in Wide Column NoSQL
19
High Performance NoSQL Masterclass
■ Keyspaces are similar to a relational DB schema.
■ They are the highest level of the data model.
■ Usually contain many tables.
○ A grouping of tables.
■ Defines to how many nodes/replicas and DCs the data will be
replicated to.
■ Defines options that apply to all included tables.
■ Keyspaces are created using the CREATE KEYSPACE command.
Keyspaces
20
High Performance NoSQL Masterclass
■ Batches to a single partition are applied as a single mutation and are
recommended.
■ Batch statements with mutations to several partitions simultaneously are
strongly discouraged.
○ Individual mutations are better for performance than such a batch.
■ Only modification statements (INSERT, UPDATE, or DELETE) are allowed.
■ Batches are atomic, i.e., everything succeeds, or nothing does. No Rollbacks.
■ Isolation is guaranteed at a partition level but not across all involved partitions.
The mutation might become available in some partitions, but other mutations
from the same batch might not have been applied yet to the other partitions.
■ Batches are not transactional, but they can include LWT. If multiple LWTs are
used, they need to target the same partition.
■ In order to update counters, a "counter batch" is required.
Batches
21
High Performance NoSQL Masterclass
■ MV is a view of a “base table”.
● ScyllaDB creates it as a separate (read-only) table.
● Takes up space (table) in the cluster.
■ MV might be on a different node than the base table, based on
Partition Key.
● The view itself exists across the entire cluster.
■ Automatically updated when the base table is updated.
■ All of the original table's Primary Key components MUST also
appear in the MV’s key.
■ A view can have some or all of the base table's columns and use
different sorting orders.
Materialized Views - Wide Column NoSQL (1 of 2)
22
High Performance NoSQL Masterclass
Materialized Views - Wide Column NoSQL (2 of 2)
23
Cassandra
■ Cassandra's Materialized Views are considered experimental due
to their instability.
■ Not recommended for Production workloads in Cassandra.
ScyllaDB
■ Production Ready.
High Performance NoSQL Masterclass
Data Modeling Rules
High Performance NoSQL Masterclass
Intro to Wide Column NoSQL Data Modeling
■ Data Model Design - "Measure Twice, Cut Once".
○ Spend time on Requirements and Design.
○ Especially Important when working on the Data Model.
○ It guides design of the rest of the solution.
○ Tech Debt - Difficult to change later.
■ How will the data be distributed, accessed, used ?
■ Design for common as well as uncommon usage.
■ Avoid "Hot Partitions"
■ Avoid Large Partitions
25
High Performance NoSQL Masterclass
Data Modeling Goals
26
The goal of data modeling is to design a DB cluster that is performant,
complete, and organized. It should provide the following
■ Data is evenly distributed across nodes, ideally.
■ Minimize number of nodes / partitions accessed in a read query.
○ Ideally, only one partition.
○ E.G. Avoid Range Queries
High Performance NoSQL Masterclass
Data Modeling Process for Wide-Column NoSQL
■ Consider the Conceptual Data Model.
■ Application Workflow right behind it.
○ Start thinking about queries.
■ Logical Data Model will come from those.
○ Primary and Clustering Key selection is critical here.
■ Physical Data Model - create the actual DB using CQL.
■ Review, Test, and Optimize the model.
27
High Performance NoSQL Masterclass
Model Tables Around Query Patterns
■ Identify the most common query patterns, then design the tables
around them.
■ All the data must be available for the query, without joins to other
tables, and in the order it needs to be returned.
■ If any joins or other sorts are needed, they have to be done by the
application.
28
High Performance NoSQL Masterclass
Model Tables Around Query Patterns (Example 1)
■ "Give me the post content for userID #____ in the year _____ sorted by time of post.",
the table would be designed like this:
CREATE TABLE blogpostsbyuseryear (
userid bigint,
posttime timestamp,
postid uuid,
postcontent text,
year bigint,
PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER BY (posttime
DESC));
■ Notice that we broke out "year" from the posttime timestamp.
SELECT postcontent FROM blogpostsbyuseryear WHERE userid=N and
year=2022;
29
High Performance NoSQL Masterclass
Model Tables Around Query Patterns (Example 2)
■ "Give me the names of users who posted today, sorted by last name,"
we would need another table:
CREATE TABLE blogpostsbyusertoday (
userid bigint,
userfirstname text,
userlastname text,
posttime timestamp,
year bigint,
PRIMARY KEY (posttime, userlastname) WITH CLUSTERING ORDER
BY (userlastname ASC));
SELECT userfirstname, userlastname FROM
blogpostsbyusertoday WHERE posttime = '2022-11-09';
30
High Performance NoSQL Masterclass
Model Tables Around Query Patterns (Example 3)
■ "How many users posted today?" We could use a Counter type to
avoid aggregation for performance reasons:
CREATE TABLE blogpostcounttoday (
counter_value counter,
postdate bigint,
PRIMARY KEY (postdate) WITH CLUSTERING ORDER BY
(postdate DESC));
SELECT counter_value FROM blogpostcounttoday WHERE
postdate= '2022-11-09';
31
High Performance NoSQL Masterclass
Conceptual Data Modeling
■ What are the business requirements ?
■ What data is available and what is to be stored ?
■ Working out the basic concepts here.
33
High Performance NoSQL Masterclass
Partitions
■ Don't let partitions get too large.
■ Don't let one or more get too large or busy
■ Hot Partitions - Very busy partition's that don't spread the load.
○ These will have a serious impact on performance.
○ E.G. LIFO design - latest data is what everyone wants, like sports scores - Could
lead to a Hot Partition.
34
High Performance NoSQL Masterclass
Minimize Number of Partitions to be Read
■ Queries should read from as few partitions as possible.
■ Fewer fetched partitions mean faster queries.
■ The reason for this is that each partition can be stored on a
different node.
■ When you issue a query, the coordinator generally will need to
issue the command to several nodes.
■ This results in additional overhead and increases the standard
deviation in latency.
■ Even if all partitions are stored on a single node, the way rows are
stored in WC NoSQL, it is cheaper to read data from a single
partition than from multiple ones at the same time.
35
High Performance NoSQL Masterclass
Spread Data Evenly Across the Cluster
■ Try to spread data evenly across the cluster, avoiding data
hotspots that put pressure on specific nodes.
■ The partition key – the first element of the primary key –
determines which node stores the data.
■ It is responsible for data distribution across the cluster.
○ It is thus of the utmost importance to choose the primary key wisely.
36
High Performance NoSQL Masterclass
Design for Storage
37
■ Data duplication is expected in WC NoSQL, as data is stored in
multiple tables to support multiple queries.
■ In modern times, data storage is considered cheaper than other
server resources, and expectations are high that queries return
quickly, even for very large datasets.
■ Therefore, planning storage requirements is necessary.
■ After you have created the logical and physical models of your
schema, then calculate storage needs by looking at space
required by the individual data types in each table.
High Performance NoSQL Masterclass
Logical Data Modeling - Keys
■ Proper Key selection is critical to performance, data distribution,
and sorting capabilities.
■ The partition key is assigned a token, which is placed on the token
ring and automatic sharding determines which node owns the
data and which nodes it is replicated to, according to the
replication factor.
○ Automatically shards and distributes data across the cluster.
■ The clustering Key (AKA Sort Key) sorts the data within a given
partition.
38
High Performance NoSQL Masterclass
Secondary Indexes
39
Secondary Indexes in a Relational DB
■ Alternate access path to rows.
■ Filtering based on values.
■ Can be very effective when designed and used correctly.
ScyllaDB Secondary Indexes
■ Implemented differently from Cassandra, where they are only
Local.
■ ScyllaDB Secondary Indexes can be either Global or Local.
■ Materialized Views allow the creation of a Secondary index on a
table.
High Performance NoSQL Masterclass
Queries
High Performance NoSQL Masterclass
Queries - Performance Recommendations
41
■ The data model is a critical part of a highly performant DB.
○ Expect an order of magnitude difference in performance between poorly
designed and well-designed data models.
○ Queries need to be an early part of the data modeling process.
■ Use Prepared Statements, with placeholders.
■ Don't use "SELECT *" if you don't need all of the columns for the
rows returned.
■ Avoid full table scans.
■ Avoid large Mutations (insert/update/delete), as they can create a
performance bottleneck.
High Performance NoSQL Masterclass
Queries - Prepared Statements - Overview
42
An application will often reuse the same queries; with different values
(parameters). This is where prepared statements shine.
■ Repeating CQL statements can be prepared and saved in a
PreparedStatement object.
■ Every time the statement is executed, the parameters are passed in
as arguments.
■ Before a query is executed the DB parses it. This is a costly operation,
and it is better to prepare the statement once and reuse it.
■ Prepare a query string once, and reuse it with different values. More
efficient than simple statements for queries that are used often.
PreparedStatement prodS1 = session.prepare("SELECT sku FROM product
WHERE sku = ?");
High Performance NoSQL Masterclass
■ Prevents CQL Injection (Security).
■ Query are only Parsed once, saves time every time it's later
executed.
■ Routing Key (token awareness) metadata is cached.
■ Future optimizations are still to come, some already planned.
■ Reduces data sent over the network.
■ Use them where possible; there are a lot of great reasons.
Queries - Prepared Statements - Advantages
43
High Performance NoSQL Masterclass
Queries - Lightweight Transactions (LWTs)
44
■ Based on Paxos Consensus Algorithm.
■ LWTs are a feature of WC NoSQL that allows atomic updates to
multiple rows in a single query.
■ In a relational DB, this would be a transaction.
■ NoSQL DBs are not ACID compliant, but LWTs are a step in that
direction.
■ No rollback is possible; if the LWT fails, the query fails, and no data
is updated.
■ Expensive - Four round trips; therefore, use only when necessary.
High Performance NoSQL Masterclass
Change Data Capture (CDC) Overview
45
■ When enabled, query a table's current data or the history of changes.
■ CDC uses disk space for each enabled table.
○ When the CDC space for a table fills up, the related table no longer accepts writes.
■ CDC's free space needs to be managed on all nodes.
○ More work for larger clusters (think 100's of servers or more).
■ Ensure you have a cleanup mechanism.
Example Use Cases
■ Replication between heterogeneous DBs.
○ E.G. Replication to ElasticSearch.
■ Implementing a Notification System.
■ Fraud Detection - In-flight analytics, looking for (abnormal) patterns
in the changes.
High Performance NoSQL Masterclass
What We Covered
High Performance NoSQL Masterclass
Learning
■ ScyllaDB University - university.scylladb.com/
■ Pythian's Blog - blog.pythian.com/technical-track/
■ Check out Cassandra materials
■ github.com/Anant/awesome-cassandra
Professional Services
■ Pythian Services - pythian.com/
■ ScyllaDB - www.scylladb.com/
Additional Resources
47
High Performance NoSQL Masterclass
What We Covered
This talk is about Wide Column NoSQL (ScyllaDB and Cassandra).
■ Wide Column NoSQL Overview
■ Pros and Cons
■ Data Modeling RDBMS vs Wide Column NoSQL
■ Data Modeling Rules
■ Queries
■ Additional Resources
48
High Performance NoSQL Masterclass
Keep in Touch !
Allan Mason
Lead Database Consultant.
Pythian Services
■ amason@pythian.com
■ @_digitalknight
■ linkedin.com/in/allan-mason-7b50b426
54

More Related Content

What's hot

MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB plc
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
 
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptxMySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptxNeoClova
 
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!ScyllaDB
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Mydbops
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLChristian Antognini
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLYoshinori Matsunobu
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabaseTung Nguyen Thanh
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
 
ProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewRené Cannaò
 
Optimizing MariaDB for maximum performance
Optimizing MariaDB for maximum performanceOptimizing MariaDB for maximum performance
Optimizing MariaDB for maximum performanceMariaDB plc
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
 
MySQL InnoDB Cluster 소개
MySQL InnoDB Cluster 소개MySQL InnoDB Cluster 소개
MySQL InnoDB Cluster 소개rockplace
 

What's hot (20)

Cassandra 101
Cassandra 101Cassandra 101
Cassandra 101
 
MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & Optimization
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
MySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptxMySQL8.0_performance_schema.pptx
MySQL8.0_performance_schema.pptx
 
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQL
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
ProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management OverviewProxySQL High Avalability and Configuration Management Overview
ProxySQL High Avalability and Configuration Management Overview
 
Optimizing MariaDB for maximum performance
Optimizing MariaDB for maximum performanceOptimizing MariaDB for maximum performance
Optimizing MariaDB for maximum performance
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
MySQL InnoDB Cluster 소개
MySQL InnoDB Cluster 소개MySQL InnoDB Cluster 소개
MySQL InnoDB Cluster 소개
 

Similar to Modeling Data and Queries for Wide Column NoSQL

The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017Alex Robinson
 
MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0Ted Wennmark
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandraBrian Enochson
 
Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSatya Pal
 
Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesMaynooth University
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outMariaDB plc
 
Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Ahmed Rashwan
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Managementsameerfaizan
 
Introduction to ClustrixDB
Introduction to ClustrixDBIntroduction to ClustrixDB
Introduction to ClustrixDBI Goo Lee
 
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.pptmy no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.pptwondimagegndesta
 
The No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelThe No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelRishikese MR
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQLUlf Wendel
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Clustrix
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra nehabsairam
 

Similar to Modeling Data and Queries for Wide Column NoSQL (20)

The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
 
Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explained
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
Chapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choicesChapter1: NoSQL: It’s about making intelligent choices
Chapter1: NoSQL: It’s about making intelligent choices
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale out
 
Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?Why no sql ? Why Couchbase ?
Why no sql ? Why Couchbase ?
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
 
Introduction to ClustrixDB
Introduction to ClustrixDBIntroduction to ClustrixDB
Introduction to ClustrixDB
 
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.pptmy no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
my no sql introductiobkjhikjhkjhkhjhgchjvbbnn.ppt
 
The No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra ModelThe No SQL Principles and Basic Application Of Casandra Model
The No SQL Principles and Basic Application Of Casandra Model
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQL
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
No SQL
No SQLNo SQL
No SQL
 

More from ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Modeling Data and Queries for Wide Column NoSQL

  • 1. High Performance NoSQL Masterclass Modeling Data and Queries for Wide Column NoSQL Allan Mason
  • 2. High Performance NoSQL Masterclass Allan Mason ■ Lead Database Consultant. ■ Privileged to Mentor and Lead a skilled team of DBAs at Pythian. ■ Senior SW Engineer - video games to DB tools. ■ International Speaker. ■ Fiction Writer. ■ Father of Two. 2
  • 3. Pythian Services Inc | 3 Data is in our DNA. We do not just provide database advice and consulting. We are your database partner. 25 Years in Business 420+ Experts across every Data Domain & Technology 400+ Global Customers
  • 4. Pythian Services Inc | 4 Data Estate Planning, Professional Services, Managed Services RDBMS NoSQL Cloud databases Oracle Data lakes/data warehouse Oracle Exadata Microsoft SQL Server MySQL Postgres DB2 Informix HANA MaxDB Vertica DB MongoDB Cassandra HBase Scylla OCI DBCS OCI ADB Amazon RDS Amazon Aurora MS Azure SQL Database MS Azure Cosmos DB Google Cloud Datastore Google Cloud Spanner Google Cloud SQL Google Cloud Bigtable Hadoop/Spark Amazon Redshift MS Synapse Analytics MS Azure Data Lake Storage Google BigQuery Oracle Exadata Oracle Autonomous Database Snowflake Cloud Data WH Database Monitoring and Alerting Proprietary Tool: AvailX Cloud Migration
  • 5. High Performance NoSQL Masterclass Overview
  • 6. High Performance NoSQL Masterclass Agenda This talk is about Wide Column NoSQL (ScyllaDB and Cassandra). ■ Wide Column NoSQL Overview ■ Pros and Cons ■ Data Modeling RDBMS vs Wide Column NoSQL ■ Data Modeling Rules ■ Queries ■ Additional Resources 7
  • 7. High Performance NoSQL Masterclass Wide Column NoSQL Overview 8 ■ Cassandra and ScyllaDB are both Wide Column NoSQL DBs. ■ ScyllaDB has a fully compatible API with Cassandra. ○ ScyllaDB is written in C++. ○ Cassandra is written in Java. ■ They are powerful tools, well designed to scale to millions of operations per second over geographically distributed locations operating in a highly available manner. ■ Log Structured Merge Tree engine (LSM Tree).
  • 8. High Performance NoSQL Masterclass Overview - Pros ■ Stability - Pythian has worked on Cassandra clusters that have operated for years without interruption. ○ High Availability ○ Self-Healing and Automation ○ No SPOF when correctly configured (RF, CF, etc). ■ Scalability ■ Horizontal Scaling (Sharding) built-in. ○ Linear scaling to hundreds+ of nodes is very realistic if you need it. ■ The LSM Tree engine makes writes and reads very performant. ■ Vendor Independent FOSS. 9
  • 9. High Performance NoSQL Masterclass Overview - Cons 10 ■ High Disk Usage - The LSM Tree engine requires "Compaction." ○ The Compaction process can eat a lot of disk space. ○ Look into the Compaction choices. ■ Poor engine performance when reads exceed writes by a large magnitude. ■ Far fewer community open-source tools compared to MySQL/MariaDB and PostgreSQL. ○ https://cassandra.apache.org/_/ecosystem.html ■ Great at what it does, but not a generic do it all DB.
  • 10. High Performance NoSQL Masterclass Overview - Cassandra Specific Cons ■ Java ○ Limits on the Heap ○ JMX ○ Garbage Collection ■ Significant CPU spikes during Compaction and Garbage Collection. ■ Default settings can significantly impact performance. 11
  • 11. High Performance NoSQL Masterclass Data Modeling Wide Column vs. RDBMS
  • 12. High Performance NoSQL Masterclass Relational DB Management Systems (RDBMS) Schema-First Design ■ Type of database management system based on the relational model invented by Edgar F. Codd. ○ Entirely driven by data. ■ Normalization – Process of structuring a DB to reduce data redundancy and improve data integrity ○ Central role in relational design. ○ The goal is to store an entity in a single location, to minimize application management of INSERT, UPDATE, and DELETE changes. ○ Duplicated data makes ensuring data integrity a challenge. ■ After data is normalized based on tables and their relationships, queries are written based on them. 13
  • 13. High Performance NoSQL Masterclass Wide Column NoSQL - Query-First Design 14 ■ Data in WC NoSQL is structured differently. ■ The key goal is very fast data access without the necessary joins created by schema normalization. ■ Driven by the queries, not by the data. ○ Identify the expected queries and design the tables around them. ○ Achieves more efficient reads. ○ Data duplication is not considered a problem.
  • 14. High Performance NoSQL Masterclass No Referential Integrity 15 RDBMS ■ Referential integrity is an RDBMS feature ensuring relationships between tables in a database remain accurate by applying constraints (e.g., foreign keys) . ■ This prevents applications or users from writing inaccurate data or references pointing to data that does not exist. ○ Relationships between data linked by keys remain consistent. ○ Important in relational DBs, as queries often combine data from multiple tables. Wide Column NoSQL ■ WC NoSQL - No referential integrity (foreign key) support. ○ Cascading deletes are not supported.
  • 15. High Performance NoSQL Masterclass Atomicity 16 ■ Atomicity: Either all of a Transaction's operations are completed, or none of them are. ■ WC NoSQL write operations are atomic at the partition level; inserts, updates, or deletes of two or more rows in the same partition are treated as one write operation. ○ E.G. if writing with a Consistency Level of Quorum and RF (Replication Factor) of 3, it will replicate the write to all nodes, and wait for acknowledgement of 2 nodes. ○ If the write fails on one of the nodes but succeeds on the other, it reports back a failure to replicate the write on that node. ○ However, the replicated write that succeeds on the other node is not automatically removed as there is no roll-back support.
  • 16. High Performance NoSQL Masterclass Denormalization 17 ■ WC NoSQL is optimized for writes. ■ Writing data in multiple locations incurs a minimal penalty. ■ Unlike relational databases, data duplication is treated as a good thing. ○ This provides different preset combinations of the data to serve different queries. ■ With an RF=3, and CL=Quorum, we expect strong consistency.
  • 17. High Performance NoSQL Masterclass Sorting 18 ■ Relational DBs generally return rows of data in the order in which they are written. ○ Use ORDER BY to sort the records returned by a query. ■ In WC NoSQL, sorting is a design decision. Sort order is based on the CLUSTERING ORDER specified in the table definition. CREATE TABLE blogpostsbyuseryear ( userid bigint, posttime timestamp, postid uuid, postcontent text, year bigint, PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER BY (posttime DESC));
  • 18. High Performance NoSQL Masterclass ■ Aggregation functions (count, sum, etc.) are supported. ○ However, for performance reasons, it is recommended to use an external tool, such as Spark, for these jobs. ○ Though if only accessing one partition the performance should be fine. ■ For data analytics, WC NoSQL is more likely to be integrated with Spark to do query aggregations and data analysis. ○ Spark can move large chunks of data in and out for data analysis. ■ Integrate with solr or elasticsearch to search and index large data sets (big data ecosystem). Aggregation in Wide Column NoSQL 19
  • 19. High Performance NoSQL Masterclass ■ Keyspaces are similar to a relational DB schema. ■ They are the highest level of the data model. ■ Usually contain many tables. ○ A grouping of tables. ■ Defines to how many nodes/replicas and DCs the data will be replicated to. ■ Defines options that apply to all included tables. ■ Keyspaces are created using the CREATE KEYSPACE command. Keyspaces 20
  • 20. High Performance NoSQL Masterclass ■ Batches to a single partition are applied as a single mutation and are recommended. ■ Batch statements with mutations to several partitions simultaneously are strongly discouraged. ○ Individual mutations are better for performance than such a batch. ■ Only modification statements (INSERT, UPDATE, or DELETE) are allowed. ■ Batches are atomic, i.e., everything succeeds, or nothing does. No Rollbacks. ■ Isolation is guaranteed at a partition level but not across all involved partitions. The mutation might become available in some partitions, but other mutations from the same batch might not have been applied yet to the other partitions. ■ Batches are not transactional, but they can include LWT. If multiple LWTs are used, they need to target the same partition. ■ In order to update counters, a "counter batch" is required. Batches 21
  • 21. High Performance NoSQL Masterclass ■ MV is a view of a “base table”. ● ScyllaDB creates it as a separate (read-only) table. ● Takes up space (table) in the cluster. ■ MV might be on a different node than the base table, based on Partition Key. ● The view itself exists across the entire cluster. ■ Automatically updated when the base table is updated. ■ All of the original table's Primary Key components MUST also appear in the MV’s key. ■ A view can have some or all of the base table's columns and use different sorting orders. Materialized Views - Wide Column NoSQL (1 of 2) 22
  • 22. High Performance NoSQL Masterclass Materialized Views - Wide Column NoSQL (2 of 2) 23 Cassandra ■ Cassandra's Materialized Views are considered experimental due to their instability. ■ Not recommended for Production workloads in Cassandra. ScyllaDB ■ Production Ready.
  • 23. High Performance NoSQL Masterclass Data Modeling Rules
  • 24. High Performance NoSQL Masterclass Intro to Wide Column NoSQL Data Modeling ■ Data Model Design - "Measure Twice, Cut Once". ○ Spend time on Requirements and Design. ○ Especially Important when working on the Data Model. ○ It guides design of the rest of the solution. ○ Tech Debt - Difficult to change later. ■ How will the data be distributed, accessed, used ? ■ Design for common as well as uncommon usage. ■ Avoid "Hot Partitions" ■ Avoid Large Partitions 25
  • 25. High Performance NoSQL Masterclass Data Modeling Goals 26 The goal of data modeling is to design a DB cluster that is performant, complete, and organized. It should provide the following ■ Data is evenly distributed across nodes, ideally. ■ Minimize number of nodes / partitions accessed in a read query. ○ Ideally, only one partition. ○ E.G. Avoid Range Queries
  • 26. High Performance NoSQL Masterclass Data Modeling Process for Wide-Column NoSQL ■ Consider the Conceptual Data Model. ■ Application Workflow right behind it. ○ Start thinking about queries. ■ Logical Data Model will come from those. ○ Primary and Clustering Key selection is critical here. ■ Physical Data Model - create the actual DB using CQL. ■ Review, Test, and Optimize the model. 27
  • 27. High Performance NoSQL Masterclass Model Tables Around Query Patterns ■ Identify the most common query patterns, then design the tables around them. ■ All the data must be available for the query, without joins to other tables, and in the order it needs to be returned. ■ If any joins or other sorts are needed, they have to be done by the application. 28
  • 28. High Performance NoSQL Masterclass Model Tables Around Query Patterns (Example 1) ■ "Give me the post content for userID #____ in the year _____ sorted by time of post.", the table would be designed like this: CREATE TABLE blogpostsbyuseryear ( userid bigint, posttime timestamp, postid uuid, postcontent text, year bigint, PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER BY (posttime DESC)); ■ Notice that we broke out "year" from the posttime timestamp. SELECT postcontent FROM blogpostsbyuseryear WHERE userid=N and year=2022; 29
  • 29. High Performance NoSQL Masterclass Model Tables Around Query Patterns (Example 2) ■ "Give me the names of users who posted today, sorted by last name," we would need another table: CREATE TABLE blogpostsbyusertoday ( userid bigint, userfirstname text, userlastname text, posttime timestamp, year bigint, PRIMARY KEY (posttime, userlastname) WITH CLUSTERING ORDER BY (userlastname ASC)); SELECT userfirstname, userlastname FROM blogpostsbyusertoday WHERE posttime = '2022-11-09'; 30
  • 30. High Performance NoSQL Masterclass Model Tables Around Query Patterns (Example 3) ■ "How many users posted today?" We could use a Counter type to avoid aggregation for performance reasons: CREATE TABLE blogpostcounttoday ( counter_value counter, postdate bigint, PRIMARY KEY (postdate) WITH CLUSTERING ORDER BY (postdate DESC)); SELECT counter_value FROM blogpostcounttoday WHERE postdate= '2022-11-09'; 31
  • 31. High Performance NoSQL Masterclass Conceptual Data Modeling ■ What are the business requirements ? ■ What data is available and what is to be stored ? ■ Working out the basic concepts here. 33
  • 32. High Performance NoSQL Masterclass Partitions ■ Don't let partitions get too large. ■ Don't let one or more get too large or busy ■ Hot Partitions - Very busy partition's that don't spread the load. ○ These will have a serious impact on performance. ○ E.G. LIFO design - latest data is what everyone wants, like sports scores - Could lead to a Hot Partition. 34
  • 33. High Performance NoSQL Masterclass Minimize Number of Partitions to be Read ■ Queries should read from as few partitions as possible. ■ Fewer fetched partitions mean faster queries. ■ The reason for this is that each partition can be stored on a different node. ■ When you issue a query, the coordinator generally will need to issue the command to several nodes. ■ This results in additional overhead and increases the standard deviation in latency. ■ Even if all partitions are stored on a single node, the way rows are stored in WC NoSQL, it is cheaper to read data from a single partition than from multiple ones at the same time. 35
  • 34. High Performance NoSQL Masterclass Spread Data Evenly Across the Cluster ■ Try to spread data evenly across the cluster, avoiding data hotspots that put pressure on specific nodes. ■ The partition key – the first element of the primary key – determines which node stores the data. ■ It is responsible for data distribution across the cluster. ○ It is thus of the utmost importance to choose the primary key wisely. 36
  • 35. High Performance NoSQL Masterclass Design for Storage 37 ■ Data duplication is expected in WC NoSQL, as data is stored in multiple tables to support multiple queries. ■ In modern times, data storage is considered cheaper than other server resources, and expectations are high that queries return quickly, even for very large datasets. ■ Therefore, planning storage requirements is necessary. ■ After you have created the logical and physical models of your schema, then calculate storage needs by looking at space required by the individual data types in each table.
  • 36. High Performance NoSQL Masterclass Logical Data Modeling - Keys ■ Proper Key selection is critical to performance, data distribution, and sorting capabilities. ■ The partition key is assigned a token, which is placed on the token ring and automatic sharding determines which node owns the data and which nodes it is replicated to, according to the replication factor. ○ Automatically shards and distributes data across the cluster. ■ The clustering Key (AKA Sort Key) sorts the data within a given partition. 38
  • 37. High Performance NoSQL Masterclass Secondary Indexes 39 Secondary Indexes in a Relational DB ■ Alternate access path to rows. ■ Filtering based on values. ■ Can be very effective when designed and used correctly. ScyllaDB Secondary Indexes ■ Implemented differently from Cassandra, where they are only Local. ■ ScyllaDB Secondary Indexes can be either Global or Local. ■ Materialized Views allow the creation of a Secondary index on a table.
  • 38. High Performance NoSQL Masterclass Queries
  • 39. High Performance NoSQL Masterclass Queries - Performance Recommendations 41 ■ The data model is a critical part of a highly performant DB. ○ Expect an order of magnitude difference in performance between poorly designed and well-designed data models. ○ Queries need to be an early part of the data modeling process. ■ Use Prepared Statements, with placeholders. ■ Don't use "SELECT *" if you don't need all of the columns for the rows returned. ■ Avoid full table scans. ■ Avoid large Mutations (insert/update/delete), as they can create a performance bottleneck.
  • 40. High Performance NoSQL Masterclass Queries - Prepared Statements - Overview 42 An application will often reuse the same queries; with different values (parameters). This is where prepared statements shine. ■ Repeating CQL statements can be prepared and saved in a PreparedStatement object. ■ Every time the statement is executed, the parameters are passed in as arguments. ■ Before a query is executed the DB parses it. This is a costly operation, and it is better to prepare the statement once and reuse it. ■ Prepare a query string once, and reuse it with different values. More efficient than simple statements for queries that are used often. PreparedStatement prodS1 = session.prepare("SELECT sku FROM product WHERE sku = ?");
  • 41. High Performance NoSQL Masterclass ■ Prevents CQL Injection (Security). ■ Query are only Parsed once, saves time every time it's later executed. ■ Routing Key (token awareness) metadata is cached. ■ Future optimizations are still to come, some already planned. ■ Reduces data sent over the network. ■ Use them where possible; there are a lot of great reasons. Queries - Prepared Statements - Advantages 43
  • 42. High Performance NoSQL Masterclass Queries - Lightweight Transactions (LWTs) 44 ■ Based on Paxos Consensus Algorithm. ■ LWTs are a feature of WC NoSQL that allows atomic updates to multiple rows in a single query. ■ In a relational DB, this would be a transaction. ■ NoSQL DBs are not ACID compliant, but LWTs are a step in that direction. ■ No rollback is possible; if the LWT fails, the query fails, and no data is updated. ■ Expensive - Four round trips; therefore, use only when necessary.
  • 43. High Performance NoSQL Masterclass Change Data Capture (CDC) Overview 45 ■ When enabled, query a table's current data or the history of changes. ■ CDC uses disk space for each enabled table. ○ When the CDC space for a table fills up, the related table no longer accepts writes. ■ CDC's free space needs to be managed on all nodes. ○ More work for larger clusters (think 100's of servers or more). ■ Ensure you have a cleanup mechanism. Example Use Cases ■ Replication between heterogeneous DBs. ○ E.G. Replication to ElasticSearch. ■ Implementing a Notification System. ■ Fraud Detection - In-flight analytics, looking for (abnormal) patterns in the changes.
  • 44. High Performance NoSQL Masterclass What We Covered
  • 45. High Performance NoSQL Masterclass Learning ■ ScyllaDB University - university.scylladb.com/ ■ Pythian's Blog - blog.pythian.com/technical-track/ ■ Check out Cassandra materials ■ github.com/Anant/awesome-cassandra Professional Services ■ Pythian Services - pythian.com/ ■ ScyllaDB - www.scylladb.com/ Additional Resources 47
  • 46. High Performance NoSQL Masterclass What We Covered This talk is about Wide Column NoSQL (ScyllaDB and Cassandra). ■ Wide Column NoSQL Overview ■ Pros and Cons ■ Data Modeling RDBMS vs Wide Column NoSQL ■ Data Modeling Rules ■ Queries ■ Additional Resources 48
  • 47. High Performance NoSQL Masterclass Keep in Touch ! Allan Mason Lead Database Consultant. Pythian Services ■ amason@pythian.com ■ @_digitalknight ■ linkedin.com/in/allan-mason-7b50b426 54

Editor's Notes

  1. Welcome. In this part of our class today, I'll be discussing Modeling Data and Queries for Wide Column NoSQL.
  2. A little background about me. I'm currently a Lead DB Consultant at Pythian. I feel privileged to be leading a very skilled team of DBAs. In former life, I was a Senior SW Engineer, speaker, and I enjoy writing. I'm the father of two great young adults.
  3. A quick overview of who I work for. Pythian started as a database service company, and continues to deliver top of the line database services, 25 year later. We are proud to have served over 400+ customers globally, more than 420+ experts across every data domains and technologies and counting! Pythian maximizes the value of your database by delivering advanced on-prem, hybrid, cloud, and multi-cloud solutions, solving your toughest data and analytics challenges. From database design, migrations, capacity planning, upgrades, performance tuning, backup and recovery, to round-the-clock monitoring, problem detection and resolution, our teams help you keep your mission-critical systems operating flawlessly and in an optimized fashion - without having to worry about hiring, covering vacations, sick days and so on.
  4. Our Database Services include over 25+ technologies and platforms. We have database experts who have the experience you need.
  5. In this talk, I'll be focused on Wide Column NoSQL, specifically for ScyllaDB and Cassandra. We'll go into an overview, some pros and cons. Data modeling and how it compares to relational data modeling. We'll cover some data modeling rules, talk about how queries are an important part of the modeling process, and then wrap up with some additional links.
  6. Cassandra and ScyllaDB are both wide column NoSQL databases. ScyllaDB has a fully compatible API with Cassandra. ScyllaDB is written in C++. Cassandra is written in Java. They are powerful tools, well designed to scale to millions of operations per second over geographically distributed locations operating in a highly available manner. They us a Log Structured Merge Tree engine (LSM Tree), which is the core of how Wide Column NoSQL works.
  7. Let's review some of the pros about Wide Column NoSQL. It's very stable. Pythian has worked on some Cassandra clusters that have operated for years without interruption. It's got amazing High Availability. Self-Healing and Automation help contribute to the stabilityh. There is No single point of failure when correctly configured - think replication factor, and consistency level, among others. It's easily scalable Horizontal Scaling, AKA Sharding, is built-in, which is really nice. It is very realistic to have linear cluster scaling, to hundreds or more nodes if you need it. The LSM Tree engine makes writes and reads very performant. Vendor Independent free open software. It's mostly platform independent, except ScyllaDB only runs on Linux.
  8. Some of the cons against wide column NoSQL. Using an LSM Tree engine requires "Compaction." This results in High Disk Usage. The Compaction process needs a lot of free disk space. Look into the different Compaction choices to see what suits your needs. Another con, is poor engine performance if your reads exceed writes by a large magnitude. The LSM tree isn't ideal in this case. There are also far fewer community open-source tools compared to MySQL, MariaDB and PostgreSQL. Mainly because they've been around a lot longer. Here's a link to some of those tools and resources that are available. https://cassandra.apache.org/_/ecosystem.html Wide column is great at what it does, but it's not a generic, do it all DB.
  9. Some of the cons specific to Cassandra, which mostly revolve around the choice of Java as the programming language. There are some tricky catch 22 limits on the heap, it can't be too big or too small. JMX And the famous Java Garbage collection There can be Significant CPU spikes during the Compaction and Garbage Collection processes. Also, default settings can significantly impact performance. Just ensure you always review and tune your settings.
  10. Let's talk about data modeling and compare it a bit to how it differs from Relational DB data modeling.
  11. Relational databases are very much a schema-first design. Relational databases are a type of database management system based on the relational model invented by Edgar F. Codd. They are Entirely driven by data. Normalization is a part of the process of structuring a relational database in order to reduce data redundancy and improve data integrity. This plays a central role in relational design. The goal is to store an entity in a single location, to minimize application management of changes such as INSERTs, UPDATEs, and DELETEs. Duplicated data makes ensuring data integrity a challenge. After the data is normalized into a schema based on tables and their relationships, queries are then written based on it.
  12. On the other hand, Wide Column NoSQL is very much a query-first design. Data in Wide Column NoSQL is structured differently. The key goal here is very fast data access without any joins which become necessary as a result of schema normalization in the relational DB world. Wide Column NoSQL is driven by the queries, not by the data. Identify all of your expected queries and design the tables around them. This will help you achieve more efficient reads. Data duplication is not considered a problem.
  13. (Relational) Referential Referential integrity is an relational DB feature that ensures relationships between the data in database tables remain accurate by applying constraints (e.g., foreign keys) . This prevents applications or users from writing inaccurate data or references pointing to data that does not exist. This ensures Relationships between data linked by keys remain consistent. This is an Important part of relational DBs, where queries often combine data from multiple tables. (Wide Column NoSQL) In WC NoSQL - there is No referential integrity (foreign key) support. As a result, things like Cascading deletes are not supported.
  14. Atomicity: means Either all of a Transaction's operations are completed, or none of them are. In WC NoSQL, write operations are atomic at the partition level; inserts, updates, or deletes of two or more rows in the same partition are treated as one write operation. For example, if writing with a Consistency Level of Quorum and a Replication Factor of 3, it will replicate a given write to all nodes, and wait for acknowledgement from 2 of the nodes. If the write fails on one of the nodes but succeeds on the other, it reports back a failure to replicate the write on that node. However, the replicated write that succeeded on the other node is not automatically removed as there is no roll-back support. This is important to remember.
  15. Writes are practically free with Wide Column NoSQL databases. Denormalization WC NoSQL is optimized for writes. Writing data in multiple locations incurs a very minimal penalty. Unlike relational databases, data duplication is actually considered a good thing. This provides different preset combinations of the data to serve different queries. If we have a replication factor of 3, and a Consistency Level of Quorum, we expect strong consistency.
  16. Sorting Relational DBs generally return rows of data in the order in which they are written. To change this, Use ORDER BY in the query, to sort the records returned by that query. In WC NoSQL, sorting is actually a design decision. Sort order is based on the CLUSTERING ORDER which is specified in the table definition. Here we see an example of sorting by the posttime column, as it's specified in the table definition. CREATE TABLE blogpostsbyuseryear ( userid bigint, posttime timestamp, postid uuid, postcontent text, year bigint, PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER BY (posttime DESC));
  17. Aggregation functions such as count, sum, etc. are supported in wide column NoSQL. However, for performance reasons, it is usually recommended to use an external tool, such as Spark, for these jobs. Though if you're only accessing one partition, the performance should be fine. For data analytics, WC NoSQL is more likely to be integrated with something like Spark to do query aggregations and data analysis. Spark can handle moving large chunks of data in and out of the DB for analysis. One might also Integrate wide column NoSQL with something like solr or elasticsearch to search and index large data sets (think big data ecosystem).
  18. Keyspaces Keyspaces are similar to a DB schema in a relational DB. They are the highest level of the data model. They Usually contain many tables, and can also be thought of as a grouping of tables. Keyspaces define to how many nodes/replicas and DCs the data will be replicated to. They also define options that apply to all included tables. Keyspaces are created using the CREATE KEYSPACE command.
  19. Batches to a single partition are applied as a single mutation and are recommended. Batch statements with mutations to several partitions simultaneously are strongly discouraged. Individual mutations are better for performance than such a batch. Only modification statements such as (INSERT, UPDATE, or DELETE) are allowed. Batches are atomic, i.e., everything succeeds, or nothing does. There are No Rollbacks. Isolation is guaranteed at a partition level but not across all involved partitions. The mutation might become available in some partitions, but other mutations from the same batch might not have been applied yet to the other partitions. Batches are not transactional, but they can include LWT. If multiple LWTs are used, they need to target the same partition. In order to update counters, a special type of batch, called a "counter batch" is required.
  20. Materialized Views in WC NoSQL A Materialized View is a view of a “base table”. ScyllaDB implements a Materialized View as a separate (read-only) table. These naturally Take up space (table) in the cluster. A Materialized View might be on a different node than the base table, based on Partition Key. The view itself though exists across the entire cluster. A Materialized View is Automatically updated whenever the base table is updated. All of the original table's Primary Key components MUST also appear in the MV’s key. A view can have some or all of the base table's columns and use different sorting orders.
  21. Cassandra's Materialized Views are considered experimental due to their instability. Therefore they are Not recommended for Production workloads in Cassandra. In ScyllaDB they Production Ready.
  22. Some data modeling rules for Wide Column NoSQL
  23. The Data Model Design step is very much a - "Measure Twice, Cut Once". Make sure to Spend as much time as you need on the Requirements and Design. Especially Important when working on the Data Model, as it's usually the biggest impact on performance, good or bad. The data model guides the design of the rest of the solution. Tech Debt is where shortcuts are taken. These are very Difficult and hence costly to change later, and the longer those bad implementation choices remain, the harder it is to fix them, as more and more of your application and infrastructure is built and expects things a certain way. Consider How will the data be distributed, accessed, used ? Design the model for common as well as uncommon usage. Avoid "Hot Partitions" And Avoid Large Partitions
  24. The goal of data modeling is to design a DB cluster that is performant, complete, and organized. It should provide the following Data that is ideally evenly distributed across nodes. Minimize the number of nodes / partitions accessed in a given read query. There should Ideally, only one partition in each case. E.G. Avoid Range Queries
  25. Data modeling process for WC NoSQL Consider the Conceptual Data Model. This is all about the beginning basics of your solution. Then the Application Workflow follows right behind it. Here's where you Start thinking about the queries. Logical Data Model will come out of those steps. Primary and Clustering Key selection is critical at this point. Next comes the Physical Data Model - where you will create the actual DB using CQL commands. And of course, Review, Test, and Optimize your model.
  26. Model Tables Around Query Patterns Identify the most common query patterns, then design the tables around them. All the data must be available for the query, without joins to other tables, and in the order it needs to be returned. If any joins or other sorts are needed, they have to be done by the application.
  27. Model Tables Around Query Patterns (Example 1) "Give me the post content for userID #____ in the year _____ sorted by time of post.", the table would be designed like this: CREATE TABLE blogpostsbyuseryear ( userid bigint, posttime timestamp, postid uuid, postcontent text, year bigint, PRIMARY KEY ((userid, year), posttime) WITH CLUSTERING ORDER BY (posttime DESC)); Notice that we broke out "year" from the posttime timestamp. SELECT postcontent FROM blogpostsbyuseryear WHERE userid=N and year=2022;
  28. "Give me the names of users who posted today, sorted by last name," we would need another table: CREATE TABLE blogpostsbyusertoday ( userid bigint, userfirstname text, userlastname text, posttime timestamp, year bigint, PRIMARY KEY (posttime, userlastname) WITH CLUSTERING ORDER BY (userlastname ASC)); SELECT userfirstname, userlastname FROM blogpostsbyusertoday WHERE posttime = '2022-11-09';
  29. "How many users posted today?" We could use a Counter type to avoid aggregation for performance reasons: CREATE TABLE blogpostcounttoday ( counter_value counter, postdate bigint, PRIMARY KEY (postdate) WITH CLUSTERING ORDER BY (postdate DESC)); SELECT counter_value FROM blogpostcounttoday WHERE postdate= '2022-11-09';
  30. Conceptual Data Modeling What are the business requirements ? What data is available and what is to be stored ? Working out the basic concepts here.
  31. Partitions Don't let partitions get too large. Don't let one or more get too large or busy Hot Partitions - Very busy partition's that don't spread the load. These will have a serious impact on performance. E.G. LIFO design - latest data is what everyone wants, like sports scores - Could lead to a Hot Partition.
  32. Minimize Number of Partitions to be Read
  33. Spread Data Evenly Across the Cluster Try to spread data evenly across the cluster, avoiding data hotspots that put pressure on specific nodes. The partition key – the first element of the primary key – determines which node stores the data. It is responsible for data distribution across the cluster. It is thus of the utmost importance to choose the primary key wisely.
  34. Design for Storage Data duplication is expected in WC NoSQL, as data is stored in multiple tables to support multiple queries. In modern times, data storage is considered cheaper than other server resources, and expectations are high that queries return quickly, even for very large datasets. Therefore, planning storage requirements is necessary. After you have created the logical and physical models of your schema, then calculate storage needs by looking at space required by the individual data types in each table.
  35. Logical Data Modeling - Keys Proper Key selection is critical to performance, data distribution, and sorting capabilities. The partition key is assigned a token, which is placed on the token ring and automatic sharding determines which node owns the data and which nodes it is replicated to, according to the replication factor. Automatically shards and distributes data across the cluster. The clustering Key (AKA Sort Key) sorts the data within a given partition.
  36. Secondary Indexes Secondary Indexes in a Relational DB Alternate access path to rows. Filtering based on values. Can be very effective when designed and used correctly. ScyllaDB Secondary Indexes Implemented differently from Cassandra, where they are only Local. ScyllaDB Secondary Indexes can be either Global or Local. Materialized Views allow the creation of a Secondary index on a table. Cassandra Secondary indexes are not global, so it is not especially useful to use them as alternate paths into the data. Trying to use secondary indexes in Cassandra this way will not scale and is a generally bad idea.
  37. Queries - Performance Recommendations The data model is a critical part of a highly performant DB. Expect an order of magnitude difference in performance between poorly designed and well-designed data models. Queries need to be an early part of the data modeling process. Use Prepared Statements, with placeholders. Don't use "SELECT *" if you don't need all of the columns for the rows returned. Avoid full table scans. Avoid large Mutations (insert/update/delete), as they can create a performance bottleneck.
  38. An application will often reuse the same queries; with different values (parameters). This is where prepared statements shine. Repeating CQL statements can be prepared and saved in a PreparedStatement object. Every time the statement is executed, the parameters are passed in as arguments. Before a query is executed the DB parses it. This is a costly operation, and it is better to prepare the statement once and reuse it. Prepare a query string once, and reuse it with different values. More efficient than simple statements for queries that are used often. PreparedStatement prodS1 = session.prepare("SELECT sku FROM product WHERE sku = ?");
  39. Prevents CQL Injection (Security). Query are only Parsed once, saves time every time it's later executed. Routing Key (token awareness) metadata is cached. Future optimizations are still to come, some already planned. Reduces data sent over the network. Use them where possible; there are a lot of great reasons.
  40. Based on Paxos Consensus Algorithm. LWTs are a feature of WC NoSQL that allows atomic updates to multiple rows in a single query. In a relational DB, this would be a transaction. NoSQL DBs are not ACID compliant, but LWTs are a step in that direction. No rollback is possible; if the LWT fails, the query fails, and no data is updated. Expensive - Four round trips; therefore, use only when necessary.
  41. When enabled, query a table's current data or the history of changes. CDC uses disk space for each enabled table. When the CDC space for a table fills up, the related table no longer accepts writes. CDC's free space needs to be managed on all nodes. More work for larger clusters (think 100's of servers or more). Ensure you have a cleanup mechanism. Example Use Cases Replication between heterogeneous DBs. E.G. Replication to ElasticSearch. Implementing a Notification System. Fraud Detection - In-flight analytics, looking for (abnormal) patterns in the changes. Heterogeneous database replication: applying captured changes to another database or table. The other database may use a different schema (or no schema at all), better suited for some specific workloads. An example is replication to ElasticSearch for efficient text searches.
  42. There was a lot we didn't cover Learning ScyllaDB University - university.scylladb.com/ Pythian's Blog - blog.pythian.com/technical-track/ Check out Cassandra materials github.com/Anant/awesome-cassandra Professional Services Pythian Services - pythian.com/ ScyllaDB - www.scylladb.com/