Migrating SQL Schemas
for ScyllaDB:
Data Modeling Best Practices
Pascal Desmarets
Founder & CEO
Pascal Desmarets
■ Married, father of 2 boys in business school
■ Passionate about data, technology, and doing things right
■ Avid sailboat racer, preferably offshore
Founder & CEO
YOUR PHOTO
GOES HERE
Why is Data Modeling a key success factor?
Data Modeling is a Key Success Factor
Data models and schemas are perhaps the
most important part of developing software,
because they have such a profound effect:
■ not only on how the software is written,
■ but also on how we think about the
problem that we are solving.
Martin Kleppmann,
Designing Data-Intensive Applications
Data Modeling for ScyllaDB
The ideal ScyllaDB application has the following characteristics
■ Writes exceed reads by a large margin
■ Data is rarely updated and when updates are made, they are idempotent (the
result of a successful performed operation is independent of the number of
times it is executed)
■ Read Access is by a known primary key
■ Data can be partitioned via a key that allows the database to be spread evenly
across multiple nodes
■ There is no need for joins or aggregates
Excellent ScyllaDB Use Cases
■ Transaction logging: purchases, test scores, movies watched and movie latest location
■ Recommendation and personalization engines
■ Fraud detection
■ Tracking pretty much anything including order status, packages, etc
■ Storing time series data (as long as you do your own aggregates)
• Health tracker data
• Weather service history
• Internet of things status and event history
• Sensor data in general
■ Messaging systems: chats, collaboration, and instant messaging apps, etc
It may be misleading that…
■ ScyllaDB tables look like RDBMS tables
■ CQL looks like SQL
Denormalization is expected
Writes are (almost) free
No DB-level joins
No referential integrity
Indexing useful in specific
circumstances
Differences
between
ScyllaDB
and
relational
databases
Mindshift from application-agnostic to
application-specific modeling
Data Data Model Application
Application
Design
Access
patterns
& Queries
Data Model Data
Relational
NoSQL
ScyllaDB Data Model Principles (1 of 3)
■ Keyspace: container for tables in a Cassandra data model
■ Table: container for an ordered collection of rows
■ Rows: made of a primary key plus an ordered set of columns, themselves
made of name/value pairs.
■ No need to store a value for every column each time a new row is stored.
ScyllaDB Data Model Principles (2 of 3)
■ Primary key: a composite made of a partition key plus an optional set of
clustering columns.
• Partition key: is responsible for data distribution across the nodes. It determines which node
will store a given row. It can be one or more columns.
• Clustering columns: is responsible for sorting the rows within the partition. It can be zero or
more columns.
ScyllaDB Data Model Principles (3 of 3)
■ Data type: defined to constrain the values stored in a column. Data types include character and
numeric types, collections, and user-defined types. A column also has other attributes:
timestamps and time-to-live.
■ Secondary index: an index on any columns that is not part of the primary key. Secondary indexes
are not recommended on columns with high cardinality or very low cardinality, or on columns that
a frequently updated or deleted.
■ Joins: cannot be performed at the database level. If there is need for a join, either it must be
performed at the application level, or preferably, the data model should be adapted to create a
denormalized table that represents the join results.
Data modeling for ScyllaDB is a
balancing act
■ Two primary rules of data modeling in ScyllaDB:
• each partition should have roughly same amount of data
• read operations should access minimum partitions, ideally only one
■ The two data modeling principles often conflict, therefore you have to find a
balance between the two based on domain understanding and business needs
■ Anticipate growth: a data model that may make sense with a particular
transaction volume, may not longer make sense when multiplied 100x or 1000x
Data modeling in practice
5 steps to a data model
■ Step 1: Build the application workflow
■ Step 2: Model the queries required by the application
■ Step 3: Create the tables
■ Step 4: Get the primary key right
■ Step 5: Use data types effectively
■ Example derived from
https://care-pet.docs.scylladb.com/master/design_and_data_model.html
Step 1: Build the application workflow
Step 2a: Model the queries required by the application
Step 2b: identify attributes for each entity
Step 3: Create the tables
■ In ScyllaDB, tables can be grouped into two distinct categories:
• Tables with single-row partitions:
• tables for which the primary key is also the partition keys
• used to store entities and are usually normalized.
• should be named based on the entity for clarity (i.e., pet or owner).
• Tables with multi-row partitions:
• tables with primary keys composed of partition and clustering keys
• used to store relationships and related entities (Remember: ScyllaDB doesn’t support joins,
so developers need to structure tables to support queries that relate to multiple data items
• give tables meaningful names so that people examining the schema can understand the
purpose of different tables (i.e., sensor, measurement, etc.).
Step 4: Get the primary key right
■ The primary key is made up of
• a partition key. For most applications, this should be a unique key (UUID or custom)
• followed by one or more optional clustering columns that control how rows are laid out in a
ScyllaDB partition
■ Getting the primary key right for each table is one of the most crucial aspects
of designing a good data model
■ Remember the two primary rules of data modeling in Cassandra:
• each partition should have roughly same amount of data
• read operations should access minimum partitions, ideally only one
Step 5: Use data types effectively
■ String: ascii, text, varchar, inet
■ Numeric: int, bigint, smallint, tinyint, varint,
counter, decimal, double, float
■ UUIDs: uuid, timeuuid
■ Miscellaneous: Boolean, blob
■ Date/time: timestamp, date, time, duration
■ Geospatial
■ Collections: list, map, set, tuple, nested
■ User-Defined Types (UDT)
Collections
■ List: ordered collection of one or more elements
■ Set: unordered collection of one or more unique elements
■ Map: collection of arbitrary key-value pairs
■ Tuple: holds fixed-length sets of typed positional fields
■ Frozen: serialization of multiple components into a single value – updates to
individual fields is not possible – treated as a blob so as to be able to nest
collections
■ User-Defined Type: re-usable set of multiple fields of related information,
e.g. an address
A single table per query
Use denormalization to avoid
joins
Ensure that the choice of
primary key guarantees
uniqueness
Break up large partitions in
buckets
Best
Practices
Migrating relational database structures to ScyllaDB
RDBMS ScyllaDB
Benefits of data modeling
■ While traditional data modeling may be perceived to get in
the way of development and take too much time…
■ Next-gen data modeling tools such as Hackolade are
recognized to:
• facilitate Agile development
• reduce development time
• increase application quality
• implement consistent definitions of data
• improve data quality
• enable better data governance and compliance
• facilitate documentation and communication
To leverage the dynamic schema of ScyllaDB, data
modeling turns out to be even more important than
with relational databases
Thank you!
Stay in touch
Pascal Desmarets
@Hackolade
pascal.desmarets@hackolade.com

Scylla Summit 2022: Migrating SQL Schemas for ScyllaDB: Data Modeling Best Practices

  • 1.
    Migrating SQL Schemas forScyllaDB: Data Modeling Best Practices Pascal Desmarets Founder & CEO
  • 6.
    Pascal Desmarets ■ Married,father of 2 boys in business school ■ Passionate about data, technology, and doing things right ■ Avid sailboat racer, preferably offshore Founder & CEO YOUR PHOTO GOES HERE
  • 7.
    Why is DataModeling a key success factor?
  • 9.
    Data Modeling isa Key Success Factor Data models and schemas are perhaps the most important part of developing software, because they have such a profound effect: ■ not only on how the software is written, ■ but also on how we think about the problem that we are solving. Martin Kleppmann, Designing Data-Intensive Applications
  • 10.
  • 11.
    The ideal ScyllaDBapplication has the following characteristics ■ Writes exceed reads by a large margin ■ Data is rarely updated and when updates are made, they are idempotent (the result of a successful performed operation is independent of the number of times it is executed) ■ Read Access is by a known primary key ■ Data can be partitioned via a key that allows the database to be spread evenly across multiple nodes ■ There is no need for joins or aggregates
  • 12.
    Excellent ScyllaDB UseCases ■ Transaction logging: purchases, test scores, movies watched and movie latest location ■ Recommendation and personalization engines ■ Fraud detection ■ Tracking pretty much anything including order status, packages, etc ■ Storing time series data (as long as you do your own aggregates) • Health tracker data • Weather service history • Internet of things status and event history • Sensor data in general ■ Messaging systems: chats, collaboration, and instant messaging apps, etc
  • 13.
    It may bemisleading that… ■ ScyllaDB tables look like RDBMS tables ■ CQL looks like SQL
  • 14.
    Denormalization is expected Writesare (almost) free No DB-level joins No referential integrity Indexing useful in specific circumstances Differences between ScyllaDB and relational databases
  • 15.
    Mindshift from application-agnosticto application-specific modeling Data Data Model Application Application Design Access patterns & Queries Data Model Data Relational NoSQL
  • 16.
    ScyllaDB Data ModelPrinciples (1 of 3) ■ Keyspace: container for tables in a Cassandra data model ■ Table: container for an ordered collection of rows ■ Rows: made of a primary key plus an ordered set of columns, themselves made of name/value pairs. ■ No need to store a value for every column each time a new row is stored.
  • 17.
    ScyllaDB Data ModelPrinciples (2 of 3) ■ Primary key: a composite made of a partition key plus an optional set of clustering columns. • Partition key: is responsible for data distribution across the nodes. It determines which node will store a given row. It can be one or more columns. • Clustering columns: is responsible for sorting the rows within the partition. It can be zero or more columns.
  • 18.
    ScyllaDB Data ModelPrinciples (3 of 3) ■ Data type: defined to constrain the values stored in a column. Data types include character and numeric types, collections, and user-defined types. A column also has other attributes: timestamps and time-to-live. ■ Secondary index: an index on any columns that is not part of the primary key. Secondary indexes are not recommended on columns with high cardinality or very low cardinality, or on columns that a frequently updated or deleted. ■ Joins: cannot be performed at the database level. If there is need for a join, either it must be performed at the application level, or preferably, the data model should be adapted to create a denormalized table that represents the join results.
  • 19.
    Data modeling forScyllaDB is a balancing act ■ Two primary rules of data modeling in ScyllaDB: • each partition should have roughly same amount of data • read operations should access minimum partitions, ideally only one ■ The two data modeling principles often conflict, therefore you have to find a balance between the two based on domain understanding and business needs ■ Anticipate growth: a data model that may make sense with a particular transaction volume, may not longer make sense when multiplied 100x or 1000x
  • 20.
  • 21.
    5 steps toa data model ■ Step 1: Build the application workflow ■ Step 2: Model the queries required by the application ■ Step 3: Create the tables ■ Step 4: Get the primary key right ■ Step 5: Use data types effectively ■ Example derived from https://care-pet.docs.scylladb.com/master/design_and_data_model.html
  • 22.
    Step 1: Buildthe application workflow
  • 23.
    Step 2a: Modelthe queries required by the application
  • 24.
    Step 2b: identifyattributes for each entity
  • 25.
    Step 3: Createthe tables ■ In ScyllaDB, tables can be grouped into two distinct categories: • Tables with single-row partitions: • tables for which the primary key is also the partition keys • used to store entities and are usually normalized. • should be named based on the entity for clarity (i.e., pet or owner). • Tables with multi-row partitions: • tables with primary keys composed of partition and clustering keys • used to store relationships and related entities (Remember: ScyllaDB doesn’t support joins, so developers need to structure tables to support queries that relate to multiple data items • give tables meaningful names so that people examining the schema can understand the purpose of different tables (i.e., sensor, measurement, etc.).
  • 27.
    Step 4: Getthe primary key right ■ The primary key is made up of • a partition key. For most applications, this should be a unique key (UUID or custom) • followed by one or more optional clustering columns that control how rows are laid out in a ScyllaDB partition ■ Getting the primary key right for each table is one of the most crucial aspects of designing a good data model ■ Remember the two primary rules of data modeling in Cassandra: • each partition should have roughly same amount of data • read operations should access minimum partitions, ideally only one
  • 28.
    Step 5: Usedata types effectively ■ String: ascii, text, varchar, inet ■ Numeric: int, bigint, smallint, tinyint, varint, counter, decimal, double, float ■ UUIDs: uuid, timeuuid ■ Miscellaneous: Boolean, blob ■ Date/time: timestamp, date, time, duration ■ Geospatial ■ Collections: list, map, set, tuple, nested ■ User-Defined Types (UDT)
  • 29.
    Collections ■ List: orderedcollection of one or more elements ■ Set: unordered collection of one or more unique elements ■ Map: collection of arbitrary key-value pairs ■ Tuple: holds fixed-length sets of typed positional fields ■ Frozen: serialization of multiple components into a single value – updates to individual fields is not possible – treated as a blob so as to be able to nest collections ■ User-Defined Type: re-usable set of multiple fields of related information, e.g. an address
  • 31.
    A single tableper query Use denormalization to avoid joins Ensure that the choice of primary key guarantees uniqueness Break up large partitions in buckets Best Practices
  • 32.
    Migrating relational databasestructures to ScyllaDB RDBMS ScyllaDB
  • 33.
    Benefits of datamodeling ■ While traditional data modeling may be perceived to get in the way of development and take too much time… ■ Next-gen data modeling tools such as Hackolade are recognized to: • facilitate Agile development • reduce development time • increase application quality • implement consistent definitions of data • improve data quality • enable better data governance and compliance • facilitate documentation and communication To leverage the dynamic schema of ScyllaDB, data modeling turns out to be even more important than with relational databases
  • 34.
    Thank you! Stay intouch Pascal Desmarets @Hackolade pascal.desmarets@hackolade.com