Cassandra Data
Modelling
Agenda
● Relational Data Model
● Cassandra Table Structure
● Cassandra Keys
● Cassandra Data Modeling
● Materialized Views
“ Relational Databases have been one of the most successful
applications in history. It’s used everywhere from individuals to
small firms, and in giant multinational corporations with clusters
of hundreds of finely tuned instances representing multi-terabyte
data warehouses. Relational databases store invoices, customer
records, product catalogues, accounting ledgers, user
authentication schemes—the very world, it might appear. There
is no question that the relational database is a key facet of the
modern technology and business landscape, and one that will be
with us in its various forms for many years to come “
RDBMS
Relational Data Model
● SQL (Standard) for DDL and DML.
● Normalized form of Data
● Allows user to represent complex relationships with the data.
● Supports ACID Transactions
● “Joins” a really powerful way to extract meaningful data.
RDBMS
RDBMS Data ModelRDBMS Data Model
Why NoSql Then ?
● Scalability (ACID for transaction support needs locking).
● Vertical Scalability easy but horizontal scalability a huge
problem.
● Schema Oriented (What if our data structure is not fixed)
● Joins Performance becomes a bottleneck.
● Distributed Design is harder.
RDBMS Problems
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Simplest Data StoreSimplest Data Store
Name 1 Name 2 Name 3
Value 1 Value 2 Value 2
Simplest Data StoreSimplest Data Store
ROW
Column 1 Column 2 Column 3
Value 1 Value 2 Value 2
PRIMARY KEY
ROW 1
Column 1 Column 2 Column 3
Value 1 Value 2 Value 2
PRIMARY KEY
ROW 2
Column 1 Column 3
Value 1 Value 2
PRIMARY KEY
TABLE
Wide Row
PARTITION
KEY
TABLE
STATIC COLUMN
VALUE
Clustering
Key Column
Value Value
Clustering
Key Column
Value Value
Wide Row
PARTITION
KEY
STATIC COLUMN
VALUE
Clustering
Key
Column
Value Value
Column
Value
Column
Value
CREATE TABLE raw_weather_data (
weatherId text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY ((weatherId), year, month, day, hour)
) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC,
hour DESC);
Practical Example
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘1000001’,2005,12,1,7,-5.3);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,8,-4.8);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,9,-5.1);
INSERT INTO
raw_weather_data(weatherId,year,month,day,hour,temperature)
VALUES (‘‘1000001’’,2005,12,1,10,-5.6);
Practical Example
Clustering Order
● The column, which is a name/value pair
● The row, which is a container for columns referenced by a
primary key
● The table, which is a container for rows
● The keyspace, which is a container for tables
● The cluster, which is a container for keyspaces that spans
one or more nodes
Cassandra Data Structures
CREATE TABLE playlists (
id uuid,
song_order int,
song_id uuid,
title text,
album text,
artist text,
PRIMARY KEY (id, song_order ) );
Cassandra Keys
Cassandra Keys
PRIMARY KEY (id ,
song_order ) ;
Partition Key Clustering Key
The partition key determines which
node stores the data.It is responsible
for data distribution across the
nodes.
The additional columns determine
per-partition clustering. Clustering is
a storage engine process that sorts
data within the partition.
Domain - Hotel Reservations
The conceptual domain includes :
● Hotels
● Guests
● Rooms
● Room Rates
● Reservations
● Point of Interest
● Geolocation data for hotels and POI
Conceptual Data Modeling
ER Diagram
● No Joins
● No Referential Integrity
● Denormalization
● Query First Design
● Designing for Optimal Storage
● Sorting is a Design Decision
Key Considerations
We do not have Joins in Cassandra, now what to do ?
2 Options -
● Do the work on client side and handle it through code :-)
● Create a new denormalized table which represents the
Join data (Preferred Approach)
No Joins
Cassandra does not enforce referential integrity but, it is
still a common design requirement to store IDs related to
other entities in your tables, but operations such as
cascading deletes are not available
No Referential Integrity
Denormalization is not evil.
Even in RDBMS as the data goes bigger and bigger the Join
performance becomes really slow and we have to
denormalize to reduce the number of joins to achieve a
reasonable query performance.
We may also have use cases where we have to keep it in
denormalized form (e.g. Invoice Data)
Space is cheaper and Cassandra performs best when the
data is in denormalized form.
Denormalization
Cassandra you don’t start with the data model; you start
with the query model.
● Instead of modeling the data first and then writing
queries, with Cassandra you model the queries and let
the data be organized around them.
● Think of the most common query paths your application
will use, and then create the tables that you need to
support them.
Query First Design
We have to think about how we are storing the data in
Cassandra as its tables are each stored as a separate files
on Disk.
● Keep the related columns defined together in same
table. You can also think of keeping frequently used
columns in one table and less used in other table.
● Always remember we have to minimize the number of
partitions read to get the query results.
Designing for Optimal Storage
In RDBMS you do not have a default sort order and you can
do sorting on any column or list of columns by Order By
clause. In Cassandra it is a design decision :
● The sort order available on queries is fixed and is
determined entirely by Clustering Columns defined
during create statement.
● CQL SELECT statement does support ORDER BY
semantics, but only in the order specified by the
clustering columns.
Sorting is a Design Decision
Queries Related to Hotel :
Q1. Find hotels near a given point of interest.
Q2. Find information about a given hotel, such as its
name and location.
Q3. Find points of interest near a given hotel.
Q4. Find an available room in a given date range.
Q5. Find the rate and amenities for a room.
Lets Focus on Application Queries
Queries Related to Reservation :
Q6. Lookup a reservation by confirmation number.
Q7. Lookup a reservation by hotel, date, and guest
name.
Q8. Lookup all reservations by guest name.
Q9. View guest details.
Lets Focus on Application Queries
Application Queries
Approach for Modeling
1. Identify the Primary Entity that will be needed to satisfy
the Query e.g. for Q1 we need to find hotels near POI so
we have to query hotels and POI.
2. Identify the Primary Key based on the Query (so for Q1 it
seems like ID of the POI entity) also identify the
clustering keys to guarantee unique records and sorting.
3. Identify the other attributes that will be be needed to
satisfy our need (additional columns)
Approach for Modeling
Lets focus on Q1 : View Hotels near POI
- Since we need to search hotels near POI there will be
two entities involved Hotel and POI
- Now we have to select hotel for a POI we will surely need
to add POI Name as a primary Key.
- For a POI there can be multiple hotels , so we need
HotelID as well. Hence we need it a part of our key as a
clustering column.
- We need to show Hotels to the user so Name, Address,
Phone are additional attributes that we will want to fetch
as a part of Q1.
First table
Table for Q1
hotels_by_poi
poi_name K
hotel_id C
name
phone
address
Hotel Logical Data Model
Reservation Logical Data Model
Hotel Physical Data Model
Reservation Physical Data Model
Materialized Views
In Cassandra you can not add a where clause for a column
that is not part of the key, here we can either create
secondary indexes or materialized views.
Materialized views are preconfigured views that support
queries on additional columns which are not part of the
original clustering key.
They simplify application development: instead of the
application having to keep multiple denormalized tables in
sync, Cassandra takes on the responsibility of updating
views in order to keep them consistent with the base table.
Important Aspects
Partition Size
- Number of Columns (2 billion columns)
Formula : Nv = Nr( Nc − Npk − Ns ) + Ns
The number of values (or cells) in the partition (Nv) is equal
to the number of static columns (Ns) plus the product of the
number of rows (Nr) and the number of of values per row.
The number of values per row is defined as the number of
columns (Nc) minus the number of primary key columns (Npk)
and static columns (Ns).
References
1. Cassandra: The DefinitiveGuide, 2nd Edition
Distributed Dataat Web Scale
By Eben Hewitt, Jeff Carpenter
2. Introduction to DataModeling withApacheCassandra
Patrick McFadin
Chief Evangelist for ApacheCassandra
Thank You

Cassandra Data Modelling

  • 1.
  • 2.
    Agenda ● Relational DataModel ● Cassandra Table Structure ● Cassandra Keys ● Cassandra Data Modeling ● Materialized Views
  • 3.
    “ Relational Databaseshave been one of the most successful applications in history. It’s used everywhere from individuals to small firms, and in giant multinational corporations with clusters of hundreds of finely tuned instances representing multi-terabyte data warehouses. Relational databases store invoices, customer records, product catalogues, accounting ledgers, user authentication schemes—the very world, it might appear. There is no question that the relational database is a key facet of the modern technology and business landscape, and one that will be with us in its various forms for many years to come “ RDBMS
  • 4.
  • 6.
    ● SQL (Standard)for DDL and DML. ● Normalized form of Data ● Allows user to represent complex relationships with the data. ● Supports ACID Transactions ● “Joins” a really powerful way to extract meaningful data. RDBMS
  • 7.
  • 8.
  • 9.
    ● Scalability (ACIDfor transaction support needs locking). ● Vertical Scalability easy but horizontal scalability a huge problem. ● Schema Oriented (What if our data structure is not fixed) ● Joins Performance becomes a bottleneck. ● Distributed Design is harder. RDBMS Problems
  • 10.
  • 11.
  • 12.
    Simplest Data StoreSimplestData Store Name 1 Name 2 Name 3 Value 1 Value 2 Value 2
  • 13.
    Simplest Data StoreSimplestData Store ROW Column 1 Column 2 Column 3 Value 1 Value 2 Value 2 PRIMARY KEY
  • 14.
    ROW 1 Column 1Column 2 Column 3 Value 1 Value 2 Value 2 PRIMARY KEY ROW 2 Column 1 Column 3 Value 1 Value 2 PRIMARY KEY TABLE
  • 15.
    Wide Row PARTITION KEY TABLE STATIC COLUMN VALUE Clustering KeyColumn Value Value Clustering Key Column Value Value Wide Row PARTITION KEY STATIC COLUMN VALUE Clustering Key Column Value Value Column Value Column Value
  • 16.
    CREATE TABLE raw_weather_data( weatherId text, year int, month int, day int, hour int, temperature double, PRIMARY KEY ((weatherId), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC); Practical Example
  • 17.
    INSERT INTO raw_weather_data(weatherId,year,month,day,hour,temperature) VALUES (‘1000001’,2005,12,1,7,-5.3); INSERTINTO raw_weather_data(weatherId,year,month,day,hour,temperature) VALUES (‘‘1000001’’,2005,12,1,8,-4.8); INSERT INTO raw_weather_data(weatherId,year,month,day,hour,temperature) VALUES (‘‘1000001’’,2005,12,1,9,-5.1); INSERT INTO raw_weather_data(weatherId,year,month,day,hour,temperature) VALUES (‘‘1000001’’,2005,12,1,10,-5.6); Practical Example
  • 18.
  • 19.
    ● The column,which is a name/value pair ● The row, which is a container for columns referenced by a primary key ● The table, which is a container for rows ● The keyspace, which is a container for tables ● The cluster, which is a container for keyspaces that spans one or more nodes Cassandra Data Structures
  • 20.
    CREATE TABLE playlists( id uuid, song_order int, song_id uuid, title text, album text, artist text, PRIMARY KEY (id, song_order ) ); Cassandra Keys
  • 21.
    Cassandra Keys PRIMARY KEY(id , song_order ) ; Partition Key Clustering Key The partition key determines which node stores the data.It is responsible for data distribution across the nodes. The additional columns determine per-partition clustering. Clustering is a storage engine process that sorts data within the partition.
  • 22.
    Domain - HotelReservations The conceptual domain includes : ● Hotels ● Guests ● Rooms ● Room Rates ● Reservations ● Point of Interest ● Geolocation data for hotels and POI Conceptual Data Modeling
  • 23.
  • 25.
    ● No Joins ●No Referential Integrity ● Denormalization ● Query First Design ● Designing for Optimal Storage ● Sorting is a Design Decision Key Considerations
  • 26.
    We do nothave Joins in Cassandra, now what to do ? 2 Options - ● Do the work on client side and handle it through code :-) ● Create a new denormalized table which represents the Join data (Preferred Approach) No Joins
  • 27.
    Cassandra does notenforce referential integrity but, it is still a common design requirement to store IDs related to other entities in your tables, but operations such as cascading deletes are not available No Referential Integrity
  • 28.
    Denormalization is notevil. Even in RDBMS as the data goes bigger and bigger the Join performance becomes really slow and we have to denormalize to reduce the number of joins to achieve a reasonable query performance. We may also have use cases where we have to keep it in denormalized form (e.g. Invoice Data) Space is cheaper and Cassandra performs best when the data is in denormalized form. Denormalization
  • 29.
    Cassandra you don’tstart with the data model; you start with the query model. ● Instead of modeling the data first and then writing queries, with Cassandra you model the queries and let the data be organized around them. ● Think of the most common query paths your application will use, and then create the tables that you need to support them. Query First Design
  • 30.
    We have tothink about how we are storing the data in Cassandra as its tables are each stored as a separate files on Disk. ● Keep the related columns defined together in same table. You can also think of keeping frequently used columns in one table and less used in other table. ● Always remember we have to minimize the number of partitions read to get the query results. Designing for Optimal Storage
  • 31.
    In RDBMS youdo not have a default sort order and you can do sorting on any column or list of columns by Order By clause. In Cassandra it is a design decision : ● The sort order available on queries is fixed and is determined entirely by Clustering Columns defined during create statement. ● CQL SELECT statement does support ORDER BY semantics, but only in the order specified by the clustering columns. Sorting is a Design Decision
  • 32.
    Queries Related toHotel : Q1. Find hotels near a given point of interest. Q2. Find information about a given hotel, such as its name and location. Q3. Find points of interest near a given hotel. Q4. Find an available room in a given date range. Q5. Find the rate and amenities for a room. Lets Focus on Application Queries
  • 33.
    Queries Related toReservation : Q6. Lookup a reservation by confirmation number. Q7. Lookup a reservation by hotel, date, and guest name. Q8. Lookup all reservations by guest name. Q9. View guest details. Lets Focus on Application Queries
  • 34.
  • 35.
    Approach for Modeling 1.Identify the Primary Entity that will be needed to satisfy the Query e.g. for Q1 we need to find hotels near POI so we have to query hotels and POI. 2. Identify the Primary Key based on the Query (so for Q1 it seems like ID of the POI entity) also identify the clustering keys to guarantee unique records and sorting. 3. Identify the other attributes that will be be needed to satisfy our need (additional columns)
  • 36.
    Approach for Modeling Letsfocus on Q1 : View Hotels near POI - Since we need to search hotels near POI there will be two entities involved Hotel and POI - Now we have to select hotel for a POI we will surely need to add POI Name as a primary Key. - For a POI there can be multiple hotels , so we need HotelID as well. Hence we need it a part of our key as a clustering column. - We need to show Hotels to the user so Name, Address, Phone are additional attributes that we will want to fetch as a part of Q1.
  • 37.
    First table Table forQ1 hotels_by_poi poi_name K hotel_id C name phone address
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    Materialized Views In Cassandrayou can not add a where clause for a column that is not part of the key, here we can either create secondary indexes or materialized views. Materialized views are preconfigured views that support queries on additional columns which are not part of the original clustering key. They simplify application development: instead of the application having to keep multiple denormalized tables in sync, Cassandra takes on the responsibility of updating views in order to keep them consistent with the base table.
  • 43.
    Important Aspects Partition Size -Number of Columns (2 billion columns) Formula : Nv = Nr( Nc − Npk − Ns ) + Ns The number of values (or cells) in the partition (Nv) is equal to the number of static columns (Ns) plus the product of the number of rows (Nr) and the number of of values per row. The number of values per row is defined as the number of columns (Nc) minus the number of primary key columns (Npk) and static columns (Ns).
  • 44.
    References 1. Cassandra: TheDefinitiveGuide, 2nd Edition Distributed Dataat Web Scale By Eben Hewitt, Jeff Carpenter 2. Introduction to DataModeling withApacheCassandra Patrick McFadin Chief Evangelist for ApacheCassandra
  • 45.