Jim Hatcher gave a presentation on introducing data modeling with Apache Cassandra. He discussed how Cassandra works by distributing data across nodes and replicating data. He also covered CQL for querying Cassandra, embracing denormalization in data modeling, using an appropriate key structure, and some advanced Cassandra techniques. The presentation provided an overview of modeling data in Cassandra and resources for further learning.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Introduction to Data Modeling in Cassandra
1. Jim Hatcher
DFW Cassandra Users - Meetup
7/12/2016
Introduction to Data Modeling with Apache Cassandra
2. Agenda
• Introduction
• How does Cassandra work?
• What is CQL?
• Embracing Denormalization
• Key Structure
• Advanced Techniques
• Resources
3. Introduction
Jim Hatcher
james_hatcher@hotmail.com
At IHS, we take raw data and turn it into information and insights for our customers.
Automotive Systems (CarFax)
Defense Systems (Jane’s)
Oil & Gas Systems (Petra)
Maritime Systems
Technology & Media Systems (Electronic Parts Database, Root Metrics)
Sources of Raw Data
Structure Data
Add Value
Customer-facing
Systems
4. How does Cassandra
work?
CREATE KEYSPACE orders
WITH replication =
{
'class': 'SimpleStrategy',
'replication_factor': 3
};
CREATE TABLE orders.customer
(
customer_id uuid,
customer_name varchar,
customer_age int,
PRIMARY KEY ( customer_id )
)
INSERT INTO customer (customer_id, customer_name, customer_age)
VALUES (feb2b9e6-613b-4e6b-b470-981dc4d42525, ‘Bob’, 35)
SELECT customer_name, customer_age FROM customer WHERE customer_id = feb2b9e6-613b-4e6b-b470-981dc4d42525
Cassandra Cluster
B
C
D
E
F
Client
-9223372036854775808
through
-6148914691236517207
-6148914691236517206
through
-3074457345618258605
-3074457345618258604
through
-3
-2
through
3074457345618258599
3074457345618258600
through
6148914691236517201
6148914691236517202
through
9223372036854775808
A
5. CQL
Cassandra Query Language
Standard interface for working with Cassandra
Very similar to standard SQL, with a few notable
exceptions:
• No JOIN clauses
• No GROUP BY / HAVING clauses
• Restricted WHERE clauses
• You can only query by key fields in prescribed
ways
CQL Type Description
bigint 64-bit signed long
boolean true or false
decimal Variable-precision decimal
double 64-bit IEEE-754 floating point
float 32-bit IEEE-754 floating point
int 32-bit signed integer
text UTF-8 encoded string
timestamp Date plus time, encoded as 8 bytes since epoch
timeuuid Type 1 UUID only
uuid A UUID in standard UUID format
Others:
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cql_data_types_c.html
CQL Data Types
6. Normalization
In relational databases, we start with understanding how the data relates together. We create a conceptual model.
Our physical model often looks identical to the conceptual model.
Student
Course
StudentClassSchedule
Class
StudentIDPK
FirstName
DateOfBirth
LastName
CourseIDPK
CourseName
Department
ClassIDPK
Grade CourseDescription
ClassIDPK
CourseIDFK
Semester
Professor
Section
Classroom
CourseNumber
CourseCode
DayAndTime
StudentIDPK
7. Normalization Course
CourseID CourseCode CourseNumber CourseName Department
C-AAA ENGL 101 American Literature Humanities
C-BBB MATH 203 Linear Algebra Mathematics
C-CCC BIOL 201 Molecular Biology Science
C-DDD HIST 108 World History History
C-EEE ENGL 102 British Literature Humanities
Class
ClassID CourseID Semester Section Professor Classroom DayAndTime
SP16-ENGL-101-01 C-AAA Spring 2016 01 Mark Twain XYZ Hall, Room 212 MWF 8:00 AM
SP16-MATH-203-01 C-BBB Spring 2016 01 Isaac Newton XYZ Hall, Room 212 TuTh 9:30 AM
FA16-BIOL-201-04 C-CCC Fall 2016 04 Charles Darwin XYZ Hall, Room 210 MWF 9:00 AM
FA16-HIST-108-03 C-DDD Fall 2016 03 Napoleon Bonaparte XYZ Hall, Room 317 TuTh 12:00 PM
FA16-ENGL-102-04 C-EEE Fall 2016 04 Virginia Woolf XYZ Hall, Room 184 MWF 10:00 AM
FA16-ENGL-102-04 C-EEE Fall 2016 05 Jane Austen XYZ Hall, Room 185 TuTh 2:00 PM
Every piece of data lives in one and only one place.
We use our data-layer to enforce referential integrity.
Student
StudentID FirstName LastName DateOfBirth
S-111 Joe Smith 1/1/1970
S-222 Jill Jones 2/2/1972
S-333 Betty Williams 3/3/1973
StudentClassSchedule
StudentID ClassID Grade
S-111 SP16-ENGL-101-01 A
S-111 SP16-MATH-203-01 C
S-111 FA16-BIOL-201-04 <null>
S-111 FA16-HIST-108-03 <null>
S-111 FA16-ENGL-102-04 <null>
S-222 FA16-HIST-108-03 <null>
8. Normalization
To satisfy a query, we join tables together.
To give a student his/her schedule, we might use this query:
SELECT Course.CourseCode, Course.CourseNumber, Course.CourseName, Class.ClassID, Class.Section,
Class.Classroom, Class.DayAndTime
FROM StudentClassSchedule
INNER JOIN Class ON StudentClassSchedule.ClassID = Class.ClassID
INNER JOIN Course ON Class.CourseID = Course.CourseID
WHERE StudentClassSchedule.StudentID = ‘S-111’
AND Class.Semester = ‘Fall 2016’
To give a professor a class roster, we might use this query
SELECT Student.FirstName, Student.LastName, Class.Classroom, Class.DayAndTime
FROM Student
INNER JOIN StudentClassSchedule ON Student.StudentID = StudentClassSchedule.StudentID
INNER JOIN Class ON StudentClassSchedule.ClassID = Class.ClassID
WHERE Class.ClassID = ‘FA16-HIST-108-03’
9. Denormalization
Student Schedule for a Given
Semester
Student Roster for a Given ClassQueries
student_schedule class_rosterTables
If updates happen to “core data,” we have to have a mechanism to deal with it.
For instance, if a class is relocated to a new classroom, we now have to update
the classroom field in both of the tables below.
10. Key Structure
CREATE TABLE student_schedule
(
student_id text,
semester text,
class_id text,
course_code text,
course_number int,
section text,
classroom text,
day_and_time text,
PRIMARY KEY ( (student_id), semester, classid )
)
The primary key is the combination of
1. the partitioning key, and
2. the clustering columns
Like relational database, it uniquely identifies the row.
The values in the primary key cannot by NULL.
The first value in the PRIMARY KEY clause is the
partitioning key. Any subsequent values are clustering
columns. To specify a multi-column partitioning key,
wrap it in parentheses.
Primary Key
11. Partition
student_id
S-111
FALL 2016 : FA16-ENGL-102-04 : course_code
ENGL
PRIMARY KEY ( (student_id), semester, classid )
Partitioning Key Clustering Columns
The partitioning key is responsible for distributing data
across the cluster.
Separates data.
Within a given partition, clustering columns are
responsible for clustering data values together.
Connects data.
SPRING 2016 : SP16-ENGL-101-01 : course_code
ENGL
This is a representation
of how Cassandra
stores data on disk.
Key Structure
….
12. When you access Cassandra data via CQL, you retrieve CQL Rows.
A “CQL Row” can be (and usually is) different than the physical structure (a partition) with which the data is stored
within the Cassandra cluster.
Partitioning Key Clustering Columns
Must be queried using an equality expression,
(i.e., = or IN)
If you have a multi-field partitioning key, you
must specify all the fields in the partition key to
query the data.
Can be queried with inequality, (i.e., <, >), or an equality.
If you have a multi-field partitioning key, you don’t have
to specify all the clustering columns, but you do have to
specify them in order. (i.e., you can’t specify clustering
column #2 unless you also supply clustering column #1)
student_schedule
Primary Key
Partitioning Key Clustering Columns
student_id semester class_id course_code course_number section classroom day_and_time
S-111 FALL 2016 FA16-ENGL-102-04 ENGL 102 04 XYZ Hall, Room 185 TuTh 2:00 PM
S-111 SPRING 2016 SP16-ENGL-101-01 ENGL 101 01 XYZ Hall, Room 212 MWF 8:00 AM
S-111 SPRING 2016 SP16-MATH-203-01 MATH 203 01 XYZ Hall, Room 212 TuTh 9:30 AM
Querying with CQL
SELECT * FROM student_schedule;
13. CQL
Acceptable Queries:
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
AND semester = ‘SPRING 2016’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
AND semester = ‘SPRING 2016’
AND class_id = ‘SP16-ENGL-101-01’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
AND semester = ‘SPRING 2016’
AND class_id >= ‘SP16-ENGL-101-01’
AND class_id < ‘SP16-ENGL-999-99’;
UN-acceptable Queries:
SELECT * FROM student_schedule
WHERE course_code = ‘ENGL’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
OR student_id = ‘S-222’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
AND class_id = ‘SP16-ENGL-101-01’;
Non-key field
Non-equality condition
against Partitioning Key
Specifying a clustering
column but not in order
Note: Yes, I know I could mention secondary indexes and the ALLOW
FILTERING clause at this point; but they’re anti-patterns, so don’t use
them.
student_schedule
Primary Key
Partitioning Key Clustering Columns
student_id semester class_id course_code course_number section classroom day_and_time
S-111 FALL 2016 FA16-ENGL-102-04 ENGL 102 04 XYZ Hall, Room 185 TuTh 2:00 PM
S-111 SPRING 2016 SP16-ENGL-101-01 ENGL 101 01 XYZ Hall, Room 212 MWF 8:00 AM
S-111 SPRING 2016 SP16-MATH-203-01 MATH 203 01 XYZ Hall, Room 212 TuTh 9:30 AM
14. Key Structure
Partitioning Key - Considerations:
1. Spread data adequately across the cluster so that you don’t create hotspots.
2. Minimize the number of partition reads. Ideally, you can get all your data out of one partition.
3. Updates that happen within the same partition have some atomicity guarantees.
Clustering Columns - Considerations:
1. A partition can contain a maximum of 2 billion values clustering column values.
2. A partition should not contain more than 100 MB per partition.
GETTING THE KEY STRUCTURE CORRECT
IS THE KEY TO GOOD DATA MODELING
15. CREATE TABLE student_schedule_v1
(
student_id text,
semester text,
class_id text,
course_code text,
…,
PRIMARY KEY ( (student_id), semester, classid )
)
CREATE TABLE student_schedule_v2
(
student_id text,
semester text,
class_id text,
course_code text,
…,
PRIMARY KEY ( (student_id, semester), classid )
)
CREATE TABLE student_schedule_v3
(
student_id text,
semester text,
class_id text,
course_code text,
…,
PRIMARY KEY ( (student_id, semester, classid) )
)
CREATE TABLE student_schedule_v4
(
semester text,
student_id text,
class_id text,
course_code text,
…,
PRIMARY KEY ( (semester), student_id, classid )
)
Creates a potential hotspot
Key Structure
Allows for queries: 1) by the
student_id only, OR 2) by the
student_id and semester
Minimizes the number of
partition reads. I consider
this the winner.
Requires that a field by
passed to satisfy the query
that we don’t necessarily
have in our app.
SELECT *
FROM student_schedule
WHERE student_id = ‘S-111’
AND semester = ‘Fall 2016’;
16. On writes, Cassandra always does an upsert (i.e., update if the record exists, and insert if the record doesn’t exist).
Suppose you picked a poor key for your table (one that doesn’t make the rows unique and then you inserted this
following data.
CREATE TABLE student_schedule_BAD_PK
(
student_id text,
semester text,
class_id text,
course_code text,
…,
PRIMARY KEY ( student_id )
)
Upserts
INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …)
VALUES ( ‘S-111’, ‘FALL 2016’, ‘FA16-ENGL-102-04’, … );
INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …)
VALUES ( ‘S-111’, ‘SPRING 2016’, ‘SP16-ENGL-101-01’, … );
INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …)
VALUES ( ‘S-111’, ‘SPRING 2016’, ‘SP16-MATH-203-01’, … );
SELECT * FROM student_schedule_BAD_PK WHERE student_id = ‘S-111’;
Result?
Accidental upserting is a common issue early in your data model testing. It can be tough to track down because it
doesn’t throw an error.
17. 1. CQL Collections (sets, lists, maps)
2. User Defined Types
3. Tuples
4. Static Columns
Advanced Techniques