SlideShare a Scribd company logo
1 of 18
Jim Hatcher
DFW Cassandra Users - Meetup
7/12/2016
Introduction to Data Modeling with Apache Cassandra
Agenda
• Introduction
• How does Cassandra work?
• What is CQL?
• Embracing Denormalization
• Key Structure
• Advanced Techniques
• Resources
Introduction
Jim Hatcher
james_hatcher@hotmail.com
At IHS, we take raw data and turn it into information and insights for our customers.
Automotive Systems (CarFax)
Defense Systems (Jane’s)
Oil & Gas Systems (Petra)
Maritime Systems
Technology & Media Systems (Electronic Parts Database, Root Metrics)
Sources of Raw Data
Structure Data
Add Value
Customer-facing
Systems
How does Cassandra
work?
CREATE KEYSPACE orders
WITH replication =
{
'class': 'SimpleStrategy',
'replication_factor': 3
};
CREATE TABLE orders.customer
(
customer_id uuid,
customer_name varchar,
customer_age int,
PRIMARY KEY ( customer_id )
)
INSERT INTO customer (customer_id, customer_name, customer_age)
VALUES (feb2b9e6-613b-4e6b-b470-981dc4d42525, ‘Bob’, 35)
SELECT customer_name, customer_age FROM customer WHERE customer_id = feb2b9e6-613b-4e6b-b470-981dc4d42525
Cassandra Cluster
B
C
D
E
F
Client
-9223372036854775808
through
-6148914691236517207
-6148914691236517206
through
-3074457345618258605
-3074457345618258604
through
-3
-2
through
3074457345618258599
3074457345618258600
through
6148914691236517201
6148914691236517202
through
9223372036854775808
A
CQL
Cassandra Query Language
Standard interface for working with Cassandra
Very similar to standard SQL, with a few notable
exceptions:
• No JOIN clauses
• No GROUP BY / HAVING clauses
• Restricted WHERE clauses
• You can only query by key fields in prescribed
ways
CQL Type Description
bigint 64-bit signed long
boolean true or false
decimal Variable-precision decimal
double 64-bit IEEE-754 floating point
float 32-bit IEEE-754 floating point
int 32-bit signed integer
text UTF-8 encoded string
timestamp Date plus time, encoded as 8 bytes since epoch
timeuuid Type 1 UUID only
uuid A UUID in standard UUID format
Others:
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cql_data_types_c.html
CQL Data Types
Normalization
In relational databases, we start with understanding how the data relates together. We create a conceptual model.
Our physical model often looks identical to the conceptual model.
Student
Course
StudentClassSchedule
Class
StudentIDPK
FirstName
DateOfBirth
LastName
CourseIDPK
CourseName
Department
ClassIDPK
Grade CourseDescription
ClassIDPK
CourseIDFK
Semester
Professor
Section
Classroom
CourseNumber
CourseCode
DayAndTime
StudentIDPK
Normalization Course
CourseID CourseCode CourseNumber CourseName Department
C-AAA ENGL 101 American Literature Humanities
C-BBB MATH 203 Linear Algebra Mathematics
C-CCC BIOL 201 Molecular Biology Science
C-DDD HIST 108 World History History
C-EEE ENGL 102 British Literature Humanities
Class
ClassID CourseID Semester Section Professor Classroom DayAndTime
SP16-ENGL-101-01 C-AAA Spring 2016 01 Mark Twain XYZ Hall, Room 212 MWF 8:00 AM
SP16-MATH-203-01 C-BBB Spring 2016 01 Isaac Newton XYZ Hall, Room 212 TuTh 9:30 AM
FA16-BIOL-201-04 C-CCC Fall 2016 04 Charles Darwin XYZ Hall, Room 210 MWF 9:00 AM
FA16-HIST-108-03 C-DDD Fall 2016 03 Napoleon Bonaparte XYZ Hall, Room 317 TuTh 12:00 PM
FA16-ENGL-102-04 C-EEE Fall 2016 04 Virginia Woolf XYZ Hall, Room 184 MWF 10:00 AM
FA16-ENGL-102-04 C-EEE Fall 2016 05 Jane Austen XYZ Hall, Room 185 TuTh 2:00 PM
Every piece of data lives in one and only one place.
We use our data-layer to enforce referential integrity.
Student
StudentID FirstName LastName DateOfBirth
S-111 Joe Smith 1/1/1970
S-222 Jill Jones 2/2/1972
S-333 Betty Williams 3/3/1973
StudentClassSchedule
StudentID ClassID Grade
S-111 SP16-ENGL-101-01 A
S-111 SP16-MATH-203-01 C
S-111 FA16-BIOL-201-04 <null>
S-111 FA16-HIST-108-03 <null>
S-111 FA16-ENGL-102-04 <null>
S-222 FA16-HIST-108-03 <null>
Normalization
To satisfy a query, we join tables together.
To give a student his/her schedule, we might use this query:
SELECT Course.CourseCode, Course.CourseNumber, Course.CourseName, Class.ClassID, Class.Section,
Class.Classroom, Class.DayAndTime
FROM StudentClassSchedule
INNER JOIN Class ON StudentClassSchedule.ClassID = Class.ClassID
INNER JOIN Course ON Class.CourseID = Course.CourseID
WHERE StudentClassSchedule.StudentID = ‘S-111’
AND Class.Semester = ‘Fall 2016’
To give a professor a class roster, we might use this query
SELECT Student.FirstName, Student.LastName, Class.Classroom, Class.DayAndTime
FROM Student
INNER JOIN StudentClassSchedule ON Student.StudentID = StudentClassSchedule.StudentID
INNER JOIN Class ON StudentClassSchedule.ClassID = Class.ClassID
WHERE Class.ClassID = ‘FA16-HIST-108-03’
Denormalization
Student Schedule for a Given
Semester
Student Roster for a Given ClassQueries
student_schedule class_rosterTables
If updates happen to “core data,” we have to have a mechanism to deal with it.
For instance, if a class is relocated to a new classroom, we now have to update
the classroom field in both of the tables below.
Key Structure
CREATE TABLE student_schedule
(
student_id text,
semester text,
class_id text,
course_code text,
course_number int,
section text,
classroom text,
day_and_time text,
PRIMARY KEY ( (student_id), semester, classid )
)
The primary key is the combination of
1. the partitioning key, and
2. the clustering columns
Like relational database, it uniquely identifies the row.
The values in the primary key cannot by NULL.
The first value in the PRIMARY KEY clause is the
partitioning key. Any subsequent values are clustering
columns. To specify a multi-column partitioning key,
wrap it in parentheses.
Primary Key
Partition
student_id
S-111
FALL 2016 : FA16-ENGL-102-04 : course_code
ENGL
PRIMARY KEY ( (student_id), semester, classid )
Partitioning Key Clustering Columns
The partitioning key is responsible for distributing data
across the cluster.
Separates data.
Within a given partition, clustering columns are
responsible for clustering data values together.
Connects data.
SPRING 2016 : SP16-ENGL-101-01 : course_code
ENGL
This is a representation
of how Cassandra
stores data on disk.
Key Structure
….
When you access Cassandra data via CQL, you retrieve CQL Rows.
A “CQL Row” can be (and usually is) different than the physical structure (a partition) with which the data is stored
within the Cassandra cluster.
Partitioning Key Clustering Columns
Must be queried using an equality expression,
(i.e., = or IN)
If you have a multi-field partitioning key, you
must specify all the fields in the partition key to
query the data.
Can be queried with inequality, (i.e., <, >), or an equality.
If you have a multi-field partitioning key, you don’t have
to specify all the clustering columns, but you do have to
specify them in order. (i.e., you can’t specify clustering
column #2 unless you also supply clustering column #1)
student_schedule
Primary Key
Partitioning Key Clustering Columns
student_id semester class_id course_code course_number section classroom day_and_time
S-111 FALL 2016 FA16-ENGL-102-04 ENGL 102 04 XYZ Hall, Room 185 TuTh 2:00 PM
S-111 SPRING 2016 SP16-ENGL-101-01 ENGL 101 01 XYZ Hall, Room 212 MWF 8:00 AM
S-111 SPRING 2016 SP16-MATH-203-01 MATH 203 01 XYZ Hall, Room 212 TuTh 9:30 AM
Querying with CQL
SELECT * FROM student_schedule;
CQL
Acceptable Queries:
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
AND semester = ‘SPRING 2016’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
AND semester = ‘SPRING 2016’
AND class_id = ‘SP16-ENGL-101-01’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
AND semester = ‘SPRING 2016’
AND class_id >= ‘SP16-ENGL-101-01’
AND class_id < ‘SP16-ENGL-999-99’;
UN-acceptable Queries:
SELECT * FROM student_schedule
WHERE course_code = ‘ENGL’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
OR student_id = ‘S-222’;
SELECT * FROM student_schedule
WHERE student_id = ‘S-111’
AND class_id = ‘SP16-ENGL-101-01’;
Non-key field
Non-equality condition
against Partitioning Key
Specifying a clustering
column but not in order
Note: Yes, I know I could mention secondary indexes and the ALLOW
FILTERING clause at this point; but they’re anti-patterns, so don’t use
them.
student_schedule
Primary Key
Partitioning Key Clustering Columns
student_id semester class_id course_code course_number section classroom day_and_time
S-111 FALL 2016 FA16-ENGL-102-04 ENGL 102 04 XYZ Hall, Room 185 TuTh 2:00 PM
S-111 SPRING 2016 SP16-ENGL-101-01 ENGL 101 01 XYZ Hall, Room 212 MWF 8:00 AM
S-111 SPRING 2016 SP16-MATH-203-01 MATH 203 01 XYZ Hall, Room 212 TuTh 9:30 AM
Key Structure
Partitioning Key - Considerations:
1. Spread data adequately across the cluster so that you don’t create hotspots.
2. Minimize the number of partition reads. Ideally, you can get all your data out of one partition.
3. Updates that happen within the same partition have some atomicity guarantees.
Clustering Columns - Considerations:
1. A partition can contain a maximum of 2 billion values clustering column values.
2. A partition should not contain more than 100 MB per partition.
GETTING THE KEY STRUCTURE CORRECT
IS THE KEY TO GOOD DATA MODELING
CREATE TABLE student_schedule_v1
(
student_id text,
semester text,
class_id text,
course_code text,
…,
PRIMARY KEY ( (student_id), semester, classid )
)
CREATE TABLE student_schedule_v2
(
student_id text,
semester text,
class_id text,
course_code text,
…,
PRIMARY KEY ( (student_id, semester), classid )
)
CREATE TABLE student_schedule_v3
(
student_id text,
semester text,
class_id text,
course_code text,
…,
PRIMARY KEY ( (student_id, semester, classid) )
)
CREATE TABLE student_schedule_v4
(
semester text,
student_id text,
class_id text,
course_code text,
…,
PRIMARY KEY ( (semester), student_id, classid )
)
Creates a potential hotspot
Key Structure
Allows for queries: 1) by the
student_id only, OR 2) by the
student_id and semester
Minimizes the number of
partition reads. I consider
this the winner.
Requires that a field by
passed to satisfy the query
that we don’t necessarily
have in our app.
SELECT *
FROM student_schedule
WHERE student_id = ‘S-111’
AND semester = ‘Fall 2016’;
On writes, Cassandra always does an upsert (i.e., update if the record exists, and insert if the record doesn’t exist).
Suppose you picked a poor key for your table (one that doesn’t make the rows unique and then you inserted this
following data.
CREATE TABLE student_schedule_BAD_PK
(
student_id text,
semester text,
class_id text,
course_code text,
…,
PRIMARY KEY ( student_id )
)
Upserts
INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …)
VALUES ( ‘S-111’, ‘FALL 2016’, ‘FA16-ENGL-102-04’, … );
INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …)
VALUES ( ‘S-111’, ‘SPRING 2016’, ‘SP16-ENGL-101-01’, … );
INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …)
VALUES ( ‘S-111’, ‘SPRING 2016’, ‘SP16-MATH-203-01’, … );
SELECT * FROM student_schedule_BAD_PK WHERE student_id = ‘S-111’;
Result?
Accidental upserting is a common issue early in your data model testing. It can be tough to track down because it
doesn’t throw an error.
1. CQL Collections (sets, lists, maps)
2. User Defined Types
3. Tuples
4. Static Columns
Advanced Techniques
1. DataStax Academy – Self-paced course
https://academy.datastax.com/courses/ds220-data-modeling
2. KillrVideo
https://academy.datastax.com/resources/datastax-reference-application-killrvideo/
Resources

More Related Content

Viewers also liked

AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppDynamics
 
Bbc jan13 ftth_households
Bbc jan13 ftth_householdsBbc jan13 ftth_households
Bbc jan13 ftth_householdsBailey White
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadAll Things Open
 
LJC Mashup "Building Java Microservices for the Cloud && Chuck Norris Doesn't...
LJC Mashup "Building Java Microservices for the Cloud && Chuck Norris Doesn't...LJC Mashup "Building Java Microservices for the Cloud && Chuck Norris Doesn't...
LJC Mashup "Building Java Microservices for the Cloud && Chuck Norris Doesn't...Daniel Bryant
 
Ecce de-gids nl
Ecce de-gids nlEcce de-gids nl
Ecce de-gids nlswaipnew
 
LXC - kontener pingwinów
LXC - kontener pingwinówLXC - kontener pingwinów
LXC - kontener pingwinówgnosek
 
Performance testing for web-scale
Performance testing for web-scalePerformance testing for web-scale
Performance testing for web-scaleIzzet Mustafaiev
 
Resume -Resume -continous monitoring
Resume -Resume -continous monitoringResume -Resume -continous monitoring
Resume -Resume -continous monitoringTony Kenny
 
Amazon Elastic Block Store for Application Storage
Amazon Elastic Block Store for Application StorageAmazon Elastic Block Store for Application Storage
Amazon Elastic Block Store for Application StorageAmazon Web Services
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Amazon Web Services
 
SpringIO 2016 - Spring Cloud MicroServices, a journey inside a financial entity
SpringIO 2016 - Spring Cloud MicroServices, a journey inside a financial entitySpringIO 2016 - Spring Cloud MicroServices, a journey inside a financial entity
SpringIO 2016 - Spring Cloud MicroServices, a journey inside a financial entityjordigilnieto
 
Java management extensions (jmx)
Java management extensions (jmx)Java management extensions (jmx)
Java management extensions (jmx)Tarun Telang
 

Viewers also liked (20)

Unit I.fundamental of Programmable DSP
Unit I.fundamental of Programmable DSPUnit I.fundamental of Programmable DSP
Unit I.fundamental of Programmable DSP
 
AppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance ChallengesAppSphere 15 - Containers and Microservices Create New Performance Challenges
AppSphere 15 - Containers and Microservices Create New Performance Challenges
 
Bbc jan13 ftth_households
Bbc jan13 ftth_householdsBbc jan13 ftth_households
Bbc jan13 ftth_households
 
Incident Response in the wake of Dear CEO
Incident Response in the wake of Dear CEOIncident Response in the wake of Dear CEO
Incident Response in the wake of Dear CEO
 
Watering hole attacks case study analysis
Watering hole attacks case study analysisWatering hole attacks case study analysis
Watering hole attacks case study analysis
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 
LJC Mashup "Building Java Microservices for the Cloud && Chuck Norris Doesn't...
LJC Mashup "Building Java Microservices for the Cloud && Chuck Norris Doesn't...LJC Mashup "Building Java Microservices for the Cloud && Chuck Norris Doesn't...
LJC Mashup "Building Java Microservices for the Cloud && Chuck Norris Doesn't...
 
"Mini Texts"
"Mini Texts" "Mini Texts"
"Mini Texts"
 
Ecce de-gids nl
Ecce de-gids nlEcce de-gids nl
Ecce de-gids nl
 
114 Numalliance
114 Numalliance114 Numalliance
114 Numalliance
 
LXC - kontener pingwinów
LXC - kontener pingwinówLXC - kontener pingwinów
LXC - kontener pingwinów
 
Performance testing for web-scale
Performance testing for web-scalePerformance testing for web-scale
Performance testing for web-scale
 
Resume -Resume -continous monitoring
Resume -Resume -continous monitoringResume -Resume -continous monitoring
Resume -Resume -continous monitoring
 
Amazon Elastic Block Store for Application Storage
Amazon Elastic Block Store for Application StorageAmazon Elastic Block Store for Application Storage
Amazon Elastic Block Store for Application Storage
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Distributed cat herding
Distributed cat herdingDistributed cat herding
Distributed cat herding
 
EVOLVE'16 | Enhance | Gordon Pike | Rev Up Your Marketing Engine
EVOLVE'16 | Enhance | Gordon Pike | Rev Up Your Marketing EngineEVOLVE'16 | Enhance | Gordon Pike | Rev Up Your Marketing Engine
EVOLVE'16 | Enhance | Gordon Pike | Rev Up Your Marketing Engine
 
SpringIO 2016 - Spring Cloud MicroServices, a journey inside a financial entity
SpringIO 2016 - Spring Cloud MicroServices, a journey inside a financial entitySpringIO 2016 - Spring Cloud MicroServices, a journey inside a financial entity
SpringIO 2016 - Spring Cloud MicroServices, a journey inside a financial entity
 
Automating interactions with Zabbix (Raymond Kuiper / 12-02-2015)
Automating interactions with Zabbix (Raymond Kuiper / 12-02-2015)Automating interactions with Zabbix (Raymond Kuiper / 12-02-2015)
Automating interactions with Zabbix (Raymond Kuiper / 12-02-2015)
 
Java management extensions (jmx)
Java management extensions (jmx)Java management extensions (jmx)
Java management extensions (jmx)
 

Similar to Introduction to Data Modeling in Cassandra

Micro project project co 3i
Micro project project co 3iMicro project project co 3i
Micro project project co 3iARVIND SARDAR
 
2. DBMS Experiment - Lab 2 Made in SQL Used
2. DBMS Experiment - Lab 2 Made in SQL Used2. DBMS Experiment - Lab 2 Made in SQL Used
2. DBMS Experiment - Lab 2 Made in SQL UsedTheVerse1
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 
New fordevelopersinsql server2008
New fordevelopersinsql server2008New fordevelopersinsql server2008
New fordevelopersinsql server2008Aaron Shilo
 
What's new in MariaDB TX 3.0
What's new in MariaDB TX 3.0What's new in MariaDB TX 3.0
What's new in MariaDB TX 3.0MariaDB plc
 
Charles WilliamsCS362Unit 3 Discussion BoardStructured Query Langu.docx
Charles WilliamsCS362Unit 3 Discussion BoardStructured Query Langu.docxCharles WilliamsCS362Unit 3 Discussion BoardStructured Query Langu.docx
Charles WilliamsCS362Unit 3 Discussion BoardStructured Query Langu.docxchristinemaritza
 
Introduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for CassandraIntroduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for CassandraDataStax Academy
 
Queuing Sql Server: Utilise queues to increase performance in SQL Server
Queuing Sql Server: Utilise queues to increase performance in SQL ServerQueuing Sql Server: Utilise queues to increase performance in SQL Server
Queuing Sql Server: Utilise queues to increase performance in SQL ServerNiels Berglund
 
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB plc
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraLuke Tillman
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Structured Query Language for Data Management 2 Sructu.docx
Structured Query Language for Data Management      2 Sructu.docxStructured Query Language for Data Management      2 Sructu.docx
Structured Query Language for Data Management 2 Sructu.docxjohniemcm5zt
 
Sql 2016 - What's New
Sql 2016 - What's NewSql 2016 - What's New
Sql 2016 - What's Newdpcobb
 

Similar to Introduction to Data Modeling in Cassandra (20)

Micro project project co 3i
Micro project project co 3iMicro project project co 3i
Micro project project co 3i
 
4. DML.pdf
4. DML.pdf4. DML.pdf
4. DML.pdf
 
2. DBMS Experiment - Lab 2 Made in SQL Used
2. DBMS Experiment - Lab 2 Made in SQL Used2. DBMS Experiment - Lab 2 Made in SQL Used
2. DBMS Experiment - Lab 2 Made in SQL Used
 
unit 1 ppt.pptx
unit 1 ppt.pptxunit 1 ppt.pptx
unit 1 ppt.pptx
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
New fordevelopersinsql server2008
New fordevelopersinsql server2008New fordevelopersinsql server2008
New fordevelopersinsql server2008
 
Sql analytic queries tips
Sql analytic queries tipsSql analytic queries tips
Sql analytic queries tips
 
What's new in MariaDB TX 3.0
What's new in MariaDB TX 3.0What's new in MariaDB TX 3.0
What's new in MariaDB TX 3.0
 
Charles WilliamsCS362Unit 3 Discussion BoardStructured Query Langu.docx
Charles WilliamsCS362Unit 3 Discussion BoardStructured Query Langu.docxCharles WilliamsCS362Unit 3 Discussion BoardStructured Query Langu.docx
Charles WilliamsCS362Unit 3 Discussion BoardStructured Query Langu.docx
 
Introduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for CassandraIntroduction to Dating Modeling for Cassandra
Introduction to Dating Modeling for Cassandra
 
Queuing Sql Server: Utilise queues to increase performance in SQL Server
Queuing Sql Server: Utilise queues to increase performance in SQL ServerQueuing Sql Server: Utilise queues to increase performance in SQL Server
Queuing Sql Server: Utilise queues to increase performance in SQL Server
 
Sqlserver 2008 r2
Sqlserver 2008 r2Sqlserver 2008 r2
Sqlserver 2008 r2
 
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Structured Query Language for Data Management 2 Sructu.docx
Structured Query Language for Data Management      2 Sructu.docxStructured Query Language for Data Management      2 Sructu.docx
Structured Query Language for Data Management 2 Sructu.docx
 
Sql 2016 - What's New
Sql 2016 - What's NewSql 2016 - What's New
Sql 2016 - What's New
 
Sql wksht-2
Sql wksht-2Sql wksht-2
Sql wksht-2
 
Sql
SqlSql
Sql
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Introduction to Data Modeling in Cassandra

  • 1. Jim Hatcher DFW Cassandra Users - Meetup 7/12/2016 Introduction to Data Modeling with Apache Cassandra
  • 2. Agenda • Introduction • How does Cassandra work? • What is CQL? • Embracing Denormalization • Key Structure • Advanced Techniques • Resources
  • 3. Introduction Jim Hatcher james_hatcher@hotmail.com At IHS, we take raw data and turn it into information and insights for our customers. Automotive Systems (CarFax) Defense Systems (Jane’s) Oil & Gas Systems (Petra) Maritime Systems Technology & Media Systems (Electronic Parts Database, Root Metrics) Sources of Raw Data Structure Data Add Value Customer-facing Systems
  • 4. How does Cassandra work? CREATE KEYSPACE orders WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 }; CREATE TABLE orders.customer ( customer_id uuid, customer_name varchar, customer_age int, PRIMARY KEY ( customer_id ) ) INSERT INTO customer (customer_id, customer_name, customer_age) VALUES (feb2b9e6-613b-4e6b-b470-981dc4d42525, ‘Bob’, 35) SELECT customer_name, customer_age FROM customer WHERE customer_id = feb2b9e6-613b-4e6b-b470-981dc4d42525 Cassandra Cluster B C D E F Client -9223372036854775808 through -6148914691236517207 -6148914691236517206 through -3074457345618258605 -3074457345618258604 through -3 -2 through 3074457345618258599 3074457345618258600 through 6148914691236517201 6148914691236517202 through 9223372036854775808 A
  • 5. CQL Cassandra Query Language Standard interface for working with Cassandra Very similar to standard SQL, with a few notable exceptions: • No JOIN clauses • No GROUP BY / HAVING clauses • Restricted WHERE clauses • You can only query by key fields in prescribed ways CQL Type Description bigint 64-bit signed long boolean true or false decimal Variable-precision decimal double 64-bit IEEE-754 floating point float 32-bit IEEE-754 floating point int 32-bit signed integer text UTF-8 encoded string timestamp Date plus time, encoded as 8 bytes since epoch timeuuid Type 1 UUID only uuid A UUID in standard UUID format Others: http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cql_data_types_c.html CQL Data Types
  • 6. Normalization In relational databases, we start with understanding how the data relates together. We create a conceptual model. Our physical model often looks identical to the conceptual model. Student Course StudentClassSchedule Class StudentIDPK FirstName DateOfBirth LastName CourseIDPK CourseName Department ClassIDPK Grade CourseDescription ClassIDPK CourseIDFK Semester Professor Section Classroom CourseNumber CourseCode DayAndTime StudentIDPK
  • 7. Normalization Course CourseID CourseCode CourseNumber CourseName Department C-AAA ENGL 101 American Literature Humanities C-BBB MATH 203 Linear Algebra Mathematics C-CCC BIOL 201 Molecular Biology Science C-DDD HIST 108 World History History C-EEE ENGL 102 British Literature Humanities Class ClassID CourseID Semester Section Professor Classroom DayAndTime SP16-ENGL-101-01 C-AAA Spring 2016 01 Mark Twain XYZ Hall, Room 212 MWF 8:00 AM SP16-MATH-203-01 C-BBB Spring 2016 01 Isaac Newton XYZ Hall, Room 212 TuTh 9:30 AM FA16-BIOL-201-04 C-CCC Fall 2016 04 Charles Darwin XYZ Hall, Room 210 MWF 9:00 AM FA16-HIST-108-03 C-DDD Fall 2016 03 Napoleon Bonaparte XYZ Hall, Room 317 TuTh 12:00 PM FA16-ENGL-102-04 C-EEE Fall 2016 04 Virginia Woolf XYZ Hall, Room 184 MWF 10:00 AM FA16-ENGL-102-04 C-EEE Fall 2016 05 Jane Austen XYZ Hall, Room 185 TuTh 2:00 PM Every piece of data lives in one and only one place. We use our data-layer to enforce referential integrity. Student StudentID FirstName LastName DateOfBirth S-111 Joe Smith 1/1/1970 S-222 Jill Jones 2/2/1972 S-333 Betty Williams 3/3/1973 StudentClassSchedule StudentID ClassID Grade S-111 SP16-ENGL-101-01 A S-111 SP16-MATH-203-01 C S-111 FA16-BIOL-201-04 <null> S-111 FA16-HIST-108-03 <null> S-111 FA16-ENGL-102-04 <null> S-222 FA16-HIST-108-03 <null>
  • 8. Normalization To satisfy a query, we join tables together. To give a student his/her schedule, we might use this query: SELECT Course.CourseCode, Course.CourseNumber, Course.CourseName, Class.ClassID, Class.Section, Class.Classroom, Class.DayAndTime FROM StudentClassSchedule INNER JOIN Class ON StudentClassSchedule.ClassID = Class.ClassID INNER JOIN Course ON Class.CourseID = Course.CourseID WHERE StudentClassSchedule.StudentID = ‘S-111’ AND Class.Semester = ‘Fall 2016’ To give a professor a class roster, we might use this query SELECT Student.FirstName, Student.LastName, Class.Classroom, Class.DayAndTime FROM Student INNER JOIN StudentClassSchedule ON Student.StudentID = StudentClassSchedule.StudentID INNER JOIN Class ON StudentClassSchedule.ClassID = Class.ClassID WHERE Class.ClassID = ‘FA16-HIST-108-03’
  • 9. Denormalization Student Schedule for a Given Semester Student Roster for a Given ClassQueries student_schedule class_rosterTables If updates happen to “core data,” we have to have a mechanism to deal with it. For instance, if a class is relocated to a new classroom, we now have to update the classroom field in both of the tables below.
  • 10. Key Structure CREATE TABLE student_schedule ( student_id text, semester text, class_id text, course_code text, course_number int, section text, classroom text, day_and_time text, PRIMARY KEY ( (student_id), semester, classid ) ) The primary key is the combination of 1. the partitioning key, and 2. the clustering columns Like relational database, it uniquely identifies the row. The values in the primary key cannot by NULL. The first value in the PRIMARY KEY clause is the partitioning key. Any subsequent values are clustering columns. To specify a multi-column partitioning key, wrap it in parentheses. Primary Key
  • 11. Partition student_id S-111 FALL 2016 : FA16-ENGL-102-04 : course_code ENGL PRIMARY KEY ( (student_id), semester, classid ) Partitioning Key Clustering Columns The partitioning key is responsible for distributing data across the cluster. Separates data. Within a given partition, clustering columns are responsible for clustering data values together. Connects data. SPRING 2016 : SP16-ENGL-101-01 : course_code ENGL This is a representation of how Cassandra stores data on disk. Key Structure ….
  • 12. When you access Cassandra data via CQL, you retrieve CQL Rows. A “CQL Row” can be (and usually is) different than the physical structure (a partition) with which the data is stored within the Cassandra cluster. Partitioning Key Clustering Columns Must be queried using an equality expression, (i.e., = or IN) If you have a multi-field partitioning key, you must specify all the fields in the partition key to query the data. Can be queried with inequality, (i.e., <, >), or an equality. If you have a multi-field partitioning key, you don’t have to specify all the clustering columns, but you do have to specify them in order. (i.e., you can’t specify clustering column #2 unless you also supply clustering column #1) student_schedule Primary Key Partitioning Key Clustering Columns student_id semester class_id course_code course_number section classroom day_and_time S-111 FALL 2016 FA16-ENGL-102-04 ENGL 102 04 XYZ Hall, Room 185 TuTh 2:00 PM S-111 SPRING 2016 SP16-ENGL-101-01 ENGL 101 01 XYZ Hall, Room 212 MWF 8:00 AM S-111 SPRING 2016 SP16-MATH-203-01 MATH 203 01 XYZ Hall, Room 212 TuTh 9:30 AM Querying with CQL SELECT * FROM student_schedule;
  • 13. CQL Acceptable Queries: SELECT * FROM student_schedule WHERE student_id = ‘S-111’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND semester = ‘SPRING 2016’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND semester = ‘SPRING 2016’ AND class_id = ‘SP16-ENGL-101-01’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND semester = ‘SPRING 2016’ AND class_id >= ‘SP16-ENGL-101-01’ AND class_id < ‘SP16-ENGL-999-99’; UN-acceptable Queries: SELECT * FROM student_schedule WHERE course_code = ‘ENGL’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ OR student_id = ‘S-222’; SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND class_id = ‘SP16-ENGL-101-01’; Non-key field Non-equality condition against Partitioning Key Specifying a clustering column but not in order Note: Yes, I know I could mention secondary indexes and the ALLOW FILTERING clause at this point; but they’re anti-patterns, so don’t use them. student_schedule Primary Key Partitioning Key Clustering Columns student_id semester class_id course_code course_number section classroom day_and_time S-111 FALL 2016 FA16-ENGL-102-04 ENGL 102 04 XYZ Hall, Room 185 TuTh 2:00 PM S-111 SPRING 2016 SP16-ENGL-101-01 ENGL 101 01 XYZ Hall, Room 212 MWF 8:00 AM S-111 SPRING 2016 SP16-MATH-203-01 MATH 203 01 XYZ Hall, Room 212 TuTh 9:30 AM
  • 14. Key Structure Partitioning Key - Considerations: 1. Spread data adequately across the cluster so that you don’t create hotspots. 2. Minimize the number of partition reads. Ideally, you can get all your data out of one partition. 3. Updates that happen within the same partition have some atomicity guarantees. Clustering Columns - Considerations: 1. A partition can contain a maximum of 2 billion values clustering column values. 2. A partition should not contain more than 100 MB per partition. GETTING THE KEY STRUCTURE CORRECT IS THE KEY TO GOOD DATA MODELING
  • 15. CREATE TABLE student_schedule_v1 ( student_id text, semester text, class_id text, course_code text, …, PRIMARY KEY ( (student_id), semester, classid ) ) CREATE TABLE student_schedule_v2 ( student_id text, semester text, class_id text, course_code text, …, PRIMARY KEY ( (student_id, semester), classid ) ) CREATE TABLE student_schedule_v3 ( student_id text, semester text, class_id text, course_code text, …, PRIMARY KEY ( (student_id, semester, classid) ) ) CREATE TABLE student_schedule_v4 ( semester text, student_id text, class_id text, course_code text, …, PRIMARY KEY ( (semester), student_id, classid ) ) Creates a potential hotspot Key Structure Allows for queries: 1) by the student_id only, OR 2) by the student_id and semester Minimizes the number of partition reads. I consider this the winner. Requires that a field by passed to satisfy the query that we don’t necessarily have in our app. SELECT * FROM student_schedule WHERE student_id = ‘S-111’ AND semester = ‘Fall 2016’;
  • 16. On writes, Cassandra always does an upsert (i.e., update if the record exists, and insert if the record doesn’t exist). Suppose you picked a poor key for your table (one that doesn’t make the rows unique and then you inserted this following data. CREATE TABLE student_schedule_BAD_PK ( student_id text, semester text, class_id text, course_code text, …, PRIMARY KEY ( student_id ) ) Upserts INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …) VALUES ( ‘S-111’, ‘FALL 2016’, ‘FA16-ENGL-102-04’, … ); INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …) VALUES ( ‘S-111’, ‘SPRING 2016’, ‘SP16-ENGL-101-01’, … ); INSERT INTO student_schedule_BAD_PK ( student_id, semester, class_id, …) VALUES ( ‘S-111’, ‘SPRING 2016’, ‘SP16-MATH-203-01’, … ); SELECT * FROM student_schedule_BAD_PK WHERE student_id = ‘S-111’; Result? Accidental upserting is a common issue early in your data model testing. It can be tough to track down because it doesn’t throw an error.
  • 17. 1. CQL Collections (sets, lists, maps) 2. User Defined Types 3. Tuples 4. Static Columns Advanced Techniques
  • 18. 1. DataStax Academy – Self-paced course https://academy.datastax.com/courses/ds220-data-modeling 2. KillrVideo https://academy.datastax.com/resources/datastax-reference-application-killrvideo/ Resources