SlideShare a Scribd company logo
Cassandra Data Modelling
CQL is not SQL
Querying simple tables + CQL TRACE (your new best friend)
C* columns and disk storage
C* column nesting (clustering)
Querying clustering columns
RDBMS data modelling: normalize tables then define queries
C* data modelling: define queries then define denormalized tables
C* data modelling: use-case
Where is my data?
Where is my data?
01->09
30->39
10->19
40->49
20->29
50->59
70->79
60->69
CREATE keyspace...
CREATE TABLE users {
username text,
password text,
address text,
PRIMARY KEY(username)
}
username password address
john@site.com xxx 35 Arthur St
bill@yahoo.com xxx 21 Jump St
james@gmail.com xxx 18 Smith St
Where is my data?
01->09
30->39
10->19
40->49
20->29
50->59
70->79
60->69
username password address
john@site.com xxx 35 Arthur St
bill@yahoo.com xxx 21 Jump St
james@gmail.com xxx 18 Smith St
Each node owns a range of tokens and a ring
ALWAYS forms a complete token range.
hash(primary key) -> token
The token produced always falls between the upper
bound and lower bound of the complete token range
(0->79)*
*doesn’t matter if the PK is a string, int, float, GUID,
blob...always falls within the token range.
the hash produced is randomized
so token for hash(john@site.com) could produce be
any number between 0-79, but will always produce
the same number
-> consistent hashing
Where is my data?
01->09 bill@yahoo.com
30->39
10->19
40->49
20->29
john@site.com
50->59
70->79
bill@yahoo.com
60->69
token = hash(primary key)
eg
hash(john@site.com) = 26
hash(bill@yahoo.com) = 79
hash(james@gmail.com) = 5
username password address
john@site.com xxx 35 Arthur St
bill@yahoo.com xxx 21 Jump St
james@gmail.com xxx 18 Smith St
What is the difference between:
SQL> SELECT * FROM users;
and
CQL> SELECT * FROM users;
Answer : Where is my data?
CQL is not SQL
SELECT * FROM users;
1. query node 3
username password address
john@site.com xxx 35 Arthur St
bill@yahoo.com xxx 21 Jump St
james@gmail.com xxx 18 Smith St
Where is my data?
8 2
5
7 3
1
6
4
01->09 bill@yahoo.com
30->39
10->19
40->49
20->29
john@site.com
50->59
70->79
bill@yahoo.com
60->69
SELECT * FROM users;
1. query node 3
2. query node 8
username password address
john@site.com xxx 35 Arthur St
bill@yahoo.com xxx 21 Jump St
james@gmail.com xxx 18 Smith St
Where is my data?
8 2
5
7 3
1
6
4
01->09 bill@yahoo.com
30->39
10->19
40->49
20->29
john@site.com
50->59
70->79
bill@yahoo.com
60->69
SELECT * FROM users;
1. query node 3
2. query node 8
3. query node 1
username password address
john@site.com xxx 35 Arthur St
bill@yahoo.com xxx 21 Jump St
james@gmail.com xxx 18 Smith St
Where is my data?
8 2
5
7 3
1
6
4
01->09 bill@yahoo.com
30->39
10->19
40->49
20->29
john@site.com
50->59
70->79
bill@yahoo.com
60->69
Username (PK) PasswordAddress
aaaaaaaaxxx xxx
aaaaaaabxxx xxx
bbbbbbbbxxx xxx
cccccccccxxx xxx
zzzzzzzzzxxx xxx
40->49
SELECT * FROM users;
1. query node 3
2. query node 8
3. query node 1
4. query node 2
This is called a table scan in C* parlance and is
a performance anti-pattern, you can see that very
quickly you will timeout the query.
Proper design means this query is unnecessary.
Test all queries by running a TRACE in DevCenter
or at the CQLSH prompt as a sanity check.
Where is my data?
8 2
5
7 3
1
6
4
01->09 bill@yahoo.com
30->39
10->19
40->49
20->29
john@site.com
50->59
70->79
bill@yahoo.com
60->69
C* columns and disk storage
C* is very slow at scanning down a list of partition_keys because they are distributed over many partitions / nodes
C* is very fast at scanning across columns for a specific partition_key because they are on a single partition.
col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 | col11 | col12 | col13 | col14 | col15 | col16….aaaaaaab
fast scan !!
slow
scan
OK, i get that the partition_keys are spread out on different nodes and thats why scanning down them is slow, but that doesn’t explain why
scanning across columns is fast for a specific partition_key (e.g john@site.com )
It all comes down to the disk storage of columns for a specific partition_key, the on disk.
aaaaaaab col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 | col11 | col12 | col13 | col14 | col15 | col16….aaaaaaab
disk
C* columns and disk storage
efficient disk scan, query and slice semantics
(partition_key:aaaaaaab)
(column=col1, value=xx, timestamp=1357866010549000)
(column=col2, value=xx, timestamp=1357866010549000)
(column=col3, value=xx, timestamp=1357866010549000)
(column=col4, value=xx, timestamp=1357866010549000)
(column=col5, value=xx, timestamp=1357866010549000)
(column=col6, value=xx, timestamp=1357866010549000)
(column=col7, value=xx, timestamp=1357866010549000)
(column=col8, value=xx, timestamp=1357866010549000)
(column=col9, value=xx, timestamp=1357866010549000)
(column=col10, value=xx, timestamp=1357866010549000)
(column=col11, value=xx, timestamp=1357866010549000)
(column=col12, value=xx, timestamp=1357866010549000)
(column=col13, value=xx, timestamp=1357866010549000)
(column=col14, value=xx, timestamp=1357866010549000)
(column=col15, value=xx, timestamp=1357866010549000)
(column=col16, value=xx, timestamp=1357866010549000)
(partition_key:xxxxxxxx)
(column=col1, value=xx, timestamp=1357866010543000)
(column=col2, value=xx, timestamp=1357866010543000)
(column=col3, value=xx, timestamp=1357866010543000)
C* column nesting (clustering)
CREATE TABLE sessions {
username text,
session_id text,
url text,
time_spent int,
browser text,
PRIMARY KEY(username, session_id)
}
PRIMARY KEY(<partition_key>, <clustering column>)
john@site.com
session2
url xxx
time_spent xxx
browser xxx
session3 ...
SELECT * FROM sessions WHERE username=”john@site.com” AND session_id=”session2”;
Dictates what node
the data is stored on
Dictates how data is
stored and sorted
under the partition
key
2
1
01->29
60->89 30->59
3
john@site.com
session1
url xxx
time_spent xxx
browser xxx
session2 ...
bill@yahoo.com
session1
url xxx
time_spent xxx
browser xxx
session2 ...
james@gmail.com
session1
url xxx
time_spent xxx
browser xxx
session2 ...
CREATE TABLE sessions {
username text,
session_id text,
url text,
time_spent int,
browser text,
PRIMARY KEY(username, session_id)
}
PRIMARY KEY(<partition_key>, <clustering column>)
C* column nesting (clustering) - queries
CREATE TABLE sessions {
username text,
session_id text,
url text,
time_spent int,
browser text,
PRIMARY KEY(username, session_id)
}
GOOD:
SELECT * FROM sessions WHERE username=”john@site.com”;
SELECT * FROM sessions WHERE username=”john@site.com” AND session_id=”session2”;
WRONG:
SELECT * FROM sessions WHERE session_id=”session2”;
RULE: Clustering columns or partition_keys prior to the most granular clustering column must be present in the query.
C* column nesting (clustering) - queries
CREATE TABLE timeline (
day text,
hour int,
min int,
sec int,
reading text,
PRIMARY KEY (day, hour, min, sec)
);
PRIMARY KEY(<partition_key>, <cl. column>, <cl. column>, <cl. column>)
day1
hour1
min1
sec1
reading
sec2
reading
day 2 ...
SELECT * FROM timeline WHERE day=day1 AND hour=hour1 AND min=min1 AND sec=sec1;
C* column nesting (clustering) - queries
CREATE TABLE timeline (
day text,
hour int,
min int,
sec int,
value text,
PRIMARY KEY (day, hour, min, sec)
);
GOOD:
SELECT * FROM timeline WHERE day=day1;
SELECT * FROM timeline WHERE day=day1 AND hour=hour1;
SELECT * FROM timeline WHERE day=day1 AND hour=hour1 AND min=min1;
SELECT * FROM timeline WHERE day=day1 AND hour=hour1 AND min=min1 AND sec=sec1;
WRONG:
SELECT * FROM sessions WHERE day=day1 AND min=min1;
RULE: Clustering columns must be present in the query in the same order as the PRIMARY KEY
CAREFUL: be aware how much data you are returning !!
C* column nesting (clustering) - queries
Notes and limitations
(partition_key:aaaaaaab)
(column=col1, value=xx, timestamp=1357866010549000)
(column=col2, value=xx, timestamp=1357866010549000)
(column=col3, value=xx, timestamp=1357866010549000)
(column=col4, value=xx, timestamp=1357866010549000)
(column=col5, value=xx, timestamp=1357866010549000)
(column=col6, value=xx, timestamp=1357866010549000)
(column=col7, value=xx, timestamp=1357866010549000)
(column=col8, value=xx, timestamp=1357866010549000)
(column=col9, value=xx, timestamp=1357866010549000)
(column=col10, value=xx, timestamp=1357866010549000)
(column=col11, value=xx, timestamp=1357866010549000)
(column=col12, value=xx, timestamp=1357866010549000)
(column=col13, value=xx, timestamp=1357866010549000)
(column=col14, value=xx, timestamp=1357866010549000)
(column=col15, value=xx, timestamp=1357866010549000)
(column=col16, value=xx, timestamp=1357866010549000)
RULE: Always design your tables so that you limit the amount of data stored in a single partition_key to the size of the
in_memory_compaction_limit_in_mb that is set in cassandra.yaml (default 64mb)
Why? Compaction (which we will cover later) needs to be able to process a complete partition_key and all its underlying data in memory,
swapping to and from disk introduces serious performance degradation and poor JVM GC behaviour.
RULE: Clustering columns must be present in the query in the same order as the PRIMARY KEY
CAREFUL: be aware how much data you are returning !!
Where is my data?
CQL TRACE will show you where your data is,
how costly it is to get it in terms of time and
how many node hops it is going to take to get it.
Sane queries on simple tables + introducing indexes
CREATE TABLE users {
username text,
password text,
address text,
age int,
PRIMARY KEY(username)
}
SELECT * FROM users WHERE username=”john@site.com”;
SELECT address, age FROM users WHERE username=”bill@yahoo.com”;
CREATE INDEX age_key ON users(age);
SELECT * FROM users WHERE age=35;
CAREFUL: think about how much data you are returning and from where that data is coming...if you don’t
know, or can’t work it out run a TRACE at the CQLSH console or run the query under DevCenter 1.3...
CQL TRACE - your new best friend
TRACE provides a description of each step it takes to satisfy the request, the names of nodes that are affected, the time for each step,
and the total time for the request. TRACE is the most powerful tool in a data modellers hands. (INSERT)
activity | timestamp | source | source_elapsed (microseconds)
-------------------------------------+--------------+-----------+----------------
execute_cql3_query | 16:41:00,754 | 127.0.0.1 | 0
Parsing statement | 16:41:00,754 | 127.0.0.1 | 48
Preparing statement | 16:41:00,755 | 127.0.0.1 | 658
Determining replicas for mutation | 16:41:00,755 | 127.0.0.1 | 979
Message received from /127.0.0.1 | 16:41:00,756 | 127.0.0.3 | 37
Acquiring switchLock read lock | 16:41:00,756 | 127.0.0.1 | 1848
Sending message to /127.0.0.3 | 16:41:00,756 | 127.0.0.1 | 1853
Appending to commitlog | 16:41:00,756 | 127.0.0.1 | 1891
Sending message to /127.0.0.2 | 16:41:00,756 | 127.0.0.1 | 1911
Adding to emp memtable | 16:41:00,756 | 127.0.0.1 | 1997
Acquiring switchLock read lock | 16:41:00,757 | 127.0.0.3 | 395
Message received from /127.0.0.1 | 16:41:00,757 | 127.0.0.2 | 42
Appending to commitlog | 16:41:00,757 | 127.0.0.3 | 432
Acquiring switchLock read lock | 16:41:00,757 | 127.0.0.2 | 168
Adding to emp memtable | 16:41:00,757 | 127.0.0.3 | 522
Appending to commitlog | 16:41:00,757 | 127.0.0.2 | 211
Adding to emp memtable | 16:41:00,757 | 127.0.0.2 | 359
Enqueuing response to /127.0.0.1 | 16:41:00,758 | 127.0.0.3 | 1282
Enqueuing response to /127.0.0.1 | 16:41:00,758 | 127.0.0.2 | 1024
Sending message to /127.0.0.1 | 16:41:00,758 | 127.0.0.3 | 1469
Sending message to /127.0.0.1 | 16:41:00,758 | 127.0.0.2 | 1179
Message received from /127.0.0.2 | 16:41:00,765 | 127.0.0.1 | 10966
Message received from /127.0.0.3 | 16:41:00,765 | 127.0.0.1 | 10966
Processing response from /127.0.0.2 | 16:41:00,765 | 127.0.0.1 | 11063
Processing response from /127.0.0.3 | 16:41:00,765 | 127.0.0.1 | 11066
Request complete | 16:41:00,765 | 127.0.0.1 | 11139
CQL TRACE - how do I invoke it?
Option 1: CQLSH
All C* installs come with a commandline client
called cqlsh, you can run any CQL commands against
a cassandra cluster using cqlsh, to invoke TRACE:
cqlsh>TRACE ON;
cqlsh>SELECT * FROM mytable WHERE id=1;
After running the query, cqlsh will return with
both the query results and the TRACE results.
Option 2: DevCenter 1.3+
For the GUI inclined (like me) DevCenter
automatically runs a TRACE on every query in a TAB
behind the execution/results screen, you can see
the formatted results there.
RDBMS data modelling
1. Design normalised tables. 2. Define SQL queries 3. Build consuming application
JOINS -> normalize tables -> queries last
Cassandra data modelling
1. Define CQL queries 2. Design de-normalised tables
for each query.
3. Build consuming application
no JOINS -> queries first -> then denormalize tables
Data modelling use-case #1 - Music data
Data modelling use-case #1 - Music data
Q1
CREATE TABLE performers_by_style {
style TEXT,
name TEXT,
PRIMARY KEY(style, name)
}
WITH CLUSTERING ORDER BY (name ASC);
(partition_key:style1)
(column=name1:, value=, timestamp=1357866010549000)
(column=name2:, value=, timestamp=1357866010549000)
(column=name3:, value=, timestamp=1357866010549000)
(partition_key:style2)
(column=name4:, value=, timestamp=1357866010549000)
(column=name5:, value=, timestamp=1357866010549000)
(column=name6:, value=, timestamp=1357866010549000)
SELECT * FROM performers_by_style WHERE style=”rock”;
Data modelling use-case #1 - Music data
Q2
CREATE TABLE performer (
name TEXT,
type TEXT,
country TEXT,
style LIST<TEXT>,
founded INT,
born INT
died TEXT,
PRIMARY KEY (name)
);
SELECT * FROM performer WHERE name=”someName”;
(partition_key:someName)
(column=type, value=, timestamp=1357866010549000)
...
Data modelling use-case #1 - Music data
Q3
CREATE TABLE album (
title TEXT,
year INT,
performer TEXT,
genre TEXT,
tracks map<INT,TEXT>,
PRIMARY KEY((title,year))
);
(partition_key:myTitle:2014)
(column=performer, value=Blondie, timestamp=1357866010549000)
(column=genre, value=rock, timestamp=1357866010549000)
(column=tracks, value={1:track1, 2:track2, 3:track3}, timestamp=1357866010549000)
(partition_key:title56:1999)
...
SELECT * FROM album WHERE title=”myTitle” AND year=2014;
Data modelling use-case #1 - Music data
Q4
CREATE TABLE albums_by_performer (
performer TEXT,
year INT,
title TEXT,
genre TEXT,
PRIMARY KEY(performer, year, title)
)
WITH CLUSTERING ORDER BY (year DESC, title ASC);
SELECT * FROM albums_by_performer WHERE performer=”myPerformer”;
(partition_key:myPerformer)
(column=year1:, value=, timestamp=1357866010549000)
(column=title1:, value=, timestamp=1357866010549000)
(column=genre, value=rock, timestamp=1357866010549000)
(partition_key:perfomer2)
Data modelling use-case #1 - Music data
Q5
CREATE TABLE albums_by_genre (
genre TEXT,
performer TEXT,
year INT,
title TEXT,
PRIMARY KEY(genre, performer, year, title)
)
WITH CLUSTERING ORDER BY (performer ASC, year DESC, title ASC);
(partition_key:myGenre)
(column=performer1, value=, timestamp=1357866010549000)
(column=year1, value=, timestamp=1357866010549000)
(column=title1, value=, timestamp=1357866010549000)
(partition_key:genre2)
SELECT * FROM albums_by_genre WHERE genre=”myGenre”;
Data modelling use-case #1 - Music data
Q6
CREATE TABLE albums_by_track (
track TEXT,
performer TEXT,
year INT,
title TEXT,
PRIMARY KEY(track, performer, year, title)
)
WITH CLUSTERING ORDER BY (performer ASC, year DESC, title ASC);
(partition_key:myTrack)
(column=performer1, value=, timestamp=1357866010549000)
(column=year1:, value=, timestamp=1357866010549000)
(column=title1, value=, timestamp=1357866010549000)
(partition_key:track2)
SELECT * FROM albums_by_track WHERE genre=”myTrack”;
Data modelling use-case #1 - Music data
Q7
CREATE TABLE tracks_by_album (
album TEXT,
year INT,
number INT,
performer TEXT,
genre TEXT,
title TEXT,
PRIMARY KEY((album, year), number)
)
WITH CLUSTERING ORDER BY (number ASC);
(partition_key:myAlbum:2014)
(column=number1, value=, timestamp=1357866010549000)
(column=performer, value=performer1, timestamp=1357866010549000)
(column=genre, value=genre1, timestamp=1357866010549000)
(column=title, value=title1, timestamp=1357866010549000)
(column=number2, value=, timestamp=1357866010549000)
(column=performer, value=performer1, timestamp=1357866010549000)
(column=genre, value=genre1, timestamp=1357866010549000)
(column=title, value=title1, timestamp=1357866010549000)
SELECT title, year FROM tracks_by_album WHERE album=”myAlbum” AND year=2015;
Data modelling use-case #1 - Music data
Cassandra is not an RDBMS, Cassandra is vastly more powerfully than any RDBMS in existence with the
proven ability in production to run 1000x node clusters.
But as a Cassandra Data Modeller you need to *think different*, you need to think distributed and
denormalized, but ultimately you need to ask the question:
“Where is my data?”
an RDBMS

More Related Content

What's hot

The Ring programming language version 1.5.2 book - Part 45 of 181
The Ring programming language version 1.5.2 book - Part 45 of 181The Ring programming language version 1.5.2 book - Part 45 of 181
The Ring programming language version 1.5.2 book - Part 45 of 181
Mahmoud Samir Fayed
 
Tracking Data Updates in Real-time with Change Data Capture
Tracking Data Updates in Real-time with Change Data CaptureTracking Data Updates in Real-time with Change Data Capture
Tracking Data Updates in Real-time with Change Data Capture
ScyllaDB
 
MySQL 5.7 NF – JSON Datatype 활용
MySQL 5.7 NF – JSON Datatype 활용MySQL 5.7 NF – JSON Datatype 활용
MySQL 5.7 NF – JSON Datatype 활용
I Goo Lee
 
MySQL Functions
MySQL FunctionsMySQL Functions
MySQL Functions
Compare Infobase Limited
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
Altinity Ltd
 
Cassandra introduction @ ParisJUG
Cassandra introduction @ ParisJUGCassandra introduction @ ParisJUG
Cassandra introduction @ ParisJUG
Duyhai Doan
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
Mike Anderson
 
Intro to my sql
Intro to my sqlIntro to my sql
Intro to my sql
MusTufa Nullwala
 
Introduction To MySQL Lecture 1
Introduction To MySQL Lecture 1Introduction To MySQL Lecture 1
Introduction To MySQL Lecture 1
Ajay Khatri
 
Quick reference for Grafana
Quick reference for GrafanaQuick reference for Grafana
Quick reference for Grafana
Rajkumar Asohan, PMP
 
04 Reports
04 Reports04 Reports
04 Reports
Hadley Wickham
 
Intro to my sql
Intro to my sqlIntro to my sql
Intro to my sql
Alamgir Bhuyan
 
Mysql devops-to
Mysql devops-toMysql devops-to
Mysql devops-to
lxfontes
 
Rsm notes f14
Rsm notes f14Rsm notes f14
Rsm notes f14
Tam Minh Le
 
Lecture2 mysql by okello erick
Lecture2 mysql by okello erickLecture2 mysql by okello erick
Lecture2 mysql by okello erick
okelloerick
 

What's hot (16)

The Ring programming language version 1.5.2 book - Part 45 of 181
The Ring programming language version 1.5.2 book - Part 45 of 181The Ring programming language version 1.5.2 book - Part 45 of 181
The Ring programming language version 1.5.2 book - Part 45 of 181
 
Tracking Data Updates in Real-time with Change Data Capture
Tracking Data Updates in Real-time with Change Data CaptureTracking Data Updates in Real-time with Change Data Capture
Tracking Data Updates in Real-time with Change Data Capture
 
MySQL 5.7 NF – JSON Datatype 활용
MySQL 5.7 NF – JSON Datatype 활용MySQL 5.7 NF – JSON Datatype 활용
MySQL 5.7 NF – JSON Datatype 활용
 
MySQL Functions
MySQL FunctionsMySQL Functions
MySQL Functions
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
 
Cassandra introduction @ ParisJUG
Cassandra introduction @ ParisJUGCassandra introduction @ ParisJUG
Cassandra introduction @ ParisJUG
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
 
Intro to my sql
Intro to my sqlIntro to my sql
Intro to my sql
 
Introduction To MySQL Lecture 1
Introduction To MySQL Lecture 1Introduction To MySQL Lecture 1
Introduction To MySQL Lecture 1
 
Quick reference for Grafana
Quick reference for GrafanaQuick reference for Grafana
Quick reference for Grafana
 
04 Reports
04 Reports04 Reports
04 Reports
 
Intro to my sql
Intro to my sqlIntro to my sql
Intro to my sql
 
Mysql devops-to
Mysql devops-toMysql devops-to
Mysql devops-to
 
Rsm notes f14
Rsm notes f14Rsm notes f14
Rsm notes f14
 
Lecture2 mysql by okello erick
Lecture2 mysql by okello erickLecture2 mysql by okello erick
Lecture2 mysql by okello erick
 

Similar to Apache Cassandra - Data modelling

Dun ddd
Dun dddDun ddd
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
1 Dundee - Cassandra 101
1 Dundee - Cassandra 1011 Dundee - Cassandra 101
1 Dundee - Cassandra 101
Christopher Batey
 
Harder Faster Stronger
Harder Faster StrongerHarder Faster Stronger
Harder Faster Stronger
snyff
 
DEF CON 27 -OMER GULL - select code execution from using sq lite
DEF CON 27 -OMER GULL - select code execution from using sq liteDEF CON 27 -OMER GULL - select code execution from using sq lite
DEF CON 27 -OMER GULL - select code execution from using sq lite
Felipe Prado
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
Eric Evans
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
zznate
 
SQLQueries
SQLQueriesSQLQueries
SQLQueries
karunakar81987
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
HighLoad2009
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynote
jbellis
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
cookie1969
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
DataStax Academy
 
Postgres index types
Postgres index typesPostgres index types
Postgres index types
Louise Grandjonc
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
DataStax
 
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
aaronmorton
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
maclean liu
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
it-people
 
SQL and PLSQL features for APEX Developers
SQL and PLSQL features for APEX DevelopersSQL and PLSQL features for APEX Developers
SQL and PLSQL features for APEX Developers
Connor McDonald
 

Similar to Apache Cassandra - Data modelling (20)

Dun ddd
Dun dddDun ddd
Dun ddd
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
1 Dundee - Cassandra 101
1 Dundee - Cassandra 1011 Dundee - Cassandra 101
1 Dundee - Cassandra 101
 
Harder Faster Stronger
Harder Faster StrongerHarder Faster Stronger
Harder Faster Stronger
 
DEF CON 27 -OMER GULL - select code execution from using sq lite
DEF CON 27 -OMER GULL - select code execution from using sq liteDEF CON 27 -OMER GULL - select code execution from using sq lite
DEF CON 27 -OMER GULL - select code execution from using sq lite
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
SQLQueries
SQLQueriesSQLQueries
SQLQueries
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
Cassandra Summit 2013 Keynote
Cassandra Summit 2013 KeynoteCassandra Summit 2013 Keynote
Cassandra Summit 2013 Keynote
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Postgres index types
Postgres index typesPostgres index types
Postgres index types
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
 
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
 
SQL and PLSQL features for APEX Developers
SQL and PLSQL features for APEX DevelopersSQL and PLSQL features for APEX Developers
SQL and PLSQL features for APEX Developers
 

More from Alex Thompson

The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
Alex Thompson
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep dive
Alex Thompson
 
Apache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoringApache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoring
Alex Thompson
 
Deconstructing Apache Cassandra
Deconstructing Apache CassandraDeconstructing Apache Cassandra
Deconstructing Apache Cassandra
Alex Thompson
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
Alex Thompson
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
Alex Thompson
 

More from Alex Thompson (6)

The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Apache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep diveApache Cassandra - Drivers deep dive
Apache Cassandra - Drivers deep dive
 
Apache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoringApache Cassandra - Diagnostics and monitoring
Apache Cassandra - Diagnostics and monitoring
 
Deconstructing Apache Cassandra
Deconstructing Apache CassandraDeconstructing Apache Cassandra
Deconstructing Apache Cassandra
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 

Recently uploaded

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Recently uploaded (20)

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Apache Cassandra - Data modelling

  • 1. Cassandra Data Modelling CQL is not SQL Querying simple tables + CQL TRACE (your new best friend) C* columns and disk storage C* column nesting (clustering) Querying clustering columns RDBMS data modelling: normalize tables then define queries C* data modelling: define queries then define denormalized tables C* data modelling: use-case
  • 2. Where is my data?
  • 3. Where is my data? 01->09 30->39 10->19 40->49 20->29 50->59 70->79 60->69 CREATE keyspace... CREATE TABLE users { username text, password text, address text, PRIMARY KEY(username) } username password address john@site.com xxx 35 Arthur St bill@yahoo.com xxx 21 Jump St james@gmail.com xxx 18 Smith St
  • 4. Where is my data? 01->09 30->39 10->19 40->49 20->29 50->59 70->79 60->69 username password address john@site.com xxx 35 Arthur St bill@yahoo.com xxx 21 Jump St james@gmail.com xxx 18 Smith St Each node owns a range of tokens and a ring ALWAYS forms a complete token range. hash(primary key) -> token The token produced always falls between the upper bound and lower bound of the complete token range (0->79)* *doesn’t matter if the PK is a string, int, float, GUID, blob...always falls within the token range. the hash produced is randomized so token for hash(john@site.com) could produce be any number between 0-79, but will always produce the same number -> consistent hashing
  • 5. Where is my data? 01->09 bill@yahoo.com 30->39 10->19 40->49 20->29 john@site.com 50->59 70->79 bill@yahoo.com 60->69 token = hash(primary key) eg hash(john@site.com) = 26 hash(bill@yahoo.com) = 79 hash(james@gmail.com) = 5 username password address john@site.com xxx 35 Arthur St bill@yahoo.com xxx 21 Jump St james@gmail.com xxx 18 Smith St
  • 6. What is the difference between: SQL> SELECT * FROM users; and CQL> SELECT * FROM users; Answer : Where is my data? CQL is not SQL
  • 7. SELECT * FROM users; 1. query node 3 username password address john@site.com xxx 35 Arthur St bill@yahoo.com xxx 21 Jump St james@gmail.com xxx 18 Smith St Where is my data? 8 2 5 7 3 1 6 4 01->09 bill@yahoo.com 30->39 10->19 40->49 20->29 john@site.com 50->59 70->79 bill@yahoo.com 60->69
  • 8. SELECT * FROM users; 1. query node 3 2. query node 8 username password address john@site.com xxx 35 Arthur St bill@yahoo.com xxx 21 Jump St james@gmail.com xxx 18 Smith St Where is my data? 8 2 5 7 3 1 6 4 01->09 bill@yahoo.com 30->39 10->19 40->49 20->29 john@site.com 50->59 70->79 bill@yahoo.com 60->69
  • 9. SELECT * FROM users; 1. query node 3 2. query node 8 3. query node 1 username password address john@site.com xxx 35 Arthur St bill@yahoo.com xxx 21 Jump St james@gmail.com xxx 18 Smith St Where is my data? 8 2 5 7 3 1 6 4 01->09 bill@yahoo.com 30->39 10->19 40->49 20->29 john@site.com 50->59 70->79 bill@yahoo.com 60->69
  • 10. Username (PK) PasswordAddress aaaaaaaaxxx xxx aaaaaaabxxx xxx bbbbbbbbxxx xxx cccccccccxxx xxx zzzzzzzzzxxx xxx 40->49 SELECT * FROM users; 1. query node 3 2. query node 8 3. query node 1 4. query node 2 This is called a table scan in C* parlance and is a performance anti-pattern, you can see that very quickly you will timeout the query. Proper design means this query is unnecessary. Test all queries by running a TRACE in DevCenter or at the CQLSH prompt as a sanity check. Where is my data? 8 2 5 7 3 1 6 4 01->09 bill@yahoo.com 30->39 10->19 40->49 20->29 john@site.com 50->59 70->79 bill@yahoo.com 60->69
  • 11. C* columns and disk storage C* is very slow at scanning down a list of partition_keys because they are distributed over many partitions / nodes C* is very fast at scanning across columns for a specific partition_key because they are on a single partition. col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 | col11 | col12 | col13 | col14 | col15 | col16….aaaaaaab fast scan !! slow scan OK, i get that the partition_keys are spread out on different nodes and thats why scanning down them is slow, but that doesn’t explain why scanning across columns is fast for a specific partition_key (e.g john@site.com ) It all comes down to the disk storage of columns for a specific partition_key, the on disk.
  • 12. aaaaaaab col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 | col11 | col12 | col13 | col14 | col15 | col16….aaaaaaab disk C* columns and disk storage efficient disk scan, query and slice semantics (partition_key:aaaaaaab) (column=col1, value=xx, timestamp=1357866010549000) (column=col2, value=xx, timestamp=1357866010549000) (column=col3, value=xx, timestamp=1357866010549000) (column=col4, value=xx, timestamp=1357866010549000) (column=col5, value=xx, timestamp=1357866010549000) (column=col6, value=xx, timestamp=1357866010549000) (column=col7, value=xx, timestamp=1357866010549000) (column=col8, value=xx, timestamp=1357866010549000) (column=col9, value=xx, timestamp=1357866010549000) (column=col10, value=xx, timestamp=1357866010549000) (column=col11, value=xx, timestamp=1357866010549000) (column=col12, value=xx, timestamp=1357866010549000) (column=col13, value=xx, timestamp=1357866010549000) (column=col14, value=xx, timestamp=1357866010549000) (column=col15, value=xx, timestamp=1357866010549000) (column=col16, value=xx, timestamp=1357866010549000) (partition_key:xxxxxxxx) (column=col1, value=xx, timestamp=1357866010543000) (column=col2, value=xx, timestamp=1357866010543000) (column=col3, value=xx, timestamp=1357866010543000)
  • 13. C* column nesting (clustering) CREATE TABLE sessions { username text, session_id text, url text, time_spent int, browser text, PRIMARY KEY(username, session_id) } PRIMARY KEY(<partition_key>, <clustering column>) john@site.com session2 url xxx time_spent xxx browser xxx session3 ... SELECT * FROM sessions WHERE username=”john@site.com” AND session_id=”session2”; Dictates what node the data is stored on Dictates how data is stored and sorted under the partition key
  • 14. 2 1 01->29 60->89 30->59 3 john@site.com session1 url xxx time_spent xxx browser xxx session2 ... bill@yahoo.com session1 url xxx time_spent xxx browser xxx session2 ... james@gmail.com session1 url xxx time_spent xxx browser xxx session2 ... CREATE TABLE sessions { username text, session_id text, url text, time_spent int, browser text, PRIMARY KEY(username, session_id) } PRIMARY KEY(<partition_key>, <clustering column>)
  • 15. C* column nesting (clustering) - queries CREATE TABLE sessions { username text, session_id text, url text, time_spent int, browser text, PRIMARY KEY(username, session_id) } GOOD: SELECT * FROM sessions WHERE username=”john@site.com”; SELECT * FROM sessions WHERE username=”john@site.com” AND session_id=”session2”; WRONG: SELECT * FROM sessions WHERE session_id=”session2”; RULE: Clustering columns or partition_keys prior to the most granular clustering column must be present in the query.
  • 16. C* column nesting (clustering) - queries CREATE TABLE timeline ( day text, hour int, min int, sec int, reading text, PRIMARY KEY (day, hour, min, sec) ); PRIMARY KEY(<partition_key>, <cl. column>, <cl. column>, <cl. column>) day1 hour1 min1 sec1 reading sec2 reading day 2 ... SELECT * FROM timeline WHERE day=day1 AND hour=hour1 AND min=min1 AND sec=sec1;
  • 17. C* column nesting (clustering) - queries CREATE TABLE timeline ( day text, hour int, min int, sec int, value text, PRIMARY KEY (day, hour, min, sec) ); GOOD: SELECT * FROM timeline WHERE day=day1; SELECT * FROM timeline WHERE day=day1 AND hour=hour1; SELECT * FROM timeline WHERE day=day1 AND hour=hour1 AND min=min1; SELECT * FROM timeline WHERE day=day1 AND hour=hour1 AND min=min1 AND sec=sec1; WRONG: SELECT * FROM sessions WHERE day=day1 AND min=min1; RULE: Clustering columns must be present in the query in the same order as the PRIMARY KEY CAREFUL: be aware how much data you are returning !!
  • 18. C* column nesting (clustering) - queries Notes and limitations (partition_key:aaaaaaab) (column=col1, value=xx, timestamp=1357866010549000) (column=col2, value=xx, timestamp=1357866010549000) (column=col3, value=xx, timestamp=1357866010549000) (column=col4, value=xx, timestamp=1357866010549000) (column=col5, value=xx, timestamp=1357866010549000) (column=col6, value=xx, timestamp=1357866010549000) (column=col7, value=xx, timestamp=1357866010549000) (column=col8, value=xx, timestamp=1357866010549000) (column=col9, value=xx, timestamp=1357866010549000) (column=col10, value=xx, timestamp=1357866010549000) (column=col11, value=xx, timestamp=1357866010549000) (column=col12, value=xx, timestamp=1357866010549000) (column=col13, value=xx, timestamp=1357866010549000) (column=col14, value=xx, timestamp=1357866010549000) (column=col15, value=xx, timestamp=1357866010549000) (column=col16, value=xx, timestamp=1357866010549000) RULE: Always design your tables so that you limit the amount of data stored in a single partition_key to the size of the in_memory_compaction_limit_in_mb that is set in cassandra.yaml (default 64mb) Why? Compaction (which we will cover later) needs to be able to process a complete partition_key and all its underlying data in memory, swapping to and from disk introduces serious performance degradation and poor JVM GC behaviour. RULE: Clustering columns must be present in the query in the same order as the PRIMARY KEY CAREFUL: be aware how much data you are returning !!
  • 19. Where is my data? CQL TRACE will show you where your data is, how costly it is to get it in terms of time and how many node hops it is going to take to get it.
  • 20. Sane queries on simple tables + introducing indexes CREATE TABLE users { username text, password text, address text, age int, PRIMARY KEY(username) } SELECT * FROM users WHERE username=”john@site.com”; SELECT address, age FROM users WHERE username=”bill@yahoo.com”; CREATE INDEX age_key ON users(age); SELECT * FROM users WHERE age=35; CAREFUL: think about how much data you are returning and from where that data is coming...if you don’t know, or can’t work it out run a TRACE at the CQLSH console or run the query under DevCenter 1.3...
  • 21. CQL TRACE - your new best friend TRACE provides a description of each step it takes to satisfy the request, the names of nodes that are affected, the time for each step, and the total time for the request. TRACE is the most powerful tool in a data modellers hands. (INSERT) activity | timestamp | source | source_elapsed (microseconds) -------------------------------------+--------------+-----------+---------------- execute_cql3_query | 16:41:00,754 | 127.0.0.1 | 0 Parsing statement | 16:41:00,754 | 127.0.0.1 | 48 Preparing statement | 16:41:00,755 | 127.0.0.1 | 658 Determining replicas for mutation | 16:41:00,755 | 127.0.0.1 | 979 Message received from /127.0.0.1 | 16:41:00,756 | 127.0.0.3 | 37 Acquiring switchLock read lock | 16:41:00,756 | 127.0.0.1 | 1848 Sending message to /127.0.0.3 | 16:41:00,756 | 127.0.0.1 | 1853 Appending to commitlog | 16:41:00,756 | 127.0.0.1 | 1891 Sending message to /127.0.0.2 | 16:41:00,756 | 127.0.0.1 | 1911 Adding to emp memtable | 16:41:00,756 | 127.0.0.1 | 1997 Acquiring switchLock read lock | 16:41:00,757 | 127.0.0.3 | 395 Message received from /127.0.0.1 | 16:41:00,757 | 127.0.0.2 | 42 Appending to commitlog | 16:41:00,757 | 127.0.0.3 | 432 Acquiring switchLock read lock | 16:41:00,757 | 127.0.0.2 | 168 Adding to emp memtable | 16:41:00,757 | 127.0.0.3 | 522 Appending to commitlog | 16:41:00,757 | 127.0.0.2 | 211 Adding to emp memtable | 16:41:00,757 | 127.0.0.2 | 359 Enqueuing response to /127.0.0.1 | 16:41:00,758 | 127.0.0.3 | 1282 Enqueuing response to /127.0.0.1 | 16:41:00,758 | 127.0.0.2 | 1024 Sending message to /127.0.0.1 | 16:41:00,758 | 127.0.0.3 | 1469 Sending message to /127.0.0.1 | 16:41:00,758 | 127.0.0.2 | 1179 Message received from /127.0.0.2 | 16:41:00,765 | 127.0.0.1 | 10966 Message received from /127.0.0.3 | 16:41:00,765 | 127.0.0.1 | 10966 Processing response from /127.0.0.2 | 16:41:00,765 | 127.0.0.1 | 11063 Processing response from /127.0.0.3 | 16:41:00,765 | 127.0.0.1 | 11066 Request complete | 16:41:00,765 | 127.0.0.1 | 11139
  • 22. CQL TRACE - how do I invoke it? Option 1: CQLSH All C* installs come with a commandline client called cqlsh, you can run any CQL commands against a cassandra cluster using cqlsh, to invoke TRACE: cqlsh>TRACE ON; cqlsh>SELECT * FROM mytable WHERE id=1; After running the query, cqlsh will return with both the query results and the TRACE results. Option 2: DevCenter 1.3+ For the GUI inclined (like me) DevCenter automatically runs a TRACE on every query in a TAB behind the execution/results screen, you can see the formatted results there.
  • 23. RDBMS data modelling 1. Design normalised tables. 2. Define SQL queries 3. Build consuming application JOINS -> normalize tables -> queries last
  • 24. Cassandra data modelling 1. Define CQL queries 2. Design de-normalised tables for each query. 3. Build consuming application no JOINS -> queries first -> then denormalize tables
  • 25. Data modelling use-case #1 - Music data
  • 26. Data modelling use-case #1 - Music data Q1 CREATE TABLE performers_by_style { style TEXT, name TEXT, PRIMARY KEY(style, name) } WITH CLUSTERING ORDER BY (name ASC); (partition_key:style1) (column=name1:, value=, timestamp=1357866010549000) (column=name2:, value=, timestamp=1357866010549000) (column=name3:, value=, timestamp=1357866010549000) (partition_key:style2) (column=name4:, value=, timestamp=1357866010549000) (column=name5:, value=, timestamp=1357866010549000) (column=name6:, value=, timestamp=1357866010549000) SELECT * FROM performers_by_style WHERE style=”rock”;
  • 27. Data modelling use-case #1 - Music data Q2 CREATE TABLE performer ( name TEXT, type TEXT, country TEXT, style LIST<TEXT>, founded INT, born INT died TEXT, PRIMARY KEY (name) ); SELECT * FROM performer WHERE name=”someName”; (partition_key:someName) (column=type, value=, timestamp=1357866010549000) ...
  • 28. Data modelling use-case #1 - Music data Q3 CREATE TABLE album ( title TEXT, year INT, performer TEXT, genre TEXT, tracks map<INT,TEXT>, PRIMARY KEY((title,year)) ); (partition_key:myTitle:2014) (column=performer, value=Blondie, timestamp=1357866010549000) (column=genre, value=rock, timestamp=1357866010549000) (column=tracks, value={1:track1, 2:track2, 3:track3}, timestamp=1357866010549000) (partition_key:title56:1999) ... SELECT * FROM album WHERE title=”myTitle” AND year=2014;
  • 29. Data modelling use-case #1 - Music data Q4 CREATE TABLE albums_by_performer ( performer TEXT, year INT, title TEXT, genre TEXT, PRIMARY KEY(performer, year, title) ) WITH CLUSTERING ORDER BY (year DESC, title ASC); SELECT * FROM albums_by_performer WHERE performer=”myPerformer”; (partition_key:myPerformer) (column=year1:, value=, timestamp=1357866010549000) (column=title1:, value=, timestamp=1357866010549000) (column=genre, value=rock, timestamp=1357866010549000) (partition_key:perfomer2)
  • 30. Data modelling use-case #1 - Music data Q5 CREATE TABLE albums_by_genre ( genre TEXT, performer TEXT, year INT, title TEXT, PRIMARY KEY(genre, performer, year, title) ) WITH CLUSTERING ORDER BY (performer ASC, year DESC, title ASC); (partition_key:myGenre) (column=performer1, value=, timestamp=1357866010549000) (column=year1, value=, timestamp=1357866010549000) (column=title1, value=, timestamp=1357866010549000) (partition_key:genre2) SELECT * FROM albums_by_genre WHERE genre=”myGenre”;
  • 31. Data modelling use-case #1 - Music data Q6 CREATE TABLE albums_by_track ( track TEXT, performer TEXT, year INT, title TEXT, PRIMARY KEY(track, performer, year, title) ) WITH CLUSTERING ORDER BY (performer ASC, year DESC, title ASC); (partition_key:myTrack) (column=performer1, value=, timestamp=1357866010549000) (column=year1:, value=, timestamp=1357866010549000) (column=title1, value=, timestamp=1357866010549000) (partition_key:track2) SELECT * FROM albums_by_track WHERE genre=”myTrack”;
  • 32. Data modelling use-case #1 - Music data Q7 CREATE TABLE tracks_by_album ( album TEXT, year INT, number INT, performer TEXT, genre TEXT, title TEXT, PRIMARY KEY((album, year), number) ) WITH CLUSTERING ORDER BY (number ASC); (partition_key:myAlbum:2014) (column=number1, value=, timestamp=1357866010549000) (column=performer, value=performer1, timestamp=1357866010549000) (column=genre, value=genre1, timestamp=1357866010549000) (column=title, value=title1, timestamp=1357866010549000) (column=number2, value=, timestamp=1357866010549000) (column=performer, value=performer1, timestamp=1357866010549000) (column=genre, value=genre1, timestamp=1357866010549000) (column=title, value=title1, timestamp=1357866010549000) SELECT title, year FROM tracks_by_album WHERE album=”myAlbum” AND year=2015;
  • 33. Data modelling use-case #1 - Music data
  • 34. Cassandra is not an RDBMS, Cassandra is vastly more powerfully than any RDBMS in existence with the proven ability in production to run 1000x node clusters. But as a Cassandra Data Modeller you need to *think different*, you need to think distributed and denormalized, but ultimately you need to ask the question: “Where is my data?” an RDBMS