SlideShare a Scribd company logo
1 of 50
Download to read offline
CASSANDRA
DATA MODELING
INTRODUCTION
A little bit about me, a little bit more about AddThis
WHY CASSANDRA?
▸ Scalability
▸ Fault Tolerant
▸ Optimized for Big Data
▸ Queryable
QUERYABLE, NOW
WE'RE TALKIN!
STORING DATA IN CASSANDRA
▸ Before we can talk about building a house,
you have to know what kind of tools you have.
▸ Before you can cook a meal,
you need to know what ingredients you have.
▸ Before you can data model with Cassandra,
you need to know what types of data you can store.
BASIC TYPES
CQL Type | Constants | Description
-------- | --------- | -----------
ascii | strings | US-ASCII character string
bigint | integers | 64-bit signed long
blob | blob | Arbitrary bytes
boolean | booleans | true or false
counter | integers | Distributed counter value (64-bit long)
decimal | ints/floats| Variable-precision decimal
float | ints/floats| 32-bit floating point
inet | strings | IP address string
int | integers | 32-bit signed integer
text | strings | UTF-8 encoded string
timestamp| ints/strings | Date plus time
timeuuid | uuids | Type 1 UUID only
tuple | n/a | A group of 2-3 fields (Cassandra 2.1+)
uuid | uuids | A UUID in standard UUID format
varchar | strings | UTF-8 encoded string
varint | integers | Arbitrary-precision integer
HOW CASSANDRA STORES DATA
TABLE CONSTRUCT
CREATE TABLE users (
user_id uuid,
first_name text,
last_name text,
company text,
PRIMARY KEY (user_id)
);
COLLECTIONS
SETS
emails set<text>
--- insert set into table
INSERT INTO users (user_id, first_name, last_name, emails)
VALUES(uuid(), 'Ben', 'Knear', {'ben@addthis.com', 'ben.kn@gmail.com'});
--- add to set, even if set was never instantiated
--- note, it will re-sort the collection after adding
UPDATE users
SET emails = emails + {'ben.kn@yahoo.com'}
WHERE user_id = X;
--- remove from set
UPDATE users
SET emails = emails - {'ben.kn@yahoo.com'}
WHERE user_id = X;
LISTS
priority_emails list<text>
--- insert list into table
INSERT INTO users (user_id, first_name, last_name, priority_emails)
VALUES(uuid(), 'Ben', 'Knear', ['ben@addthis.com', 'ben.kn@gmail.com']);
--- add to list, even if list was never instantiated
UPDATE users
SET priority_emails = priority_emails + ['ben.kn@yahoo.com']
WHERE user_id = X;
--- remove from list
UPDATE users
SET priority_emails = priority_emails - ['ben.kn@yahoo.com']
WHERE user_id = X;
MAPS
contact_info map<text, text>
--- insert map into table
INSERT INTO users (user_id, first_name, last_name, contact_info)
VALUES(uuid(), 'Ben', 'Knear',
{ 'work_email' : 'ben@addthis.com',
'home_email' : 'ben.kn@gmail.com' });
--- delete from a map
DELETE contact_info['work_email']
FROM users WHERE user_id = X;
MAPS
contact_info map<text, text>
--- add to map, even if map was never instantiated
UPDATE users
SET contact_info['other_email'] = 'ben.kn@yahoo.com'
WHERE user_id = X;
--- remove from map
UPDATE users
SET contact_info = contact_info - ['ben.kn@yahoo.com']
WHERE user_id = X;
JSON OR MAP
▸ Remember all values in a Map must be the same type
▸ What will you do with the value?
▸ Remember: Values of items in collections are limited to 64K.
HOW CASSANDRA STORES DATA
Row Keys, Column Families
| row key | columns |
|---------|----------------------------------------|
| | "first_name" | "last_name" | "company" |
| UUID | 'Ben' | 'Knear' | 'AddThis' |
|---------|----------------------------------------|
CASSANDRA
LIMITS
▸ Maximum number of columns per row is 2 billion.
▸ Maximum size for the name of a column is 64 KB.
▸ Maximum size for a value in a column is 2 GB.
▸ Collection values may not be larger than 64 KB.
APPROACHING
DATA MODEL
You must ask:
▸ What do we want to store?
▸ What are the relationships within the data?
▸ How do we plan to access it?
In a relational model, you would focus most on the questions one and
two. But for Cassandra, you must focus most on the third.
You must ask:
▸ What do we want to store?
▸ What are the relationships within the data?
▸ How do we plan to access it?
In a relational model, you would focus most on the questions one and
two. But for Cassandra, you must focus most on the third.
Cassandra data modeling starts with how will you use the data.
Denormalization
Optimizing the read performance of a database by adding redundant
data or by grouping data. In Cassandra, this process is accomplished by
duplicating data in multiple tables, grouping data for queries.
So how will we get data from
Cassandra?
CASSANDRA
QUERYING
PRIMARY KEY
Defines a unique value to identify the row, and also drives partitioning.
When multiple columns are defined in the primary key, the first column
defines the partition key, the rest are clustered columns.
Partitioning is important because rows and columns are grouped on
nodes, and grouped data is read and written faster.
EXAMPLE
CREATE TABLE movies (
id uuid,
name text,
genre text,
cast set<uuid>,
company text,
PRIMARY KEY (id)
);
Good if I only know the ID when I retrieve
COMPOSITE PRIMARY KEY ALTERNATIVE
CREATE TABLE movies (
id uuid,
name text,
genre text,
cast set<uuid>,
company text,
PRIMARY KEY (genre, id)
);
Good if I know the genre and ID when I retrieve, or to get all of a genre
QUERYING EXAMPLES
CREATE TABLE user_addresses (
state text,
city text,
username text,
address text,
PRIMARY KEY (state, city, username)
);
-- insert a value
INSERT INTO user_addresses (state, city, username, address)
VALUES ('VA', 'Vienna', 'AddThis', 'Spring Hill Rd');
-- FAILURES
SELECT * FROM user_addresses WHERE city = 'Vienna';
SELECT * FROM user_addresses
WHERE city = 'Vienna' AND username = 'AddThis';
-- SUCCESSES
SELECT * FROM user_addresses WHERE state = 'VA';
SELECT * FROM user_addresses
WHERE state = 'VA' AND city = 'Vienna';
SELECT * FROM user_addresses
WHERE state = 'VA' AND city = 'Vienna'
AND username = 'AddThis';
ERROR
Filtering by just clustering columns will give you this response:
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want to
execute this query despite the performance unpredictability, use ALLOW
FILTERING
WORK AROUND
Expensive query, so use LIMIT if you must:
SELECT * FROM user_addresses WHERE username = 'AddThis' LIMIT 10 ALLOW FILTERING;
At absolute worst (there are no addresses with username = 'AddThis'),
it will have to look through the entire table.
COMPOSITE PARTITION KEYS
Composite partition keys will group the first value on the same node,
though the second value may be on a different node.
CREATE TABLE cars (
lot_id int,
make text,
model text,
color text,
PRIMARY KEY ((lot_id, make), model)
);
INSERT INTO cars (lot_id, make, model, color) VALUES (1, 'Ford', 'Explorer', 'Black');
INSERT INTO cars (lot_id, make, model, color) VALUES (1, 'Cadillac', 'CT6', 'Black');
INSERT INTO cars (lot_id, make, model, color) VALUES (2, 'BMW', 'M8', 'Red');
COMPOSITE PARTITION KEYS
-- FAILS
SELECT * FROM cars WHERE lot_id = 1;
SELECT * FROM cars WHERE make = 'Ford';
-- SUCCESS
SELECT * FROM cars WHERE lot_id = 1 AND make = 'Ford';
Generally, Cassandra will store columns having the same lot id but a
different make on different nodes, and columns having the same lot id
and make on the same node.
INDEXES
You can also add indices at any time:
CREATE INDEX genre_idx ON movie (genre);
Or even on a collection field (Cassandra 2.1+)
CREATE INDEX cast_idx ON movie (cast);
But for maps, you can index either the keys or the values.
CREATE INDEX ON users (general);
CREATE INDEX ON users (KEYS(general));
RULES FOR INDEXES
Similar to relational databases, the more unique values in the index, the
larger it'll be, and the longer it'll take to read it.
RULES FOR INDEXES
▸ Do not index counter columns.
▸ Do not index high cardinality columns.
▸ Do not index on a frequently updated or deleted column (tombstone
issues)
▸ Do not index on a largely partitioned field (which requires
communicating with more servers to retrieve the information)
SIDEBAR ON TOMBSTONES
Tombstones are relics from deleted values, used in data replication.
▸ Grace period for garbage collection
▸ Avoid nulls
FILTERING WITH INDEX
Indexes on a basic data column filters the same as a primary key.
For collections you will use CONTAINS:
SELECT * FROM users WHERE email_addresses CONTAINS 'ben@addthis.com';
SELECT * FROM users WHERE user_attr_map CONTAINS 'Software Engineer';
SELECT * FROM users WHERE user_attr_map CONTAINS KEY 'Job Title';
RANGE QUERIES
Especially useful for timeseries tables, you can select a range of values
SELECT * FROM weather_reports WHERE report_time >= '2014-05-17 00:00:00-0000' AND report_time < '2014-05-18 00:00:00-0000';
DATA MODEL
WORKSHOP
REFERENCING Netflix
USER QUEUE
RELATIONAL MODEL
▸ User table with auto-inc ID
▸ Movie table with auto-inc ID
▸ User_Movie table with auto-inc ID, foreign keys user_id and
movie_id, plus a timestamp
RELATIONAL MODEL
▸ User table with auto-inc ID
▸ Movie table with auto-inc ID
▸ User_Movie table with auto-inc ID, foreign keys user_id and
movie_id, plus a timestamp
Completely inefficient for Cassandra
BAD PARTITIONING
CREATE TABLE user_queue (
id uuid,
user_id uuid,
video_id uuid,
added timestamp,
PRIMARY KEY (id)
);
CREATE INDEX user_id_idx ON user_queue (user_id);
Terrible.
BETTER PARTITIONING
Including the video info as a JSON blob
CREATE TABLE user_queue (
user_id uuid,
video_id uuid,
video_info text,
added timestamp,
PRIMARY KEY (user_id)
);
BETTER MODEL
Utilize the columns
CREATE TABLE user_queue (
user_id uuid,
queue map<text, text>,
PRIMARY KEY (user_id)
);
▸ queue map will contain 'movieId' -> json blob of movie
▸ Read will pull all movies for a user
ADDING TO QUEUE
UPDATE TABLE user
SET queue['newMovieId'] = 'json about movie'
WHERE user_id = X;
VIEW HISTORY
USER REVIEWS
NEW RELEASES
METRICS
THANK YOU

More Related Content

Similar to Cassandra Data Modeling for Optimal Query Performance

Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerCassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerDataStax
 
Cassandra Table Modeling - an alternate approach
Cassandra Table Modeling - an alternate approachCassandra Table Modeling - an alternate approach
Cassandra Table Modeling - an alternate approachDevopam Mittra
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_uploadProf. Wim Van Criekinge
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADataconomy Media
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Charles Givre
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)Jerome Eteve
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceDataStax Academy
 
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?Martin Loetzsch
 
Cassandra Tutorial | Data types | Why Cassandra for Big Data
 Cassandra Tutorial | Data types | Why Cassandra for Big Data Cassandra Tutorial | Data types | Why Cassandra for Big Data
Cassandra Tutorial | Data types | Why Cassandra for Big Datavinayiqbusiness
 

Similar to Cassandra Data Modeling for Optimal Query Performance (20)

Cassandra20141009
Cassandra20141009Cassandra20141009
Cassandra20141009
 
Cassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super ModelerCassandra Community Webinar | Become a Super Modeler
Cassandra Community Webinar | Become a Super Modeler
 
Cassandra Table Modeling - an alternate approach
Cassandra Table Modeling - an alternate approachCassandra Table Modeling - an alternate approach
Cassandra Table Modeling - an alternate approach
 
MySQL Indexes
MySQL IndexesMySQL Indexes
MySQL Indexes
 
Cassandra20141113
Cassandra20141113Cassandra20141113
Cassandra20141113
 
Mysql
MysqlMysql
Mysql
 
Chapter 4 Structured Query Language
Chapter 4 Structured Query LanguageChapter 4 Structured Query Language
Chapter 4 Structured Query Language
 
2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload2018 03 27_biological_databases_part4_v_upload
2018 03 27_biological_databases_part4_v_upload
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2Data Exploration with Apache Drill: Day 2
Data Exploration with Apache Drill: Day 2
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Cassandra
CassandraCassandra
Cassandra
 
Sql
SqlSql
Sql
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Apache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis PriceApache Cassandra Data Modeling with Travis Price
Apache Cassandra Data Modeling with Travis Price
 
My Sql concepts
My Sql conceptsMy Sql concepts
My Sql concepts
 
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
 
MSAvMySQL.pptx
MSAvMySQL.pptxMSAvMySQL.pptx
MSAvMySQL.pptx
 
Database concepts
Database conceptsDatabase concepts
Database concepts
 
Cassandra Tutorial | Data types | Why Cassandra for Big Data
 Cassandra Tutorial | Data types | Why Cassandra for Big Data Cassandra Tutorial | Data types | Why Cassandra for Big Data
Cassandra Tutorial | Data types | Why Cassandra for Big Data
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Cassandra Data Modeling for Optimal Query Performance

  • 2. INTRODUCTION A little bit about me, a little bit more about AddThis
  • 3. WHY CASSANDRA? ▸ Scalability ▸ Fault Tolerant ▸ Optimized for Big Data ▸ Queryable
  • 5. STORING DATA IN CASSANDRA ▸ Before we can talk about building a house, you have to know what kind of tools you have. ▸ Before you can cook a meal, you need to know what ingredients you have. ▸ Before you can data model with Cassandra, you need to know what types of data you can store.
  • 6. BASIC TYPES CQL Type | Constants | Description -------- | --------- | ----------- ascii | strings | US-ASCII character string bigint | integers | 64-bit signed long blob | blob | Arbitrary bytes boolean | booleans | true or false counter | integers | Distributed counter value (64-bit long) decimal | ints/floats| Variable-precision decimal float | ints/floats| 32-bit floating point inet | strings | IP address string int | integers | 32-bit signed integer text | strings | UTF-8 encoded string timestamp| ints/strings | Date plus time timeuuid | uuids | Type 1 UUID only tuple | n/a | A group of 2-3 fields (Cassandra 2.1+) uuid | uuids | A UUID in standard UUID format varchar | strings | UTF-8 encoded string varint | integers | Arbitrary-precision integer
  • 7. HOW CASSANDRA STORES DATA TABLE CONSTRUCT CREATE TABLE users ( user_id uuid, first_name text, last_name text, company text, PRIMARY KEY (user_id) );
  • 9. SETS emails set<text> --- insert set into table INSERT INTO users (user_id, first_name, last_name, emails) VALUES(uuid(), 'Ben', 'Knear', {'ben@addthis.com', 'ben.kn@gmail.com'}); --- add to set, even if set was never instantiated --- note, it will re-sort the collection after adding UPDATE users SET emails = emails + {'ben.kn@yahoo.com'} WHERE user_id = X; --- remove from set UPDATE users SET emails = emails - {'ben.kn@yahoo.com'} WHERE user_id = X;
  • 10. LISTS priority_emails list<text> --- insert list into table INSERT INTO users (user_id, first_name, last_name, priority_emails) VALUES(uuid(), 'Ben', 'Knear', ['ben@addthis.com', 'ben.kn@gmail.com']); --- add to list, even if list was never instantiated UPDATE users SET priority_emails = priority_emails + ['ben.kn@yahoo.com'] WHERE user_id = X; --- remove from list UPDATE users SET priority_emails = priority_emails - ['ben.kn@yahoo.com'] WHERE user_id = X;
  • 11. MAPS contact_info map<text, text> --- insert map into table INSERT INTO users (user_id, first_name, last_name, contact_info) VALUES(uuid(), 'Ben', 'Knear', { 'work_email' : 'ben@addthis.com', 'home_email' : 'ben.kn@gmail.com' }); --- delete from a map DELETE contact_info['work_email'] FROM users WHERE user_id = X;
  • 12. MAPS contact_info map<text, text> --- add to map, even if map was never instantiated UPDATE users SET contact_info['other_email'] = 'ben.kn@yahoo.com' WHERE user_id = X; --- remove from map UPDATE users SET contact_info = contact_info - ['ben.kn@yahoo.com'] WHERE user_id = X;
  • 13. JSON OR MAP ▸ Remember all values in a Map must be the same type ▸ What will you do with the value? ▸ Remember: Values of items in collections are limited to 64K.
  • 14. HOW CASSANDRA STORES DATA Row Keys, Column Families | row key | columns | |---------|----------------------------------------| | | "first_name" | "last_name" | "company" | | UUID | 'Ben' | 'Knear' | 'AddThis' | |---------|----------------------------------------|
  • 16. ▸ Maximum number of columns per row is 2 billion. ▸ Maximum size for the name of a column is 64 KB. ▸ Maximum size for a value in a column is 2 GB. ▸ Collection values may not be larger than 64 KB.
  • 18. You must ask: ▸ What do we want to store? ▸ What are the relationships within the data? ▸ How do we plan to access it? In a relational model, you would focus most on the questions one and two. But for Cassandra, you must focus most on the third.
  • 19. You must ask: ▸ What do we want to store? ▸ What are the relationships within the data? ▸ How do we plan to access it? In a relational model, you would focus most on the questions one and two. But for Cassandra, you must focus most on the third. Cassandra data modeling starts with how will you use the data.
  • 20. Denormalization Optimizing the read performance of a database by adding redundant data or by grouping data. In Cassandra, this process is accomplished by duplicating data in multiple tables, grouping data for queries.
  • 21. So how will we get data from Cassandra?
  • 23. PRIMARY KEY Defines a unique value to identify the row, and also drives partitioning. When multiple columns are defined in the primary key, the first column defines the partition key, the rest are clustered columns. Partitioning is important because rows and columns are grouped on nodes, and grouped data is read and written faster.
  • 24. EXAMPLE CREATE TABLE movies ( id uuid, name text, genre text, cast set<uuid>, company text, PRIMARY KEY (id) ); Good if I only know the ID when I retrieve
  • 25. COMPOSITE PRIMARY KEY ALTERNATIVE CREATE TABLE movies ( id uuid, name text, genre text, cast set<uuid>, company text, PRIMARY KEY (genre, id) ); Good if I know the genre and ID when I retrieve, or to get all of a genre
  • 26. QUERYING EXAMPLES CREATE TABLE user_addresses ( state text, city text, username text, address text, PRIMARY KEY (state, city, username) ); -- insert a value INSERT INTO user_addresses (state, city, username, address) VALUES ('VA', 'Vienna', 'AddThis', 'Spring Hill Rd');
  • 27. -- FAILURES SELECT * FROM user_addresses WHERE city = 'Vienna'; SELECT * FROM user_addresses WHERE city = 'Vienna' AND username = 'AddThis'; -- SUCCESSES SELECT * FROM user_addresses WHERE state = 'VA'; SELECT * FROM user_addresses WHERE state = 'VA' AND city = 'Vienna'; SELECT * FROM user_addresses WHERE state = 'VA' AND city = 'Vienna' AND username = 'AddThis';
  • 28. ERROR Filtering by just clustering columns will give you this response: Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
  • 29. WORK AROUND Expensive query, so use LIMIT if you must: SELECT * FROM user_addresses WHERE username = 'AddThis' LIMIT 10 ALLOW FILTERING; At absolute worst (there are no addresses with username = 'AddThis'), it will have to look through the entire table.
  • 30. COMPOSITE PARTITION KEYS Composite partition keys will group the first value on the same node, though the second value may be on a different node. CREATE TABLE cars ( lot_id int, make text, model text, color text, PRIMARY KEY ((lot_id, make), model) ); INSERT INTO cars (lot_id, make, model, color) VALUES (1, 'Ford', 'Explorer', 'Black'); INSERT INTO cars (lot_id, make, model, color) VALUES (1, 'Cadillac', 'CT6', 'Black'); INSERT INTO cars (lot_id, make, model, color) VALUES (2, 'BMW', 'M8', 'Red');
  • 31. COMPOSITE PARTITION KEYS -- FAILS SELECT * FROM cars WHERE lot_id = 1; SELECT * FROM cars WHERE make = 'Ford'; -- SUCCESS SELECT * FROM cars WHERE lot_id = 1 AND make = 'Ford'; Generally, Cassandra will store columns having the same lot id but a different make on different nodes, and columns having the same lot id and make on the same node.
  • 32. INDEXES You can also add indices at any time: CREATE INDEX genre_idx ON movie (genre); Or even on a collection field (Cassandra 2.1+) CREATE INDEX cast_idx ON movie (cast); But for maps, you can index either the keys or the values. CREATE INDEX ON users (general); CREATE INDEX ON users (KEYS(general));
  • 33. RULES FOR INDEXES Similar to relational databases, the more unique values in the index, the larger it'll be, and the longer it'll take to read it.
  • 34. RULES FOR INDEXES ▸ Do not index counter columns. ▸ Do not index high cardinality columns. ▸ Do not index on a frequently updated or deleted column (tombstone issues) ▸ Do not index on a largely partitioned field (which requires communicating with more servers to retrieve the information)
  • 35. SIDEBAR ON TOMBSTONES Tombstones are relics from deleted values, used in data replication. ▸ Grace period for garbage collection ▸ Avoid nulls
  • 36. FILTERING WITH INDEX Indexes on a basic data column filters the same as a primary key. For collections you will use CONTAINS: SELECT * FROM users WHERE email_addresses CONTAINS 'ben@addthis.com'; SELECT * FROM users WHERE user_attr_map CONTAINS 'Software Engineer'; SELECT * FROM users WHERE user_attr_map CONTAINS KEY 'Job Title';
  • 37. RANGE QUERIES Especially useful for timeseries tables, you can select a range of values SELECT * FROM weather_reports WHERE report_time >= '2014-05-17 00:00:00-0000' AND report_time < '2014-05-18 00:00:00-0000';
  • 40. RELATIONAL MODEL ▸ User table with auto-inc ID ▸ Movie table with auto-inc ID ▸ User_Movie table with auto-inc ID, foreign keys user_id and movie_id, plus a timestamp
  • 41. RELATIONAL MODEL ▸ User table with auto-inc ID ▸ Movie table with auto-inc ID ▸ User_Movie table with auto-inc ID, foreign keys user_id and movie_id, plus a timestamp Completely inefficient for Cassandra
  • 42. BAD PARTITIONING CREATE TABLE user_queue ( id uuid, user_id uuid, video_id uuid, added timestamp, PRIMARY KEY (id) ); CREATE INDEX user_id_idx ON user_queue (user_id); Terrible.
  • 43. BETTER PARTITIONING Including the video info as a JSON blob CREATE TABLE user_queue ( user_id uuid, video_id uuid, video_info text, added timestamp, PRIMARY KEY (user_id) );
  • 44. BETTER MODEL Utilize the columns CREATE TABLE user_queue ( user_id uuid, queue map<text, text>, PRIMARY KEY (user_id) ); ▸ queue map will contain 'movieId' -> json blob of movie ▸ Read will pull all movies for a user
  • 45. ADDING TO QUEUE UPDATE TABLE user SET queue['newMovieId'] = 'json about movie' WHERE user_id = X;