Cassandra Data Modeling for Optimal Query Performance

INTRODUCTION
A little bit about me, a little bit more about AddThis

WHY CASSANDRA?
▸ Scalability
▸ Fault Tolerant
▸ Optimized for Big Data
▸ Queryable

STORING DATA IN CASSANDRA
▸ Before we can talk about building a house,
you have to know what kind of tools you have.
▸ Before you can cook a meal,
you need to know what ingredients you have.
▸ Before you can data model with Cassandra,
you need to know what types of data you can store.

BASIC TYPES
CQL Type | Constants | Description
-------- | --------- | -----------
ascii | strings | US-ASCII character string
bigint | integers | 64-bit signed long
blob | blob | Arbitrary bytes
boolean | booleans | true or false
counter | integers | Distributed counter value (64-bit long)
decimal | ints/floats| Variable-precision decimal
float | ints/floats| 32-bit floating point
inet | strings | IP address string
int | integers | 32-bit signed integer
text | strings | UTF-8 encoded string
timestamp| ints/strings | Date plus time
timeuuid | uuids | Type 1 UUID only
tuple | n/a | A group of 2-3 fields (Cassandra 2.1+)
uuid | uuids | A UUID in standard UUID format
varchar | strings | UTF-8 encoded string
varint | integers | Arbitrary-precision integer

HOW CASSANDRA STORES DATA
TABLE CONSTRUCT
CREATE TABLE users (
user_id uuid,
first_name text,
last_name text,
company text,
PRIMARY KEY (user_id)
);

SETS
emails set<text>
--- insert set into table
INSERT INTO users (user_id, first_name, last_name, emails)
VALUES(uuid(), 'Ben', 'Knear', {'ben@addthis.com', 'ben.kn@gmail.com'});
--- add to set, even if set was never instantiated
--- note, it will re-sort the collection after adding
UPDATE users
SET emails = emails + {'ben.kn@yahoo.com'}
WHERE user_id = X;
--- remove from set
UPDATE users
SET emails = emails - {'ben.kn@yahoo.com'}
WHERE user_id = X;

LISTS
priority_emails list<text>
--- insert list into table
INSERT INTO users (user_id, first_name, last_name, priority_emails)
VALUES(uuid(), 'Ben', 'Knear', ['ben@addthis.com', 'ben.kn@gmail.com']);
--- add to list, even if list was never instantiated
UPDATE users
SET priority_emails = priority_emails + ['ben.kn@yahoo.com']
WHERE user_id = X;
--- remove from list
UPDATE users
SET priority_emails = priority_emails - ['ben.kn@yahoo.com']
WHERE user_id = X;

MAPS
contact_info map<text, text>
--- insert map into table
INSERT INTO users (user_id, first_name, last_name, contact_info)
VALUES(uuid(), 'Ben', 'Knear',
{ 'work_email' : 'ben@addthis.com',
'home_email' : 'ben.kn@gmail.com' });
--- delete from a map
DELETE contact_info['work_email']
FROM users WHERE user_id = X;

MAPS
contact_info map<text, text>
--- add to map, even if map was never instantiated
UPDATE users
SET contact_info['other_email'] = 'ben.kn@yahoo.com'
WHERE user_id = X;
--- remove from map
UPDATE users
SET contact_info = contact_info - ['ben.kn@yahoo.com']
WHERE user_id = X;

JSON OR MAP
▸ Remember all values in a Map must be the same type
▸ What will you do with the value?
▸ Remember: Values of items in collections are limited to 64K.

HOW CASSANDRA STORES DATA
Row Keys, Column Families
| row key | columns |
|---------|----------------------------------------|
| | "first_name" | "last_name" | "company" |
| UUID | 'Ben' | 'Knear' | 'AddThis' |
|---------|----------------------------------------|

▸ Maximum number of columns per row is 2 billion.
▸ Maximum size for the name of a column is 64 KB.
▸ Maximum size for a value in a column is 2 GB.
▸ Collection values may not be larger than 64 KB.

You must ask:
▸ What do we want to store?
▸ What are the relationships within the data?
▸ How do we plan to access it?
In a relational model, you would focus most on the questions one and
two. But for Cassandra, you must focus most on the third.

You must ask:
▸ What do we want to store?
▸ What are the relationships within the data?
▸ How do we plan to access it?
In a relational model, you would focus most on the questions one and
two. But for Cassandra, you must focus most on the third.
Cassandra data modeling starts with how will you use the data.

Denormalization
Optimizing the read performance of a database by adding redundant
data or by grouping data. In Cassandra, this process is accomplished by
duplicating data in multiple tables, grouping data for queries.

So how will we get data from
Cassandra?

PRIMARY KEY
Defines a unique value to identify the row, and also drives partitioning.
When multiple columns are defined in the primary key, the first column
defines the partition key, the rest are clustered columns.
Partitioning is important because rows and columns are grouped on
nodes, and grouped data is read and written faster.

EXAMPLE
CREATE TABLE movies (
id uuid,
name text,
genre text,
cast set<uuid>,
company text,
PRIMARY KEY (id)
);
Good if I only know the ID when I retrieve

COMPOSITE PRIMARY KEY ALTERNATIVE
CREATE TABLE movies (
id uuid,
name text,
genre text,
cast set<uuid>,
company text,
PRIMARY KEY (genre, id)
);
Good if I know the genre and ID when I retrieve, or to get all of a genre

QUERYING EXAMPLES
CREATE TABLE user_addresses (
state text,
city text,
username text,
address text,
PRIMARY KEY (state, city, username)
);
-- insert a value
INSERT INTO user_addresses (state, city, username, address)
VALUES ('VA', 'Vienna', 'AddThis', 'Spring Hill Rd');

-- FAILURES
SELECT * FROM user_addresses WHERE city = 'Vienna';
SELECT * FROM user_addresses
WHERE city = 'Vienna' AND username = 'AddThis';
-- SUCCESSES
SELECT * FROM user_addresses WHERE state = 'VA';
WHERE state = 'VA' AND city = 'Vienna';
WHERE state = 'VA' AND city = 'Vienna'
AND username = 'AddThis';

ERROR
Filtering by just clustering columns will give you this response:
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want to
execute this query despite the performance unpredictability, use ALLOW
FILTERING

WORK AROUND
Expensive query, so use LIMIT if you must:
SELECT * FROM user_addresses WHERE username = 'AddThis' LIMIT 10 ALLOW FILTERING;
At absolute worst (there are no addresses with username = 'AddThis'),
it will have to look through the entire table.

COMPOSITE PARTITION KEYS
Composite partition keys will group the first value on the same node,
though the second value may be on a different node.
CREATE TABLE cars (
lot_id int,
make text,
model text,
color text,
PRIMARY KEY ((lot_id, make), model)
);
INSERT INTO cars (lot_id, make, model, color) VALUES (1, 'Ford', 'Explorer', 'Black');
INSERT INTO cars (lot_id, make, model, color) VALUES (1, 'Cadillac', 'CT6', 'Black');
INSERT INTO cars (lot_id, make, model, color) VALUES (2, 'BMW', 'M8', 'Red');

COMPOSITE PARTITION KEYS
-- FAILS
SELECT * FROM cars WHERE lot_id = 1;
SELECT * FROM cars WHERE make = 'Ford';
-- SUCCESS
SELECT * FROM cars WHERE lot_id = 1 AND make = 'Ford';
Generally, Cassandra will store columns having the same lot id but a
different make on different nodes, and columns having the same lot id
and make on the same node.

INDEXES
You can also add indices at any time:
CREATE INDEX genre_idx ON movie (genre);
Or even on a collection field (Cassandra 2.1+)
CREATE INDEX cast_idx ON movie (cast);
But for maps, you can index either the keys or the values.
CREATE INDEX ON users (general);
CREATE INDEX ON users (KEYS(general));

RULES FOR INDEXES
Similar to relational databases, the more unique values in the index, the
larger it'll be, and the longer it'll take to read it.

RULES FOR INDEXES
▸ Do not index counter columns.
▸ Do not index high cardinality columns.
▸ Do not index on a frequently updated or deleted column (tombstone
issues)
▸ Do not index on a largely partitioned field (which requires
communicating with more servers to retrieve the information)

SIDEBAR ON TOMBSTONES
Tombstones are relics from deleted values, used in data replication.
▸ Grace period for garbage collection
▸ Avoid nulls

FILTERING WITH INDEX
Indexes on a basic data column filters the same as a primary key.
For collections you will use CONTAINS:
SELECT * FROM users WHERE email_addresses CONTAINS 'ben@addthis.com';
SELECT * FROM users WHERE user_attr_map CONTAINS 'Software Engineer';
SELECT * FROM users WHERE user_attr_map CONTAINS KEY 'Job Title';

RANGE QUERIES
Especially useful for timeseries tables, you can select a range of values
SELECT * FROM weather_reports WHERE report_time >= '2014-05-17 00:00:00-0000' AND report_time < '2014-05-18 00:00:00-0000';

DATA MODEL
WORKSHOP
REFERENCING Netﬂix

RELATIONAL MODEL
▸ User table with auto-inc ID
▸ Movie table with auto-inc ID
▸ User_Movie table with auto-inc ID, foreign keys user_id and
movie_id, plus a timestamp

RELATIONAL MODEL
▸ User table with auto-inc ID
▸ Movie table with auto-inc ID
▸ User_Movie table with auto-inc ID, foreign keys user_id and
movie_id, plus a timestamp
Completely inefficient for Cassandra

BAD PARTITIONING
CREATE TABLE user_queue (
id uuid,
user_id uuid,
video_id uuid,
added timestamp,
PRIMARY KEY (id)
);
CREATE INDEX user_id_idx ON user_queue (user_id);
Terrible.

BETTER PARTITIONING
Including the video info as a JSON blob
user_id uuid,
video_id uuid,
video_info text,
added timestamp,
);

BETTER MODEL
Utilize the columns
user_id uuid,
queue map<text, text>,
);
▸ queue map will contain 'movieId' -> json blob of movie
▸ Read will pull all movies for a user

ADDING TO QUEUE
UPDATE TABLE user
SET queue['newMovieId'] = 'json about movie'
WHERE user_id = X;

Cassandra Data Modeling for Optimal Query Performance

Recommended

Recommended

More Related Content

Similar to Cassandra Data Modeling for Optimal Query Performance

Similar to Cassandra Data Modeling for Optimal Query Performance (20)

Recently uploaded

Recently uploaded (20)

Cassandra Data Modeling for Optimal Query Performance