Meetup Crash Course: Cassandra Data Modelling

Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
Crash Course :
Cassandra Data Modelling
Erick Ramirez

DataStax Engineering

@flightc
27 August 2015

@flightc
Erick Ramirez | @flightc
Welcome
• Modelling crash course

• Forget everything you know

• Informal session

• Please ask me questions

@flightc
A refresher
• Gossip

• Partitions & hashing

• Replicas & snitches

• Client & coordinator

• Consistency level

@flightc
A cluster
• Node - a Cassandra instance

• Rack - a logical group of nodes

• DC - a logical group of racks

• Cluster - a full set of nodes

@flightc
Gossip
• New node gossips with seed
nodes

• Happens every second

• Learns about other nodes

• Up/down status

• Node locations

@flightc
Partitions & hashing
• Data is partitioned

• Partition key is hashed

hash(“DataStax”) = 9b036bd16dbe90073a
hash(“@flightc”) = 1668bf314257609f04
• Partition range is -263 to 263

• Each node owns token [range]*

* vnodes = multiple owned tokens

@flightc
Replicas & snitches
• A replica is copy of a partition

• 1st replica is token owner

• Next replica is “next” node

• A snitch tells partitioner the
topology

@flightc
Client & coordinator
• C* driver (client) chooses node

- seed nodes

- load-balancing policy

• Chosen node for request is
coordinator

• Coordinator manages
replication factor

• Each write is timestamped

@flightc
Consistency level
• Number of nodes which must
acknowledge a read or write

• Can vary per request

• Possible CLs: ANY, ONE,
QUORUM, LOCAL_QUORUM, ALL
• For writes, data is written to
disk (commitlog)

• For reads, nodes send most
recent copy of data

@flightc
Modelling Cassandra
• CQL

• Tables & column families

• Rows & partitions

@flightc
Modelling is a science
• Use tested methodologies

• Predictable results

@flightc
Modelling is an art
• Sometimes, you need to
improvise

• Massage schema to optimise

@flightc
Data Modelling
• Collect & analyse data
requirements

• Identify entities & relationships

• Identify queries

• Design schema

• Optimise!

@flightc
Goals
• Very fast queries

• De-normalise

• Nest data

• Duplicate data

• Query-driven model

@flightc
Modelling Cassandra
• Use Cassandra Query
Language (CQL)

• Similar SQL-like approach

• DDL - CREATE, ALTER, DROP
• DML - SELECT, INSERT,
UPDATE, DELETE
CREATE TABLE users (
userid uuid,
name text,
email text,
PRIMARY KEY (userid)
);

@flightc
Tables & column families
• Table is a two-dimensional view
of data

• A set of rows with a similar
structure

• Table schema defines a set of
columns and a primary key

• PK is a sequence of columns
which uniquely identify a row
• Column family is a multi-
dimensional data structure

• Rows are organised into
partitions

• A partition has 1 or more rows

• Partition key is part of primary
key used to uniquely identify a
partition

@flightc
Example - Table with single-row partitions

@flightc
Example - Table with multi-row partitions

@flightc
Keys, composites & clustering columns
• A simple partition key

PRIMARY KEY ( userid )
• Composite partition key

PRIMARY KEY ( (album_name, year) )
• Simple partition key with clustering columns

PRIMARY KEY ( userid, name, email )
• Composite partition key with clustering columns

PRIMARY KEY ( (album_name, year), title)

@flightc
Examples
Composite partition key
Composite partition key with clustering columns

@flightc
Column families
• Distributed

• Sparse

@flightc
Storage
FAST SCANSLOWSCAN

@flightc
Physical storage layout

@flightc
On-disk layout to 2D representation

@flightc
Sizes
• Column family size is only
limited to the size of the cluster

• Linear scaling - partitions are
distributed
• Largest partition must fit on
disk on a single node

• A single partition does not span
multiple nodes

• Max cells is 2 billion

• Max data size per cell (column
value) is 2GB

@flightc
Query-driven modelling
• Find all performers and albums
for a given track title

CREATE TABLE albums_by_track (
track_title TEXT,
performer TEXT,
year INT,
album_title TEXT,
PRIMARY KEY ( track_title,
performer, year,
album_title )
);
• Find performer, genre & titles
for a given album title & year

CREATE TABLE tracks_by_album (
album_title TEXT,
year INT,
performer TEXT,
genre TEXT,
number INT,
track_title TEXT,
PRIMARY KEY ( (album_title,year),
number )
);

@flightc
• Most efficient access pattern

• Query accesses only 1 partition

• Partition can be 1 or more rows
Partition per query

@flightc
Partition+ per query
• Less efficient

• Not necessarily bad

• Query accesses 1+ partitions

@flightc
Table scan, multi-table
• Not efficient at all - avoid!

• Query accesses all partitions in
a table(s)

@flightc
Nest data
• More efficient to get to partition and iterate through rows

@flightc
Duplicate data
• Better than doing an expensive join

• Results are pre-computed & materialised

@flightc
Query-driven model
• Each query has a
corresponding table

• Tables are optimised for queries

• Tables return data in correct
order

@flightc
This is the beginning

@flightc
Get trained
• Free instructor-led courses

• Free self-paced learning

• Free online resources

• Go to academy.datastax.com

@flightc
Cassandra Summit 2015
• 5 reasons to join me in SF
buff.ly/1JHl6Kw

• September 22-24

• Free general passes still
available!

@flightc
Thank you
Erick Ramirez @flightc

Meetup Crash Course: Cassandra Data Modelling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Meetup Crash Course: Cassandra Data Modelling

Similar to Meetup Crash Course: Cassandra Data Modelling (20)

Recently uploaded

Recently uploaded (20)

Meetup Crash Course: Cassandra Data Modelling