SlideShare a Scribd company logo
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
Crash Course :
Cassandra Data Modelling
Erick Ramirez

DataStax Engineering

@flightc
27 August 2015
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Welcome
• Modelling crash course

• Forget everything you know

• Informal session

• Please ask me questions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
A refresher
• Gossip

• Partitions & hashing

• Replicas & snitches

• Client & coordinator

• Consistency level
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
A cluster
• Node - a Cassandra instance

• Rack - a logical group of nodes

• DC - a logical group of racks

• Cluster - a full set of nodes
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Gossip
• New node gossips with seed
nodes

• Happens every second

• Learns about other nodes

• Up/down status

• Node locations
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Partitions & hashing
• Data is partitioned

• Partition key is hashed

hash(“DataStax”) = 9b036bd16dbe90073a
hash(“@flightc”) = 1668bf314257609f04
• Partition range is -263 to 263

• Each node owns token [range]*

* vnodes = multiple owned tokens
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Replicas & snitches
• A replica is copy of a partition

• 1st replica is token owner

• Next replica is “next” node

• A snitch tells partitioner the
topology
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Client & coordinator
• C* driver (client) chooses node

- seed nodes

- load-balancing policy

• Chosen node for request is
coordinator

• Coordinator manages
replication factor

• Each write is timestamped
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Consistency level
• Number of nodes which must
acknowledge a read or write

• Can vary per request

• Possible CLs: ANY, ONE,
QUORUM, LOCAL_QUORUM, ALL
• For writes, data is written to
disk (commitlog)

• For reads, nodes send most
recent copy of data
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Modelling Cassandra
• CQL

• Tables & column families

• Rows & partitions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Modelling is a science
• Use tested methodologies

• Predictable results
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Modelling is an art
• Sometimes, you need to
improvise

• Massage schema to optimise
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Data Modelling
• Collect & analyse data
requirements

• Identify entities & relationships

• Identify queries

• Design schema

• Optimise!
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Goals
• Very fast queries

• De-normalise

• Nest data

• Duplicate data

• Query-driven model
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Modelling Cassandra
• Use Cassandra Query
Language (CQL)

• Similar SQL-like approach

• DDL - CREATE, ALTER, DROP
• DML - SELECT, INSERT,
UPDATE, DELETE
CREATE TABLE users (
userid uuid,
name text,
email text,
PRIMARY KEY (userid)
);
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Tables & column families
• Table is a two-dimensional view
of data

• A set of rows with a similar
structure

• Table schema defines a set of
columns and a primary key

• PK is a sequence of columns
which uniquely identify a row
• Column family is a multi-
dimensional data structure

• Rows are organised into
partitions

• A partition has 1 or more rows

• Partition key is part of primary
key used to uniquely identify a
partition
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Example - Table with single-row partitions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Example - Table with multi-row partitions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Keys, composites & clustering columns
• A simple partition key

PRIMARY KEY ( userid )
• Composite partition key

PRIMARY KEY ( (album_name, year) )
• Simple partition key with clustering columns

PRIMARY KEY ( userid, name, email )
• Composite partition key with clustering columns

PRIMARY KEY ( (album_name, year), title)
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Examples
Composite partition key
Composite partition key with clustering columns
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Column families
• Distributed

• Sparse
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Storage
FAST SCANSLOWSCAN
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Physical storage layout
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
On-disk layout to 2D representation
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Sizes
• Column family size is only
limited to the size of the cluster

• Linear scaling - partitions are
distributed
• Largest partition must fit on
disk on a single node

• A single partition does not span
multiple nodes

• Max cells is 2 billion

• Max data size per cell (column
value) is 2GB
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Query-driven modelling
• Find all performers and albums
for a given track title

CREATE TABLE albums_by_track (
track_title TEXT,
performer TEXT,
year INT,
album_title TEXT,
PRIMARY KEY ( track_title,
performer, year,
album_title )
);
• Find performer, genre & titles
for a given album title & year

CREATE TABLE tracks_by_album (
album_title TEXT,
year INT,
performer TEXT,
genre TEXT,
number INT,
track_title TEXT,
PRIMARY KEY ( (album_title,year),
number )
);
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
• Most efficient access pattern

• Query accesses only 1 partition

• Partition can be 1 or more rows
Partition per query
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Partition+ per query
• Less efficient

• Not necessarily bad

• Query accesses 1+ partitions
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Table scan, multi-table
• Not efficient at all - avoid!

• Query accesses all partitions in
a table(s)
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Nest data
• More efficient to get to partition and iterate through rows
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Duplicate data
• Better than doing an expensive join

• Results are pre-computed & materialised
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Query-driven model
• Each query has a
corresponding table

• Tables are optimised for queries

• Tables return data in correct
order
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
This is the beginning
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Get trained
• Free instructor-led courses

• Free self-paced learning

• Free online resources

• Go to academy.datastax.com
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Cassandra Summit 2015
• 5 reasons to join me in SF
buff.ly/1JHl6Kw

• September 22-24

• Free general passes still
available!
Melbourne Cassandra Meetup
© 2015 DataStax. Use only with permission.
@flightc
Erick Ramirez | @flightc
Thank you
Erick Ramirez @flightc

More Related Content

What's hot

What's hot (20)

Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
 
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkAPACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Building Event Streaming Applications with Pac-Man (Ricardo Ferreira, Conflue...
Building Event Streaming Applications with Pac-Man (Ricardo Ferreira, Conflue...Building Event Streaming Applications with Pac-Man (Ricardo Ferreira, Conflue...
Building Event Streaming Applications with Pac-Man (Ricardo Ferreira, Conflue...
 
Which Freaking Database Should I Use?
Which Freaking Database Should I Use?Which Freaking Database Should I Use?
Which Freaking Database Should I Use?
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
Microservices with Terraform, Docker and the Cloud. DevOps Wet 2018
 
Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013Netflix presents at MassTLC Cloud Summit 2013
Netflix presents at MassTLC Cloud Summit 2013
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Riak CS Build Your Own Cloud Storage
Riak CS Build Your Own Cloud StorageRiak CS Build Your Own Cloud Storage
Riak CS Build Your Own Cloud Storage
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
 
Riot Games - Player Focused Pipeline - Stampedecon 2015
Riot Games - Player Focused Pipeline - Stampedecon 2015Riot Games - Player Focused Pipeline - Stampedecon 2015
Riot Games - Player Focused Pipeline - Stampedecon 2015
 
SparkR + Zeppelin
SparkR + ZeppelinSparkR + Zeppelin
SparkR + Zeppelin
 
Using Adversarial Autoencoders for Multi-Modal Automatic Playlist Continuation
Using Adversarial Autoencoders for Multi-Modal Automatic Playlist ContinuationUsing Adversarial Autoencoders for Multi-Modal Automatic Playlist Continuation
Using Adversarial Autoencoders for Multi-Modal Automatic Playlist Continuation
 
AWS Best Practices Version 2
AWS Best Practices Version 2AWS Best Practices Version 2
AWS Best Practices Version 2
 
AWS Best Practices
AWS Best PracticesAWS Best Practices
AWS Best Practices
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
 

Similar to Meetup Crash Course: Cassandra Data Modelling

Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 

Similar to Meetup Crash Course: Cassandra Data Modelling (20)

Meetup core concepts-erick-ramirez-20150729
Meetup core concepts-erick-ramirez-20150729Meetup core concepts-erick-ramirez-20150729
Meetup core concepts-erick-ramirez-20150729
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Data Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and SparkData Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and Spark
 
Keep Calm And Serilog Elasticsearch Kibana on .NET Core
Keep Calm And Serilog Elasticsearch Kibana on .NET CoreKeep Calm And Serilog Elasticsearch Kibana on .NET Core
Keep Calm And Serilog Elasticsearch Kibana on .NET Core
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
DataStax Enterprise & Apache Cassandra – Essentials for Financial Services – ...
DataStax Enterprise & Apache Cassandra – Essentials for Financial Services – ...DataStax Enterprise & Apache Cassandra – Essentials for Financial Services – ...
DataStax Enterprise & Apache Cassandra – Essentials for Financial Services – ...
 
Ultimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secUltimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per sec
 
SQL Server 2008 For Developers
SQL Server 2008 For DevelopersSQL Server 2008 For Developers
SQL Server 2008 For Developers
 
Spark etl
Spark etlSpark etl
Spark etl
 
Data High Availability With TIDB
Data High Availability With TIDBData High Availability With TIDB
Data High Availability With TIDB
 
New T-SQL Features in SQL Server 2012
New T-SQL Features in SQL Server 2012 New T-SQL Features in SQL Server 2012
New T-SQL Features in SQL Server 2012
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Search Basics
Search BasicsSearch Basics
Search Basics
 
Exploring the Fundamentals of YugabyteDB - Mydbops
Exploring the Fundamentals of YugabyteDB - Mydbops Exploring the Fundamentals of YugabyteDB - Mydbops
Exploring the Fundamentals of YugabyteDB - Mydbops
 
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Compa...
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 

Meetup Crash Course: Cassandra Data Modelling

  • 1. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. Crash Course : Cassandra Data Modelling Erick Ramirez DataStax Engineering @flightc 27 August 2015
  • 2. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Welcome • Modelling crash course • Forget everything you know • Informal session • Please ask me questions
  • 3. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc A refresher • Gossip • Partitions & hashing • Replicas & snitches • Client & coordinator • Consistency level
  • 4. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc A cluster • Node - a Cassandra instance • Rack - a logical group of nodes • DC - a logical group of racks • Cluster - a full set of nodes
  • 5. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Gossip • New node gossips with seed nodes • Happens every second • Learns about other nodes • Up/down status • Node locations
  • 6. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Partitions & hashing • Data is partitioned • Partition key is hashed hash(“DataStax”) = 9b036bd16dbe90073a hash(“@flightc”) = 1668bf314257609f04 • Partition range is -263 to 263 • Each node owns token [range]* * vnodes = multiple owned tokens
  • 7. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Replicas & snitches • A replica is copy of a partition • 1st replica is token owner • Next replica is “next” node • A snitch tells partitioner the topology
  • 8. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Client & coordinator • C* driver (client) chooses node - seed nodes - load-balancing policy • Chosen node for request is coordinator • Coordinator manages replication factor • Each write is timestamped
  • 9. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Consistency level • Number of nodes which must acknowledge a read or write • Can vary per request • Possible CLs: ANY, ONE, QUORUM, LOCAL_QUORUM, ALL • For writes, data is written to disk (commitlog) • For reads, nodes send most recent copy of data
  • 10. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Modelling Cassandra • CQL • Tables & column families • Rows & partitions
  • 11. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Modelling is a science • Use tested methodologies • Predictable results
  • 12. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Modelling is an art • Sometimes, you need to improvise • Massage schema to optimise
  • 13. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Data Modelling • Collect & analyse data requirements • Identify entities & relationships • Identify queries • Design schema • Optimise!
  • 14. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Goals • Very fast queries • De-normalise • Nest data • Duplicate data • Query-driven model
  • 15. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Modelling Cassandra • Use Cassandra Query Language (CQL) • Similar SQL-like approach • DDL - CREATE, ALTER, DROP • DML - SELECT, INSERT, UPDATE, DELETE CREATE TABLE users ( userid uuid, name text, email text, PRIMARY KEY (userid) );
  • 16. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Tables & column families • Table is a two-dimensional view of data • A set of rows with a similar structure • Table schema defines a set of columns and a primary key • PK is a sequence of columns which uniquely identify a row • Column family is a multi- dimensional data structure • Rows are organised into partitions • A partition has 1 or more rows • Partition key is part of primary key used to uniquely identify a partition
  • 17. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Example - Table with single-row partitions
  • 18. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Example - Table with multi-row partitions
  • 19. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Keys, composites & clustering columns • A simple partition key PRIMARY KEY ( userid ) • Composite partition key PRIMARY KEY ( (album_name, year) ) • Simple partition key with clustering columns PRIMARY KEY ( userid, name, email ) • Composite partition key with clustering columns PRIMARY KEY ( (album_name, year), title)
  • 20. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Examples Composite partition key Composite partition key with clustering columns
  • 21. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Column families • Distributed • Sparse
  • 22. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Storage FAST SCANSLOWSCAN
  • 23. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Physical storage layout
  • 24. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc On-disk layout to 2D representation
  • 25. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Sizes • Column family size is only limited to the size of the cluster • Linear scaling - partitions are distributed • Largest partition must fit on disk on a single node • A single partition does not span multiple nodes • Max cells is 2 billion • Max data size per cell (column value) is 2GB
  • 26. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Query-driven modelling • Find all performers and albums for a given track title CREATE TABLE albums_by_track ( track_title TEXT, performer TEXT, year INT, album_title TEXT, PRIMARY KEY ( track_title, performer, year, album_title ) ); • Find performer, genre & titles for a given album title & year CREATE TABLE tracks_by_album ( album_title TEXT, year INT, performer TEXT, genre TEXT, number INT, track_title TEXT, PRIMARY KEY ( (album_title,year), number ) );
  • 27. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc • Most efficient access pattern • Query accesses only 1 partition • Partition can be 1 or more rows Partition per query
  • 28. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Partition+ per query • Less efficient • Not necessarily bad • Query accesses 1+ partitions
  • 29. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Table scan, multi-table • Not efficient at all - avoid! • Query accesses all partitions in a table(s)
  • 30. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Nest data • More efficient to get to partition and iterate through rows
  • 31. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Duplicate data • Better than doing an expensive join • Results are pre-computed & materialised
  • 32. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Query-driven model • Each query has a corresponding table • Tables are optimised for queries • Tables return data in correct order
  • 33. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc This is the beginning
  • 34. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Get trained • Free instructor-led courses • Free self-paced learning • Free online resources • Go to academy.datastax.com
  • 35. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Cassandra Summit 2015 • 5 reasons to join me in SF buff.ly/1JHl6Kw • September 22-24 • Free general passes still available!
  • 36. Melbourne Cassandra Meetup © 2015 DataStax. Use only with permission. @flightc Erick Ramirez | @flightc Thank you Erick Ramirez @flightc