0
The C* Developer Training
Chuck Droukas, Systems Engineer – Datastax
Disclaimers
• This course is designed to be a “fast start” on the
basics of data modeling with Cassandra.
• We will cover ...
Agenda
• Architecture Overview
- Ring Topology
- Write Path
- Read Path
- Updates and Deletes
• Break
• Columns and their
...
The Cassandra Schema
Consists of:
•Column
•Column Family (aka Table)
•Keyspace (aka Database)
•Cluster
High Level Overview
Keyspace
Column Family /Table
Rows
Columns
Components of the Column
The column is the fundamental data
type in Cassandra and includes:
• Column name
• Column value
•...
The Column
Name
Value
Timestamp
(Name: “firstName”, Value: “Engelbert”, Timestamp: 1363106500)
Column Name
• Can be any value
• Can be any type
• Not optional
• Must be unique
• Stored with every value
Column Value
• Any value
• Any type
• Can be empty – but is required
Column Names and Values
• the data type for a column (or row key) value is
called a validator.
• The data type for a colum...
Data Types
Column TimeStamp
• 64-bit integer
• Best Practice
– Should be created in a consistent manner by all your
clients
• Required
Column TTL
• Defined on INSERT
• Positive delay (in seconds)
• After time expires it is marked for deletion
Special Types of Columns
• Super
• Counter
• Collections
Counters
• Allows for addition / subtraction
• 64-bit value
• No timestamp
• Deletion does not require a
timestamp
Collections
• New in 1.2!
• Set, Map, List
SET Example
The Cassandra Schema
Consists of:
•Column
•Column Family
•Keyspace
•Cluster
Column Families / Tables
•Same as tables
-Groupings of Rows
- AcID
-Eventual Consistency
•De-Normalization
-To avoid I/O
-...
Static Column Families
• Are the most similar to a relational table
• Most rows have the same column names
• Columns in ro...
Dynamic Column Families
• Also called “wide rows”
• Structured so a query into the row will answer a
question
jbellis
dhut...
Dynamic Table CQL3 Example
CREATE TABLE timeline (
user_id varchar,
tweet_id uuid,
author varchar,
body varchar,
PRIMARY K...
Clustering Order
• Sorts columns on disk by default
• Can change the order
The Cassandra Schema
Consists of:
•Column
•Column Family
•Keyspace
•Cluster
Keyspaces
•Are groupings of Column Families
•Replication strategies
•Replication factor
CREATE KEYSPACE videodb WITH REPLI...
Complex Queries
Partitioning and Indexing
Partitioners
• Partitioner Types
- RandomPartitioner / Murmur3Partitioner
- ByteOrderedPartioner
• Random means that your ...
Partitioners (cont‟d)
•SELECT * FROM test WHERE token(k) >
token(42);
Primary Index Overview
•Index for all of your row keys
•Per-node index
•Partitioner + placement manages
which node
•Keys a...
Natural Keys
•Examples:
-An email address
-A user id
•Easy to make the relationship
•Less de-normalization
•More risk of a...
Surrogate Keys
•Example:
-UUID
•Independently generated
•Allows you to store multiple versions
of a user
•Relationship is ...
Compound (Composite) Primary Keys
Sorting
•It‟s Free!
•Like Open Source is free
•ONLY on the second column in
compound Primary Key
Secondary Indexes
•Need for an easy way to do limited ad-hoc
queries
•Supports multiple per row
•Single clause can support...
Secondary Indexes
Conditional Operators
Data Modeling
The Basics of C* Modeling
•Work backwards
-What does your application do?
-What are the access patterns?
•Now design your ...
Procedures
Consider use case requirements
•What data?
•Ordering?
•Filtering?
•Grouping?
•Events in chronological order?
•D...
De-Normalization
•The New Black: De-Normalization
-Forget everything you‟ve learned about
normalization…then forget it aga...
Foreign Keys
•There are no foreign keys
•No server-side joins
What now?
•Ideally each query will be one row
-Compared to other resources, disk
space is cheap
•Reduce disk seeks
•Reduce...
Workload Preference
•High level of de-normalization means
you may have to write the same data
many times
•Cassandra handle...
Concurrent Writes
•A row is always referenced by a Key
•Keys are just bytes
•They must be unique within a CF
•Primary keys...
Let‟s Review Some
Examples…
Relational Concept - De-normalization
• To combine relations into a single row
• Used in relational modeling to avoid comp...
Relational Concept - De-normalization
• Combine table columns into a single view
• No joins
• All in how you set the data ...
Cassandra Concept - One-to-Many
• Relationship without being relational
• Users have many videos
• Wait? Where is the fore...
Cassandra Concept - One-to-many
• Static table to store videos
• UUID for unique video id
• Add username to denormalize
CR...
Cassandra Concept - One-to-Many
• Lookup video by username
• Write in two tables at once for fast lookups
CREATE TABLE use...
Cassandra concept - Many-to-many
• Users and videos have many comments
Videos
Comments
18
Thursday, May 2, 13
username fir...
Cassandra concept - Many-to-many
• Model both sides of the view
• Insert both when comment is created
• View from either s...
Time Series Data
• Sensors
- CPU
- Network Card
- Wave-Form
- Resource Utilization
• Clickstream data
• Historical trends
...
Timeseries Example
Single Device Per Row
Single device per row - Time Series Pattern 1
• The simplest model for storing time series data is c...
Single Device Per Row
CREATE TABLE temperature (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY K...
Slice Query
SELECT temperature
FROM temperature
WHERE weatherstation_id=‟1234ABCD‟
AND event_time > ‟2013-04-03 07:01:00′
...
Partitioning to limit row size
Partitioning to limit row size – Time Series Pattern 2
• Cassandra can store up to 2 billio...
Partitioning to limit row size
CREATE TABLE temperature_by_day (
weatherstation_id text,
date text,
event_time timestamp,
...
Get all the weather data for a single day..
SELECT *
FROM temperature_by_day
WHERE weatherstation_id=‟1234ABCD‟
AND date=‟...
Reverse Order Time Series/Expiring Columns
Reverse order timeseries with expiring columns – Time Series Pattern 3
• Imagin...
Partitioning to limit row size
CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperatur...
Insert Data With TTLs
INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)
VALUES (‟1234ABCD‟,‟2013-0...
Shopping cart use case
*Store shopping cart data reliably
*Minimize (or eliminate) downtime. Multi-dc
*Scale for the “Cybe...
Shopping Cart Example
* Un-ashamedly ripped off from Patrick McFaddin‟s Cassandra
Summit 2013 presentation
The 5 C* Commandments for Developers
1. Start with queries. Don‟t data model for data
modeling sake. That is sooo turn of ...
…and Cassandra will not ask if your “wallet is open.”
Upcoming SlideShare
Loading in...5
×

Apache Cassandra Developer Training Slide Deck

1,067

Published on

This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you choose your data model. It is still important to take a proper Admin class if you are responsible for production instance. This course focuses on CQL3, but thrift shall not be ignored.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,067
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
71
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Cassandra Developer Training Slide Deck"

  1. 1. The C* Developer Training Chuck Droukas, Systems Engineer – Datastax
  2. 2. Disclaimers • This course is designed to be a “fast start” on the basics of data modeling with Cassandra. • We will cover some basic Administration information upfront that is important to understand as you choose your data model • It is still important to take a proper Admin class if you are responsible for production instance • This course focuses on CQL3, but thrift shall not be ignored • Please ask questions and interrupt me. It makes the day go faster for both of us.
  3. 3. Agenda • Architecture Overview - Ring Topology - Write Path - Read Path - Updates and Deletes • Break • Columns and their components • Column Families • Lunch • Keyspaces • Complex Queries • Break • Timeseries Example • User Activity Example • Shopping Cart Example • Logging Example
  4. 4. The Cassandra Schema Consists of: •Column •Column Family (aka Table) •Keyspace (aka Database) •Cluster
  5. 5. High Level Overview Keyspace Column Family /Table Rows Columns
  6. 6. Components of the Column The column is the fundamental data type in Cassandra and includes: • Column name • Column value • Timestamp • TTL (Optional)
  7. 7. The Column Name Value Timestamp (Name: “firstName”, Value: “Engelbert”, Timestamp: 1363106500)
  8. 8. Column Name • Can be any value • Can be any type • Not optional • Must be unique • Stored with every value
  9. 9. Column Value • Any value • Any type • Can be empty – but is required
  10. 10. Column Names and Values • the data type for a column (or row key) value is called a validator. • The data type for a column name is called a comparator. • Cassandra validates that data type of the keys of rows. • Columns are sorted, and stored in sorted order on disk, so you have to specify a comparator for columns. This can be reversed… more on this later
  11. 11. Data Types
  12. 12. Column TimeStamp • 64-bit integer • Best Practice – Should be created in a consistent manner by all your clients • Required
  13. 13. Column TTL • Defined on INSERT • Positive delay (in seconds) • After time expires it is marked for deletion
  14. 14. Special Types of Columns • Super • Counter • Collections
  15. 15. Counters • Allows for addition / subtraction • 64-bit value • No timestamp • Deletion does not require a timestamp
  16. 16. Collections • New in 1.2! • Set, Map, List
  17. 17. SET Example
  18. 18. The Cassandra Schema Consists of: •Column •Column Family •Keyspace •Cluster
  19. 19. Column Families / Tables •Same as tables -Groupings of Rows - AcID -Eventual Consistency •De-Normalization -To avoid I/O -Simplify the Read Path •Static or Dynamic
  20. 20. Static Column Families • Are the most similar to a relational table • Most rows have the same column names • Columns in rows can be different jbellis Name Email Address State Jonathan jb@ds.co m 123 main TX dhutch Name Email Address State Daria dh@ds.co m 45 2nd St. CA egilmore Name Email eric eg@ds.co m Row Key Columns
  21. 21. Dynamic Column Families • Also called “wide rows” • Structured so a query into the row will answer a question jbellis dhutch egilmore datastax mzcassie dhutch egilmore egilmore datastax mzcassie Row Key Columns Subscribers
  22. 22. Dynamic Table CQL3 Example CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar, PRIMARY KEY (user_id, tweet_id) )
  23. 23. Clustering Order • Sorts columns on disk by default • Can change the order
  24. 24. The Cassandra Schema Consists of: •Column •Column Family •Keyspace •Cluster
  25. 25. Keyspaces •Are groupings of Column Families •Replication strategies •Replication factor CREATE KEYSPACE videodb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 } In production you would use NetworkTopologyStrategy for multiple DCs. CREATE KEYSPACE "Excalibur“ WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' :2};
  26. 26. Complex Queries Partitioning and Indexing
  27. 27. Partitioners • Partitioner Types - RandomPartitioner / Murmur3Partitioner - ByteOrderedPartioner • Random means that your tokens are random  your ordering is Random • Ordered means your K  T is a no-op and ordering is lexical - For each node - And for the ring
  28. 28. Partitioners (cont‟d) •SELECT * FROM test WHERE token(k) > token(42);
  29. 29. Primary Index Overview •Index for all of your row keys •Per-node index •Partitioner + placement manages which node •Keys are just kept in ordered buckets •Partitioner determines how K  Token
  30. 30. Natural Keys •Examples: -An email address -A user id •Easy to make the relationship •Less de-normalization •More risk of an „UPSERT‟ •Changing the key requires a bulk copy operation
  31. 31. Surrogate Keys •Example: -UUID •Independently generated •Allows you to store multiple versions of a user •Relationship is now indirect •Changing the key requires the creation of a new row, or column
  32. 32. Compound (Composite) Primary Keys
  33. 33. Sorting •It‟s Free! •Like Open Source is free •ONLY on the second column in compound Primary Key
  34. 34. Secondary Indexes •Need for an easy way to do limited ad-hoc queries •Supports multiple per row •Single clause can support multiple selectors •Implemented as a hash map, not B-Tree •Low cardinality ONLY
  35. 35. Secondary Indexes
  36. 36. Conditional Operators
  37. 37. Data Modeling
  38. 38. The Basics of C* Modeling •Work backwards -What does your application do? -What are the access patterns? •Now design your data model
  39. 39. Procedures Consider use case requirements •What data? •Ordering? •Filtering? •Grouping? •Events in chronological order? •Does the data expire?
  40. 40. De-Normalization •The New Black: De-Normalization -Forget everything you‟ve learned about normalization…then forget it again!!! •The Ugly: -Resource contention -Latency -Client-side joins •Avoid them in your C* code
  41. 41. Foreign Keys •There are no foreign keys •No server-side joins
  42. 42. What now? •Ideally each query will be one row -Compared to other resources, disk space is cheap •Reduce disk seeks •Reduce network traffic
  43. 43. Workload Preference •High level of de-normalization means you may have to write the same data many times •Cassandra handles large numbers of writes well
  44. 44. Concurrent Writes •A row is always referenced by a Key •Keys are just bytes •They must be unique within a CF •Primary keys are unique -But Cassandra will not enforce uniqueness -If you are not careful you will accidentally [UPSERT] the whole thing
  45. 45. Let‟s Review Some Examples…
  46. 46. Relational Concept - De-normalization • To combine relations into a single row • Used in relational modeling to avoid complex joins Employees Department SELECT e.First, e.Last, d.Dept FROM Department d, Employees e WHERE 1 = e.id AND e.id = d.id Take this and then... 13 Thursday, May 2, 13 id First Last 1 Edgar Codd 2 Raymond Boyce id Dept 1 Engineering 2 Math
  47. 47. Relational Concept - De-normalization • Combine table columns into a single view • No joins • All in how you set the data for fast reads Employees SELECT First, Last, Dept FROM employees WHERE id = ‘1’ 14 Thursday, May 2, 13 id First Last Dept 1 Edgar Codd Engineering 2 Raymond Boyce Math
  48. 48. Cassandra Concept - One-to-Many • Relationship without being relational • Users have many videos • Wait? Where is the foreign key? Users Videos 15 Thursday, May 2, 13 username firstname lastname email tcodd Edgar Codd tcodd@relational.co m rboyce Raymond Boyce rboyce@relational.co m videoid videoname username description tags 99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol b3a76c6b Math tcodd Now my dog plays dogs,piano,lol
  49. 49. Cassandra Concept - One-to-many • Static table to store videos • UUID for unique video id • Add username to denormalize CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY(videoid) ); 16 Thursday, May 2, 13
  50. 50. Cassandra Concept - One-to-Many • Lookup video by username • Write in two tables at once for fast lookups CREATE TABLE username_video_index ( username varchar, videoid uuid, upload_date timestamp, video_name varchar, PRIMARY KEY (username, videoid) ); SELECT video_name FROM username_video_index WHERE username = ‘ctodd’ AND videoid = ‘99051fe9’ Createsawide row! 17 Thursday, May 2, 13
  51. 51. Cassandra concept - Many-to-many • Users and videos have many comments Videos Comments 18 Thursday, May 2, 13 username firstname lastname email tcodd Edgar Codd tcodd@relational.com rboyce Raymond Boyce rboyce@relational.com videoid videoname username description tags 99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol b3a76c6b Math tcodd Now my dog plays dogs,piano,lol username videoid comment tcodd 99051fe9 Sweet! rboyce b3a76c6b Boring :( Users
  52. 52. Cassandra concept - Many-to-many • Model both sides of the view • Insert both when comment is created • View from either side CREATE TABLE comments_by_user ( username varchar, videoid uuid, comment_ts timestamp, comment varchar, PRIMARY KEY username,videoid) ); 19 Thursday, May 2, 13 CREATE TABLE comments_by_video ( videoid uuid, username varchar, comment_ts timestamp, comment varchar, PRIMARY KEY (videoid,username) );
  53. 53. Time Series Data • Sensors - CPU - Network Card - Wave-Form - Resource Utilization • Clickstream data • Historical trends • Anything that varies on a temporal basis
  54. 54. Timeseries Example
  55. 55. Single Device Per Row Single device per row - Time Series Pattern 1 • The simplest model for storing time series data is creating a wide row of data for each source. • The timestamp of the reading will be the column name and the temperature the column value • Since each column is dynamic, our row will grow as needed to accommodate the data. • We will also get the built-in sorting of Cassandra to keep everything in order. http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling#!pc
  56. 56. Single Device Per Row CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time) );
  57. 57. Slice Query SELECT temperature FROM temperature WHERE weatherstation_id=‟1234ABCD‟ AND event_time > ‟2013-04-03 07:01:00′ AND event_time < ‟2013-04-03 07:04:00′;
  58. 58. Partitioning to limit row size Partitioning to limit row size – Time Series Pattern 2 • Cassandra can store up to 2 billion columns per row, but if we're storing data every millisecond you wouldn't even get a month‟s worth of data. • The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device. • Using data already available in the event, we can use the date portion of the timestamp and add that to the weather station id. • This will give us a row per day, per weather station, and an easy way to find the data.
  59. 59. Partitioning to limit row size CREATE TABLE temperature_by_day ( weatherstation_id text, date text, event_time timestamp, temperature text, PRIMARY KEY ((weatherstation_id,date),event_time) );
  60. 60. Get all the weather data for a single day.. SELECT * FROM temperature_by_day WHERE weatherstation_id=‟1234ABCD‟ AND date=‟2013-04-03′;
  61. 61. Reverse Order Time Series/Expiring Columns Reverse order timeseries with expiring columns – Time Series Pattern 3 • Imagine we are using this data for a dashboard application and we only want to show the last 10 temperature readings. • Older data is no longer useful, so can be purged eventually. • We can take advantage of a feature called expiring columns to have our data quietly disappear after a set amount of seconds.
  62. 62. Partitioning to limit row size CREATE TABLE latest_temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time), ) WITH CLUSTERING ORDER BY (event_time DESC);
  63. 63. Insert Data With TTLs INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES (‟1234ABCD‟,‟2013-04-03 07:03:00′,‟72F‟) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES (‟1234ABCD‟,‟2013-04-03 07:02:00′,‟73F‟) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES (‟1234ABCD‟,‟2013-04-03 07:01:00′,‟73F‟) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES (‟1234ABCD‟,‟2013-04-03 07:04:00′,‟74F‟) USING TTL 20;
  64. 64. Shopping cart use case *Store shopping cart data reliably *Minimize (or eliminate) downtime. Multi-dc *Scale for the “Cyber Monday” problem The bad *Every minute off-line is lost $$ *Online shoppers want speed!
  65. 65. Shopping Cart Example * Un-ashamedly ripped off from Patrick McFaddin‟s Cassandra Summit 2013 presentation
  66. 66. The 5 C* Commandments for Developers 1. Start with queries. Don‟t data model for data modeling sake. That is sooo turn of the century. 2. It‟s ok to duplicate data. Really. Get over it. 3. C* is designed to read and write sequentially. Great for rotational disk, awesome for SSDs, awful for NAS. So don‟t do it. Ever. 4. Secondary indexes are not a band-aid for a poor data model. 5. Embrace wide rows and de-normalization
  67. 67. …and Cassandra will not ask if your “wallet is open.”
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×