• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Cassandra Developer Training Slide Deck
 

Apache Cassandra Developer Training Slide Deck

on

  • 392 views

This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you ...

This course is designed to be a “fast start” on the basics of data modeling with Cassandra. We will cover some basic Administration information upfront that is important to understand as you choose your data model. It is still important to take a proper Admin class if you are responsible for production instance. This course focuses on CQL3, but thrift shall not be ignored.

Statistics

Views

Total Views
392
Views on SlideShare
386
Embed Views
6

Actions

Likes
1
Downloads
23
Comments
0

2 Embeds 6

https://twitter.com 5
http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Apache Cassandra Developer Training Slide Deck Apache Cassandra Developer Training Slide Deck Presentation Transcript

    • The C* Developer Training Chuck Droukas, Systems Engineer – Datastax
    • Disclaimers • This course is designed to be a “fast start” on the basics of data modeling with Cassandra. • We will cover some basic Administration information upfront that is important to understand as you choose your data model • It is still important to take a proper Admin class if you are responsible for production instance • This course focuses on CQL3, but thrift shall not be ignored • Please ask questions and interrupt me. It makes the day go faster for both of us.
    • Agenda • Architecture Overview - Ring Topology - Write Path - Read Path - Updates and Deletes • Break • Columns and their components • Column Families • Lunch • Keyspaces • Complex Queries • Break • Timeseries Example • User Activity Example • Shopping Cart Example • Logging Example
    • The Cassandra Schema Consists of: •Column •Column Family (aka Table) •Keyspace (aka Database) •Cluster
    • High Level Overview Keyspace Column Family /Table Rows Columns
    • Components of the Column The column is the fundamental data type in Cassandra and includes: • Column name • Column value • Timestamp • TTL (Optional)
    • The Column Name Value Timestamp (Name: “firstName”, Value: “Engelbert”, Timestamp: 1363106500)
    • Column Name • Can be any value • Can be any type • Not optional • Must be unique • Stored with every value
    • Column Value • Any value • Any type • Can be empty – but is required
    • Column Names and Values • the data type for a column (or row key) value is called a validator. • The data type for a column name is called a comparator. • Cassandra validates that data type of the keys of rows. • Columns are sorted, and stored in sorted order on disk, so you have to specify a comparator for columns. This can be reversed… more on this later
    • Data Types
    • Column TimeStamp • 64-bit integer • Best Practice – Should be created in a consistent manner by all your clients • Required
    • Column TTL • Defined on INSERT • Positive delay (in seconds) • After time expires it is marked for deletion
    • Special Types of Columns • Super • Counter • Collections
    • Counters • Allows for addition / subtraction • 64-bit value • No timestamp • Deletion does not require a timestamp
    • Collections • New in 1.2! • Set, Map, List
    • SET Example
    • The Cassandra Schema Consists of: •Column •Column Family •Keyspace •Cluster
    • Column Families / Tables •Same as tables -Groupings of Rows - AcID -Eventual Consistency •De-Normalization -To avoid I/O -Simplify the Read Path •Static or Dynamic
    • Static Column Families • Are the most similar to a relational table • Most rows have the same column names • Columns in rows can be different jbellis Name Email Address State Jonathan jb@ds.co m 123 main TX dhutch Name Email Address State Daria dh@ds.co m 45 2nd St. CA egilmore Name Email eric eg@ds.co m Row Key Columns
    • Dynamic Column Families • Also called “wide rows” • Structured so a query into the row will answer a question jbellis dhutch egilmore datastax mzcassie dhutch egilmore egilmore datastax mzcassie Row Key Columns Subscribers
    • Dynamic Table CQL3 Example CREATE TABLE timeline ( user_id varchar, tweet_id uuid, author varchar, body varchar, PRIMARY KEY (user_id, tweet_id) )
    • Clustering Order • Sorts columns on disk by default • Can change the order
    • The Cassandra Schema Consists of: •Column •Column Family •Keyspace •Cluster
    • Keyspaces •Are groupings of Column Families •Replication strategies •Replication factor CREATE KEYSPACE videodb WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 } In production you would use NetworkTopologyStrategy for multiple DCs. CREATE KEYSPACE "Excalibur“ WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' :2};
    • Complex Queries Partitioning and Indexing
    • Partitioners • Partitioner Types - RandomPartitioner / Murmur3Partitioner - ByteOrderedPartioner • Random means that your tokens are random  your ordering is Random • Ordered means your K  T is a no-op and ordering is lexical - For each node - And for the ring
    • Partitioners (cont‟d) •SELECT * FROM test WHERE token(k) > token(42);
    • Primary Index Overview •Index for all of your row keys •Per-node index •Partitioner + placement manages which node •Keys are just kept in ordered buckets •Partitioner determines how K  Token
    • Natural Keys •Examples: -An email address -A user id •Easy to make the relationship •Less de-normalization •More risk of an „UPSERT‟ •Changing the key requires a bulk copy operation
    • Surrogate Keys •Example: -UUID •Independently generated •Allows you to store multiple versions of a user •Relationship is now indirect •Changing the key requires the creation of a new row, or column
    • Compound (Composite) Primary Keys
    • Sorting •It‟s Free! •Like Open Source is free •ONLY on the second column in compound Primary Key
    • Secondary Indexes •Need for an easy way to do limited ad-hoc queries •Supports multiple per row •Single clause can support multiple selectors •Implemented as a hash map, not B-Tree •Low cardinality ONLY
    • Secondary Indexes
    • Conditional Operators
    • Data Modeling
    • The Basics of C* Modeling •Work backwards -What does your application do? -What are the access patterns? •Now design your data model
    • Procedures Consider use case requirements •What data? •Ordering? •Filtering? •Grouping? •Events in chronological order? •Does the data expire?
    • De-Normalization •The New Black: De-Normalization -Forget everything you‟ve learned about normalization…then forget it again!!! •The Ugly: -Resource contention -Latency -Client-side joins •Avoid them in your C* code
    • Foreign Keys •There are no foreign keys •No server-side joins
    • What now? •Ideally each query will be one row -Compared to other resources, disk space is cheap •Reduce disk seeks •Reduce network traffic
    • Workload Preference •High level of de-normalization means you may have to write the same data many times •Cassandra handles large numbers of writes well
    • Concurrent Writes •A row is always referenced by a Key •Keys are just bytes •They must be unique within a CF •Primary keys are unique -But Cassandra will not enforce uniqueness -If you are not careful you will accidentally [UPSERT] the whole thing
    • Let‟s Review Some Examples…
    • Relational Concept - De-normalization • To combine relations into a single row • Used in relational modeling to avoid complex joins Employees Department SELECT e.First, e.Last, d.Dept FROM Department d, Employees e WHERE 1 = e.id AND e.id = d.id Take this and then... 13 Thursday, May 2, 13 id First Last 1 Edgar Codd 2 Raymond Boyce id Dept 1 Engineering 2 Math
    • Relational Concept - De-normalization • Combine table columns into a single view • No joins • All in how you set the data for fast reads Employees SELECT First, Last, Dept FROM employees WHERE id = ‘1’ 14 Thursday, May 2, 13 id First Last Dept 1 Edgar Codd Engineering 2 Raymond Boyce Math
    • Cassandra Concept - One-to-Many • Relationship without being relational • Users have many videos • Wait? Where is the foreign key? Users Videos 15 Thursday, May 2, 13 username firstname lastname email tcodd Edgar Codd tcodd@relational.co m rboyce Raymond Boyce rboyce@relational.co m videoid videoname username description tags 99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol b3a76c6b Math tcodd Now my dog plays dogs,piano,lol
    • Cassandra Concept - One-to-many • Static table to store videos • UUID for unique video id • Add username to denormalize CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY(videoid) ); 16 Thursday, May 2, 13
    • Cassandra Concept - One-to-Many • Lookup video by username • Write in two tables at once for fast lookups CREATE TABLE username_video_index ( username varchar, videoid uuid, upload_date timestamp, video_name varchar, PRIMARY KEY (username, videoid) ); SELECT video_name FROM username_video_index WHERE username = ‘ctodd’ AND videoid = ‘99051fe9’ Createsawide row! 17 Thursday, May 2, 13
    • Cassandra concept - Many-to-many • Users and videos have many comments Videos Comments 18 Thursday, May 2, 13 username firstname lastname email tcodd Edgar Codd tcodd@relational.com rboyce Raymond Boyce rboyce@relational.com videoid videoname username description tags 99051fe9 My funny cat tcodd My cat plays the piano cats,piano,lol b3a76c6b Math tcodd Now my dog plays dogs,piano,lol username videoid comment tcodd 99051fe9 Sweet! rboyce b3a76c6b Boring :( Users
    • Cassandra concept - Many-to-many • Model both sides of the view • Insert both when comment is created • View from either side CREATE TABLE comments_by_user ( username varchar, videoid uuid, comment_ts timestamp, comment varchar, PRIMARY KEY username,videoid) ); 19 Thursday, May 2, 13 CREATE TABLE comments_by_video ( videoid uuid, username varchar, comment_ts timestamp, comment varchar, PRIMARY KEY (videoid,username) );
    • Time Series Data • Sensors - CPU - Network Card - Wave-Form - Resource Utilization • Clickstream data • Historical trends • Anything that varies on a temporal basis
    • Timeseries Example
    • Single Device Per Row Single device per row - Time Series Pattern 1 • The simplest model for storing time series data is creating a wide row of data for each source. • The timestamp of the reading will be the column name and the temperature the column value • Since each column is dynamic, our row will grow as needed to accommodate the data. • We will also get the built-in sorting of Cassandra to keep everything in order. http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling#!pc
    • Single Device Per Row CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time) );
    • Slice Query SELECT temperature FROM temperature WHERE weatherstation_id=‟1234ABCD‟ AND event_time > ‟2013-04-03 07:01:00′ AND event_time < ‟2013-04-03 07:04:00′;
    • Partitioning to limit row size Partitioning to limit row size – Time Series Pattern 2 • Cassandra can store up to 2 billion columns per row, but if we're storing data every millisecond you wouldn't even get a month‟s worth of data. • The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device. • Using data already available in the event, we can use the date portion of the timestamp and add that to the weather station id. • This will give us a row per day, per weather station, and an easy way to find the data.
    • Partitioning to limit row size CREATE TABLE temperature_by_day ( weatherstation_id text, date text, event_time timestamp, temperature text, PRIMARY KEY ((weatherstation_id,date),event_time) );
    • Get all the weather data for a single day.. SELECT * FROM temperature_by_day WHERE weatherstation_id=‟1234ABCD‟ AND date=‟2013-04-03′;
    • Reverse Order Time Series/Expiring Columns Reverse order timeseries with expiring columns – Time Series Pattern 3 • Imagine we are using this data for a dashboard application and we only want to show the last 10 temperature readings. • Older data is no longer useful, so can be purged eventually. • We can take advantage of a feature called expiring columns to have our data quietly disappear after a set amount of seconds.
    • Partitioning to limit row size CREATE TABLE latest_temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time), ) WITH CLUSTERING ORDER BY (event_time DESC);
    • Insert Data With TTLs INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES (‟1234ABCD‟,‟2013-04-03 07:03:00′,‟72F‟) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES (‟1234ABCD‟,‟2013-04-03 07:02:00′,‟73F‟) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES (‟1234ABCD‟,‟2013-04-03 07:01:00′,‟73F‟) USING TTL 20; INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature) VALUES (‟1234ABCD‟,‟2013-04-03 07:04:00′,‟74F‟) USING TTL 20;
    • Shopping cart use case *Store shopping cart data reliably *Minimize (or eliminate) downtime. Multi-dc *Scale for the “Cyber Monday” problem The bad *Every minute off-line is lost $$ *Online shoppers want speed!
    • Shopping Cart Example * Un-ashamedly ripped off from Patrick McFaddin‟s Cassandra Summit 2013 presentation
    • The 5 C* Commandments for Developers 1. Start with queries. Don‟t data model for data modeling sake. That is sooo turn of the century. 2. It‟s ok to duplicate data. Really. Get over it. 3. C* is designed to read and write sequentially. Great for rotational disk, awesome for SSDs, awful for NAS. So don‟t do it. Ever. 4. Secondary indexes are not a band-aid for a poor data model. 5. Embrace wide rows and de-normalization
    • …and Cassandra will not ask if your “wallet is open.”