Cassandra NYC 2011 Data ModelingPresentation Transcript
Data Modeling ExamplesMatthew F. Dennis // @mdennis
Overview● general guiding goals for Cassandra data models● Interesting and/or common examples/questions to get us started● Should be plenty of time at the end for questions, so bring them up if you have them !
Data Modeling Goals● Keep data queried together on disk together● In a more general sense think about the efficiency of querying your data and work backward from there to a model in Cassandra● Usually, you shouldnt try to normalize your data (contrary to many use cases in relational databases)● Usually better to keep a record that something happened as opposed to changing a value (not always the best approach though)
Time Series Data● Easily the most common use of Cassandra ● Financial tick data ● Click streams ● Sensor data ● Performance metrics ● GPS data ● Event logs ● etc, etc, etc ...● All of the above are essentially the same as far as C* is concerned
Time Series Thought Model● Things happen in some timestamp ordered stream and consist of values associated with the given timestamp (i.e. “data points”) – Every 30 seconds record location, speed, heading and engine temp – Every 5 minutes record CPU, IO and Memory usage● We are interested in recreating, aggregating and/or analyzing arbitrary time slices of the stream – Where was agent:007 and what was he doing between 11:21am and 2:38pm yesterday? – What are the last N actions foo did on my site?
Data Points Defined● Each data point has 1-N values● Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of 17th hour on some date)
Data Points Mapped to Cassandra● Row Key is id of the data point stream bucketed by time – e.g. plane01:jan_2011 or plane01:jan_01_2011 for month or day buckets respectively● Column Name is TimeUUID(timestamp of date point)● Column Value is serialized data point – JSON, XML, pickle, msgpack, thrift, protobuf, avro, BSON, WTFe● Bucketing – Avoids always requiring multiple seeks when only small slices of the stream are requested (e.g. stream is 5 years old but Im on only interested in Jan 5 th 3 years ago and/or yesterday between 2pm and 3pm). – Make it easy to lazily aggregate old stream activity – Reduces compaction overhead since old rows will never have to be merged again (until you “back fill” and/or delete something)
A Slightly More Concrete Example● Sensor data from airplanes● Every 30 seconds each plane sends latitude+longitude, altitude and wine remaining in mdennis glass.
The Visual plane5:jan_2011 TimeUUID0 TimeUUID1 TimeUUID2 p5:j11 28.90, 124.30 45K feet 28.85, 124.25 44K feet 28.81, 124.22 44K feet 70% 50% 95% Middle of the ocean and half a glass of wine at 44K feet● Row Key is the id of stream being recorded (e.g. plane5:jan_2011)● Column Name is timestamp (or TimeUUID) associated with the data point● Column Value is the value of the event (e.g. protobuf serialized lat/long+alt+wine_level)
Querying● When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call● Or use a empty start and/or end along with a count
Bucket Sizes?● Depends greatly on ● Average size of time slice queried ● Average data point size ● Write rate of data points to a stream ● IO capacity of the nodes
So... Bucket Sizes?● No Bigger than a few GB per row ● bucket_size * write_rate * sizeof(avg_data_point)● Bucket size >= average size of time slice queried● No more than maybe 10M entries per row● No more than a month if you have lots of different streams● NB: there are exceptions to all of the above, which are really nothing more than guidelines
Ordering● In cases where the most recent data is the most interesting (e.g. last N events for entity foo or last hour of events for entity bar), you can reverse the comparator (i.e. sort descending instead of ascending) ● http://thelastpickle.com/2011/10/03/Reverse-Comparators/ ● https://issues.apache.org/jira/browse/CASSANDRA-2355
Spanning Buckets● If your time slice spans buckets, youll need to construct all the row keys in question (i.e. number of unique row keys = spans+1)● If you want all the results between the dates, pass all the row keys to multiget_slice with the start and end of the desired time slice● If you only want the first N results within your time slice, lowest latency comes from multiget_slice as above but best efficiency comes from serially paging one row key at a time until your desired count is reached
Expiring Streams (e.g. “I only care about the past year”)● Just set the TTL to the age you want to keep● yeah, thats pretty much it ...
Counters● Sometimes youre only interested in counting things that happened within some time slice● Minor adaptation to the previous content to use counters (be aware they are not idempotent) ● Column names become buckets ● Values become counters
Example: Counting User Logins user3:system5:logins:by_day 20110107 ... 20110523 U3:S5:L:D 2 ... 7 2 logins on Jan 7th 2011 7 logins on May 23rd 2011 for user 3 on system 5 for user 3 on system 5 user3:system5:logins:by_hour 2011010710 ... 2011052316 U3:S5:L:H 1 ... 7one login for user 3 on system 5 2 logins for user 3 on system 5on Jan 7th 2011 for the 10th hour on May 23rd 2011 for the 16th hour
Eventually Atomic● In a legacy RDBMS atomicity is “easy”● Attempting full ACID compliance in distributed systems is a bad idea (and actually impossible in the strictest sense)● However, consistency is important and can certainly be achieved in C*● Many approaches / alternatives● I like a transaction log approach, especially in the context of C*
Transaction Logs (in this context)● Records what is going to be performed before it is actually performed● Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense which is usually what people mean when they say isolation)● Marks that the actions were performed
In Cassandra● Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), pickle, JSO, msgpack, protobuf, et cetera ● Row Key = randomly chosen C* node token ● Column Name = TimeUUID(nowish)● Perform actions● Delete Column
Configuration Details● Short gc_grace_seconds on the XACT_LOG Column Family (e.g. 5 minutes)● Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability ● if it fails with an unavailable exception, pick a different node token and/or node and try again (gives same semantics as a relational DB in terms of knowing the state of your transaction)
Failures● Before insert into the XACT_LOG● After insert, before actions● After insert, in middle of actions● After insert, after actions, before delete● After insert, after actions, after delete
Recovery● Each C* has a crond job offset from every other by some time period● Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period (the “recovery period”)● Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working)
XACT_LOG Comments● Idempotent writes are awesome (thats why this works so well)● Doesnt work so well for counters (theyre not idempotent)● Clients must be able to deal with temporarily inconsistent data (they have to do this anyway)
Q?Cassandra Data Modeling Examples Matthew F. Dennis // @mdennis