SlideShare a Scribd company logo
1 of 31
Download to read offline
Cassandra Data Modeling Workshop
  Matthew F. Dennis // @mdennis
Overview
●   Hopefully interactive
●   Use cases submitted via Google Moderator,
    email, IRC, etc
●   Interesting and/or common requests in the
    slides to get us started
●   Bring up others if you have them !
Data Modeling Goals
●   Keep data queried together on disk together
●   In a more general sense think about the
    efficiency of querying your data and work
    backward from there to a model in Cassandra
●   Don't try to normalize your data (contrary to
    many use cases in relational databases)
●   Usually better to keep a record that something
    happened as opposed to changing a value (not
    always advisable or possible)
ClickStream Data
                     (use case #1)

●   A ClickStream (in this context) is the sequence
    of actions a user of an application performs
●   Usually this refers to clicking links in a WebApp
●   Useful for ad selection, error recording, UI/UX
    improvement, A/B testing, debugging, et cetera
●   Not a lot of detail in the Google Moderator
    request on what the purpose of collecting the
    ClickStream data was – so I made some up
ClickStream Data Defined
●   Record actions of a user within a session for
    debugging purposes if app/browser/page/server
    crashes
Recording Sessions
●   CF for sessions a user has had
    ●   Row Key is user name/id
    ●   Column Name is session id (TimeUUID)
    ●   Column Value is empty (or length of session, or some
        aggregated details about the session after it ended)
●   CF for actual sessions
    ●   Row Key is TimeUUID session id
    ●   Column Name is timestamp/TimeUUID of each click
    ●   Column Value is details about that click (serialized)
UserSessions Column Family
              Session_01    Session_02    Session_03
              (TimeUUID)                  (TimeUUID)
    userId                  (TimeUUID)

              (empty/agg)   (empty/agg)   (empty/agg)


●   Most recent session
●   All sessions for a given time period
Sessions Column Family
                 timestamp_01 timestamp_02 timestamp_03
 SessionId
(TimeUUID)          ClickData      ClickData      ClickData
                 (json/xml/etc) (json/xml/etc) (json/xml/etc)



●   Retrieve entire session's ClickStream (row)
●   Order of clicks/events preserved
●   Retrieve ClickStream for a slice of time within the session
●   First action taken in a session
●   Most recent action taken in a session
●   Why JSON/XML/etc?
Alternatives?
Of Course
         (depends on what you want to do)
●   Secondary Indexes
●   All Sessions in one row
●   Track by time of activity instead of session
Secondary Indexes Applied
●   Drop UserSessions CF and use secondary
    indexes
●   Uses a “well known” column to record the user
    in the row; secondary index is created on that
    column
●   Doesn't work so well when storing aggregates
    about sessions in the UserSessions CF
●   Better when you want to retrieve all sessions a
    user has had
All Sessions In One Row Applied
●   Row Key is userId
●   Column Name is composite of timestamp and
    sessionId
●   Can efficiently request activity of a user across
    all sessions within a specific time range
●   Rows could potentially grow quite large, be
    careful
●   Reads will almost always require at least two
    seeks on disk
Time Period Partitioning Applied
●   Row Key is composite of userId and time “bucket”
    ●   e.g. jan_2011 or jan_01_2011 for month or day buckets respectively
●   Column Name is TimeUUID of click
●   Column Value is serialized click data
●   Avoids always requiring multiple seeks when the user has old
    data but only recent data is requested
●   Easy to lazily aggregate old activity
●   Can still efficiently request activity of a user across all
    sessions within a specific time range
Rolling Time Window Of Data Points
                    (use case #2)
●   Similar to RRDTool was the example given
●   Essentially store a series of data points within a
    rolling window
●   common request from Cassandra users for this
    and/or similar
Data Points Defined
●   Each data point has a value (or multiple values)
●   Each data point corresponds to a specific point
    in time or an interval/bucket (e.g. 5 th minute of
       th
    17 hour on some date)
Time Window Model
              System7:RenderTime

               TimeUUID0   TimeUUID1     TimeUUID2

    s7:rt        0.051       0.014          0.173

                                     Some request took 0.014 seconds to render


●   Row Key is the id of the time window data you are
    tracking (e.g. server7:render_time)
●   Column Name is timestamp (or TimeUUID) the event
    occurred at
●   Column Value is the value of the event (e.g. 0.051)
The Details
●   Cassandra TTL values are key here
    ●   When you insert each data point set the TTL to the max time
        range you will ever request; there is very little overhead to
        expiring columns
●   When querying, construct TimeUUIDs for the min/max of
    the time range in question and use them as the start/end
    in your get_slice call
●   Consider partitioning the rows by a known time period
    (e.g. “year”) if you plan on keeping a long history of data
    (NB: requires slightly more complex logic in the app if a
    time range spans such a period)
●   Very efficient queries for any window of time
Rolling Window Of Counters
                (use case #3)
●   “How to model rolling time window that contains counters with time
    buckets of monthly (12 months), weekly (4 weeks), daily (7 days),
    hourly (24 hours)? Example would be; how many times user logged
    into a system in last 24 hours, last 7 days ...”
●   Timezones and “rolling window” is what makes this interesting
Rolling Time Window Details
●   One row for every granularity you want to track
    (e.g. day, hour)
●   Row Key consists of the granularity, metric, user
    and system
●   Column Name is a “fixed” time bucket on UTC time
●   Column Values are counts of the logins in that
    bucket
●   get_slice calls to return multiple counters which
    are them summed up
Rolling Time Window Counter Model
                     user3:system5:logins:by_day

                                     20110107          ...          20110523
            U3:S5:L:D
                                        2              ...               7

    2 logins in Jan 7th 2011           7 logins on May 23rd 2011
    for user 3 on system 5               for user 3 on system 5


                    user3:system5:logins:by_hour

                                    2011010710         ...         2011052316
            U3:S5:L:H
                                        1              ...               7

one login for user 3 on system 5     2 logins for user 3 on system 5
on Jan 7th 2011 for the 10th hour   on May 23rd 2011 for the 16th hour
Rolling Time Window Queries
●   Time window is rolling and there are other
    timezones besides UTC
    ●   one get_slice for the “middle” counts
    ●   one get_slice for the “left end”
    ●   one get_slice for the “right end”
Example: logins for the past 7 days
●   Determine date/time boundaries
●   Determine UTC days that are wholly contained
    within your boundaries to select and sum
●   Select and sum counters for the remaining hours
    on either side of the UTC days
●   O(1) queries (3 in this case), can be requested
    from C* in parallel
●   NB: some timezones are annoying (e.g. 15 minute
    or 30 minutes offsets); I try to ignore them
Alternatives?
                         (of course)
●   If you're counting logins and each user doesn't login
    in hundreds of times a day, just have one row per
    user with a TimeUUID column name for the time the
    login occurred
●   Supports any timezone/range/granularity easily
●   More expensive for large ranges (e.g. year)
    regardless of granularity, so cache results (in C*)
    lazily.
●   NB: caching results for rolling windows is not usually
    helpful (because, well it's rolling and always changes)
Eventually Atomic
                            (use case #4)
●   “When there are many to many or one to many relations involved how
    to model that and also keep it atomic? for eg: one user can upload
    many pictures and those pictures can somehow be related to other
    users as well.”
●   Attempting full ACID compliance in distributed systems is a bad idea
    (and impossible in the general sense)
●   However, consistency is important and can certainly be achieved in
    C*
●   Many approaches / alternatives
●   I like transaction log approach, especially in the context of C*
Transaction Logs
                   (in this context)
●   Records what is going to be performed before it
    is actually performed
●   Performs the actions that need to be atomic (in
    the indivisible sense, not the all at once sense)
●   Marks that the actions were performed
In Cassandra
●   Serialize all actions that need to be performed
    in a single column – JSON, XML, YAML (yuck!),
    cpickle, JSO, et cetera
    ●   Row Key = randomly chosen C* node token
    ●   Column Name = TimeUUID
●   Perform actions
●   Delete Column
Configuration Details
●   Short GC_Grace on the XACT_LOG Column
    Family (e.g. 1 hour)
●   Write to XACT_LOG at CL.QUORUM or
    CL.LOCAL_QUORUM for durability (if it fails
    with an unavailable exception, pick a different
    node token and/or node and try again; same
    semantics as a traditional relational DB)
●   1M memtable ops, 1 hour memtable flush time
Failures
●   Before insert into the XACT_LOG
●   After insert, before actions
●   After insert, in middle of actions
●   After insert, after actions, before delete
●   After insert, after actions, after delete
Recovery
●   Each C* has a crond job offset from every other
    by some time period
●   Each job runs the same code: multiget_slice for
    all node tokens for all columns older than some
    time period
●   Any columns need to be replayed in their
    entirety and are deleted after replay (normally
    there are no columns because normally things
    are working normally)
XACT_LOG Comments
●   Idempotent writes are awesome (that's why this
    works so well)
●   Doesn't work so well for counters (they're not
    idempotent)
●   Clients must be able to deal with temporarily
    inconsistent data (they have to do this anyway)
●   Could use a reliable queuing service (e.g. SQS)
    instead of polling – push to SQS first, then
    XACT log.
Q?
Cassandra Data Modeling Workshop
  Matthew F. Dennis // @mdennis

More Related Content

What's hot

collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQLMark Wong
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureDataStax Academy
 
An Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL TriggersAn Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL TriggersJim Mlodgenski
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Michaël Figuière
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres MonitoringDenish Patel
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessJon Haddad
 
Understanding Autovacuum
Understanding AutovacuumUnderstanding Autovacuum
Understanding AutovacuumDan Robinson
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyMartin Zapletal
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkJon Haddad
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized ViewsCarl Yeksigian
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into CassandraDataStax
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...DataStax
 
The world's next top data model
The world's next top data modelThe world's next top data model
The world's next top data modelPatrick McFadin
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015StampedeCon
 

What's hot (20)

collectd & PostgreSQL
collectd & PostgreSQLcollectd & PostgreSQL
collectd & PostgreSQL
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
High Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & AzureHigh Throughput Analytics with Cassandra & Azure
High Throughput Analytics with Cassandra & Azure
 
An Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL TriggersAn Introduction To PostgreSQL Triggers
An Introduction To PostgreSQL Triggers
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
 
Cassandra 3.0 Awesomeness
Cassandra 3.0 AwesomenessCassandra 3.0 Awesomeness
Cassandra 3.0 Awesomeness
 
Understanding Autovacuum
Understanding AutovacuumUnderstanding Autovacuum
Understanding Autovacuum
 
Data in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data EfficientlyData in Motion: Streaming Static Data Efficiently
Data in Motion: Streaming Static Data Efficiently
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
 
Cassandra Materialized Views
Cassandra Materialized ViewsCassandra Materialized Views
Cassandra Materialized Views
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0Cassandra 2.2 & 3.0
Cassandra 2.2 & 3.0
 
Dun ddd
Dun dddDun ddd
Dun ddd
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
 
The world's next top data model
The world's next top data modelThe world's next top data model
The world's next top data model
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015
 

Viewers also liked

Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Modelebenhewitt
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingMatthew Dennis
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarMatthew Dennis
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-PatternsMatthew Dennis
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsDave Gardner
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGMatthew Dennis
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsMatthew Dennis
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durabilityMatthew Dennis
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014odnoklassniki.ru
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSEDataStax Academy
 
Cassandra datamodel
Cassandra datamodelCassandra datamodel
Cassandra datamodellurga
 
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit SuisseCassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit SuisseDataStax Academy
 
NoSQL with Cassandra
NoSQL with CassandraNoSQL with Cassandra
NoSQL with CassandraGasol Wu
 

Viewers also liked (20)

Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
 
DZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling WebinarDZone Cassandra Data Modeling Webinar
DZone Cassandra Data Modeling Webinar
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Cassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUGCassandra, Modeling and Availability at AMUG
Cassandra, Modeling and Availability at AMUG
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
durability, durability, durability
durability, durability, durabilitydurability, durability, durability
durability, durability, durability
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
Add a bit of ACID to Cassandra. Cassandra Summit EU 2014
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
 
Cassandra datamodel
Cassandra datamodelCassandra datamodel
Cassandra datamodel
 
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit SuisseCassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
Cassandra Summit 2014: An overview of the Hippo Project at Credit Suisse
 
NoSQL with Cassandra
NoSQL with CassandraNoSQL with Cassandra
NoSQL with Cassandra
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 

Similar to Cassandra Data Modeling Workshop

Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSvtunotesbysree
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overviewjoeyrobert
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)Eran Levy
 
Teradata Tutorial for Beginners
Teradata Tutorial for BeginnersTeradata Tutorial for Beginners
Teradata Tutorial for Beginnersrajkamaltibacademy
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas DistribuidosLocaweb
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...confluent
 
Document 14 (6).pdf
Document 14 (6).pdfDocument 14 (6).pdf
Document 14 (6).pdfRajMantry
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationWebinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationFederico Razzoli
 
MySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossukMySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossukValeriy Kravchuk
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with GatlingPetr Vlček
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayGeorge T. C. Lai
 
Accumulo14 15
Accumulo14 15Accumulo14 15
Accumulo14 15Sqrrl
 
Async Web Frameworks in Python
Async Web Frameworks in PythonAsync Web Frameworks in Python
Async Web Frameworks in PythonRyan Johnson
 
Time Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOSTime Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOSLaura Hood
 
Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Python Ireland
 
Utopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersUtopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersJaime Buelta
 

Similar to Cassandra Data Modeling Workshop (20)

Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
 
Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Teradata Tutorial for Beginners
Teradata Tutorial for BeginnersTeradata Tutorial for Beginners
Teradata Tutorial for Beginners
 
Sistemas Distribuidos
Sistemas DistribuidosSistemas Distribuidos
Sistemas Distribuidos
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
Document 14 (6).pdf
Document 14 (6).pdfDocument 14 (6).pdf
Document 14 (6).pdf
 
Lecture 5 process concept
Lecture 5   process conceptLecture 5   process concept
Lecture 5 process concept
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Webinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstrationWebinar - MariaDB Temporal Tables: a demonstration
Webinar - MariaDB Temporal Tables: a demonstration
 
MySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossukMySQL Performance schema missing_manual_flossuk
MySQL Performance schema missing_manual_flossuk
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
 
Accumulo14 15
Accumulo14 15Accumulo14 15
Accumulo14 15
 
Async Web Frameworks in Python
Async Web Frameworks in PythonAsync Web Frameworks in Python
Async Web Frameworks in Python
 
Time Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOSTime Travelling With DB2 10 For zOS
Time Travelling With DB2 10 For zOS
 
Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+Utopia Kingdoms scaling case. From 4 users to 50.000+
Utopia Kingdoms scaling case. From 4 users to 50.000+
 
Utopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K usersUtopia Kindgoms scaling case: From 4 to 50K users
Utopia Kindgoms scaling case: From 4 to 50K users
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Cassandra Data Modeling Workshop

  • 1. Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis
  • 2. Overview ● Hopefully interactive ● Use cases submitted via Google Moderator, email, IRC, etc ● Interesting and/or common requests in the slides to get us started ● Bring up others if you have them !
  • 3. Data Modeling Goals ● Keep data queried together on disk together ● In a more general sense think about the efficiency of querying your data and work backward from there to a model in Cassandra ● Don't try to normalize your data (contrary to many use cases in relational databases) ● Usually better to keep a record that something happened as opposed to changing a value (not always advisable or possible)
  • 4. ClickStream Data (use case #1) ● A ClickStream (in this context) is the sequence of actions a user of an application performs ● Usually this refers to clicking links in a WebApp ● Useful for ad selection, error recording, UI/UX improvement, A/B testing, debugging, et cetera ● Not a lot of detail in the Google Moderator request on what the purpose of collecting the ClickStream data was – so I made some up
  • 5. ClickStream Data Defined ● Record actions of a user within a session for debugging purposes if app/browser/page/server crashes
  • 6. Recording Sessions ● CF for sessions a user has had ● Row Key is user name/id ● Column Name is session id (TimeUUID) ● Column Value is empty (or length of session, or some aggregated details about the session after it ended) ● CF for actual sessions ● Row Key is TimeUUID session id ● Column Name is timestamp/TimeUUID of each click ● Column Value is details about that click (serialized)
  • 7. UserSessions Column Family Session_01 Session_02 Session_03 (TimeUUID) (TimeUUID) userId (TimeUUID) (empty/agg) (empty/agg) (empty/agg) ● Most recent session ● All sessions for a given time period
  • 8. Sessions Column Family timestamp_01 timestamp_02 timestamp_03 SessionId (TimeUUID) ClickData ClickData ClickData (json/xml/etc) (json/xml/etc) (json/xml/etc) ● Retrieve entire session's ClickStream (row) ● Order of clicks/events preserved ● Retrieve ClickStream for a slice of time within the session ● First action taken in a session ● Most recent action taken in a session ● Why JSON/XML/etc?
  • 10. Of Course (depends on what you want to do) ● Secondary Indexes ● All Sessions in one row ● Track by time of activity instead of session
  • 11. Secondary Indexes Applied ● Drop UserSessions CF and use secondary indexes ● Uses a “well known” column to record the user in the row; secondary index is created on that column ● Doesn't work so well when storing aggregates about sessions in the UserSessions CF ● Better when you want to retrieve all sessions a user has had
  • 12. All Sessions In One Row Applied ● Row Key is userId ● Column Name is composite of timestamp and sessionId ● Can efficiently request activity of a user across all sessions within a specific time range ● Rows could potentially grow quite large, be careful ● Reads will almost always require at least two seeks on disk
  • 13. Time Period Partitioning Applied ● Row Key is composite of userId and time “bucket” ● e.g. jan_2011 or jan_01_2011 for month or day buckets respectively ● Column Name is TimeUUID of click ● Column Value is serialized click data ● Avoids always requiring multiple seeks when the user has old data but only recent data is requested ● Easy to lazily aggregate old activity ● Can still efficiently request activity of a user across all sessions within a specific time range
  • 14. Rolling Time Window Of Data Points (use case #2) ● Similar to RRDTool was the example given ● Essentially store a series of data points within a rolling window ● common request from Cassandra users for this and/or similar
  • 15. Data Points Defined ● Each data point has a value (or multiple values) ● Each data point corresponds to a specific point in time or an interval/bucket (e.g. 5 th minute of th 17 hour on some date)
  • 16. Time Window Model System7:RenderTime TimeUUID0 TimeUUID1 TimeUUID2 s7:rt 0.051 0.014 0.173 Some request took 0.014 seconds to render ● Row Key is the id of the time window data you are tracking (e.g. server7:render_time) ● Column Name is timestamp (or TimeUUID) the event occurred at ● Column Value is the value of the event (e.g. 0.051)
  • 17. The Details ● Cassandra TTL values are key here ● When you insert each data point set the TTL to the max time range you will ever request; there is very little overhead to expiring columns ● When querying, construct TimeUUIDs for the min/max of the time range in question and use them as the start/end in your get_slice call ● Consider partitioning the rows by a known time period (e.g. “year”) if you plan on keeping a long history of data (NB: requires slightly more complex logic in the app if a time range spans such a period) ● Very efficient queries for any window of time
  • 18. Rolling Window Of Counters (use case #3) ● “How to model rolling time window that contains counters with time buckets of monthly (12 months), weekly (4 weeks), daily (7 days), hourly (24 hours)? Example would be; how many times user logged into a system in last 24 hours, last 7 days ...” ● Timezones and “rolling window” is what makes this interesting
  • 19. Rolling Time Window Details ● One row for every granularity you want to track (e.g. day, hour) ● Row Key consists of the granularity, metric, user and system ● Column Name is a “fixed” time bucket on UTC time ● Column Values are counts of the logins in that bucket ● get_slice calls to return multiple counters which are them summed up
  • 20. Rolling Time Window Counter Model user3:system5:logins:by_day 20110107 ... 20110523 U3:S5:L:D 2 ... 7 2 logins in Jan 7th 2011 7 logins on May 23rd 2011 for user 3 on system 5 for user 3 on system 5 user3:system5:logins:by_hour 2011010710 ... 2011052316 U3:S5:L:H 1 ... 7 one login for user 3 on system 5 2 logins for user 3 on system 5 on Jan 7th 2011 for the 10th hour on May 23rd 2011 for the 16th hour
  • 21. Rolling Time Window Queries ● Time window is rolling and there are other timezones besides UTC ● one get_slice for the “middle” counts ● one get_slice for the “left end” ● one get_slice for the “right end”
  • 22. Example: logins for the past 7 days ● Determine date/time boundaries ● Determine UTC days that are wholly contained within your boundaries to select and sum ● Select and sum counters for the remaining hours on either side of the UTC days ● O(1) queries (3 in this case), can be requested from C* in parallel ● NB: some timezones are annoying (e.g. 15 minute or 30 minutes offsets); I try to ignore them
  • 23. Alternatives? (of course) ● If you're counting logins and each user doesn't login in hundreds of times a day, just have one row per user with a TimeUUID column name for the time the login occurred ● Supports any timezone/range/granularity easily ● More expensive for large ranges (e.g. year) regardless of granularity, so cache results (in C*) lazily. ● NB: caching results for rolling windows is not usually helpful (because, well it's rolling and always changes)
  • 24. Eventually Atomic (use case #4) ● “When there are many to many or one to many relations involved how to model that and also keep it atomic? for eg: one user can upload many pictures and those pictures can somehow be related to other users as well.” ● Attempting full ACID compliance in distributed systems is a bad idea (and impossible in the general sense) ● However, consistency is important and can certainly be achieved in C* ● Many approaches / alternatives ● I like transaction log approach, especially in the context of C*
  • 25. Transaction Logs (in this context) ● Records what is going to be performed before it is actually performed ● Performs the actions that need to be atomic (in the indivisible sense, not the all at once sense) ● Marks that the actions were performed
  • 26. In Cassandra ● Serialize all actions that need to be performed in a single column – JSON, XML, YAML (yuck!), cpickle, JSO, et cetera ● Row Key = randomly chosen C* node token ● Column Name = TimeUUID ● Perform actions ● Delete Column
  • 27. Configuration Details ● Short GC_Grace on the XACT_LOG Column Family (e.g. 1 hour) ● Write to XACT_LOG at CL.QUORUM or CL.LOCAL_QUORUM for durability (if it fails with an unavailable exception, pick a different node token and/or node and try again; same semantics as a traditional relational DB) ● 1M memtable ops, 1 hour memtable flush time
  • 28. Failures ● Before insert into the XACT_LOG ● After insert, before actions ● After insert, in middle of actions ● After insert, after actions, before delete ● After insert, after actions, after delete
  • 29. Recovery ● Each C* has a crond job offset from every other by some time period ● Each job runs the same code: multiget_slice for all node tokens for all columns older than some time period ● Any columns need to be replayed in their entirety and are deleted after replay (normally there are no columns because normally things are working normally)
  • 30. XACT_LOG Comments ● Idempotent writes are awesome (that's why this works so well) ● Doesn't work so well for counters (they're not idempotent) ● Clients must be able to deal with temporarily inconsistent data (they have to do this anyway) ● Could use a reliable queuing service (e.g. SQS) instead of polling – push to SQS first, then XACT log.
  • 31. Q? Cassandra Data Modeling Workshop Matthew F. Dennis // @mdennis