SlideShare a Scribd company logo
Modeling Data In Cassandra
     Conceptual Differences Versus RDBMS
    Matthew F. Dennis, DataStax // @mdennis




June 27, 2012
Cassandra Is Not Relational
get out of the relational mindset when working
  with Cassandra (or really any NoSQL DB)
Work Backwards From Queries
   Think in terms of queries, not in terms of
normalizing the data; in fact, you often want to
  denormalize (already common in the data
    warehousing world, even in RDBMS)
OK great, but how do I do that?
Well, you need to know how Cassandra models
          data (e.g. Google Big Table)

   research.google.com/archive/bigtable-osdi06.pdf



   Go Read It!
In Cassandra:

data is organized into Keyspaces (usually one per app)
➔




each Keyspace can have multiple Column Families
➔




each Column Family can have many Rows
➔




each Row has a Row Key and a variable number of Columns
➔




each Column consists of a Name, Value and Timestamp
➔
In Cassandra, Keyspaces:
are similar in concept to a “database” in some RDBMs
➔




are stored in separate directories on disk
➔




are usually one-one with applications
➔




are usually the administrative unit for things related to ops
➔




contain multiple column families
➔
In Cassandra, In Keyspaces, Column Famlies:
   ➔ are similar in concept to a “table” in most RDBMs

   ➔ are stored in separate files on disk (many per CF)

   ➔ are usually approximately one-one with query type

   ➔ are usually the administrative unit for things related to your data

   ➔ can contain many (~billion* per node) rows




* for a good sized node
(you can always add nodes)
In Cassandra, In Keyspaces, In Column Families ...
Rows

 thepaul   office: Austin      OS: OSX          twitter: thepaul0


 mdennis    office: UA         OS: Linux        twitter: mdennis


  thobbs   office: Austin   twitter: tylhobbs




Row Keys
thepaul   office: Austin       OS: OSX          twitter: thepaul0


mdennis    office: UA          OS: Linux        twitter: mdennis


thobbs    office: Austin    twitter: tylhobbs




                           Columns
Column Names

thepaul   office: Austin      OS: OSX          twitter: thepaul0


mdennis    office: UA         OS: Linux        twitter: mdennis


thobbs    office: Austin   twitter: tylhobbs
Column Values

thepaul   office: Austin      OS: OSX          twitter: thepaul0


mdennis    office: UA         OS: Linux        twitter: mdennis


thobbs    office: Austin   twitter: tylhobbs
thepaul   office: Austin       OS: OSX          twitter: thepaul0


mdennis    office: UA          OS: Linux        twitter: mdennis


thobbs    office: Austin    twitter: tylhobbs




                           Rows Are Randomly Ordered
                             (if using the RandomPartitioner)
thepaul   office: Austin           OS: OSX          twitter: thepaul0


mdennis    office: UA              OS: Linux        twitter: mdennis


thobbs    office: Austin        twitter: tylhobbs




                  Columns Are Ordered by Name
                           (by a configurable comparator)
Columns are ordered because
 doing so allows very efficient
implementations of useful and
     common operations

        (e.g. merge join)
In particular, within a row
columns with a given name can
    be located very quickly.
(ordered names => log(n) binary search)
More importantly, I can query for a
      slice between a start and end

                 Row Key

RK   ts0   ts1   ...   ...   tsM ...   ...   ...   ...   tsN ...   ...   ...   ...   ...


 start                                                                         end
Why does that matter?
Because columns within don’t have to be static!
    (and random disk seeks are teh evil)
The Column Name Can Be Part of Your Data

  INTC     ts0: $25.20         ts1: $25.25             ...


  AMR       ts0: $6.20          ts9: $0.26             ...


  CRDS      ts0: $1.05          ts5: $6.82             ...




                  Columns Are Ordered by Name
                   (in this case by a TimeUUID Comparator)
Turns Out That Pattern Comes Up A Lot
  ➔ stock ticks
  ➔ event logs

  ➔ ad clicks/views

  ➔ sensor records

  ➔ access/error logs

  ➔ plane/truck/person/”entity” locations

  ➔…
OK, but I can do that in SQL
Not efficiently at scale, at least not easily ...
How it Looks In a RDBMS
                    ticker   timestamp   bid   ask   ...
                    AMR      ts0         ...   ...   ...
                    ...      ...         ...   ...   ...
                    CRDS     ts0         ...   ...   ...
                    ...      ...         ...   ...   ...
Data I Care About   ...      ts0         ...   ...   ...
                    AMR      ts1         ...   ...   ...
                    ...      ...         ...   ...   ...
                    ...      ...         ...   ...   ...
                    …        ts1         ...   ...   ...
                    AMR      ts2         ...   ...   ...
                    ...      ts2         ...   ...   ...
How it Looks In a RDBMS
             ticker     timestamp   bid   ask   ...
             AMR        ts0         ...   ...   ...



                      Larger Than Your Page Size
Disk Seeks
             AMR        ts1         ...   ...   ...


                      Larger Than Your Page Size

             AMR        ts2         ...   ...   ...
             ...        ts2         ...   ...   ...
OK, but what about ...
PostgreSQL Cluster Command?
➔




MySQL Cluster Indexes?
➔




Oracle Index Organized Tables?
➔




SQLServer Clustered Index?
➔
OK, but what about ...
PostgreSQL Cluster Using?
➔




    Meh ...
MySQL [InnoDB] Cluster Indexes?
➔




Oracle Index Organized Table?
➔




SQLServer Clustered Index?
➔
The on-disk management of that
        clustering results in tons of IO …

In the case of PostgreSQL:

clustering is a one time operation
➔

    (implies you must periodically rewrite the entire table)

new data is *not* written in clustered order
➔

    (which is often the data you care most about)
OK, so just partition the tables ...
Not a bad idea, except in MySQL there is a limit of
 1024 partitions and generally less if using NDB

 (you should probably still do it if using MySQL though)

  http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
OK fine, I agree storing data that is queried
       together on disk together is a good thing but
          what's that have to do with modeling in
                        Cassandra?

        Seek To Here


 RK    ts0   ts1   ...   ...   tsM ...   ...   ...   ...   tsN ...   ...   ...   ...   ...



                                  Read Precisely My Data *



* more on some caveats later
Well, that's what is meant by “work backwards
from your queries” or “think in terms of queries”

(NB: this concept, in general, applies to RDBMS
 at scale as well; it is not specific to Cassandra)
An Example From Fraud Detection
  To calculate risk it is common to need to know all the
 emails, destinations, origins, devices, locations, phone
numbers, et cetera ever used for the account in question
In a normalized model that usually translates to a
          table for each type of entity being tracked

                id          name         ...           id          device         ...
                1           guy          ...           1000        0xdead         ...
                2           gal          ...           2000        0xb33f         ...
                ...         ...          ...           ...         ...            ...


id       dest         ...          id          email         ...            id          origin    ...
15       USA          ...          100         guy@          ...            150         USA       ...
25       Finland      ...          200         gal@          ...            250         Nigeria   ...
...      ...          ...          ...         ...           ...            ...         ...       ...
The problem is that at scale that also means
        a disk seek for each one …
    (even for perfect IOT et al if across multiple tables)




➔Previous emails? That's a seek …
➔Previous devices? That's a seek …

➔Previous destinations? That's a seek ...
But In Cassandra I Store The Data I Query
           Together On Disk Together
               (remember, column names need not be static)


  Data I Care About

acctY    ...          ...          ...       ...        ...      ...         ...
acctX    dest21       dev2         dev7        email3   email9   orig4       ...
acctZ    ...          ...          ...       ...        ...      ...         ...



                            email:cassandra@mailinator.com = dateEmailWasLastUsed




                            Column Name                                  Column Value
Don't treat Cassandra (or any DB) as a black box
  ➔Understand how your DBs (and data structures) work

  ➔Understand the building blocks they provide

  ➔Understand the work complexity (“big O”) of queries

  ➔For data sets > memory, goal is to minimize seeks *




* on a related note, SSDs are awesome
Q?
      Modeling Data In Cassandra
 Conceptual Differences Versus RDBMS
Matthew F. Dennis, DataStax // @mdennis

More Related Content

What's hot

Guaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in RustGuaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in Rust
nikomatsakis
 
Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22
nikomatsakis
 
8 - OOP - Syntax & Messages
8 - OOP - Syntax & Messages8 - OOP - Syntax & Messages
8 - OOP - Syntax & Messages
The World of Smalltalk
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)
nikomatsakis
 
11 bytecode
11 bytecode11 bytecode
Better Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworkingBetter Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworking
Guillermo Gonzalez
 
Senten500.c
Senten500.cSenten500.c
Senten500.c
albertinous
 
Introduction to Rust
Introduction to RustIntroduction to Rust
Introduction to Rust
Jean Carlo Machado
 
Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)
Angel Boy
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
nikomatsakis
 
12 virtualmachine
12 virtualmachine12 virtualmachine
12 virtualmachine
The World of Smalltalk
 
The State of NoSQL
The State of NoSQLThe State of NoSQL
The State of NoSQL
Ben Scofield
 
07 bestpractice
07 bestpractice07 bestpractice
07 bestpractice
The World of Smalltalk
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
aleks-f
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
HyeonSeok Choi
 
Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013
aleks-f
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) Exploitation
Angel Boy
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of Abstraction
Alex Miller
 
Python lec4
Python lec4Python lec4
Python lec4
Swarup Ghosh
 
Apache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and PerformanceApache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and Performance
aaronmorton
 

What's hot (20)

Guaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in RustGuaranteeing Memory Safety in Rust
Guaranteeing Memory Safety in Rust
 
Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22Rust tutorial from Boston Meetup 2015-07-22
Rust tutorial from Boston Meetup 2015-07-22
 
8 - OOP - Syntax & Messages
8 - OOP - Syntax & Messages8 - OOP - Syntax & Messages
8 - OOP - Syntax & Messages
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)
 
11 bytecode
11 bytecode11 bytecode
11 bytecode
 
Better Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworkingBetter Web Clients with Mantle and AFNetworking
Better Web Clients with Mantle and AFNetworking
 
Senten500.c
Senten500.cSenten500.c
Senten500.c
 
Introduction to Rust
Introduction to RustIntroduction to Rust
Introduction to Rust
 
Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)Windows 10 Nt Heap Exploitation (Chinese version)
Windows 10 Nt Heap Exploitation (Chinese version)
 
Rust "Hot or Not" at Sioux
Rust "Hot or Not" at SiouxRust "Hot or Not" at Sioux
Rust "Hot or Not" at Sioux
 
12 virtualmachine
12 virtualmachine12 virtualmachine
12 virtualmachine
 
The State of NoSQL
The State of NoSQLThe State of NoSQL
The State of NoSQL
 
07 bestpractice
07 bestpractice07 bestpractice
07 bestpractice
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
 
Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013Dynamic C++ ACCU 2013
Dynamic C++ ACCU 2013
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) Exploitation
 
Clojure: The Art of Abstraction
Clojure: The Art of AbstractionClojure: The Art of Abstraction
Clojure: The Art of Abstraction
 
Python lec4
Python lec4Python lec4
Python lec4
 
Apache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and PerformanceApache Cassandra in Bangalore - Cassandra Internals and Performance
Apache Cassandra in Bangalore - Cassandra Internals and Performance
 

Viewers also liked

Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
Matthew Dennis
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
Matthew Dennis
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
Matthew Dennis
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
Matthew Dennis
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
Matthew Dennis
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
Dave Gardner
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
Matthew Dennis
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
ebenhewitt
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
DataStax Academy
 
Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraIntroduction to Data Modeling in Cassandra
Introduction to Data Modeling in Cassandra
Jim Hatcher
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
jbellis
 
C*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache CassandraC*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache Cassandra
DataStax
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
Matthew Dennis
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandra
Patrick McFadin
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
Dave Gardner
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
Dave Gardner
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
Dave Gardner
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
Patricia Gorla
 
From rdbms to cassandra without a hitch
From rdbms to cassandra without a hitchFrom rdbms to cassandra without a hitch
From rdbms to cassandra without a hitch
Duyhai Doan
 
Cassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsCassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patterns
Duyhai Doan
 

Viewers also liked (20)

Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
BigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current TrendsBigData as a Platform: Cassandra and Current Trends
BigData as a Platform: Cassandra and Current Trends
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Cassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patternsCassandra concepts, patterns and anti-patterns
Cassandra concepts, patterns and anti-patterns
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
Cassandra Data Model
Cassandra Data ModelCassandra Data Model
Cassandra Data Model
 
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSECassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
Cassandra Day Chicago 2015: Top 5 Tips/Tricks with Apache Cassandra and DSE
 
Introduction to Data Modeling in Cassandra
Introduction to Data Modeling in CassandraIntroduction to Data Modeling in Cassandra
Introduction to Data Modeling in Cassandra
 
Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
 
C*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache CassandraC*ollege Credit: An Introduction to Apache Cassandra
C*ollege Credit: An Introduction to Apache Cassandra
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandra
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo (at Cassandra EU)
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
Introduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and HadoopIntroduction to Real-Time Analytics with Cassandra and Hadoop
Introduction to Real-Time Analytics with Cassandra and Hadoop
 
From rdbms to cassandra without a hitch
From rdbms to cassandra without a hitchFrom rdbms to cassandra without a hitch
From rdbms to cassandra without a hitch
 
Cassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsCassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patterns
 

Similar to DZone Cassandra Data Modeling Webinar

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Fact
mediumdata
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
Mike Acton
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
Altinity Ltd
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
Ruben Verborgh
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
DataStax
 
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
aaronmorton
 
Intro to riak
Intro to riakIntro to riak
Intro to riak
Jaseem Abid
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
Jacek Lewandowski
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Heroku
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDB
Rick Copeland
 
Taming Cassandra
Taming CassandraTaming Cassandra
Taming Cassandra
Dmitry Buzdin
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Cody Ray
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
aaronmorton
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client Tutorial
Joe McTee
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
Wim Godden
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Connor McDonald
 
Web_Alg_Project
Web_Alg_ProjectWeb_Alg_Project
Web_Alg_Project
Giuseppe Filingeri
 

Similar to DZone Cassandra Data Modeling Webinar (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Fact
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
 
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2Cassandra Community Webinar  - Introduction To Apache Cassandra 1.2
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2
 
Intro to riak
Intro to riakIntro to riak
Intro to riak
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
 
Scaling with MongoDB
Scaling with MongoDBScaling with MongoDB
Scaling with MongoDB
 
Taming Cassandra
Taming CassandraTaming Cassandra
Taming Cassandra
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client Tutorial
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Tokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java DeveloperTokyo APAC Groundbreakers tour - The Complete Java Developer
Tokyo APAC Groundbreakers tour - The Complete Java Developer
 
Web_Alg_Project
Web_Alg_ProjectWeb_Alg_Project
Web_Alg_Project
 

Recently uploaded

June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 

Recently uploaded (20)

June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 

DZone Cassandra Data Modeling Webinar

  • 1. Modeling Data In Cassandra Conceptual Differences Versus RDBMS Matthew F. Dennis, DataStax // @mdennis June 27, 2012
  • 2. Cassandra Is Not Relational get out of the relational mindset when working with Cassandra (or really any NoSQL DB)
  • 3. Work Backwards From Queries Think in terms of queries, not in terms of normalizing the data; in fact, you often want to denormalize (already common in the data warehousing world, even in RDBMS)
  • 4. OK great, but how do I do that? Well, you need to know how Cassandra models data (e.g. Google Big Table) research.google.com/archive/bigtable-osdi06.pdf Go Read It!
  • 5. In Cassandra: data is organized into Keyspaces (usually one per app) ➔ each Keyspace can have multiple Column Families ➔ each Column Family can have many Rows ➔ each Row has a Row Key and a variable number of Columns ➔ each Column consists of a Name, Value and Timestamp ➔
  • 6. In Cassandra, Keyspaces: are similar in concept to a “database” in some RDBMs ➔ are stored in separate directories on disk ➔ are usually one-one with applications ➔ are usually the administrative unit for things related to ops ➔ contain multiple column families ➔
  • 7. In Cassandra, In Keyspaces, Column Famlies: ➔ are similar in concept to a “table” in most RDBMs ➔ are stored in separate files on disk (many per CF) ➔ are usually approximately one-one with query type ➔ are usually the administrative unit for things related to your data ➔ can contain many (~billion* per node) rows * for a good sized node (you can always add nodes)
  • 8. In Cassandra, In Keyspaces, In Column Families ...
  • 9. Rows thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Row Keys
  • 10. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Columns
  • 11. Column Names thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs
  • 12. Column Values thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs
  • 13. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Rows Are Randomly Ordered (if using the RandomPartitioner)
  • 14. thepaul office: Austin OS: OSX twitter: thepaul0 mdennis office: UA OS: Linux twitter: mdennis thobbs office: Austin twitter: tylhobbs Columns Are Ordered by Name (by a configurable comparator)
  • 15. Columns are ordered because doing so allows very efficient implementations of useful and common operations (e.g. merge join)
  • 16. In particular, within a row columns with a given name can be located very quickly. (ordered names => log(n) binary search)
  • 17. More importantly, I can query for a slice between a start and end Row Key RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ... start end
  • 18. Why does that matter? Because columns within don’t have to be static! (and random disk seeks are teh evil)
  • 19. The Column Name Can Be Part of Your Data INTC ts0: $25.20 ts1: $25.25 ... AMR ts0: $6.20 ts9: $0.26 ... CRDS ts0: $1.05 ts5: $6.82 ... Columns Are Ordered by Name (in this case by a TimeUUID Comparator)
  • 20. Turns Out That Pattern Comes Up A Lot ➔ stock ticks ➔ event logs ➔ ad clicks/views ➔ sensor records ➔ access/error logs ➔ plane/truck/person/”entity” locations ➔…
  • 21. OK, but I can do that in SQL Not efficiently at scale, at least not easily ...
  • 22. How it Looks In a RDBMS ticker timestamp bid ask ... AMR ts0 ... ... ... ... ... ... ... ... CRDS ts0 ... ... ... ... ... ... ... ... Data I Care About ... ts0 ... ... ... AMR ts1 ... ... ... ... ... ... ... ... ... ... ... ... ... … ts1 ... ... ... AMR ts2 ... ... ... ... ts2 ... ... ...
  • 23. How it Looks In a RDBMS ticker timestamp bid ask ... AMR ts0 ... ... ... Larger Than Your Page Size Disk Seeks AMR ts1 ... ... ... Larger Than Your Page Size AMR ts2 ... ... ... ... ts2 ... ... ...
  • 24. OK, but what about ... PostgreSQL Cluster Command? ➔ MySQL Cluster Indexes? ➔ Oracle Index Organized Tables? ➔ SQLServer Clustered Index? ➔
  • 25. OK, but what about ... PostgreSQL Cluster Using? ➔ Meh ... MySQL [InnoDB] Cluster Indexes? ➔ Oracle Index Organized Table? ➔ SQLServer Clustered Index? ➔
  • 26. The on-disk management of that clustering results in tons of IO … In the case of PostgreSQL: clustering is a one time operation ➔ (implies you must periodically rewrite the entire table) new data is *not* written in clustered order ➔ (which is often the data you care most about)
  • 27. OK, so just partition the tables ...
  • 28. Not a bad idea, except in MySQL there is a limit of 1024 partitions and generally less if using NDB (you should probably still do it if using MySQL though) http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
  • 29. OK fine, I agree storing data that is queried together on disk together is a good thing but what's that have to do with modeling in Cassandra? Seek To Here RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ... Read Precisely My Data * * more on some caveats later
  • 30. Well, that's what is meant by “work backwards from your queries” or “think in terms of queries” (NB: this concept, in general, applies to RDBMS at scale as well; it is not specific to Cassandra)
  • 31. An Example From Fraud Detection To calculate risk it is common to need to know all the emails, destinations, origins, devices, locations, phone numbers, et cetera ever used for the account in question
  • 32. In a normalized model that usually translates to a table for each type of entity being tracked id name ... id device ... 1 guy ... 1000 0xdead ... 2 gal ... 2000 0xb33f ... ... ... ... ... ... ... id dest ... id email ... id origin ... 15 USA ... 100 guy@ ... 150 USA ... 25 Finland ... 200 gal@ ... 250 Nigeria ... ... ... ... ... ... ... ... ... ...
  • 33. The problem is that at scale that also means a disk seek for each one … (even for perfect IOT et al if across multiple tables) ➔Previous emails? That's a seek … ➔Previous devices? That's a seek … ➔Previous destinations? That's a seek ...
  • 34. But In Cassandra I Store The Data I Query Together On Disk Together (remember, column names need not be static) Data I Care About acctY ... ... ... ... ... ... ... acctX dest21 dev2 dev7 email3 email9 orig4 ... acctZ ... ... ... ... ... ... ... email:cassandra@mailinator.com = dateEmailWasLastUsed Column Name Column Value
  • 35. Don't treat Cassandra (or any DB) as a black box ➔Understand how your DBs (and data structures) work ➔Understand the building blocks they provide ➔Understand the work complexity (“big O”) of queries ➔For data sets > memory, goal is to minimize seeks * * on a related note, SSDs are awesome
  • 36. Q? Modeling Data In Cassandra Conceptual Differences Versus RDBMS Matthew F. Dennis, DataStax // @mdennis