SlideShare a Scribd company logo
1 of 45
Download to read offline
Five factors to consider when
choosing a big data solution!
Jonathan Ellis
CTO, DataStax
Project Chair, Apache Cassandra
how do I



 my application?
                 model

©2012 DataStax
Popular options
  • Key/value
  • Tabular
  • Document
  • Graph?




©2012 DataStax
Schema is your friend

{
         "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c",
         "name": "jbellis",
         "state": "TX",
         "birthdate": "1/1/1976",
         "email_addresses": ["jbellis@gmail", "jbellis@datastax.com"],
}




    ©2012 DataStax
SQL can be your friend too

 CREATE TABLE users (
    id uuid PRIMARY KEY,
    name text,
    state text,
    birth_date date
 );



 CREATE INDEX ON users(state);

 SELECT * FROM users
 WHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;




©2012 DataStax
Collections
 CREATE TABLE users (
    id uuid PRIMARY KEY,
    name text,
    state text,
    birth_date date
 );

 CREATE TABLE users_addresses (
    user_id uuid REFERENCES users,
    email text
 );

 SELECT *
 FROM users NATURAL JOIN users_addresses;




©2012 DataStax
Collections
 CREATE TABLE users (
    id uuid PRIMARY KEY,
    name text,
    state text,




                 X
    birth_date date
 );

 CREATE TABLE users_addresses (
    user_id uuid REFERENCES users,
    email text
 );

 SELECT *
 FROM users NATURAL JOIN users_addresses;




©2012 DataStax
Collections
 CREATE TABLE users (
    id uuid PRIMARY KEY,
    name text,
    state text,
    birth_date date,
    email_addresses set<text>
 );

 UPDATE users
 SET email_addresses = email_addresses + {‘jbellis@gmail.com’,
 ‘jbellis@datastax.com’};




©2012 DataStax
Joins don’t scale
  • No joins
  • No subqueries
  • No aggregation functions* or GROUP BY
  • ORDER BY?




©2012 DataStax
SELECT * FROM tweets
WHERE user_id IN (SELECT follower FROM followers
                  WHERE user_id = ’driftx’)

                       followers




                  ?




 ©2012 DataStax
                                    tweets
Clustering in Cassandra
CREATE TABLE timeline (     user_id   tweet_id   _author    _body
  user_id uuid,
  tweet_id timeuuid,        jbellis   3290f9da.. rbranson   lorem
  tweet_author uuid,        jbellis   3895411a..   tjake    ipsum
   tweet_body text,           ...         ...        ...
  PRIMARY KEY (user_id,
                tweet_id)   driftx    3290f9da.. rbranson   lorem
);
                            driftx    71b46a84.. yzhang     dolor
                              ...         ...       ...


                            yukim     3290f9da.. rbranson   lorem
                            yukim     e451dd42..   tjake     amet
                              ...         ...        ...



 ©2012 DataStax
Clustering in Cassandra
CREATE TABLE timeline (     user_id   tweet_id   _author    _body
  user_id uuid,
  tweet_id timeuuid,        jbellis   3290f9da.. rbranson   lorem
  tweet_author uuid,        jbellis   3895411a..   tjake    ipsum
   tweet_body text,           ...         ...        ...
  PRIMARY KEY (user_id,
                tweet_id)   driftx    3290f9da.. rbranson   lorem
);
                            driftx    71b46a84.. yzhang     dolor
                              ...         ...       ...
SELECT * FROM timeline
WHERE user_id = ’driftx’;   yukim     3290f9da.. rbranson   lorem
                            yukim     e451dd42..   tjake     amet
                              ...         ...        ...



 ©2012 DataStax
how does it

                 perform?

©2012 DataStax
Larger than memory datasets




©2012 DataStax
Locking




©2012 DataStax
Efficiency




©2012 DataStax
UPDATE users
 SET email_addresses = email_addresses + {...}
 WHERE user_id = ‘jbellis’;




©2012 DataStax
Durability




©2012 DataStax
C* storage engine very briefly
           write( k1 , c1:v1 )

                                              Memory




                                 Memtable




         Commit log


©2012 DataStax                              Hard drive
write( k1 , c1:v1 )

                                                         Memory
                                 k1 c1:v1




                                            Memtable



                 k1 c1:v1




         Commit log


©2012 DataStax                                         Hard drive
write( k1 , c2:v2 )

                                                    Memory
                                 k1 c1:v1 c2:v2




                 k1 c1:v1
                 k1 c2:v2




©2012 DataStax                                    Hard drive
write(        k2   ,   c1:v1 c2:v2   )

                                                                        Memory
                                                     k1 c1:v1 c2:v2

                                                     k2 c1:v1 c2:v2




                   k1 c1:v1
                   k1 c2:v2
                 k2 c1:v1 c2:v2




©2012 DataStax                                                        Hard drive
write(        k1   ,   c1:v4 c3:v3   )

                                                                              Memory
                                                     k1 c1:v4 c2:v2 c3:v3

                                                     k2 c1:v1 c2:v2




                   k1 c1:v1
                   k1 c2:v2
                 k2 c1:v1 c2:v2
             k1 c1:v4 c3:v3




©2012 DataStax                                                              Hard drive
Memory




                           flush




                                  index
                 cleanup    k1 c1:v4 c2:v2 c3:v3

                            k2 c1:v1 c2:v2


                                                   SSTable




©2012 DataStax                                               Hard drive
No random writes




©2012 DataStax
reads/s            writes/s

                                                                       35000



                                                                      30000


                                                                     25000


                                                                    20000


                                                                   15000


                                                                   10000

                                                               5000
                 Cassandra 0.6
                                                               0
©2012 DataStax
                                           Cassandra 1.0
how does it handle

                 failure?

©2012 DataStax
Classic partitioning with SPOF
                 partition 1   partition 2      partition 3   partition 4




                                         router


                                             client
©2012 DataStax
Availability
  • “High availability implies that a single fault will not bring
            down your system. Not ‘we’ll recover quickly.’”
            -- Ben Coverston: DataStax

     •      “The biggest problem with failover is that you're almost
            never using it until it really hurts. It's like backups that
            you never test.”
            -- Rick Branson: Instagram




©2012 DataStax
Fully distributed, no SPOF
                 client




                          p3
                                p6        p1
                           p1




                                     p1




©2012 DataStax
Multiple datacenters




©2012 DataStax
©2012 DataStax
how does it

                 scale?

©2012 DataStax
Scaling antipatterns
  • Metadata servers
  • Router bottlenecks
  • Overloading existing nodes when adding capacity




©2012 DataStax
©2012 DataStax
how


 is it?
                 flexible

©2012 DataStax
36
Data model: Realtime
     LiveStocks      stock       last
                    GOOG        $95.52
                     AAPL      $186.10
                    AMZN       $112.98


       Portfolios    user       stock       shares
                    jbellis     GOOG          80
                    jbellis     LNKD          20
                    yukim       AMZN         100

      StockHist     stock        date       price
                    GOOG      2011-01-01    $8.23
                    GOOG      2011-01-02    $6.14
                    GOOG      2011-001-03   $7.78
©2012 DataStax
Data model: Analytics
 HistLoss                     worst_date    loss
                 Portfolio1   2011-07-23   -$34.81
                 Portfolio2   2011-03-11 -$11432.24
                 Portfolio3   2011-05-21 -$1476.93




©2012 DataStax
Data model: Analytics
  10dayreturns
          stock      rdate     return
          GOOG    2011-07-25   $8.23
          GOOG    2011-07-24   $6.14
          GOOG    2011-07-23   $7.78
          AAPL    2011-07-25   $15.32
          AAPL    2011-07-24   $12.68


     INSERT OVERWRITE TABLE 10dayreturns
     SELECT a.stock,
            b.date as rdate,
            b.price - a.price
     FROM StockHist a
     JOIN StockHist b
     ON (a.stock = b.stock
         AND date_add(a.date, 10) = b.date);

©2012 DataStax
Data model: Analytics
  portfolio_returns
            portfolio       rdate      preturn
            Portfolio1   2011-07-25    $118.21
            Portfolio1   2011-07-24     $60.78
            Portfolio1   2011-07-23    -$34.81
            Portfolio2   2011-07-25   $2143.92
            Portfolio3   2011-07-24    -$10.19


       INSERT OVERWRITE TABLE portfolio_returns
       SELECT portfolio,
              rdate,
              SUM(b.return)
       FROM portfolios a JOIN 10dayreturns b
       ON (a.stock = b.stock)
       GROUP BY portfolio, rdate;

©2012 DataStax
Data model: Analytics
  HistLoss
                       worst_date    loss
          Portfolio1   2011-07-23   -$34.81
          Portfolio2   2011-03-11 -$11432.24
          Portfolio3   2011-05-21 -$1476.93



    INSERT OVERWRITE TABLE HistLoss
    SELECT a.portfolio, rdate, minp
    FROM (
      SELECT portfolio, min(preturn) as minp
      FROM portfolio_returns
      GROUP BY portfolio
    ) a
    JOIN portfolio_returns b
    ON (a.portfolio = b.portfolio and a.minp = b.preturn);

©2012 DataStax
42
Some Cassandra users




©2012 DataStax
Questions?

Image credits
•    http://www.flickr.com/photos/26817893@N05/2573006312/

•    http://www.flickr.com/photos/rowanbank/7686239548

•    http://www.flickr.com/photos/mervtheswerve/6081933265

•    http://www.flickr.com/photos/dg_pics/2526208830

•    http://www.flickr.com/photos/wainwright/351684037

•    http://www.flickr.com/photos/mikeneilson/1606662529

•    http://www.flickr.com/photos/sbisson/3852905534

•    http://www.flickr.com/photos/breadnbadger/2674928517

More Related Content

What's hot

Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQLEvan Weaver
 
Advanced Windows Debugging
Advanced Windows DebuggingAdvanced Windows Debugging
Advanced Windows DebuggingBala Subra
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014jbellis
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectMorningstar Tech Talks
 
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17Aaron Benton
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014jbellis
 
Deployment in Oracle SOA Suite and in Oracle BPM Suite
Deployment in Oracle SOA Suite and in Oracle BPM SuiteDeployment in Oracle SOA Suite and in Oracle BPM Suite
Deployment in Oracle SOA Suite and in Oracle BPM SuiteLonneke Dikmans
 
Introduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseIntroduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseDipti Borkar
 
Akiban Technologies: Renormalize
Akiban Technologies: RenormalizeAkiban Technologies: Renormalize
Akiban Technologies: RenormalizeAriel Weil
 
The Native NDB Engine for Memcached
The Native NDB Engine for MemcachedThe Native NDB Engine for Memcached
The Native NDB Engine for MemcachedJohn David Duncan
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1jbellis
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
What You Need to Know to Move from a Relational to a NoSQL Database
What You Need to Know to Move from a Relational to a NoSQL DatabaseWhat You Need to Know to Move from a Relational to a NoSQL Database
What You Need to Know to Move from a Relational to a NoSQL DatabaseDATAVERSITY
 

What's hot (17)

Cassandra presentation at NoSQL
Cassandra presentation at NoSQLCassandra presentation at NoSQL
Cassandra presentation at NoSQL
 
Advanced Windows Debugging
Advanced Windows DebuggingAdvanced Windows Debugging
Advanced Windows Debugging
 
Cassandra summit keynote 2014
Cassandra summit keynote 2014Cassandra summit keynote 2014
Cassandra summit keynote 2014
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17
Couchbase Overview - Monterey Bay Information Technologists Meetup 02.15.17
 
Tokyo cassandra conference 2014
Tokyo cassandra conference 2014Tokyo cassandra conference 2014
Tokyo cassandra conference 2014
 
Deployment in Oracle SOA Suite and in Oracle BPM Suite
Deployment in Oracle SOA Suite and in Oracle BPM SuiteDeployment in Oracle SOA Suite and in Oracle BPM Suite
Deployment in Oracle SOA Suite and in Oracle BPM Suite
 
Grails 2.0 Update
Grails 2.0 UpdateGrails 2.0 Update
Grails 2.0 Update
 
Introduction to NoSQL and Couchbase
Introduction to NoSQL and CouchbaseIntroduction to NoSQL and Couchbase
Introduction to NoSQL and Couchbase
 
Akiban Technologies: Renormalize
Akiban Technologies: RenormalizeAkiban Technologies: Renormalize
Akiban Technologies: Renormalize
 
The Native NDB Engine for Memcached
The Native NDB Engine for MemcachedThe Native NDB Engine for Memcached
The Native NDB Engine for Memcached
 
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 
Cassandra11
Cassandra11Cassandra11
Cassandra11
 
Cassandra 2.1
Cassandra 2.1Cassandra 2.1
Cassandra 2.1
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
What You Need to Know to Move from a Relational to a NoSQL Database
What You Need to Know to Move from a Relational to a NoSQL DatabaseWhat You Need to Know to Move from a Relational to a NoSQL Database
What You Need to Know to Move from a Relational to a NoSQL Database
 
Advanced queuinginternals
Advanced queuinginternalsAdvanced queuinginternals
Advanced queuinginternals
 

Similar to Top five questions to ask when choosing a big data solution

Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandrajbellis
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012jbellis
 
On Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and FutureOn Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and Futurepcmanus
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversMichaël Figuière
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Scaling DataStax in Docker
Scaling DataStax in DockerScaling DataStax in Docker
Scaling DataStax in DockerDataStax
 
Tungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleTungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleContinuent
 
Data day texas: Cassandra and the Cloud
Data day texas: Cassandra and the CloudData day texas: Cassandra and the Cloud
Data day texas: Cassandra and the Cloudjbellis
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersMichaël Figuière
 
Big Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object StorageBig Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object StorageIntel® Software
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012Mike Miller
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierKellyn Pot'Vin-Gorman
 
Breaking the-database-type-barrier-replicating-across-different-dbms
Breaking the-database-type-barrier-replicating-across-different-dbmsBreaking the-database-type-barrier-replicating-across-different-dbms
Breaking the-database-type-barrier-replicating-across-different-dbmsLinas Virbalas
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 

Similar to Top five questions to ask when choosing a big data solution (20)

Massively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache CassandraMassively Scalable NoSQL with Apache Cassandra
Massively Scalable NoSQL with Apache Cassandra
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
State of Cassandra 2012
State of Cassandra 2012State of Cassandra 2012
State of Cassandra 2012
 
DataStax 6 and Beyond
DataStax 6 and BeyondDataStax 6 and Beyond
DataStax 6 and Beyond
 
On Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and FutureOn Cassandra Development: Past, Present and Future
On Cassandra Development: Past, Present and Future
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra Drivers
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Scaling DataStax in Docker
Scaling DataStax in DockerScaling DataStax in Docker
Scaling DataStax in Docker
 
Tungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleTungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And Oracle
 
Data day texas: Cassandra and the Cloud
Data day texas: Cassandra and the CloudData day texas: Cassandra and the Cloud
Data day texas: Cassandra and the Cloud
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for Developers
 
Big Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object StorageBig Data Uses with Distributed Asynchronous Object Storage
Big Data Uses with Distributed Asynchronous Object Storage
 
Scalability 09262012
Scalability 09262012Scalability 09262012
Scalability 09262012
 
Multi-cluster k8ssandra
Multi-cluster k8ssandraMulti-cluster k8ssandra
Multi-cluster k8ssandra
 
Denver SQL Saturday The Next Frontier
Denver SQL Saturday The Next FrontierDenver SQL Saturday The Next Frontier
Denver SQL Saturday The Next Frontier
 
CouchDB
CouchDBCouchDB
CouchDB
 
Copy Data Management for the DBA
Copy Data Management for the DBACopy Data Management for the DBA
Copy Data Management for the DBA
 
Breaking the-database-type-barrier-replicating-across-different-dbms
Breaking the-database-type-barrier-replicating-across-different-dbmsBreaking the-database-type-barrier-replicating-across-different-dbms
Breaking the-database-type-barrier-replicating-across-different-dbms
 
Starburn
StarburnStarburn
Starburn
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 

More from jbellis

Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015jbellis
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013jbellis
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0jbellis
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Javajbellis
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprisejbellis
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)jbellis
 
Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011jbellis
 
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)jbellis
 
What python can learn from java
What python can learn from javaWhat python can learn from java
What python can learn from javajbellis
 
State of Cassandra, 2011
State of Cassandra, 2011State of Cassandra, 2011
State of Cassandra, 2011jbellis
 
Brisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by CassandraBrisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by Cassandrajbellis
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialjbellis
 
Cassandra 0.7, Los Angeles High Scalability Group
Cassandra 0.7, Los Angeles High Scalability GroupCassandra 0.7, Los Angeles High Scalability Group
Cassandra 0.7, Los Angeles High Scalability Groupjbellis
 
Cassandra devoxx 2010
Cassandra devoxx 2010Cassandra devoxx 2010
Cassandra devoxx 2010jbellis
 
Cassandra FrOSCon 10
Cassandra FrOSCon 10Cassandra FrOSCon 10
Cassandra FrOSCon 10jbellis
 
State of Cassandra, August 2010
State of Cassandra, August 2010State of Cassandra, August 2010
State of Cassandra, August 2010jbellis
 
Cassandra nosql eu 2010
Cassandra nosql eu 2010Cassandra nosql eu 2010
Cassandra nosql eu 2010jbellis
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010jbellis
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamojbellis
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalabilityjbellis
 

More from jbellis (20)

Cassandra Summit 2015
Cassandra Summit 2015Cassandra Summit 2015
Cassandra Summit 2015
 
Cassandra Summit EU 2013
Cassandra Summit EU 2013Cassandra Summit EU 2013
Cassandra Summit EU 2013
 
London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0London + Dublin Cassandra 2.0
London + Dublin Cassandra 2.0
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
 
Apache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterpriseApache Cassandra: NoSQL in the enterprise
Apache Cassandra: NoSQL in the enterprise
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
 
Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011Cassandra at High Performance Transaction Systems 2011
Cassandra at High Performance Transaction Systems 2011
 
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
 
What python can learn from java
What python can learn from javaWhat python can learn from java
What python can learn from java
 
State of Cassandra, 2011
State of Cassandra, 2011State of Cassandra, 2011
State of Cassandra, 2011
 
Brisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by CassandraBrisk: more powerful Hadoop powered by Cassandra
Brisk: more powerful Hadoop powered by Cassandra
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorial
 
Cassandra 0.7, Los Angeles High Scalability Group
Cassandra 0.7, Los Angeles High Scalability GroupCassandra 0.7, Los Angeles High Scalability Group
Cassandra 0.7, Los Angeles High Scalability Group
 
Cassandra devoxx 2010
Cassandra devoxx 2010Cassandra devoxx 2010
Cassandra devoxx 2010
 
Cassandra FrOSCon 10
Cassandra FrOSCon 10Cassandra FrOSCon 10
Cassandra FrOSCon 10
 
State of Cassandra, August 2010
State of Cassandra, August 2010State of Cassandra, August 2010
State of Cassandra, August 2010
 
Cassandra nosql eu 2010
Cassandra nosql eu 2010Cassandra nosql eu 2010
Cassandra nosql eu 2010
 
What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010What every developer should know about database scalability, PyCon 2010
What every developer should know about database scalability, PyCon 2010
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
What Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database ScalabilityWhat Every Developer Should Know About Database Scalability
What Every Developer Should Know About Database Scalability
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Top five questions to ask when choosing a big data solution

  • 1. Five factors to consider when choosing a big data solution! Jonathan Ellis CTO, DataStax Project Chair, Apache Cassandra
  • 2. how do I my application? model ©2012 DataStax
  • 3. Popular options • Key/value • Tabular • Document • Graph? ©2012 DataStax
  • 4. Schema is your friend { "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c", "name": "jbellis", "state": "TX", "birthdate": "1/1/1976", "email_addresses": ["jbellis@gmail", "jbellis@datastax.com"], } ©2012 DataStax
  • 5. SQL can be your friend too CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > ‘1950-01-01’; ©2012 DataStax
  • 6. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses; ©2012 DataStax
  • 7. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, X birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses; ©2012 DataStax
  • 8. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date, email_addresses set<text> ); UPDATE users SET email_addresses = email_addresses + {‘jbellis@gmail.com’, ‘jbellis@datastax.com’}; ©2012 DataStax
  • 9. Joins don’t scale • No joins • No subqueries • No aggregation functions* or GROUP BY • ORDER BY? ©2012 DataStax
  • 10. SELECT * FROM tweets WHERE user_id IN (SELECT follower FROM followers WHERE user_id = ’driftx’) followers ? ©2012 DataStax tweets
  • 11. Clustering in Cassandra CREATE TABLE timeline ( user_id tweet_id _author _body   user_id uuid,   tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem   tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...   PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem ); driftx 71b46a84.. yzhang dolor ... ... ... yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
  • 12. Clustering in Cassandra CREATE TABLE timeline ( user_id tweet_id _author _body   user_id uuid,   tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem   tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...   PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem ); driftx 71b46a84.. yzhang dolor ... ... ... SELECT * FROM timeline WHERE user_id = ’driftx’; yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
  • 13. how does it perform? ©2012 DataStax
  • 14. Larger than memory datasets ©2012 DataStax
  • 17. UPDATE users SET email_addresses = email_addresses + {...} WHERE user_id = ‘jbellis’; ©2012 DataStax
  • 19. C* storage engine very briefly write( k1 , c1:v1 ) Memory Memtable Commit log ©2012 DataStax Hard drive
  • 20. write( k1 , c1:v1 ) Memory k1 c1:v1 Memtable k1 c1:v1 Commit log ©2012 DataStax Hard drive
  • 21. write( k1 , c2:v2 ) Memory k1 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 ©2012 DataStax Hard drive
  • 22. write( k2 , c1:v1 c2:v2 ) Memory k1 c1:v1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2 ©2012 DataStax Hard drive
  • 23. write( k1 , c1:v4 c3:v3 ) Memory k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v4 c3:v3 ©2012 DataStax Hard drive
  • 24. Memory flush index cleanup k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 SSTable ©2012 DataStax Hard drive
  • 26. reads/s writes/s 35000 30000 25000 20000 15000 10000 5000 Cassandra 0.6 0 ©2012 DataStax Cassandra 1.0
  • 27. how does it handle failure? ©2012 DataStax
  • 28. Classic partitioning with SPOF partition 1 partition 2 partition 3 partition 4 router client ©2012 DataStax
  • 29. Availability • “High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax • “The biggest problem with failover is that you're almost never using it until it really hurts. It's like backups that you never test.” -- Rick Branson: Instagram ©2012 DataStax
  • 30. Fully distributed, no SPOF client p3 p6 p1 p1 p1 ©2012 DataStax
  • 33. how does it scale? ©2012 DataStax
  • 34. Scaling antipatterns • Metadata servers • Router bottlenecks • Overloading existing nodes when adding capacity ©2012 DataStax
  • 36. how is it? flexible ©2012 DataStax
  • 37. 36
  • 38. Data model: Realtime LiveStocks stock last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios user stock shares jbellis GOOG 80 jbellis LNKD 20 yukim AMZN 100 StockHist stock date price GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78 ©2012 DataStax
  • 39. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 ©2012 DataStax
  • 40. Data model: Analytics 10dayreturns stock rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.stock, b.date as rdate, b.price - a.price FROM StockHist a JOIN StockHist b ON (a.stock = b.stock AND date_add(a.date, 10) = b.date); ©2012 DataStax
  • 41. Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.stock = b.stock) GROUP BY portfolio, rdate; ©2012 DataStax
  • 42. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn); ©2012 DataStax
  • 43. 42
  • 45. Questions? Image credits • http://www.flickr.com/photos/26817893@N05/2573006312/ • http://www.flickr.com/photos/rowanbank/7686239548 • http://www.flickr.com/photos/mervtheswerve/6081933265 • http://www.flickr.com/photos/dg_pics/2526208830 • http://www.flickr.com/photos/wainwright/351684037 • http://www.flickr.com/photos/mikeneilson/1606662529 • http://www.flickr.com/photos/sbisson/3852905534 • http://www.flickr.com/photos/breadnbadger/2674928517