• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Top five questions to ask when choosing a big data solution
 

Top five questions to ask when choosing a big data solution

on

  • 1,752 views

 

Statistics

Views

Total Views
1,752
Views on SlideShare
1,752
Embed Views
0

Actions

Likes
1
Downloads
41
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Top five questions to ask when choosing a big data solution Top five questions to ask when choosing a big data solution Presentation Transcript

    • Five factors to consider whenchoosing a big data solution!Jonathan EllisCTO, DataStaxProject Chair, Apache Cassandra
    • how do I my application? model©2012 DataStax
    • Popular options • Key/value • Tabular • Document • Graph?©2012 DataStax
    • Schema is your friend{ "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c", "name": "jbellis", "state": "TX", "birthdate": "1/1/1976", "email_addresses": ["jbellis@gmail", "jbellis@datastax.com"],} ©2012 DataStax
    • SQL can be your friend too CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;©2012 DataStax
    • Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses;©2012 DataStax
    • Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, X birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses;©2012 DataStax
    • Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date, email_addresses set<text> ); UPDATE users SET email_addresses = email_addresses + {‘jbellis@gmail.com’, ‘jbellis@datastax.com’};©2012 DataStax
    • Joins don’t scale • No joins • No subqueries • No aggregation functions* or GROUP BY • ORDER BY?©2012 DataStax
    • SELECT * FROM tweetsWHERE user_id IN (SELECT follower FROM followers WHERE user_id = ’driftx’) followers ? ©2012 DataStax tweets
    • Clustering in CassandraCREATE TABLE timeline ( user_id tweet_id _author _body  user_id uuid,  tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem  tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...  PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem); driftx 71b46a84.. yzhang dolor ... ... ... yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
    • Clustering in CassandraCREATE TABLE timeline ( user_id tweet_id _author _body  user_id uuid,  tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem  tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...  PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem); driftx 71b46a84.. yzhang dolor ... ... ...SELECT * FROM timelineWHERE user_id = ’driftx’; yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
    • how does it perform?©2012 DataStax
    • Larger than memory datasets©2012 DataStax
    • Locking©2012 DataStax
    • Efficiency©2012 DataStax
    • UPDATE users SET email_addresses = email_addresses + {...} WHERE user_id = ‘jbellis’;©2012 DataStax
    • Durability©2012 DataStax
    • C* storage engine very briefly write( k1 , c1:v1 ) Memory Memtable Commit log©2012 DataStax Hard drive
    • write( k1 , c1:v1 ) Memory k1 c1:v1 Memtable k1 c1:v1 Commit log©2012 DataStax Hard drive
    • write( k1 , c2:v2 ) Memory k1 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2©2012 DataStax Hard drive
    • write( k2 , c1:v1 c2:v2 ) Memory k1 c1:v1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2©2012 DataStax Hard drive
    • write( k1 , c1:v4 c3:v3 ) Memory k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v4 c3:v3©2012 DataStax Hard drive
    • Memory flush index cleanup k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 SSTable©2012 DataStax Hard drive
    • No random writes©2012 DataStax
    • reads/s writes/s 35000 30000 25000 20000 15000 10000 5000 Cassandra 0.6 0©2012 DataStax Cassandra 1.0
    • how does it handle failure?©2012 DataStax
    • Classic partitioning with SPOF partition 1 partition 2 partition 3 partition 4 router client©2012 DataStax
    • Availability • “High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax • “The biggest problem with failover is that youre almost never using it until it really hurts. Its like backups that you never test.” -- Rick Branson: Instagram©2012 DataStax
    • Fully distributed, no SPOF client p3 p6 p1 p1 p1©2012 DataStax
    • Multiple datacenters©2012 DataStax
    • ©2012 DataStax
    • how does it scale?©2012 DataStax
    • Scaling antipatterns • Metadata servers • Router bottlenecks • Overloading existing nodes when adding capacity©2012 DataStax
    • ©2012 DataStax
    • how is it? flexible©2012 DataStax
    • 36
    • Data model: Realtime LiveStocks stock last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios user stock shares jbellis GOOG 80 jbellis LNKD 20 yukim AMZN 100 StockHist stock date price GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78©2012 DataStax
    • Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93©2012 DataStax
    • Data model: Analytics 10dayreturns stock rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.stock, b.date as rdate, b.price - a.price FROM StockHist a JOIN StockHist b ON (a.stock = b.stock AND date_add(a.date, 10) = b.date);©2012 DataStax
    • Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.stock = b.stock) GROUP BY portfolio, rdate;©2012 DataStax
    • Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);©2012 DataStax
    • 42
    • Some Cassandra users©2012 DataStax
    • Questions?Image credits• http://www.flickr.com/photos/26817893@N05/2573006312/• http://www.flickr.com/photos/rowanbank/7686239548• http://www.flickr.com/photos/mervtheswerve/6081933265• http://www.flickr.com/photos/dg_pics/2526208830• http://www.flickr.com/photos/wainwright/351684037• http://www.flickr.com/photos/mikeneilson/1606662529• http://www.flickr.com/photos/sbisson/3852905534• http://www.flickr.com/photos/breadnbadger/2674928517