Five factors to consider whenchoosing a big data solution!Jonathan EllisCTO, DataStaxProject Chair, Apache Cassandra
how do I my application?                 model©2012 DataStax
Popular options  • Key/value  • Tabular  • Document  • Graph?©2012 DataStax
Schema is your friend{         "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c",         "name": "jbellis",         "state": "...
SQL can be your friend too CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,    birth_date date );...
Collections CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,    birth_date date ); CREATE TABLE u...
Collections CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,                 X    birth_date date...
Collections CREATE TABLE users (    id uuid PRIMARY KEY,    name text,    state text,    birth_date date,    email_address...
Joins don’t scale  • No joins  • No subqueries  • No aggregation functions* or GROUP BY  • ORDER BY?©2012 DataStax
SELECT * FROM tweetsWHERE user_id IN (SELECT follower FROM followers                  WHERE user_id = ’driftx’)           ...
Clustering in CassandraCREATE TABLE timeline (     user_id   tweet_id   _author    _body  user_id uuid,  tweet_id timeuuid...
Clustering in CassandraCREATE TABLE timeline (     user_id   tweet_id   _author    _body  user_id uuid,  tweet_id timeuuid...
how does it                 perform?©2012 DataStax
Larger than memory datasets©2012 DataStax
Locking©2012 DataStax
Efficiency©2012 DataStax
UPDATE users SET email_addresses = email_addresses + {...} WHERE user_id = ‘jbellis’;©2012 DataStax
Durability©2012 DataStax
C* storage engine very briefly           write( k1 , c1:v1 )                                              Memory           ...
write( k1 , c1:v1 )                                                         Memory                                 k1 c1:v...
write( k1 , c2:v2 )                                                    Memory                                 k1 c1:v1 c2:...
write(        k2   ,   c1:v1 c2:v2   )                                                                        Memory      ...
write(        k1   ,   c1:v4 c3:v3   )                                                                              Memory...
Memory                           flush                                  index                 cleanup    k1 c1:v4 c2:v2 c3:...
No random writes©2012 DataStax
reads/s            writes/s                                                                       35000                   ...
how does it handle                 failure?©2012 DataStax
Classic partitioning with SPOF                 partition 1   partition 2      partition 3   partition 4                   ...
Availability  • “High availability implies that a single fault will not bring            down your system. Not ‘we’ll reco...
Fully distributed, no SPOF                 client                          p3                                p6        p1 ...
Multiple datacenters©2012 DataStax
©2012 DataStax
how does it                 scale?©2012 DataStax
Scaling antipatterns  • Metadata servers  • Router bottlenecks  • Overloading existing nodes when adding capacity©2012 Dat...
©2012 DataStax
how is it?                 flexible©2012 DataStax
36
Data model: Realtime     LiveStocks      stock       last                    GOOG        $95.52                     AAPL  ...
Data model: Analytics HistLoss                     worst_date    loss                 Portfolio1   2011-07-23   -$34.81   ...
Data model: Analytics  10dayreturns          stock      rdate     return          GOOG    2011-07-25   $8.23          GOOG...
Data model: Analytics  portfolio_returns            portfolio       rdate      preturn            Portfolio1   2011-07-25 ...
Data model: Analytics  HistLoss                       worst_date    loss          Portfolio1   2011-07-23   -$34.81       ...
42
Some Cassandra users©2012 DataStax
Questions?Image credits•    http://www.flickr.com/photos/26817893@N05/2573006312/•    http://www.flickr.com/photos/rowanba...
Upcoming SlideShare
Loading in...5
×

Top five questions to ask when choosing a big data solution

1,792

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,792
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
47
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Top five questions to ask when choosing a big data solution

  1. 1. Five factors to consider whenchoosing a big data solution!Jonathan EllisCTO, DataStaxProject Chair, Apache Cassandra
  2. 2. how do I my application? model©2012 DataStax
  3. 3. Popular options • Key/value • Tabular • Document • Graph?©2012 DataStax
  4. 4. Schema is your friend{ "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c", "name": "jbellis", "state": "TX", "birthdate": "1/1/1976", "email_addresses": ["jbellis@gmail", "jbellis@datastax.com"],} ©2012 DataStax
  5. 5. SQL can be your friend too CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;©2012 DataStax
  6. 6. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses;©2012 DataStax
  7. 7. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, X birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses;©2012 DataStax
  8. 8. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date, email_addresses set<text> ); UPDATE users SET email_addresses = email_addresses + {‘jbellis@gmail.com’, ‘jbellis@datastax.com’};©2012 DataStax
  9. 9. Joins don’t scale • No joins • No subqueries • No aggregation functions* or GROUP BY • ORDER BY?©2012 DataStax
  10. 10. SELECT * FROM tweetsWHERE user_id IN (SELECT follower FROM followers WHERE user_id = ’driftx’) followers ? ©2012 DataStax tweets
  11. 11. Clustering in CassandraCREATE TABLE timeline ( user_id tweet_id _author _body  user_id uuid,  tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem  tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...  PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem); driftx 71b46a84.. yzhang dolor ... ... ... yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
  12. 12. Clustering in CassandraCREATE TABLE timeline ( user_id tweet_id _author _body  user_id uuid,  tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem  tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...  PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem); driftx 71b46a84.. yzhang dolor ... ... ...SELECT * FROM timelineWHERE user_id = ’driftx’; yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
  13. 13. how does it perform?©2012 DataStax
  14. 14. Larger than memory datasets©2012 DataStax
  15. 15. Locking©2012 DataStax
  16. 16. Efficiency©2012 DataStax
  17. 17. UPDATE users SET email_addresses = email_addresses + {...} WHERE user_id = ‘jbellis’;©2012 DataStax
  18. 18. Durability©2012 DataStax
  19. 19. C* storage engine very briefly write( k1 , c1:v1 ) Memory Memtable Commit log©2012 DataStax Hard drive
  20. 20. write( k1 , c1:v1 ) Memory k1 c1:v1 Memtable k1 c1:v1 Commit log©2012 DataStax Hard drive
  21. 21. write( k1 , c2:v2 ) Memory k1 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2©2012 DataStax Hard drive
  22. 22. write( k2 , c1:v1 c2:v2 ) Memory k1 c1:v1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2©2012 DataStax Hard drive
  23. 23. write( k1 , c1:v4 c3:v3 ) Memory k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v4 c3:v3©2012 DataStax Hard drive
  24. 24. Memory flush index cleanup k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 SSTable©2012 DataStax Hard drive
  25. 25. No random writes©2012 DataStax
  26. 26. reads/s writes/s 35000 30000 25000 20000 15000 10000 5000 Cassandra 0.6 0©2012 DataStax Cassandra 1.0
  27. 27. how does it handle failure?©2012 DataStax
  28. 28. Classic partitioning with SPOF partition 1 partition 2 partition 3 partition 4 router client©2012 DataStax
  29. 29. Availability • “High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax • “The biggest problem with failover is that youre almost never using it until it really hurts. Its like backups that you never test.” -- Rick Branson: Instagram©2012 DataStax
  30. 30. Fully distributed, no SPOF client p3 p6 p1 p1 p1©2012 DataStax
  31. 31. Multiple datacenters©2012 DataStax
  32. 32. ©2012 DataStax
  33. 33. how does it scale?©2012 DataStax
  34. 34. Scaling antipatterns • Metadata servers • Router bottlenecks • Overloading existing nodes when adding capacity©2012 DataStax
  35. 35. ©2012 DataStax
  36. 36. how is it? flexible©2012 DataStax
  37. 37. 36
  38. 38. Data model: Realtime LiveStocks stock last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios user stock shares jbellis GOOG 80 jbellis LNKD 20 yukim AMZN 100 StockHist stock date price GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78©2012 DataStax
  39. 39. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93©2012 DataStax
  40. 40. Data model: Analytics 10dayreturns stock rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.stock, b.date as rdate, b.price - a.price FROM StockHist a JOIN StockHist b ON (a.stock = b.stock AND date_add(a.date, 10) = b.date);©2012 DataStax
  41. 41. Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.stock = b.stock) GROUP BY portfolio, rdate;©2012 DataStax
  42. 42. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);©2012 DataStax
  43. 43. 42
  44. 44. Some Cassandra users©2012 DataStax
  45. 45. Questions?Image credits• http://www.flickr.com/photos/26817893@N05/2573006312/• http://www.flickr.com/photos/rowanbank/7686239548• http://www.flickr.com/photos/mervtheswerve/6081933265• http://www.flickr.com/photos/dg_pics/2526208830• http://www.flickr.com/photos/wainwright/351684037• http://www.flickr.com/photos/mikeneilson/1606662529• http://www.flickr.com/photos/sbisson/3852905534• http://www.flickr.com/photos/breadnbadger/2674928517
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×