Top five questions to ask when choosing a big data solution

  • 1,495 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,495
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
43
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Five factors to consider whenchoosing a big data solution!Jonathan EllisCTO, DataStaxProject Chair, Apache Cassandra
  • 2. how do I my application? model©2012 DataStax
  • 3. Popular options • Key/value • Tabular • Document • Graph?©2012 DataStax
  • 4. Schema is your friend{ "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c", "name": "jbellis", "state": "TX", "birthdate": "1/1/1976", "email_addresses": ["jbellis@gmail", "jbellis@datastax.com"],} ©2012 DataStax
  • 5. SQL can be your friend too CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;©2012 DataStax
  • 6. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses;©2012 DataStax
  • 7. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, X birth_date date ); CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text ); SELECT * FROM users NATURAL JOIN users_addresses;©2012 DataStax
  • 8. Collections CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date, email_addresses set<text> ); UPDATE users SET email_addresses = email_addresses + {‘jbellis@gmail.com’, ‘jbellis@datastax.com’};©2012 DataStax
  • 9. Joins don’t scale • No joins • No subqueries • No aggregation functions* or GROUP BY • ORDER BY?©2012 DataStax
  • 10. SELECT * FROM tweetsWHERE user_id IN (SELECT follower FROM followers WHERE user_id = ’driftx’) followers ? ©2012 DataStax tweets
  • 11. Clustering in CassandraCREATE TABLE timeline ( user_id tweet_id _author _body  user_id uuid,  tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem  tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...  PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem); driftx 71b46a84.. yzhang dolor ... ... ... yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
  • 12. Clustering in CassandraCREATE TABLE timeline ( user_id tweet_id _author _body  user_id uuid,  tweet_id timeuuid, jbellis 3290f9da.. rbranson lorem  tweet_author uuid, jbellis 3895411a.. tjake ipsum tweet_body text, ... ... ...  PRIMARY KEY (user_id, tweet_id) driftx 3290f9da.. rbranson lorem); driftx 71b46a84.. yzhang dolor ... ... ...SELECT * FROM timelineWHERE user_id = ’driftx’; yukim 3290f9da.. rbranson lorem yukim e451dd42.. tjake amet ... ... ... ©2012 DataStax
  • 13. how does it perform?©2012 DataStax
  • 14. Larger than memory datasets©2012 DataStax
  • 15. Locking©2012 DataStax
  • 16. Efficiency©2012 DataStax
  • 17. UPDATE users SET email_addresses = email_addresses + {...} WHERE user_id = ‘jbellis’;©2012 DataStax
  • 18. Durability©2012 DataStax
  • 19. C* storage engine very briefly write( k1 , c1:v1 ) Memory Memtable Commit log©2012 DataStax Hard drive
  • 20. write( k1 , c1:v1 ) Memory k1 c1:v1 Memtable k1 c1:v1 Commit log©2012 DataStax Hard drive
  • 21. write( k1 , c2:v2 ) Memory k1 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2©2012 DataStax Hard drive
  • 22. write( k2 , c1:v1 c2:v2 ) Memory k1 c1:v1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2©2012 DataStax Hard drive
  • 23. write( k1 , c1:v4 c3:v3 ) Memory k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v4 c3:v3©2012 DataStax Hard drive
  • 24. Memory flush index cleanup k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 SSTable©2012 DataStax Hard drive
  • 25. No random writes©2012 DataStax
  • 26. reads/s writes/s 35000 30000 25000 20000 15000 10000 5000 Cassandra 0.6 0©2012 DataStax Cassandra 1.0
  • 27. how does it handle failure?©2012 DataStax
  • 28. Classic partitioning with SPOF partition 1 partition 2 partition 3 partition 4 router client©2012 DataStax
  • 29. Availability • “High availability implies that a single fault will not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax • “The biggest problem with failover is that youre almost never using it until it really hurts. Its like backups that you never test.” -- Rick Branson: Instagram©2012 DataStax
  • 30. Fully distributed, no SPOF client p3 p6 p1 p1 p1©2012 DataStax
  • 31. Multiple datacenters©2012 DataStax
  • 32. ©2012 DataStax
  • 33. how does it scale?©2012 DataStax
  • 34. Scaling antipatterns • Metadata servers • Router bottlenecks • Overloading existing nodes when adding capacity©2012 DataStax
  • 35. ©2012 DataStax
  • 36. how is it? flexible©2012 DataStax
  • 37. 36
  • 38. Data model: Realtime LiveStocks stock last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios user stock shares jbellis GOOG 80 jbellis LNKD 20 yukim AMZN 100 StockHist stock date price GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78©2012 DataStax
  • 39. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93©2012 DataStax
  • 40. Data model: Analytics 10dayreturns stock rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.stock, b.date as rdate, b.price - a.price FROM StockHist a JOIN StockHist b ON (a.stock = b.stock AND date_add(a.date, 10) = b.date);©2012 DataStax
  • 41. Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.stock = b.stock) GROUP BY portfolio, rdate;©2012 DataStax
  • 42. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);©2012 DataStax
  • 43. 42
  • 44. Some Cassandra users©2012 DataStax
  • 45. Questions?Image credits• http://www.flickr.com/photos/26817893@N05/2573006312/• http://www.flickr.com/photos/rowanbank/7686239548• http://www.flickr.com/photos/mervtheswerve/6081933265• http://www.flickr.com/photos/dg_pics/2526208830• http://www.flickr.com/photos/wainwright/351684037• http://www.flickr.com/photos/mikeneilson/1606662529• http://www.flickr.com/photos/sbisson/3852905534• http://www.flickr.com/photos/breadnbadger/2674928517