Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

3,201 views
3,062 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,201
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
66
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

  1. 1. Cassandra 1.0 and the future of big data Jonathan EllisTuesday, October 4, 2011
  2. 2. About me ✤ Project chair, Apache Cassandra ✤ Active since Dec 2008 ✤ First non-Facebook committer ✤ wrote ~30% of committed patches, reviewed ~40% of the rest ✤ Distributed systems background ✤ At Mozy, built a multi-petabyte, scalable storage system based on Reed-Solomon encoding ✤ Founder and CTO, DataStaxTuesday, October 4, 2011
  3. 3. About DataStax ✤ Founded in April 2010 ✤ Commercial leader in Apache Cassandra ✤ 100+ customers ✤ 30+ employees ✤ Home to Apache Cassandra Chair & most committers ✤ Headquartered in San Francisco Bay area, California ✤ Secured $11M in Series B funding in Sep 2011Tuesday, October 4, 2011
  4. 4. Job Trends (indeed.com)Tuesday, October 4, 2011
  5. 5. “Big Data” trendTuesday, October 4, 2011
  6. 6. Big data Analytics Realtime ? (Hadoop) (“NoSQL”)Tuesday, October 4, 2011
  7. 7. Some Cassandra users ✤ Financial ✤ Social Media ✤ Advertising ✤ Entertainment ✤ Energy ✤ E-tail ✤ Health care ✤ GovernmentTuesday, October 4, 2011
  8. 8. Common use cases ✤ Time series data ✤ Messaging ✤ Ad tracking ✤ Data mining ✤ User activity streams ✤ User sessions ✤ Anything requiring: Scalable + performant + highly availableTuesday, October 4, 2011
  9. 9. Why people choose Cassandra ✤ Multi-master, multi-DC ✤ Linearly scalable ✤ Larger-than-memory datasets ✤ Best-in-class performance (not just writes!) ✤ Fully durable ✤ Integrated caching ✤ Tuneable consistencyTuesday, October 4, 2011
  10. 10. 0.7 ✤ CREATE COLUMN FAMILY ✤ Expiring columns (TTL) ✤ Secondary (column) indexes ✤ Efficient streaming ✤ Efficient cross-datacenter writesTuesday, October 4, 2011
  11. 11. 0.8 ✤ CQL ✤ Counters ✤ Automatic memtable tuning ✤ New bulk load interfaceTuesday, October 4, 2011
  12. 12. 1.0 ✤ Compression ✤ Read performance ✤ LeveledCompactionStrategy ✤ CQL 2.0Tuesday, October 4, 2011
  13. 13. Compression ✤ Rows-per-block or blocks-per-rowTuesday, October 4, 2011
  14. 14. Classic size-tiered compactionTuesday, October 4, 2011
  15. 15. Level-based Compaction ✤ SSTables are non-overlapping within a level ✤ Bounds the number that can contain a given row L0: newly flushed L1: 100 MB L2: 1000 MBTuesday, October 4, 2011
  16. 16. Read performance: maxtimestamp ✤ Sort sstables by maximum (client-provided) timestamp ✤ Only merge sstables until we have the columns requested ✤ Allows pre-merging highly fragmented rows without waiting for compactionTuesday, October 4, 2011
  17. 17. ResultsTuesday, October 4, 2011
  18. 18. CQLcqlsh> SELECT * FROM users WHERE state=UT AND birth_date > 1970;        KEY | birth_date |         full_name | state | bsanderson |       1975 | Brandon Sanderson |    UT |Tuesday, October 4, 2011
  19. 19. CQL 2.0 ✤ ALTER ✤ Counter support ✤ TTL support ✤ SELECT count(*)Tuesday, October 4, 2011
  20. 20. Post-1.0 features ✤ Ease Of Use ✤ CQL ✤ “Native” transport ✤ Composite columns ✤ Prepared statements ✤ Triggers ✤ Entity groups ✤ Smarter range queries ✤ Enables more-efficient analyticsTuesday, October 4, 2011
  21. 21. The evolution of Analytics Analytics + RealtimeTuesday, October 4, 2011
  22. 22. The evolution of Analytics replication Analytics RealtimeTuesday, October 4, 2011
  23. 23. The evolution of Analytics ETLTuesday, October 4, 2011
  24. 24. Big data Analytics DataStax Realtime (Hadoop) Enterprise (Cassandra)Tuesday, October 4, 2011
  25. 25. DataStax Enterprise re-unifies realtime and analyticsTuesday, October 4, 2011
  26. 26. 26Tuesday, October 4, 2011
  27. 27. Data model: Realtime LiveStocks last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios GOOG LNKD P AMZN AAPLE Portfolio1 80 20 40 100 20 StockHist 2011-01-01 2011-01-02 2011-01-03 GOOG $79.85 $75.23 $82.11Tuesday, October 4, 2011
  28. 28. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93Tuesday, October 4, 2011
  29. 29. Data model: Analytics 10dayreturns ticker rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.row_key ticker, b.column_name rdate, b.value - a.value FROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);Tuesday, October 4, 2011
  30. 30. Data model: Analytics 2011-01-01 2011-01-02 2011-01-03 GOOG $79.85 $75.23 $82.11 row_key column_name value GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78Tuesday, October 4, 2011
  31. 31. Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT row_key portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker) GROUP BY row_key, rdate;Tuesday, October 4, 2011
  32. 32. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);Tuesday, October 4, 2011
  33. 33. Portfolio Demo dataflow Portfolios Portfolios Historical Prices Live Prices for today Intermediate Results Largest loss Largest lossTuesday, October 4, 2011
  34. 34. Operations ✤ “Vanilla” Hadoop ✤ 8+ services to setup, monitor, backup, and recover (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...) ✤ Single points of failure ✤ Cant separate online and offline processing ✤ DataStax Enterprise ✤ Single, simplified component ✤ Self-organizes based on workload ✤ Peer to peer ✤ JobTracker failover ✤ No additional cassandra configTuesday, October 4, 2011
  35. 35. OpsCenterTuesday, October 4, 2011
  36. 36. Questions? ✤ http://datastax.com/dev/blog ✤ jonathan@datastax.comTuesday, October 4, 2011
  37. 37. 37Tuesday, October 4, 2011

×