• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
 

Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

on

  • 2,652 views

 

Statistics

Views

Total Views
2,652
Views on SlideShare
2,652
Embed Views
0

Actions

Likes
2
Downloads
62
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011) Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011) Presentation Transcript

    • Cassandra 1.0 and the future of big data Jonathan EllisTuesday, October 4, 2011
    • About me ✤ Project chair, Apache Cassandra ✤ Active since Dec 2008 ✤ First non-Facebook committer ✤ wrote ~30% of committed patches, reviewed ~40% of the rest ✤ Distributed systems background ✤ At Mozy, built a multi-petabyte, scalable storage system based on Reed-Solomon encoding ✤ Founder and CTO, DataStaxTuesday, October 4, 2011
    • About DataStax ✤ Founded in April 2010 ✤ Commercial leader in Apache Cassandra ✤ 100+ customers ✤ 30+ employees ✤ Home to Apache Cassandra Chair & most committers ✤ Headquartered in San Francisco Bay area, California ✤ Secured $11M in Series B funding in Sep 2011Tuesday, October 4, 2011
    • Job Trends (indeed.com)Tuesday, October 4, 2011
    • “Big Data” trendTuesday, October 4, 2011
    • Big data Analytics Realtime ? (Hadoop) (“NoSQL”)Tuesday, October 4, 2011
    • Some Cassandra users ✤ Financial ✤ Social Media ✤ Advertising ✤ Entertainment ✤ Energy ✤ E-tail ✤ Health care ✤ GovernmentTuesday, October 4, 2011
    • Common use cases ✤ Time series data ✤ Messaging ✤ Ad tracking ✤ Data mining ✤ User activity streams ✤ User sessions ✤ Anything requiring: Scalable + performant + highly availableTuesday, October 4, 2011
    • Why people choose Cassandra ✤ Multi-master, multi-DC ✤ Linearly scalable ✤ Larger-than-memory datasets ✤ Best-in-class performance (not just writes!) ✤ Fully durable ✤ Integrated caching ✤ Tuneable consistencyTuesday, October 4, 2011
    • 0.7 ✤ CREATE COLUMN FAMILY ✤ Expiring columns (TTL) ✤ Secondary (column) indexes ✤ Efficient streaming ✤ Efficient cross-datacenter writesTuesday, October 4, 2011
    • 0.8 ✤ CQL ✤ Counters ✤ Automatic memtable tuning ✤ New bulk load interfaceTuesday, October 4, 2011
    • 1.0 ✤ Compression ✤ Read performance ✤ LeveledCompactionStrategy ✤ CQL 2.0Tuesday, October 4, 2011
    • Compression ✤ Rows-per-block or blocks-per-rowTuesday, October 4, 2011
    • Classic size-tiered compactionTuesday, October 4, 2011
    • Level-based Compaction ✤ SSTables are non-overlapping within a level ✤ Bounds the number that can contain a given row L0: newly flushed L1: 100 MB L2: 1000 MBTuesday, October 4, 2011
    • Read performance: maxtimestamp ✤ Sort sstables by maximum (client-provided) timestamp ✤ Only merge sstables until we have the columns requested ✤ Allows pre-merging highly fragmented rows without waiting for compactionTuesday, October 4, 2011
    • ResultsTuesday, October 4, 2011
    • CQLcqlsh> SELECT * FROM users WHERE state=UT AND birth_date > 1970;        KEY | birth_date |         full_name | state | bsanderson |       1975 | Brandon Sanderson |    UT |Tuesday, October 4, 2011
    • CQL 2.0 ✤ ALTER ✤ Counter support ✤ TTL support ✤ SELECT count(*)Tuesday, October 4, 2011
    • Post-1.0 features ✤ Ease Of Use ✤ CQL ✤ “Native” transport ✤ Composite columns ✤ Prepared statements ✤ Triggers ✤ Entity groups ✤ Smarter range queries ✤ Enables more-efficient analyticsTuesday, October 4, 2011
    • The evolution of Analytics Analytics + RealtimeTuesday, October 4, 2011
    • The evolution of Analytics replication Analytics RealtimeTuesday, October 4, 2011
    • The evolution of Analytics ETLTuesday, October 4, 2011
    • Big data Analytics DataStax Realtime (Hadoop) Enterprise (Cassandra)Tuesday, October 4, 2011
    • DataStax Enterprise re-unifies realtime and analyticsTuesday, October 4, 2011
    • 26Tuesday, October 4, 2011
    • Data model: Realtime LiveStocks last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios GOOG LNKD P AMZN AAPLE Portfolio1 80 20 40 100 20 StockHist 2011-01-01 2011-01-02 2011-01-03 GOOG $79.85 $75.23 $82.11Tuesday, October 4, 2011
    • Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93Tuesday, October 4, 2011
    • Data model: Analytics 10dayreturns ticker rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.row_key ticker, b.column_name rdate, b.value - a.value FROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);Tuesday, October 4, 2011
    • Data model: Analytics 2011-01-01 2011-01-02 2011-01-03 GOOG $79.85 $75.23 $82.11 row_key column_name value GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78Tuesday, October 4, 2011
    • Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT row_key portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker) GROUP BY row_key, rdate;Tuesday, October 4, 2011
    • Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);Tuesday, October 4, 2011
    • Portfolio Demo dataflow Portfolios Portfolios Historical Prices Live Prices for today Intermediate Results Largest loss Largest lossTuesday, October 4, 2011
    • Operations ✤ “Vanilla” Hadoop ✤ 8+ services to setup, monitor, backup, and recover (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...) ✤ Single points of failure ✤ Cant separate online and offline processing ✤ DataStax Enterprise ✤ Single, simplified component ✤ Self-organizes based on workload ✤ Peer to peer ✤ JobTracker failover ✤ No additional cassandra configTuesday, October 4, 2011
    • OpsCenterTuesday, October 4, 2011
    • Questions? ✤ http://datastax.com/dev/blog ✤ jonathan@datastax.comTuesday, October 4, 2011
    • 37Tuesday, October 4, 2011