Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
1. Cassandra 1.0
and the future of big data
Jonathan Ellis
Tuesday, October 4, 2011
2. About me
✤ Project chair, Apache Cassandra
✤ Active since Dec 2008
✤ First non-Facebook committer
✤ wrote ~30% of committed patches, reviewed ~40% of the rest
✤ Distributed systems background
✤ At Mozy, built a multi-petabyte, scalable storage system based on
Reed-Solomon encoding
✤ Founder and CTO, DataStax
Tuesday, October 4, 2011
3. About DataStax
✤ Founded in April 2010
✤ Commercial leader in Apache Cassandra
✤ 100+ customers
✤ 30+ employees
✤ Home to Apache Cassandra Chair & most committers
✤ Headquartered in San Francisco Bay area, California
✤ Secured $11M in Series B funding in Sep 2011
Tuesday, October 4, 2011
6. Big data
Analytics Realtime
?
(Hadoop) (“NoSQL”)
Tuesday, October 4, 2011
7. Some Cassandra users
✤ Financial
✤ Social Media
✤ Advertising
✤ Entertainment
✤ Energy
✤ E-tail
✤ Health care
✤ Government
Tuesday, October 4, 2011
8. Common use cases
✤ Time series data
✤ Messaging
✤ Ad tracking
✤ Data mining
✤ User activity streams
✤ User sessions
✤ Anything requiring: Scalable + performant + highly
available
Tuesday, October 4, 2011
15. Level-based Compaction
✤ SSTables are non-overlapping within a level
✤ Bounds the number that can contain a given row
L0: newly flushed
L1: 100 MB
L2: 1000 MB
Tuesday, October 4, 2011
16. Read performance: maxtimestamp
✤ Sort sstables by maximum (client-provided) timestamp
✤ Only merge sstables until we have the columns requested
✤ Allows pre-merging highly fragmented rows without
waiting for compaction
Tuesday, October 4, 2011
18. CQL
cqlsh> SELECT * FROM users WHERE state='UT' AND birth_date > 1970;
KEY | birth_date | full_name | state |
bsanderson | 1975 | Brandon Sanderson | UT |
Tuesday, October 4, 2011
19. CQL 2.0
✤ ALTER
✤ Counter support
✤ TTL support
✤ SELECT count(*)
Tuesday, October 4, 2011
20. Post-1.0 features
✤ Ease Of Use
✤ CQL
✤ “Native” transport
✤ Composite columns
✤ Prepared statements
✤ Triggers
✤ Entity groups
✤ Smarter range queries
✤ Enables more-efficient analytics
Tuesday, October 4, 2011
21. The evolution of Analytics
Analytics + Realtime
Tuesday, October 4, 2011
22. The evolution of Analytics
replication
Analytics Realtime
Tuesday, October 4, 2011
27. Data model: Realtime
LiveStocks
last
GOOG $95.52
AAPL $186.10
AMZN $112.98
Portfolios
GOOG LNKD P AMZN AAPLE
Portfolio1
80 20 40 100 20
StockHist
2011-01-01 2011-01-02 2011-01-03
GOOG
$79.85 $75.23 $82.11
Tuesday, October 4, 2011
28. Data model: Analytics
HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93
Tuesday, October 4, 2011
29. Data model: Analytics
10dayreturns
ticker rdate return
GOOG 2011-07-25 $8.23
GOOG 2011-07-24 $6.14
GOOG 2011-07-23 $7.78
AAPL 2011-07-25 $15.32
AAPL 2011-07-24 $12.68
INSERT OVERWRITE TABLE 10dayreturns
SELECT a.row_key ticker,
b.column_name rdate,
b.value - a.value
FROM StockHist a
JOIN StockHist b
ON (a.row_key = b.row_key
AND date_add(a.column_name,10) = b.column_name);
Tuesday, October 4, 2011
30. Data model: Analytics
2011-01-01 2011-01-02 2011-01-03
GOOG
$79.85 $75.23 $82.11
row_key column_name value
GOOG 2011-01-01 $8.23
GOOG 2011-01-02 $6.14
GOOG 2011-001-03 $7.78
Tuesday, October 4, 2011
31. Data model: Analytics
portfolio_returns
portfolio rdate preturn
Portfolio1 2011-07-25 $118.21
Portfolio1 2011-07-24 $60.78
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-07-25 $2143.92
Portfolio3 2011-07-24 -$10.19
INSERT OVERWRITE TABLE portfolio_returns
SELECT row_key portfolio,
rdate,
SUM(b.return)
FROM portfolios a JOIN 10dayreturns b
ON (a.column_name = b.ticker)
GROUP BY row_key, rdate;
Tuesday, October 4, 2011
32. Data model: Analytics
HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93
INSERT OVERWRITE TABLE HistLoss
SELECT a.portfolio, rdate, minp
FROM (
SELECT portfolio, min(preturn) as minp
FROM portfolio_returns
GROUP BY portfolio
) a
JOIN portfolio_returns b
ON (a.portfolio = b.portfolio and a.minp = b.preturn);
Tuesday, October 4, 2011
33. Portfolio Demo dataflow
Portfolios Portfolios
Historical Prices Live Prices for today
Intermediate Results
Largest loss Largest loss
Tuesday, October 4, 2011
34. Operations
✤ “Vanilla” Hadoop
✤ 8+ services to setup, monitor, backup, and recover
(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper,
Region Server,...)
✤ Single points of failure
✤ Can't separate online and offline processing
✤ DataStax Enterprise
✤ Single, simplified component
✤ Self-organizes based on workload
✤ Peer to peer
✤ JobTracker failover
✤ No additional cassandra config
Tuesday, October 4, 2011