Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Cassandra 1.0
and the future of big data
Jonathan Ellis

Tuesday, October 4, 2011

About me

✤ Project chair, Apache Cassandra
✤ Active since Dec 2008
✤ First non-Facebook committer
✤ wrote ~30% of committed patches, reviewed ~40% of the rest
✤ Distributed systems background
✤ At Mozy, built a multi-petabyte, scalable storage system based on
Reed-Solomon encoding
✤ Founder and CTO, DataStax


About DataStax

✤ Founded in April 2010
✤ Commercial leader in Apache Cassandra
✤ 100+ customers
✤ 30+ employees
✤ Home to Apache Cassandra Chair & most committers
✤ Headquartered in San Francisco Bay area, California
✤ Secured $11M in Series B funding in Sep 2011


Job Trends (indeed.com)


“Big Data” trend


Big data

Analytics Realtime
?
(Hadoop) (“NoSQL”)


Some Cassandra users

✤ Financial
✤ Social Media
✤ Advertising
✤ Entertainment
✤ Energy
✤ E-tail
✤ Health care
✤ Government


Common use cases

✤ Time series data
✤ Messaging
✤ Ad tracking
✤ Data mining
✤ User activity streams
✤ User sessions
✤ Anything requiring: Scalable + performant + highly
available


Why people choose Cassandra

✤ Multi-master, multi-DC
✤ Linearly scalable
✤ Larger-than-memory datasets
✤ Best-in-class performance (not just writes!)
✤ Fully durable
✤ Integrated caching
✤ Tuneable consistency


0.7

✤ CREATE COLUMN FAMILY
✤ Expiring columns (TTL)
✤ Secondary (column) indexes
✤ Efﬁcient streaming
✤ Efﬁcient cross-datacenter writes


0.8

✤ CQL
✤ Counters
✤ Automatic memtable tuning
✤ New bulk load interface


1.0

✤ Compression
✤ Read performance
✤ LeveledCompactionStrategy
✤ CQL 2.0


Compression

✤ Rows-per-block or blocks-per-row


Classic size-tiered compaction


Level-based Compaction

✤ SSTables are non-overlapping within a level
✤ Bounds the number that can contain a given row

L0: newly flushed

L1: 100 MB

L2: 1000 MB


Read performance: maxtimestamp

✤ Sort sstables by maximum (client-provided) timestamp
✤ Only merge sstables until we have the columns requested
✤ Allows pre-merging highly fragmented rows without
waiting for compaction


Results


CQL 2.0

✤ ALTER
✤ Counter support
✤ TTL support
✤ SELECT count(*)


Post-1.0 features

✤ Ease Of Use
✤ CQL
✤ “Native” transport
✤ Composite columns
✤ Prepared statements
✤ Triggers
✤ Entity groups
✤ Smarter range queries
✤ Enables more-efﬁcient analytics

The evolution of Analytics

Analytics + Realtime



replication

Analytics Realtime



ETL


Big data

Analytics DataStax Realtime
(Hadoop) Enterprise (Cassandra)


DataStax Enterprise re-unifies
realtime and analytics


Data model: Realtime
LiveStocks
last
GOOG $95.52
AAPL $186.10
AMZN $112.98

Portfolios
GOOG LNKD P AMZN AAPLE
Portfolio1
80 20 40 100 20

StockHist
2011-01-01 2011-01-02 2011-01-03
GOOG
$79.85 $75.23 $82.11


Data model: Analytics
HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93


10dayreturns
ticker rdate return
GOOG 2011-07-25 $8.23
GOOG 2011-07-24 $6.14
GOOG 2011-07-23 $7.78
AAPL 2011-07-25 $15.32
AAPL 2011-07-24 $12.68

INSERT OVERWRITE TABLE 10dayreturns
SELECT a.row_key ticker,
b.column_name rdate,
b.value - a.value
FROM StockHist a
JOIN StockHist b
ON (a.row_key = b.row_key
AND date_add(a.column_name,10) = b.column_name);



2011-01-01 2011-01-02 2011-01-03
GOOG
$79.85 $75.23 $82.11

row_key column_name value
GOOG 2011-01-01 $8.23
GOOG 2011-01-02 $6.14
GOOG 2011-001-03 $7.78


portfolio_returns
portfolio rdate preturn
Portfolio1 2011-07-25 $118.21
Portfolio1 2011-07-24 $60.78
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-07-25 $2143.92
Portfolio3 2011-07-24 -$10.19

INSERT OVERWRITE TABLE portfolio_returns
SELECT row_key portfolio,
rdate,
SUM(b.return)
FROM portfolios a JOIN 10dayreturns b
ON (a.column_name = b.ticker)
GROUP BY row_key, rdate;


HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93

INSERT OVERWRITE TABLE HistLoss
SELECT a.portfolio, rdate, minp
FROM (
SELECT portfolio, min(preturn) as minp
FROM portfolio_returns
GROUP BY portfolio
) a
JOIN portfolio_returns b
ON (a.portfolio = b.portfolio and a.minp = b.preturn);


Portfolio Demo dataflow

Portfolios Portfolios
Historical Prices Live Prices for today
Intermediate Results
Largest loss Largest loss


Operations

✤ “Vanilla” Hadoop
✤ 8+ services to setup, monitor, backup, and recover
(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper,
Region Server,...)
✤ Single points of failure
✤ Can't separate online and offline processing

✤ DataStax Enterprise
✤ Single, simplified component
✤ Self-organizes based on workload
✤ Peer to peer
✤ JobTracker failover
✤ No additional cassandra config


OpsCenter


Questions?

✤ http://datastax.com/dev/blog
✤ jonathan@datastax.com


Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Recommended

Recommended

More Related Content

Similar to Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)

Similar to Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011) (20)

More from jbellis

More from jbellis (20)

Recently uploaded

Recently uploaded (20)

Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)