Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

OLAP WITH SPARK AND
CASSANDRA
#CassandraSummit
EVAN CHAN
SEPT 2014

WHO AM I?
Principal Engineer,
@evanfchan
Creator of
Socrata, Inc.
http://github.com/velvia
Spark Job Server

WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE
PEOPLE.
data.edmonton.ca finances.worldbank.org data.cityofchicago.org
data.seattle.gov data.oregon.gov data.wa.gov
www.metrochicagodata.org data.cityofboston.gov
info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov
data.nola.gov data.illinois.gov data.colorado.gov
data.austintexas.gov data.undp.org www.opendatanyc.com
data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it
data.montgomerycountymd.gov data.cityofnewyork.us
data.acgov.org data.baltimorecity.gov data.energystar.gov
data.somervillema.gov data.maryland.gov data.taxpayer.net
bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org

BIG DATA AT SOCRATA
Tens of thousands of datasets, each one up to 30 million rows
Customer demand for billion row datasets
Want to analyze across datasets

BIG DATA AT OOYALA
2.5 billion analytics pings a day = almost a trillion events a
year.
Roll up tables - 30 million rows per day

HOW CAN WE ALLOW CUSTOMERS TO QUERY A
YEAR'S WORTH OF DATA?
Flexible - complex queries included
Sometimes you can't denormalize your data enough
Fast - interactive speeds
Near Real Time - can't make customers wait hours before
querying new data

RDBMS? POSTGRES?
Start hitting latency limits at ~10 million rows
No robust and inexpensive solution for querying across shards
No robust way to scale horizontally
PostGres runs query on single thread unless you partition
(painful!)
Complex and expensive to improve performance (eg rollup
tables, huge expensive servers)

OLAP CUBES?
Materialize summary for every possible combination
Too complicated and brittle
Takes forever to compute - not for real time
Explodes storage and memory

When in doubt, use brute force
- Ken Thompson

CASSANDRA
Horizontally scalable
Very flexible data modelling (lists, sets, custom data types)
Easy to operate
No fear of number of rows or documents
Best of breed storage technology, huge community
BUT: Simple queries only

APACHE SPARK
Horizontally scalable, in-memory queries
Functional Scala transforms - map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, many more plugins
all on ONE platform - feed your SQL results to a logistic
regression, easy!
THE Hottest big data platform, huge community, leaving
Hadoop in the dust
Developers love it

SPARK PROVIDES THE MISSING FAST, DEEP
ANALYTICS PIECE OF CASSANDRA!

INTEGRATING SPARK AND CASSANDRA
Scala solutions:
Datastax integration:
https://github.com/datastax/spark-cassandra-
connector
(CQL-based)
Calliope

A bit more work:
Use traditional Cassandra client with RDDs
Use an existing InputFormat, like CqlPagedInputFormat
Only reason to go here is probably you are not on CQL version of
Cassandra, or you're using Shark/Hive.

A SPARK AND CASSANDRA
OLAP ARCHITECTURE

SEPARATE STORAGE AND QUERY LAYERS
Combine best of breed storage and query platforms
Take full advantage of evolution of each
Storage handles replication for availability
Query can replicate data for scaling read concurrency -
independent!

SCALE NODES, NOT
DEVELOPER TIME!!

KEEPING IT SIMPLE
Maximize row scan speed
Columnar representation for efficiency
Compressed bitmap indexes for fast algebra
Functional transforms for easy memoization, testing,
concurrency, composition

EVEN BETTER: TACHYON OFF-HEAP CACHING

INITIAL ATTEMPTS
val rows = Seq(
Seq("Burglary", "19xx Hurston", 10),
Seq("Theft", "55xx Floatilla Ave", 5)
)
sc.parallelize(rows)
.map { values => (values[0], values) }
.groupByKey
.reduce(_[2] + _[2])

No existing generic query engine for Spark when we started
(Shark was in infancy, had no indexes, etc.), so we built our own
For every row, need to extract out needed columns
Ability to select arbitrary columns means using Seq[Any], no
type safety
Boxing makes integer aggregation very expensive and memory
inefficient

The traditional row-based data storage
approach is dead
- Michael Stonebraker

TRADITIONAL ROW-BASED STORAGE
Same layout in memory and on disk:
Name Age
Barak 46
Hillary 66
Each row is stored contiguously. All columns in row 2 come after
row 1.

COLUMNAR STORAGE (MEMORY)
Name column
0 1
0 1
Dictionary: {0: "Barak", 1: "Hillary"}
Age column
0 1
46 66

COLUMNAR STORAGE (CASSANDRA)
Review: each physical row in Cassandra (e.g. a "partition key")
stores its columns together on disk.
Schema CF
Rowkey Type
Name StringDict
Age Int
Data CF
Rowkey 0 1
Name 0 1
Age 46 66

ADVANTAGES OF COLUMNAR STORAGE
Compression
Dictionary compression - HUGE savings for low-cardinality
string columns
RLE
Reduce I/O
Only columns needed for query are loaded from disk
Can keep strong types in memory, avoid boxing
Batch multiple rows in one cell for efficiency

ADVANTAGES OF COLUMNAR QUERYING
Cache locality for aggregating column of data
Take advantage of CPU/GPU vector instructions for ints /
doubles
avoid row-ifying until last possible moment
easy to derive computed columns
Use vector data / linear math libraries

COLUMNAR QUERY ENGINE VS ROW-BASED IN
SCALA
Custom RDD of column-oriented blocks of data
Uses ~10x less heap
10-100x faster for group by's on a single node
Scan speed in excess of 150M rows/sec/core for integer
aggregations

SO, GREAT, OLAP WITH CASSANDRA AND
SPARK. NOW WHAT?

DATASTAX: CASSANDRA SPARK INTEGRATION
Datastax Enterprise now comes with HA Spark
HA master, that is.
spark-cassandra-connector

SPARK SQL
Appeared with Spark 1.0
In-memory columnar store
Can read from Parquet and JSON now; direct Cassandra
integration coming
Querying is not column-based (yet)
No indexes
Write custom functions in Scala .... take that Hive UDFs!!
Integrates well with MLBase, Scala/Java/Python

CACHING A SQL TABLE FROM CASSANDRA
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.cassandraTable[GDeltRow]("gdelt, "1979to2009")
.registerAsTable("gdelt")
sqlContext.cacheTable("gdelt")
sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the
collect()
In Spark 1.1+: registerTempTable

SOME PERFORMANCE NUMBERS
GDELT dataset, 117 million rows, 57 columns, ~50GB
Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory
Query Avg
time
(sec)
SELECT count(*) FROM gdelt
WHERE Actor2CountryCode =
'CHN'
0.49
SELECT 4 columns Top K 1.51
SELECT Top countries by Avg Tone
2.69
(Group By)

IMPORTANT - CACHING
By default, queries will read data from source - Cassandra -
every time
Spark RDD Caching - much faster, but big waste of memory
(row oriented)
Spark SQL table caching - fastest, memory efficient

WORK STILL NEEDED
Indexes
Columnar querying for fast aggregation
Tachyon support for Cassandra/CQL
Efficient reading from columnar storage formats

LESSONS
Extremely fast distributed querying for these use cases
Data doesn't change much (and only bulk changes)
Analytical queries for subset of columns
Focused on numerical aggregations
Small numbers of group bys
For fast query performance, cache your data using Spark SQL
Concurrent queries is a frontier with Spark. Use additional
Spark contexts.

EXAMPLE CUSTOM INTEGRATION USING
ASTYANAX
val cassRDD = sc.parallelize(rowkeys).
flatMap { rowkey =>
columnFamily.get(rowkey).execute().asScala
}

SOME COLUMNAR ALTERNATIVES
Monetdb and Infobright - true columnar stores (storage +
querying)
Vertica and C-Store
Google BigQuery - columnar cloud database, Dremel based
Amazon RedShift

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

More Related Content

What's hot

Viewers also liked

Similar to Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

More from DataStax Academy

Recently uploaded

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark