OLAP WITH SPARK AND 
CASSANDRA 
#CassandraSummit 
EVAN CHAN 
SEPT 2014
WHO AM I? 
Principal Engineer, 
@evanfchan 
Creator of 
Socrata, Inc. 
http://github.com/velvia 
Spark Job Server
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE 
PEOPLE. 
data.edmonton.ca finances.worldbank.org data.cityofchicago.org 
data.seattle.gov data.oregon.gov data.wa.gov 
www.metrochicagodata.org data.cityofboston.gov 
info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov 
data.nola.gov data.illinois.gov data.colorado.gov 
data.austintexas.gov data.undp.org www.opendatanyc.com 
data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it 
data.montgomerycountymd.gov data.cityofnewyork.us 
data.acgov.org data.baltimorecity.gov data.energystar.gov 
data.somervillema.gov data.maryland.gov data.taxpayer.net 
bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
WE ARE SWIMMING IN DATA!
BIG DATA AT SOCRATA 
Tens of thousands of datasets, each one up to 30 million rows 
Customer demand for billion row datasets 
Want to analyze across datasets
BIG DATA AT OOYALA 
2.5 billion analytics pings a day = almost a trillion events a 
year. 
Roll up tables - 30 million rows per day
HOW CAN WE ALLOW CUSTOMERS TO QUERY A 
YEAR'S WORTH OF DATA? 
Flexible - complex queries included 
Sometimes you can't denormalize your data enough 
Fast - interactive speeds 
Near Real Time - can't make customers wait hours before 
querying new data
RDBMS? POSTGRES? 
Start hitting latency limits at ~10 million rows 
No robust and inexpensive solution for querying across shards 
No robust way to scale horizontally 
PostGres runs query on single thread unless you partition 
(painful!) 
Complex and expensive to improve performance (eg rollup 
tables, huge expensive servers)
OLAP CUBES? 
Materialize summary for every possible combination 
Too complicated and brittle 
Takes forever to compute - not for real time 
Explodes storage and memory
When in doubt, use brute force 
- Ken Thompson
CASSANDRA 
Horizontally scalable 
Very flexible data modelling (lists, sets, custom data types) 
Easy to operate 
No fear of number of rows or documents 
Best of breed storage technology, huge community 
BUT: Simple queries only
APACHE SPARK 
Horizontally scalable, in-memory queries 
Functional Scala transforms - map, filter, groupBy, sort 
etc. 
SQL, machine learning, streaming, graph, R, many more plugins 
all on ONE platform - feed your SQL results to a logistic 
regression, easy! 
THE Hottest big data platform, huge community, leaving 
Hadoop in the dust 
Developers love it
SPARK PROVIDES THE MISSING FAST, DEEP 
ANALYTICS PIECE OF CASSANDRA!
INTEGRATING SPARK AND CASSANDRA 
Scala solutions: 
Datastax integration: 
https://github.com/datastax/spark-cassandra- 
connector 
(CQL-based) 
Calliope
A bit more work: 
Use traditional Cassandra client with RDDs 
Use an existing InputFormat, like CqlPagedInputFormat 
Only reason to go here is probably you are not on CQL version of 
Cassandra, or you're using Shark/Hive.
A SPARK AND CASSANDRA 
OLAP ARCHITECTURE
SEPARATE STORAGE AND QUERY LAYERS 
Combine best of breed storage and query platforms 
Take full advantage of evolution of each 
Storage handles replication for availability 
Query can replicate data for scaling read concurrency - 
independent!
SCALE NODES, NOT 
DEVELOPER TIME!!
KEEPING IT SIMPLE 
Maximize row scan speed 
Columnar representation for efficiency 
Compressed bitmap indexes for fast algebra 
Functional transforms for easy memoization, testing, 
concurrency, composition
SPARK AS CASSANDRA'S CACHE
EVEN BETTER: TACHYON OFF-HEAP CACHING
INITIAL ATTEMPTS 
val rows = Seq( 
Seq("Burglary", "19xx Hurston", 10), 
Seq("Theft", "55xx Floatilla Ave", 5) 
) 
sc.parallelize(rows) 
.map { values => (values[0], values) } 
.groupByKey 
.reduce(_[2] + _[2])
No existing generic query engine for Spark when we started 
(Shark was in infancy, had no indexes, etc.), so we built our own 
For every row, need to extract out needed columns 
Ability to select arbitrary columns means using Seq[Any], no 
type safety 
Boxing makes integer aggregation very expensive and memory 
inefficient
COLUMNAR STORAGE AND QUERYING
The traditional row-based data storage 
approach is dead 
- Michael Stonebraker
TRADITIONAL ROW-BASED STORAGE 
Same layout in memory and on disk: 
Name Age 
Barak 46 
Hillary 66 
Each row is stored contiguously. All columns in row 2 come after 
row 1.
COLUMNAR STORAGE (MEMORY) 
Name column 
0 1 
0 1 
Dictionary: {0: "Barak", 1: "Hillary"} 
Age column 
0 1 
46 66
COLUMNAR STORAGE (CASSANDRA) 
Review: each physical row in Cassandra (e.g. a "partition key") 
stores its columns together on disk. 
Schema CF 
Rowkey Type 
Name StringDict 
Age Int 
Data CF 
Rowkey 0 1 
Name 0 1 
Age 46 66
ADVANTAGES OF COLUMNAR STORAGE 
Compression 
Dictionary compression - HUGE savings for low-cardinality 
string columns 
RLE 
Reduce I/O 
Only columns needed for query are loaded from disk 
Can keep strong types in memory, avoid boxing 
Batch multiple rows in one cell for efficiency
ADVANTAGES OF COLUMNAR QUERYING 
Cache locality for aggregating column of data 
Take advantage of CPU/GPU vector instructions for ints / 
doubles 
avoid row-ifying until last possible moment 
easy to derive computed columns 
Use vector data / linear math libraries
COLUMNAR QUERY ENGINE VS ROW-BASED IN 
SCALA 
Custom RDD of column-oriented blocks of data 
Uses ~10x less heap 
10-100x faster for group by's on a single node 
Scan speed in excess of 150M rows/sec/core for integer 
aggregations
SO, GREAT, OLAP WITH CASSANDRA AND 
SPARK. NOW WHAT?
DATASTAX: CASSANDRA SPARK INTEGRATION 
Datastax Enterprise now comes with HA Spark 
HA master, that is. 
spark-cassandra-connector
SPARK SQL 
Appeared with Spark 1.0 
In-memory columnar store 
Can read from Parquet and JSON now; direct Cassandra 
integration coming 
Querying is not column-based (yet) 
No indexes 
Write custom functions in Scala .... take that Hive UDFs!! 
Integrates well with MLBase, Scala/Java/Python
CACHING A SQL TABLE FROM CASSANDRA 
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") 
.registerAsTable("gdelt") 
sqlContext.cacheTable("gdelt") 
sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the 
collect() 
In Spark 1.1+: registerTempTable
SOME PERFORMANCE NUMBERS 
GDELT dataset, 117 million rows, 57 columns, ~50GB 
Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory 
Query Avg 
time 
(sec) 
SELECT count(*) FROM gdelt 
WHERE Actor2CountryCode = 
'CHN' 
0.49 
SELECT 4 columns Top K 1.51 
SELECT Top countries by Avg Tone 
2.69 
(Group By)
IMPORTANT - CACHING 
By default, queries will read data from source - Cassandra - 
every time 
Spark RDD Caching - much faster, but big waste of memory 
(row oriented) 
Spark SQL table caching - fastest, memory efficient
WORK STILL NEEDED 
Indexes 
Columnar querying for fast aggregation 
Tachyon support for Cassandra/CQL 
Efficient reading from columnar storage formats
LESSONS 
Extremely fast distributed querying for these use cases 
Data doesn't change much (and only bulk changes) 
Analytical queries for subset of columns 
Focused on numerical aggregations 
Small numbers of group bys 
For fast query performance, cache your data using Spark SQL 
Concurrent queries is a frontier with Spark. Use additional 
Spark contexts.
THANK YOU!
EXTRA SLIDES
EXAMPLE CUSTOM INTEGRATION USING 
ASTYANAX 
val cassRDD = sc.parallelize(rowkeys). 
flatMap { rowkey => 
columnFamily.get(rowkey).execute().asScala 
}
SOME COLUMNAR ALTERNATIVES 
Monetdb and Infobright - true columnar stores (storage + 
querying) 
Vertica and C-Store 
Google BigQuery - columnar cloud database, Dremel based 
Amazon RedShift

Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark

  • 1.
    OLAP WITH SPARKAND CASSANDRA #CassandraSummit EVAN CHAN SEPT 2014
  • 2.
    WHO AM I? Principal Engineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  • 3.
    WE BUILD SOFTWARETO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.ca finances.worldbank.org data.cityofchicago.org data.seattle.gov data.oregon.gov data.wa.gov www.metrochicagodata.org data.cityofboston.gov info.samhsa.gov explore.data.gov data.cms.gov data.ok.gov data.nola.gov data.illinois.gov data.colorado.gov data.austintexas.gov data.undp.org www.opendatanyc.com data.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.it data.montgomerycountymd.gov data.cityofnewyork.us data.acgov.org data.baltimorecity.gov data.energystar.gov data.somervillema.gov data.maryland.gov data.taxpayer.net bronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org
  • 4.
  • 5.
    BIG DATA ATSOCRATA Tens of thousands of datasets, each one up to 30 million rows Customer demand for billion row datasets Want to analyze across datasets
  • 6.
    BIG DATA ATOOYALA 2.5 billion analytics pings a day = almost a trillion events a year. Roll up tables - 30 million rows per day
  • 7.
    HOW CAN WEALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible - complex queries included Sometimes you can't denormalize your data enough Fast - interactive speeds Near Real Time - can't make customers wait hours before querying new data
  • 8.
    RDBMS? POSTGRES? Starthitting latency limits at ~10 million rows No robust and inexpensive solution for querying across shards No robust way to scale horizontally PostGres runs query on single thread unless you partition (painful!) Complex and expensive to improve performance (eg rollup tables, huge expensive servers)
  • 9.
    OLAP CUBES? Materializesummary for every possible combination Too complicated and brittle Takes forever to compute - not for real time Explodes storage and memory
  • 10.
    When in doubt,use brute force - Ken Thompson
  • 12.
    CASSANDRA Horizontally scalable Very flexible data modelling (lists, sets, custom data types) Easy to operate No fear of number of rows or documents Best of breed storage technology, huge community BUT: Simple queries only
  • 13.
    APACHE SPARK Horizontallyscalable, in-memory queries Functional Scala transforms - map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, many more plugins all on ONE platform - feed your SQL results to a logistic regression, easy! THE Hottest big data platform, huge community, leaving Hadoop in the dust Developers love it
  • 14.
    SPARK PROVIDES THEMISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  • 15.
    INTEGRATING SPARK ANDCASSANDRA Scala solutions: Datastax integration: https://github.com/datastax/spark-cassandra- connector (CQL-based) Calliope
  • 16.
    A bit morework: Use traditional Cassandra client with RDDs Use an existing InputFormat, like CqlPagedInputFormat Only reason to go here is probably you are not on CQL version of Cassandra, or you're using Shark/Hive.
  • 17.
    A SPARK ANDCASSANDRA OLAP ARCHITECTURE
  • 18.
    SEPARATE STORAGE ANDQUERY LAYERS Combine best of breed storage and query platforms Take full advantage of evolution of each Storage handles replication for availability Query can replicate data for scaling read concurrency - independent!
  • 19.
    SCALE NODES, NOT DEVELOPER TIME!!
  • 20.
    KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fast algebra Functional transforms for easy memoization, testing, concurrency, composition
  • 21.
  • 22.
    EVEN BETTER: TACHYONOFF-HEAP CACHING
  • 23.
    INITIAL ATTEMPTS valrows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) ) sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])
  • 24.
    No existing genericquery engine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we built our own For every row, need to extract out needed columns Ability to select arbitrary columns means using Seq[Any], no type safety Boxing makes integer aggregation very expensive and memory inefficient
  • 25.
  • 26.
    The traditional row-baseddata storage approach is dead - Michael Stonebraker
  • 27.
    TRADITIONAL ROW-BASED STORAGE Same layout in memory and on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. All columns in row 2 come after row 1.
  • 28.
    COLUMNAR STORAGE (MEMORY) Name column 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Age column 0 1 46 66
  • 29.
    COLUMNAR STORAGE (CASSANDRA) Review: each physical row in Cassandra (e.g. a "partition key") stores its columns together on disk. Schema CF Rowkey Type Name StringDict Age Int Data CF Rowkey 0 1 Name 0 1 Age 46 66
  • 30.
    ADVANTAGES OF COLUMNARSTORAGE Compression Dictionary compression - HUGE savings for low-cardinality string columns RLE Reduce I/O Only columns needed for query are loaded from disk Can keep strong types in memory, avoid boxing Batch multiple rows in one cell for efficiency
  • 31.
    ADVANTAGES OF COLUMNARQUERYING Cache locality for aggregating column of data Take advantage of CPU/GPU vector instructions for ints / doubles avoid row-ifying until last possible moment easy to derive computed columns Use vector data / linear math libraries
  • 32.
    COLUMNAR QUERY ENGINEVS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10x less heap 10-100x faster for group by's on a single node Scan speed in excess of 150M rows/sec/core for integer aggregations
  • 33.
    SO, GREAT, OLAPWITH CASSANDRA AND SPARK. NOW WHAT?
  • 35.
    DATASTAX: CASSANDRA SPARKINTEGRATION Datastax Enterprise now comes with HA Spark HA master, that is. spark-cassandra-connector
  • 36.
    SPARK SQL Appearedwith Spark 1.0 In-memory columnar store Can read from Parquet and JSON now; direct Cassandra integration coming Querying is not column-based (yet) No indexes Write custom functions in Scala .... take that Hive UDFs!! Integrates well with MLBase, Scala/Java/Python
  • 37.
    CACHING A SQLTABLE FROM CASSANDRA val sqlContext = new org.apache.spark.sql.SQLContext(sc) sc.cassandraTable[GDeltRow]("gdelt, "1979to2009") .registerAsTable("gdelt") sqlContext.cacheTable("gdelt") sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the collect() In Spark 1.1+: registerTempTable
  • 38.
    SOME PERFORMANCE NUMBERS GDELT dataset, 117 million rows, 57 columns, ~50GB Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory Query Avg time (sec) SELECT count(*) FROM gdelt WHERE Actor2CountryCode = 'CHN' 0.49 SELECT 4 columns Top K 1.51 SELECT Top countries by Avg Tone 2.69 (Group By)
  • 39.
    IMPORTANT - CACHING By default, queries will read data from source - Cassandra - every time Spark RDD Caching - much faster, but big waste of memory (row oriented) Spark SQL table caching - fastest, memory efficient
  • 40.
    WORK STILL NEEDED Indexes Columnar querying for fast aggregation Tachyon support for Cassandra/CQL Efficient reading from columnar storage formats
  • 41.
    LESSONS Extremely fastdistributed querying for these use cases Data doesn't change much (and only bulk changes) Analytical queries for subset of columns Focused on numerical aggregations Small numbers of group bys For fast query performance, cache your data using Spark SQL Concurrent queries is a frontier with Spark. Use additional Spark contexts.
  • 42.
  • 43.
  • 44.
    EXAMPLE CUSTOM INTEGRATIONUSING ASTYANAX val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }
  • 45.
    SOME COLUMNAR ALTERNATIVES Monetdb and Infobright - true columnar stores (storage + querying) Vertica and C-Store Google BigQuery - columnar cloud database, Dremel based Amazon RedShift