The document discusses using Apache Spark and Cassandra for online analytical processing (OLAP) of big data. It describes challenges with relational databases and OLAP cubes at large scales and how Spark can provide fast, distributed querying of data stored in Cassandra. The key points made are that Spark and Cassandra combine to provide horizontally scalable storage with Cassandra and fast, in-memory analytics with Spark; and that for optimal performance, data should be cached in Spark SQL tables for column-oriented querying and aggregation.
5. BIG DATA AT SOCRATA
Tens of thousands of datasets, each one up to 30 million rows
Customer demand for billion row datasets
Want to analyze across datasets
6. BIG DATA AT OOYALA
2.5 billion analytics pings a day = almost a trillion events a
year.
Roll up tables - 30 million rows per day
7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A
YEAR'S WORTH OF DATA?
Flexible - complex queries included
Sometimes you can't denormalize your data enough
Fast - interactive speeds
Near Real Time - can't make customers wait hours before
querying new data
8. RDBMS? POSTGRES?
Start hitting latency limits at ~10 million rows
No robust and inexpensive solution for querying across shards
No robust way to scale horizontally
PostGres runs query on single thread unless you partition
(painful!)
Complex and expensive to improve performance (eg rollup
tables, huge expensive servers)
9. OLAP CUBES?
Materialize summary for every possible combination
Too complicated and brittle
Takes forever to compute - not for real time
Explodes storage and memory
12. CASSANDRA
Horizontally scalable
Very flexible data modelling (lists, sets, custom data types)
Easy to operate
No fear of number of rows or documents
Best of breed storage technology, huge community
BUT: Simple queries only
13. APACHE SPARK
Horizontally scalable, in-memory queries
Functional Scala transforms - map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, many more plugins
all on ONE platform - feed your SQL results to a logistic
regression, easy!
THE Hottest big data platform, huge community, leaving
Hadoop in the dust
Developers love it
16. A bit more work:
Use traditional Cassandra client with RDDs
Use an existing InputFormat, like CqlPagedInputFormat
Only reason to go here is probably you are not on CQL version of
Cassandra, or you're using Shark/Hive.
18. SEPARATE STORAGE AND QUERY LAYERS
Combine best of breed storage and query platforms
Take full advantage of evolution of each
Storage handles replication for availability
Query can replicate data for scaling read concurrency -
independent!
24. No existing generic query engine for Spark when we started
(Shark was in infancy, had no indexes, etc.), so we built our own
For every row, need to extract out needed columns
Ability to select arbitrary columns means using Seq[Any], no
type safety
Boxing makes integer aggregation very expensive and memory
inefficient
27. TRADITIONAL ROW-BASED STORAGE
Same layout in memory and on disk:
Name Age
Barak 46
Hillary 66
Each row is stored contiguously. All columns in row 2 come after
row 1.
29. COLUMNAR STORAGE (CASSANDRA)
Review: each physical row in Cassandra (e.g. a "partition key")
stores its columns together on disk.
Schema CF
Rowkey Type
Name StringDict
Age Int
Data CF
Rowkey 0 1
Name 0 1
Age 46 66
30. ADVANTAGES OF COLUMNAR STORAGE
Compression
Dictionary compression - HUGE savings for low-cardinality
string columns
RLE
Reduce I/O
Only columns needed for query are loaded from disk
Can keep strong types in memory, avoid boxing
Batch multiple rows in one cell for efficiency
31. ADVANTAGES OF COLUMNAR QUERYING
Cache locality for aggregating column of data
Take advantage of CPU/GPU vector instructions for ints /
doubles
avoid row-ifying until last possible moment
easy to derive computed columns
Use vector data / linear math libraries
32. COLUMNAR QUERY ENGINE VS ROW-BASED IN
SCALA
Custom RDD of column-oriented blocks of data
Uses ~10x less heap
10-100x faster for group by's on a single node
Scan speed in excess of 150M rows/sec/core for integer
aggregations
35. DATASTAX: CASSANDRA SPARK INTEGRATION
Datastax Enterprise now comes with HA Spark
HA master, that is.
spark-cassandra-connector
36. SPARK SQL
Appeared with Spark 1.0
In-memory columnar store
Can read from Parquet and JSON now; direct Cassandra
integration coming
Querying is not column-based (yet)
No indexes
Write custom functions in Scala .... take that Hive UDFs!!
Integrates well with MLBase, Scala/Java/Python
37. CACHING A SQL TABLE FROM CASSANDRA
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.cassandraTable[GDeltRow]("gdelt, "1979to2009")
.registerAsTable("gdelt")
sqlContext.cacheTable("gdelt")
sqlContext.sql("SELECT Actor2Code, Actor2Name, Actor2CountryCode, AvgTone from gdelt ORDER Remember Spark is lazy, nothing is executed until the
collect()
In Spark 1.1+: registerTempTable
38. SOME PERFORMANCE NUMBERS
GDELT dataset, 117 million rows, 57 columns, ~50GB
Spark 1.0.2, AWS 8 x c3.xlarge, cached in memory
Query Avg
time
(sec)
SELECT count(*) FROM gdelt
WHERE Actor2CountryCode =
'CHN'
0.49
SELECT 4 columns Top K 1.51
SELECT Top countries by Avg Tone
2.69
(Group By)
39. IMPORTANT - CACHING
By default, queries will read data from source - Cassandra -
every time
Spark RDD Caching - much faster, but big waste of memory
(row oriented)
Spark SQL table caching - fastest, memory efficient
40. WORK STILL NEEDED
Indexes
Columnar querying for fast aggregation
Tachyon support for Cassandra/CQL
Efficient reading from columnar storage formats
41. LESSONS
Extremely fast distributed querying for these use cases
Data doesn't change much (and only bulk changes)
Analytical queries for subset of columns
Focused on numerical aggregations
Small numbers of group bys
For fast query performance, cache your data using Spark SQL
Concurrent queries is a frontier with Spark. Use additional
Spark contexts.