OLAP WITH SPARK AND
CASSANDRA
EVAN CHAN
JULY 2014
WHO AM I?
PrincipalEngineer,
@evanfchan
Creator of
Socrata, Inc.
http://github.com/velvia
Spark Job Server
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE
PEOPLE.
data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org
data.se...
WE ARE SWIMMING IN DATA!
BIG DATA AT OOYALA
2.5 billionanalytics pings aday= almostatrillionevents a
year.
Rollup tables -30 million rows per day
BIG DATA AT SOCRATA
Hundreds of datasets, each one up to 30 million rows
Customer demand for billion row datasets
HOW CAN WE ALLOW CUSTOMERS TO QUERY A
YEAR'S WORTH OF DATA?
Flexible -complex queries included
Sometimes you can'tdenormal...
RDBMS? POSTGRES?
Starthittinglatencylimits at~10 million rows
No robustand inexpensive solution for queryingacross shards
...
OLAP CUBES?
Materialize summaryfor everypossible combination
Too complicated and brittle
Takes forever to compute
Explodes...
When in doubt, use brute force
- Ken Thompson
CASSANDRA
Horizontallyscalable
Veryflexible datamodelling(lists, sets, custom datatypes)
Easyto operate
No fear of number ...
APACHE SPARK
Horizontallyscalable, in-memoryqueries
FunctionalScalatransforms -map, filter, groupBy, sort
etc.
SQL, machin...
SPARK PROVIDES THE MISSING FAST, DEEP
ANALYTICS PIECE OF CASSANDRA!
INTEGRATING SPARK AND CASSANDRA
Scalasolutions:
Datastax integration:
(CQL-based)
https://github.com/datastax/cassandra-
d...
Abitmore work:
Use traditionalCassandraclientwith RDDs
Use an existingInputFormat, like CqlPagedInputFormat
EXAMPLE CUSTOM INTEGRATION USING
ASTYANAX
valcassRDD=sc.parallelize(rowkeys).
flatMap{rowkey=>
columnFamily.get(rowkey).ex...
A SPARK AND CASSANDRA
OLAP ARCHITECTURE
SEPARATE STORAGE AND QUERY LAYERS
Combine bestof breed storage and queryplatforms
Take fulladvantage of evolution of each
...
SCALE NODES, NOT
DEVELOPER TIME!!
KEEPING IT SIMPLE
Maximize row scan speed
Columnar representation for efficiency
Compressed bitmap indexes for fastalgebra...
SPARK AS CASSANDRA'S CACHE
EVEN BETTER: TACHYON OFF-HEAP CACHING
INITIAL ATTEMPTS
valrows=Seq(
Seq("Burglary","19xxHurston",10),
Seq("Theft","55xxFloatillaAve",5)
)
sc.parallelize(rows)
....
No existinggeneric queryengine for Spark when we started
(Shark was in infancy, had no indexes, etc.), so we builtour own
...
COLUMNAR STORAGE AND QUERYING
The traditional row-based datastorage
approach is dead
- Michael Stonebraker
TRADITIONAL ROW-BASED STORAGE
Same layoutin memoryand on disk:
Name Age
Barak 46
Hillary 66
Each row is stored contiguousl...
COLUMNAR STORAGE (MEMORY)
Namecolumn
0 1
0 1
Dictionary: {0: "Barak", 1: "Hillary"}
Agecolumn
0 1
46 66
COLUMNAR STORAGE (CASSANDRA)
Review: each physicalrow in Cassandra(e.g. a"partition key")
stores its columns together on d...
ADVANTAGES OF COLUMNAR STORAGE
Compression
Dictionarycompression -HUGE savings for low-cardinality
stringcolumns
RLE
Reduc...
ADVANTAGES OF COLUMNAR QUERYING
Cache localityfor aggregatingcolumn of data
Take advantage of CPU/GPUvector instructions f...
COLUMNAR QUERY ENGINE VS ROW-BASED IN
SCALA
Custom RDD of column-oriented blocks of data
Uses ~10xless heap
10-100xfaster ...
SO, GREAT, OLAP WITH CASSANDRA AND
SPARK. NOW WHAT?
DATASTAX: CASSANDRA SPARK INTEGRATION
Datastax Enterprise now comes with HASpark
HAmaster, thatis.
cassandra-driver-spark
SPARK SQL
Appeared with Spark 1.0
In-memorycolumnar store
Can read from Parquetnow; Cassandraintegration coming
Queryingis...
WORK STILL NEEDED
Indexes
Columnar queryingfor fastaggregation
Efficientreadingfrom columnar storage formats
GETTING TO A BILLION ROWS / SEC
Benchmarked at20 million rows/sec, GROUP BY on two
columns, aggregatingtwo more columns. P...
LESSONS
Extremelyfastdistributed queryingfor these use cases
Datadoesn'tchange much (and onlybulk changes)
Analyticalqueri...
THANK YOU!
SOME COLUMNAR
ALTERNATIVES
Monetdb and Infobright-true columnar stores (storage +
querying)
Cstore-fdw for PostGres -colum...
Olap with Spark and Cassandra
Olap with Spark and Cassandra
Upcoming SlideShare
Loading in …5
×

Olap with Spark and Cassandra

3,252 views

Published on

Olap with Spark and Cassandra

  1. 1. OLAP WITH SPARK AND CASSANDRA EVAN CHAN JULY 2014
  2. 2. WHO AM I? PrincipalEngineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  3. 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org data.seattle.govdata.oregon.govdata.wa.gov www.metrochicagodata.orgdata.cityofboston.gov info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov data.nola.govdata.illinois.govdata.colorado.gov data.austintexas.govdata.undp.orgwww.opendatanyc.com data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it data.montgomerycountymd.govdata.cityofnewyork.us data.acgov.orgdata.baltimorecity.govdata.energystar.gov data.somervillema.govdata.maryland.govdata.taxpayer.net bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org
  4. 4. WE ARE SWIMMING IN DATA!
  5. 5. BIG DATA AT OOYALA 2.5 billionanalytics pings aday= almostatrillionevents a year. Rollup tables -30 million rows per day
  6. 6. BIG DATA AT SOCRATA Hundreds of datasets, each one up to 30 million rows Customer demand for billion row datasets
  7. 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible -complex queries included Sometimes you can'tdenormalize your dataenough Fast-interactive speeds
  8. 8. RDBMS? POSTGRES? Starthittinglatencylimits at~10 million rows No robustand inexpensive solution for queryingacross shards No robustwayto scale horizontally Complex and expensive to improve performance (egrollup tables)
  9. 9. OLAP CUBES? Materialize summaryfor everypossible combination Too complicated and brittle Takes forever to compute Explodes storage and memory
  10. 10. When in doubt, use brute force - Ken Thompson
  11. 11. CASSANDRA Horizontallyscalable Veryflexible datamodelling(lists, sets, custom datatypes) Easyto operate No fear of number of rows or documents Bestof breed storage technology, huge community BUT: Simplequeries only
  12. 12. APACHE SPARK Horizontallyscalable, in-memoryqueries FunctionalScalatransforms -map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, manymore plugins allon ONE platform -feed your SQL results to alogistic regression, easy! THE Hottestbigdataplatform, huge community, leaving Hadoop in the dust Developers love it
  13. 13. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  14. 14. INTEGRATING SPARK AND CASSANDRA Scalasolutions: Datastax integration: (CQL-based) https://github.com/datastax/cassandra- driver-spark Calliope
  15. 15. Abitmore work: Use traditionalCassandraclientwith RDDs Use an existingInputFormat, like CqlPagedInputFormat
  16. 16. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX valcassRDD=sc.parallelize(rowkeys). flatMap{rowkey=> columnFamily.get(rowkey).execute().asScala }
  17. 17. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  18. 18. SEPARATE STORAGE AND QUERY LAYERS Combine bestof breed storage and queryplatforms Take fulladvantage of evolution of each Storage handles replication for availability Querycan replicate datafor scalingread concurrency- independent!
  19. 19. SCALE NODES, NOT DEVELOPER TIME!!
  20. 20. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fastalgebra Functionaltransforms for easymemoization, testing, concurrency, composition
  21. 21. SPARK AS CASSANDRA'S CACHE
  22. 22. EVEN BETTER: TACHYON OFF-HEAP CACHING
  23. 23. INITIAL ATTEMPTS valrows=Seq( Seq("Burglary","19xxHurston",10), Seq("Theft","55xxFloatillaAve",5) ) sc.parallelize(rows) .map{values=>(values[0],values)} .groupByKey .reduce(_[2]+_[2])
  24. 24. No existinggeneric queryengine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we builtour own For everyrow, need to extractoutneeded columns Abilityto selectarbitrarycolumns means usingSeq[Any], no type safety Boxingmakes integer aggregation veryexpensive and memory inefficient
  25. 25. COLUMNAR STORAGE AND QUERYING
  26. 26. The traditional row-based datastorage approach is dead - Michael Stonebraker
  27. 27. TRADITIONAL ROW-BASED STORAGE Same layoutin memoryand on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. Allcolumns in row 2 come after row 1.
  28. 28. COLUMNAR STORAGE (MEMORY) Namecolumn 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Agecolumn 0 1 46 66
  29. 29. COLUMNAR STORAGE (CASSANDRA) Review: each physicalrow in Cassandra(e.g. a"partition key") stores its columns together on disk. SchemaCF Rowkey Type Name StringDict Age Int DataCF Rowkey 0 1 Name 0 1 Age 46 66
  30. 30. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionarycompression -HUGE savings for low-cardinality stringcolumns RLE Reduce I/O Onlycolumns needed for queryare loaded from disk Can keep strongtypes in memory, avoid boxing Batch multiple rows in one cellfor efficiency
  31. 31. ADVANTAGES OF COLUMNAR QUERYING Cache localityfor aggregatingcolumn of data Take advantage of CPU/GPUvector instructions for ints / doubles avoid row-ifyinguntillastpossible moment easyto derive computed columns Use vector data/linear math libraries
  32. 32. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10xless heap 10-100xfaster for group by's on asingle node Scan speed in excess of 150M rows/sec/core for integer aggregations
  33. 33. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  34. 34. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HASpark HAmaster, thatis. cassandra-driver-spark
  35. 35. SPARK SQL Appeared with Spark 1.0 In-memorycolumnar store Can read from Parquetnow; Cassandraintegration coming Queryingis notcolumn-based (yet) No indexes Write custom functions in Scala.... take thatHive UDFs!! Integrates wellwith MLBase, Scala/Java/Python
  36. 36. WORK STILL NEEDED Indexes Columnar queryingfor fastaggregation Efficientreadingfrom columnar storage formats
  37. 37. GETTING TO A BILLION ROWS / SEC Benchmarked at20 million rows/sec, GROUP BY on two columns, aggregatingtwo more columns. Per core. 50 cores needed for parallellocalized groupingthroughputof 1 billion rows ~5-10 additionalcores budgetfor distributed exchange and groupingof locallyagggregated groups, dependingon result size and network topology Above is acustom solution, NOTSpark SQL. Look for integration with Spark/SQL for aproper solution
  38. 38. LESSONS Extremelyfastdistributed queryingfor these use cases Datadoesn'tchange much (and onlybulk changes) Analyticalqueries for subsetof columns Focused on numericalaggregations Smallnumbers of group bys, limited network interchange of data Spark abitrough around edges, butevolvingfast Concurrentqueries is afrontier with Spark. Use additional Spark contexts.
  39. 39. THANK YOU!
  40. 40. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright-true columnar stores (storage + querying) Cstore-fdw for PostGres -columnar storage only VoltDB-in-memorydistributed columnar database (butneed to recompile for DDL changes) Google BigQuery-columnar cloud database, Dremelbased Amazon RedShift

×