OLAP with Cassandra and Spark

15,014 views
14,449 views

Published on

How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.

3 Comments
35 Likes
Statistics
Notes
No Downloads
Views
Total views
15,014
On SlideShare
0
From Embeds
0
Number of Embeds
1,531
Actions
Shares
0
Downloads
315
Comments
3
Likes
35
Embeds 0
No embeds

No notes for slide

OLAP with Cassandra and Spark

  1. 1. OLAP WITH SPARK AND CASSANDRA EVAN CHAN JULY 2014
  2. 2. WHO AM I? PrincipalEngineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  3. 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org data.seattle.govdata.oregon.govdata.wa.gov www.metrochicagodata.orgdata.cityofboston.gov info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov data.nola.govdata.illinois.govdata.colorado.gov data.austintexas.govdata.undp.orgwww.opendatanyc.com data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it data.montgomerycountymd.govdata.cityofnewyork.us data.acgov.orgdata.baltimorecity.govdata.energystar.gov data.somervillema.govdata.maryland.govdata.taxpayer.net bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org
  4. 4. WE ARE SWIMMING IN DATA!
  5. 5. BIG DATA AT OOYALA 2.5 billionanalytics pings aday= almostatrillionevents a year. Rollup tables -30 million rows per day
  6. 6. BIG DATA AT SOCRATA Hundreds of datasets, each one up to 30 million rows Customer demand for billion row datasets
  7. 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible -complex queries included Sometimes you can'tdenormalize your dataenough Fast-interactive speeds
  8. 8. RDBMS? POSTGRES? Starthittinglatencylimits at~10 million rows No robustand inexpensive solution for queryingacross shards No robustwayto scale horizontally Complex and expensive to improve performance (egrollup tables)
  9. 9. OLAP CUBES? Materialize summaryfor everypossible combination Too complicated and brittle Takes forever to compute Explodes storage and memory
  10. 10. When in doubt, use brute force - Ken Thompson
  11. 11. CASSANDRA Horizontallyscalable Veryflexible datamodelling(lists, sets, custom datatypes) Easyto operate No fear of number of rows or documents Bestof breed storage technology, huge community BUT: Simplequeries only
  12. 12. APACHE SPARK Horizontallyscalable, in-memoryqueries FunctionalScalatransforms -map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, manymore plugins allon ONE platform -feed your SQL results to alogistic regression, easy! THE Hottestbigdataplatform, huge community, leaving Hadoop in the dust Developers love it
  13. 13. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  14. 14. INTEGRATING SPARK AND CASSANDRA Scalasolutions: Datastax integration: (CQL-based) https://github.com/datastax/cassandra- driver-spark Calliope
  15. 15. Abitmore work: Use traditionalCassandraclientwith RDDs Use an existingInputFormat, like CqlPagedInputFormat
  16. 16. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX valcassRDD=sc.parallelize(rowkeys). flatMap{rowkey=> columnFamily.get(rowkey).execute().asScala }
  17. 17. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  18. 18. SEPARATE STORAGE AND QUERY LAYERS Combine bestof breed storage and queryplatforms Take fulladvantage of evolution of each Storage handles replication for availability Querycan replicate datafor scalingread concurrency- independent!
  19. 19. SCALE NODES, NOT DEVELOPER TIME!!
  20. 20. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fastalgebra Functionaltransforms for easymemoization, testing, concurrency, composition
  21. 21. SPARK AS CASSANDRA'S CACHE
  22. 22. EVEN BETTER: TACHYON OFF-HEAP CACHING
  23. 23. INITIAL ATTEMPTS valrows=Seq( Seq("Burglary","19xxHurston",10), Seq("Theft","55xxFloatillaAve",5) ) sc.parallelize(rows) .map{values=>(values[0],values)} .groupByKey .reduce(_[2]+_[2])
  24. 24. No existinggeneric queryengine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we builtour own For everyrow, need to extractoutneeded columns Abilityto selectarbitrarycolumns means usingSeq[Any], no type safety Boxingmakes integer aggregation veryexpensive and memory inefficient
  25. 25. COLUMNAR STORAGE AND QUERYING
  26. 26. The traditional row-based datastorage approach is dead - Michael Stonebraker
  27. 27. TRADITIONAL ROW-BASED STORAGE Same layoutin memoryand on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. Allcolumns in row 2 come after row 1.
  28. 28. COLUMNAR STORAGE (MEMORY) Namecolumn 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Agecolumn 0 1 46 66
  29. 29. COLUMNAR STORAGE (CASSANDRA) Review: each physicalrow in Cassandra(e.g. a"partition key") stores its columns together on disk. SchemaCF Rowkey Type Name StringDict Age Int DataCF Rowkey 0 1 Name 0 1 Age 46 66
  30. 30. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionarycompression -HUGE savings for low-cardinality stringcolumns RLE Reduce I/O Onlycolumns needed for queryare loaded from disk Can keep strongtypes in memory, avoid boxing Batch multiple rows in one cellfor efficiency
  31. 31. ADVANTAGES OF COLUMNAR QUERYING Cache localityfor aggregatingcolumn of data Take advantage of CPU/GPUvector instructions for ints / doubles avoid row-ifyinguntillastpossible moment easyto derive computed columns Use vector data/linear math libraries
  32. 32. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10xless heap 10-100xfaster for group by's on asingle node Scan speed in excess of 150M rows/sec/core for integer aggregations
  33. 33. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  34. 34. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HASpark HAmaster, thatis. cassandra-driver-spark
  35. 35. SPARK SQL Appeared with Spark 1.0 In-memorycolumnar store Can read from Parquetnow; Cassandraintegration coming Queryingis notcolumn-based (yet) No indexes Write custom functions in Scala.... take thatHive UDFs!! Integrates wellwith MLBase, Scala/Java/Python
  36. 36. WORK STILL NEEDED Indexes Columnar queryingfor fastaggregation Efficientreadingfrom columnar storage formats
  37. 37. GETTING TO A BILLION ROWS / SEC Benchmarked at20 million rows/sec, GROUP BY on two columns, aggregatingtwo more columns. Per core. 50 cores needed for parallellocalized groupingthroughputof 1 billion rows ~5-10 additionalcores budgetfor distributed exchange and groupingof locallyagggregated groups, dependingon result size and network topology Above is acustom solution, NOTSpark SQL. Look for integration with Spark/SQL for aproper solution
  38. 38. LESSONS Extremelyfastdistributed queryingfor these use cases Datadoesn'tchange much (and onlybulk changes) Analyticalqueries for subsetof columns Focused on numericalaggregations Smallnumbers of group bys, limited network interchange of data Spark abitrough around edges, butevolvingfast Concurrentqueries is afrontier with Spark. Use additional Spark contexts.
  39. 39. THANK YOU!
  40. 40. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright-true columnar stores (storage + querying) Cstore-fdw for PostGres -columnar storage only VoltDB-in-memorydistributed columnar database (butneed to recompile for DDL changes) Google BigQuery-columnar cloud database, Dremelbased Amazon RedShift

×