Your SlideShare is downloading. ×

Olap with Spark and Cassandra

2,033
views

Published on


0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,033
On Slideshare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. OLAP WITH SPARK AND CASSANDRA EVAN CHAN JULY 2014
  • 2. WHO AM I? PrincipalEngineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  • 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org data.seattle.govdata.oregon.govdata.wa.gov www.metrochicagodata.orgdata.cityofboston.gov info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov data.nola.govdata.illinois.govdata.colorado.gov data.austintexas.govdata.undp.orgwww.opendatanyc.com data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it data.montgomerycountymd.govdata.cityofnewyork.us data.acgov.orgdata.baltimorecity.govdata.energystar.gov data.somervillema.govdata.maryland.govdata.taxpayer.net bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org
  • 4. WE ARE SWIMMING IN DATA!
  • 5. BIG DATA AT OOYALA 2.5 billionanalytics pings aday= almostatrillionevents a year. Rollup tables -30 million rows per day
  • 6. BIG DATA AT SOCRATA Hundreds of datasets, each one up to 30 million rows Customer demand for billion row datasets
  • 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible -complex queries included Sometimes you can'tdenormalize your dataenough Fast-interactive speeds
  • 8. RDBMS? POSTGRES? Starthittinglatencylimits at~10 million rows No robustand inexpensive solution for queryingacross shards No robustwayto scale horizontally Complex and expensive to improve performance (egrollup tables)
  • 9. OLAP CUBES? Materialize summaryfor everypossible combination Too complicated and brittle Takes forever to compute Explodes storage and memory
  • 10. When in doubt, use brute force - Ken Thompson
  • 11. CASSANDRA Horizontallyscalable Veryflexible datamodelling(lists, sets, custom datatypes) Easyto operate No fear of number of rows or documents Bestof breed storage technology, huge community BUT: Simplequeries only
  • 12. APACHE SPARK Horizontallyscalable, in-memoryqueries FunctionalScalatransforms -map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, manymore plugins allon ONE platform -feed your SQL results to alogistic regression, easy! THE Hottestbigdataplatform, huge community, leaving Hadoop in the dust Developers love it
  • 13. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  • 14. INTEGRATING SPARK AND CASSANDRA Scalasolutions: Datastax integration: (CQL-based) https://github.com/datastax/cassandra- driver-spark Calliope
  • 15. Abitmore work: Use traditionalCassandraclientwith RDDs Use an existingInputFormat, like CqlPagedInputFormat
  • 16. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX valcassRDD=sc.parallelize(rowkeys). flatMap{rowkey=> columnFamily.get(rowkey).execute().asScala }
  • 17. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  • 18. SEPARATE STORAGE AND QUERY LAYERS Combine bestof breed storage and queryplatforms Take fulladvantage of evolution of each Storage handles replication for availability Querycan replicate datafor scalingread concurrency- independent!
  • 19. SCALE NODES, NOT DEVELOPER TIME!!
  • 20. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fastalgebra Functionaltransforms for easymemoization, testing, concurrency, composition
  • 21. SPARK AS CASSANDRA'S CACHE
  • 22. EVEN BETTER: TACHYON OFF-HEAP CACHING
  • 23. INITIAL ATTEMPTS valrows=Seq( Seq("Burglary","19xxHurston",10), Seq("Theft","55xxFloatillaAve",5) ) sc.parallelize(rows) .map{values=>(values[0],values)} .groupByKey .reduce(_[2]+_[2])
  • 24. No existinggeneric queryengine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we builtour own For everyrow, need to extractoutneeded columns Abilityto selectarbitrarycolumns means usingSeq[Any], no type safety Boxingmakes integer aggregation veryexpensive and memory inefficient
  • 25. COLUMNAR STORAGE AND QUERYING
  • 26. The traditional row-based datastorage approach is dead - Michael Stonebraker
  • 27. TRADITIONAL ROW-BASED STORAGE Same layoutin memoryand on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. Allcolumns in row 2 come after row 1.
  • 28. COLUMNAR STORAGE (MEMORY) Namecolumn 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Agecolumn 0 1 46 66
  • 29. COLUMNAR STORAGE (CASSANDRA) Review: each physicalrow in Cassandra(e.g. a"partition key") stores its columns together on disk. SchemaCF Rowkey Type Name StringDict Age Int DataCF Rowkey 0 1 Name 0 1 Age 46 66
  • 30. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionarycompression -HUGE savings for low-cardinality stringcolumns RLE Reduce I/O Onlycolumns needed for queryare loaded from disk Can keep strongtypes in memory, avoid boxing Batch multiple rows in one cellfor efficiency
  • 31. ADVANTAGES OF COLUMNAR QUERYING Cache localityfor aggregatingcolumn of data Take advantage of CPU/GPUvector instructions for ints / doubles avoid row-ifyinguntillastpossible moment easyto derive computed columns Use vector data/linear math libraries
  • 32. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10xless heap 10-100xfaster for group by's on asingle node Scan speed in excess of 150M rows/sec/core for integer aggregations
  • 33. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  • 34. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HASpark HAmaster, thatis. cassandra-driver-spark
  • 35. SPARK SQL Appeared with Spark 1.0 In-memorycolumnar store Can read from Parquetnow; Cassandraintegration coming Queryingis notcolumn-based (yet) No indexes Write custom functions in Scala.... take thatHive UDFs!! Integrates wellwith MLBase, Scala/Java/Python
  • 36. WORK STILL NEEDED Indexes Columnar queryingfor fastaggregation Efficientreadingfrom columnar storage formats
  • 37. GETTING TO A BILLION ROWS / SEC Benchmarked at20 million rows/sec, GROUP BY on two columns, aggregatingtwo more columns. Per core. 50 cores needed for parallellocalized groupingthroughputof 1 billion rows ~5-10 additionalcores budgetfor distributed exchange and groupingof locallyagggregated groups, dependingon result size and network topology Above is acustom solution, NOTSpark SQL. Look for integration with Spark/SQL for aproper solution
  • 38. LESSONS Extremelyfastdistributed queryingfor these use cases Datadoesn'tchange much (and onlybulk changes) Analyticalqueries for subsetof columns Focused on numericalaggregations Smallnumbers of group bys, limited network interchange of data Spark abitrough around edges, butevolvingfast Concurrentqueries is afrontier with Spark. Use additional Spark contexts.
  • 39. THANK YOU!
  • 40. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright-true columnar stores (storage + querying) Cstore-fdw for PostGres -columnar storage only VoltDB-in-memorydistributed columnar database (butneed to recompile for DDL changes) Google BigQuery-columnar cloud database, Dremelbased Amazon RedShift