SlideShare a Scribd company logo
1 of 42
Download to read offline
OLAP WITH SPARK AND
CASSANDRA
EVAN CHAN
JULY 2014
WHO AM I?
PrincipalEngineer,
@evanfchan
Creator of
Socrata, Inc.
http://github.com/velvia
Spark Job Server
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE
PEOPLE.
data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org
data.seattle.govdata.oregon.govdata.wa.gov
www.metrochicagodata.orgdata.cityofboston.gov
info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov
data.nola.govdata.illinois.govdata.colorado.gov
data.austintexas.govdata.undp.orgwww.opendatanyc.com
data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it
data.montgomerycountymd.govdata.cityofnewyork.us
data.acgov.orgdata.baltimorecity.govdata.energystar.gov
data.somervillema.govdata.maryland.govdata.taxpayer.net
bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org
WE ARE SWIMMING IN DATA!
BIG DATA AT OOYALA
2.5 billionanalytics pings aday= almostatrillionevents a
year.
Rollup tables -30 million rows per day
BIG DATA AT SOCRATA
Hundreds of datasets, each one up to 30 million rows
Customer demand for billion row datasets
HOW CAN WE ALLOW CUSTOMERS TO QUERY A
YEAR'S WORTH OF DATA?
Flexible -complex queries included
Sometimes you can'tdenormalize your dataenough
Fast-interactive speeds
RDBMS? POSTGRES?
Starthittinglatencylimits at~10 million rows
No robustand inexpensive solution for queryingacross shards
No robustwayto scale horizontally
Complex and expensive to improve performance (egrollup
tables)
OLAP CUBES?
Materialize summaryfor everypossible combination
Too complicated and brittle
Takes forever to compute
Explodes storage and memory
When in doubt, use brute force
- Ken Thompson
CASSANDRA
Horizontallyscalable
Veryflexible datamodelling(lists, sets, custom datatypes)
Easyto operate
No fear of number of rows or documents
Bestof breed storage technology, huge community
BUT: Simplequeries only
APACHE SPARK
Horizontallyscalable, in-memoryqueries
FunctionalScalatransforms -map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, manymore plugins
allon ONE platform -feed your SQL results to alogistic
regression, easy!
THE Hottestbigdataplatform, huge community, leaving
Hadoop in the dust
Developers love it
SPARK PROVIDES THE MISSING FAST, DEEP
ANALYTICS PIECE OF CASSANDRA!
INTEGRATING SPARK AND CASSANDRA
Scalasolutions:
Datastax integration:
(CQL-based)
https://github.com/datastax/cassandra-
driver-spark
Calliope
Abitmore work:
Use traditionalCassandraclientwith RDDs
Use an existingInputFormat, like CqlPagedInputFormat
EXAMPLE CUSTOM INTEGRATION USING
ASTYANAX
valcassRDD=sc.parallelize(rowkeys).
flatMap{rowkey=>
columnFamily.get(rowkey).execute().asScala
}
A SPARK AND CASSANDRA
OLAP ARCHITECTURE
SEPARATE STORAGE AND QUERY LAYERS
Combine bestof breed storage and queryplatforms
Take fulladvantage of evolution of each
Storage handles replication for availability
Querycan replicate datafor scalingread concurrency-
independent!
SCALE NODES, NOT
DEVELOPER TIME!!
KEEPING IT SIMPLE
Maximize row scan speed
Columnar representation for efficiency
Compressed bitmap indexes for fastalgebra
Functionaltransforms for easymemoization, testing,
concurrency, composition
SPARK AS CASSANDRA'S CACHE
EVEN BETTER: TACHYON OFF-HEAP CACHING
INITIAL ATTEMPTS
valrows=Seq(
Seq("Burglary","19xxHurston",10),
Seq("Theft","55xxFloatillaAve",5)
)
sc.parallelize(rows)
.map{values=>(values[0],values)}
.groupByKey
.reduce(_[2]+_[2])
No existinggeneric queryengine for Spark when we started
(Shark was in infancy, had no indexes, etc.), so we builtour own
For everyrow, need to extractoutneeded columns
Abilityto selectarbitrarycolumns means usingSeq[Any], no
type safety
Boxingmakes integer aggregation veryexpensive and memory
inefficient
COLUMNAR STORAGE AND QUERYING
The traditional row-based datastorage
approach is dead
- Michael Stonebraker
TRADITIONAL ROW-BASED STORAGE
Same layoutin memoryand on disk:
Name Age
Barak 46
Hillary 66
Each row is stored contiguously. Allcolumns in row 2 come after
row 1.
COLUMNAR STORAGE (MEMORY)
Namecolumn
0 1
0 1
Dictionary: {0: "Barak", 1: "Hillary"}
Agecolumn
0 1
46 66
COLUMNAR STORAGE (CASSANDRA)
Review: each physicalrow in Cassandra(e.g. a"partition key")
stores its columns together on disk.
SchemaCF
Rowkey Type
Name StringDict
Age Int
DataCF
Rowkey 0 1
Name 0 1
Age 46 66
ADVANTAGES OF COLUMNAR STORAGE
Compression
Dictionarycompression -HUGE savings for low-cardinality
stringcolumns
RLE
Reduce I/O
Onlycolumns needed for queryare loaded from disk
Can keep strongtypes in memory, avoid boxing
Batch multiple rows in one cellfor efficiency
ADVANTAGES OF COLUMNAR QUERYING
Cache localityfor aggregatingcolumn of data
Take advantage of CPU/GPUvector instructions for ints /
doubles
avoid row-ifyinguntillastpossible moment
easyto derive computed columns
Use vector data/linear math libraries
COLUMNAR QUERY ENGINE VS ROW-BASED IN
SCALA
Custom RDD of column-oriented blocks of data
Uses ~10xless heap
10-100xfaster for group by's on asingle node
Scan speed in excess of 150M rows/sec/core for integer
aggregations
SO, GREAT, OLAP WITH CASSANDRA AND
SPARK. NOW WHAT?
DATASTAX: CASSANDRA SPARK INTEGRATION
Datastax Enterprise now comes with HASpark
HAmaster, thatis.
cassandra-driver-spark
SPARK SQL
Appeared with Spark 1.0
In-memorycolumnar store
Can read from Parquetnow; Cassandraintegration coming
Queryingis notcolumn-based (yet)
No indexes
Write custom functions in Scala.... take thatHive UDFs!!
Integrates wellwith MLBase, Scala/Java/Python
WORK STILL NEEDED
Indexes
Columnar queryingfor fastaggregation
Efficientreadingfrom columnar storage formats
GETTING TO A BILLION ROWS / SEC
Benchmarked at20 million rows/sec, GROUP BY on two
columns, aggregatingtwo more columns. Per core.
50 cores needed for parallellocalized groupingthroughputof
1 billion rows
~5-10 additionalcores budgetfor distributed exchange and
groupingof locallyagggregated groups, dependingon result
size and network topology
Above is acustom solution, NOTSpark SQL.
Look for integration with Spark/SQL for aproper solution
LESSONS
Extremelyfastdistributed queryingfor these use cases
Datadoesn'tchange much (and onlybulk changes)
Analyticalqueries for subsetof columns
Focused on numericalaggregations
Smallnumbers of group bys, limited network interchange of
data
Spark abitrough around edges, butevolvingfast
Concurrentqueries is afrontier with Spark. Use additional
Spark contexts.
THANK YOU!
SOME COLUMNAR
ALTERNATIVES
Monetdb and Infobright-true columnar stores (storage +
querying)
Cstore-fdw for PostGres -columnar storage only
VoltDB-in-memorydistributed columnar database (butneed
to recompile for DDL changes)
Google BigQuery-columnar cloud database, Dremelbased
Amazon RedShift

More Related Content

What's hot

Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchPatricia Gorla
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with CassandraJacek Lewandowski
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark Summit
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringAnant Rustagi
 

What's hot (20)

Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
An Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Spark Streaming with Cassandra
Spark Streaming with CassandraSpark Streaming with Cassandra
Spark Streaming with Cassandra
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 

Viewers also liked

Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradusrandyguck
 
Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analyticsrandyguck
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...Luke Han
 
インメモリーで超高速処理を実現する場合のカギ
インメモリーで超高速処理を実現する場合のカギインメモリーで超高速処理を実現する場合のカギ
インメモリーで超高速処理を実現する場合のカギMasaki Yamakawa
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODEMasaki Yamakawa
 
GemFire In Memory Data Grid
GemFire In Memory Data GridGemFire In Memory Data Grid
GemFire In Memory Data GridDmitry Buzdin
 
NoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAPNoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAPAricelio Souza
 

Viewers also liked (20)

Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
Case Study: OLAP usability on Spark and Hadoop
Case Study: OLAP usability on Spark and HadoopCase Study: OLAP usability on Spark and Hadoop
Case Study: OLAP usability on Spark and Hadoop
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Integración de DataStax de Spark con Cassandra
Integración de DataStax de Spark con CassandraIntegración de DataStax de Spark con Cassandra
Integración de DataStax de Spark con Cassandra
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Overiew of Cassandra and Doradus
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradus
 
Extending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analytics
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
 
インメモリーで超高速処理を実現する場合のカギ
インメモリーで超高速処理を実現する場合のカギインメモリーで超高速処理を実現する場合のカギ
インメモリーで超高速処理を実現する場合のカギ
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
 
GemFire In Memory Data Grid
GemFire In Memory Data GridGemFire In Memory Data Grid
GemFire In Memory Data Grid
 
NoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAPNoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAP
 

Similar to OLAP with Cassandra and Spark

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...DataStax Academy
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCarlos Alonso Pérez
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
 
Scalable Applications with Scala
Scalable Applications with ScalaScalable Applications with Scala
Scalable Applications with ScalaNimrod Argov
 
Big Data Landscape 2019
Big Data Landscape 2019Big Data Landscape 2019
Big Data Landscape 2019QAware GmbH
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesSnappyData
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkTim Vincent
 
Autoscaling Best Practices
Autoscaling Best PracticesAutoscaling Best Practices
Autoscaling Best PracticesMarc Cluet
 

Similar to OLAP with Cassandra and Spark (20)

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Big Data on the Cloud
Big Data on the CloudBig Data on the Cloud
Big Data on the Cloud
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one day
 
Escape from Hadoop
Escape from HadoopEscape from Hadoop
Escape from Hadoop
 
5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
 
Scalable Applications with Scala
Scalable Applications with ScalaScalable Applications with Scala
Scalable Applications with Scala
 
Big Data Landscape 2019
Big Data Landscape 2019Big Data Landscape 2019
Big Data Landscape 2019
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
Autoscaling Best Practices
Autoscaling Best PracticesAutoscaling Best Practices
Autoscaling Best Practices
 

More from Evan Chan

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureEvan Chan
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 

More from Evan Chan (10)

Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 

OLAP with Cassandra and Spark

  • 1. OLAP WITH SPARK AND CASSANDRA EVAN CHAN JULY 2014
  • 2. WHO AM I? PrincipalEngineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  • 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org data.seattle.govdata.oregon.govdata.wa.gov www.metrochicagodata.orgdata.cityofboston.gov info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov data.nola.govdata.illinois.govdata.colorado.gov data.austintexas.govdata.undp.orgwww.opendatanyc.com data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it data.montgomerycountymd.govdata.cityofnewyork.us data.acgov.orgdata.baltimorecity.govdata.energystar.gov data.somervillema.govdata.maryland.govdata.taxpayer.net bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org
  • 4. WE ARE SWIMMING IN DATA!
  • 5. BIG DATA AT OOYALA 2.5 billionanalytics pings aday= almostatrillionevents a year. Rollup tables -30 million rows per day
  • 6. BIG DATA AT SOCRATA Hundreds of datasets, each one up to 30 million rows Customer demand for billion row datasets
  • 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible -complex queries included Sometimes you can'tdenormalize your dataenough Fast-interactive speeds
  • 8. RDBMS? POSTGRES? Starthittinglatencylimits at~10 million rows No robustand inexpensive solution for queryingacross shards No robustwayto scale horizontally Complex and expensive to improve performance (egrollup tables)
  • 9. OLAP CUBES? Materialize summaryfor everypossible combination Too complicated and brittle Takes forever to compute Explodes storage and memory
  • 10. When in doubt, use brute force - Ken Thompson
  • 11.
  • 12. CASSANDRA Horizontallyscalable Veryflexible datamodelling(lists, sets, custom datatypes) Easyto operate No fear of number of rows or documents Bestof breed storage technology, huge community BUT: Simplequeries only
  • 13. APACHE SPARK Horizontallyscalable, in-memoryqueries FunctionalScalatransforms -map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, manymore plugins allon ONE platform -feed your SQL results to alogistic regression, easy! THE Hottestbigdataplatform, huge community, leaving Hadoop in the dust Developers love it
  • 14. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  • 15. INTEGRATING SPARK AND CASSANDRA Scalasolutions: Datastax integration: (CQL-based) https://github.com/datastax/cassandra- driver-spark Calliope
  • 16. Abitmore work: Use traditionalCassandraclientwith RDDs Use an existingInputFormat, like CqlPagedInputFormat
  • 17. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX valcassRDD=sc.parallelize(rowkeys). flatMap{rowkey=> columnFamily.get(rowkey).execute().asScala }
  • 18. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  • 19. SEPARATE STORAGE AND QUERY LAYERS Combine bestof breed storage and queryplatforms Take fulladvantage of evolution of each Storage handles replication for availability Querycan replicate datafor scalingread concurrency- independent!
  • 21. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fastalgebra Functionaltransforms for easymemoization, testing, concurrency, composition
  • 23. EVEN BETTER: TACHYON OFF-HEAP CACHING
  • 25. No existinggeneric queryengine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we builtour own For everyrow, need to extractoutneeded columns Abilityto selectarbitrarycolumns means usingSeq[Any], no type safety Boxingmakes integer aggregation veryexpensive and memory inefficient
  • 27. The traditional row-based datastorage approach is dead - Michael Stonebraker
  • 28. TRADITIONAL ROW-BASED STORAGE Same layoutin memoryand on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. Allcolumns in row 2 come after row 1.
  • 29. COLUMNAR STORAGE (MEMORY) Namecolumn 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Agecolumn 0 1 46 66
  • 30. COLUMNAR STORAGE (CASSANDRA) Review: each physicalrow in Cassandra(e.g. a"partition key") stores its columns together on disk. SchemaCF Rowkey Type Name StringDict Age Int DataCF Rowkey 0 1 Name 0 1 Age 46 66
  • 31. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionarycompression -HUGE savings for low-cardinality stringcolumns RLE Reduce I/O Onlycolumns needed for queryare loaded from disk Can keep strongtypes in memory, avoid boxing Batch multiple rows in one cellfor efficiency
  • 32. ADVANTAGES OF COLUMNAR QUERYING Cache localityfor aggregatingcolumn of data Take advantage of CPU/GPUvector instructions for ints / doubles avoid row-ifyinguntillastpossible moment easyto derive computed columns Use vector data/linear math libraries
  • 33. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10xless heap 10-100xfaster for group by's on asingle node Scan speed in excess of 150M rows/sec/core for integer aggregations
  • 34. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  • 35.
  • 36. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HASpark HAmaster, thatis. cassandra-driver-spark
  • 37. SPARK SQL Appeared with Spark 1.0 In-memorycolumnar store Can read from Parquetnow; Cassandraintegration coming Queryingis notcolumn-based (yet) No indexes Write custom functions in Scala.... take thatHive UDFs!! Integrates wellwith MLBase, Scala/Java/Python
  • 38. WORK STILL NEEDED Indexes Columnar queryingfor fastaggregation Efficientreadingfrom columnar storage formats
  • 39. GETTING TO A BILLION ROWS / SEC Benchmarked at20 million rows/sec, GROUP BY on two columns, aggregatingtwo more columns. Per core. 50 cores needed for parallellocalized groupingthroughputof 1 billion rows ~5-10 additionalcores budgetfor distributed exchange and groupingof locallyagggregated groups, dependingon result size and network topology Above is acustom solution, NOTSpark SQL. Look for integration with Spark/SQL for aproper solution
  • 40. LESSONS Extremelyfastdistributed queryingfor these use cases Datadoesn'tchange much (and onlybulk changes) Analyticalqueries for subsetof columns Focused on numericalaggregations Smallnumbers of group bys, limited network interchange of data Spark abitrough around edges, butevolvingfast Concurrentqueries is afrontier with Spark. Use additional Spark contexts.
  • 42. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright-true columnar stores (storage + querying) Cstore-fdw for PostGres -columnar storage only VoltDB-in-memorydistributed columnar database (butneed to recompile for DDL changes) Google BigQuery-columnar cloud database, Dremelbased Amazon RedShift