OLAP with Cassandra and Spark

Evan Chan
Evan ChanSoftware Engineer at Apple
OLAP WITH SPARK AND
CASSANDRA
EVAN CHAN
JULY 2014
WHO AM I?
PrincipalEngineer,
@evanfchan
Creator of
Socrata, Inc.
http://github.com/velvia
Spark Job Server
WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE
PEOPLE.
data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org
data.seattle.govdata.oregon.govdata.wa.gov
www.metrochicagodata.orgdata.cityofboston.gov
info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov
data.nola.govdata.illinois.govdata.colorado.gov
data.austintexas.govdata.undp.orgwww.opendatanyc.com
data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it
data.montgomerycountymd.govdata.cityofnewyork.us
data.acgov.orgdata.baltimorecity.govdata.energystar.gov
data.somervillema.govdata.maryland.govdata.taxpayer.net
bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org
WE ARE SWIMMING IN DATA!
BIG DATA AT OOYALA
2.5 billionanalytics pings aday= almostatrillionevents a
year.
Rollup tables -30 million rows per day
BIG DATA AT SOCRATA
Hundreds of datasets, each one up to 30 million rows
Customer demand for billion row datasets
HOW CAN WE ALLOW CUSTOMERS TO QUERY A
YEAR'S WORTH OF DATA?
Flexible -complex queries included
Sometimes you can'tdenormalize your dataenough
Fast-interactive speeds
RDBMS? POSTGRES?
Starthittinglatencylimits at~10 million rows
No robustand inexpensive solution for queryingacross shards
No robustwayto scale horizontally
Complex and expensive to improve performance (egrollup
tables)
OLAP CUBES?
Materialize summaryfor everypossible combination
Too complicated and brittle
Takes forever to compute
Explodes storage and memory
When in doubt, use brute force
- Ken Thompson
OLAP with Cassandra and Spark
CASSANDRA
Horizontallyscalable
Veryflexible datamodelling(lists, sets, custom datatypes)
Easyto operate
No fear of number of rows or documents
Bestof breed storage technology, huge community
BUT: Simplequeries only
APACHE SPARK
Horizontallyscalable, in-memoryqueries
FunctionalScalatransforms -map, filter, groupBy, sort
etc.
SQL, machine learning, streaming, graph, R, manymore plugins
allon ONE platform -feed your SQL results to alogistic
regression, easy!
THE Hottestbigdataplatform, huge community, leaving
Hadoop in the dust
Developers love it
SPARK PROVIDES THE MISSING FAST, DEEP
ANALYTICS PIECE OF CASSANDRA!
INTEGRATING SPARK AND CASSANDRA
Scalasolutions:
Datastax integration:
(CQL-based)
https://github.com/datastax/cassandra-
driver-spark
Calliope
Abitmore work:
Use traditionalCassandraclientwith RDDs
Use an existingInputFormat, like CqlPagedInputFormat
EXAMPLE CUSTOM INTEGRATION USING
ASTYANAX
valcassRDD=sc.parallelize(rowkeys).
flatMap{rowkey=>
columnFamily.get(rowkey).execute().asScala
}
A SPARK AND CASSANDRA
OLAP ARCHITECTURE
SEPARATE STORAGE AND QUERY LAYERS
Combine bestof breed storage and queryplatforms
Take fulladvantage of evolution of each
Storage handles replication for availability
Querycan replicate datafor scalingread concurrency-
independent!
SCALE NODES, NOT
DEVELOPER TIME!!
KEEPING IT SIMPLE
Maximize row scan speed
Columnar representation for efficiency
Compressed bitmap indexes for fastalgebra
Functionaltransforms for easymemoization, testing,
concurrency, composition
SPARK AS CASSANDRA'S CACHE
EVEN BETTER: TACHYON OFF-HEAP CACHING
INITIAL ATTEMPTS
valrows=Seq(
Seq("Burglary","19xxHurston",10),
Seq("Theft","55xxFloatillaAve",5)
)
sc.parallelize(rows)
.map{values=>(values[0],values)}
.groupByKey
.reduce(_[2]+_[2])
No existinggeneric queryengine for Spark when we started
(Shark was in infancy, had no indexes, etc.), so we builtour own
For everyrow, need to extractoutneeded columns
Abilityto selectarbitrarycolumns means usingSeq[Any], no
type safety
Boxingmakes integer aggregation veryexpensive and memory
inefficient
COLUMNAR STORAGE AND QUERYING
The traditional row-based datastorage
approach is dead
- Michael Stonebraker
TRADITIONAL ROW-BASED STORAGE
Same layoutin memoryand on disk:
Name Age
Barak 46
Hillary 66
Each row is stored contiguously. Allcolumns in row 2 come after
row 1.
COLUMNAR STORAGE (MEMORY)
Namecolumn
0 1
0 1
Dictionary: {0: "Barak", 1: "Hillary"}
Agecolumn
0 1
46 66
COLUMNAR STORAGE (CASSANDRA)
Review: each physicalrow in Cassandra(e.g. a"partition key")
stores its columns together on disk.
SchemaCF
Rowkey Type
Name StringDict
Age Int
DataCF
Rowkey 0 1
Name 0 1
Age 46 66
ADVANTAGES OF COLUMNAR STORAGE
Compression
Dictionarycompression -HUGE savings for low-cardinality
stringcolumns
RLE
Reduce I/O
Onlycolumns needed for queryare loaded from disk
Can keep strongtypes in memory, avoid boxing
Batch multiple rows in one cellfor efficiency
ADVANTAGES OF COLUMNAR QUERYING
Cache localityfor aggregatingcolumn of data
Take advantage of CPU/GPUvector instructions for ints /
doubles
avoid row-ifyinguntillastpossible moment
easyto derive computed columns
Use vector data/linear math libraries
COLUMNAR QUERY ENGINE VS ROW-BASED IN
SCALA
Custom RDD of column-oriented blocks of data
Uses ~10xless heap
10-100xfaster for group by's on asingle node
Scan speed in excess of 150M rows/sec/core for integer
aggregations
SO, GREAT, OLAP WITH CASSANDRA AND
SPARK. NOW WHAT?
OLAP with Cassandra and Spark
DATASTAX: CASSANDRA SPARK INTEGRATION
Datastax Enterprise now comes with HASpark
HAmaster, thatis.
cassandra-driver-spark
SPARK SQL
Appeared with Spark 1.0
In-memorycolumnar store
Can read from Parquetnow; Cassandraintegration coming
Queryingis notcolumn-based (yet)
No indexes
Write custom functions in Scala.... take thatHive UDFs!!
Integrates wellwith MLBase, Scala/Java/Python
WORK STILL NEEDED
Indexes
Columnar queryingfor fastaggregation
Efficientreadingfrom columnar storage formats
GETTING TO A BILLION ROWS / SEC
Benchmarked at20 million rows/sec, GROUP BY on two
columns, aggregatingtwo more columns. Per core.
50 cores needed for parallellocalized groupingthroughputof
1 billion rows
~5-10 additionalcores budgetfor distributed exchange and
groupingof locallyagggregated groups, dependingon result
size and network topology
Above is acustom solution, NOTSpark SQL.
Look for integration with Spark/SQL for aproper solution
LESSONS
Extremelyfastdistributed queryingfor these use cases
Datadoesn'tchange much (and onlybulk changes)
Analyticalqueries for subsetof columns
Focused on numericalaggregations
Smallnumbers of group bys, limited network interchange of
data
Spark abitrough around edges, butevolvingfast
Concurrentqueries is afrontier with Spark. Use additional
Spark contexts.
THANK YOU!
SOME COLUMNAR
ALTERNATIVES
Monetdb and Infobright-true columnar stores (storage +
querying)
Cstore-fdw for PostGres -columnar storage only
VoltDB-in-memorydistributed columnar database (butneed
to recompile for DDL changes)
Google BigQuery-columnar cloud database, Dremelbased
Amazon RedShift
1 of 42

Recommended

Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A... by
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
12.6K views58 slides
Analyzing Time Series Data with Apache Spark and Cassandra by
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
19.6K views86 slides
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ... by
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
17.2K views45 slides
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark by
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
11.1K views46 slides
Cassandra spark connector by
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
12.6K views40 slides
Analytics with Cassandra & Spark by
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & SparkMatthias Niehoff
1.3K views45 slides

More Related Content

What's hot

Big data analytics with Spark & Cassandra by
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
2.7K views72 slides
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016 by
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
2.9K views58 slides
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day by
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff
2K views46 slides
Alpine academy apache spark series #1 introduction to cluster computing wit... by
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
34.4K views42 slides
Kafka spark cassandra webinar feb 16 2016 by
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
356 views48 slides
Owning time series with team apache Strata San Jose 2015 by
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
4.7K views151 slides

What's hot(20)

Big data analytics with Spark & Cassandra by Matthias Niehoff
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff2.7K views
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016 by StampedeCon
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon2.9K views
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day by Matthias Niehoff
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Matthias Niehoff2K views
Alpine academy apache spark series #1 introduction to cluster computing wit... by Holden Karau
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau34.4K views
Kafka spark cassandra webinar feb 16 2016 by Hiromitsu Komatsu
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Hiromitsu Komatsu356 views
Owning time series with team apache Strata San Jose 2015 by Patrick McFadin
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
Patrick McFadin4.7K views
An Introduction to Distributed Search with Datastax Enterprise Search by Patricia Gorla
An Introduction to Distributed Search with Datastax Enterprise SearchAn Introduction to Distributed Search with Datastax Enterprise Search
An Introduction to Distributed Search with Datastax Enterprise Search
Patricia Gorla4.5K views
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S... by Helena Edelson
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson86.1K views
Spark + Cassandra = Real Time Analytics on Operational Data by Victor Coustenoble
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble10K views
Spark cassandra connector.API, Best Practices and Use-Cases by Duyhai Doan
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan8.2K views
Spark And Cassandra: 2 Fast, 2 Furious by Jen Aman
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman6.7K views
Spark Cassandra Connector Dataframes by Russell Spitzer
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
Russell Spitzer6.1K views
Akka in Production - ScalaDays 2015 by Evan Chan
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan53.3K views
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark by Evan Chan
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan5.8K views
700 Updatable Queries Per Second: Spark as a Real-Time Web Service by Evan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan993 views
Real Time Data Processing Using Spark Streaming by Hari Shreedharan
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan2.6K views
Spark with Cassandra by Christopher Batey by Spark Summit
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
Spark Summit3.5K views
Kafka Lambda architecture with mirroring by Anant Rustagi
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
Anant Rustagi1.1K views

Viewers also liked

Breakthrough OLAP performance with Cassandra and Spark by
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
8.8K views72 slides
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark by
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
6.7K views45 slides
BI, Reporting and Analytics on Apache Cassandra by
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
27K views39 slides
Real-Time Analytics with Apache Cassandra and Apache Spark by
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
9.1K views57 slides
TupleJump: Breakthrough OLAP performance on Cassandra and Spark by
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
56.3K views70 slides
Case Study: OLAP usability on Spark and Hadoop by
Case Study: OLAP usability on Spark and HadoopCase Study: OLAP usability on Spark and Hadoop
Case Study: OLAP usability on Spark and HadoopDataWorks Summit/Hadoop Summit
3.9K views47 slides

Viewers also liked(20)

Breakthrough OLAP performance with Cassandra and Spark by Evan Chan
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan8.8K views
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark by DataStax Academy
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkCassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy6.7K views
BI, Reporting and Analytics on Apache Cassandra by Victor Coustenoble
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
Victor Coustenoble27K views
Real-Time Analytics with Apache Cassandra and Apache Spark by Guido Schmutz
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz9.1K views
TupleJump: Breakthrough OLAP performance on Cassandra and Spark by DataStax Academy
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy56.3K views
Building a High-Performance Database with Scala, Akka, and Spark by Evan Chan
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
Evan Chan2.8K views
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive by Xu Jiang
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang33.1K views
BDM8 - Near-realtime Big Data Analytics using Impala by David Lauzon
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon2.5K views
Drilling into Data with Apache Drill by DataWorks Summit
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
DataWorks Summit3.9K views
Overiew of Cassandra and Doradus by randyguck
Overiew of Cassandra and DoradusOveriew of Cassandra and Doradus
Overiew of Cassandra and Doradus
randyguck1.4K views
Extending Cassandra with Doradus OLAP for High Performance Analytics by randyguck
Extending Cassandra with Doradus OLAP for High Performance AnalyticsExtending Cassandra with Doradus OLAP for High Performance Analytics
Extending Cassandra with Doradus OLAP for High Performance Analytics
randyguck1.1K views
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ... by randyguck
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
randyguck2.2K views
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ... by Luke Han
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
Luke Han3.1K views
インメモリーで超高速処理を実現する場合のカギ by Masaki Yamakawa
インメモリーで超高速処理を実現する場合のカギインメモリーで超高速処理を実現する場合のカギ
インメモリーで超高速処理を実現する場合のカギ
Masaki Yamakawa2.6K views
超高速処理とスケーラビリティを両立するApache GEODE by Masaki Yamakawa
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
Masaki Yamakawa3.8K views
GemFire In Memory Data Grid by Dmitry Buzdin
GemFire In Memory Data GridGemFire In Memory Data Grid
GemFire In Memory Data Grid
Dmitry Buzdin7.3K views
NoSQL, Base VS ACID e Teorema CAP by Aricelio Souza
NoSQL, Base VS ACID e Teorema CAPNoSQL, Base VS ACID e Teorema CAP
NoSQL, Base VS ACID e Teorema CAP
Aricelio Souza14.6K views

Similar to OLAP with Cassandra and Spark

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
83.3K views77 slides
Big Data on the Cloud by
Big Data on the CloudBig Data on the Cloud
Big Data on the CloudSercan Karaoglu
286 views21 slides
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa... by
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
3.7K views79 slides
A Tale of Two APIs: Using Spark Streaming In Production by
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
2.7K views44 slides
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 by
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
3.7K views37 slides
Lightning fast analytics with Cassandra and Spark by
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
5.2K views22 slides

Similar to OLAP with Cassandra and Spark(20)

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar) by Helena Edelson
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson83.3K views
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa... by Helena Edelson
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson3.7K views
A Tale of Two APIs: Using Spark Streaming In Production by Lightbend
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
Lightbend2.7K views
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一 by scalaconfjp
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp3.7K views
Lightning fast analytics with Cassandra and Spark by Victor Coustenoble
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
Victor Coustenoble5.2K views
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics by Miklos Christine
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine1.2K views
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala by Helena Edelson
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson75K views
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ... by Edureka!
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!2K views
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home... by DataStax Academy
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
DataStax Academy1.6K views
Intro to Spark and Spark SQL by jeykottalam
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam51.2K views
Cassandra Workshop - Cassandra from scratch in one day by Carlos Alonso Pérez
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one day
5 Ways to Use Spark to Enrich your Cassandra Environment by Jim Hatcher
5 Ways to Use Spark to Enrich your Cassandra Environment5 Ways to Use Spark to Enrich your Cassandra Environment
5 Ways to Use Spark to Enrich your Cassandra Environment
Jim Hatcher699 views
Scalable Applications with Scala by Nimrod Argov
Scalable Applications with ScalaScalable Applications with Scala
Scalable Applications with Scala
Nimrod Argov1.7K views
Big Data Landscape 2019 by QAware GmbH
Big Data Landscape 2019Big Data Landscape 2019
Big Data Landscape 2019
QAware GmbH297 views
Efficient State Management With Spark 2.0 And Scale-Out Databases by Jen Aman
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman1.8K views
Efficient State Management With Spark 2.x And Scale-Out Databases by SnappyData
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
SnappyData354 views
Lightning Fast Analytics with Cassandra and Spark by Tim Vincent
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
Tim Vincent564 views
Autoscaling Best Practices by Marc Cluet
Autoscaling Best PracticesAutoscaling Best Practices
Autoscaling Best Practices
Marc Cluet10.7K views

More from Evan Chan

Porting a Streaming Pipeline from Scala to Rust by
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
7 views38 slides
Designing Stateful Apps for Cloud and Kubernetes by
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesEvan Chan
222 views33 slides
Histograms at scale - Monitorama 2019 by
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Evan Chan
1.5K views36 slides
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale by
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
1.1K views52 slides
Building Scalable Data Pipelines - 2016 DataPalooza Seattle by
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
5.7K views80 slides
Productionizing Spark and the Spark Job Server by
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
16.8K views72 slides

More from Evan Chan(10)

Porting a Streaming Pipeline from Scala to Rust by Evan Chan
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan7 views
Designing Stateful Apps for Cloud and Kubernetes by Evan Chan
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan222 views
Histograms at scale - Monitorama 2019 by Evan Chan
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
Evan Chan1.5K views
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale by Evan Chan
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan1.1K views
Building Scalable Data Pipelines - 2016 DataPalooza Seattle by Evan Chan
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan5.7K views
Productionizing Spark and the Spark Job Server by Evan Chan
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan16.8K views
MIT lecture - Socrata Open Data Architecture by Evan Chan
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
Evan Chan3K views
Spark Summit 2014: Spark Job Server Talk by Evan Chan
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
Evan Chan4.4K views
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14) by Evan Chan
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan11.8K views
Real-time Analytics with Cassandra, Spark, and Shark by Evan Chan
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
Evan Chan26.8K views

OLAP with Cassandra and Spark

  • 1. OLAP WITH SPARK AND CASSANDRA EVAN CHAN JULY 2014
  • 2. WHO AM I? PrincipalEngineer, @evanfchan Creator of Socrata, Inc. http://github.com/velvia Spark Job Server
  • 3. WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MORE PEOPLE. data.edmonton.cafinances.worldbank.orgdata.cityofchicago.org data.seattle.govdata.oregon.govdata.wa.gov www.metrochicagodata.orgdata.cityofboston.gov info.samhsa.govexplore.data.govdata.cms.govdata.ok.gov data.nola.govdata.illinois.govdata.colorado.gov data.austintexas.govdata.undp.orgwww.opendatanyc.com data.mo.govdata.nfpa.orgdata.raleighnc.govdati.lombardia.it data.montgomerycountymd.govdata.cityofnewyork.us data.acgov.orgdata.baltimorecity.govdata.energystar.gov data.somervillema.govdata.maryland.govdata.taxpayer.net bronx.lehman.cuny.edu data.hawaii.govdata.sfgov.org
  • 4. WE ARE SWIMMING IN DATA!
  • 5. BIG DATA AT OOYALA 2.5 billionanalytics pings aday= almostatrillionevents a year. Rollup tables -30 million rows per day
  • 6. BIG DATA AT SOCRATA Hundreds of datasets, each one up to 30 million rows Customer demand for billion row datasets
  • 7. HOW CAN WE ALLOW CUSTOMERS TO QUERY A YEAR'S WORTH OF DATA? Flexible -complex queries included Sometimes you can'tdenormalize your dataenough Fast-interactive speeds
  • 8. RDBMS? POSTGRES? Starthittinglatencylimits at~10 million rows No robustand inexpensive solution for queryingacross shards No robustwayto scale horizontally Complex and expensive to improve performance (egrollup tables)
  • 9. OLAP CUBES? Materialize summaryfor everypossible combination Too complicated and brittle Takes forever to compute Explodes storage and memory
  • 10. When in doubt, use brute force - Ken Thompson
  • 12. CASSANDRA Horizontallyscalable Veryflexible datamodelling(lists, sets, custom datatypes) Easyto operate No fear of number of rows or documents Bestof breed storage technology, huge community BUT: Simplequeries only
  • 13. APACHE SPARK Horizontallyscalable, in-memoryqueries FunctionalScalatransforms -map, filter, groupBy, sort etc. SQL, machine learning, streaming, graph, R, manymore plugins allon ONE platform -feed your SQL results to alogistic regression, easy! THE Hottestbigdataplatform, huge community, leaving Hadoop in the dust Developers love it
  • 14. SPARK PROVIDES THE MISSING FAST, DEEP ANALYTICS PIECE OF CASSANDRA!
  • 15. INTEGRATING SPARK AND CASSANDRA Scalasolutions: Datastax integration: (CQL-based) https://github.com/datastax/cassandra- driver-spark Calliope
  • 16. Abitmore work: Use traditionalCassandraclientwith RDDs Use an existingInputFormat, like CqlPagedInputFormat
  • 17. EXAMPLE CUSTOM INTEGRATION USING ASTYANAX valcassRDD=sc.parallelize(rowkeys). flatMap{rowkey=> columnFamily.get(rowkey).execute().asScala }
  • 18. A SPARK AND CASSANDRA OLAP ARCHITECTURE
  • 19. SEPARATE STORAGE AND QUERY LAYERS Combine bestof breed storage and queryplatforms Take fulladvantage of evolution of each Storage handles replication for availability Querycan replicate datafor scalingread concurrency- independent!
  • 21. KEEPING IT SIMPLE Maximize row scan speed Columnar representation for efficiency Compressed bitmap indexes for fastalgebra Functionaltransforms for easymemoization, testing, concurrency, composition
  • 23. EVEN BETTER: TACHYON OFF-HEAP CACHING
  • 25. No existinggeneric queryengine for Spark when we started (Shark was in infancy, had no indexes, etc.), so we builtour own For everyrow, need to extractoutneeded columns Abilityto selectarbitrarycolumns means usingSeq[Any], no type safety Boxingmakes integer aggregation veryexpensive and memory inefficient
  • 27. The traditional row-based datastorage approach is dead - Michael Stonebraker
  • 28. TRADITIONAL ROW-BASED STORAGE Same layoutin memoryand on disk: Name Age Barak 46 Hillary 66 Each row is stored contiguously. Allcolumns in row 2 come after row 1.
  • 29. COLUMNAR STORAGE (MEMORY) Namecolumn 0 1 0 1 Dictionary: {0: "Barak", 1: "Hillary"} Agecolumn 0 1 46 66
  • 30. COLUMNAR STORAGE (CASSANDRA) Review: each physicalrow in Cassandra(e.g. a"partition key") stores its columns together on disk. SchemaCF Rowkey Type Name StringDict Age Int DataCF Rowkey 0 1 Name 0 1 Age 46 66
  • 31. ADVANTAGES OF COLUMNAR STORAGE Compression Dictionarycompression -HUGE savings for low-cardinality stringcolumns RLE Reduce I/O Onlycolumns needed for queryare loaded from disk Can keep strongtypes in memory, avoid boxing Batch multiple rows in one cellfor efficiency
  • 32. ADVANTAGES OF COLUMNAR QUERYING Cache localityfor aggregatingcolumn of data Take advantage of CPU/GPUvector instructions for ints / doubles avoid row-ifyinguntillastpossible moment easyto derive computed columns Use vector data/linear math libraries
  • 33. COLUMNAR QUERY ENGINE VS ROW-BASED IN SCALA Custom RDD of column-oriented blocks of data Uses ~10xless heap 10-100xfaster for group by's on asingle node Scan speed in excess of 150M rows/sec/core for integer aggregations
  • 34. SO, GREAT, OLAP WITH CASSANDRA AND SPARK. NOW WHAT?
  • 36. DATASTAX: CASSANDRA SPARK INTEGRATION Datastax Enterprise now comes with HASpark HAmaster, thatis. cassandra-driver-spark
  • 37. SPARK SQL Appeared with Spark 1.0 In-memorycolumnar store Can read from Parquetnow; Cassandraintegration coming Queryingis notcolumn-based (yet) No indexes Write custom functions in Scala.... take thatHive UDFs!! Integrates wellwith MLBase, Scala/Java/Python
  • 38. WORK STILL NEEDED Indexes Columnar queryingfor fastaggregation Efficientreadingfrom columnar storage formats
  • 39. GETTING TO A BILLION ROWS / SEC Benchmarked at20 million rows/sec, GROUP BY on two columns, aggregatingtwo more columns. Per core. 50 cores needed for parallellocalized groupingthroughputof 1 billion rows ~5-10 additionalcores budgetfor distributed exchange and groupingof locallyagggregated groups, dependingon result size and network topology Above is acustom solution, NOTSpark SQL. Look for integration with Spark/SQL for aproper solution
  • 40. LESSONS Extremelyfastdistributed queryingfor these use cases Datadoesn'tchange much (and onlybulk changes) Analyticalqueries for subsetof columns Focused on numericalaggregations Smallnumbers of group bys, limited network interchange of data Spark abitrough around edges, butevolvingfast Concurrentqueries is afrontier with Spark. Use additional Spark contexts.
  • 42. SOME COLUMNAR ALTERNATIVES Monetdb and Infobright-true columnar stores (storage + querying) Cstore-fdw for PostGres -columnar storage only VoltDB-in-memorydistributed columnar database (butneed to recompile for DDL changes) Google BigQuery-columnar cloud database, Dremelbased Amazon RedShift