Thing you didn't know you could do in Spark

ThingsYou Didn'tKnowYou Could
Do in Spark
Version 0.1 | © Snappydata Inc 2017
Strategic Advisor @ Snappydata
Assistant Prof. @Univ. of Mich
Barzan Mozafari
CTO | Co-founder
@ Snappydata
Jags Ramnarayan
COO | Co-founder
@ Snappydata
Sudhir Menon

2
Current
Market
Trends
Analytic workloads are moving to the cloud
Scale-outinfrastructure->IncreasinglyExpensive
Excessivedatashuffling -> Longer responsetimes
Real-time operational analytics are on the rise
MixedWorkloadscombine streams,operationaldataandHistoricaldata
Real-timeMatters
Concurrency is increasing
Sharedinfrustrure(10’sof usersand apps,100’sof queries)
Exploratoryanalytics
BIandvisualizationtools
Concurrentqueries
Data volumes are growing rapidly
IndustrialIoT(Internetof Things)
Datagrowing fasterthanMoore’slaw

3
Our
Observations
Acheiving interactive analytics is hard
Analystswantresponsetime <2 seconds
Hardwith concurrentqueriesin the cluster
Hardonananalyst’slaptop
Hardonlargedatasets
Hardwhenthereare networkshuffles
After 99.9%- accurate early results, the final answer is no
longer needed
200xqueryspeedup
200xreductionof clusterload, networkshuffles,andI/O waits
Lambda architecture is extremely complex (and slow!)
Dealingwith multipledataformats, disparateproductAPIs
Wastedresources
Querying multiple remote data sources is slow
Curateddatafrom multiplesourcesisslow

Agenda
• Mixed workloads
Structured Streaming in Spark 2.0
State Management in Spark 2.0
• Interactive Analytics in Spark 2.0
• The SnappyData Solution
• Q&A

Stream
Processing
Mixed workloads are way too common
Transactions
(point lookups,
small updates)
Interactive
Analytics
Analytics on
mutating data
Maintaining state or
counters while processing
streams
Correlating and
joining streams w/
large histories

Mixed Workload in IoT Analytics
- Map sensors to tags
- Monitor temperature
- Send alerts
Correlate current
temperature trend with
history….
Anomaly detection –
score against models
Interact using
dynamic
queries

How Mixed Workloads are Supported Today?
Lambda Architecture

Scalable Store
HDFS,
MPP DBData-in-motion
Analytics
Application
Streams
Alerts
Enrich
Reference DB
Lamba Architecure is Complex
Enrich – requires
reference data
join
State – manage
windows with
mutating state
Analytics – Joins to
history, trend patterns
Model maintenance
Models
Transaction
writes
Interactive OLAP
queries
Updates

Scalable Store
Updates
HDFS,
MPP DBData-in-motion
Analytics
Application
Streams
Alerts
Enrich
Reference DB
Storm, Spark
streaming,
Samza…
NoSQL – Cassandra,
Redis, Hbase ….
Models
Enterprise DB –
Oracle, Postgres..
OLAP SQL DB

• Complexity: learn and master
multiple products  data
models, disparate APIs, configs
• Slower
• Wasted resources

Can we simplify & optimize?
How about a single clustered DB that can manage stream state,
transactional data and run OLAP queries?
RDB
Tables
Txn
Framework for stream
processing, etc
Stream processing
HDFS
MPP DB
Scalable writes, point reads, OLAP queries
Apps

One idea: Start from Apache Spark
Spark Core
Spark
Streaming
Spark ML
Spark SQL
GraphX
Stream program can
Leverage ML, SQL, Graph
Why?
• Unified programming across
streaming, interactive, batch
• Multiple data formats, data sources
• Rich library for Streaming, SQL,
Machine Learning, …
But: Spark is a computational
engine not a store or a database

Structured Streaming
in Spark 2.0

Source: Databricks presentation

Source: Databricks presentation
read...stream() creates a
streaming DataFrame, does not
start any of the computation
write...startStream() defines where
& how to output the data and
starts the processing

New State Management API in Spark 2.0 – KV Store
• Preserve state across streaming aggregations across multiple
batches
• Fetch, store KV pairs
• Transactional
• Supports versioning (batch)
• Built in store that persists to HDFS

Deeper Look – Ad Impression Analytics
Ad Network Architecture – Analyze log impressions in real time

Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/

Bottlenecks in the write path
Stream micro batches in
parallel from Kafka to each
Spark executor
Filter and collect event for 1 minute.
Reduce to 1 event per Publisher,
Geo every minute
Execute GROUP BY … Expensive
Spark Shuffle …
Shuffle again in DB cluster …
data format changes …
serialization costs
val input :DataFrame= sqlContext.read
.options(kafkaOptions).format("kafkaSource")
.stream()
val result :DataFrame = input
.where("geo != 'UnknownGeo'")
.groupBy(
window("event-time", "1min"),
Publisher", "geo")
.agg(avg("bid"), count("geo"),countDistinct("cookie"))
val query = result.write.format("org.apache.spark.sql.cassandra")
.outputMode("append")
.startStream("dest-path”)

Bottlenecks in the Write Path
• Aggregations – GroupBy, MapReduce
• Joins with other streams, Reference data
Shuffle Costs (Copying, Serialization) Excessive copying in
Java based Scale out stores

Avoid bottleneck - Localize process with State
Spark Executor
Spark Executor
Kafka
queue
KV
KV
Publisher: 1,3,5
Geo: 2,4,6
Avoid shuffle by localizing writes Input queue,
Stream, KV
Store all share
the same
partitioning
strategy
Publisher: 2,4,6
Geo: 1,3,5
Publisher: 1,3,5
Geo: 2,4,6
Publisher: 2,4,6
Geo: 1,3,5
Spark 2.0
StateStore?

Impedance mismatch with KV stores
We want to run interactive “scan”-intensive queries:
- Find total uniques for Ads grouped on geography
- Impression trends for Advertisers(group by query)
Two alternatives: row-stores vs. column stores

Goal – Localize processingFast key-based lookup
But, too slow to run
aggregations, scan based
interactive queries
Fast scans, aggregations
Updates and random
writes are very difficult
Consume too
much memory

Why columnar storage in-memory?

Interactive Analytics
in Spark 2.0

But, are in-memory column-stores fast enough?
SELECT
SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark
-- AWS m2.4xlarge ; total of 342 GB

Why distributed, in-memory and columnar
formats not enough for interactive speeds?
1. Distributed shuffles can take minutes
2. Software overheads, e.g., (de)serialization
3. Concurrent queries competing for the same resources
So, how can we ensure interactive-speed
analytics?

• Most apps happy to tradeoff 1% accuracy
for 200x speedup!
• Can usually get a 99.9% accurate answer by only
looking at a tiny fraction of data!
• Often can make perfectly accurate
decisions without having perfectly
accurate answers!
• A/B Testing, visualization, ...
• The data itself is usually noisy
• Processing entire data doesn’t necessarily mean
exact answers!
• Inference is probabilistic anyway
Use statistical techniques to shrink data!

SnappyData: A New Approach
A Single Unified Cluster: OLTP + OLAP + Streaming
for real-time analytics
Batch design, high throughput
Real-time design
Low latency, HA,
concurrency
Vision: Drastically reduce the cost and
complexity in modern big data
Rapidly Maturing Matured over 13 years

SnappyData: A New Approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real-time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC,
JDBC, REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial product with
Approximate Query Processing (AQP)
MPP DB
Index

SnappyData Data Model
Rows
Columnar
Stream processing
ODBC
JDBCProbabilistic
data
Snappy Data Server – Spark Executor + Store
Index
Spark
Program
SQL
Parser
RDD Table
Hybrid Storage
RDD
Cluster
Mgr
Scheduler
High Latency/OLAP
Low Latency
Add/remove Servers
Catalog

Features
- Deeply integrated database for Spark
- 100% compatible with Spark
- Extensions for Transactions (updates), SQL stream processing
- Extensions for High Availability
- Approximate query processing for interactive OLAP
- OLTP+OLAP Store
- Replicated and partitioned tables
- Tables can be Row or Column oriented (in-memory & on-disk)
- SQL extensions for compatibility with SQL Standard
- create table, view, indexes, constraints, etc

Time Series Analytics (kparc.com)
http://kparc.com/q4/readme.txt Trade and Quote Analytics

Time Series Analytics – kparc.com
http://kparc.com/q4/readme.txt
- Find last quoted price for Top 100 instruments for given date
"select quote.sym, avg(bid) from quote join S on (quote.sym = S.sym) where
date='$d' group by quote.sym”
// Replaced with AVG as MemSQL doesn’t support LAST
- Max price for Top 100 Instruments across Exchanges for given date
"select trade.sym, ex, max(price) from trade join S on (trade.sym = S.sym)
where date='$d' group by trade.sym, ex”
- Find Average size of Trade in each hour for top 100 Instruments for given
date
"select trade.sym, hour(time), avg(size) from trade join S on (trade.sym =
S.sym) where date='$d' group by trade.sym, hour(time)"

Results for Small day (40million), Big day(640 million)
0
500
1000
1500
2000
2500
3000
3500
Q1 Q2 Q3
SnappyData 224 125 124
MemSQL 556 1884 3210
AxisTitle
Query Response time in
Millis
(Smaller is better)
0
10000
20000
30000
40000
50000
60000
SnappyData MemSQL Impala Greenplum
Q1 1769 8652 45000 56000
Q2 432 26034 2250 900
Q3 494 49614 1880 1000AxisTitle
Query Response time in Millis
(Smaller is better)

TPC-H: 10X-20X faster than Spark 2.0

Use Case Patterns
1. Analytics cache
- Caching for Analytics over any “big data” store (esp. MPP)
- Federate query between samples and backend’
2. Stream analytics for Spark
Process streams, transform, real-time scoring, store, query
3. In-memory transactional store
Highly concurrent apps, SQL cache, OLTP + OLAP

Interactive-Speed Analytic Queries – Exact or Approximate
Select avg(Bid), Advertiser from T1 group by Advertiser
Select avg(Bid), Advertiser from T1 group by Advertiser with error 0.1
Speed/Accuracy tradeoffError(%)
30 mins
Time to Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time 42
100 secs
2 secs
1% Error

Approximate Query Processing Features
• Support for uniform sampling
• Support for stratified sampling
- Solutions exist for stored data (BlinkDB)
- SnappyData works for infinite streams of data too
• Support for exponentially decaying windows over time
• Support for synopses
- Top-K queries, heavy hitters, outliers, ...
• [future] Support for joins
• Workload mining (http://CliffGuard.org)

Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Uniform Sample
ID Advertiser Geo Bid Sampling
Rate
3 adv20 NY 0.0002 1/3
5 adv20 NY 0.0001 1/3
SELECT avg(bid)
FROM AdImpresssions
WHERE geo = ‘VT’
Original Table

Uniform (Random) Sampling
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Uniform Sample
Rate
3 adv20 NY 0.0002 2/3
5 adv20 NY 0.0001 2/3
1 adv10 NY 0.0001 2/3
2 adv10 VT 0.0005 2/3
SELECT avg(bid)
FROM AdImpresssions
Original Table Larger

Stratified Sampling
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Stratified Sample on Geo
Rate
3 adv20 NY 0.0002 1/4
2 adv10 VT 0.0005 1/2
SELECT avg(bid)
FROM AdImpresssions
Original Table

Sketching techniques
● Sampling not effective for outlier detection
○ MAX/MIN etc
● Other probabilistic structures like CMS, heavy hitters, etc
● SnappyData implements Hokusai
○ Capturing item frequencies in timeseries
● Design permits TopK queries over arbitrary trim intervals
(Top100 popular URLs)
SELECT pageURL, count(*) frequency FROM Table
WHERE …. GROUP BY ….
ORDER BY frequency DESC
LIMIT 100

Airline Analytics Demo
Zeppelin
Spark
Interpreter
(Driver)
Zeppelin
Server
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM

Free Cloud trial service – Project iSight
● Free AWS/Azure credits for anyone to try out SnappyData
● One click launch of private SnappyData cluster with Zeppelin
● Multiple notebooks with comprehensive description of concepts and
value
● Bring your own data sets to try ‘Immediate visualization’ using
Synopses data
http://www.snappydata.io/cloudbuilder --- URL for AWS Service
Send email to chomp@snappydata.io to be notified. Release expected next week

Snappy Spark Cluster Deployment topologies
• Snappy store and Spark
Executor share the JVM
memory
• Reference based access –
zero copy
• SnappyStore is isolated but
use the same COLUMN
FORMAT AS SPARK for high
throughput
Unified Cluster
Split Cluster

Simple API – Spark Compatible
● Access Table as DataFrame
Catalog is automatically recovered
● Store RDD[T]/DataFrame can be
stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates,
inserts, deletes
//Save a dataFrame using the Snappy or spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema,
props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(pro
ps).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)
val campaignRef: DataFrame = context.table(rowTable)
val parquetData: DataFrame = context.table(parquetTable)
<… Now use any of DataFrame APIs … >

Extends Spark
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘JDBC | ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name', // Default none
PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store.
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];

Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog
(<Columns>) using directkafka_stream options (
<socket endpoints>
"topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps,
count(distinct(cookie)) uniques from AdImpressionLog
window (duration '2' seconds, slide '2' seconds)
where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => {
df.write.format("column").mode(SaveMode.Appen
d)
.saveAsTable("adImpressions")

AdImpression Demo
Spark, SQL Code Walkthrough, interactive SQL

AdImpression Ingest Performance
Loading from parquet files using 4 cores Stream ingest
(Kafka + Spark streaming to store using 1 core)

How do we extend Spark for Real Time?
• Spark Executors are long
running. Driver failure
doesn’t shutdown
Executors
• Driver HA – Drivers run
“Managed” with standby
secondary
• Data HA – Consensus based
clustering integrated for
eager replication

How do we extend Spark for Real Time?
• By pass scheduler for low
latency SQL
• Deep integration with
Spark Catalyst(SQL) –
collocation optimizations,
indexing use, etc
• Full SQL support –
Persistent Catalog,
Transaction, DML

Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive

www.snappydata.io
SnappyData is Open Source
● Ad Analytics example/benchmark -
https://github.com/SnappyDataInc/snappy-poc
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com

Thing you didn't know you could do in Spark

More Related Content

What's hot

Viewers also liked

Similar to Thing you didn't know you could do in Spark

Recently uploaded

Thing you didn't know you could do in Spark

Editor's Notes