Jags Ramnarayan's presentation

SnappyData
1
SnappyData
1
SnappyData
Getting Spark ready for real time,
operational analytics
Jags Ramnarayan
Oct 2015
snappydata.io

SnappyData
2
SnappyData
2
 Key attributes for modern real time streaming processing
and interactive analytics
 What is so exciting to me about Spark?
What are some of the myths?
 What is missing in Spark for real time?
 SnappyData’s mission – fuse Spark with in-memory data
management in one unified cluster to offer – OLTP +
OLAP + Stream processing + Probabilistic data
My Talk today

SnappyData
3
Stream processing – what is required?
• ingest in parallel from disparate sources. Cannot throttle the input
- Sockets, message bus, files, HDFS..
• Process in parallel - filter, transform, normalize
• Apply rules and trigger alerts, actions – SQL?
• State management with mutability
• HA semantics for state
– input streams must be HA (reliable enqueue with “once and only once” processing )
– Processing may depend on reference data that must be HA
– Generated state must be HA
• Store row, derived data into HDFS

SnappyData
4
Stream Analytics – Applications
• Real time scoring of analytic models
• Incremental, online training of models
• Recommendations and targeting
• Personalized Ads
• Detect patterns and anomalies in massive quantities of machine data
• Stream Analytics requires working with lots of history, reference data

SnappyData
5
Stream Processing Stream Analytics
• Popular stream processors are parallel processing frameworks
with very limited support for deep analytic operators
Deeper analytic class problems include
 report topK popular URLs over last hour, day - report every 5
seconds
 Correlate energy consumption pattern over last 10 mins to similar
time periods in the past
 Maintain a prediction model in real time - model for fraud detection
Stream Analytics require the same optimization challenges as OLAP SQL - discover
trends, patterns and outliers may require incremental joins and aggregations on large
quantities of historical or reference data

SnappyData
6
Leave to the
application Developer
Join with History, ref data
in External DB
Embed, integrate with in-
memory row oriented store
Slow, complex
management of multiple
products
Complex, sub-optimal,
dynamic rules difficult
Most designed for
OLTP; Scans,
aggregations too slow
Current Solutions Fall Short

SnappyData
7
In-Memory DB
Interactive queries
, updates
Deep Scale,
High volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
Streaming Analytics should be inside the DataStore
OLAP queries are
CPU intensive and
traditional methods
are too slow
The New Real time analytics operational DB

SnappyData
8
So, Why do we like Spark?
• Blends Streaming, interactive, batch analytics into cohesive whole
• Appeals to Java developers, R, Python folks
• Succinct code – maybe the credit goes to Scala?
• Rich set of transformations, and libraries (ML, Graph)
• RDD and fault tolerance without replication
• Stream processing with high throughput (pipeline of micro batches)

SnappyData
9
Spark Myths
• It is a distributed in-memory database
– it is a computational framework with immutable caching
• It is Highly Available
– Fault tolerance is not the same as HA
• Well suited for real time, operational environments
– For e.g. Hundreds of concurrent clients running interactive queries

SnappyData
10
Spark Streaming Runtime Architecture
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Queue is buffered in
executor. Driver
submits batch job
every second. This
results in a new
RDD pushed to
stream(batch from
buffer)
Short term immutable state.
Long term – In external DB

SnappyData
11
Challenge 1: Spark Driver is NOT HA
Driver
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Spark Driver is NOT HA
YARN for Driver restart but state is still an issue
Fault tolerance can be configured through
Write-ahead Logging, Check pointing
Bigger problem is driver failure results in all
executors shutting down… All Cached state
(could be TBs) will have to be recovered

SnappyData
12
Challenge 2: External state management
Driver
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Join to reference data, join across
streams may require remote access
for each batch
If you cache data then you are
working with stale reference data
Colocation of state to maintain
performance and not falling behind
newDStream =
wordDstream.updateStateByKey[Int](newUpda
teFunc,… )
- Built in capability to update state as batches
arrive requires iteration of the full data set

SnappyData
13
Challenge 3: Sharing state across clients (Applications)
Driver
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
RDDs cached within executors cannot
be shared across different client
applications.
So, for instance,
App1 is a streaming apps - store
counters in a DataFrame.
App2 runs a SQL query – has no
visibility to the state from App1

SnappyData
14
Driver
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App“Once and only once” – e.g. update
counters for each stream batch.
Failures would result in batch resent.
The counters should reflect correct
state.
Works well if the state is only
managed inside Spark.
Maintaining correct state of
counters in External DB is left
upto the user.
Challenge 4: “Once and only once” in reality is difficult

SnappyData
15
Challenge 5: Always ON – Not Just Fault tolerance
Driver
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
HA: If something fails, there is always a
redundant copy that is fully in sync.
Failover is instantenous
Fault tolerance in Spark: Recover state
from the original source or checkpoint by
tracking lineage. Can take too long.

SnappyData
16
Challenge 6: Interactive queries with high concurrency too slow
Driver
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
OLAP queries are CPU
intensive and traditional
methods are too slow

SnappyData
17
SnappyData
17
- New Spark open source project started by
Pivotal GemFire Founders+engineers
- Decades of in-memory data management
experience
- Focus on real-time, operational analytics -
Spark inside a OLTP+OLAP database
What is Snappy Data?

SnappyData
18
Streaming
Analytics
Probabilistic
data
Distributed
In-Memory
SQL
Deep integration
of Spark + Gem
Unified cluster, AlwaysOn, Cloud ready
For Real time analytics
Vision – Drastically reduce the cost and complexity in modern big
data. …Using fraction of the resources
10X better response time, drop resource cost 10X,
reduce complexity 10X
Deep Scale,
High volume
MPP DB
Integrate
with
SnappyData: A New Approach To Real Time Analytics

SnappyData
19
Snappy Data Platform – Key Features
• Data can be row oriented (Point update to reference data)
• Or column oriented(compressed for high density storage)
• Support high write rates, scalable
— Streaming data goes through stages
— queue streams(rows), intermediate storage (rows), finally immutable
compressed columns
• Leverage spark streaming for micro-batch streaming
— Not designed for ultra low latency

SnappyData
20
Key Features – Synopses Using Approximate Data
• Maintain exact data in columnar form (compressed)
• Maintain stratified samples
— Intelligent sampling to keep error bounds low
• Specialize for time series
— Decay accuracy over time  sub-linear growth
• Probabilistic data
— TopK for time series (using time aggregation CMS, item aggregation)
— Histograms, HLL, Bloom filters, Wavelets

SnappyData
21
Key Differentiation– OLTP + OLAP with Synopsis
CQ
Subscriptions
OLAP Query
Engine
Micro Batch
Processing
Module
(Plugins)
Sliding Window
Emits Batches
[ ]
User
Applications
processing
Events &
Issuing
Interactive
Queries
Summary DB
 Time Series with decay
 TopK, Frequency Summary
Structures
 Counters
 Histograms
 Stratified Samples
 Raw Data Windows
Exact DB
(Row + column
oriented)

SnappyData
22
Solving The Complexity And Volume Challenge
• Far fewer resources: TB problem becomes GB.
— CPU contention drops
• Far less complex
— Single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
• Much faster
— Compressed data managed in distributed memory in columnar form
reduces volume and is much more responsive

SnappyData
23
Not Panacea, But Comes Close
• Synopses require prior workload knowledge
• Not all queries … complex queries will result in high error rates
— Single cluster for stream ingestion and analytic queries (both streaming
and interactive)
• Our Strategy – be adjunct to MPP databases…
— First compute the error estimate ; if error is above tolerance delegate to
exact store

SnappyData
24
Adjunct Store In Certain Scenarios

SnappyData
25
SnappyData
25
We are hiring!
http://www.snappydata.io/blog/careers-fall2015
Goto snappydata.io/blog for more info ….
Register @ http://www.snappydata.io/register for Beta

SnappyData
26
SnappyData
26
Again, We are hiring! 
http://www.snappydata.io/blog/careers-fall2015
EXTRAS

SnappyData
27
Speed/Accuracy Trade-off
Error
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
27
Credit: Barzan; Berkeley AMPLab

SnappyData
28
Stratified Sampling
●Random Sampling has intuitive semantics
●But, data is typically skewed and our queries are multi-dimentional
●Avg sales order price for each product class for each geography
●Some products may have little to no sales
●Stratification ensures that each “group”(product class) is represented

SnappyData
29
Stratified sampling challenges
●Solutions exist for batch data (blinkDB) (partial solution)
●Our challenge is to get this working for infinite streams of time series data
●Answer: Use combination of Stratified with other techniques like Bernouli/reservoir
sampling
●Exponentially decay over time

SnappyData
30
Dealing with Errors and Latency
●Well known error techniques for “closed form aggregations”
●Exploring other techniques – Analytical Bootstrap (Barzan)
●User can specify error bound with a confidence interval
●Engine would determine if it can satisfy the error bound first
●If not, delegate execution to an “exact” store (GPDB, etc)
●Query execution can also be latency bounded
●SELECT … FROM .. WHERE … WITHIN 2 SECONDS
SELECT avg(sessionTime) FROM Table
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%

SnappyData
31
Sketching Techniques
●Sampling not effective for outlier detection
●MAX, MIN, etc
●Other probabilistic structures like CMS, Heavy hitters, etc
●We implemented Hokusai
●capture frequencies of items in time series
●Design permits TopK queries over arbitrary time intervals
(Top100 popular URLs)
SELECT pageURL, count(*) frequency FROM Table
WHERE …. GROUP BY ….
ORDER BY frequency DESC
LIMIT 100

SnappyData
32
Zeppelin
Spark
Interpreter
(Driver)
Zeppelin
Server
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
DEMO

Jags Ramnarayan's presentation

More Related Content

What's hot

Similar to Jags Ramnarayan's presentation

Recently uploaded

Jags Ramnarayan's presentation