SnappyData
1
SnappyData
1
SnappyData
Getting Spark ready for real time,
operational analytics
Jags Ramnarayan
Oct 2015
snappydata.io
SnappyData
2
SnappyData
2
 Key attributes for modern real time streaming processing
and interactive analytics
 What is so exciting to me about Spark?
What are some of the myths?
 What is missing in Spark for real time?
 SnappyData’s mission – fuse Spark with in-memory data
management in one unified cluster to offer – OLTP +
OLAP + Stream processing + Probabilistic data
My Talk today
SnappyData
3
Stream processing – what is required?
• ingest in parallel from disparate sources. Cannot throttle the input
- Sockets, message bus, files, HDFS..
• Process in parallel - filter, transform, normalize
• Apply rules and trigger alerts, actions – SQL?
• State management with mutability
• HA semantics for state
– input streams must be HA (reliable enqueue with “once and only once” processing )
– Processing may depend on reference data that must be HA
– Generated state must be HA
• Store row, derived data into HDFS
SnappyData
4
Stream Analytics – Applications
• Real time scoring of analytic models
• Incremental, online training of models
• Recommendations and targeting
• Personalized Ads
• Detect patterns and anomalies in massive quantities of machine data
• Stream Analytics requires working with lots of history, reference data
SnappyData
5
Stream Processing Stream Analytics
• Popular stream processors are parallel processing frameworks
with very limited support for deep analytic operators
Deeper analytic class problems include
 report topK popular URLs over last hour, day - report every 5
seconds
 Correlate energy consumption pattern over last 10 mins to similar
time periods in the past
 Maintain a prediction model in real time - model for fraud detection
Stream Analytics require the same optimization challenges as OLAP SQL - discover
trends, patterns and outliers may require incremental joins and aggregations on large
quantities of historical or reference data
SnappyData
6
Leave to the
application Developer
Join with History, ref data
in External DB
Embed, integrate with in-
memory row oriented store
Slow, complex
management of multiple
products
Complex, sub-optimal,
dynamic rules difficult
Most designed for
OLTP; Scans,
aggregations too slow
Current Solutions Fall Short
SnappyData
7
In-Memory DB
Interactive queries
, updates
Deep Scale,
High volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
Streaming Analytics should be inside the DataStore
OLAP queries are
CPU intensive and
traditional methods
are too slow
The New Real time analytics operational DB
SnappyData
8
So, Why do we like Spark?
• Blends Streaming, interactive, batch analytics into cohesive whole
• Appeals to Java developers, R, Python folks
• Succinct code – maybe the credit goes to Scala?
• Rich set of transformations, and libraries (ML, Graph)
• RDD and fault tolerance without replication
• Stream processing with high throughput (pipeline of micro batches)
SnappyData
9
Spark Myths
• It is a distributed in-memory database
– it is a computational framework with immutable caching
• It is Highly Available
– Fault tolerance is not the same as HA
• Well suited for real time, operational environments
– For e.g. Hundreds of concurrent clients running interactive queries
SnappyData
10
Spark Streaming Runtime Architecture
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Queue is buffered in
executor. Driver
submits batch job
every second. This
results in a new
RDD pushed to
stream(batch from
buffer)
Short term immutable state.
Long term – In external DB
SnappyData
11
Challenge 1: Spark Driver is NOT HA
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Spark Driver is NOT HA
YARN for Driver restart but state is still an issue
Fault tolerance can be configured through
Write-ahead Logging, Check pointing
Bigger problem is driver failure results in all
executors shutting down… All Cached state
(could be TBs) will have to be recovered
SnappyData
12
Challenge 2: External state management
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Join to reference data, join across
streams may require remote access
for each batch
If you cache data then you are
working with stale reference data
Colocation of state to maintain
performance and not falling behind
newDStream =
wordDstream.updateStateByKey[Int](newUpda
teFunc,… )
- Built in capability to update state as batches
arrive requires iteration of the full data set
SnappyData
13
Challenge 3: Sharing state across clients (Applications)
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
RDDs cached within executors cannot
be shared across different client
applications.
So, for instance,
App1 is a streaming apps - store
counters in a DataFrame.
App2 runs a SQL query – has no
visibility to the state from App1
SnappyData
14
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App“Once and only once” – e.g. update
counters for each stream batch.
Failures would result in batch resent.
The counters should reflect correct
state.
Works well if the state is only
managed inside Spark.
Maintaining correct state of
counters in External DB is left
upto the user.
Challenge 4: “Once and only once” in reality is difficult
SnappyData
15
Challenge 5: Always ON – Not Just Fault tolerance
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
HA: If something fails, there is always a
redundant copy that is fully in sync.
Failover is instantenous
Fault tolerance in Spark: Recover state
from the original source or checkpoint by
tracking lineage. Can take too long.
SnappyData
16
Challenge 6: Interactive queries with high concurrency too slow
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
OLAP queries are CPU
intensive and traditional
methods are too slow
SnappyData
17
SnappyData
17
- New Spark open source project started by
Pivotal GemFire Founders+engineers
- Decades of in-memory data management
experience
- Focus on real-time, operational analytics -
Spark inside a OLTP+OLAP database
What is Snappy Data?
SnappyData
18
Streaming
Analytics
Probabilistic
data
Distributed
In-Memory
SQL
Deep integration
of Spark + Gem
Unified cluster, AlwaysOn, Cloud ready
For Real time analytics
Vision – Drastically reduce the cost and complexity in modern big
data. …Using fraction of the resources
10X better response time, drop resource cost 10X,
reduce complexity 10X
Deep Scale,
High volume
MPP DB
Integrate
with
SnappyData: A New Approach To Real Time Analytics
SnappyData
19
Snappy Data Platform – Key Features
• Data can be row oriented (Point update to reference data)
• Or column oriented(compressed for high density storage)
• Support high write rates, scalable
— Streaming data goes through stages
— queue streams(rows), intermediate storage (rows), finally immutable
compressed columns
• Leverage spark streaming for micro-batch streaming
— Not designed for ultra low latency
SnappyData
20
Key Features – Synopses Using Approximate Data
• Maintain exact data in columnar form (compressed)
• Maintain stratified samples
— Intelligent sampling to keep error bounds low
• Specialize for time series
— Decay accuracy over time  sub-linear growth
• Probabilistic data
— TopK for time series (using time aggregation CMS, item aggregation)
— Histograms, HLL, Bloom filters, Wavelets
SnappyData
21
Key Differentiation– OLTP + OLAP with Synopsis
CQ
Subscriptions
OLAP Query
Engine
Micro Batch
Processing
Module
(Plugins)
Sliding Window
Emits Batches
[ ]
User
Applications
processing
Events &
Issuing
Interactive
Queries
Summary DB
 Time Series with decay
 TopK, Frequency Summary
Structures
 Counters
 Histograms
 Stratified Samples
 Raw Data Windows
Exact DB
(Row + column
oriented)
SnappyData
22
Solving The Complexity And Volume Challenge
• Far fewer resources: TB problem becomes GB.
— CPU contention drops
• Far less complex
— Single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
• Much faster
— Compressed data managed in distributed memory in columnar form
reduces volume and is much more responsive
SnappyData
23
Not Panacea, But Comes Close
• Synopses require prior workload knowledge
• Not all queries … complex queries will result in high error rates
— Single cluster for stream ingestion and analytic queries (both streaming
and interactive)
• Our Strategy – be adjunct to MPP databases…
— First compute the error estimate ; if error is above tolerance delegate to
exact store
SnappyData
24
Adjunct Store In Certain Scenarios
SnappyData
25
SnappyData
25
We are hiring!
http://www.snappydata.io/blog/careers-fall2015
Goto snappydata.io/blog for more info ….
Register @ http://www.snappydata.io/register for Beta
SnappyData
26
SnappyData
26
Again, We are hiring! 
http://www.snappydata.io/blog/careers-fall2015
EXTRAS
SnappyData
27
Speed/Accuracy Trade-off
Error
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
27
Credit: Barzan; Berkeley AMPLab
SnappyData
28
Stratified Sampling
●Random Sampling has intuitive semantics
●But, data is typically skewed and our queries are multi-dimentional
●Avg sales order price for each product class for each geography
●Some products may have little to no sales
●Stratification ensures that each “group”(product class) is represented
SnappyData
29
Stratified sampling challenges
●Solutions exist for batch data (blinkDB) (partial solution)
●Our challenge is to get this working for infinite streams of time series data
●Answer: Use combination of Stratified with other techniques like Bernouli/reservoir
sampling
●Exponentially decay over time
SnappyData
30
Dealing with Errors and Latency
●Well known error techniques for “closed form aggregations”
●Exploring other techniques – Analytical Bootstrap (Barzan)
●User can specify error bound with a confidence interval
●Engine would determine if it can satisfy the error bound first
●If not, delegate execution to an “exact” store (GPDB, etc)
●Query execution can also be latency bounded
●SELECT … FROM .. WHERE … WITHIN 2 SECONDS
SELECT avg(sessionTime) FROM Table
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
SnappyData
31
Sketching Techniques
●Sampling not effective for outlier detection
●MAX, MIN, etc
●Other probabilistic structures like CMS, Heavy hitters, etc
●We implemented Hokusai
●capture frequencies of items in time series
●Design permits TopK queries over arbitrary time intervals
(Top100 popular URLs)
SELECT pageURL, count(*) frequency FROM Table
WHERE …. GROUP BY ….
ORDER BY frequency DESC
LIMIT 100
SnappyData
32
Zeppelin
Spark
Interpreter
(Driver)
Zeppelin
Server
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
DEMO

Jags Ramnarayan's presentation

  • 1.
    SnappyData 1 SnappyData 1 SnappyData Getting Spark readyfor real time, operational analytics Jags Ramnarayan Oct 2015 snappydata.io
  • 2.
    SnappyData 2 SnappyData 2  Key attributesfor modern real time streaming processing and interactive analytics  What is so exciting to me about Spark? What are some of the myths?  What is missing in Spark for real time?  SnappyData’s mission – fuse Spark with in-memory data management in one unified cluster to offer – OLTP + OLAP + Stream processing + Probabilistic data My Talk today
  • 3.
    SnappyData 3 Stream processing –what is required? • ingest in parallel from disparate sources. Cannot throttle the input - Sockets, message bus, files, HDFS.. • Process in parallel - filter, transform, normalize • Apply rules and trigger alerts, actions – SQL? • State management with mutability • HA semantics for state – input streams must be HA (reliable enqueue with “once and only once” processing ) – Processing may depend on reference data that must be HA – Generated state must be HA • Store row, derived data into HDFS
  • 4.
    SnappyData 4 Stream Analytics –Applications • Real time scoring of analytic models • Incremental, online training of models • Recommendations and targeting • Personalized Ads • Detect patterns and anomalies in massive quantities of machine data • Stream Analytics requires working with lots of history, reference data
  • 5.
    SnappyData 5 Stream Processing StreamAnalytics • Popular stream processors are parallel processing frameworks with very limited support for deep analytic operators Deeper analytic class problems include  report topK popular URLs over last hour, day - report every 5 seconds  Correlate energy consumption pattern over last 10 mins to similar time periods in the past  Maintain a prediction model in real time - model for fraud detection Stream Analytics require the same optimization challenges as OLAP SQL - discover trends, patterns and outliers may require incremental joins and aggregations on large quantities of historical or reference data
  • 6.
    SnappyData 6 Leave to the applicationDeveloper Join with History, ref data in External DB Embed, integrate with in- memory row oriented store Slow, complex management of multiple products Complex, sub-optimal, dynamic rules difficult Most designed for OLTP; Scans, aggregations too slow Current Solutions Fall Short
  • 7.
    SnappyData 7 In-Memory DB Interactive queries ,updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts Streaming Analytics should be inside the DataStore OLAP queries are CPU intensive and traditional methods are too slow The New Real time analytics operational DB
  • 8.
    SnappyData 8 So, Why dowe like Spark? • Blends Streaming, interactive, batch analytics into cohesive whole • Appeals to Java developers, R, Python folks • Succinct code – maybe the credit goes to Scala? • Rich set of transformations, and libraries (ML, Graph) • RDD and fault tolerance without replication • Stream processing with high throughput (pipeline of micro batches)
  • 9.
    SnappyData 9 Spark Myths • Itis a distributed in-memory database – it is a computational framework with immutable caching • It is Highly Available – Fault tolerance is not the same as HA • Well suited for real time, operational environments – For e.g. Hundreds of concurrent clients running interactive queries
  • 10.
    SnappyData 10 Spark Streaming RuntimeArchitecture Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App Queue is buffered in executor. Driver submits batch job every second. This results in a new RDD pushed to stream(batch from buffer) Short term immutable state. Long term – In external DB
  • 11.
    SnappyData 11 Challenge 1: SparkDriver is NOT HA Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App Spark Driver is NOT HA YARN for Driver restart but state is still an issue Fault tolerance can be configured through Write-ahead Logging, Check pointing Bigger problem is driver failure results in all executors shutting down… All Cached state (could be TBs) will have to be recovered
  • 12.
    SnappyData 12 Challenge 2: Externalstate management Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App Join to reference data, join across streams may require remote access for each batch If you cache data then you are working with stale reference data Colocation of state to maintain performance and not falling behind newDStream = wordDstream.updateStateByKey[Int](newUpda teFunc,… ) - Built in capability to update state as batches arrive requires iteration of the full data set
  • 13.
    SnappyData 13 Challenge 3: Sharingstate across clients (Applications) Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Kafka queue Client submits stream App RDDs cached within executors cannot be shared across different client applications. So, for instance, App1 is a streaming apps - store counters in a DataFrame. App2 runs a SQL query – has no visibility to the state from App1
  • 14.
    SnappyData 14 Driver Executor – sparkengine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App“Once and only once” – e.g. update counters for each stream batch. Failures would result in batch resent. The counters should reflect correct state. Works well if the state is only managed inside Spark. Maintaining correct state of counters in External DB is left upto the user. Challenge 4: “Once and only once” in reality is difficult
  • 15.
    SnappyData 15 Challenge 5: AlwaysON – Not Just Fault tolerance Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Kafka queue Client submits stream App HA: If something fails, there is always a redundant copy that is fully in sync. Failover is instantenous Fault tolerance in Spark: Recover state from the original source or checkpoint by tracking lineage. Can take too long.
  • 16.
    SnappyData 16 Challenge 6: Interactivequeries with high concurrency too slow Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Kafka queue Client submits stream App OLAP queries are CPU intensive and traditional methods are too slow
  • 17.
    SnappyData 17 SnappyData 17 - New Sparkopen source project started by Pivotal GemFire Founders+engineers - Decades of in-memory data management experience - Focus on real-time, operational analytics - Spark inside a OLTP+OLAP database What is Snappy Data?
  • 18.
    SnappyData 18 Streaming Analytics Probabilistic data Distributed In-Memory SQL Deep integration of Spark+ Gem Unified cluster, AlwaysOn, Cloud ready For Real time analytics Vision – Drastically reduce the cost and complexity in modern big data. …Using fraction of the resources 10X better response time, drop resource cost 10X, reduce complexity 10X Deep Scale, High volume MPP DB Integrate with SnappyData: A New Approach To Real Time Analytics
  • 19.
    SnappyData 19 Snappy Data Platform– Key Features • Data can be row oriented (Point update to reference data) • Or column oriented(compressed for high density storage) • Support high write rates, scalable — Streaming data goes through stages — queue streams(rows), intermediate storage (rows), finally immutable compressed columns • Leverage spark streaming for micro-batch streaming — Not designed for ultra low latency
  • 20.
    SnappyData 20 Key Features –Synopses Using Approximate Data • Maintain exact data in columnar form (compressed) • Maintain stratified samples — Intelligent sampling to keep error bounds low • Specialize for time series — Decay accuracy over time  sub-linear growth • Probabilistic data — TopK for time series (using time aggregation CMS, item aggregation) — Histograms, HLL, Bloom filters, Wavelets
  • 21.
    SnappyData 21 Key Differentiation– OLTP+ OLAP with Synopsis CQ Subscriptions OLAP Query Engine Micro Batch Processing Module (Plugins) Sliding Window Emits Batches [ ] User Applications processing Events & Issuing Interactive Queries Summary DB  Time Series with decay  TopK, Frequency Summary Structures  Counters  Histograms  Stratified Samples  Raw Data Windows Exact DB (Row + column oriented)
  • 22.
    SnappyData 22 Solving The ComplexityAnd Volume Challenge • Far fewer resources: TB problem becomes GB. — CPU contention drops • Far less complex — Single cluster for stream ingestion, continuous queries, interactive queries and machine learning • Much faster — Compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  • 23.
    SnappyData 23 Not Panacea, ButComes Close • Synopses require prior workload knowledge • Not all queries … complex queries will result in high error rates — Single cluster for stream ingestion and analytic queries (both streaming and interactive) • Our Strategy – be adjunct to MPP databases… — First compute the error estimate ; if error is above tolerance delegate to exact store
  • 24.
  • 25.
    SnappyData 25 SnappyData 25 We are hiring! http://www.snappydata.io/blog/careers-fall2015 Gotosnappydata.io/blog for more info …. Register @ http://www.snappydata.io/register for Beta
  • 26.
    SnappyData 26 SnappyData 26 Again, We arehiring!  http://www.snappydata.io/blog/careers-fall2015 EXTRAS
  • 27.
    SnappyData 27 Speed/Accuracy Trade-off Error 30 mins Timeto Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size) 27 Credit: Barzan; Berkeley AMPLab
  • 28.
    SnappyData 28 Stratified Sampling ●Random Samplinghas intuitive semantics ●But, data is typically skewed and our queries are multi-dimentional ●Avg sales order price for each product class for each geography ●Some products may have little to no sales ●Stratification ensures that each “group”(product class) is represented
  • 29.
    SnappyData 29 Stratified sampling challenges ●Solutionsexist for batch data (blinkDB) (partial solution) ●Our challenge is to get this working for infinite streams of time series data ●Answer: Use combination of Stratified with other techniques like Bernouli/reservoir sampling ●Exponentially decay over time
  • 30.
    SnappyData 30 Dealing with Errorsand Latency ●Well known error techniques for “closed form aggregations” ●Exploring other techniques – Analytical Bootstrap (Barzan) ●User can specify error bound with a confidence interval ●Engine would determine if it can satisfy the error bound first ●If not, delegate execution to an “exact” store (GPDB, etc) ●Query execution can also be latency bounded ●SELECT … FROM .. WHERE … WITHIN 2 SECONDS SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0%
  • 31.
    SnappyData 31 Sketching Techniques ●Sampling noteffective for outlier detection ●MAX, MIN, etc ●Other probabilistic structures like CMS, Heavy hitters, etc ●We implemented Hokusai ●capture frequencies of items in time series ●Design permits TopK queries over arbitrary time intervals (Top100 popular URLs) SELECT pageURL, count(*) frequency FROM Table WHERE …. GROUP BY …. ORDER BY frequency DESC LIMIT 100
  • 32.
    SnappyData 32 Zeppelin Spark Interpreter (Driver) Zeppelin Server Row cache Columnar compressed Spark ExecutorJVM Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM DEMO