Key attributes for modern real time streaming processing and interactive analytics
What is so exciting to me about Spark?
What are some of the myths?
What is missing in Spark for real time?
SnappyData’s mission – fuse Spark with in-memory data management in one unified cluster to offer – OLTP + OLAP + Stream processing + Probabilistic data
2. SnappyData
2
SnappyData
2
Key attributes for modern real time streaming processing
and interactive analytics
What is so exciting to me about Spark?
What are some of the myths?
What is missing in Spark for real time?
SnappyData’s mission – fuse Spark with in-memory data
management in one unified cluster to offer – OLTP +
OLAP + Stream processing + Probabilistic data
My Talk today
3. SnappyData
3
Stream processing – what is required?
• ingest in parallel from disparate sources. Cannot throttle the input
- Sockets, message bus, files, HDFS..
• Process in parallel - filter, transform, normalize
• Apply rules and trigger alerts, actions – SQL?
• State management with mutability
• HA semantics for state
– input streams must be HA (reliable enqueue with “once and only once” processing )
– Processing may depend on reference data that must be HA
– Generated state must be HA
• Store row, derived data into HDFS
4. SnappyData
4
Stream Analytics – Applications
• Real time scoring of analytic models
• Incremental, online training of models
• Recommendations and targeting
• Personalized Ads
• Detect patterns and anomalies in massive quantities of machine data
• Stream Analytics requires working with lots of history, reference data
5. SnappyData
5
Stream Processing Stream Analytics
• Popular stream processors are parallel processing frameworks
with very limited support for deep analytic operators
Deeper analytic class problems include
report topK popular URLs over last hour, day - report every 5
seconds
Correlate energy consumption pattern over last 10 mins to similar
time periods in the past
Maintain a prediction model in real time - model for fraud detection
Stream Analytics require the same optimization challenges as OLAP SQL - discover
trends, patterns and outliers may require incremental joins and aggregations on large
quantities of historical or reference data
6. SnappyData
6
Leave to the
application Developer
Join with History, ref data
in External DB
Embed, integrate with in-
memory row oriented store
Slow, complex
management of multiple
products
Complex, sub-optimal,
dynamic rules difficult
Most designed for
OLTP; Scans,
aggregations too slow
Current Solutions Fall Short
7. SnappyData
7
In-Memory DB
Interactive queries
, updates
Deep Scale,
High volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
Streaming Analytics should be inside the DataStore
OLAP queries are
CPU intensive and
traditional methods
are too slow
The New Real time analytics operational DB
8. SnappyData
8
So, Why do we like Spark?
• Blends Streaming, interactive, batch analytics into cohesive whole
• Appeals to Java developers, R, Python folks
• Succinct code – maybe the credit goes to Scala?
• Rich set of transformations, and libraries (ML, Graph)
• RDD and fault tolerance without replication
• Stream processing with high throughput (pipeline of micro batches)
9. SnappyData
9
Spark Myths
• It is a distributed in-memory database
– it is a computational framework with immutable caching
• It is Highly Available
– Fault tolerance is not the same as HA
• Well suited for real time, operational environments
– For e.g. Hundreds of concurrent clients running interactive queries
10. SnappyData
10
Spark Streaming Runtime Architecture
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Queue is buffered in
executor. Driver
submits batch job
every second. This
results in a new
RDD pushed to
stream(batch from
buffer)
Short term immutable state.
Long term – In external DB
11. SnappyData
11
Challenge 1: Spark Driver is NOT HA
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Spark Driver is NOT HA
YARN for Driver restart but state is still an issue
Fault tolerance can be configured through
Write-ahead Logging, Check pointing
Bigger problem is driver failure results in all
executors shutting down… All Cached state
(could be TBs) will have to be recovered
12. SnappyData
12
Challenge 2: External state management
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Join to reference data, join across
streams may require remote access
for each batch
If you cache data then you are
working with stale reference data
Colocation of state to maintain
performance and not falling behind
newDStream =
wordDstream.updateStateByKey[Int](newUpda
teFunc,… )
- Built in capability to update state as batches
arrive requires iteration of the full data set
13. SnappyData
13
Challenge 3: Sharing state across clients (Applications)
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
RDDs cached within executors cannot
be shared across different client
applications.
So, for instance,
App1 is a streaming apps - store
counters in a DataFrame.
App2 runs a SQL query – has no
visibility to the state from App1
14. SnappyData
14
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App“Once and only once” – e.g. update
counters for each stream batch.
Failures would result in batch resent.
The counters should reflect correct
state.
Works well if the state is only
managed inside Spark.
Maintaining correct state of
counters in External DB is left
upto the user.
Challenge 4: “Once and only once” in reality is difficult
15. SnappyData
15
Challenge 5: Always ON – Not Just Fault tolerance
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
HA: If something fails, there is always a
redundant copy that is fully in sync.
Failover is instantenous
Fault tolerance in Spark: Recover state
from the original source or checkpoint by
tracking lineage. Can take too long.
16. SnappyData
16
Challenge 6: Interactive queries with high concurrency too slow
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
OLAP queries are CPU
intensive and traditional
methods are too slow
17. SnappyData
17
SnappyData
17
- New Spark open source project started by
Pivotal GemFire Founders+engineers
- Decades of in-memory data management
experience
- Focus on real-time, operational analytics -
Spark inside a OLTP+OLAP database
What is Snappy Data?
18. SnappyData
18
Streaming
Analytics
Probabilistic
data
Distributed
In-Memory
SQL
Deep integration
of Spark + Gem
Unified cluster, AlwaysOn, Cloud ready
For Real time analytics
Vision – Drastically reduce the cost and complexity in modern big
data. …Using fraction of the resources
10X better response time, drop resource cost 10X,
reduce complexity 10X
Deep Scale,
High volume
MPP DB
Integrate
with
SnappyData: A New Approach To Real Time Analytics
19. SnappyData
19
Snappy Data Platform – Key Features
• Data can be row oriented (Point update to reference data)
• Or column oriented(compressed for high density storage)
• Support high write rates, scalable
— Streaming data goes through stages
— queue streams(rows), intermediate storage (rows), finally immutable
compressed columns
• Leverage spark streaming for micro-batch streaming
— Not designed for ultra low latency
20. SnappyData
20
Key Features – Synopses Using Approximate Data
• Maintain exact data in columnar form (compressed)
• Maintain stratified samples
— Intelligent sampling to keep error bounds low
• Specialize for time series
— Decay accuracy over time sub-linear growth
• Probabilistic data
— TopK for time series (using time aggregation CMS, item aggregation)
— Histograms, HLL, Bloom filters, Wavelets
21. SnappyData
21
Key Differentiation– OLTP + OLAP with Synopsis
CQ
Subscriptions
OLAP Query
Engine
Micro Batch
Processing
Module
(Plugins)
Sliding Window
Emits Batches
[ ]
User
Applications
processing
Events &
Issuing
Interactive
Queries
Summary DB
Time Series with decay
TopK, Frequency Summary
Structures
Counters
Histograms
Stratified Samples
Raw Data Windows
Exact DB
(Row + column
oriented)
22. SnappyData
22
Solving The Complexity And Volume Challenge
• Far fewer resources: TB problem becomes GB.
— CPU contention drops
• Far less complex
— Single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
• Much faster
— Compressed data managed in distributed memory in columnar form
reduces volume and is much more responsive
23. SnappyData
23
Not Panacea, But Comes Close
• Synopses require prior workload knowledge
• Not all queries … complex queries will result in high error rates
— Single cluster for stream ingestion and analytic queries (both streaming
and interactive)
• Our Strategy – be adjunct to MPP databases…
— First compute the error estimate ; if error is above tolerance delegate to
exact store
28. SnappyData
28
Stratified Sampling
●Random Sampling has intuitive semantics
●But, data is typically skewed and our queries are multi-dimentional
●Avg sales order price for each product class for each geography
●Some products may have little to no sales
●Stratification ensures that each “group”(product class) is represented
29. SnappyData
29
Stratified sampling challenges
●Solutions exist for batch data (blinkDB) (partial solution)
●Our challenge is to get this working for infinite streams of time series data
●Answer: Use combination of Stratified with other techniques like Bernouli/reservoir
sampling
●Exponentially decay over time
30. SnappyData
30
Dealing with Errors and Latency
●Well known error techniques for “closed form aggregations”
●Exploring other techniques – Analytical Bootstrap (Barzan)
●User can specify error bound with a confidence interval
●Engine would determine if it can satisfy the error bound first
●If not, delegate execution to an “exact” store (GPDB, etc)
●Query execution can also be latency bounded
●SELECT … FROM .. WHERE … WITHIN 2 SECONDS
SELECT avg(sessionTime) FROM Table
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
31. SnappyData
31
Sketching Techniques
●Sampling not effective for outlier detection
●MAX, MIN, etc
●Other probabilistic structures like CMS, Heavy hitters, etc
●We implemented Hokusai
●capture frequencies of items in time series
●Design permits TopK queries over arbitrary time intervals
(Top100 popular URLs)
SELECT pageURL, count(*) frequency FROM Table
WHERE …. GROUP BY ….
ORDER BY frequency DESC
LIMIT 100