SlideShare a Scribd company logo
1 of 32
SnappyData
1
SnappyData
1
SnappyData
Getting Spark ready for real time,
operational analytics
Jags Ramnarayan
Oct 2015
snappydata.io
SnappyData
2
SnappyData
2
 Key attributes for modern real time streaming processing
and interactive analytics
 What is so exciting to me about Spark?
What are some of the myths?
 What is missing in Spark for real time?
 SnappyData’s mission – fuse Spark with in-memory data
management in one unified cluster to offer – OLTP +
OLAP + Stream processing + Probabilistic data
My Talk today
SnappyData
3
Stream processing – what is required?
• ingest in parallel from disparate sources. Cannot throttle the input
- Sockets, message bus, files, HDFS..
• Process in parallel - filter, transform, normalize
• Apply rules and trigger alerts, actions – SQL?
• State management with mutability
• HA semantics for state
– input streams must be HA (reliable enqueue with “once and only once” processing )
– Processing may depend on reference data that must be HA
– Generated state must be HA
• Store row, derived data into HDFS
SnappyData
4
Stream Analytics – Applications
• Real time scoring of analytic models
• Incremental, online training of models
• Recommendations and targeting
• Personalized Ads
• Detect patterns and anomalies in massive quantities of machine data
• Stream Analytics requires working with lots of history, reference data
SnappyData
5
Stream Processing Stream Analytics
• Popular stream processors are parallel processing frameworks
with very limited support for deep analytic operators
Deeper analytic class problems include
 report topK popular URLs over last hour, day - report every 5
seconds
 Correlate energy consumption pattern over last 10 mins to similar
time periods in the past
 Maintain a prediction model in real time - model for fraud detection
Stream Analytics require the same optimization challenges as OLAP SQL - discover
trends, patterns and outliers may require incremental joins and aggregations on large
quantities of historical or reference data
SnappyData
6
Leave to the
application Developer
Join with History, ref data
in External DB
Embed, integrate with in-
memory row oriented store
Slow, complex
management of multiple
products
Complex, sub-optimal,
dynamic rules difficult
Most designed for
OLTP; Scans,
aggregations too slow
Current Solutions Fall Short
SnappyData
7
In-Memory DB
Interactive queries
, updates
Deep Scale,
High volume
MPP DB
Transform
Data-in-motion
Analytics
Application
Streams
Alerts
Streaming Analytics should be inside the DataStore
OLAP queries are
CPU intensive and
traditional methods
are too slow
The New Real time analytics operational DB
SnappyData
8
So, Why do we like Spark?
• Blends Streaming, interactive, batch analytics into cohesive whole
• Appeals to Java developers, R, Python folks
• Succinct code – maybe the credit goes to Scala?
• Rich set of transformations, and libraries (ML, Graph)
• RDD and fault tolerance without replication
• Stream processing with high throughput (pipeline of micro batches)
SnappyData
9
Spark Myths
• It is a distributed in-memory database
– it is a computational framework with immutable caching
• It is Highly Available
– Fault tolerance is not the same as HA
• Well suited for real time, operational environments
– For e.g. Hundreds of concurrent clients running interactive queries
SnappyData
10
Spark Streaming Runtime Architecture
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Queue is buffered in
executor. Driver
submits batch job
every second. This
results in a new
RDD pushed to
stream(batch from
buffer)
Short term immutable state.
Long term – In external DB
SnappyData
11
Challenge 1: Spark Driver is NOT HA
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Spark Driver is NOT HA
YARN for Driver restart but state is still an issue
Fault tolerance can be configured through
Write-ahead Logging, Check pointing
Bigger problem is driver failure results in all
executors shutting down… All Cached state
(could be TBs) will have to be recovered
SnappyData
12
Challenge 2: External state management
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App
Join to reference data, join across
streams may require remote access
for each batch
If you cache data then you are
working with stale reference data
Colocation of state to maintain
performance and not falling behind
newDStream =
wordDstream.updateStateByKey[Int](newUpda
teFunc,… )
- Built in capability to update state as batches
arrive requires iteration of the full data set
SnappyData
13
Challenge 3: Sharing state across clients (Applications)
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
RDDs cached within executors cannot
be shared across different client
applications.
So, for instance,
App1 is a streaming apps - store
counters in a DataFrame.
App2 runs a SQL query – has no
visibility to the state from App1
SnappyData
14
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
cassandra
Kafka
queue
Client
submits
stream App“Once and only once” – e.g. update
counters for each stream batch.
Failures would result in batch resent.
The counters should reflect correct
state.
Works well if the state is only
managed inside Spark.
Maintaining correct state of
counters in External DB is left
upto the user.
Challenge 4: “Once and only once” in reality is difficult
SnappyData
15
Challenge 5: Always ON – Not Just Fault tolerance
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
HA: If something fails, there is always a
redundant copy that is fully in sync.
Failover is instantenous
Fault tolerance in Spark: Recover state
from the original source or checkpoint by
tracking lineage. Can take too long.
SnappyData
16
Challenge 6: Interactive queries with high concurrency too slow
Driver
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Executor – spark engine
RDD Partition
@t0
RDD Partition
@t2
RDD Partition
@t1 time
Kafka
queue
Client
submits
stream App
OLAP queries are CPU
intensive and traditional
methods are too slow
SnappyData
17
SnappyData
17
- New Spark open source project started by
Pivotal GemFire Founders+engineers
- Decades of in-memory data management
experience
- Focus on real-time, operational analytics -
Spark inside a OLTP+OLAP database
What is Snappy Data?
SnappyData
18
Streaming
Analytics
Probabilistic
data
Distributed
In-Memory
SQL
Deep integration
of Spark + Gem
Unified cluster, AlwaysOn, Cloud ready
For Real time analytics
Vision – Drastically reduce the cost and complexity in modern big
data. …Using fraction of the resources
10X better response time, drop resource cost 10X,
reduce complexity 10X
Deep Scale,
High volume
MPP DB
Integrate
with
SnappyData: A New Approach To Real Time Analytics
SnappyData
19
Snappy Data Platform – Key Features
• Data can be row oriented (Point update to reference data)
• Or column oriented(compressed for high density storage)
• Support high write rates, scalable
— Streaming data goes through stages
— queue streams(rows), intermediate storage (rows), finally immutable
compressed columns
• Leverage spark streaming for micro-batch streaming
— Not designed for ultra low latency
SnappyData
20
Key Features – Synopses Using Approximate Data
• Maintain exact data in columnar form (compressed)
• Maintain stratified samples
— Intelligent sampling to keep error bounds low
• Specialize for time series
— Decay accuracy over time  sub-linear growth
• Probabilistic data
— TopK for time series (using time aggregation CMS, item aggregation)
— Histograms, HLL, Bloom filters, Wavelets
SnappyData
21
Key Differentiation– OLTP + OLAP with Synopsis
CQ
Subscriptions
OLAP Query
Engine
Micro Batch
Processing
Module
(Plugins)
Sliding Window
Emits Batches
[ ]
User
Applications
processing
Events &
Issuing
Interactive
Queries
Summary DB
 Time Series with decay
 TopK, Frequency Summary
Structures
 Counters
 Histograms
 Stratified Samples
 Raw Data Windows
Exact DB
(Row + column
oriented)
SnappyData
22
Solving The Complexity And Volume Challenge
• Far fewer resources: TB problem becomes GB.
— CPU contention drops
• Far less complex
— Single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
• Much faster
— Compressed data managed in distributed memory in columnar form
reduces volume and is much more responsive
SnappyData
23
Not Panacea, But Comes Close
• Synopses require prior workload knowledge
• Not all queries … complex queries will result in high error rates
— Single cluster for stream ingestion and analytic queries (both streaming
and interactive)
• Our Strategy – be adjunct to MPP databases…
— First compute the error estimate ; if error is above tolerance delegate to
exact store
SnappyData
24
Adjunct Store In Certain Scenarios
SnappyData
25
SnappyData
25
We are hiring!
http://www.snappydata.io/blog/careers-fall2015
Goto snappydata.io/blog for more info ….
Register @ http://www.snappydata.io/register for Beta
SnappyData
26
SnappyData
26
Again, We are hiring! 
http://www.snappydata.io/blog/careers-fall2015
EXTRAS
SnappyData
27
Speed/Accuracy Trade-off
Error
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
27
Credit: Barzan; Berkeley AMPLab
SnappyData
28
Stratified Sampling
●Random Sampling has intuitive semantics
●But, data is typically skewed and our queries are multi-dimentional
●Avg sales order price for each product class for each geography
●Some products may have little to no sales
●Stratification ensures that each “group”(product class) is represented
SnappyData
29
Stratified sampling challenges
●Solutions exist for batch data (blinkDB) (partial solution)
●Our challenge is to get this working for infinite streams of time series data
●Answer: Use combination of Stratified with other techniques like Bernouli/reservoir
sampling
●Exponentially decay over time
SnappyData
30
Dealing with Errors and Latency
●Well known error techniques for “closed form aggregations”
●Exploring other techniques – Analytical Bootstrap (Barzan)
●User can specify error bound with a confidence interval
●Engine would determine if it can satisfy the error bound first
●If not, delegate execution to an “exact” store (GPDB, etc)
●Query execution can also be latency bounded
●SELECT … FROM .. WHERE … WITHIN 2 SECONDS
SELECT avg(sessionTime) FROM Table
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
SnappyData
31
Sketching Techniques
●Sampling not effective for outlier detection
●MAX, MIN, etc
●Other probabilistic structures like CMS, Heavy hitters, etc
●We implemented Hokusai
●capture frequencies of items in time series
●Design permits TopK queries over arbitrary time intervals
(Top100 popular URLs)
SELECT pageURL, count(*) frequency FROM Table
WHERE …. GROUP BY ….
ORDER BY frequency DESC
LIMIT 100
SnappyData
32
Zeppelin
Spark
Interpreter
(Driver)
Zeppelin
Server
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
DEMO

More Related Content

What's hot

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData WebinarSnappyData
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 

What's hot (20)

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 

Similar to Jags Ramnarayan's presentation

Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesSnappyData
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestDataGyula Fóra
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 

Similar to Jags Ramnarayan's presentation (20)

Nike tech talk.2
Nike tech talk.2Nike tech talk.2
Nike tech talk.2
 
Efficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out DatabasesEfficient State Management With Spark 2.x And Scale-Out Databases
Efficient State Management With Spark 2.x And Scale-Out Databases
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Low Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache ApexLow Latency Polyglot Model Scoring using Apache Apex
Low Latency Polyglot Model Scoring using Apache Apex
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Flink Streaming @BudapestData
Flink Streaming @BudapestDataFlink Streaming @BudapestData
Flink Streaming @BudapestData
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 

Jags Ramnarayan's presentation

  • 1. SnappyData 1 SnappyData 1 SnappyData Getting Spark ready for real time, operational analytics Jags Ramnarayan Oct 2015 snappydata.io
  • 2. SnappyData 2 SnappyData 2  Key attributes for modern real time streaming processing and interactive analytics  What is so exciting to me about Spark? What are some of the myths?  What is missing in Spark for real time?  SnappyData’s mission – fuse Spark with in-memory data management in one unified cluster to offer – OLTP + OLAP + Stream processing + Probabilistic data My Talk today
  • 3. SnappyData 3 Stream processing – what is required? • ingest in parallel from disparate sources. Cannot throttle the input - Sockets, message bus, files, HDFS.. • Process in parallel - filter, transform, normalize • Apply rules and trigger alerts, actions – SQL? • State management with mutability • HA semantics for state – input streams must be HA (reliable enqueue with “once and only once” processing ) – Processing may depend on reference data that must be HA – Generated state must be HA • Store row, derived data into HDFS
  • 4. SnappyData 4 Stream Analytics – Applications • Real time scoring of analytic models • Incremental, online training of models • Recommendations and targeting • Personalized Ads • Detect patterns and anomalies in massive quantities of machine data • Stream Analytics requires working with lots of history, reference data
  • 5. SnappyData 5 Stream Processing Stream Analytics • Popular stream processors are parallel processing frameworks with very limited support for deep analytic operators Deeper analytic class problems include  report topK popular URLs over last hour, day - report every 5 seconds  Correlate energy consumption pattern over last 10 mins to similar time periods in the past  Maintain a prediction model in real time - model for fraud detection Stream Analytics require the same optimization challenges as OLAP SQL - discover trends, patterns and outliers may require incremental joins and aggregations on large quantities of historical or reference data
  • 6. SnappyData 6 Leave to the application Developer Join with History, ref data in External DB Embed, integrate with in- memory row oriented store Slow, complex management of multiple products Complex, sub-optimal, dynamic rules difficult Most designed for OLTP; Scans, aggregations too slow Current Solutions Fall Short
  • 7. SnappyData 7 In-Memory DB Interactive queries , updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts Streaming Analytics should be inside the DataStore OLAP queries are CPU intensive and traditional methods are too slow The New Real time analytics operational DB
  • 8. SnappyData 8 So, Why do we like Spark? • Blends Streaming, interactive, batch analytics into cohesive whole • Appeals to Java developers, R, Python folks • Succinct code – maybe the credit goes to Scala? • Rich set of transformations, and libraries (ML, Graph) • RDD and fault tolerance without replication • Stream processing with high throughput (pipeline of micro batches)
  • 9. SnappyData 9 Spark Myths • It is a distributed in-memory database – it is a computational framework with immutable caching • It is Highly Available – Fault tolerance is not the same as HA • Well suited for real time, operational environments – For e.g. Hundreds of concurrent clients running interactive queries
  • 10. SnappyData 10 Spark Streaming Runtime Architecture Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App Queue is buffered in executor. Driver submits batch job every second. This results in a new RDD pushed to stream(batch from buffer) Short term immutable state. Long term – In external DB
  • 11. SnappyData 11 Challenge 1: Spark Driver is NOT HA Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App Spark Driver is NOT HA YARN for Driver restart but state is still an issue Fault tolerance can be configured through Write-ahead Logging, Check pointing Bigger problem is driver failure results in all executors shutting down… All Cached state (could be TBs) will have to be recovered
  • 12. SnappyData 12 Challenge 2: External state management Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App Join to reference data, join across streams may require remote access for each batch If you cache data then you are working with stale reference data Colocation of state to maintain performance and not falling behind newDStream = wordDstream.updateStateByKey[Int](newUpda teFunc,… ) - Built in capability to update state as batches arrive requires iteration of the full data set
  • 13. SnappyData 13 Challenge 3: Sharing state across clients (Applications) Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Kafka queue Client submits stream App RDDs cached within executors cannot be shared across different client applications. So, for instance, App1 is a streaming apps - store counters in a DataFrame. App2 runs a SQL query – has no visibility to the state from App1
  • 14. SnappyData 14 Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time cassandra Kafka queue Client submits stream App“Once and only once” – e.g. update counters for each stream batch. Failures would result in batch resent. The counters should reflect correct state. Works well if the state is only managed inside Spark. Maintaining correct state of counters in External DB is left upto the user. Challenge 4: “Once and only once” in reality is difficult
  • 15. SnappyData 15 Challenge 5: Always ON – Not Just Fault tolerance Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Kafka queue Client submits stream App HA: If something fails, there is always a redundant copy that is fully in sync. Failover is instantenous Fault tolerance in Spark: Recover state from the original source or checkpoint by tracking lineage. Can take too long.
  • 16. SnappyData 16 Challenge 6: Interactive queries with high concurrency too slow Driver Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Executor – spark engine RDD Partition @t0 RDD Partition @t2 RDD Partition @t1 time Kafka queue Client submits stream App OLAP queries are CPU intensive and traditional methods are too slow
  • 17. SnappyData 17 SnappyData 17 - New Spark open source project started by Pivotal GemFire Founders+engineers - Decades of in-memory data management experience - Focus on real-time, operational analytics - Spark inside a OLTP+OLAP database What is Snappy Data?
  • 18. SnappyData 18 Streaming Analytics Probabilistic data Distributed In-Memory SQL Deep integration of Spark + Gem Unified cluster, AlwaysOn, Cloud ready For Real time analytics Vision – Drastically reduce the cost and complexity in modern big data. …Using fraction of the resources 10X better response time, drop resource cost 10X, reduce complexity 10X Deep Scale, High volume MPP DB Integrate with SnappyData: A New Approach To Real Time Analytics
  • 19. SnappyData 19 Snappy Data Platform – Key Features • Data can be row oriented (Point update to reference data) • Or column oriented(compressed for high density storage) • Support high write rates, scalable — Streaming data goes through stages — queue streams(rows), intermediate storage (rows), finally immutable compressed columns • Leverage spark streaming for micro-batch streaming — Not designed for ultra low latency
  • 20. SnappyData 20 Key Features – Synopses Using Approximate Data • Maintain exact data in columnar form (compressed) • Maintain stratified samples — Intelligent sampling to keep error bounds low • Specialize for time series — Decay accuracy over time  sub-linear growth • Probabilistic data — TopK for time series (using time aggregation CMS, item aggregation) — Histograms, HLL, Bloom filters, Wavelets
  • 21. SnappyData 21 Key Differentiation– OLTP + OLAP with Synopsis CQ Subscriptions OLAP Query Engine Micro Batch Processing Module (Plugins) Sliding Window Emits Batches [ ] User Applications processing Events & Issuing Interactive Queries Summary DB  Time Series with decay  TopK, Frequency Summary Structures  Counters  Histograms  Stratified Samples  Raw Data Windows Exact DB (Row + column oriented)
  • 22. SnappyData 22 Solving The Complexity And Volume Challenge • Far fewer resources: TB problem becomes GB. — CPU contention drops • Far less complex — Single cluster for stream ingestion, continuous queries, interactive queries and machine learning • Much faster — Compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  • 23. SnappyData 23 Not Panacea, But Comes Close • Synopses require prior workload knowledge • Not all queries … complex queries will result in high error rates — Single cluster for stream ingestion and analytic queries (both streaming and interactive) • Our Strategy – be adjunct to MPP databases… — First compute the error estimate ; if error is above tolerance delegate to exact store
  • 24. SnappyData 24 Adjunct Store In Certain Scenarios
  • 25. SnappyData 25 SnappyData 25 We are hiring! http://www.snappydata.io/blog/careers-fall2015 Goto snappydata.io/blog for more info …. Register @ http://www.snappydata.io/register for Beta
  • 26. SnappyData 26 SnappyData 26 Again, We are hiring!  http://www.snappydata.io/blog/careers-fall2015 EXTRAS
  • 27. SnappyData 27 Speed/Accuracy Trade-off Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size) 27 Credit: Barzan; Berkeley AMPLab
  • 28. SnappyData 28 Stratified Sampling ●Random Sampling has intuitive semantics ●But, data is typically skewed and our queries are multi-dimentional ●Avg sales order price for each product class for each geography ●Some products may have little to no sales ●Stratification ensures that each “group”(product class) is represented
  • 29. SnappyData 29 Stratified sampling challenges ●Solutions exist for batch data (blinkDB) (partial solution) ●Our challenge is to get this working for infinite streams of time series data ●Answer: Use combination of Stratified with other techniques like Bernouli/reservoir sampling ●Exponentially decay over time
  • 30. SnappyData 30 Dealing with Errors and Latency ●Well known error techniques for “closed form aggregations” ●Exploring other techniques – Analytical Bootstrap (Barzan) ●User can specify error bound with a confidence interval ●Engine would determine if it can satisfy the error bound first ●If not, delegate execution to an “exact” store (GPDB, etc) ●Query execution can also be latency bounded ●SELECT … FROM .. WHERE … WITHIN 2 SECONDS SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0%
  • 31. SnappyData 31 Sketching Techniques ●Sampling not effective for outlier detection ●MAX, MIN, etc ●Other probabilistic structures like CMS, Heavy hitters, etc ●We implemented Hokusai ●capture frequencies of items in time series ●Design permits TopK queries over arbitrary time intervals (Top100 popular URLs) SELECT pageURL, count(*) frequency FROM Table WHERE …. GROUP BY …. ORDER BY frequency DESC LIMIT 100
  • 32. SnappyData 32 Zeppelin Spark Interpreter (Driver) Zeppelin Server Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM DEMO