ThingsYou Didn'tKnowYou Could
Do in Spark
Version 0.1 | © Snappydata Inc 2017
Strategic Advisor @ Snappydata
Assistant Prof. @Univ. of Mich
Barzan Mozafari
CTO | Co-founder
@ Snappydata
Jags Ramnarayan
COO | Co-founder
@ Snappydata
Sudhir Menon
2
Current
Market
Trends
Analytic workloads are moving to the cloud
Scale-outinfrastructure->IncreasinglyExpensive
Excessivedatashuffling -> Longer responsetimes
Real-time operational analytics are on the rise
MixedWorkloadscombine streams,operationaldataandHistoricaldata
Real-timeMatters
Concurrency is increasing
Sharedinfrustrure(10’sof usersand apps,100’sof queries)
Exploratoryanalytics
BIandvisualizationtools
Concurrentqueries
Data volumes are growing rapidly
IndustrialIoT(Internetof Things)
Datagrowing fasterthanMoore’slaw
3
Our
Observations
Acheiving interactive analytics is hard
Analystswantresponsetime <2 seconds
Hardwith concurrentqueriesin the cluster
Hardonananalyst’slaptop
Hardonlargedatasets
Hardwhenthereare networkshuffles
After 99.9%- accurate early results, the final answer is no
longer needed
200xqueryspeedup
200xreductionof clusterload, networkshuffles,andI/O waits
Lambda architecture is extremely complex (and slow!)
Dealingwith multipledataformats, disparateproductAPIs
Wastedresources
Querying multiple remote data sources is slow
Curateddatafrom multiplesourcesisslow
Agenda
• Mixed workloads
Structured Streaming in Spark 2.0
State Management in Spark 2.0
• Interactive Analytics in Spark 2.0
• The SnappyData Solution
• Q&A
Stream
Processing
Mixed workloads are way too common
Transactions
(point lookups,
small updates)
Interactive
Analytics
Analytics on
mutating data
Maintaining state or
counters while processing
streams
Correlating and
joining streams w/
large histories
Mixed Workload in IoT Analytics
- Map sensors to tags
- Monitor temperature
- Send alerts
Correlate current
temperature trend with
history….
Anomaly detection –
score against models
Interact using
dynamic
queries
How Mixed Workloads are Supported Today?
Lambda Architecture
Scalable Store
HDFS,
MPP DBData-in-motion
Analytics
Application
Streams
Alerts
Enrich
Reference DB
Lamba Architecure is Complex
Enrich – requires
reference data
join
State – manage
windows with
mutating state
Analytics – Joins to
history, trend patterns
Model maintenance
Models
Transaction
writes
Interactive OLAP
queries
Updates
Scalable Store
Updates
HDFS,
MPP DBData-in-motion
Analytics
Application
Streams
Alerts
Enrich
Reference DB
Lamba Architecure is Complex
Storm, Spark
streaming,
Samza…
NoSQL – Cassandra,
Redis, Hbase ….
Models
Enterprise DB –
Oracle, Postgres..
OLAP SQL DB
Lamba Architecure is Complex
• Complexity: learn and master
multiple products  data
models, disparate APIs, configs
• Slower
• Wasted resources
Can we simplify & optimize?
How about a single clustered DB that can manage stream state,
transactional data and run OLAP queries?
RDB
Tables
Txn
Framework for stream
processing, etc
Stream processing
HDFS
MPP DB
Scalable writes, point reads, OLAP queries
Apps
One idea: Start from Apache Spark
Spark Core
Spark
Streaming
Spark ML
Spark SQL
GraphX
Stream program can
Leverage ML, SQL, Graph
Why?
• Unified programming across
streaming, interactive, batch
• Multiple data formats, data sources
• Rich library for Streaming, SQL,
Machine Learning, …
But: Spark is a computational
engine not a store or a database
Structured Streaming
in Spark 2.0
Source: Databricks presentation
Source: Databricks presentation
read...stream() creates a
streaming DataFrame, does not
start any of the computation
write...startStream() defines where
& how to output the data and
starts the processing
Source: Databricks presentation
New State Management API in Spark 2.0 – KV Store
• Preserve state across streaming aggregations across multiple
batches
• Fetch, store KV pairs
• Transactional
• Supports versioning (batch)
• Built in store that persists to HDFS
State Management
in Spark 2.0
Deeper Look – Ad Impression Analytics
Ad Network Architecture – Analyze log impressions in real time
Ad Impression Analytics
Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
Bottlenecks in the write path
Stream micro batches in
parallel from Kafka to each
Spark executor
Filter and collect event for 1 minute.
Reduce to 1 event per Publisher,
Geo every minute
Execute GROUP BY … Expensive
Spark Shuffle …
Shuffle again in DB cluster …
data format changes …
serialization costs
val input :DataFrame= sqlContext.read
.options(kafkaOptions).format("kafkaSource")
.stream()
val result :DataFrame = input
.where("geo != 'UnknownGeo'")
.groupBy(
window("event-time", "1min"),
Publisher", "geo")
.agg(avg("bid"), count("geo"),countDistinct("cookie"))
val query = result.write.format("org.apache.spark.sql.cassandra")
.outputMode("append")
.startStream("dest-path”)
Bottlenecks in the Write Path
• Aggregations – GroupBy, MapReduce
• Joins with other streams, Reference data
Shuffle Costs (Copying, Serialization) Excessive copying in
Java based Scale out stores
Avoid bottleneck - Localize process with State
Spark Executor
Spark Executor
Kafka
queue
KV
KV
Publisher: 1,3,5
Geo: 2,4,6
Avoid shuffle by localizing writes Input queue,
Stream, KV
Store all share
the same
partitioning
strategy
Publisher: 2,4,6
Geo: 1,3,5
Publisher: 1,3,5
Geo: 2,4,6
Publisher: 2,4,6
Geo: 1,3,5
Spark 2.0
StateStore?
Impedance mismatch with KV stores
We want to run interactive “scan”-intensive queries:
- Find total uniques for Ads grouped on geography
- Impression trends for Advertisers(group by query)
Two alternatives: row-stores vs. column stores
Goal – Localize processingFast key-based lookup
But, too slow to run
aggregations, scan based
interactive queries
Fast scans, aggregations
Updates and random
writes are very difficult
Consume too
much memory
Why columnar storage in-memory?
Interactive Analytics
in Spark 2.0
But, are in-memory column-stores fast enough?
SELECT
SUBSTR(sourceIP, 1, X),
SUM(adRevenue)
FROM uservisits
GROUP BY SUBSTR(sourceIP, 1, X)
Berkeley AMPLab Big Data Benchmark
-- AWS m2.4xlarge ; total of 342 GB
Why distributed, in-memory and columnar
formats not enough for interactive speeds?
1. Distributed shuffles can take minutes
2. Software overheads, e.g., (de)serialization
3. Concurrent queries competing for the same resources
So, how can we ensure interactive-speed
analytics?
• Most apps happy to tradeoff 1% accuracy
for 200x speedup!
• Can usually get a 99.9% accurate answer by only
looking at a tiny fraction of data!
• Often can make perfectly accurate
decisions without having perfectly
accurate answers!
• A/B Testing, visualization, ...
• The data itself is usually noisy
• Processing entire data doesn’t necessarily mean
exact answers!
• Inference is probabilistic anyway
Use statistical techniques to shrink data!
Our Solution:
SnappyData
SnappyData: A New Approach
A Single Unified Cluster: OLTP + OLAP + Streaming
for real-time analytics
Batch design, high throughput
Real-time design
Low latency, HA,
concurrency
Vision: Drastically reduce the cost and
complexity in modern big data
Rapidly Maturing Matured over 13 years
SnappyData: A New Approach
Single unified HA cluster: OLTP + OLAP +
Stream for real-time analytics
Batch design, high throughput
Real-time operational Analytics – TBs in memory
RDB
Rows
Txn
Columnar
API
Stream processing
ODBC,
JDBC, REST
Spark -
Scala, Java,
Python, R
HDFS
AQP
First commercial product with
Approximate Query Processing (AQP)
MPP DB
Index
SnappyData Data Model
Rows
Columnar
Stream processing
ODBC
JDBCProbabilistic
data
Snappy Data Server – Spark Executor + Store
Index
Spark
Program
SQL
Parser
RDD Table
Hybrid Storage
RDD
Cluster
Mgr
Scheduler
High Latency/OLAP
Low Latency
Add/remove Servers
Catalog
Features
- Deeply integrated database for Spark
- 100% compatible with Spark
- Extensions for Transactions (updates), SQL stream processing
- Extensions for High Availability
- Approximate query processing for interactive OLAP
- OLTP+OLAP Store
- Replicated and partitioned tables
- Tables can be Row or Column oriented (in-memory & on-disk)
- SQL extensions for compatibility with SQL Standard
- create table, view, indexes, constraints, etc
Time Series Analytics (kparc.com)
http://kparc.com/q4/readme.txt Trade and Quote Analytics
Time Series Analytics – kparc.com
http://kparc.com/q4/readme.txt
- Find last quoted price for Top 100 instruments for given date
"select quote.sym, avg(bid) from quote join S on (quote.sym = S.sym) where
date='$d' group by quote.sym”
// Replaced with AVG as MemSQL doesn’t support LAST
- Max price for Top 100 Instruments across Exchanges for given date
"select trade.sym, ex, max(price) from trade join S on (trade.sym = S.sym)
where date='$d' group by trade.sym, ex”
- Find Average size of Trade in each hour for top 100 Instruments for given
date
"select trade.sym, hour(time), avg(size) from trade join S on (trade.sym =
S.sym) where date='$d' group by trade.sym, hour(time)"
Results for Small day (40million), Big day(640 million)
0
500
1000
1500
2000
2500
3000
3500
Q1 Q2 Q3
SnappyData 224 125 124
MemSQL 556 1884 3210
AxisTitle
Query Response time in
Millis
(Smaller is better)
0
10000
20000
30000
40000
50000
60000
SnappyData MemSQL Impala Greenplum
Q1 1769 8652 45000 56000
Q2 432 26034 2250 900
Q3 494 49614 1880 1000AxisTitle
Query Response time in Millis
(Smaller is better)
TPC-H: 10X-20X faster than Spark 2.0
Synopses Data Engine
Use Case Patterns
1. Analytics cache
- Caching for Analytics over any “big data” store (esp. MPP)
- Federate query between samples and backend’
2. Stream analytics for Spark
Process streams, transform, real-time scoring, store, query
3. In-memory transactional store
Highly concurrent apps, SQL cache, OLTP + OLAP
Interactive-Speed Analytic Queries – Exact or Approximate
Select avg(Bid), Advertiser from T1 group by Advertiser
Select avg(Bid), Advertiser from T1 group by Advertiser with error 0.1
Speed/Accuracy tradeoffError(%)
30 mins
Time to Execute on
Entire Dataset
Interactive
Queries
2 sec
Execution Time 42
100 secs
2 secs
1% Error
Approximate Query Processing Features
• Support for uniform sampling
• Support for stratified sampling
- Solutions exist for stored data (BlinkDB)
- SnappyData works for infinite streams of data too
• Support for exponentially decaying windows over time
• Support for synopses
- Top-K queries, heavy hitters, outliers, ...
• [future] Support for joins
• Workload mining (http://CliffGuard.org)
Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Uniform Sample
ID Advertiser Geo Bid Sampling
Rate
3 adv20 NY 0.0002 1/3
5 adv20 NY 0.0001 1/3
SELECT avg(bid)
FROM AdImpresssions
WHERE geo = ‘VT’
Original Table
Uniform (Random) Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Uniform Sample
ID Advertiser Geo Bid Sampling
Rate
3 adv20 NY 0.0002 2/3
5 adv20 NY 0.0001 2/3
1 adv10 NY 0.0001 2/3
2 adv10 VT 0.0005 2/3
SELECT avg(bid)
FROM AdImpresssions
WHERE geo = ‘VT’
Original Table Larger
Stratified Sampling
ID Advertiser Geo Bid
1 adv10 NY 0.0001
2 adv10 VT 0.0005
3 adv20 NY 0.0002
4 adv10 NY 0.0003
5 adv20 NY 0.0001
6 adv30 VT 0.0001
Stratified Sample on Geo
ID Advertiser Geo Bid Sampling
Rate
3 adv20 NY 0.0002 1/4
2 adv10 VT 0.0005 1/2
SELECT avg(bid)
FROM AdImpresssions
WHERE geo = ‘VT’
Original Table
Sketching techniques
● Sampling not effective for outlier detection
○ MAX/MIN etc
● Other probabilistic structures like CMS, heavy hitters, etc
● SnappyData implements Hokusai
○ Capturing item frequencies in timeseries
● Design permits TopK queries over arbitrary trim intervals
(Top100 popular URLs)
SELECT pageURL, count(*) frequency FROM Table
WHERE …. GROUP BY ….
ORDER BY frequency DESC
LIMIT 100
Airline Analytics Demo
Zeppelin
Spark
Interpreter
(Driver)
Zeppelin
Server
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Row cache
Columnar
compressed
Spark Executor JVM
Free Cloud trial service – Project iSight
● Free AWS/Azure credits for anyone to try out SnappyData
● One click launch of private SnappyData cluster with Zeppelin
● Multiple notebooks with comprehensive description of concepts and
value
● Bring your own data sets to try ‘Immediate visualization’ using
Synopses data
http://www.snappydata.io/cloudbuilder --- URL for AWS Service
Send email to chomp@snappydata.io to be notified. Release expected next week
How SnappyData Extends
Spark
Snappy Spark Cluster Deployment topologies
• Snappy store and Spark
Executor share the JVM
memory
• Reference based access –
zero copy
• SnappyStore is isolated but
use the same COLUMN
FORMAT AS SPARK for high
throughput
Unified Cluster
Split Cluster
Simple API – Spark Compatible
● Access Table as DataFrame
Catalog is automatically recovered
● Store RDD[T]/DataFrame can be
stored in SnappyData tables
● Access from Remote SQL clients
● Addtional API for updates,
inserts, deletes
//Save a dataFrame using the Snappy or spark context …
context.createExternalTable(”T1", "ROW", myDataFrame.schema,
props );
//save using DataFrame API
dataDF.write.format("ROW").mode(SaveMode.Append).options(pro
ps).saveAsTable(”T1");
val impressionLogs: DataFrame = context.table(colTable)
val campaignRef: DataFrame = context.table(rowTable)
val parquetData: DataFrame = context.table(parquetTable)
<… Now use any of DataFrame APIs … >
Extends Spark
CREATE [Temporary] TABLE [IF NOT EXISTS] table_name
(
<column definition>
) USING ‘JDBC | ROW | COLUMN ’
OPTIONS (
COLOCATE_WITH 'table_name', // Default none
PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default
REDUNDANCY '1' , // Manage HA
PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",
// Empty string will map to default disk store.
OFFHEAP "true | false"
EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",
…..
[AS select_statement];
Simple to Ingest Streams using SQL
Consume from stream
Transform raw data
Continuous Analytics
Ingest into in-memory Store
Overflow table to HDFS
Create stream table AdImpressionLog
(<Columns>) using directkafka_stream options (
<socket endpoints>
"topics 'adnetwork-topic’ “,
"rowConverter ’ AdImpressionLogAvroDecoder’ )
streamingContext.registerCQ(
"select publisher, geo, avg(bid) as avg_bid, count(*) imps,
count(distinct(cookie)) uniques from AdImpressionLog
window (duration '2' seconds, slide '2' seconds)
where geo != 'unknown' group by publisher, geo”)// Register CQ
.foreachDataFrame(df => {
df.write.format("column").mode(SaveMode.Appen
d)
.saveAsTable("adImpressions")
AdImpression Demo
Spark, SQL Code Walkthrough, interactive SQL
AdImpression Ingest Performance
Loading from parquet files using 4 cores Stream ingest
(Kafka + Spark streaming to store using 1 core)
Unified Cluster Architecture
How do we extend Spark for Real Time?
• Spark Executors are long
running. Driver failure
doesn’t shutdown
Executors
• Driver HA – Drivers run
“Managed” with standby
secondary
• Data HA – Consensus based
clustering integrated for
eager replication
How do we extend Spark for Real Time?
• By pass scheduler for low
latency SQL
• Deep integration with
Spark Catalyst(SQL) –
collocation optimizations,
indexing use, etc
• Full SQL support –
Persistent Catalog,
Transaction, DML
Unified OLAP/OLTP streaming w/ Spark
● Far fewer resources: TB problem becomes GB.
○ CPU contention drops
● Far less complex
○ single cluster for stream ingestion, continuous queries, interactive
queries and machine learning
● Much faster
○ compressed data managed in distributed memory in columnar
form reduces volume and is much more responsive
www.snappydata.io
SnappyData is Open Source
● Ad Analytics example/benchmark -
https://github.com/SnappyDataInc/snappy-poc
● https://github.com/SnappyDataInc/snappydata
● Learn more www.snappydata.io/blog
● Connect:
○ twitter: www.twitter.com/snappydata
○ facebook: www.facebook.com/snappydata
○ slack: http://snappydata-slackin.herokuapp.com

Thing you didn't know you could do in Spark

  • 1.
    ThingsYou Didn'tKnowYou Could Doin Spark Version 0.1 | © Snappydata Inc 2017 Strategic Advisor @ Snappydata Assistant Prof. @Univ. of Mich Barzan Mozafari CTO | Co-founder @ Snappydata Jags Ramnarayan COO | Co-founder @ Snappydata Sudhir Menon
  • 2.
    2 Current Market Trends Analytic workloads aremoving to the cloud Scale-outinfrastructure->IncreasinglyExpensive Excessivedatashuffling -> Longer responsetimes Real-time operational analytics are on the rise MixedWorkloadscombine streams,operationaldataandHistoricaldata Real-timeMatters Concurrency is increasing Sharedinfrustrure(10’sof usersand apps,100’sof queries) Exploratoryanalytics BIandvisualizationtools Concurrentqueries Data volumes are growing rapidly IndustrialIoT(Internetof Things) Datagrowing fasterthanMoore’slaw
  • 3.
    3 Our Observations Acheiving interactive analyticsis hard Analystswantresponsetime <2 seconds Hardwith concurrentqueriesin the cluster Hardonananalyst’slaptop Hardonlargedatasets Hardwhenthereare networkshuffles After 99.9%- accurate early results, the final answer is no longer needed 200xqueryspeedup 200xreductionof clusterload, networkshuffles,andI/O waits Lambda architecture is extremely complex (and slow!) Dealingwith multipledataformats, disparateproductAPIs Wastedresources Querying multiple remote data sources is slow Curateddatafrom multiplesourcesisslow
  • 4.
    Agenda • Mixed workloads StructuredStreaming in Spark 2.0 State Management in Spark 2.0 • Interactive Analytics in Spark 2.0 • The SnappyData Solution • Q&A
  • 5.
    Stream Processing Mixed workloads areway too common Transactions (point lookups, small updates) Interactive Analytics Analytics on mutating data Maintaining state or counters while processing streams Correlating and joining streams w/ large histories
  • 6.
    Mixed Workload inIoT Analytics - Map sensors to tags - Monitor temperature - Send alerts Correlate current temperature trend with history…. Anomaly detection – score against models Interact using dynamic queries
  • 7.
    How Mixed Workloadsare Supported Today? Lambda Architecture
  • 8.
    Scalable Store HDFS, MPP DBData-in-motion Analytics Application Streams Alerts Enrich ReferenceDB Lamba Architecure is Complex Enrich – requires reference data join State – manage windows with mutating state Analytics – Joins to history, trend patterns Model maintenance Models Transaction writes Interactive OLAP queries Updates
  • 9.
    Scalable Store Updates HDFS, MPP DBData-in-motion Analytics Application Streams Alerts Enrich ReferenceDB Lamba Architecure is Complex Storm, Spark streaming, Samza… NoSQL – Cassandra, Redis, Hbase …. Models Enterprise DB – Oracle, Postgres.. OLAP SQL DB
  • 10.
    Lamba Architecure isComplex • Complexity: learn and master multiple products  data models, disparate APIs, configs • Slower • Wasted resources
  • 11.
    Can we simplify& optimize? How about a single clustered DB that can manage stream state, transactional data and run OLAP queries? RDB Tables Txn Framework for stream processing, etc Stream processing HDFS MPP DB Scalable writes, point reads, OLAP queries Apps
  • 12.
    One idea: Startfrom Apache Spark Spark Core Spark Streaming Spark ML Spark SQL GraphX Stream program can Leverage ML, SQL, Graph Why? • Unified programming across streaming, interactive, batch • Multiple data formats, data sources • Rich library for Streaming, SQL, Machine Learning, … But: Spark is a computational engine not a store or a database
  • 13.
  • 14.
  • 15.
    Source: Databricks presentation read...stream()creates a streaming DataFrame, does not start any of the computation write...startStream() defines where & how to output the data and starts the processing
  • 16.
  • 17.
    New State ManagementAPI in Spark 2.0 – KV Store • Preserve state across streaming aggregations across multiple batches • Fetch, store KV pairs • Transactional • Supports versioning (batch) • Built in store that persists to HDFS
  • 18.
  • 19.
    Deeper Look –Ad Impression Analytics Ad Network Architecture – Analyze log impressions in real time
  • 20.
    Ad Impression Analytics Ref- https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/
  • 21.
    Bottlenecks in thewrite path Stream micro batches in parallel from Kafka to each Spark executor Filter and collect event for 1 minute. Reduce to 1 event per Publisher, Geo every minute Execute GROUP BY … Expensive Spark Shuffle … Shuffle again in DB cluster … data format changes … serialization costs val input :DataFrame= sqlContext.read .options(kafkaOptions).format("kafkaSource") .stream() val result :DataFrame = input .where("geo != 'UnknownGeo'") .groupBy( window("event-time", "1min"), Publisher", "geo") .agg(avg("bid"), count("geo"),countDistinct("cookie")) val query = result.write.format("org.apache.spark.sql.cassandra") .outputMode("append") .startStream("dest-path”)
  • 22.
    Bottlenecks in theWrite Path • Aggregations – GroupBy, MapReduce • Joins with other streams, Reference data Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores
  • 23.
    Avoid bottleneck -Localize process with State Spark Executor Spark Executor Kafka queue KV KV Publisher: 1,3,5 Geo: 2,4,6 Avoid shuffle by localizing writes Input queue, Stream, KV Store all share the same partitioning strategy Publisher: 2,4,6 Geo: 1,3,5 Publisher: 1,3,5 Geo: 2,4,6 Publisher: 2,4,6 Geo: 1,3,5 Spark 2.0 StateStore?
  • 24.
    Impedance mismatch withKV stores We want to run interactive “scan”-intensive queries: - Find total uniques for Ads grouped on geography - Impression trends for Advertisers(group by query) Two alternatives: row-stores vs. column stores
  • 25.
    Goal – LocalizeprocessingFast key-based lookup But, too slow to run aggregations, scan based interactive queries Fast scans, aggregations Updates and random writes are very difficult Consume too much memory
  • 26.
  • 27.
  • 28.
    But, are in-memorycolumn-stores fast enough? SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
  • 29.
    Why distributed, in-memoryand columnar formats not enough for interactive speeds? 1. Distributed shuffles can take minutes 2. Software overheads, e.g., (de)serialization 3. Concurrent queries competing for the same resources So, how can we ensure interactive-speed analytics?
  • 30.
    • Most appshappy to tradeoff 1% accuracy for 200x speedup! • Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data! • Often can make perfectly accurate decisions without having perfectly accurate answers! • A/B Testing, visualization, ... • The data itself is usually noisy • Processing entire data doesn’t necessarily mean exact answers! • Inference is probabilistic anyway Use statistical techniques to shrink data!
  • 31.
  • 32.
    SnappyData: A NewApproach A Single Unified Cluster: OLTP + OLAP + Streaming for real-time analytics Batch design, high throughput Real-time design Low latency, HA, concurrency Vision: Drastically reduce the cost and complexity in modern big data Rapidly Maturing Matured over 13 years
  • 33.
    SnappyData: A NewApproach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real-time operational Analytics – TBs in memory RDB Rows Txn Columnar API Stream processing ODBC, JDBC, REST Spark - Scala, Java, Python, R HDFS AQP First commercial product with Approximate Query Processing (AQP) MPP DB Index
  • 34.
    SnappyData Data Model Rows Columnar Streamprocessing ODBC JDBCProbabilistic data Snappy Data Server – Spark Executor + Store Index Spark Program SQL Parser RDD Table Hybrid Storage RDD Cluster Mgr Scheduler High Latency/OLAP Low Latency Add/remove Servers Catalog
  • 35.
    Features - Deeply integrateddatabase for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP - OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc
  • 36.
    Time Series Analytics(kparc.com) http://kparc.com/q4/readme.txt Trade and Quote Analytics
  • 37.
    Time Series Analytics– kparc.com http://kparc.com/q4/readme.txt - Find last quoted price for Top 100 instruments for given date "select quote.sym, avg(bid) from quote join S on (quote.sym = S.sym) where date='$d' group by quote.sym” // Replaced with AVG as MemSQL doesn’t support LAST - Max price for Top 100 Instruments across Exchanges for given date "select trade.sym, ex, max(price) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, ex” - Find Average size of Trade in each hour for top 100 Instruments for given date "select trade.sym, hour(time), avg(size) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, hour(time)"
  • 38.
    Results for Smallday (40million), Big day(640 million) 0 500 1000 1500 2000 2500 3000 3500 Q1 Q2 Q3 SnappyData 224 125 124 MemSQL 556 1884 3210 AxisTitle Query Response time in Millis (Smaller is better) 0 10000 20000 30000 40000 50000 60000 SnappyData MemSQL Impala Greenplum Q1 1769 8652 45000 56000 Q2 432 26034 2250 900 Q3 494 49614 1880 1000AxisTitle Query Response time in Millis (Smaller is better)
  • 39.
    TPC-H: 10X-20X fasterthan Spark 2.0
  • 40.
  • 41.
    Use Case Patterns 1.Analytics cache - Caching for Analytics over any “big data” store (esp. MPP) - Federate query between samples and backend’ 2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query 3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP
  • 42.
    Interactive-Speed Analytic Queries– Exact or Approximate Select avg(Bid), Advertiser from T1 group by Advertiser Select avg(Bid), Advertiser from T1 group by Advertiser with error 0.1 Speed/Accuracy tradeoffError(%) 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time 42 100 secs 2 secs 1% Error
  • 43.
    Approximate Query ProcessingFeatures • Support for uniform sampling • Support for stratified sampling - Solutions exist for stored data (BlinkDB) - SnappyData works for infinite streams of data too • Support for exponentially decaying windows over time • Support for synopses - Top-K queries, heavy hitters, outliers, ... • [future] Support for joins • Workload mining (http://CliffGuard.org)
  • 44.
    Uniform (Random) Sampling IDAdvertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Uniform Sample ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 1/3 5 adv20 NY 0.0001 1/3 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table
  • 45.
    Uniform (Random) Sampling IDAdvertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Uniform Sample ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 2/3 5 adv20 NY 0.0001 2/3 1 adv10 NY 0.0001 2/3 2 adv10 VT 0.0005 2/3 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table Larger
  • 46.
    Stratified Sampling ID AdvertiserGeo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Stratified Sample on Geo ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 1/4 2 adv10 VT 0.0005 1/2 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table
  • 47.
    Sketching techniques ● Samplingnot effective for outlier detection ○ MAX/MIN etc ● Other probabilistic structures like CMS, heavy hitters, etc ● SnappyData implements Hokusai ○ Capturing item frequencies in timeseries ● Design permits TopK queries over arbitrary trim intervals (Top100 popular URLs) SELECT pageURL, count(*) frequency FROM Table WHERE …. GROUP BY …. ORDER BY frequency DESC LIMIT 100
  • 48.
    Airline Analytics Demo Zeppelin Spark Interpreter (Driver) Zeppelin Server Rowcache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM
  • 49.
    Free Cloud trialservice – Project iSight ● Free AWS/Azure credits for anyone to try out SnappyData ● One click launch of private SnappyData cluster with Zeppelin ● Multiple notebooks with comprehensive description of concepts and value ● Bring your own data sets to try ‘Immediate visualization’ using Synopses data http://www.snappydata.io/cloudbuilder --- URL for AWS Service Send email to chomp@snappydata.io to be notified. Release expected next week
  • 50.
  • 51.
    Snappy Spark ClusterDeployment topologies • Snappy store and Spark Executor share the JVM memory • Reference based access – zero copy • SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput Unified Cluster Split Cluster
  • 52.
    Simple API –Spark Compatible ● Access Table as DataFrame Catalog is automatically recovered ● Store RDD[T]/DataFrame can be stored in SnappyData tables ● Access from Remote SQL clients ● Addtional API for updates, inserts, deletes //Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(pro ps).saveAsTable(”T1"); val impressionLogs: DataFrame = context.table(colTable) val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable) <… Now use any of DataFrame APIs … >
  • 53.
    Extends Spark CREATE [Temporary]TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  • 54.
    Simple to IngestStreams using SQL Consume from stream Transform raw data Continuous Analytics Ingest into in-memory Store Overflow table to HDFS Create stream table AdImpressionLog (<Columns>) using directkafka_stream options ( <socket endpoints> "topics 'adnetwork-topic’ “, "rowConverter ’ AdImpressionLogAvroDecoder’ ) streamingContext.registerCQ( "select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds) where geo != 'unknown' group by publisher, geo”)// Register CQ .foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Appen d) .saveAsTable("adImpressions")
  • 55.
    AdImpression Demo Spark, SQLCode Walkthrough, interactive SQL
  • 56.
    AdImpression Ingest Performance Loadingfrom parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)
  • 57.
  • 58.
    How do weextend Spark for Real Time? • Spark Executors are long running. Driver failure doesn’t shutdown Executors • Driver HA – Drivers run “Managed” with standby secondary • Data HA – Consensus based clustering integrated for eager replication
  • 59.
    How do weextend Spark for Real Time? • By pass scheduler for low latency SQL • Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc • Full SQL support – Persistent Catalog, Transaction, DML
  • 60.
    Unified OLAP/OLTP streamingw/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  • 61.
    www.snappydata.io SnappyData is OpenSource ● Ad Analytics example/benchmark - https://github.com/SnappyDataInc/snappy-poc ● https://github.com/SnappyDataInc/snappydata ● Learn more www.snappydata.io/blog ● Connect: ○ twitter: www.twitter.com/snappydata ○ facebook: www.facebook.com/snappydata ○ slack: http://snappydata-slackin.herokuapp.com

Editor's Notes

  • #3  Algorithmic trading Fraud detection Systems monitoring Application performance monitoring Customer Relationship Management Demand sensing Dynamic pricing and yield management Data validation Operational intelligence and risk management Payments & cash monitoring Data security monitoring Supply chain optimization RFID/sensor network data analysis Workstreaming Call center optimization Enterprise Mashups and Mashup Dashboards Transportation industry
  • #9 Modern: Enrich (requires reference data join), State (maintain larger windows with mutating state), analytics (Joins to history, changes in trends), ML (incrementally maintain models, score), transactional writes to data sources (e.g. join with meta data like Turbine parts and expected temperature ranges for the various sensors ).
  • #10 Modern: Enrich (requires reference data join), State (maintain larger windows with mutating state), analytics (Joins to history, changes in trends), ML (incrementally maintain models, score), transactional writes to data sources (e.g. join with meta data like Turbine parts and expected temperature ranges for the various sensors ).
  • #11 Modern: Enrich (requires reference data join), State (maintain larger windows with mutating state), analytics (Joins to history, changes in trends), ML (incrementally maintain models, score), transactional writes to data sources (e.g. join with meta data like Turbine parts and expected temperature ranges for the various sensors ).
  • #13 event at a time too slow/too expensive for high velocity …. Lots of APIs for transformations and working with disparate data sources and formats
  • #24 And, of course, the whole point behind colocation is to linear scale with minimal or even no shuffling. So, for instance, when using Kafka, all three components - Kafka, native RDD in Spark and Table in Snappy can all share the same partitioning strategy. As an example, in our telco case, all records associated with a subscriber can be colocated onto the same node - the queue, spark procesing of partitions and related reference data in Snappy store.
  • #53 There is a reciprocal relationship with Spark RDDs/DataFrames. any table is visible as a DataFrame and vice versa. Hence, all the spark APIs, tranformations can also be applied to snappy managed tables. For instance, you can use the DataFrame data source API to save any arbitrary DataFrame into a snappy table like shown in the example. One cool aspect of Spark is its ability to take an RDD of objects (say with nested structure) and implicitly infer its schema. i.e. turn into into a DataFrame and store it.
  • #54 The SQL dialect will be Spark SQL ++. i.e. we are extending SQL to be much more compliant with standard SQL. A number of the extensions that dictate things like HA, disk persistence, etc are all specified through OPTIONS in spark SQL.
  • #55 CREATE HDFSSTORE streamingstore NameNode 'hdfs://gfxd1:8020' HomeDir 'stream-tables' BatchSize 10 BatchTimeInterval 2000 milliseconds QueuePersistent true MaxWriteOnlyFileSize 200 WriteOnlyFileRolloverInterval 1 minute;
  • #59 Manage data(mutable) in spark executors (store memory mgr works with Block mgr) Make executors long lived Which means, spark drivers run de-coupled .. they can fail. - managed Drivers - Selective scheduling - Deeply integrate with query engine for optimizations - Full SQL support: including transactions, DML, catalog integration
  • #60 Manage data(mutable) in spark executors (store memory mgr works with Block mgr) Make executors long lived Which means, spark drivers run de-coupled .. they can fail. - managed Drivers - Selective scheduling - Deeply integrate with query engine for optimizations - Full SQL support: including transactions, DML, catalog integration