SnappyData is a new open source project started by Pivotal GemFire founders to provide a unified platform for OLTP, OLAP and streaming analytics using Spark. It aims to simplify big data architectures by supporting mixed workloads in a single clustered database, allowing for real-time operational analytics on live data without copying to other systems. This provides faster insights than current approaches that require periodic data copying between different databases and analytics systems.
2.
Who Are We?
2
• New Spark-‐based open source project started by Pivotal GemFire
founders + engineers
• Decades of in-‐memory data management experience
• Focus on real-‐Ome, operaOonal analyOcs: Spark based OLTP+OLAP
database
Spinout
SnappyData
Funded by
Pivotal, GE, GTD
Capital
3. www.snappydata.io
The Big Data Market Is Facing Disruption (Again)
• Higher
Data
Volumes
• Growth
in
Streaming
Workloads
• Analy;cs
on
Live
Data
• Growth
in
unstructured
data
• Machine
Learning/
AI
as
first
class
workloads
Need
to
reduce
complexity
and
cloud
costs
5.
Mixed Workloads Are Everywhere
5
Stream
Processing
TransacCon
(point lookups, small
updates)
InteracCve
AnalyCcs
Analytics on
mutating data
Correlating and
joining streams with
large histories
Maintaining state or
counters while
ingesting streams
6. www.snappydata.io
• Elapsed time from event occurrence to event
analytics matters
• Latency in using information for learning matters
• Concurrency matters
• Recovery time matters
• User-kernel crossings matter
In short, liveness of data matters when it comes to
making decisions based on current information
Time Value of Information – Why does it matter
7. www.snappydata.io
• Applications that are intelligent, proactive, learn from past interactions, and are
context aware in their decision making
• Fast and reliable ingestion capabilities
• Support high memory density
• Utilize memory to reduce response time
• Support high concurrency
• Work on live data
• Support data mutability
What Is Happening With Applications
8. www.snappydata.io
• Market Surveillance Systems (Trading exchanges, Market makers)
• Real Time Scoring Systems (Product recommendations, real time offers)
• Telco Analytics (Location based services, Predictive analytics)
• Sensor Analytics (Real time alerting for parking management, lighting etc.)
• Ad analytics + Ad placement systems
• Credit Card Fraud
• Detecting and Stopping Malware
Lets Discuss Some Use Cases
9.
Mixed Workloads in Industrial IOT
9
IOT
Devices
Anomaly detecOon –
score against models
-‐ Map sensors to tags
-‐ Monitor temperature
-‐ Send alerts
Correlate current
temperature trend with
history….
Interact using
dynamic queries….
Event Stream
10. www.snappydata.io
• Is Spark ready for real time? Enterprise?
• Lacks mutable State management
• Not designed for high concurrency and mixed workloads
• Inadequate SQL support; Limited ODBC/JDBC access
• Fault tolerant not HA
• Near impossible to do Live Analytics on NoSQL stores
• Pattern today – periodic copy of state into some analytic DB/Hadoop
• Stale Insight, not continuous or real time
• Interactive dynamic aggregations not possible
• Data models makes support for BI tools like Tableau difficult
• Most Stream processors not capable of true Analytics
• Deep stream analytics augmenting stream processing
Pain points we come across today
11. 11
PaAern 1: NaCve data store for Spark … Fast, Simple
Scalable
NoSQL
Spark
Analy;cs
(compute)
Mul;-‐model,
distributed
in-‐memory
data
store
na;vely
designed
for
Spark
Immutable
Cache
NoSQL,
Hadoop
-‐ 20X
faster
than
Spark
for
Analy;c
queries
-‐ 1000X
faster
than
Spark+NoSQL
-‐ Mutable
DataFrames,
transac;ons,
indexing
-‐ Rich,
complete
SQL
+
All
Spark
APIs
-‐ Highly
Available
data,
Spark
Driver
-‐ Enterprise
grade
security
-‐ Na;ve
support
for
ML/DL
Too
much
copying
…
too
slow
for
real
;me,
Interac;ve
analy;cs
Scalable
NoSQL
Spark
Analy;cs
(compute)
SnappyData
Data
sources
12. PaAern 2: InteracCve analyCcs on live data in NoSQL stores
Problem:
Interac;ve
Analy;cs
at
scale
and
concurrency
for
LIVE
data
sets
e.g.
Sensor
data,
Customer
interac;ons
Scalable
NoSQL
Opera;onal
Live,
Data
NoSQL
Scalable
NoSQL
Hadoop,
MPP
DB
Cubes,
aggrega;ons
-‐ MongoDB
BI
connector
-‐ Custom
BI
like
SlamData
Tableau
Extracts
Tableau
MPP
ETL
Expensive,
complex,
batch
Difficult
to
deal
with
semi-‐structured
Read
only,
stale
Insight
Inflexible,
Slow
Con;nuous
updates
13. 13
CDC
Streams
Hadoop
NoSQL
Rich
SPARK
APIs
window
Spark
Transform
(Data
Prep)
-‐ Live
updates
propagated
to
in-‐memory
Analy;cs
Cluster
in
SnappyData
Micro
Service
1
Micro
Service
2
Micro
Service
3
In-‐memory
Row-‐Column
Tables
Virtual
Tables
NoSQL
Connectors
SQL
Visualize
on
any
tool
Micro
Service
3
Sensor
stream
PaAern 2 : Live AnalyCcs on Polyglot NotOnlySQL stores
SnappyData
14. 14
Unbounded
Streams
State Update
Index
OLAP
Column
table
Hadoop
NoSQL
Stream
App
window
KV
Store
-‐ KV
stores
offer
lihle
to
no
analy;c
operators
-‐ Joins,
aggrega;ons
across
mul;ple
DBs
not
possible
-‐ Too
slow
PaAern 3: Streaming AnalyCcs not just simple processing
Streaming in Flink, DataFlow, Apex …
15. 15
• Sensor streams
• CDC streams
• TransacCon
Streams…
Rich
SPARK
APIs
Stream
Streaming
deeply
integrated
with
Analy:cs
DB
PaAern 3: Deep integraCon of stream processing with OLAP
In-‐memory
Row-‐Column
Tables
NoSQL
Connectors
SQL
Pull
history
on
Demand
Con;nuously
summarize
-‐ Con;nuous
queries
on
stream
+
history
+
enterprise
data
-‐ Simple:
Build
stream+analy;cs
apps
using
single
model
-‐ Much
faster
than
s;tching
Tableau,
Zeppelin
16.
How Mixed Workloads Are Supported Today
16
Query
New
Data
Batch layer
Master
Datasheet
2
Serving layer
Batch view
3
Batch view
Speed layer
4
Real-‐Cme View Real-‐Cme View
1
Query
5
17.
Lambda Architecture is Complex
17
KAFKA
STORM
CASSANDRA
.....
SOURCE APPS
• Complexity: learn and master mulOple
products, data models, disparate APIs,
configs
• Slower
• Wasted resources
19.
19
How about a single clustered DB that can manage
stream state, transactional data & run OLAP queries?
Stream processing
Scalable writes, point reads, OLAP queries
Apps
Framework for Stream Processing, etc
RDB
MPP DB
HDFS
Tables
Txn
21.
Our Solution
21
Deep Scale,
High Volume
MPP DB
Real-‐Cme design
Low latency, HA,
Concurrency, replicaOion
based consensus driven
Batch design, high
throughput, lineage
based system
Rapidly Maturing Matured over 13 years
Single Unified HA Cluster
OLTP + OLAP + Streaming for real-time analytics
22.
A Spark Based Big Data AnalyCcs Pla_orm
22
Spark API
(Streaming, ML, Graph)
TransacOons,
Indexing
Full SQL HA
DataFrame,
RDD, DataSets
Rows Columnar
IN-‐MEMORY
Spark Cache
Synopses
(Samples)
Unified Data Access
(Virtual Tables)
Unified Catalog NaOve Store
SNAPPYDATA
HDFS/
HBASE
S3
JSON,
CSV,
XML
SQL
db
Cassandra
MPP
DB
Stream
sources
Spark
Jobs,
Scala/Java/Python/R
API,
JDBC/ODBC,
Object
API
(RDD,
DataSets)
23.
We transform Spark from this…
23
Deep Scale,
High Volume
MPP DB
USER 1 / APP 1
SPARK
MASTER
Spark ExecuCon (Worker)
Framework for
streaming SQL,
ML…
Immutable
CACHE
USER 2 / APP 2
SPARK
MASTER
Spark ExecuCon (Worker)
Framework for
streaming SQL,
ML…
Immutable
CACHE
HDFS
SQL
NoSQL
• Cannot update
• Repeated for each User/
APP
Boaleneck
24.
… Into “an always-on hybrid database !
24
Deep Scale,
High Volume
MPP DB
HDFS
SQL
NoSQL
HISTORY
Spark ExecuCon (Worker) JVM
- Long running
Framework for
streaming SQL,
ML…
Spark
Driver
IN-‐Memory
ROW + COLUMN
Start with
Indexing
Store
- Mutable,
- TransactionalSPARK
Cluster
JDBC
ODBC
Spark Job
Shared Nothing
Persistence
25.
Architecture
25
Cluster Manager
& Scheduler
Snappy Data Server (Spark Executor + Store)
Parser
OLAP
TXN
Synopsis Data Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
ProbabilisOc Rows Columns
Index
Query
OpOmizer
Add / Remove
Server
Tables ODBC/JDBC
26.
Unified API
26
• ML, graph, batch & streaming, SQL (selects)
Spark’s DataFrame API allows for:
• Mutability semanOcs (DML & transacOons)
• Indexing
• SQL-‐based streaming
SnappyData adds full SQL support and extends DataFrame and DataSource APIs for:
27.
Can we use Statistical techniques to shrink data?
27
• Most
apps
happy
to
tradeoff
1%
accuracy
for
200x
speedup!
• Can
usually
get
a
99.9%
accurate
answer
by
only
looking
at
a
;ny
frac;on
of
data!
• Oqen
can
make
perfectly
accurate
decisions
with
imperfect
answers!
• A/B
Tes;ng,
visualiza;on,
...
• The
data
itself
is
usually
noisy
• Processing
en;re
data
doesn’t
necessarily
mean
exact
answers!
28.
`
Probabilistic Store: Sketches + Uniform & Stratified Samples
Higher resoluOon for more recent
Ome ranges
1. Streaming CMS
(Count-Min-Sketch)
[t1, t2) [t2, t3) [t3, t4) [t4, now) Time
4T
2T
T
≤T
....
Maintain a small sample at each CMS cell
2. Top-K Queries w/ Arbitrary Filters
Tradi2onal CMS CMS+Samples
3. Fully Distributed Stratified Samples
Always include Omestamp as a straOfied column
for streams
Streams
Aging Row Store (In-‐memory) Column Store (Disk)
timestamp
30.
30
Deep Fusion
w/ Spark Extreme
Speed
Synopsis
Data
Engine
Deep Fusion with Spark
Elas;c,
highly
available
in-‐memory
store
for
OLTP
fused
with
Spark’s
memory
manager
and
the
Catalyst/Tungsten
engine.
The
store
itself
is
exposed
as
na;ve
Spark
data
frames.
Extreme Speed thru CPU code gen, vectorizaCon
Extend
Spark’s
Tungsten
engine
with
beher
code
genera;on,
coloca;on
schemes,
..
Use
Sta;s;cal
techniques
to
reduce
data
by
100-‐1000x
Answer
queries
in
frac;on
of
;me
and
resources
Synopses Data Engine
What is unique
32. Dealing with Credit Card Fraud
SnappyData Cluster
Credit Card
transacOon
stream
User History
PredicOon
Model
Streaming ApplicaOon
……….
Black
Listed
Cards
Data
Lake
No;fica;on
to
owner
No;fica;on
to
merchant
33. SnappyData Cluster
Customers
Approaching
Limit
Plan
Info
CDR Stream
Schedule callback
through call center
Streaming ApplicaOon Immediate SMS
to customer
Data
Lake
Preventing Bill Shock, Real Time Upgrades
The
system
detects
approaching
usage
limits,
no;fies
users
and
gives
them
a
chance
to
buy
a
one
;me
upgrade
or
a
new
plan,
increasing
loyalty
&
revenue
35. Connected Car Real Time Data Flow
SnappyData Cluster
Kava
Receiver
Vehicle Time
Series Data
Vehicle
History
Driver
History
Streaming ApplicaOon
HDFS,
HBase
Raw
Data
Store
Custom
Summary
Dashboard
No;fica;on
to
owner
……….
System
KPIs
Asset
Metadata
36. Offline
Analysis
REAL TIME MATCHING ENGINE
MATCHING
ENGINE
Customer
History
NoOficaOon
Sub-‐system
!
Historical Customer
Profiles
User by
Geo locaOon
PERSONALIZED
CAMPAIGNS TO
USERS
Ingest Stream
REAL
TIME
OFFERS
from
Merchants
Real Time Marketing Campaigns
A
stream
matching
engine
that
uses
customer
history,
their
current
loca;on
and
relevant
offers
to
Effec;vely
target
users
creates
differen;a;on
&
generates
revenue
37. www.snappydata.io
000’s data points/sec
Emergency Shutdown
Tuning & Optimization,
Monitor & Control
Continuous Real-time
Analysis
Maintenance
Billing
Sensor Analytics
38. Message
Bus
Stream IngesOon
Reference
Data
ETL
• OLAP
and
Low
Latency
Querying
in
SQL
• Machine
Learning
in
Spark
RFQs/Trades/Quotes streams
Analytic Dashboards
SnappyData
RFQ Analytics