SnappyData Toronto Meetup Nov 2017

SnappyData

Building Continuous Applications Driven By Real Time Insights
Version 0.1 | © SnappyData Inc 2017
www.snappydata.io
Sudhir Menon , Founder, COO

Who Are We?
2

•  New Spark-‐based open source project started by Pivotal GemFire
founders + engineers
•  Decades of in-‐memory data management experience
•  Focus on real-‐Ome, operaOonal analyOcs: Spark based OLTP+OLAP
database

Spinout
SnappyData
Funded by
Pivotal, GE, GTD
Capital

www.snappydata.io
The Big Data Market Is Facing Disruption (Again)
•  Higher
Data
Volumes

•  Growth
in
Streaming

Workloads

•  Analy;cs
on
Live
Data

•  Growth
in
unstructured

data

•  Machine
Learning/
AI
as

ﬁrst
class
workloads

Need
to
reduce
complexity
and
cloud

costs

©
SnappyData
Inc.
2017
4

Spark – Key To emerging baAleground in AnalyCcs and BI
-‐
Mul;-‐model
data

-‐
Integrate
Streaming
data

Phase1

Phase2

Phase3
?

Source:
Gartner

Mixed Workloads Are Everywhere
5

Stream
Processing
TransacCon
(point lookups, small
updates)
InteracCve
AnalyCcs
Analytics on
mutating data
Correlating and
joining streams with
large histories
Maintaining state or
counters while
ingesting streams

www.snappydata.io
•  Elapsed time from event occurrence to event
analytics matters
•  Latency in using information for learning matters
•  Concurrency matters
•  Recovery time matters
•  User-kernel crossings matter
In short, liveness of data matters when it comes to
making decisions based on current information
Time Value of Information – Why does it matter

www.snappydata.io
•  Applications that are intelligent, proactive, learn from past interactions, and are
context aware in their decision making
•  Fast and reliable ingestion capabilities
•  Support high memory density
•  Utilize memory to reduce response time
•  Support high concurrency
•  Work on live data
•  Support data mutability
What Is Happening With Applications

www.snappydata.io
•  Market Surveillance Systems (Trading exchanges, Market makers)
•  Real Time Scoring Systems (Product recommendations, real time offers)
•  Telco Analytics (Location based services, Predictive analytics)
•  Sensor Analytics (Real time alerting for parking management, lighting etc.)
•  Ad analytics + Ad placement systems
•  Credit Card Fraud
•  Detecting and Stopping Malware
Lets Discuss Some Use Cases

Mixed Workloads in Industrial IOT
9

IOT
Devices
Anomaly detecOon –
score against models
-‐ Map sensors to tags
-‐ Monitor temperature
-‐ Send alerts
Correlate current
temperature trend with
history….
Interact using
dynamic queries….
Event Stream

www.snappydata.io
•  Is Spark ready for real time? Enterprise?
•  Lacks mutable State management
•  Not designed for high concurrency and mixed workloads
•  Inadequate SQL support; Limited ODBC/JDBC access
•  Fault tolerant not HA
•  Near impossible to do Live Analytics on NoSQL stores
•  Pattern today – periodic copy of state into some analytic DB/Hadoop
•  Stale Insight, not continuous or real time
•  Interactive dynamic aggregations not possible
•  Data models makes support for BI tools like Tableau difficult
•  Most Stream processors not capable of true Analytics
•  Deep stream analytics augmenting stream processing
Pain points we come across today

11

PaAern 1: NaCve data store for Spark … Fast, Simple
Scalable
NoSQL

Spark
Analy;cs

(compute)

Mul;-‐model,
distributed
in-‐memory
data
store
na;vely
designed
for
Spark

Immutable

Cache

NoSQL,

Hadoop

-‐  20X
faster
than
Spark
for
Analy;c
queries

-‐  1000X
faster
than
Spark+NoSQL

-‐  Mutable
DataFrames,
transac;ons,
indexing

-‐  Rich,
complete
SQL
+
All
Spark
APIs

-‐  Highly
Available
data,
Spark
Driver

-‐  Enterprise
grade
security

-‐  Na;ve
support
for
ML/DL

Too
much
copying
…
too
slow
for
real

;me,
Interac;ve
analy;cs

Scalable
NoSQL

Spark
Analy;cs

(compute)

SnappyData

Data
sources

PaAern 2: InteracCve analyCcs on live data in NoSQL stores
Problem:
Interac;ve
Analy;cs
at
scale
and
concurrency
for
LIVE
data
sets

e.g.
Sensor
data,
Customer
interac;ons

Scalable

NoSQL

Opera;onal

Live,
Data

NoSQL

Scalable

NoSQL
Hadoop,

MPP
DB

Cubes,
aggrega;ons

-‐  MongoDB
BI
connector

-‐  Custom
BI
like
SlamData

Tableau
Extracts

Tableau

MPP
ETL

Expensive,
complex,
batch

Diﬃcult
to
deal
with
semi-‐structured

Read
only,
stale
Insight

Inﬂexible,
Slow

Con;nuous

updates

13

CDC
Streams
Hadoop

NoSQL

Rich
SPARK
APIs

window

Spark

Transform

(Data
Prep)

-‐  Live
updates
propagated
to
in-‐memory
Analy;cs
Cluster
in
SnappyData

Micro

Service
1

Micro

Service
2

Micro

Service
3

In-‐memory

Row-‐Column

Tables

Virtual
Tables

NoSQL
Connectors
SQL

Visualize
on
any
tool

Micro

Service
3
Sensor
stream

PaAern 2 : Live AnalyCcs on Polyglot NotOnlySQL stores
SnappyData

14

Unbounded
Streams
State Update
Index
OLAP
Column
table

Hadoop

NoSQL

Stream
App

window

KV
Store

-‐  KV
stores
oﬀer
lihle
to
no
analy;c
operators

-‐  Joins,
aggrega;ons
across
mul;ple
DBs
not
possible

-‐  Too
slow

PaAern 3: Streaming AnalyCcs not just simple processing
Streaming in Flink, DataFlow, Apex …

15

•  Sensor streams
•  CDC streams
•  TransacCon
Streams…
Rich
SPARK
APIs

Stream

Streaming
deeply
integrated
with
Analy:cs
DB

PaAern 3: Deep integraCon of stream processing with OLAP
In-‐memory

Row-‐Column

Tables

NoSQL
Connectors
SQL

Pull
history

on
Demand
Con;nuously

summarize

-‐  Con;nuous
queries
on
stream
+
history
+
enterprise
data

-‐  Simple:
Build
stream+analy;cs
apps
using
single
model

-‐  Much
faster
than
s;tching

Tableau,
Zeppelin

How Mixed Workloads Are Supported Today
16

Query
New
Data
Batch layer
Master
Datasheet
2
Serving layer
Batch view
3
Batch view
Speed layer
4
Real-‐Cme View Real-‐Cme View
1
Query
5

Lambda Architecture is Complex
17

KAFKA
STORM
CASSANDRA
.....
SOURCE APPS
•  Complexity: learn and master mulOple
products, data models, disparate APIs,
conﬁgs
•  Slower
•  Wasted resources

18

Can
We

Simplify
&

Op:mize?

19

How about a single clustered DB that can manage
stream state, transactional data & run OLAP queries?
Stream processing
Scalable writes, point reads, OLAP queries
Apps
Framework for Stream Processing, etc
RDB
MPP DB
HDFS
Tables
Txn

20
© Snappydata INC 2017

Our
SoluCon
SnappyData

A Single Unified Cluster:
OLTP + OLAP + Streaming for real-time analytics

Our Solution
21

Deep Scale,
High Volume
MPP DB
Real-‐Cme design
Low latency, HA,
Concurrency, replicaOion
based consensus driven

Batch design, high
throughput, lineage
based system

Rapidly Maturing Matured over 13 years
Single Uniﬁed HA Cluster
OLTP + OLAP + Streaming for real-time analytics

A Spark Based Big Data AnalyCcs Pla_orm
22

Spark API
(Streaming, ML, Graph)
TransacOons,
Indexing
Full SQL HA
DataFrame,
RDD, DataSets
Rows Columnar
IN-‐MEMORY

Spark Cache
Synopses
(Samples)
Uniﬁed Data Access
(Virtual Tables)
Uniﬁed Catalog NaOve Store
SNAPPYDATA
HDFS/
HBASE

S3

JSON,
CSV,

XML

SQL
db
Cassandra
MPP
DB

Stream

sources

Spark
Jobs,
Scala/Java/Python/R
API,
JDBC/ODBC,
Object
API
(RDD,
DataSets)

We transform Spark from this…
23

Deep Scale,
High Volume
MPP DB
USER 1 / APP 1
SPARK
MASTER
Spark ExecuCon (Worker)
Framework for
streaming SQL,
ML…
Immutable
CACHE
USER 2 / APP 2
SPARK
MASTER
Spark ExecuCon (Worker)
Framework for
streaming SQL,
ML…
Immutable
CACHE
HDFS
SQL
NoSQL

•  Cannot update
•  Repeated for each User/
APP
Boaleneck

… Into “an always-on hybrid database !
24

Deep Scale,
High Volume
MPP DB
HDFS
SQL
NoSQL

HISTORY
Spark ExecuCon (Worker) JVM
- Long running
Framework for
streaming SQL,
ML…
Spark
Driver
IN-‐Memory
ROW + COLUMN
Start with
Indexing
Store
-  Mutable,
-  TransactionalSPARK
Cluster
JDBC
ODBC
Spark Job
Shared Nothing
Persistence

Architecture
25

Cluster Manager
& Scheduler
Snappy Data Server (Spark Executor + Store)
Parser
OLAP
TXN
Synopsis Data Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
ProbabilisOc Rows Columns
Index
Query
OpOmizer
Add / Remove
Server
Tables ODBC/JDBC

Unified API
26

•  ML, graph, batch & streaming, SQL (selects)

Spark’s DataFrame API allows for:

•  Mutability semanOcs (DML & transacOons)
•  Indexing
•  SQL-‐based streaming

SnappyData adds full SQL support and extends DataFrame and DataSource APIs for:

Can we use Statistical techniques to shrink data?
27

•  Most
apps
happy
to
tradeoﬀ
1%
accuracy
for

200x
speedup!

•  Can
usually
get
a
99.9%
accurate
answer
by
only

looking
at
a
;ny
frac;on
of
data!

•  Oqen
can
make
perfectly
accurate
decisions

with
imperfect
answers!

•  A/B
Tes;ng,
visualiza;on,
...

•  The
data
itself
is
usually
noisy

•  Processing
en;re
data
doesn’t
necessarily
mean
exact

answers!

`

Probabilistic Store: Sketches + Uniform & Stratified Samples
Higher resoluOon for more recent
Ome ranges
1. Streaming CMS
(Count-Min-Sketch)
[t1, t2) [t2, t3) [t3, t4) [t4, now) Time
4T
2T
T
≤T

....

Maintain a small sample at each CMS cell
2. Top-K Queries w/ Arbitrary Filters
Tradi2onal CMS CMS+Samples
3. Fully Distributed Stratified Samples
Always include Omestamp as a straOﬁed column
for streams
Streams
Aging Row Store (In-‐memory) Column Store (Disk)
timestamp

High-Level Accuracy Guarantees
29

1 0 1 1 0 0
2 1 2 0 0 1
2 0 0 0 1 1
0 1 0 2 0 2
Quality cer2fied
Approx Answers
Query Engine
HAC
Bias Es2mate
Variance Es2mate
STREAMS
Aging
SNAPPY STORE
Stra2fied Samples Stra2fied Samples
Interac2ve Query
Con2nuous Query
Pipelined
bootstrapped
operator
Row store Memory Column Store Disk

30

Deep Fusion
w/ Spark Extreme
Speed
Synopsis
Data
Engine
Deep Fusion with Spark
Elas;c,
highly
available
in-‐memory
store
for
OLTP
fused
with

Spark’s
memory
manager
and
the
Catalyst/Tungsten
engine.

The
store
itself
is
exposed
as
na;ve
Spark
data
frames.
Extreme Speed thru CPU code gen, vectorizaCon
Extend
Spark’s
Tungsten
engine
with
beher
code
genera;on,

coloca;on
schemes,
..
Use
Sta;s;cal
techniques
to
reduce
data
by
100-‐1000x

Answer
queries
in
frac;on
of
;me
and
resources

Synopses Data Engine
What is unique

Dealing with Credit Card Fraud
SnappyData Cluster
Credit Card
transacOon
stream
User History
PredicOon
Model
Streaming ApplicaOon
……….
Black
Listed
Cards
Data
Lake
No;ﬁca;on
to

owner

No;ﬁca;on
to

merchant

SnappyData Cluster
Customers
Approaching
Limit
Plan
Info
CDR Stream
Schedule callback
through call center
Streaming ApplicaOon Immediate SMS
to customer
Data
Lake
Preventing Bill Shock, Real Time Upgrades
The
system
detects
approaching
usage

limits,
no;ﬁes
users
and
gives
them

a
chance
to
buy
a
one
;me
upgrade
or

a
new
plan,
increasing
loyalty
&
revenue

www.snappydata.io
Stream IngesOon
Reference

Data

•  Stream analyOcs
•  Insider detecOon
•  Apply Rules
•  Detect Market
ManipulaOon
Alert & NoOfy Downstream
Systems
Trigger InvesOgaOons
Spark
Streaming

SQL
Querying

Con;nuous
Queries

Par;;oned
Stream

Inges;on

Summaries
&
Alerts

Messaging

Machine
Learning

Market Surveillance For Market Makers

Connected Car Real Time Data Flow
SnappyData Cluster
Kava

Receiver
Vehicle Time
Series Data
Vehicle
History
Driver
History
Streaming ApplicaOon
HDFS,
HBase

Raw
Data
Store

Custom

Summary

Dashboard

No;ﬁca;on
to

owner

……….
System
KPIs

Asset
Metadata

Offline
Analysis
REAL TIME MATCHING ENGINE
MATCHING
ENGINE
Customer
History
NoOficaOon
Sub-‐system
!
Historical Customer
Profiles
User by
Geo locaOon
PERSONALIZED
CAMPAIGNS TO
USERS

Ingest Stream
REAL
TIME
OFFERS

from
Merchants
Real Time Marketing Campaigns
A
stream
matching
engine
that
uses
customer

history,
their
current
loca;on
and
relevant
offers
to

Effec;vely
target
users
creates
differen;a;on
&
generates
revenue

www.snappydata.io
000’s data points/sec
Emergency Shutdown
Tuning & Optimization,
Monitor & Control
Continuous Real-time
Analysis
Maintenance
Billing
Sensor Analytics

Message
Bus

Stream IngesOon
Reference

Data

ETL

•  OLAP
and
Low

Latency

Querying
in
SQL

•  Machine

Learning
in
Spark

RFQs/Trades/Quotes streams
Analytic Dashboards
SnappyData
RFQ Analytics

Ad Analytics
39

1.5-‐2x faster ingestion, faster trx
7-‐142× faster analytics (at 300M records)

TPCH
41

Avg Latency

SnappyData

MemSQL

Spark
5.7s
100 GB
12.0s
66.9s

THANK YOU !
Try it out: hAp://snappydata.io/download
Resources: hAp://www.snappydata.io/
resources

SnappyData Toronto Meetup Nov 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SnappyData Toronto Meetup Nov 2017

Similar to SnappyData Toronto Meetup Nov 2017 (20)

Recently uploaded

Recently uploaded (20)

SnappyData Toronto Meetup Nov 2017