High performance Spark distribution on PKS by SnappyData

SnappyData Roadmap
In-Memory Data Platform based on Spark
for Interactive Analyticson LIVE Data
© Snappydata Inc 2017
www.Snappydata.io
SnappyData Team
Disclaimer – Dates can change, content can be reprioritized

2
ACCESS using Spark programming model, Core functions from Snappy
SnappyData – Unified Analytics Platform On Spark
Mutable In-memory database, HA,
High concurrency, Persist/recover, WAN…
Low latency predictions, Row-Column tables,
Approximate Query processing, Transactions ….
600% faster OOTB than latest Spark version
Spark API
- Streaming
- Graph
- Map-reduce
- ML
- Spark DL pipelines
SQL: JDBC, ODBC
REST
Spark connectors
- NoSQL (Cassandra,
HBase,
Redis, Elastic)
- RDBMS
- CSV, S3
- HDFS
- Mainframes

SnappyData: An In-Memory Virtual Cloud Warehouse
CDC
On-Prem
Cloud
Enterprise
Security
Streaming
& ML
Querying
Subsystem
Intelligent Data Management
Inventory
Finance
On-prem mfg
m/c
Digital Click
Streams
Legacy
Operational
Systems
CDC
IoT streams
Cloud Native
Data Sources
Cloud native
streaming sources
SnappyData

© SnappyData Inc. 2017
2 0 1 9 P l a n
SnappyData Real Time Analytics Product Suite

A Spark Based Big Data Analytics Platform
5
Spark API
(Streaming, ML, Graph)
Transactions
, Indexing
Full SQL HA
DataFrame,
RDD, DataSets
RowsColumnar
IN-MEMORY
Spark Cache
Synopses
(Samples)
Unified Data Access
(Virtual Tables)
Unified CatalogNative Store
SNAPPYDATA
HDFS/HBAS
E
S3
JSON, CSV,
XML
SQL db Cassandra MPP DB
Stream
sources
Spark Jobs, Scala/Java/Python/R API, JDBC/ODBC, Object API (RDD, DataSets)
GemFire

We transform Spark from this…
6
Deep Scale,
High Volume
MPP DB
USER 1 / APP 1
SPARK
MASTER
Spark Execution (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
USER 2 / APP 2
SPARK
MASTER
Spark Execution (Worker)
Framework for
streaming
SQL, ML…
Immutable
CACHE
HDFS
SQL
NoSQL
• Cannot update
• Repeated for each
User/APP
Bottleneck

… Into “an always-on hybrid database !
7
Deep Scale,
High Volume
MPP DB
HDFS
SQL
NoSQL
HISTORY
Spark Execution (Worker)JVM
- Long running
Framework for
streaming
SQL, ML…
Spark
Driver
IN-Memory
ROW + COLUMN
Start with
Indexing
Store
- Mutable,
- TransactionalSPARK
Cluster
JDBC
ODBC
Spark Job
Shared Nothing
Persistence

Architecture
8
Cluster Manager
& Scheduler
Snappy Data Server (SparkExecutor+ Store)
Parser
OLAP
TXN
Synopsis Data Engine
Distributed Membership
Service
H
A
Stream Processing
Data Frame
RDD
Low
Latency
High
Latency
HYBRID Store
Probabilistic Rows Columns
Index
Query
Optimizer
Add / Remove
Server
Tables ODBC/JDBC

9
Continuous
replication
Join with
Hadoop
NoSQL
Rich SPARK APIs
Stream window
Spark
Transform
(Data Prep)
- Apps/BI Clients execute ad-hoc Join/aggregation queries on multiple NoSQL stores
Live Analytics WithOut The Need For Pipelines
In-memory
Row-Column
Tables
Virtual Tables
NoSQL Connectors SQL
Pull history
on Demand
Continuously
summarize
- No need to do expensive pre-aggregations on large data sets
- Analytics on current, moving data
- Built-in Spark ETL to enrich data
20X faster than Spark, 100-1000X faster than Spark-Cassandra
Micro
Service 1
Micro
Service 2
Micro
Service 3
Session state
Profiles
Orders

Use-case Patterns
•Real-time Analytics operational DB
• Move from traditional cubes to distributed in-memory for real-time
•Streaming with Interactive Analytics
• Stream joins with history/context
• Tableau/SpotFire/Zeppelin based interactive analytics
•Interactive exploratory analytics
• Patterns, Top-K, Trends at Google like speed

Snappy on PKS – Cloud Neutral Containerized Analytics
Platform
 In-memory redundancy and HA
provided by SnappyData
 Pod redundancy and restarts
provided by Kubernetes
 VM redundancy and restarts
provided by PKS

Steps To Launch A Snappy Cluster On PKS
# Connect to PKS cluster
• pks login -a https://api.pks.snappydata.io -u <uname> -p test123 -k
• pks get-credentials pks-cluster-01
• kubectl config use-context pks-cluster-01
# Update to the latest snappydata chart
• cd <spark-on-k8s-checkout>
• git fetch
• git checkout enable-hive-server
# Start SnappyData cluster and note the external IP addresses of lead and locator
• helm install --name snappydata --namespace snappy ./charts/snappydata/
• kubectl get services -n snappy | grep public

Steps To Launch A Snappy Cluster On PKS
# Load data into the cluster
• <snappydata-product-dir>/bin/snappy
• snappy> connect client '<locator-public-ip>:1527';
• snappy> run '<path/to/attached/load_CFPB_CC_Data.sql';
# Access SnappyData dashboard at <lead-public-ip>:5050
# Tableau workbook
• Point the workbook to the lead node.
• Launch the workbook by double-clicking it.

How We Beat The Competition
 Unified Analytics through deep integration into Apache Spark and its eco-
system
 High performance through in-memory design center
 Support for ETL free live data through CDC integration
 Scale and Performance using our Synopsis Data Engine
 Cloud neutral, lower TCO analytics platform based on Kubernetes
 Standards based approach with support for SQL, ML, & Streaming

Apache Spark
compatible data
platform = Unified
analytics in the
cloud
Analytics any on
prem or cloud
native data source
=
No expensive
cloud data
migration!!
Cloud neutral
portable real time
analytics= No
Amazon lock in!!
High concurrency
and BI tool support
= Analytics for
everyone!!
Intelligent CDC
integration =
Unified analytics
on live data!! In memory
scale out
virtual
cloud
warehouse
What Makes SnappyData Compelling

• Multi-cloud certification
Kubernetes for Multi-cloud support
• Cloud neutral managed cloud offering
DevOps Simplification
• RLS, persistence to cloud, backup/restore using parquet, Dashboard
enhancements, improved performance using SIMD
Enterprise Readiness
• Support for Debizium, certified on major Spark distributions
Eco system support
2019 Themes

17
Sampling of Customer Use Cases today

CDC
Streams
NoSQL
w i n d o w
Spark Transform
(Data Prep)
In-memory
Row-Column
Tables
SpotFire & Tableau
Raw Data
Ingestion
& Prep
Rich SPARK APIs NoSQL ConnectorsSQL
SnappyData
Analytics Back Bone For A Large Fortune 30 company

Smart City – Parking, congestion management
● Sensors power lamp posts
● Optimize parking services
● Optimize energy consumption
● Congestion control
Challenge:
• Hundreds of thousands of sensor streams generating too much data
• Actionable intelligence requires analysis of streams with history
• Ad-hoc Interactive analytics on all this data

Smart City – Parking, Congestion Management
 Application built using SnappyData’s Unified
Analytics API
 Reduced complexity due to fewer moving
parts
 20X better performance and far fewer
resources

Performance Benchmark
600% faster than Apache Spark in TPC-H (Complex Analytical queries)
Up to 20X faster than Spark on complex joins, aggregations

Continue at www.snappydata.io/resources

High performance Spark distribution on PKS by SnappyData

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to High performance Spark distribution on PKS by SnappyData

Similar to High performance Spark distribution on PKS by SnappyData (20)

More from VMware Tanzu

More from VMware Tanzu (20)

Recently uploaded

Recently uploaded (20)

High performance Spark distribution on PKS by SnappyData

Editor's Notes