Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

1
Hybrid Transactional/Analytics Processing
with Spark and In-Memory Data Grids
Copyright © GigaSpaces 2017. All rights reserved.
Ali Hodroj
VP, Products and Strategy @ahodroj

2
GigaSpaces
Ultra-Low Latency / High Throughput Middleware
Direct customers
500+
Headquarters
New York, NY
Established
2001

4
We’re seeing more in our customer base

5
…a shift towards real-time
BI
Big
Data
Fast
Data

6
Sample Customer Use Cases
Internet of Things Omni-Channel Operational
Intelligence
Operational
Analytics
Predictive
Analytics
Fraud Detection, Supply
chain optimization
Personalization,
Recommendation
Edge
Analytics
Operational Intelligence,
Predictive Maintenance,
Spatial Analytics

7
In-Memory Computing
(not a new thing)
Rapid decline in RAM prices lead to advanced data processing
innovations
drives
• Transactional (2001-present)
– In-Memory Databases
– In-Memory Data Grids
• Analytics (2012-present)
– In-Memory Data Processing
Frameworks (Spark)
– In-Memory File Systems (Tachyon)

8
In-Memory Data Processing: Apache Spark

99
Data Grid is a cluster of
machines that work
together to create a
resilient shared data
fabric for low-latency
data access and extreme
transaction processing
In-Memory Data Grid:
Online Transaction Processing at Low-Latency and High Throughput
http://xap.github.io

10
In-Memory Data Grid 101
Feeder
Virtual Machine Virtual MachineVirtual Machine
Partitioned Data

11
Write
Event-Driven / Reactive
In-Memory Data Grid 101: Execution Models
RPC / Master-Worker

12
Write
Event-Driven / Reactive
In-Memory Data Grid 101: Execution Models
RPC / Master-Worker

13
In-Memory Data Grid 101: Typical Deployment
HTML
HTTP/S
HW LB
REST
HTTP/
S
REST
HTTP/S
LB
Agen
t
GSA
HTTPD
Load
Balanc
er
LB
Agen
t
GSA
HTTPD
Load
Balanc
er
Mirror
Service
GSA
DB
Private or Public Cloud
Processing Processing Processing
Processing Processing
Processi
ng
Primary Set 1 Primary Set 2 Primary Set 3
Primary Set 4 Primary Set 5 Primary Set 6
Backup Set 6Backup Set 5Backup Set 4
Backup Set 1 Backup Set 2 Backup Set 3
GSA GSA GSA
GSA GSA GSA
Async
)

14
Host Cisco UCS Server
CPU Intel 16core 2.9GHz
Concurrent Threads 2
Throughput 200, 400, 800 ops/sec

16
Hybrid Transactional/Analytics Processing at Scale
Provide closed-loop analytics pipeline. Data,
insight, to action at sub-second latency
IoT and Omni-channel require the
convergence of many different data
types
Blend of both real-time and historical
data
Requirements
1
Bi-directional integration between
transactional and analytical data stores
Ability to support POJO, JSON,
GeoSpatial, and Unstructured types
through a unified API
Unified and scale-out real-time
and historical data store
Challenges
2
3

17
HTAP:
SPARK + MICROSERVICES
Our road towards

18
What’s needed
Large-scale distributed
analytics framework
Unified, scale-out, low-latency data store
Transactional capabilities:
ACID, Event-Driven, Rich
Data modeling
Microservices

19
Our approach to HTAP
Low-latency Scale-Out
In-Memory Data Grid
Large-scale distributed
analytics framework
+

20
• Unified & Concise API
• Highly Flexible Data Store
Integration
• Massive Community and Adoption
Why Spark?

21
1
Bi-directional integration between
transactional and analytical data stores
Provide closed-loop analytics pipeline. Data, insight, to action
at scale (at sub-seconds)

23
In-Memory Data Grid
In-Memory Store(RAM) Flash, SSD, Off-Heap Store
Spark Spark SQL
Spark
Steaming
Machine Learning
Highavailability
Security&Management
Transactional Tier
ACID-compliant
Strong Consistency
Analytics Tier

24
• Get Partitions: An array of partitions
that a dataset is divided to
• Compute: A compute function to do a
computation on partitions
• Get Preferred Location: Optional
preferred locations, i.e. hosts for a
partition where the data will be loaded
• IMDG Distributed Query to get partitions
and their hosts
• Iterator over portion of data
• Hosts from Distributed Query
Build a connector: Spark to IMDG

25
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
Partition
node 3
Spark worker
Grid
Partition
NoSQL Storage
Pattern #1: Data Locality (machine-level)

26
Aggregation in
Spark
Filtering and
columns pruning
in Data Grid
SELECT SUM(amount)
FROM order
WHERE city = ‘NY’ AND year > 2012
Spark SQL architecture:
• Pushing down predicates to Data Grid
• Leveraging indexes
• Transparent to user
• Enabling support for other languages -
Python/R
Implementing DataSource API
Pattern #2: Pushdown Predicates (Grid-side processing)

27
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
Partition
node 3
Spark worker
Grid
Partition
Lightweight
workers,
small JVMs
Large JVMs,
Fast
indexing
NoSQL Storage
Pattern #3: Decouple Data Processing from Data Storage

28
Push-down
Predicates
performance
Traditional Spark filtering of 7MM records
Grid-side + Spark filtering of 7MM records
31
sec
800
ms
vs

29
Ability to support POJO, JSON, GeoSpatial, and
Unstructured types through a unified API
2
IoT and Omni-channel require the convergence of many
different data types

In-Memory Data Grid + Spark Convergence
Geo-Spatial Full Text

POJO Domain Model to Spark (Event-Driven)

Geo-Spatial Data Frames
Geo-Spatial

Full Text Indexes + Lucene Analyzers
Full Text

37
Unified and scale-out real-time and historical
data store
3
Blend of both real-time and historical data

38
hash(key) % #nodes
In-Memory Data Grid Partitioning

39
hash(key) % #nodes
In-Memory Data Grid Partitioning – With HA

40
node 1
Spark executor
Spark
Partition
#1
Grid
Partition #1
Direct
connection
Simple, but
not enough
parallelism
for Spark
node 2
Spark executor
Spark
Partition
#2
Grid
Partition #2
node 3
Spark executor
Spark
Partition
#3
Grid
Partition #3
Spark to Data Grid Partition Cardinality

41
node 1
Spark Executor
Grid Primary #1
0
.
.
1
.
.
2
.
.
3
.
.
4
.
.
5
.
.
.
.
.
.
.
.
.
.
.
.
Spark
Partition #1
1023
1 Spark partition = M grid buckets
1 Grid partition = N Spark partitions
Spark
Partition #2
Spark
Partition #1
Pattern #4: Grid bucketing for higher throughput

42
Eventually, we productized this as
an open source Spark distribution

@InsightEdgeIO http://insightedge.io
Apache 2 License
http://insightedge.io/docs
http://insightedge.io/blog
http://github.com/InsightEdge

GigaSpaces InsightEdge
http://insightedge.io
High Performance Spark with OLTP Capabilities

upcoming: Spark RDD/DF native read/save on Off-Heap
(SSD/Flash/Direct Buffers)
Application
Processi
ng
Primary
instance
s
Backup
instance
s
Sync
Replicati
on
Storage
Array
Storage
Array
In Memory Data Grid
Spark worker Spark worker
• Significant RAM TCO reduction
in Spark clusters
• Direct RDD/DataFrame read
write from SSD/Flash device
• Avoid Filesystem hops and
write amplification

48
In-Memory Data Grid
Realtime Replication
• Scoring models
• Trigger actions
• Events
Transactions Analytics
XAP + InsightEdge deployed on
different grid clusters with bi-
directional real-time data replication
Point-of-Decision HTAP

4949
Challenge
• Stream data from 1,000s of Taxis
• Actively monitor and generate real-time notifications
• Real-time Route Optimization and Geo-Fencing
Solution
• Leverage unified in-memory data fabric as middleware for
geo-spatial analytics
• Elastically scale stream processing and transactional apps
together
• Location-based tracking, Geo-fencing
Edge components
Data Sources
Transportation / IoT: Connected Cars / Fleet Geo-Analytics

Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

Similar to Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids (20)

Recently uploaded

Recently uploaded (20)

Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids