SlideShare a Scribd company logo
1
Hybrid Transactional/Analytics Processing
with Spark and In-Memory Data Grids
Copyright © GigaSpaces 2017. All rights reserved.
Ali Hodroj
VP, Products and Strategy @ahodroj
2
GigaSpaces
Ultra-Low Latency / High Throughput Middleware
Direct customers
500+
Headquarters
New York, NY
Established
2001
3
HERE
How we got
4
We’re seeing more in our customer base
5
…a shift towards real-time
BI
Big
Data
Fast
Data
6
Sample Customer Use Cases
Internet of Things Omni-Channel Operational
Intelligence
Operational
Analytics
Predictive
Analytics
Fraud Detection, Supply
chain optimization
Personalization,
Recommendation
Edge
Analytics
Operational Intelligence,
Predictive Maintenance,
Spatial Analytics
7
In-Memory Computing
(not a new thing)
Rapid decline in RAM prices lead to advanced data processing
innovations
drives
• Transactional (2001-present)
– In-Memory Databases
– In-Memory Data Grids
• Analytics (2012-present)
– In-Memory Data Processing
Frameworks (Spark)
– In-Memory File Systems (Tachyon)
8
In-Memory Data Processing: Apache Spark
99
Data Grid is a cluster of
machines that work
together to create a
resilient shared data
fabric for low-latency
data access and extreme
transaction processing
In-Memory Data Grid:
Online Transaction Processing at Low-Latency and High Throughput
http://xap.github.io
10
In-Memory Data Grid 101
Feeder
Virtual Machine Virtual MachineVirtual Machine
Partitioned Data
11
Write
Event-Driven / Reactive
In-Memory Data Grid 101: Execution Models
RPC / Master-Worker
12
Write
Event-Driven / Reactive
In-Memory Data Grid 101: Execution Models
RPC / Master-Worker
13
In-Memory Data Grid 101: Typical Deployment
HTML
HTTP/S
HW LB
REST
HTTP/
S
REST
HTTP/S
LB
Agen
t
GSA
HTTPD
Load
Balanc
er
LB
Agen
t
GSA
HTTPD
Load
Balanc
er
Mirror
Service
GSA
DB
Private or Public Cloud
Processing Processing Processing
Processing Processing
Processi
ng
Processing Processing Processing
Processing Processing Processing
Primary Set 1 Primary Set 2 Primary Set 3
Primary Set 4 Primary Set 5 Primary Set 6
Backup Set 6Backup Set 5Backup Set 4
Backup Set 1 Backup Set 2 Backup Set 3
GSA GSA GSA
GSA GSA GSA
Async
)
14
Host Cisco UCS Server
CPU Intel 16core 2.9GHz
Concurrent Threads 2
Throughput 200, 400, 800 ops/sec
15
16
Hybrid Transactional/Analytics Processing at Scale
Provide closed-loop analytics pipeline. Data,
insight, to action at sub-second latency
IoT and Omni-channel require the
convergence of many different data
types
Blend of both real-time and historical
data
Requirements
1
Bi-directional integration between
transactional and analytical data stores
Ability to support POJO, JSON,
GeoSpatial, and Unstructured types
through a unified API
Unified and scale-out real-time
and historical data store
Challenges
2
3
17
HTAP:
SPARK + MICROSERVICES
Our road towards
18
What’s needed
Large-scale distributed
analytics framework
Unified, scale-out, low-latency data store
Transactional capabilities:
ACID, Event-Driven, Rich
Data modeling
Microservices
19
Our approach to HTAP
Low-latency Scale-Out
In-Memory Data Grid
Large-scale distributed
analytics framework
+
20
• Unified & Concise API
• Highly Flexible Data Store
Integration
• Massive Community and Adoption
Why Spark?
21
1
Bi-directional integration between
transactional and analytical data stores
Provide closed-loop analytics pipeline. Data, insight, to action
at scale (at sub-seconds)
22
23
In-Memory Data Grid
In-Memory Store(RAM) Flash, SSD, Off-Heap Store
Spark Spark SQL
Spark
Steaming
Machine Learning
Highavailability
Security&Management
Transactional Tier
ACID-compliant
Strong Consistency
Analytics Tier
24
• Get Partitions: An array of partitions
that a dataset is divided to
• Compute: A compute function to do a
computation on partitions
• Get Preferred Location: Optional
preferred locations, i.e. hosts for a
partition where the data will be loaded
• IMDG Distributed Query to get partitions
and their hosts
• Iterator over portion of data
• Hosts from Distributed Query
Build a connector: Spark to IMDG
25
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
Partition
node 3
Spark worker
Grid
Partition
NoSQL Storage
Pattern #1: Data Locality (machine-level)
26
Aggregation in
Spark
Filtering and
columns pruning
in Data Grid
SELECT SUM(amount)
FROM order
WHERE city = ‘NY’ AND year > 2012
Spark SQL architecture:
• Pushing down predicates to Data Grid
• Leveraging indexes
• Transparent to user
• Enabling support for other languages -
Python/R
Implementing DataSource API
Pattern #2: Pushdown Predicates (Grid-side processing)
27
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
Partition
node 3
Spark worker
Grid
Partition
Lightweight
workers,
small JVMs
Large JVMs,
Fast
indexing
NoSQL Storage
Pattern #3: Decouple Data Processing from Data Storage
28
Push-down
Predicates
performance
Traditional Spark filtering of 7MM records
Grid-side + Spark filtering of 7MM records
31
sec
800
ms
vs
29
Ability to support POJO, JSON, GeoSpatial, and
Unstructured types through a unified API
2
IoT and Omni-channel require the convergence of many
different data types
In-Memory Data Grid + Spark Convergence
Geo-Spatial Full Text
Simple K/V to RDD Mapping
POJO Domain Model to Spark
POJO Domain Model to Spark (Event-Driven)
JSON Domain Model to Spark
Geo-Spatial Data Frames
Geo-Spatial
Full Text Indexes + Lucene Analyzers
Full Text
37
Unified and scale-out real-time and historical
data store
3
Blend of both real-time and historical data
38
hash(key) % #nodes
In-Memory Data Grid Partitioning
39
hash(key) % #nodes
In-Memory Data Grid Partitioning – With HA
40
node 1
Spark executor
Spark
Partition
#1
Grid
Partition #1
Direct
connection
Simple, but
not enough
parallelism
for Spark
node 2
Spark executor
Spark
Partition
#2
Grid
Partition #2
node 3
Spark executor
Spark
Partition
#3
Grid
Partition #3
Spark to Data Grid Partition Cardinality
41
node 1
Spark Executor
Grid Primary #1
0
.
.
1
.
.
2
.
.
3
.
.
4
.
.
5
.
.
.
.
.
.
.
.
.
.
.
.
Spark
Partition #1
1023
1 Spark partition = M grid buckets
1 Grid partition = N Spark partitions
Spark
Partition #2
Spark
Partition #1
Pattern #4: Grid bucketing for higher throughput
42
Eventually, we productized this as
an open source Spark distribution
@InsightEdgeIO http://insightedge.io
Apache 2 License
http://insightedge.io/docs
http://insightedge.io/blog
http://github.com/InsightEdge
GigaSpaces InsightEdge
http://insightedge.io
High Performance Spark with OLTP Capabilities
upcoming: Spark RDD/DF native read/save on Off-Heap
(SSD/Flash/Direct Buffers)
Application
Processi
ng
Primary
instance
s
Backup
instance
s
Sync
Replicati
on
Storage
Array
Storage
Array
In Memory Data Grid
Spark worker Spark worker
• Significant RAM TCO reduction
in Spark clusters
• Direct RDD/DataFrame read
write from SSD/Flash device
• Avoid Filesystem hops and
write amplification
46
REFERENCE
ARCHITECTURES
47
In-Process HTAP
48
In-Memory Data Grid
Realtime Replication
• Scoring models
• Trigger actions
• Events
Transactions Analytics
XAP + InsightEdge deployed on
different grid clusters with bi-
directional real-time data replication
Point-of-Decision HTAP
4949
Challenge
• Stream data from 1,000s of Taxis
• Actively monitor and generate real-time notifications
• Real-time Route Optimization and Geo-Fencing
Solution
• Leverage unified in-memory data fabric as middleware for
geo-spatial analytics
• Elastically scale stream processing and transactional apps
together
• Location-based tracking, Geo-fencing
Edge components
Data Sources
Transportation / IoT: Connected Cars / Fleet Geo-Analytics
50
THANK YOU!
QUESTIONS?

More Related Content

What's hot

Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
Mark Kromer
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data Spain
 
VP of WW Partners by Alan Chhabra
VP of WW Partners by Alan ChhabraVP of WW Partners by Alan Chhabra
VP of WW Partners by Alan Chhabra
Big Data Spain
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
DevFest DC
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
TigerGraph
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
Eyal Ben Ivri
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
Big Data Spain
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
Big Data Spain
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionCortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
MSAdvAnalytics
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Neo4j
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
DataWorks Summit/Hadoop Summit
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
Mk Kim
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
DataWorks Summit
 
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and DatabricksUnlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Databricks
 

What's hot (20)

Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
VP of WW Partners by Alan Chhabra
VP of WW Partners by Alan ChhabraVP of WW Partners by Alan Chhabra
VP of WW Partners by Alan Chhabra
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
 
The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionCortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
 
Optimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystemOptimizing industrial operations using the big data ecosystem
Optimizing industrial operations using the big data ecosystem
 
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and DatabricksUnlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
 

Viewers also liked

Application-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStackApplication-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStackAli Hodroj
 
6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday
Ali Hodroj
 
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability ChasmE-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
Ali Hodroj
 
RDMA on ARM
RDMA on ARMRDMA on ARM
RDMA on ARM
inside-BigData.com
 
Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing World
inside-BigData.com
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryCloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Ali Hodroj
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Spark Summit
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNTsuyoshi OZAWA
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
Taras Matyashovsky
 
State of the OpenFabrics Alliance
State of the OpenFabrics AllianceState of the OpenFabrics Alliance
State of the OpenFabrics Alliance
inside-BigData.com
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
inside-BigData.com
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
Cloudera, Inc.
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
Drift
 

Viewers also liked (15)

Application-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStackApplication-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStack
 
6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday
 
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability ChasmE-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
 
RDMA on ARM
RDMA on ARMRDMA on ARM
RDMA on ARM
 
Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing World
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryCloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster Recovery
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...Building a REST Job Server for interactive Spark as a service by Romain Rigau...
Building a REST Job Server for interactive Spark as a service by Romain Rigau...
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARN
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
State of the OpenFabrics Alliance
State of the OpenFabrics AllianceState of the OpenFabrics Alliance
State of the OpenFabrics Alliance
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 

Similar to Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark Meetup
SnappyData
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
Selvaraj Kesavan
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Red Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use CasesRed Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use Cases
Red_Hat_Storage
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Rizaldy Ignacio
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Denodo
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
DATAVERSITY
 
Fom io t_to_bigdata_step_by_step-final
Fom io t_to_bigdata_step_by_step-finalFom io t_to_bigdata_step_by_step-final
Fom io t_to_bigdata_step_by_step-final
Luis Filipe Silva
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
SingleStore
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone
 

Similar to Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids (20)

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark Meetup
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Red Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use CasesRed Hat Storage: Emerging Use Cases
Red Hat Storage: Emerging Use Cases
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
Fom io t_to_bigdata_step_by_step-final
Fom io t_to_bigdata_step_by_step-finalFom io t_to_bigdata_step_by_step-final
Fom io t_to_bigdata_step_by_step-final
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
 

Recently uploaded

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 

Recently uploaded (20)

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 

Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids

  • 1. 1 Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids Copyright © GigaSpaces 2017. All rights reserved. Ali Hodroj VP, Products and Strategy @ahodroj
  • 2. 2 GigaSpaces Ultra-Low Latency / High Throughput Middleware Direct customers 500+ Headquarters New York, NY Established 2001
  • 4. 4 We’re seeing more in our customer base
  • 5. 5 …a shift towards real-time BI Big Data Fast Data
  • 6. 6 Sample Customer Use Cases Internet of Things Omni-Channel Operational Intelligence Operational Analytics Predictive Analytics Fraud Detection, Supply chain optimization Personalization, Recommendation Edge Analytics Operational Intelligence, Predictive Maintenance, Spatial Analytics
  • 7. 7 In-Memory Computing (not a new thing) Rapid decline in RAM prices lead to advanced data processing innovations drives • Transactional (2001-present) – In-Memory Databases – In-Memory Data Grids • Analytics (2012-present) – In-Memory Data Processing Frameworks (Spark) – In-Memory File Systems (Tachyon)
  • 9. 99 Data Grid is a cluster of machines that work together to create a resilient shared data fabric for low-latency data access and extreme transaction processing In-Memory Data Grid: Online Transaction Processing at Low-Latency and High Throughput http://xap.github.io
  • 10. 10 In-Memory Data Grid 101 Feeder Virtual Machine Virtual MachineVirtual Machine Partitioned Data
  • 11. 11 Write Event-Driven / Reactive In-Memory Data Grid 101: Execution Models RPC / Master-Worker
  • 12. 12 Write Event-Driven / Reactive In-Memory Data Grid 101: Execution Models RPC / Master-Worker
  • 13. 13 In-Memory Data Grid 101: Typical Deployment HTML HTTP/S HW LB REST HTTP/ S REST HTTP/S LB Agen t GSA HTTPD Load Balanc er LB Agen t GSA HTTPD Load Balanc er Mirror Service GSA DB Private or Public Cloud Processing Processing Processing Processing Processing Processi ng Processing Processing Processing Processing Processing Processing Primary Set 1 Primary Set 2 Primary Set 3 Primary Set 4 Primary Set 5 Primary Set 6 Backup Set 6Backup Set 5Backup Set 4 Backup Set 1 Backup Set 2 Backup Set 3 GSA GSA GSA GSA GSA GSA Async )
  • 14. 14 Host Cisco UCS Server CPU Intel 16core 2.9GHz Concurrent Threads 2 Throughput 200, 400, 800 ops/sec
  • 15. 15
  • 16. 16 Hybrid Transactional/Analytics Processing at Scale Provide closed-loop analytics pipeline. Data, insight, to action at sub-second latency IoT and Omni-channel require the convergence of many different data types Blend of both real-time and historical data Requirements 1 Bi-directional integration between transactional and analytical data stores Ability to support POJO, JSON, GeoSpatial, and Unstructured types through a unified API Unified and scale-out real-time and historical data store Challenges 2 3
  • 18. 18 What’s needed Large-scale distributed analytics framework Unified, scale-out, low-latency data store Transactional capabilities: ACID, Event-Driven, Rich Data modeling Microservices
  • 19. 19 Our approach to HTAP Low-latency Scale-Out In-Memory Data Grid Large-scale distributed analytics framework +
  • 20. 20 • Unified & Concise API • Highly Flexible Data Store Integration • Massive Community and Adoption Why Spark?
  • 21. 21 1 Bi-directional integration between transactional and analytical data stores Provide closed-loop analytics pipeline. Data, insight, to action at scale (at sub-seconds)
  • 22. 22
  • 23. 23 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store Spark Spark SQL Spark Steaming Machine Learning Highavailability Security&Management Transactional Tier ACID-compliant Strong Consistency Analytics Tier
  • 24. 24 • Get Partitions: An array of partitions that a dataset is divided to • Compute: A compute function to do a computation on partitions • Get Preferred Location: Optional preferred locations, i.e. hosts for a partition where the data will be loaded • IMDG Distributed Query to get partitions and their hosts • Iterator over portion of data • Hosts from Distributed Query Build a connector: Spark to IMDG
  • 25. 25 node 1 Spark master Grid master node 2 Spark worker Grid Partition node 3 Spark worker Grid Partition NoSQL Storage Pattern #1: Data Locality (machine-level)
  • 26. 26 Aggregation in Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API Pattern #2: Pushdown Predicates (Grid-side processing)
  • 27. 27 node 1 Spark master Grid master node 2 Spark worker Grid Partition node 3 Spark worker Grid Partition Lightweight workers, small JVMs Large JVMs, Fast indexing NoSQL Storage Pattern #3: Decouple Data Processing from Data Storage
  • 28. 28 Push-down Predicates performance Traditional Spark filtering of 7MM records Grid-side + Spark filtering of 7MM records 31 sec 800 ms vs
  • 29. 29 Ability to support POJO, JSON, GeoSpatial, and Unstructured types through a unified API 2 IoT and Omni-channel require the convergence of many different data types
  • 30. In-Memory Data Grid + Spark Convergence Geo-Spatial Full Text
  • 31. Simple K/V to RDD Mapping
  • 32. POJO Domain Model to Spark
  • 33. POJO Domain Model to Spark (Event-Driven)
  • 34. JSON Domain Model to Spark
  • 36. Full Text Indexes + Lucene Analyzers Full Text
  • 37. 37 Unified and scale-out real-time and historical data store 3 Blend of both real-time and historical data
  • 38. 38 hash(key) % #nodes In-Memory Data Grid Partitioning
  • 39. 39 hash(key) % #nodes In-Memory Data Grid Partitioning – With HA
  • 40. 40 node 1 Spark executor Spark Partition #1 Grid Partition #1 Direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 Grid Partition #2 node 3 Spark executor Spark Partition #3 Grid Partition #3 Spark to Data Grid Partition Cardinality
  • 41. 41 node 1 Spark Executor Grid Primary #1 0 . . 1 . . 2 . . 3 . . 4 . . 5 . . . . . . . . . . . . Spark Partition #1 1023 1 Spark partition = M grid buckets 1 Grid partition = N Spark partitions Spark Partition #2 Spark Partition #1 Pattern #4: Grid bucketing for higher throughput
  • 42. 42 Eventually, we productized this as an open source Spark distribution
  • 43. @InsightEdgeIO http://insightedge.io Apache 2 License http://insightedge.io/docs http://insightedge.io/blog http://github.com/InsightEdge
  • 45. upcoming: Spark RDD/DF native read/save on Off-Heap (SSD/Flash/Direct Buffers) Application Processi ng Primary instance s Backup instance s Sync Replicati on Storage Array Storage Array In Memory Data Grid Spark worker Spark worker • Significant RAM TCO reduction in Spark clusters • Direct RDD/DataFrame read write from SSD/Flash device • Avoid Filesystem hops and write amplification
  • 48. 48 In-Memory Data Grid Realtime Replication • Scoring models • Trigger actions • Events Transactions Analytics XAP + InsightEdge deployed on different grid clusters with bi- directional real-time data replication Point-of-Decision HTAP
  • 49. 4949 Challenge • Stream data from 1,000s of Taxis • Actively monitor and generate real-time notifications • Real-time Route Optimization and Geo-Fencing Solution • Leverage unified in-memory data fabric as middleware for geo-spatial analytics • Elastically scale stream processing and transactional apps together • Location-based tracking, Geo-fencing Edge components Data Sources Transportation / IoT: Connected Cars / Fleet Geo-Analytics