SlideShare a Scribd company logo
1 of 15
Download to read offline
Real-Time Insights by
Leveraging Spark with
Aerospike
Aerospike Spark Connector
Zohar Elkayam, Aerospike
2 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
▪ Where is Aerospike Spark Connecter located in the EcoSystem
▪ A Quick Overview of Aerospike Spark Connector
▪ Some Code Example
▪ Scaling up: A Customer Story
Agenda
3 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
Data Warehouse Data Lake
Legacy RDBMS HDFS Based
Aerospike Simplifies Real-time Architecture at any Scale
Aerospike
Database
SoE Location 1
SoE Location 2
SoE Location 3
XDR
XDR
Transactional
Systems
Aerospike
Database
XDR
XDR
Enterprise Environment
Transactional
Systems
Legacy Database
(Mainframe)
RDBMS
Database
Delivering Extreme Scalability:
✓ Simplicity
✓ Maintainability
✓ Durability
✓ Strong Consistency
✓ Scalability
✓ Low Cost ($)
✓ Less Data Drag
XDR Legacy RDBMS
Data LakeReal-time Data Warehouse
System of Record Query &
Reporting Store
XDR
4 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Connect for Spark
5 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
Aerospike Connect for Spark
Example Use Cases
✓ Fraud prevention: transaction data via
streaming and need to analyze based on
historical data in real time
✓ Recommendation Engines: Real-time
recommendations and targeting based on user
behavior
✓ Ad Tech: Ad Fraud and real-time retargeting
base on user behavior
✓ Digital Identity Management
✓ Industrial Internet of Things (IIoT): Real-time &
closed loop business decisions
6 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Spark connection for Aerospike – both loading the data and using it as dataframe (i.e.
Spark SQL) or by using it as streamed data
• Supports Scala (spark-shell) for all Aerospike’s Spark Operations
• Support Python (pyspark) for some operations – Dataset operations not supported
• Guide: https://www.aerospike.com/docs/connectors/enterprise/spark/index.html
Aerospike Connect for Spark
7 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Use SparkSQL to fetch data from Aerospike
• Aerospike Connect for Spark provides the capability to use Spark SQL in order to
query records from an Aerospike cluster.
• Load Aerospike data into Spark for processing
• Load data from Aerospike into DataFrames for processing
• The connector support Scan and Queries (secondary indexes)
• Save data from DataFrame back into Aerospike
• A DataFrame can be saved in Aerospike by specifying a column in the DataFrame as
the Primary Key or the Digest.
• Joins Data using Aerospike [Scala Only]
• Provides an AeroJoin function which allows you to read records from Aerospike given
a Dataset which contains keys to the records of interest.
• This operation takes advantage of Aerospike's batch read functionality.
Aerospike Spark Operations
8 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Spark Example: Spark SQL
9 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Save DataFrame to Aerospike (by Key, with schema)
10 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Spark Example: AeroJoin
11 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Spark partition data for workers, supervised by executor (one per spark node)
• Aerospike scan (pre-4.9) scans data by Aerospike node (one per Aerospike node)
• That means there is a mismatch in parallization between the number of cores on the spark
side and the number of nodes on Aerospike side
Customer Story: Is Scaling an Issue?
12 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Data is distributed evenly across nodes in a cluster using the Aerospike Smart
Partitions™ algorithm.
▪ Automatic Sharding
▪ 4096 Data Partitions
▪ Even distribution of
▪ Partitions across nodes
▪ Records across Partitions
▪ Data across Flash devices
▪ Primary and Replica Partitions
Aerospike Partitions: Even Data Distribution
13 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Customer Environment:
• 33 Aerospike nodes
• Over 10B objects, over 125TB unique data
• ~200 Spark Nodes with 36 core each (~7200 total cores/workers)
• The Problem: Less than 1 percent utilization on the spark side in data load operation.
• The Change: Aerospike 4.9 will allow scanning of partitions instead on nodes so 4096
partitions, Aerospike Spark Connector 2.0 Supports partition scan.
• The Result:
• The customer got a RC for Aerospike 4.9 + Spark Connector 2.0
• Using over 10B unique records (125TB unique data) was scanned, load and
filtered in ~45 minutes.
Customer Story: Scaling Things Up (With 4.9 RC Access)
14 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Time for Q&A!
15 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Thank You!
zelkayam@aerospike.com

More Related Content

What's hot

Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Alluxio, Inc.
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesAlluxio, Inc.
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Presto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy endPresto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy endAlluxio, Inc.
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()Steve Loughran
 
CtrlS - DR on Demand
CtrlS - DR on DemandCtrlS - DR on Demand
CtrlS - DR on DemandCTRLS
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Distributing Data The Aerospike Way
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike WayAerospike, Inc.
 
Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKRob Mueller
 
Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Matillion
 
Webinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift SpectrumWebinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift SpectrumMatillion
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 

What's hot (20)

Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with Kubernetes
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Presto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy endPresto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy end
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
CtrlS - DR on Demand
CtrlS - DR on DemandCtrlS - DR on Demand
CtrlS - DR on Demand
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Distributing Data The Aerospike Way
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike Way
 
Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASK
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Getting Started With Amazon Redshift
Getting Started With Amazon Redshift Getting Started With Amazon Redshift
Getting Started With Amazon Redshift
 
Webinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift SpectrumWebinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift Spectrum
 
Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 

Similar to Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020

Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...HostedbyConfluent
 
C5 journey to_the_cloud_with_oracle_sparc
C5 journey to_the_cloud_with_oracle_sparcC5 journey to_the_cloud_with_oracle_sparc
C5 journey to_the_cloud_with_oracle_sparcDr. Wilfred Lin (Ph.D.)
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Aerospike, Inc.
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Web Services
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Wei Gong
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7MarketingArrowECS_CZ
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
 
Lift and shift to sparc cloud
Lift and shift to sparc cloudLift and shift to sparc cloud
Lift and shift to sparc cloudRiccardo Romani
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedOmid Vahdaty
 

Similar to Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020 (20)

Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
 
C5 journey to_the_cloud_with_oracle_sparc
C5 journey to_the_cloud_with_oracle_sparcC5 journey to_the_cloud_with_oracle_sparc
C5 journey to_the_cloud_with_oracle_sparc
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
 
Amazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
Sparc solaris servers
Sparc solaris serversSparc solaris servers
Sparc solaris servers
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
Lift and shift to sparc cloud
Lift and shift to sparc cloudLift and shift to sparc cloud
Lift and shift to sparc cloud
 
Why_Oracle_Hardware.ppt
Why_Oracle_Hardware.pptWhy_Oracle_Hardware.ppt
Why_Oracle_Hardware.ppt
 
Oracle Cloud Infrastructure
Oracle Cloud InfrastructureOracle Cloud Infrastructure
Oracle Cloud Infrastructure
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
 

More from Aerospike

Aerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad LeevAerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad LeevAerospike
 
Contentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Contentsquare Aerospike Usage and COVID-19 Impact - Doron HoffmanContentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Contentsquare Aerospike Usage and COVID-19 Impact - Doron HoffmanAerospike
 
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...Aerospike
 
Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike Meetup - Introduction - Ami - 04 March 2020Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike Meetup - Introduction - Ami - 04 March 2020Aerospike
 
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020Aerospike
 
Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike Roadmap Overview - Meetup Dec 2019Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike Roadmap Overview - Meetup Dec 2019Aerospike
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike
 
Aerospike Data Modeling - Meetup Dec 2019
Aerospike Data Modeling - Meetup Dec 2019Aerospike Data Modeling - Meetup Dec 2019
Aerospike Data Modeling - Meetup Dec 2019Aerospike
 
JDBC Driver for Aerospike - Meetup Dec 2019
JDBC Driver for Aerospike - Meetup Dec 2019JDBC Driver for Aerospike - Meetup Dec 2019
JDBC Driver for Aerospike - Meetup Dec 2019Aerospike
 

More from Aerospike (9)

Aerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad LeevAerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
 
Contentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Contentsquare Aerospike Usage and COVID-19 Impact - Doron HoffmanContentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Contentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
 
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
 
Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike Meetup - Introduction - Ami - 04 March 2020Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike Meetup - Introduction - Ami - 04 March 2020
 
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
 
Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike Roadmap Overview - Meetup Dec 2019Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike Roadmap Overview - Meetup Dec 2019
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019
 
Aerospike Data Modeling - Meetup Dec 2019
Aerospike Data Modeling - Meetup Dec 2019Aerospike Data Modeling - Meetup Dec 2019
Aerospike Data Modeling - Meetup Dec 2019
 
JDBC Driver for Aerospike - Meetup Dec 2019
JDBC Driver for Aerospike - Meetup Dec 2019JDBC Driver for Aerospike - Meetup Dec 2019
JDBC Driver for Aerospike - Meetup Dec 2019
 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020

  • 1. Real-Time Insights by Leveraging Spark with Aerospike Aerospike Spark Connector Zohar Elkayam, Aerospike
  • 2. 2 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. ▪ Where is Aerospike Spark Connecter located in the EcoSystem ▪ A Quick Overview of Aerospike Spark Connector ▪ Some Code Example ▪ Scaling up: A Customer Story Agenda
  • 3. 3 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. Data Warehouse Data Lake Legacy RDBMS HDFS Based Aerospike Simplifies Real-time Architecture at any Scale Aerospike Database SoE Location 1 SoE Location 2 SoE Location 3 XDR XDR Transactional Systems Aerospike Database XDR XDR Enterprise Environment Transactional Systems Legacy Database (Mainframe) RDBMS Database Delivering Extreme Scalability: ✓ Simplicity ✓ Maintainability ✓ Durability ✓ Strong Consistency ✓ Scalability ✓ Low Cost ($) ✓ Less Data Drag XDR Legacy RDBMS Data LakeReal-time Data Warehouse System of Record Query & Reporting Store XDR
  • 4. 4 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Connect for Spark
  • 5. 5 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. Aerospike Connect for Spark Example Use Cases ✓ Fraud prevention: transaction data via streaming and need to analyze based on historical data in real time ✓ Recommendation Engines: Real-time recommendations and targeting based on user behavior ✓ Ad Tech: Ad Fraud and real-time retargeting base on user behavior ✓ Digital Identity Management ✓ Industrial Internet of Things (IIoT): Real-time & closed loop business decisions
  • 6. 6 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Spark connection for Aerospike – both loading the data and using it as dataframe (i.e. Spark SQL) or by using it as streamed data • Supports Scala (spark-shell) for all Aerospike’s Spark Operations • Support Python (pyspark) for some operations – Dataset operations not supported • Guide: https://www.aerospike.com/docs/connectors/enterprise/spark/index.html Aerospike Connect for Spark
  • 7. 7 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Use SparkSQL to fetch data from Aerospike • Aerospike Connect for Spark provides the capability to use Spark SQL in order to query records from an Aerospike cluster. • Load Aerospike data into Spark for processing • Load data from Aerospike into DataFrames for processing • The connector support Scan and Queries (secondary indexes) • Save data from DataFrame back into Aerospike • A DataFrame can be saved in Aerospike by specifying a column in the DataFrame as the Primary Key or the Digest. • Joins Data using Aerospike [Scala Only] • Provides an AeroJoin function which allows you to read records from Aerospike given a Dataset which contains keys to the records of interest. • This operation takes advantage of Aerospike's batch read functionality. Aerospike Spark Operations
  • 8. 8 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Spark Example: Spark SQL
  • 9. 9 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Save DataFrame to Aerospike (by Key, with schema)
  • 10. 10 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Spark Example: AeroJoin
  • 11. 11 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Spark partition data for workers, supervised by executor (one per spark node) • Aerospike scan (pre-4.9) scans data by Aerospike node (one per Aerospike node) • That means there is a mismatch in parallization between the number of cores on the spark side and the number of nodes on Aerospike side Customer Story: Is Scaling an Issue?
  • 12. 12 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Data is distributed evenly across nodes in a cluster using the Aerospike Smart Partitions™ algorithm. ▪ Automatic Sharding ▪ 4096 Data Partitions ▪ Even distribution of ▪ Partitions across nodes ▪ Records across Partitions ▪ Data across Flash devices ▪ Primary and Replica Partitions Aerospike Partitions: Even Data Distribution
  • 13. 13 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Customer Environment: • 33 Aerospike nodes • Over 10B objects, over 125TB unique data • ~200 Spark Nodes with 36 core each (~7200 total cores/workers) • The Problem: Less than 1 percent utilization on the spark side in data load operation. • The Change: Aerospike 4.9 will allow scanning of partitions instead on nodes so 4096 partitions, Aerospike Spark Connector 2.0 Supports partition scan. • The Result: • The customer got a RC for Aerospike 4.9 + Spark Connector 2.0 • Using over 10B unique records (125TB unique data) was scanned, load and filtered in ~45 minutes. Customer Story: Scaling Things Up (With 4.9 RC Access)
  • 14. 14 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Time for Q&A!
  • 15. 15 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Thank You! zelkayam@aerospike.com