Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020

Aerospike
AerospikeAerospike
Real-Time Insights by
Leveraging Spark with
Aerospike
Aerospike Spark Connector
Zohar Elkayam, Aerospike
2 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
▪ Where is Aerospike Spark Connecter located in the EcoSystem
▪ A Quick Overview of Aerospike Spark Connector
▪ Some Code Example
▪ Scaling up: A Customer Story
Agenda
3 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
Data Warehouse Data Lake
Legacy RDBMS HDFS Based
Aerospike Simplifies Real-time Architecture at any Scale
Aerospike
Database
SoE Location 1
SoE Location 2
SoE Location 3
XDR
XDR
Transactional
Systems
Aerospike
Database
XDR
XDR
Enterprise Environment
Transactional
Systems
Legacy Database
(Mainframe)
RDBMS
Database
Delivering Extreme Scalability:
✓ Simplicity
✓ Maintainability
✓ Durability
✓ Strong Consistency
✓ Scalability
✓ Low Cost ($)
✓ Less Data Drag
XDR Legacy RDBMS
Data LakeReal-time Data Warehouse
System of Record Query &
Reporting Store
XDR
4 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Connect for Spark
5 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc.
Aerospike Connect for Spark
Example Use Cases
✓ Fraud prevention: transaction data via
streaming and need to analyze based on
historical data in real time
✓ Recommendation Engines: Real-time
recommendations and targeting based on user
behavior
✓ Ad Tech: Ad Fraud and real-time retargeting
base on user behavior
✓ Digital Identity Management
✓ Industrial Internet of Things (IIoT): Real-time &
closed loop business decisions
6 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Spark connection for Aerospike – both loading the data and using it as dataframe (i.e.
Spark SQL) or by using it as streamed data
• Supports Scala (spark-shell) for all Aerospike’s Spark Operations
• Support Python (pyspark) for some operations – Dataset operations not supported
• Guide: https://www.aerospike.com/docs/connectors/enterprise/spark/index.html
Aerospike Connect for Spark
7 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Use SparkSQL to fetch data from Aerospike
• Aerospike Connect for Spark provides the capability to use Spark SQL in order to
query records from an Aerospike cluster.
• Load Aerospike data into Spark for processing
• Load data from Aerospike into DataFrames for processing
• The connector support Scan and Queries (secondary indexes)
• Save data from DataFrame back into Aerospike
• A DataFrame can be saved in Aerospike by specifying a column in the DataFrame as
the Primary Key or the Digest.
• Joins Data using Aerospike [Scala Only]
• Provides an AeroJoin function which allows you to read records from Aerospike given
a Dataset which contains keys to the records of interest.
• This operation takes advantage of Aerospike's batch read functionality.
Aerospike Spark Operations
8 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Spark Example: Spark SQL
9 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Save DataFrame to Aerospike (by Key, with schema)
10 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Aerospike Spark Example: AeroJoin
11 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Spark partition data for workers, supervised by executor (one per spark node)
• Aerospike scan (pre-4.9) scans data by Aerospike node (one per Aerospike node)
• That means there is a mismatch in parallization between the number of cores on the spark
side and the number of nodes on Aerospike side
Customer Story: Is Scaling an Issue?
12 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Data is distributed evenly across nodes in a cluster using the Aerospike Smart
Partitions™ algorithm.
▪ Automatic Sharding
▪ 4096 Data Partitions
▪ Even distribution of
▪ Partitions across nodes
▪ Records across Partitions
▪ Data across Flash devices
▪ Primary and Replica Partitions
Aerospike Partitions: Even Data Distribution
13 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
• Customer Environment:
• 33 Aerospike nodes
• Over 10B objects, over 125TB unique data
• ~200 Spark Nodes with 36 core each (~7200 total cores/workers)
• The Problem: Less than 1 percent utilization on the spark side in data load operation.
• The Change: Aerospike 4.9 will allow scanning of partitions instead on nodes so 4096
partitions, Aerospike Spark Connector 2.0 Supports partition scan.
• The Result:
• The customer got a RC for Aerospike 4.9 + Spark Connector 2.0
• Using over 10B unique records (125TB unique data) was scanned, load and
filtered in ~45 minutes.
Customer Story: Scaling Things Up (With 4.9 RC Access)
14 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Time for Q&A!
15 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc
Thank You!
zelkayam@aerospike.com
1 of 15

Recommended

Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ... by
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Aerospike
63 views20 slides
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics by
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
791 views36 slides
Empower Data-Driven Organizations by
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven OrganizationsDataWorks Summit/Hadoop Summit
1.5K views34 slides
Dancing Elephants: Working with Object Storage in Apache Spark and Hive by
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
983 views30 slides
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste... by
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
2.9K views41 slides
What does rename() do? by
What does rename() do?What does rename() do?
What does rename() do?Steve Loughran
768 views30 slides

More Related Content

What's hot

Accelerate Spark Workloads on S3 by
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Alluxio, Inc.
364 views34 slides
HPE Keynote Hadoop Summit San Jose 2016 by
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016DataWorks Summit/Hadoop Summit
2.6K views12 slides
Dancing elephants - efficiently working with object stores from Apache Spark ... by
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
766 views35 slides
High Performance Python on Apache Spark by
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
16.6K views35 slides
TriHUG Feb: Hive on spark by
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
3.4K views20 slides
Hive on spark is blazing fast or is it final by
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
74.6K views61 slides

What's hot(20)

Accelerate Spark Workloads on S3 by Alluxio, Inc.
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
Alluxio, Inc.364 views
Dancing elephants - efficiently working with object stores from Apache Spark ... by DataWorks Summit
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit766 views
High Performance Python on Apache Spark by Wes McKinney
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney16.6K views
TriHUG Feb: Hive on spark by trihug
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug3.4K views
Hive on spark is blazing fast or is it final by Hortonworks
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks74.6K views
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio by Alluxio, Inc.
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.396 views
Query Anything, Anywhere with Kubernetes by Alluxio, Inc.
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with Kubernetes
Alluxio, Inc.1.2K views
Spark Summit EU talk by Debasish Das and Pramod Narasimha by Spark Summit
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit248 views
Presto + Alluxio on steroids a romantic drama on Production with happy end by Alluxio, Inc.
Presto + Alluxio on steroids a romantic drama on Production with happy endPresto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy end
Alluxio, Inc.1.1K views
CtrlS - DR on Demand by CTRLS
CtrlS - DR on DemandCtrlS - DR on Demand
CtrlS - DR on Demand
CTRLS570 views
Apache Hadoop 3.0 Community Update by DataWorks Summit
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit1.4K views
Distributing Data The Aerospike Way by Aerospike, Inc.
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike Way
Aerospike, Inc. 3.7K views
Infra space talk on Apache Spark - Into to CASK by Rob Mueller
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASK
Rob Mueller64 views
Getting Started With Amazon Redshift by Matillion
Getting Started With Amazon Redshift Getting Started With Amazon Redshift
Getting Started With Amazon Redshift
Matillion304 views
Webinar | Getting Started With Amazon Redshift Spectrum by Matillion
Webinar | Getting Started With Amazon Redshift SpectrumWebinar | Getting Started With Amazon Redshift Spectrum
Webinar | Getting Started With Amazon Redshift Spectrum
Matillion222 views
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat... by Databricks
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks11.1K views

Similar to Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020

Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S... by
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...HostedbyConfluent
917 views13 slides
C5 journey to_the_cloud_with_oracle_sparc by
C5 journey to_the_cloud_with_oracle_sparcC5 journey to_the_cloud_with_oracle_sparc
C5 journey to_the_cloud_with_oracle_sparcDr. Wilfred Lin (Ph.D.)
311 views34 slides
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS by
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAmazon Web Services
6.7K views54 slides
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra... by
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
741 views43 slides
Data Science & Best Practices for Apache Spark on Amazon EMR by
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
6K views56 slides
Configuring Aerospike - Part 2 by
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Aerospike, Inc.
9.2K views57 slides

Similar to Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020(20)

Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S... by HostedbyConfluent
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
HostedbyConfluent917 views
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS by Amazon Web Services
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
Amazon Web Services6.7K views
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra... by Databricks
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks741 views
Data Science & Best Practices for Apache Spark on Amazon EMR by Amazon Web Services
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Configuring Aerospike - Part 2 by Aerospike, Inc.
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
Aerospike, Inc. 9.2K views
Amazon Aurora and AWS Database Migration Service by Amazon Web Services
Amazon Aurora and AWS Database Migration ServiceAmazon Aurora and AWS Database Migration Service
Amazon Aurora and AWS Database Migration Service
Amazon Web Services1.9K views
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16 by MLconf
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
MLconf1K views
Spectrum Scale - Diversified analytic solution based on various storage servi... by Wei Gong
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...
Wei Gong673 views
Big data processing with Apache Spark and Oracle Database by Martin Toshev
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev1.6K views
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi... by Databricks
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks1.9K views
Apache Spark in Scientific Applciations by Dr. Mirko Kämpf
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf380 views
Apache Spark in Scientific Applications by Dr. Mirko Kämpf
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf1.1K views
Aerospike meetup july 2019 | Big Data Demystified by Omid Vahdaty
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
Omid Vahdaty289 views
Big Telco Real-Time Network Analytics by Yousun Jeong
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong885 views

More from Aerospike

Aerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev by
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad LeevAerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad LeevAerospike
99 views21 slides
Contentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman by
Contentsquare Aerospike Usage and COVID-19 Impact - Doron HoffmanContentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Contentsquare Aerospike Usage and COVID-19 Impact - Doron HoffmanAerospike
90 views15 slides
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &... by
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...Aerospike
48 views16 slides
Aerospike Meetup - Introduction - Ami - 04 March 2020 by
Aerospike Meetup - Introduction - Ami - 04 March 2020Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike Meetup - Introduction - Ami - 04 March 2020Aerospike
86 views20 slides
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020 by
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020Aerospike
110 views44 slides
Aerospike Roadmap Overview - Meetup Dec 2019 by
Aerospike Roadmap Overview - Meetup Dec 2019Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike Roadmap Overview - Meetup Dec 2019Aerospike
110 views8 slides

More from Aerospike(9)

Aerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev by Aerospike
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad LeevAerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike-AppsFlyer COVID-19 Crisis Growth Elad Leev
Aerospike99 views
Contentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman by Aerospike
Contentsquare Aerospike Usage and COVID-19 Impact - Doron HoffmanContentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Contentsquare Aerospike Usage and COVID-19 Impact - Doron Hoffman
Aerospike90 views
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &... by Aerospike
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Handling Increasing Load and Reducing Costs During COVID-19 Crisis - Oshrat &...
Aerospike48 views
Aerospike Meetup - Introduction - Ami - 04 March 2020 by Aerospike
Aerospike Meetup - Introduction - Ami - 04 March 2020Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike Meetup - Introduction - Ami - 04 March 2020
Aerospike86 views
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020 by Aerospike
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike Meetup - Nielsen Customer Story - Alex - 04 March 2020
Aerospike110 views
Aerospike Roadmap Overview - Meetup Dec 2019 by Aerospike
Aerospike Roadmap Overview - Meetup Dec 2019Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike Roadmap Overview - Meetup Dec 2019
Aerospike110 views
Aerospike Nested CDTs - Meetup Dec 2019 by Aerospike
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike99 views
Aerospike Data Modeling - Meetup Dec 2019 by Aerospike
Aerospike Data Modeling - Meetup Dec 2019Aerospike Data Modeling - Meetup Dec 2019
Aerospike Data Modeling - Meetup Dec 2019
Aerospike40 views
JDBC Driver for Aerospike - Meetup Dec 2019 by Aerospike
JDBC Driver for Aerospike - Meetup Dec 2019JDBC Driver for Aerospike - Meetup Dec 2019
JDBC Driver for Aerospike - Meetup Dec 2019
Aerospike107 views

Recently uploaded

JCon Live 2023 - Lice coding some integration problems by
JCon Live 2023 - Lice coding some integration problemsJCon Live 2023 - Lice coding some integration problems
JCon Live 2023 - Lice coding some integration problemsBernd Ruecker
67 views85 slides
The details of description: Techniques, tips, and tangents on alternative tex... by
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...BookNet Canada
110 views24 slides
TE Connectivity: Card Edge Interconnects by
TE Connectivity: Card Edge InterconnectsTE Connectivity: Card Edge Interconnects
TE Connectivity: Card Edge InterconnectsCXL Forum
96 views12 slides
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...Vadym Kazulkin
70 views64 slides
The Research Portal of Catalonia: Growing more (information) & more (services) by
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
66 views25 slides
Transcript: The Details of Description Techniques tips and tangents on altern... by
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...BookNet Canada
119 views15 slides

Recently uploaded(20)

JCon Live 2023 - Lice coding some integration problems by Bernd Ruecker
JCon Live 2023 - Lice coding some integration problemsJCon Live 2023 - Lice coding some integration problems
JCon Live 2023 - Lice coding some integration problems
Bernd Ruecker67 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada110 views
TE Connectivity: Card Edge Interconnects by CXL Forum
TE Connectivity: Card Edge InterconnectsTE Connectivity: Card Edge Interconnects
TE Connectivity: Card Edge Interconnects
CXL Forum96 views
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin70 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada119 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV by Splunk
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk86 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst449 views
GigaIO: The March of Composability Onward to Memory with CXL by CXL Forum
GigaIO: The March of Composability Onward to Memory with CXLGigaIO: The March of Composability Onward to Memory with CXL
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum126 views
MemVerge: Memory Viewer Software by CXL Forum
MemVerge: Memory Viewer SoftwareMemVerge: Memory Viewer Software
MemVerge: Memory Viewer Software
CXL Forum118 views
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... by NUS-ISS
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
NUS-ISS32 views
Microchip: CXL Use Cases and Enabling Ecosystem by CXL Forum
Microchip: CXL Use Cases and Enabling EcosystemMicrochip: CXL Use Cases and Enabling Ecosystem
Microchip: CXL Use Cases and Enabling Ecosystem
CXL Forum129 views
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM by CXL Forum
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMSamsung: CMM-H Tiered Memory Solution with Built-in DRAM
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM
CXL Forum105 views
MemVerge: Past Present and Future of CXL by CXL Forum
MemVerge: Past Present and Future of CXLMemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXL
CXL Forum110 views
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi by Fwdays
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays26 views
"Fast Start to Building on AWS", Igor Ivaniuk by Fwdays
"Fast Start to Building on AWS", Igor Ivaniuk"Fast Start to Building on AWS", Igor Ivaniuk
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays36 views
Micron CXL product and architecture update by CXL Forum
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum27 views

Aerospike Meetup - Real Time Insights using Spark with Aerospike - Zohar - 04 March 2020

  • 1. Real-Time Insights by Leveraging Spark with Aerospike Aerospike Spark Connector Zohar Elkayam, Aerospike
  • 2. 2 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. ▪ Where is Aerospike Spark Connecter located in the EcoSystem ▪ A Quick Overview of Aerospike Spark Connector ▪ Some Code Example ▪ Scaling up: A Customer Story Agenda
  • 3. 3 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. Data Warehouse Data Lake Legacy RDBMS HDFS Based Aerospike Simplifies Real-time Architecture at any Scale Aerospike Database SoE Location 1 SoE Location 2 SoE Location 3 XDR XDR Transactional Systems Aerospike Database XDR XDR Enterprise Environment Transactional Systems Legacy Database (Mainframe) RDBMS Database Delivering Extreme Scalability: ✓ Simplicity ✓ Maintainability ✓ Durability ✓ Strong Consistency ✓ Scalability ✓ Low Cost ($) ✓ Less Data Drag XDR Legacy RDBMS Data LakeReal-time Data Warehouse System of Record Query & Reporting Store XDR
  • 4. 4 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Connect for Spark
  • 5. 5 Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc. Aerospike Connect for Spark Example Use Cases ✓ Fraud prevention: transaction data via streaming and need to analyze based on historical data in real time ✓ Recommendation Engines: Real-time recommendations and targeting based on user behavior ✓ Ad Tech: Ad Fraud and real-time retargeting base on user behavior ✓ Digital Identity Management ✓ Industrial Internet of Things (IIoT): Real-time & closed loop business decisions
  • 6. 6 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Spark connection for Aerospike – both loading the data and using it as dataframe (i.e. Spark SQL) or by using it as streamed data • Supports Scala (spark-shell) for all Aerospike’s Spark Operations • Support Python (pyspark) for some operations – Dataset operations not supported • Guide: https://www.aerospike.com/docs/connectors/enterprise/spark/index.html Aerospike Connect for Spark
  • 7. 7 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Use SparkSQL to fetch data from Aerospike • Aerospike Connect for Spark provides the capability to use Spark SQL in order to query records from an Aerospike cluster. • Load Aerospike data into Spark for processing • Load data from Aerospike into DataFrames for processing • The connector support Scan and Queries (secondary indexes) • Save data from DataFrame back into Aerospike • A DataFrame can be saved in Aerospike by specifying a column in the DataFrame as the Primary Key or the Digest. • Joins Data using Aerospike [Scala Only] • Provides an AeroJoin function which allows you to read records from Aerospike given a Dataset which contains keys to the records of interest. • This operation takes advantage of Aerospike's batch read functionality. Aerospike Spark Operations
  • 8. 8 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Spark Example: Spark SQL
  • 9. 9 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Save DataFrame to Aerospike (by Key, with schema)
  • 10. 10 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Aerospike Spark Example: AeroJoin
  • 11. 11 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Spark partition data for workers, supervised by executor (one per spark node) • Aerospike scan (pre-4.9) scans data by Aerospike node (one per Aerospike node) • That means there is a mismatch in parallization between the number of cores on the spark side and the number of nodes on Aerospike side Customer Story: Is Scaling an Issue?
  • 12. 12 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Data is distributed evenly across nodes in a cluster using the Aerospike Smart Partitions™ algorithm. ▪ Automatic Sharding ▪ 4096 Data Partitions ▪ Even distribution of ▪ Partitions across nodes ▪ Records across Partitions ▪ Data across Flash devices ▪ Primary and Replica Partitions Aerospike Partitions: Even Data Distribution
  • 13. 13 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc • Customer Environment: • 33 Aerospike nodes • Over 10B objects, over 125TB unique data • ~200 Spark Nodes with 36 core each (~7200 total cores/workers) • The Problem: Less than 1 percent utilization on the spark side in data load operation. • The Change: Aerospike 4.9 will allow scanning of partitions instead on nodes so 4096 partitions, Aerospike Spark Connector 2.0 Supports partition scan. • The Result: • The customer got a RC for Aerospike 4.9 + Spark Connector 2.0 • Using over 10B unique records (125TB unique data) was scanned, load and filtered in ~45 minutes. Customer Story: Scaling Things Up (With 4.9 RC Access)
  • 14. 14 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Time for Q&A!
  • 15. 15 A E R O S P I K E | Proprietary & Confidential | All rights reserved. © 2020 Aerospike Inc Thank You! zelkayam@aerospike.com