SlideShare a Scribd company logo
Davide Fantuzzi - Data Engineer
The Spark of Neo4j
25 Febbraio 2021 - GraphRM
WELCOME
Davide Fantuzzi
Data Engineer @ LARUS Business Automation
PICTURE
@utnaf
/in/davidefantuzzi/
ABOUT LARUS
Founded in 2004
HQ: Venice
Offices: Pescara, Rome, Milan
Global services
International projects
Data Engineer, Data
Architect, Data Scientist, Big
Data certified experts team
We help companies to
become insight-driven
organizations
Leader in development of
data-driven application
based on NoSQL & Event
Streaming Technologies.
LARUS: OUR SPECIALTIES
Big Data Platform Design &
Development (Java, Scala,
Python, Javascript)
Data Engineering
Graph Data Visualization
Data Science
Strategic Advisoring for
Data-Driven Transformation
Projects
Machine Learning and AI
graph based technology
LARUS NEO4J
LARUS: OUR PARTNERS
7
AGENDA
1. Spark & Neo4j
2. Challenges
3. Neo4j Connector for Apache Spark
4. Demo
8
SPARK & NEO4J
GraphRM
WHAT IS APACHE SPARK?
GraphRM 25 Febbraio 2021
● Analytics engine for large-scale data
processing
● Cluster of worker nodes which partition
operations and execute in parallel
● Supports SQL & Streaming
● Oriented around DataFrames which for our
purposes are effectively tables
● Is Polyglot
WHEN TO USE SPARK?
GraphRM 25 Febbraio 2021
● Very large datasets that have to be broken into pieces
● Complex pipelines with many sources & transformations
● Great for iterative algorithms (map, reduce, filter, sort)
OUR GOAL
GraphRM 25 Febbraio 2021
● Avoid custom "hacky" solutions
● Continuous development
● Quick response to issues
● Leverage DataSource V2 APIs in order to be fully Spark Compliant
○ Polyglot
○ ETL
○ Graph-Driven Machine Learning
OUR SOLUTION
GraphRM 25 Febbraio 2021
● Deprecation of the old connector
● Complete rewrite using DataSource API V2
● Official (enterprise) Neo4j support through Larus
● Dedicated Team
● Open-source
● Comprehensive Documentation
13
CHALLENGES
GraphRM
DOCUMENTATION AND EXAMPLES
GraphRM 25 Febbraio 2021
● Lack of.
● No official documentation on the DataSource V2 API
● Examples on the web were superficial
BREAKING CHANGES
GraphRM 25 Febbraio 2021
Spark 2.3 Spark 2.4 Spark 3.0
Breaking Changes Breaking Changes
WHICH VERSION TO START WITH?
GraphRM 25 Febbraio 2021
Spark 2.3 Spark 2.4 Spark 3.0
Spark 2.4
WHICH VERSION TO START WITH?
GraphRM 25 Febbraio 2021
WHICH VERSION TO START WITH?
GraphRM 25 Febbraio 2021
● Spark 2.4 is Supported
● Spark 3.0 is pre-released and an official released will be happening in March
● Spark 2.3 in on hold
VERSION FRAGMENTATION
GraphRM 25 Febbraio 2021
● Spark 2.4 supports Scala 2.11 and Scala 2.12
● Spark 3.0 supports Scala 2.12 and Scala 2.13, and removed the support for Scala 2.11
● (but Spark 3.0 for Scala 2.13 is not released yet)
● We need to release a JAR for each combination
neo4j-connector-apache-spark_2.12_3.0-4.0.0.jar
VERSION FRAGMENTATION
GraphRM 25 Febbraio 2021
Maven modules to the rescue!
TABLES VS LABELS
GraphRM 25 Febbraio 2021
Person Movie
:ACTED_IN
We had to find a way to map Graph entities into
table columns
Person.id Person.name
1 Keanu Reeves
2 Tom Hanks
ACTED_IN.source ACTED_IN.target
1 3
2 4
Movie.id Movie.title
3 The Matrix
4 Cloud Atlas
TABLES VS LABELS
GraphRM 25 Febbraio 2021
Bi-directional mapping allows to read from and
write to Neo4j
Person.id Person.name
1 Keanu Reeves
2 Tom Hanks
ACTED_IN.source ACTED_IN.target
1 3
2 4
Movie.id Movie.title
3 The Matrix
4 Cloud Atlas
READ WRITE
SCHEMA VS. SCHEMALESS
GraphRM 25 Febbraio 2021
● Result flattening
name String
age Long
location String
name age location
“John Doe” 33 “Milan”
“Jane Doe” 24 “Rome”
Result
Schema
SCHEMA VS. SCHEMALESS
GraphRM 25 Febbraio 2021
● Result flattening
● Property might not exist in every
record
name String
age Long
location String
name age location
“John Doe” 33 “Milan”
“Jane Doe” 24 null
Result
Schema
SCHEMA VS. SCHEMALESS
Graph RM 25 Febbraio 2021
● Result flattening
● Property might not exist in every
record
● Property might not have type
consistency
name String
age String*
location String
name age location
“John Doe” “33” “Milan”
“Jane Doe” “24” “Rome”
Result
Schema
* no matter the types involved
27
NEO4J CONNECTOR FOR
APACHE SPARK
GraphRM
READ DATA
GraphRM 25 Febbraio 2021
● Labels
val df = spark.read.format("org.neo4j.spark.DataSource")
.option("labels", ":Person:Admin")
.load()
df = spark.read.format("org.neo4j.spark.DataSource") 
.option("labels", ":Person:Admin") 
.load()
df <- read.df(source="org.neo4j.spark.DataSource",
labels=":Person:Admin")
Scala
Python
R
READ DATA
GraphRM 25 Febbraio 2021
● Labels
● Relationship
val df = spark.read.format("org.neo4j.spark.DataSource")
.option("relationship", "ACTED_IN")
.option("relationship.source.labels", "Person")
.option("relationship.target.labels", "Movie")
.load()
READ DATA
GraphRM 25 Febbraio 2021
● Labels
● Relationship
● Query
val df = spark.read.format("org.neo4j.spark.DataSource")
.option("query","MATCH (n:Person) RETURN n.name, n.age")
.load()
WRITE DATA
GraphRM 25 Febbraio 2021
● Labels val bandDf = Seq(
(1, "Alex Lifeson"),
(2, "Neil Peart"),
(3, "Geddy Lee")
).toDF("id", "name")
bandDf.write
.format("org.neo4j.spark.DataSource")
.option("labels", ":Person:Musician")
.save
WRITE DATA
GraphRM 25 Febbraio 2021
● Labels
● Relationship
val musicDf = Seq(
(12, "John Bonham", "Drums"),
(19, "John Mayer", "Guitar"),
(32, "John Scofield", "Guitar"),
(15, "John Butler", "Guitar")
).toDF("experience", "name", "instrument")
musicDf.write
.format("org.neo4j.spark.DataSource")
.option("relationship", "PLAYS")
.option("relationship.save.strategy", "keys")
.option("relationship.source.labels", ":Musician")
.option("relationship.source.node.keys", "name:name")
.option("relationship.target.labels", ":Instrument")
.option("relationship.target.node.keys", "instrument:name")
.save
WRITE DATA
GraphRM 25 Febbraio 2021
● Labels
● Relationship
● Query
val theTeam = Seq(
("David", "Allen"),
("Andrea", "Santurbano"),
("Davide", "Fantuzzi")
).toDF("name", "lastname")
theTeam.write
.format("org.neo4j.spark.DataSource")
.option(
"query",
"CREATE (n:Person)" +
"SET fullName = event.name + event.lastname"
)
.save()
This will generate a query like:
UNWIND $events AS event
CREATE (n:Person) SET fullName = event.name + event.lastname
FEATURES
GraphRM 25 Febbraio 2021
● Push Down Filters val df = spark.read
.format("org.neo4j.spark.DataSource")
.option("labels", "Movie")
.load
df.where("title LIKE 'Matrix%'").show()
FEATURES
GraphRM 25 Febbraio 2021
● Push Down Filters
● Push Down Columns
val df = spark.read
.format("org.neo4j.spark.DataSource")
.option("labels", "Movie")
.load
df.select("title").show()
FEATURES
GraphRM 25 Febbraio 2021
● Push Down Filters
● Push Down Columns
● Official Neo4j Driver [link]
FEATURES
GraphRM 25 Febbraio 2021
● Push Down Filters
● Push Down Columns
● Official Neo4j Driver [link]
● CypherDSL [link]
FEATURES
GraphRM 25 Febbraio 2021
● Push Down Filters
● Push Down Columns
● Official Neo4j Driver [link]
● CypherDSL [link]
● GraphX / GraphFrames are not used
COMMON USE CASES
GraphRM 25 Febbraio 2021
● Data Source Integration
○ Connect any supported file format or database of Spark
○ To Neo4j
● Extraction, Transformation, and Load (ETL) bi-directionally
○ Bulk insert for new databases
○ Ongoing nightly jobs
● Graph-driven Machine Learning
○ Use Spark to leverage Graph Data Science to existing pipelines
40
DEMO
GraphRM
USEFUL LINKS
GraphRM 25 Febbraio 2021
● GitHub https://github.com/neo4j-contrib/neo4j-spark-connector
● Documentation https://neo4j.com/developer/spark/
● Notebook for playing around with the connector
https://github.com/utnaf/neo4j-connector-apache-spark-notebooks
● Article on Towards Data Science
https://towardsdatascience.com/using-neo4j-with-pyspark-on-databricks-eb3d127f2245
THANKS FOR YOUR ATTENTION
Davide Fantuzzi
Data Engineer
@utnaf

More Related Content

Similar to The Spark of Neo4j

Leveraging Neo4j With Apache Spark
Leveraging Neo4j With Apache SparkLeveraging Neo4j With Apache Spark
Leveraging Neo4j With Apache Spark
Neo4j
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
dhiguero
 
Training di Base Neo4j
Training di Base Neo4jTraining di Base Neo4j
Training di Base Neo4j
Neo4j
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
Andy Petrella
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
Olga Lavrentieva
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
Giovanna Roda
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Masayuki Matsushita
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
Vincent Poncet
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 

Similar to The Spark of Neo4j (20)

Leveraging Neo4j With Apache Spark
Leveraging Neo4j With Apache SparkLeveraging Neo4j With Apache Spark
Leveraging Neo4j With Apache Spark
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Training di Base Neo4j
Training di Base Neo4jTraining di Base Neo4j
Training di Base Neo4j
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»«Почему Spark отнюдь не так хорош»
«Почему Spark отнюдь не так хорош»
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 

The Spark of Neo4j

  • 1. Davide Fantuzzi - Data Engineer The Spark of Neo4j 25 Febbraio 2021 - GraphRM
  • 2. WELCOME Davide Fantuzzi Data Engineer @ LARUS Business Automation PICTURE @utnaf /in/davidefantuzzi/
  • 3. ABOUT LARUS Founded in 2004 HQ: Venice Offices: Pescara, Rome, Milan Global services International projects Data Engineer, Data Architect, Data Scientist, Big Data certified experts team We help companies to become insight-driven organizations Leader in development of data-driven application based on NoSQL & Event Streaming Technologies.
  • 4. LARUS: OUR SPECIALTIES Big Data Platform Design & Development (Java, Scala, Python, Javascript) Data Engineering Graph Data Visualization Data Science Strategic Advisoring for Data-Driven Transformation Projects Machine Learning and AI graph based technology
  • 7. 7 AGENDA 1. Spark & Neo4j 2. Challenges 3. Neo4j Connector for Apache Spark 4. Demo
  • 9. WHAT IS APACHE SPARK? GraphRM 25 Febbraio 2021 ● Analytics engine for large-scale data processing ● Cluster of worker nodes which partition operations and execute in parallel ● Supports SQL & Streaming ● Oriented around DataFrames which for our purposes are effectively tables ● Is Polyglot
  • 10. WHEN TO USE SPARK? GraphRM 25 Febbraio 2021 ● Very large datasets that have to be broken into pieces ● Complex pipelines with many sources & transformations ● Great for iterative algorithms (map, reduce, filter, sort)
  • 11. OUR GOAL GraphRM 25 Febbraio 2021 ● Avoid custom "hacky" solutions ● Continuous development ● Quick response to issues ● Leverage DataSource V2 APIs in order to be fully Spark Compliant ○ Polyglot ○ ETL ○ Graph-Driven Machine Learning
  • 12. OUR SOLUTION GraphRM 25 Febbraio 2021 ● Deprecation of the old connector ● Complete rewrite using DataSource API V2 ● Official (enterprise) Neo4j support through Larus ● Dedicated Team ● Open-source ● Comprehensive Documentation
  • 14. DOCUMENTATION AND EXAMPLES GraphRM 25 Febbraio 2021 ● Lack of. ● No official documentation on the DataSource V2 API ● Examples on the web were superficial
  • 15. BREAKING CHANGES GraphRM 25 Febbraio 2021 Spark 2.3 Spark 2.4 Spark 3.0 Breaking Changes Breaking Changes
  • 16. WHICH VERSION TO START WITH? GraphRM 25 Febbraio 2021 Spark 2.3 Spark 2.4 Spark 3.0 Spark 2.4
  • 17.
  • 18. WHICH VERSION TO START WITH? GraphRM 25 Febbraio 2021
  • 19. WHICH VERSION TO START WITH? GraphRM 25 Febbraio 2021 ● Spark 2.4 is Supported ● Spark 3.0 is pre-released and an official released will be happening in March ● Spark 2.3 in on hold
  • 20. VERSION FRAGMENTATION GraphRM 25 Febbraio 2021 ● Spark 2.4 supports Scala 2.11 and Scala 2.12 ● Spark 3.0 supports Scala 2.12 and Scala 2.13, and removed the support for Scala 2.11 ● (but Spark 3.0 for Scala 2.13 is not released yet) ● We need to release a JAR for each combination neo4j-connector-apache-spark_2.12_3.0-4.0.0.jar
  • 21. VERSION FRAGMENTATION GraphRM 25 Febbraio 2021 Maven modules to the rescue!
  • 22. TABLES VS LABELS GraphRM 25 Febbraio 2021 Person Movie :ACTED_IN We had to find a way to map Graph entities into table columns Person.id Person.name 1 Keanu Reeves 2 Tom Hanks ACTED_IN.source ACTED_IN.target 1 3 2 4 Movie.id Movie.title 3 The Matrix 4 Cloud Atlas
  • 23. TABLES VS LABELS GraphRM 25 Febbraio 2021 Bi-directional mapping allows to read from and write to Neo4j Person.id Person.name 1 Keanu Reeves 2 Tom Hanks ACTED_IN.source ACTED_IN.target 1 3 2 4 Movie.id Movie.title 3 The Matrix 4 Cloud Atlas READ WRITE
  • 24. SCHEMA VS. SCHEMALESS GraphRM 25 Febbraio 2021 ● Result flattening name String age Long location String name age location “John Doe” 33 “Milan” “Jane Doe” 24 “Rome” Result Schema
  • 25. SCHEMA VS. SCHEMALESS GraphRM 25 Febbraio 2021 ● Result flattening ● Property might not exist in every record name String age Long location String name age location “John Doe” 33 “Milan” “Jane Doe” 24 null Result Schema
  • 26. SCHEMA VS. SCHEMALESS Graph RM 25 Febbraio 2021 ● Result flattening ● Property might not exist in every record ● Property might not have type consistency name String age String* location String name age location “John Doe” “33” “Milan” “Jane Doe” “24” “Rome” Result Schema * no matter the types involved
  • 28. READ DATA GraphRM 25 Febbraio 2021 ● Labels val df = spark.read.format("org.neo4j.spark.DataSource") .option("labels", ":Person:Admin") .load() df = spark.read.format("org.neo4j.spark.DataSource") .option("labels", ":Person:Admin") .load() df <- read.df(source="org.neo4j.spark.DataSource", labels=":Person:Admin") Scala Python R
  • 29. READ DATA GraphRM 25 Febbraio 2021 ● Labels ● Relationship val df = spark.read.format("org.neo4j.spark.DataSource") .option("relationship", "ACTED_IN") .option("relationship.source.labels", "Person") .option("relationship.target.labels", "Movie") .load()
  • 30. READ DATA GraphRM 25 Febbraio 2021 ● Labels ● Relationship ● Query val df = spark.read.format("org.neo4j.spark.DataSource") .option("query","MATCH (n:Person) RETURN n.name, n.age") .load()
  • 31. WRITE DATA GraphRM 25 Febbraio 2021 ● Labels val bandDf = Seq( (1, "Alex Lifeson"), (2, "Neil Peart"), (3, "Geddy Lee") ).toDF("id", "name") bandDf.write .format("org.neo4j.spark.DataSource") .option("labels", ":Person:Musician") .save
  • 32. WRITE DATA GraphRM 25 Febbraio 2021 ● Labels ● Relationship val musicDf = Seq( (12, "John Bonham", "Drums"), (19, "John Mayer", "Guitar"), (32, "John Scofield", "Guitar"), (15, "John Butler", "Guitar") ).toDF("experience", "name", "instrument") musicDf.write .format("org.neo4j.spark.DataSource") .option("relationship", "PLAYS") .option("relationship.save.strategy", "keys") .option("relationship.source.labels", ":Musician") .option("relationship.source.node.keys", "name:name") .option("relationship.target.labels", ":Instrument") .option("relationship.target.node.keys", "instrument:name") .save
  • 33. WRITE DATA GraphRM 25 Febbraio 2021 ● Labels ● Relationship ● Query val theTeam = Seq( ("David", "Allen"), ("Andrea", "Santurbano"), ("Davide", "Fantuzzi") ).toDF("name", "lastname") theTeam.write .format("org.neo4j.spark.DataSource") .option( "query", "CREATE (n:Person)" + "SET fullName = event.name + event.lastname" ) .save() This will generate a query like: UNWIND $events AS event CREATE (n:Person) SET fullName = event.name + event.lastname
  • 34. FEATURES GraphRM 25 Febbraio 2021 ● Push Down Filters val df = spark.read .format("org.neo4j.spark.DataSource") .option("labels", "Movie") .load df.where("title LIKE 'Matrix%'").show()
  • 35. FEATURES GraphRM 25 Febbraio 2021 ● Push Down Filters ● Push Down Columns val df = spark.read .format("org.neo4j.spark.DataSource") .option("labels", "Movie") .load df.select("title").show()
  • 36. FEATURES GraphRM 25 Febbraio 2021 ● Push Down Filters ● Push Down Columns ● Official Neo4j Driver [link]
  • 37. FEATURES GraphRM 25 Febbraio 2021 ● Push Down Filters ● Push Down Columns ● Official Neo4j Driver [link] ● CypherDSL [link]
  • 38. FEATURES GraphRM 25 Febbraio 2021 ● Push Down Filters ● Push Down Columns ● Official Neo4j Driver [link] ● CypherDSL [link] ● GraphX / GraphFrames are not used
  • 39. COMMON USE CASES GraphRM 25 Febbraio 2021 ● Data Source Integration ○ Connect any supported file format or database of Spark ○ To Neo4j ● Extraction, Transformation, and Load (ETL) bi-directionally ○ Bulk insert for new databases ○ Ongoing nightly jobs ● Graph-driven Machine Learning ○ Use Spark to leverage Graph Data Science to existing pipelines
  • 41. USEFUL LINKS GraphRM 25 Febbraio 2021 ● GitHub https://github.com/neo4j-contrib/neo4j-spark-connector ● Documentation https://neo4j.com/developer/spark/ ● Notebook for playing around with the connector https://github.com/utnaf/neo4j-connector-apache-spark-notebooks ● Article on Towards Data Science https://towardsdatascience.com/using-neo4j-with-pyspark-on-databricks-eb3d127f2245
  • 42. THANKS FOR YOUR ATTENTION Davide Fantuzzi Data Engineer @utnaf