The Spark of Neo4j

Davide Fantuzzi - Data Engineer
The Spark of Neo4j
25 Febbraio 2021 - GraphRM

WELCOME
Davide Fantuzzi
Data Engineer @ LARUS Business Automation
PICTURE
@utnaf
/in/davidefantuzzi/

ABOUT LARUS
Founded in 2004
HQ: Venice
Ofﬁces: Pescara, Rome, Milan
Global services
International projects
Data Engineer, Data
Architect, Data Scientist, Big
Data certiﬁed experts team
We help companies to
become insight-driven
organizations
Leader in development of
data-driven application
based on NoSQL & Event
Streaming Technologies.

LARUS: OUR SPECIALTIES
Big Data Platform Design &
Development (Java, Scala,
Python, Javascript)
Data Engineering
Graph Data Visualization
Data Science
Strategic Advisoring for
Data-Driven Transformation
Projects
Machine Learning and AI
graph based technology

7
AGENDA
1. Spark & Neo4j
2. Challenges
3. Neo4j Connector for Apache Spark
4. Demo

WHAT IS APACHE SPARK?
GraphRM 25 Febbraio 2021
● Analytics engine for large-scale data
processing
● Cluster of worker nodes which partition
operations and execute in parallel
● Supports SQL & Streaming
● Oriented around DataFrames which for our
purposes are effectively tables
● Is Polyglot

WHEN TO USE SPARK?
● Very large datasets that have to be broken into pieces
● Complex pipelines with many sources & transformations
● Great for iterative algorithms (map, reduce, ﬁlter, sort)

OUR GOAL
● Avoid custom "hacky" solutions
● Continuous development
● Quick response to issues
● Leverage DataSource V2 APIs in order to be fully Spark Compliant
○ Polyglot
○ ETL
○ Graph-Driven Machine Learning

OUR SOLUTION
● Deprecation of the old connector
● Complete rewrite using DataSource API V2
● Ofﬁcial (enterprise) Neo4j support through Larus
● Dedicated Team
● Open-source
● Comprehensive Documentation

DOCUMENTATION AND EXAMPLES
● Lack of.
● No ofﬁcial documentation on the DataSource V2 API
● Examples on the web were superﬁcial

BREAKING CHANGES
Spark 2.3 Spark 2.4 Spark 3.0
Breaking Changes Breaking Changes

WHICH VERSION TO START WITH?
Spark 2.3 Spark 2.4 Spark 3.0
Spark 2.4

● Spark 2.4 is Supported
● Spark 3.0 is pre-released and an ofﬁcial released will be happening in March
● Spark 2.3 in on hold

VERSION FRAGMENTATION
● Spark 2.4 supports Scala 2.11 and Scala 2.12
● Spark 3.0 supports Scala 2.12 and Scala 2.13, and removed the support for Scala 2.11
● (but Spark 3.0 for Scala 2.13 is not released yet)
● We need to release a JAR for each combination
neo4j-connector-apache-spark_2.12_3.0-4.0.0.jar

VERSION FRAGMENTATION
Maven modules to the rescue!

TABLES VS LABELS
Person Movie
:ACTED_IN
We had to ﬁnd a way to map Graph entities into
table columns
Person.id Person.name
1 Keanu Reeves
2 Tom Hanks
ACTED_IN.source ACTED_IN.target
1 3
2 4
Movie.id Movie.title
3 The Matrix
4 Cloud Atlas

TABLES VS LABELS
Bi-directional mapping allows to read from and
write to Neo4j
Person.id Person.name
1 Keanu Reeves
2 Tom Hanks
ACTED_IN.source ACTED_IN.target
1 3
2 4
Movie.id Movie.title
3 The Matrix
4 Cloud Atlas
READ WRITE

SCHEMA VS. SCHEMALESS
● Result ﬂattening
name String
age Long
location String
name age location
“John Doe” 33 “Milan”
“Jane Doe” 24 “Rome”
Result
Schema

● Property might not exist in every
record
name String
age Long
location String
name age location
“John Doe” 33 “Milan”
“Jane Doe” 24 null
Result
Schema

Graph RM 25 Febbraio 2021
● Property might not exist in every
record
● Property might not have type
consistency
name String
age String*
location String
name age location
“John Doe” “33” “Milan”
“Jane Doe” “24” “Rome”
Result
Schema
* no matter the types involved

27
NEO4J CONNECTOR FOR
APACHE SPARK
GraphRM

READ DATA
● Labels
val df = spark.read.format("org.neo4j.spark.DataSource")
.option("labels", ":Person:Admin")
.load()
df = spark.read.format("org.neo4j.spark.DataSource")
.option("labels", ":Person:Admin")
.load()
df <- read.df(source="org.neo4j.spark.DataSource",
labels=":Person:Admin")
Scala
Python
R

READ DATA
● Labels
● Relationship
.option("relationship", "ACTED_IN")
.option("relationship.source.labels", "Person")
.option("relationship.target.labels", "Movie")
.load()

READ DATA
● Labels
● Relationship
● Query
.option("query","MATCH (n:Person) RETURN n.name, n.age")
.load()

WRITE DATA
● Labels val bandDf = Seq(
(1, "Alex Lifeson"),
(2, "Neil Peart"),
(3, "Geddy Lee")
).toDF("id", "name")
bandDf.write
.format("org.neo4j.spark.DataSource")
.option("labels", ":Person:Musician")
.save

WRITE DATA
● Labels
● Relationship
val musicDf = Seq(
(12, "John Bonham", "Drums"),
(19, "John Mayer", "Guitar"),
(32, "John Scofield", "Guitar"),
(15, "John Butler", "Guitar")
).toDF("experience", "name", "instrument")
musicDf.write
.option("relationship", "PLAYS")
.option("relationship.save.strategy", "keys")
.option("relationship.source.labels", ":Musician")
.option("relationship.source.node.keys", "name:name")
.option("relationship.target.labels", ":Instrument")
.option("relationship.target.node.keys", "instrument:name")
.save

WRITE DATA
● Labels
● Relationship
● Query
val theTeam = Seq(
("David", "Allen"),
("Andrea", "Santurbano"),
("Davide", "Fantuzzi")
).toDF("name", "lastname")
theTeam.write
.option(
"query",
"CREATE (n:Person)" +
"SET fullName = event.name + event.lastname"
)
.save()
This will generate a query like:
UNWIND $events AS event
CREATE (n:Person) SET fullName = event.name + event.lastname

FEATURES
● Push Down Filters val df = spark.read
.option("labels", "Movie")
.load
df.where("title LIKE 'Matrix%'").show()

FEATURES
● Push Down Filters
● Push Down Columns
val df = spark.read
.option("labels", "Movie")
.load
df.select("title").show()

FEATURES
● Ofﬁcial Neo4j Driver [link]

FEATURES
● CypherDSL [link]

FEATURES
● CypherDSL [link]
● GraphX / GraphFrames are not used

COMMON USE CASES
● Data Source Integration
○ Connect any supported ﬁle format or database of Spark
○ To Neo4j
● Extraction, Transformation, and Load (ETL) bi-directionally
○ Bulk insert for new databases
○ Ongoing nightly jobs
● Graph-driven Machine Learning
○ Use Spark to leverage Graph Data Science to existing pipelines

USEFUL LINKS
● GitHub https://github.com/neo4j-contrib/neo4j-spark-connector
● Documentation https://neo4j.com/developer/spark/
● Notebook for playing around with the connector
https://github.com/utnaf/neo4j-connector-apache-spark-notebooks
● Article on Towards Data Science
https://towardsdatascience.com/using-neo4j-with-pyspark-on-databricks-eb3d127f2245

THANKS FOR YOUR ATTENTION
Davide Fantuzzi
Data Engineer
@utnaf

The Spark of Neo4j

Recommended

Recommended

More Related Content

Similar to The Spark of Neo4j

Similar to The Spark of Neo4j (20)

Recently uploaded

Recently uploaded (20)

The Spark of Neo4j