Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
6. #UnifiedDataAnalytics #SparkAISummit
The Property Graph Model
Node
● Represents an entity within the graph
● Can have labels
Relationship
● Connects a start node with an end node
● Has one type
Property
● Describes a node/relationship: e.g. name, age, weight etc
● Key-value pair: String key; typed value (string, number, list, ...)
6
9. #UnifiedDataAnalytics #SparkAISummit
Spark Project Improvement Proposal
● Defines a Cypher-compatible Property Graph type
based on DataFrames
● Replaces GraphFrames querying with Cypher
● Reimplements GraphFrames/GraphX algos on the
Property Graph type
● Running PoC: [SPARK-27299][GRAPH][WIP] Spark
Graph API design proposal
https://git.io/fjqp6
10. #UnifiedDataAnalytics #SparkAISummit
SPIP: What are we trying to do?
● “Spark Cypher”
○ Run a Cypher query on a Property Graph returning a
tabular result
● Implementation is based on Spark SQL
○ Property Graphs are composed of one or more DFs
● Provide Scala, Python and Java APIs
● Deep dive: Graph Features in Spark 3.0: Thursday 11AM,
Room G104
13. #UnifiedDataAnalytics #SparkAISummit
SPIP: What are we not solving?
● Addresses the Cypher Property Graph Model
○ Does not deal with variants of that model (e.g. RDF)
● No multiple graph features
○ API is flexible to support this in future iterations
● No Property Graph Catalog
○ Also no Property Graph specific Data Sources
21. #UnifiedDataAnalytics #SparkAISummit
Spark and Neo4j
Spark is an immutable data processing engine
○ Spark SQL organizes data in tables (DataFrames)
○ DataFrames can be queried via SQL
○ Spark SQL programs are optimized by Catalyst
Neo4j is a native transactional CRUD database
○ Neo4j graphs use a native graph data representation
○ Neo4j graphs can be queried using Cypher
○ Neo4j has optimized in-process MT graph algos
22. #UnifiedDataAnalytics #SparkAISummit
Morpheus: SQL + Cypher in one session
Graphs and tables are both useful data models
○ Finding paths and subgraphs, and transforming graphs
○ Viewing, aggregating and ordering values
The Morpheus project parallels Spark SQL
○ PropertyGraph type (composed of DataFrames)
○ Catalog of graph data sources, named graphs, views,
○ Cypher query language
A CypherSession adds graphs to a SparkSession
23. #UnifiedDataAnalytics #SparkAISummit
What is Morpheus used for?
Data integration
○ Integrate (non-)graphy data from multiple, heterogeneous
data sources into one or more property graphs
Distributed Cypher execution
○ OLAP-style graph analytics
Data science
○ Integration with other Spark libraries
○ Feature extraction using Neo4j Graph Algorithms
24. #UnifiedDataAnalytics #SparkAISummit
Neo4j Graph Algorithms
https://bit.ly/2oUfnA5
• Parallel Breadth First Search
• Parallel Depth First Search
• Shortest Path
• Single-Source Shortest Path
• All Pairs Shortest Path
• Minimum Spanning Tree
• A* Shortest Path
• Yen’s K Shortest Path
• K-Spanning Tree (MST)
• Random Walk
• Degree Centrality
• Closeness Centrality
• CC Variations: Harmonic, Dangalchev,
Wasserman & Faust
• Betweenness Centrality
• Approximate Betweenness Centrality
• PageRank
• Personalized PageRank
• ArticleRank
• Eigenvector Centrality
• Triangle Count
• Clustering Coefficients
• Connected Components (Union Find)
• Strongly Connected Components
• Label Propagation
• Louvain Modularity – 1 Step & Multi-Step
• Balanced Triad (identification)
• Euclidean Distance
• Cosine Similarity
• Jaccard Similarity
• Overlap Similarity
• Pearson Similarity
Pathfinding
& Search
Centrality /
Importance
Community
Detection
Similarity
neo4j.com/docs/
graph-algorithms/current/
Link
Prediction
• Adamic Adar
• Common Neighbors
• Preferential Attachment
• Resource Allocations
• Same Community
• Total Neighbors* Available in GraphFrames
27. #UnifiedDataAnalytics #SparkAISummit
Cypher query language
Cypher 9 is the latest full version of openCypher
○ Implemented in Neo4j 3.5
○ Implemented in whole/part by six other vendors
○ Several other partial and research implementations
○ Cypher for Gremlin is another openCypher project
28. #UnifiedDataAnalytics #SparkAISummit
Cypher 9 in Morpheus and Spark Graph (SPIP)
Cypher is a full CRUD language
○ RETURNs only tabular results: not composable
○ Results can include graph elements (paths,
relationships, nodes) or property values
Morpheus and SPIP implement most of read-only Cypher
○ No MERGE or DELETE
○ Spark immutable data + transformations
29. #UnifiedDataAnalytics #SparkAISummit
Cypher 10 in Morpheus - Multiple graphs
Cypher 10 proposes support for Multiple Graphs
○ Multiple Graph CIP: https://git.io/fjmrx
Allows for Cypher Query composition
○ Similar to chaining transformations on DataFrames
Support Graph Catalog for managing Graphs
○ Analogous to Spark SQL catalog
Query support for Graph Construction
30. #UnifiedDataAnalytics #SparkAISummit
Returning tabular data Input: a property graph
Output: a table
FROM GRAPH socialNetwork
MATCH ({name: 'Dan'})-[:FRIEND*2]->(foaf)
RETURN toUpper(foaf.name) AS name
ORDER BY name DESC
Language features available in Morpheus
31. #UnifiedDataAnalytics #SparkAISummit
Constructing graphs Input: a property graph
Output: a property graph
FROM GRAPH socialNetwork
MATCH (p:Person)-[:FRIEND*2]->(foaf)
WHERE NOT (p)-[:FRIEND]->(foaf)
CONSTRUCT
CREATE (p)-[:POSSIBLE_FRIEND]->(foaf)
RETURN GRAPH
Language features available in Morpheus
32. #UnifiedDataAnalytics #SparkAISummit
Querying multiple graphs Input: property graphs
Output: a property graph
FROM GRAPH socialNetwork
MATCH (p:Person)
FROM GRAPH products
MATCH (c:Customer)
WHERE p.email = c.email
CONSTRUCT ON socialNetwork, products
CREATE (p)-[:IS]->(c)
RETURN GRAPH
Language features available in Morpheus
33. #UnifiedDataAnalytics #SparkAISummit
Creating graph views Input: property graphs
Output: a property graph
CATALOG CREATE VIEW youngFriends($inGraph){
FROM GRAPH $inGraph
MATCH (p1:Person)-[r]->(p2:Person)
WHERE p1.age < 25 AND p2.age < 25
CONSTRUCT
CREATE (p1)-[COPY OF r]->(p2)
RETURN GRAPH
}
Language features available in Morpheus
34. #UnifiedDataAnalytics #SparkAISummit
Using graph views Input: property graphs
Output: table or graph
FROM youngFriends(socialNetwork)
MATCH (p:Person)-[r]->(o)
RETURN p, r, o
// and views over views
FROM youngFriends(europe(socialNetwork))
MATCH ...
Language features available in Morpheus
39. #UnifiedDataAnalytics #SparkAISummit
Read from single Property Graph
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US” (Property Graph)
FROM social-net.US
MATCH (p:Person)
RETURN p
40. #UnifiedDataAnalytics #SparkAISummit
Read from multiple Property Graphs
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US”
“EU”
“products” (SQL PGDS)
“2018”
“2017”
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
WHERE p.email = c.email
RETURN p, c
41. #UnifiedDataAnalytics #SparkAISummit
Construct new Property Graphs
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US”
“EU”
“products” (SQL PGDS)
“2018”
“2017”
CATALOG CREATE GRAPH social-net.US_new {
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
WHERE p.email = c.email
CONSTRUCT ON social-net.US
CREATE (p)-[:SAME_AS]->(c)
RETURN GRAPH
}
42. #UnifiedDataAnalytics #SparkAISummit
Construct new Property Graphs
CATALOG CREATE GRAPH social-net.US_new {
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
WHERE p.email = c.email
CONSTRUCT ON social-net.US
CREATE (p)-[:SAME_AS]->(c)
RETURN GRAPH
}
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US”
“EU”
“products” (SQL PGDS)
“2018”
“2017”
“US_new”
43. #UnifiedDataAnalytics #SparkAISummit
Create and query Graph Views
Cypher Session
Property Graph Catalog
“social-net” (Neo4j PGDS)
“US”
“EU”
...
CATALOG CREATE VIEW youngPeople($sn) {
FROM $sn
MATCH (p:Person)-[r]->(n)
WHERE p.age < 21
CONSTRUCT
CREATE (p)-[COPY OF r]->(n)
RETURN GRAPH
}
FROM youngPeople(social-net.US)
MATCH (p:Person)
RETURN p
“youngPeople”
Views
45. #UnifiedDataAnalytics #SparkAISummit
The Yelp Open Dataset
45
:Business
name : ACME
address : 123 ACME Rd.
city : San Jose
state : CA
:User
name : Alice
since : 2013
elite : [2014, 2016]
:User
name : Bob
since : 2014
elite : null
:REVIEWS
stars : 5
date : 2014-02-03
:REVIEWS
stars : 4
date : 2014-08-03
https://www.yelp.com
https://www.yelp.com/dataset
https://www.yelp.com/dataset/challenge
46. #UnifiedDataAnalytics #SparkAISummit
Yelp Demo Overview
46
Part 1
From JSON to Graph
Create persistent
Property Graph from
raw Yelp dataset
Read Yelp Data from
JSON into DataFrames
Create Property Graph
from DataFrames
Store Property Graph
using Parquet
Part 2
A library of Graphs
Create a library of
graph projections
Read Property Graph
from Parquet
Create subgraph for a
specifc city
Project and persist city
subgraph
Part 3
Federated queries
Integrate reviews with
social network data
Define Graph Type and
Mapping with Graph
DDL
Load data from Hive
and H2
Run analytical query on
the integrated graph
Part 5
Neo4j Integration II
Recommend
businesses to users
Load graph projections
from library
Write graphs to Neo4j,
run Louvain + Jaccard
Run analytical query in
Morpheus to find
recommendations
Part 4
Neo4j Integration I
Find trending
businesses
Load graph projections
from library
Write graphs to Neo4j
and run PageRank
Combine graphs in
Morpheus and select
trending businesses
https://git.io/fjZ2b
47. #UnifiedDataAnalytics #SparkAISummit
Starting point: A Library of Graphs
47
2015 - 2018
(:User)-[:CO_REVIEWS]->(:User)
(:User)-[:REVIEWS]->(:Business)
(:User)-[:CO_REVIEWS]->(:User)
Constuct graphs for each year
(:Business)-[:CO_REVIEWED]->(:Business)
https://git.io/fjZ25