Compiling openCypher graph queries with Spark Catalyst

Gábor Szárnyas
Gábor Szárnyasresearcher at Budapest University of Technology and Economics
Compiling openCypher graph queries
with Spark Catalyst
Gábor Szárnyas
pre-holiday Spark des fêtes @ Montréal
BACKGROUND
 PhD student @ Budapest Univ. of Tech. and Econ., Hungary
 Visiting researcher @ McGill University
RESEARCH TOPIC
Problem statement
 Large graph (100M+ nodes)
 Complex global graph queries
 Evaluate them in <1sec
Approach
 We “cheat”
 Build a huge cache
 Maintain results:
incremental views
RESEARCH OBJECTIVES
Create a scalable graph query engine with incremental views
1. graph queries
2. incremental views
3. making it scale
Graph queries
PROPERTY GRAPH DATABASES
NoSQL family
Data model:
vertices, edges
and properties
#1 query approach: graph pattern matching
Note. Spark GraphX is an engine for graph analytics.
CYPHER AND OPENCYPHER
Cypher: query language of the Neo4j graph database.
„Cypher is a declarative, SQL-inspired language for describing
patterns in graphs visually using an ascii-art syntax.”
MATCH
(p:Person)-[:PRESENTER_OF]->(:Presentation)-[:AT]->(m:Meetup)
WHERE m.date = 'Monday, December 18, 2017'
RETURN p
„The openCypher project aims to deliver a full and open
specification of the industry’s most widely adopted graph
database query language: Cypher.” (late 2015)
OPENCYPHER SYSTEMS
 Increasing adoption
 Relational databases:
o SAP HANA
o AGENS Graph
 Research prototypes:
o Graphflow (Univesity of Waterloo)
o ingraph (incremental graph engine)
(Source: Keynote talk @ GraphConnect NYC 2017)
LINKED DATA BENCHMARK COUNCIL
LDBC is a non-profit organization dedicated to establishing
benchmarks, benchmark practices and benchmark results for
graph data management software.
LDBC’s Social Network Benchmark is an industrial and academic
initiative, formed by principal actors in the field of graph-like
data management.
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
OLAP global queries
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Example: „Friends’ recent likes”
MATCH (u:User {id: $userId})-[:FRIEND]-
(f:User)-[l:LIKES]->(p:Post)
RETURN f, p
ORDER BY l.timestamp DESC
LIMIT 10
OLAP global queries
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Orri Erling et al.,
The LDBC Social Network Benchmark: Interactive Workload,
SIGMOD 2015
14 queries and 8 updates
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
OLAP global queries
Example: „One-sided friendships”
MATCH (u1:User)-[:FRIEND]-(u2:User)-[l:LIKES]->(p:Post),
(u1)-[:AUTHOR_OF]->(p)
WITH u1, u2, count(l) AS likes
WHERE likes > 10
AND NOT (u1)-[:LIKES]->(:Post)<-[:AUTHOR_OF]-(u2)
RETURN u1, u2
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Arnau Prat, Gábor Szárnyas, Alex Averbuch et al.,
The LDBC Social Network Benchmark: BI Workload,
Technical report available, peer-reviewed paper in 2018
OLAP global queries
25 queries with infrequent executions
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
• PageRank
• Shortest paths
• Clustering coefficient
Example: „Find the most central individuals.”
Spark: GraphX | Flink: Gelly | Neo4j: Graph Algorithms library
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
Alexandru Iosup et al.,
LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on
Parallel and Distributed Platforms,
VLDB 2016
One-time execution
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
Incremental view maintenance
CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL
Trailing the switch
Proximity detection
CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
ON
a b
1
NEXT
ON
NEXT
PROXIMITY DETECTION
seg
1
NEXT: 1..2
t1
ON
MATCH
(t1:Train)-[:ON]->(seg1:Segment)
-[:NEXT*1..2]->(seg2:Segment)
<-[:ON]-(t2:Train)
RETURN t1, t2, seg1, seg2
seg
2
t2
ON
≤ 𝟏 segments
TRAILING THE SWITCH
seg div
t
STRAIGHT
ON
MATCH (t:Train)-[:ON]->(seg:Segment)
<-[:STRAIGHT]-(sw:Switch)
WHERE sw.position = 'diverging'
RETURN t.number, sw
Evaluate
continuously
INCREMENTAL QUERIES
 Register a set of standing queries
 Continuously evaluate queries on changes
 Approach: build a cache and maintain its content
 First publication: 1974, the Rete algorithm
ingraphclient
register queries
query results
change notifications
update graph
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e2
ON
a1
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
a
1
ON
e
2
ON
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e2
ON
a1
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e div
STRAIGHT
e div
STRAIGHT
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e div
STRAIGHT
e2
ON
e div
2
STRAIGHT
ON
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
div
STRAIGHTON
e div
STRAIGHT
2
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
e2 div
STRAIGHTON
e2
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
div
2
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRAIGHT
Trailing the switch
ON
div
2
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c e
g
fdiv
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
div
STRAIGHT
Trailing the switch
ON
div
ON
2
d
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
d2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
div2
c e
g
fdiv
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
div
STRAIGHT
Trailing the switch
ON
div
ON
2
d
GRAPH RELATIONAL ALGEBRA
 Basic relational algebra
o projection, selection, join, left outer join, antijoin, union
 Common extensions
o aggregation (𝛾), duplicate-elimination (𝛿), sort (𝜏), top (𝜆)
 Graph-specific extensions
o get-vertices ()
o expand-out (↑), expand-in (↓), expand-both (↕)
J. Marton, G. Szárnyas, D. Varró:
Formalising openCypher Graph Queries in Relational Algebra,
ADBIS, Springer, 2017
Compiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark Catalyst
THEORETICAL ISSUES WITH INCREMENTAL CYPHER
 Graph RA is not incrementally maintainable
o Expand operators
o Property access needs nested data structures (0NF)
o Ordering
o Weak schema
 Most incremental approaches work on flat relational algebra:
o Transform graph relational algebra to a flat one
o Optimize query
G. Szárnyas:
Incremental View Maintenance for Property Graph Queries,
arXiv preprint, 2017
PROPOSED WORKFLOW
 Parse
 Compile
 Evaluate query
openCypher
query
Magic
Deployed
query
AST
QUERY “TRAILING THE SWITCH”
PROPERTY ACCESS
Assuming that x is a column of a
graph relation, we use the notation
“x.a” in selection conditions to
express the access to the
corresponding value of property a in
the property graph.
J. Hölsch, M. Grossniklaus:
An algebra and equivalences to transform graph patterns in Neo4j,
GraphQ @ EDBT 2016
t, seg
t, seg, t.number
sw, seg
sw, seg, sw.position
t.number, sw.position
πt.number, sw
σsw.position = ′diverging′
⋈
(sw:Switch)−[:STRAIGHT]−>(seg:Segment)(t:Train)−[:ON]−>(seg:Segment)
t.number, sw
t.number, sw
t, seg, sw
t, seg, t.number, sw, sw.position
t, seg, sw
t, seg, t.number, sw, sw.position
t.number
t.number, sw.position
sw.positiont.number
2
1. external schema
2. extra attributes
3. internal schema
This is the current
implementation
SCHEMA
INFERENCING
Compiling openCypher graph queries with Spark Catalyst
MATCH (t:Train)-[:ON]->(seg:Segment)
<-[:STRAIGHT]-(sw:Switch)
WHERE sw.position = 'diverging'
RETURN t.number, sw
openCypher
query
AST Graph RA
Graph RA Flat RANested RA
Deployed
query
SPARK SQL
 “Spark SQL lets you query structured data inside Spark
programs, using either SQL or a familiar DataFrame API.”
http://www.gatorsmile.io/sparksqloverview/
http://www.gatorsmile.io/sparksqloverview/
SPARK CATALYST
 Tree Manipulation Framework
o “Catalyst is an execution-agnostic framework to represent and
manipulate a dataflow graph, i.e. trees of relational operators and
expressions.”
 Optimizer (both cost-based and rule-based)
Catalyst
SPARK CATALYST: OBSERVATIONS
 Strong community
 Well-written in general, but noisy here and there (Hive)
 Nice API docs… but not much else
CATALYST EXAMPLES: TREE TRANSFORMATION
CATALYST EXAMPLES: ATTRIBUTE RESOLVER
CATALYST FEATURES: CODE GENERATION
 Generates bytecode for performance
H. Karau, R. Warren:
High Performance Spark: Best Practices for Scaling and
Optimizing Apache Spark
O'Reilly Media, Inc., May 25, 2017
Scalable graph queries
MAKING IT SCALE πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
d2
ON
a1
ON
div
STRAIGHT
ON
Actors
Async messages
G. Szárnyas et al.,
IncQuery-D: A distributed incremental model query framework in the cloud.
ACM/IEEE MODELS, 2014
openCypher
query
AST Graph RA
Graph RA Flat RANested RA
Deployed
query
ARCHITECTURE
Related work and summary
CAPS: CYPHER FOR APACHE SPARK
 An openCypher project
 “CAPS is built on top of the Spark DataFrames API and uses
features such as the Catalyst optimizer.”
 Approach
o Compiles to operations to a custom dataflow graph
o Transforms the dataflow graph to queries on the DataFrames API
(backed by Catalyst)
LESSONS LEARNT
 Simply extending the SQL model is insufficient
 Implemented new components from scratch
o Logical plans
o Attribute resolver
 Still reused a lot of components
o Data model
o Expressions
o Transformations
o Built-in methods: toString, output, etc.
FUTURE DIRECTIONS
 Cost-based optimizer
 Experiment with the LDBC Social Network Benchmark
 Transform queries to SQL
 Integrate engine to Spark
G. Szárnyas, A. Prat, A. Averbuch et al.:
The LDBC Social Network Benchmark: BI Workload.
Technical report, peer-reviewed paper in 2018
RELATED RESOURCES
Ingraph github.com/ftsrg/ingraph
Cypher for Apache Spark github.com/opencypher/cypher-for-apache-spark
Slizaa openCypher github.com/slizaa/slizaa-opencypher-xtext
Mastering Apache Spark jaceklaskowski.gitbooks.io/mastering-apache-spark
Scala Days presentation people.apache.org/… | youtu.be/6bCpISym_0w
Deep dive blogpost databricks.com/blog/2015/04/13/deep-dive-…
Thanks for the contributions to the ingraph team.
1 of 57

Recommended

On the need to include functional testing in RDF stream engine benchmarks by
On the need to include functional testing in RDF stream engine benchmarks On the need to include functional testing in RDF stream engine benchmarks
On the need to include functional testing in RDF stream engine benchmarks Emanuele Della Valle
967 views33 slides
Predictive Datacenter Analytics with Strymon by
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonVasia Kalavri
626 views57 slides
Self-managed and automatically reconfigurable stream processing by
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingVasia Kalavri
638 views90 slides
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co... by
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Codemotion
271 views71 slides
ingraph: Live Queries on Graphs by
ingraph: Live Queries on Graphs ingraph: Live Queries on Graphs
ingraph: Live Queries on Graphs Neo4j
1.8K views95 slides
Incremental Graph Queries for Cypher by
Incremental Graph Queries for CypherIncremental Graph Queries for Cypher
Incremental Graph Queries for CypheropenCypher
94 views89 slides

More Related Content

Similar to Compiling openCypher graph queries with Spark Catalyst

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014 by
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Roger Huang
561 views51 slides
Scala 20140715 by
Scala 20140715Scala 20140715
Scala 20140715Roger Huang
926 views51 slides
Learning Timed Automata with Cypher by
Learning Timed Automata with CypherLearning Timed Automata with Cypher
Learning Timed Automata with CypherGábor Szárnyas
608 views127 slides
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介 by
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita
333 views48 slides
Data Processing with Apache Spark Meetup Talk by
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkEren Avşaroğulları
360 views22 slides
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka... by
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Flink Forward
369 views90 slides

Similar to Compiling openCypher graph queries with Spark Catalyst(20)

Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014 by Roger Huang
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Intro to Apache Spark and Scala, Austin ACM SIGKDD, 7/9/2014
Roger Huang561 views
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介 by Masayuki Matsushita
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka... by Flink Forward
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Self Managed and Automatically Reconfigurable Stream Processing - Vasiliki Ka...
Flink Forward369 views
Toying with spark by Raymond Tay
Toying with sparkToying with spark
Toying with spark
Raymond Tay1.9K views
MLconf NYC Shan Shan Huang by MLconf
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
MLconf3.7K views
Linking the prospective and retrospective provenance of scripts by Khalid Belhajjame
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
Khalid Belhajjame770 views
Scio - Moving to Google Cloud, A Spotify Story by Neville Li
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
Neville Li2.5K views
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac... by Databricks
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Databricks1.7K views
The magic of (data parallel) distributed systems and where it all breaks - Re... by Holden Karau
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
Holden Karau2.3K views
Cypher and apache spark multiple graphs and more in open cypher by Neo4j
Cypher and apache spark  multiple graphs and more in  open cypherCypher and apache spark  multiple graphs and more in  open cypher
Cypher and apache spark multiple graphs and more in open cypher
Neo4j888 views
Spline: Data Lineage For Spark Structured Streaming by Vaclav Kosar
Spline: Data Lineage For Spark Structured StreamingSpline: Data Lineage For Spark Structured Streaming
Spline: Data Lineage For Spark Structured Streaming
Vaclav Kosar1.1K views
The Semantics of SPARQL by Olaf Hartig
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQL
Olaf Hartig2.2K views
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J... by Databricks
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks98.6K views
Graph Analytics in Spark by Paco Nathan
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan21.8K views
An Introduction to NV_path_rendering by Mark Kilgard
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_rendering
Mark Kilgard1.8K views
Machine Learning and GraphX by Andy Petrella
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella6.1K views

More from Gábor Szárnyas

GraphBLAS: A linear algebraic approach for high-performance graph queries by
GraphBLAS: A linear algebraic approach for high-performance graph queriesGraphBLAS: A linear algebraic approach for high-performance graph queries
GraphBLAS: A linear algebraic approach for high-performance graph queriesGábor Szárnyas
275 views48 slides
What Makes Graph Queries Difficult? by
What Makes Graph Queries Difficult?What Makes Graph Queries Difficult?
What Makes Graph Queries Difficult?Gábor Szárnyas
91 views34 slides
Mapping Graph Queries to PostgreSQL by
Mapping Graph Queries to PostgreSQLMapping Graph Queries to PostgreSQL
Mapping Graph Queries to PostgreSQLGábor Szárnyas
2.6K views28 slides
An early look at the LDBC Social Network Benchmark's Business Intelligence wo... by
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...Gábor Szárnyas
641 views32 slides
Incremental View Maintenance for openCypher Queries by
Incremental View Maintenance for openCypher QueriesIncremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher QueriesGábor Szárnyas
497 views57 slides
Writing a Cypher Engine in Clojure by
Writing a Cypher Engine in ClojureWriting a Cypher Engine in Clojure
Writing a Cypher Engine in ClojureGábor Szárnyas
625 views35 slides

More from Gábor Szárnyas(13)

GraphBLAS: A linear algebraic approach for high-performance graph queries by Gábor Szárnyas
GraphBLAS: A linear algebraic approach for high-performance graph queriesGraphBLAS: A linear algebraic approach for high-performance graph queries
GraphBLAS: A linear algebraic approach for high-performance graph queries
Gábor Szárnyas275 views
Mapping Graph Queries to PostgreSQL by Gábor Szárnyas
Mapping Graph Queries to PostgreSQLMapping Graph Queries to PostgreSQL
Mapping Graph Queries to PostgreSQL
Gábor Szárnyas2.6K views
An early look at the LDBC Social Network Benchmark's Business Intelligence wo... by Gábor Szárnyas
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
Gábor Szárnyas641 views
Incremental View Maintenance for openCypher Queries by Gábor Szárnyas
Incremental View Maintenance for openCypher QueriesIncremental View Maintenance for openCypher Queries
Incremental View Maintenance for openCypher Queries
Gábor Szárnyas497 views
Időzített automatatanulás Cypherrel by Gábor Szárnyas
Időzített automatatanulás CypherrelIdőzített automatatanulás Cypherrel
Időzített automatatanulás Cypherrel
Gábor Szárnyas144 views
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli... by Gábor Szárnyas
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...
Gábor Szárnyas968 views
Sharded Joins for Scalable Incremental Graph Queries by Gábor Szárnyas
Sharded Joins for Scalable Incremental Graph QueriesSharded Joins for Scalable Incremental Graph Queries
Sharded Joins for Scalable Incremental Graph Queries
Gábor Szárnyas184 views
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications by Gábor Szárnyas
Towards a Macrobenchmark Framework for Performance Analysis of Java ApplicationsTowards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
Gábor Szárnyas684 views
IncQuery-D: Distributed Incremental Graph Queries by Gábor Szárnyas
IncQuery-D: Distributed Incremental Graph QueriesIncQuery-D: Distributed Incremental Graph Queries
IncQuery-D: Distributed Incremental Graph Queries
Gábor Szárnyas234 views
IncQuery-D: Incremental Queries in the Cloud by Gábor Szárnyas
IncQuery-D: Incremental Queries in the CloudIncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the Cloud
Gábor Szárnyas1.3K views

Recently uploaded

GDSC Mikroskil Members Onboarding 2023.pdf by
GDSC Mikroskil Members Onboarding 2023.pdfGDSC Mikroskil Members Onboarding 2023.pdf
GDSC Mikroskil Members Onboarding 2023.pdfgdscmikroskil
58 views62 slides
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptxlwang78
109 views19 slides
MongoDB.pdf by
MongoDB.pdfMongoDB.pdf
MongoDB.pdfArthyR3
45 views6 slides
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf by
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfAlhamduKure
6 views11 slides
Ansari: Practical experiences with an LLM-based Islamic Assistant by
Ansari: Practical experiences with an LLM-based Islamic AssistantAnsari: Practical experiences with an LLM-based Islamic Assistant
Ansari: Practical experiences with an LLM-based Islamic AssistantM Waleed Kadous
5 views29 slides
sam_software_eng_cv.pdf by
sam_software_eng_cv.pdfsam_software_eng_cv.pdf
sam_software_eng_cv.pdfsammyigbinovia
8 views5 slides

Recently uploaded(20)

GDSC Mikroskil Members Onboarding 2023.pdf by gdscmikroskil
GDSC Mikroskil Members Onboarding 2023.pdfGDSC Mikroskil Members Onboarding 2023.pdf
GDSC Mikroskil Members Onboarding 2023.pdf
gdscmikroskil58 views
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang78109 views
MongoDB.pdf by ArthyR3
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
ArthyR345 views
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf by AlhamduKure
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
AlhamduKure6 views
Ansari: Practical experiences with an LLM-based Islamic Assistant by M Waleed Kadous
Ansari: Practical experiences with an LLM-based Islamic AssistantAnsari: Practical experiences with an LLM-based Islamic Assistant
Ansari: Practical experiences with an LLM-based Islamic Assistant
M Waleed Kadous5 views
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth by Innomantra
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth
Innomantra 6 views
SUMIT SQL PROJECT SUPERSTORE 1.pptx by Sumit Jadhav
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptx
Sumit Jadhav 18 views
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi8 views
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) by Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy33 views
Web Dev Session 1.pptx by VedVekhande
Web Dev Session 1.pptxWeb Dev Session 1.pptx
Web Dev Session 1.pptx
VedVekhande11 views
Design_Discover_Develop_Campaign.pptx by ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth637 views
fakenews_DBDA_Mar23.pptx by deepmitra8
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
deepmitra816 views

Compiling openCypher graph queries with Spark Catalyst

  • 1. Compiling openCypher graph queries with Spark Catalyst Gábor Szárnyas pre-holiday Spark des fêtes @ Montréal
  • 2. BACKGROUND  PhD student @ Budapest Univ. of Tech. and Econ., Hungary  Visiting researcher @ McGill University
  • 3. RESEARCH TOPIC Problem statement  Large graph (100M+ nodes)  Complex global graph queries  Evaluate them in <1sec Approach  We “cheat”  Build a huge cache  Maintain results: incremental views
  • 4. RESEARCH OBJECTIVES Create a scalable graph query engine with incremental views 1. graph queries 2. incremental views 3. making it scale
  • 6. PROPERTY GRAPH DATABASES NoSQL family Data model: vertices, edges and properties #1 query approach: graph pattern matching Note. Spark GraphX is an engine for graph analytics.
  • 7. CYPHER AND OPENCYPHER Cypher: query language of the Neo4j graph database. „Cypher is a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.” MATCH (p:Person)-[:PRESENTER_OF]->(:Presentation)-[:AT]->(m:Meetup) WHERE m.date = 'Monday, December 18, 2017' RETURN p „The openCypher project aims to deliver a full and open specification of the industry’s most widely adopted graph database query language: Cypher.” (late 2015)
  • 8. OPENCYPHER SYSTEMS  Increasing adoption  Relational databases: o SAP HANA o AGENS Graph  Research prototypes: o Graphflow (Univesity of Waterloo) o ingraph (incremental graph engine) (Source: Keynote talk @ GraphConnect NYC 2017)
  • 9. LINKED DATA BENCHMARK COUNCIL LDBC is a non-profit organization dedicated to establishing benchmarks, benchmark practices and benchmark results for graph data management software. LDBC’s Social Network Benchmark is an industrial and academic initiative, formed by principal actors in the field of graph-like data management.
  • 10. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations
  • 11. OLAP global queries OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Example: „Friends’ recent likes” MATCH (u:User {id: $userId})-[:FRIEND]- (f:User)-[l:LIKES]->(p:Post) RETURN f, p ORDER BY l.timestamp DESC LIMIT 10
  • 12. OLAP global queries OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Orri Erling et al., The LDBC Social Network Benchmark: Interactive Workload, SIGMOD 2015 14 queries and 8 updates
  • 13. OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations OLAP global queries Example: „One-sided friendships” MATCH (u1:User)-[:FRIEND]-(u2:User)-[l:LIKES]->(p:Post), (u1)-[:AUTHOR_OF]->(p) WITH u1, u2, count(l) AS likes WHERE likes > 10 AND NOT (u1)-[:LIKES]->(:Post)<-[:AUTHOR_OF]-(u2) RETURN u1, u2
  • 14. OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Arnau Prat, Gábor Szárnyas, Alex Averbuch et al., The LDBC Social Network Benchmark: BI Workload, Technical report available, peer-reviewed paper in 2018 OLAP global queries 25 queries with infrequent executions
  • 15. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations • PageRank • Shortest paths • Clustering coefficient Example: „Find the most central individuals.” Spark: GraphX | Flink: Gelly | Neo4j: Graph Algorithms library
  • 16. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations Alexandru Iosup et al., LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms, VLDB 2016 One-time execution
  • 17. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations
  • 19. CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL Trailing the switch Proximity detection
  • 20. CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL c d e g fdiv 2 NEXT NEXT STRAIGHT TOP ON a b 1 NEXT ON NEXT
  • 22. TRAILING THE SWITCH seg div t STRAIGHT ON MATCH (t:Train)-[:ON]->(seg:Segment) <-[:STRAIGHT]-(sw:Switch) WHERE sw.position = 'diverging' RETURN t.number, sw Evaluate continuously
  • 23. INCREMENTAL QUERIES  Register a set of standing queries  Continuously evaluate queries on changes  Approach: build a cache and maintain its content  First publication: 1974, the Rete algorithm ingraphclient register queries query results change notifications update graph
  • 24. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON
  • 25. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e2 ON a1 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON a 1 ON e 2 ON
  • 26. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e2 ON a1 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e div STRAIGHT e div STRAIGHT
  • 27. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e div STRAIGHT e2 ON e div 2 STRAIGHT ON
  • 28. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON div STRAIGHTON e div STRAIGHT 2 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e2 div STRAIGHTON e2
  • 29. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON div 2
  • 30. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON div 2
  • 31. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c e g fdiv NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON div STRAIGHT Trailing the switch ON div ON 2 d
  • 32. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT d2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c e g fdiv NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON div STRAIGHT Trailing the switch ON div ON 2 d
  • 33. GRAPH RELATIONAL ALGEBRA  Basic relational algebra o projection, selection, join, left outer join, antijoin, union  Common extensions o aggregation (𝛾), duplicate-elimination (𝛿), sort (𝜏), top (𝜆)  Graph-specific extensions o get-vertices () o expand-out (↑), expand-in (↓), expand-both (↕) J. Marton, G. Szárnyas, D. Varró: Formalising openCypher Graph Queries in Relational Algebra, ADBIS, Springer, 2017
  • 36. THEORETICAL ISSUES WITH INCREMENTAL CYPHER  Graph RA is not incrementally maintainable o Expand operators o Property access needs nested data structures (0NF) o Ordering o Weak schema  Most incremental approaches work on flat relational algebra: o Transform graph relational algebra to a flat one o Optimize query G. Szárnyas: Incremental View Maintenance for Property Graph Queries, arXiv preprint, 2017
  • 37. PROPOSED WORKFLOW  Parse  Compile  Evaluate query openCypher query Magic Deployed query AST
  • 39. PROPERTY ACCESS Assuming that x is a column of a graph relation, we use the notation “x.a” in selection conditions to express the access to the corresponding value of property a in the property graph. J. Hölsch, M. Grossniklaus: An algebra and equivalences to transform graph patterns in Neo4j, GraphQ @ EDBT 2016
  • 40. t, seg t, seg, t.number sw, seg sw, seg, sw.position t.number, sw.position πt.number, sw σsw.position = ′diverging′ ⋈ (sw:Switch)−[:STRAIGHT]−>(seg:Segment)(t:Train)−[:ON]−>(seg:Segment) t.number, sw t.number, sw t, seg, sw t, seg, t.number, sw, sw.position t, seg, sw t, seg, t.number, sw, sw.position t.number t.number, sw.position sw.positiont.number 2 1. external schema 2. extra attributes 3. internal schema This is the current implementation SCHEMA INFERENCING
  • 42. MATCH (t:Train)-[:ON]->(seg:Segment) <-[:STRAIGHT]-(sw:Switch) WHERE sw.position = 'diverging' RETURN t.number, sw openCypher query AST Graph RA
  • 43. Graph RA Flat RANested RA Deployed query
  • 44. SPARK SQL  “Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API.” http://www.gatorsmile.io/sparksqloverview/ http://www.gatorsmile.io/sparksqloverview/
  • 45. SPARK CATALYST  Tree Manipulation Framework o “Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.”  Optimizer (both cost-based and rule-based) Catalyst
  • 46. SPARK CATALYST: OBSERVATIONS  Strong community  Well-written in general, but noisy here and there (Hive)  Nice API docs… but not much else
  • 47. CATALYST EXAMPLES: TREE TRANSFORMATION
  • 49. CATALYST FEATURES: CODE GENERATION  Generates bytecode for performance H. Karau, R. Warren: High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark O'Reilly Media, Inc., May 25, 2017
  • 51. MAKING IT SCALE πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT d2 ON a1 ON div STRAIGHT ON Actors Async messages G. Szárnyas et al., IncQuery-D: A distributed incremental model query framework in the cloud. ACM/IEEE MODELS, 2014
  • 52. openCypher query AST Graph RA Graph RA Flat RANested RA Deployed query ARCHITECTURE
  • 53. Related work and summary
  • 54. CAPS: CYPHER FOR APACHE SPARK  An openCypher project  “CAPS is built on top of the Spark DataFrames API and uses features such as the Catalyst optimizer.”  Approach o Compiles to operations to a custom dataflow graph o Transforms the dataflow graph to queries on the DataFrames API (backed by Catalyst)
  • 55. LESSONS LEARNT  Simply extending the SQL model is insufficient  Implemented new components from scratch o Logical plans o Attribute resolver  Still reused a lot of components o Data model o Expressions o Transformations o Built-in methods: toString, output, etc.
  • 56. FUTURE DIRECTIONS  Cost-based optimizer  Experiment with the LDBC Social Network Benchmark  Transform queries to SQL  Integrate engine to Spark G. Szárnyas, A. Prat, A. Averbuch et al.: The LDBC Social Network Benchmark: BI Workload. Technical report, peer-reviewed paper in 2018
  • 57. RELATED RESOURCES Ingraph github.com/ftsrg/ingraph Cypher for Apache Spark github.com/opencypher/cypher-for-apache-spark Slizaa openCypher github.com/slizaa/slizaa-opencypher-xtext Mastering Apache Spark jaceklaskowski.gitbooks.io/mastering-apache-spark Scala Days presentation people.apache.org/… | youtu.be/6bCpISym_0w Deep dive blogpost databricks.com/blog/2015/04/13/deep-dive-… Thanks for the contributions to the ingraph team.