Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Compiling openCypher graph queries
with Spark Catalyst
Gábor Szárnyas
pre-holiday Spark des fêtes @ Montréal
BACKGROUND
 PhD student @ Budapest Univ. of Tech. and Econ., Hungary
 Visiting researcher @ McGill University
RESEARCH TOPIC
Problem statement
 Large graph (100M+ nodes)
 Complex global graph queries
 Evaluate them in <1sec
Appro...
RESEARCH OBJECTIVES
Create a scalable graph query engine with incremental views
1. graph queries
2. incremental views
3. m...
Graph queries
PROPERTY GRAPH DATABASES
NoSQL family
Data model:
vertices, edges
and properties
#1 query approach: graph pattern matching...
CYPHER AND OPENCYPHER
Cypher: query language of the Neo4j graph database.
„Cypher is a declarative, SQL-inspired language ...
OPENCYPHER SYSTEMS
 Increasing adoption
 Relational databases:
o SAP HANA
o AGENS Graph
 Research prototypes:
o Graphfl...
LINKED DATA BENCHMARK COUNCIL
LDBC is a non-profit organization dedicated to establishing
benchmarks, benchmark practices ...
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
OLAP global queries
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Example: „Friends’ recen...
OLAP global queries
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Orri Erling et al.,
The ...
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
OLAP global queries
Example: „One-sided frie...
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
local queries
global computations
Arnau Prat, Gábor Szárnyas, Alex Averbuch et...
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
• PageRank
• Shortest pa...
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
Alexandru Iosup et al.,
...
OVERVIEW OF GRAPH PROCESSING
OLTP
analytics
OLAP
local queries
global queries
global computations
Incremental view maintenance
CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL
Trailing the switch
Proximity detection
CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
ON
a b
1
NEXT
ON
NEXT
PROXIMITY DETECTION
seg
1
NEXT: 1..2
t1
ON
MATCH
(t1:Train)-[:ON]->(seg1:Segment)
-[:NEXT*1..2]->(seg2:Segment)
<-[:ON]-(t...
TRAILING THE SWITCH
seg div
t
STRAIGHT
ON
MATCH (t:Train)-[:ON]->(seg:Segment)
<-[:STRAIGHT]-(sw:Switch)
WHERE sw.position...
INCREMENTAL QUERIES
 Register a set of standing queries
 Continuously evaluate queries on changes
 Approach: build a ca...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON
ON
div
STRA...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e2
ON
a1
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e2
ON
a1
ON
c d e
g
fdiv
2
NEXT NEXT
STRAIGHT TOP
a b
1
NEXT NEXT
ON...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
c d e
g
fdiv
2
NEXT N...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
div
STRAIGHTON
e div
STRAIGHT
2
ON
c d e
...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
d...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
d...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
e2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
d...
πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
d2
ON
a1
ON
e div
STRAIGHT
2
ON
e div
STRAIGHT
2
ON
d...
GRAPH RELATIONAL ALGEBRA
 Basic relational algebra
o projection, selection, join, left outer join, antijoin, union
 Comm...
THEORETICAL ISSUES WITH INCREMENTAL CYPHER
 Graph RA is not incrementally maintainable
o Expand operators
o Property acce...
PROPOSED WORKFLOW
 Parse
 Compile
 Evaluate query
openCypher
query
Magic
Deployed
query
AST
QUERY “TRAILING THE SWITCH”
PROPERTY ACCESS
Assuming that x is a column of a
graph relation, we use the notation
“x.a” in selection conditions to
expr...
t, seg
t, seg, t.number
sw, seg
sw, seg, sw.position
t.number, sw.position
πt.number, sw
σsw.position = ′diverging′
⋈
(sw:...
MATCH (t:Train)-[:ON]->(seg:Segment)
<-[:STRAIGHT]-(sw:Switch)
WHERE sw.position = 'diverging'
RETURN t.number, sw
openCyp...
Graph RA Flat RANested RA
Deployed
query
SPARK SQL
 “Spark SQL lets you query structured data inside Spark
programs, using either SQL or a familiar DataFrame API....
SPARK CATALYST
 Tree Manipulation Framework
o “Catalyst is an execution-agnostic framework to represent and
manipulate a ...
SPARK CATALYST: OBSERVATIONS
 Strong community
 Well-written in general, but noisy here and there (Hive)
 Nice API docs...
CATALYST EXAMPLES: TREE TRANSFORMATION
CATALYST EXAMPLES: ATTRIBUTE RESOLVER
CATALYST FEATURES: CODE GENERATION
 Generates bytecode for performance
H. Karau, R. Warren:
High Performance Spark: Best ...
Scalable graph queries
MAKING IT SCALE πt.number, sw
σsw.position = ′diverging′
⋈
STRAIGHTON e div
STRAIGHT
d2
ON
a1
ON
div
STRAIGHT
ON
Actors
As...
openCypher
query
AST Graph RA
Graph RA Flat RANested RA
Deployed
query
ARCHITECTURE
Related work and summary
CAPS: CYPHER FOR APACHE SPARK
 An openCypher project
 “CAPS is built on top of the Spark DataFrames API and uses
feature...
LESSONS LEARNT
 Simply extending the SQL model is insufficient
 Implemented new components from scratch
o Logical plans
...
FUTURE DIRECTIONS
 Cost-based optimizer
 Experiment with the LDBC Social Network Benchmark
 Transform queries to SQL
 ...
RELATED RESOURCES
Ingraph github.com/ftsrg/ingraph
Cypher for Apache Spark github.com/opencypher/cypher-for-apache-spark
S...
Compiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark Catalyst
Compiling openCypher graph queries with Spark Catalyst
Upcoming SlideShare
Loading in …5
×

Compiling openCypher graph queries with Spark Catalyst

518 views

Published on

Presented at the Montréal Spark meetup @ https://www.meetup.com/Montreal-Apache-Spark-Meetup/events/245707751/

Published in: Engineering
  • Be the first to comment

Compiling openCypher graph queries with Spark Catalyst

  1. 1. Compiling openCypher graph queries with Spark Catalyst Gábor Szárnyas pre-holiday Spark des fêtes @ Montréal
  2. 2. BACKGROUND  PhD student @ Budapest Univ. of Tech. and Econ., Hungary  Visiting researcher @ McGill University
  3. 3. RESEARCH TOPIC Problem statement  Large graph (100M+ nodes)  Complex global graph queries  Evaluate them in <1sec Approach  We “cheat”  Build a huge cache  Maintain results: incremental views
  4. 4. RESEARCH OBJECTIVES Create a scalable graph query engine with incremental views 1. graph queries 2. incremental views 3. making it scale
  5. 5. Graph queries
  6. 6. PROPERTY GRAPH DATABASES NoSQL family Data model: vertices, edges and properties #1 query approach: graph pattern matching Note. Spark GraphX is an engine for graph analytics.
  7. 7. CYPHER AND OPENCYPHER Cypher: query language of the Neo4j graph database. „Cypher is a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.” MATCH (p:Person)-[:PRESENTER_OF]->(:Presentation)-[:AT]->(m:Meetup) WHERE m.date = 'Monday, December 18, 2017' RETURN p „The openCypher project aims to deliver a full and open specification of the industry’s most widely adopted graph database query language: Cypher.” (late 2015)
  8. 8. OPENCYPHER SYSTEMS  Increasing adoption  Relational databases: o SAP HANA o AGENS Graph  Research prototypes: o Graphflow (Univesity of Waterloo) o ingraph (incremental graph engine) (Source: Keynote talk @ GraphConnect NYC 2017)
  9. 9. LINKED DATA BENCHMARK COUNCIL LDBC is a non-profit organization dedicated to establishing benchmarks, benchmark practices and benchmark results for graph data management software. LDBC’s Social Network Benchmark is an industrial and academic initiative, formed by principal actors in the field of graph-like data management.
  10. 10. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations
  11. 11. OLAP global queries OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Example: „Friends’ recent likes” MATCH (u:User {id: $userId})-[:FRIEND]- (f:User)-[l:LIKES]->(p:Post) RETURN f, p ORDER BY l.timestamp DESC LIMIT 10
  12. 12. OLAP global queries OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Orri Erling et al., The LDBC Social Network Benchmark: Interactive Workload, SIGMOD 2015 14 queries and 8 updates
  13. 13. OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations OLAP global queries Example: „One-sided friendships” MATCH (u1:User)-[:FRIEND]-(u2:User)-[l:LIKES]->(p:Post), (u1)-[:AUTHOR_OF]->(p) WITH u1, u2, count(l) AS likes WHERE likes > 10 AND NOT (u1)-[:LIKES]->(:Post)<-[:AUTHOR_OF]-(u2) RETURN u1, u2
  14. 14. OVERVIEW OF GRAPH PROCESSING OLTP analytics local queries global computations Arnau Prat, Gábor Szárnyas, Alex Averbuch et al., The LDBC Social Network Benchmark: BI Workload, Technical report available, peer-reviewed paper in 2018 OLAP global queries 25 queries with infrequent executions
  15. 15. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations • PageRank • Shortest paths • Clustering coefficient Example: „Find the most central individuals.” Spark: GraphX | Flink: Gelly | Neo4j: Graph Algorithms library
  16. 16. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations Alexandru Iosup et al., LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms, VLDB 2016 One-time execution
  17. 17. OVERVIEW OF GRAPH PROCESSING OLTP analytics OLAP local queries global queries global computations
  18. 18. Incremental view maintenance
  19. 19. CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL Trailing the switch Proximity detection
  20. 20. CYBER-PHYSICAL SYSTEMS: LIVE RAILWAY MODEL c d e g fdiv 2 NEXT NEXT STRAIGHT TOP ON a b 1 NEXT ON NEXT
  21. 21. PROXIMITY DETECTION seg 1 NEXT: 1..2 t1 ON MATCH (t1:Train)-[:ON]->(seg1:Segment) -[:NEXT*1..2]->(seg2:Segment) <-[:ON]-(t2:Train) RETURN t1, t2, seg1, seg2 seg 2 t2 ON ≤ 𝟏 segments
  22. 22. TRAILING THE SWITCH seg div t STRAIGHT ON MATCH (t:Train)-[:ON]->(seg:Segment) <-[:STRAIGHT]-(sw:Switch) WHERE sw.position = 'diverging' RETURN t.number, sw Evaluate continuously
  23. 23. INCREMENTAL QUERIES  Register a set of standing queries  Continuously evaluate queries on changes  Approach: build a cache and maintain its content  First publication: 1974, the Rete algorithm ingraphclient register queries query results change notifications update graph
  24. 24. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON
  25. 25. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e2 ON a1 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON a 1 ON e 2 ON
  26. 26. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e2 ON a1 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e div STRAIGHT e div STRAIGHT
  27. 27. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e div STRAIGHT e2 ON e div 2 STRAIGHT ON
  28. 28. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON div STRAIGHTON e div STRAIGHT 2 ON c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON e2 div STRAIGHTON e2
  29. 29. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON div 2
  30. 30. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c d e g fdiv 2 NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON ON div STRAIGHT Trailing the switch ON div 2
  31. 31. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT e2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c e g fdiv NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON div STRAIGHT Trailing the switch ON div ON 2 d
  32. 32. πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT d2 ON a1 ON e div STRAIGHT 2 ON e div STRAIGHT 2 ON div2 c e g fdiv NEXT NEXT STRAIGHT TOP a b 1 NEXT NEXT ON div STRAIGHT Trailing the switch ON div ON 2 d
  33. 33. GRAPH RELATIONAL ALGEBRA  Basic relational algebra o projection, selection, join, left outer join, antijoin, union  Common extensions o aggregation (𝛾), duplicate-elimination (𝛿), sort (𝜏), top (𝜆)  Graph-specific extensions o get-vertices () o expand-out (↑), expand-in (↓), expand-both (↕) J. Marton, G. Szárnyas, D. Varró: Formalising openCypher Graph Queries in Relational Algebra, ADBIS, Springer, 2017
  34. 34. THEORETICAL ISSUES WITH INCREMENTAL CYPHER  Graph RA is not incrementally maintainable o Expand operators o Property access needs nested data structures (0NF) o Ordering o Weak schema  Most incremental approaches work on flat relational algebra: o Transform graph relational algebra to a flat one o Optimize query G. Szárnyas: Incremental View Maintenance for Property Graph Queries, arXiv preprint, 2017
  35. 35. PROPOSED WORKFLOW  Parse  Compile  Evaluate query openCypher query Magic Deployed query AST
  36. 36. QUERY “TRAILING THE SWITCH”
  37. 37. PROPERTY ACCESS Assuming that x is a column of a graph relation, we use the notation “x.a” in selection conditions to express the access to the corresponding value of property a in the property graph. J. Hölsch, M. Grossniklaus: An algebra and equivalences to transform graph patterns in Neo4j, GraphQ @ EDBT 2016
  38. 38. t, seg t, seg, t.number sw, seg sw, seg, sw.position t.number, sw.position πt.number, sw σsw.position = ′diverging′ ⋈ (sw:Switch)−[:STRAIGHT]−>(seg:Segment)(t:Train)−[:ON]−>(seg:Segment) t.number, sw t.number, sw t, seg, sw t, seg, t.number, sw, sw.position t, seg, sw t, seg, t.number, sw, sw.position t.number t.number, sw.position sw.positiont.number 2 1. external schema 2. extra attributes 3. internal schema This is the current implementation SCHEMA INFERENCING
  39. 39. MATCH (t:Train)-[:ON]->(seg:Segment) <-[:STRAIGHT]-(sw:Switch) WHERE sw.position = 'diverging' RETURN t.number, sw openCypher query AST Graph RA
  40. 40. Graph RA Flat RANested RA Deployed query
  41. 41. SPARK SQL  “Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API.” http://www.gatorsmile.io/sparksqloverview/ http://www.gatorsmile.io/sparksqloverview/
  42. 42. SPARK CATALYST  Tree Manipulation Framework o “Catalyst is an execution-agnostic framework to represent and manipulate a dataflow graph, i.e. trees of relational operators and expressions.”  Optimizer (both cost-based and rule-based) Catalyst
  43. 43. SPARK CATALYST: OBSERVATIONS  Strong community  Well-written in general, but noisy here and there (Hive)  Nice API docs… but not much else
  44. 44. CATALYST EXAMPLES: TREE TRANSFORMATION
  45. 45. CATALYST EXAMPLES: ATTRIBUTE RESOLVER
  46. 46. CATALYST FEATURES: CODE GENERATION  Generates bytecode for performance H. Karau, R. Warren: High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark O'Reilly Media, Inc., May 25, 2017
  47. 47. Scalable graph queries
  48. 48. MAKING IT SCALE πt.number, sw σsw.position = ′diverging′ ⋈ STRAIGHTON e div STRAIGHT d2 ON a1 ON div STRAIGHT ON Actors Async messages G. Szárnyas et al., IncQuery-D: A distributed incremental model query framework in the cloud. ACM/IEEE MODELS, 2014
  49. 49. openCypher query AST Graph RA Graph RA Flat RANested RA Deployed query ARCHITECTURE
  50. 50. Related work and summary
  51. 51. CAPS: CYPHER FOR APACHE SPARK  An openCypher project  “CAPS is built on top of the Spark DataFrames API and uses features such as the Catalyst optimizer.”  Approach o Compiles to operations to a custom dataflow graph o Transforms the dataflow graph to queries on the DataFrames API (backed by Catalyst)
  52. 52. LESSONS LEARNT  Simply extending the SQL model is insufficient  Implemented new components from scratch o Logical plans o Attribute resolver  Still reused a lot of components o Data model o Expressions o Transformations o Built-in methods: toString, output, etc.
  53. 53. FUTURE DIRECTIONS  Cost-based optimizer  Experiment with the LDBC Social Network Benchmark  Transform queries to SQL  Integrate engine to Spark G. Szárnyas, A. Prat, A. Averbuch et al.: The LDBC Social Network Benchmark: BI Workload. Technical report, peer-reviewed paper in 2018
  54. 54. RELATED RESOURCES Ingraph github.com/ftsrg/ingraph Cypher for Apache Spark github.com/opencypher/cypher-for-apache-spark Slizaa openCypher github.com/slizaa/slizaa-opencypher-xtext Mastering Apache Spark jaceklaskowski.gitbooks.io/mastering-apache-spark Scala Days presentation people.apache.org/… | youtu.be/6bCpISym_0w Deep dive blogpost databricks.com/blog/2015/04/13/deep-dive-… Thanks for the contributions to the ingraph team.

×