Writing a Cypher Engine in Clojure
Dávid Szakállas (BME)
Gábor Szárnyas (BME, MTA)
3rd Budapest Clojure meetup 2018
AGENDA
Gábor:
 Graph databases
 Cypher language
 Query evaluation
Dávid:
 Graph exploration strategies
 Optimization
 How to compile patterns
GRAPHS
𝐺 = 𝑉, 𝐸 𝐸 ⊆ 𝑉 × 𝑉
Textbook definition
Flavours:
 directed/undirected edges
 labels
 properties
GRAPH DATABASES
pattern
matching
NoSQL family
Data model: property graphs = vertices, edges, and properties
analytics
CYPHER
„Cypher is a declarative, SQL-inspired language for describing
patterns in graphs visually using an ascii-art syntax.”
MATCH (pers:Person)-[:PRESENTERS]->
(:Presentation)<-[:TOPIC_OF]->(proj:Project)
WHERE proj.name = 'ingraph'
RETURN pers.name
pers.name
David
Gabor
DBMS POPULARITY BY DATA MODEL
version 2.0 introduces
the Cypher query language
OPENCYPHER SYSTEMS
 Goal: deliver a full and open specification of Cypher
 Relational databases:
o SAP HANA
o AGENS Graph
 Research prototypes:
o Graphflow (Univesity of Waterloo)
o ingraph (incremental graph engine)
(Source: Keynote talk @ GraphConnect NYC 2017)
USE CASES
 Analytics
o IT security
o Investigative journalism
 “Real-time” execution
o Fraud detection
o Recommendation systems
INGRAPH
 MTA-BME project
 PoC query engine for openCypher
 Primarily written in Scala
 Goals:
o Provide continuous evaluation (incremental view maintenance)
o Cover standard openCypher constructs
o Does not work efficiently for one-time query execution
Szárnyas, G. et al.,
IncQuery-D: A distributed incremental model query framework in the cloud.
MODELS, 2014,
https://link.springer.com/chapter/10.1007/978-3-319-11653-2_40
GRAPH QUERY PROCESSING APPROACHES
 Relational: projection, selection, join
o SAP HANA
o Agens Graph (PostgreSQL + Cypher)
o ingraph
 Relational + expand
o Neo4j
 Constraint satisfaction
o VIATRA (BME project since 2002): “local search” engine
o Our proposed openCypher engine (this talk)
RELATIONAL MODEL
person parent
Alice Sarah
Alice Bob
a b
Bob Sarah
Parent
Married
Person
{ name: Alice }
Person
{ name: Bob }
Person
{ name: Sarah }
PARENT
PARENT
MARRIED
PROPERTY GRAPH
label
properties
edge type
EXPLORATION BASED METHODS
 given this graph  and query
Person
{ name: Alice }
Person
{ name: Bob }
Person
{ name: Sarah }
PARENT
PARENT
MARRIED
MATCH (u:Person)
-[:PARENT]->
(f:Person)
-[m:MARRIED]->
(m:Person)
RETURN m, f
find an execution plan
EXPLORATION BASED METHODS
Person
{ name: Alice }
Person
{ name: Bob }
Person
{ name: Sarah }
PARENT
PARENT
MARRIED
 DFT -> small memory footprint
 Small state-space -> less steps
EXPLORATION BASED METHODS
Goal: minimize state-space generated for a query.
 Cost based optimizations
 Constraint satisfaction problems
 SAT is an NP-full problem
 Greedy algorithm unsatisfactory
EXPLORATION BASED METHODS
Goal: minimize state-space generated for a query.
 Iterative Dynamic Programming (IDP):
oGeneralization of DP and greedy algorithms
oPolynomial complexity
oMultiple variants
oWidely researched in the RDMS area
-> use iterative dynamic programming
FULL VS GREEDY VS IDP
⋮
:e _ -[_]-> _
MATCH (a)-[:e]->(b)
RETURN a, b
FULL VS GREEDY VS IDP
⋮
:e _-[_]->_
MATCH (a)-[:e]->(b)
RETURN a, b
a-[_]->_ _-[_]->b
1,0 100,0
FULL VS GREEDY VS IDP
⋮
:e _-[_]->_
MATCH (a)-[:e]->(b)
RETURN a, b
a-[_]->_ _-[_]->b
1 100
a-[:e]->b a-[:e]->b
2 0.02
FULL VS GREEDY VS IDP
⋮
:e _-[_]->_
MATCH (a)-[:e]->(b)
RETURN a, b
a-[_]->_ _-[_]->b
1 100
_-[_]->_
a-[_]->_ _-[_]->b
1 100
a-[:e]->b a-[:e]->b
2 0.02
FULL VS GREEDY VS IDP
⋮
:e _-[_]->_
MATCH (a)-[:e]->(b)
RETURN a, b
a-[_]->_ _-[_]->b
1 100
_-[_]->_
a-[_]->_ _-[_]->b
1 100
a-[:e]->b a-[:e]->b
2 0.02
FULL VS GREEDY VS IDP
⋮
:e _-[_]->_
MATCH (a)-[:e]->(b)
RETURN a, b
a-[_]->_ _-[_]->b
1 100
a-[:e]->b
2
_-[_]->_
a-[_]->_ _-[_]->b
1 100
a-[:e]->b a-[:e]->b
2 0.02
FULL VS GREEDY VS IDP
⋮
:e _-[_]->_
MATCH (a)-[:e]->(b)
RETURN a, b
a-[_]->_ _-[_]->b
1 100
a-[:e]->b
2
_-[_]->_
a-[_]->_ _-[_]->b
1 100
a-[:e]->b a-[:e]->b
2 0.02
IDP: Something in between...
MODEL-SENSITIVE APPROACH
G. Varró, F. Deckwerth, M. Wieber, A. Schürr,
An algorithm for generating model-sensitive search plans for pattern
matching on EMF models,
Software and Systems Modeling, 2013
 Derives cost based on the graph elements
 Object-oriented approach
 Different goal: model validation
Idea: adapt to property graphs.
ADAPTING IT
Requirements
 Variables for edges and vertices
 Estimate based on the graph
 Map to specialized indexer operations
Difficulties
 No type information
 Original algorithm oblivious to edges with
properties
CONSTRAINT DEFINITIONS
 Constraints and
operations are
separate
 Constraints can
imply other
constraints
OPERATIONS - CONSTRAINTS
OPERATIONS - CONSTRAINTS
IMMEDIATE
OPERATIONS – COSTS
 indexer support (context sensitive)
 Based on variables referenced in the constraints
 Given a (HasLabels a t) and a (Vertex a) constraints
we can get the number of vertices for that label
SIMPLE PATTERNS
Conjunction of positive terms
NEGATIVE PATTERNS
Compiles to a single constraint with inner constraints.
Negation needs at least one bound variable
BENCHMARKING
design
execution
CONCLUSION
 Difficult to adapt this OO algorithm to property graphs
 Clojure is well fitted to the problem
 Subpar performance from the initial implementation
 Looking for maintainers
 Interested? Contact us: @szarnyasg @szdavid92
FTSRG/ingraph
Dávid Szakállas: Evaluation of openCypher Graph Queries with a Local Search-Based Algorithm.
Master’s thesis, Budapest University of Technology and Economics, 2017.

Writing a Cypher Engine in Clojure

  • 1.
    Writing a CypherEngine in Clojure Dávid Szakállas (BME) Gábor Szárnyas (BME, MTA) 3rd Budapest Clojure meetup 2018
  • 2.
    AGENDA Gábor:  Graph databases Cypher language  Query evaluation Dávid:  Graph exploration strategies  Optimization  How to compile patterns
  • 3.
    GRAPHS 𝐺 = 𝑉,𝐸 𝐸 ⊆ 𝑉 × 𝑉 Textbook definition Flavours:  directed/undirected edges  labels  properties
  • 4.
    GRAPH DATABASES pattern matching NoSQL family Datamodel: property graphs = vertices, edges, and properties analytics
  • 6.
    CYPHER „Cypher is adeclarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.” MATCH (pers:Person)-[:PRESENTERS]-> (:Presentation)<-[:TOPIC_OF]->(proj:Project) WHERE proj.name = 'ingraph' RETURN pers.name pers.name David Gabor
  • 7.
    DBMS POPULARITY BYDATA MODEL version 2.0 introduces the Cypher query language
  • 8.
    OPENCYPHER SYSTEMS  Goal:deliver a full and open specification of Cypher  Relational databases: o SAP HANA o AGENS Graph  Research prototypes: o Graphflow (Univesity of Waterloo) o ingraph (incremental graph engine) (Source: Keynote talk @ GraphConnect NYC 2017)
  • 9.
    USE CASES  Analytics oIT security o Investigative journalism  “Real-time” execution o Fraud detection o Recommendation systems
  • 10.
    INGRAPH  MTA-BME project PoC query engine for openCypher  Primarily written in Scala  Goals: o Provide continuous evaluation (incremental view maintenance) o Cover standard openCypher constructs o Does not work efficiently for one-time query execution Szárnyas, G. et al., IncQuery-D: A distributed incremental model query framework in the cloud. MODELS, 2014, https://link.springer.com/chapter/10.1007/978-3-319-11653-2_40
  • 13.
    GRAPH QUERY PROCESSINGAPPROACHES  Relational: projection, selection, join o SAP HANA o Agens Graph (PostgreSQL + Cypher) o ingraph  Relational + expand o Neo4j  Constraint satisfaction o VIATRA (BME project since 2002): “local search” engine o Our proposed openCypher engine (this talk)
  • 14.
    RELATIONAL MODEL person parent AliceSarah Alice Bob a b Bob Sarah Parent Married Person { name: Alice } Person { name: Bob } Person { name: Sarah } PARENT PARENT MARRIED PROPERTY GRAPH label properties edge type
  • 15.
    EXPLORATION BASED METHODS given this graph  and query Person { name: Alice } Person { name: Bob } Person { name: Sarah } PARENT PARENT MARRIED MATCH (u:Person) -[:PARENT]-> (f:Person) -[m:MARRIED]-> (m:Person) RETURN m, f find an execution plan
  • 16.
    EXPLORATION BASED METHODS Person {name: Alice } Person { name: Bob } Person { name: Sarah } PARENT PARENT MARRIED  DFT -> small memory footprint  Small state-space -> less steps
  • 17.
    EXPLORATION BASED METHODS Goal:minimize state-space generated for a query.  Cost based optimizations  Constraint satisfaction problems  SAT is an NP-full problem  Greedy algorithm unsatisfactory
  • 18.
    EXPLORATION BASED METHODS Goal:minimize state-space generated for a query.  Iterative Dynamic Programming (IDP): oGeneralization of DP and greedy algorithms oPolynomial complexity oMultiple variants oWidely researched in the RDMS area -> use iterative dynamic programming
  • 19.
    FULL VS GREEDYVS IDP ⋮ :e _ -[_]-> _ MATCH (a)-[:e]->(b) RETURN a, b
  • 20.
    FULL VS GREEDYVS IDP ⋮ :e _-[_]->_ MATCH (a)-[:e]->(b) RETURN a, b a-[_]->_ _-[_]->b 1,0 100,0
  • 21.
    FULL VS GREEDYVS IDP ⋮ :e _-[_]->_ MATCH (a)-[:e]->(b) RETURN a, b a-[_]->_ _-[_]->b 1 100 a-[:e]->b a-[:e]->b 2 0.02
  • 22.
    FULL VS GREEDYVS IDP ⋮ :e _-[_]->_ MATCH (a)-[:e]->(b) RETURN a, b a-[_]->_ _-[_]->b 1 100 _-[_]->_ a-[_]->_ _-[_]->b 1 100 a-[:e]->b a-[:e]->b 2 0.02
  • 23.
    FULL VS GREEDYVS IDP ⋮ :e _-[_]->_ MATCH (a)-[:e]->(b) RETURN a, b a-[_]->_ _-[_]->b 1 100 _-[_]->_ a-[_]->_ _-[_]->b 1 100 a-[:e]->b a-[:e]->b 2 0.02
  • 24.
    FULL VS GREEDYVS IDP ⋮ :e _-[_]->_ MATCH (a)-[:e]->(b) RETURN a, b a-[_]->_ _-[_]->b 1 100 a-[:e]->b 2 _-[_]->_ a-[_]->_ _-[_]->b 1 100 a-[:e]->b a-[:e]->b 2 0.02
  • 25.
    FULL VS GREEDYVS IDP ⋮ :e _-[_]->_ MATCH (a)-[:e]->(b) RETURN a, b a-[_]->_ _-[_]->b 1 100 a-[:e]->b 2 _-[_]->_ a-[_]->_ _-[_]->b 1 100 a-[:e]->b a-[:e]->b 2 0.02 IDP: Something in between...
  • 26.
    MODEL-SENSITIVE APPROACH G. Varró,F. Deckwerth, M. Wieber, A. Schürr, An algorithm for generating model-sensitive search plans for pattern matching on EMF models, Software and Systems Modeling, 2013  Derives cost based on the graph elements  Object-oriented approach  Different goal: model validation Idea: adapt to property graphs.
  • 27.
    ADAPTING IT Requirements  Variablesfor edges and vertices  Estimate based on the graph  Map to specialized indexer operations Difficulties  No type information  Original algorithm oblivious to edges with properties
  • 28.
    CONSTRAINT DEFINITIONS  Constraintsand operations are separate  Constraints can imply other constraints
  • 29.
  • 30.
  • 31.
    OPERATIONS – COSTS indexer support (context sensitive)  Based on variables referenced in the constraints  Given a (HasLabels a t) and a (Vertex a) constraints we can get the number of vertices for that label
  • 32.
  • 33.
    NEGATIVE PATTERNS Compiles toa single constraint with inner constraints. Negation needs at least one bound variable
  • 34.
  • 35.
    CONCLUSION  Difficult toadapt this OO algorithm to property graphs  Clojure is well fitted to the problem  Subpar performance from the initial implementation  Looking for maintainers  Interested? Contact us: @szarnyasg @szdavid92 FTSRG/ingraph Dávid Szakállas: Evaluation of openCypher Graph Queries with a Local Search-Based Algorithm. Master’s thesis, Budapest University of Technology and Economics, 2017.