GraphREL: A Relational Framework for Processing Sub-graph Queries

GraphREL: A Decomposition-Based and
Selectivity-Aware Relational Framework for Processing
Sub-graph Queries

Sherif Sakr
School of Computer Science and Engineering
University of New South Wales
.
http://www.cse.unsw.edu.au/∼ssakr/

BIT Seminars ’09 - Free University of Bolzano, Italy

16 November 2009

S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 1 / 40

Outline

Previous Work: Pathﬁnder - Relational XQuery Compiler.

Current Work: GraphREL - General Graph Query Processor.

Future Work: Scalable Graph Query Processing for New Generation
of Database Applications.


Outline





Pathﬁnder: A Relational XQuery Processor

XQuery Expression

Pathfinder

Relational Algebra

MIL Code Generator SQL Code Generator

MIL Scripts SQL Scripts

Monet DBMS Conventional RDBMS

http://pathﬁnder-xquery.org/

Pathﬁnder: A Relational XQuery Processor
XML XQuery
Document
Expression

Pathfinder

Relational Algebra + Special Properties [VLDB’04]

Estimation Rules Translation Templates

[VLDB’08]
XPath Accelerator Cardinality Properties
Encoding Tuples XQuery Estimator SQL Generator
+
Statistical Guide [SIGMOD’07]
Statistical Guide [IJWIS’09] Cardinality Properties Aware
[JDM’09]
Statistical Histograms SQL Scripts

Relational
Results XML XML
Conventional RDBMS Serializer
Statistical Histograms

S. Sakr (CSE, UNSW) System Administrator
BIT Seminars’09 16 November 2009 5 / 40

Outline





GraphREL: Motivations

Graphs are among the most complicated and general form of data
structures.
Recently, they have been widely used to model many complex
structured and schemaless data such as social networks, chemical
compounds, biological pathways, spatial databases, semantic web and
business process models.
Retrieving related graphs containing a query graph from a large graph
database is a key performance issue in all of these graph-based
applications.
The success of any graph database application is directly dependent
on the efficiency of the graph indexing and query processing
mechanisms.
RDBMSs have repeatedly shown that they are very efficient, scalable
and successful in hosting different kinds of data.

Preliminaries: Graph Data Model

In labelled graphs, vertices and edges represent the entities and the
relationships between them respectively.
The attributes associated with these entities and relationships are
called labels.
A graph database D is a collection of member graphs
D = {g1 , g2 , ...gn } where each member graph gi is denoted as
(V , E , Lv , Le ).
V is the set of vertices.
E ⊆ V × V is the set of edges joining two distinct vertices.
Lv is the set of vertex labels.
Le is the set of edge labels.
labelled graphs are classiﬁed according to the direction of their edges
into two main classes:
1 Directed-labelled graphs such as XML, RDF and traﬃc networks.
2 Undirected-labelled graphs such as social networks and chemical
compounds.

Preliminaries: Graph Queries

In principle, queries in graph databases can be broadly classified into the
following main categories:
Subgraph queries: this category searches for a specific pattern in the
graph database. The pattern can be either a small graph or a graph
where some parts of it are uncertain, e.g., vertices with wildcard
labels.

Supergraph queries: this category searches for the graph database
members of which their whole structures are contained in the input
query.

Similarity (Approximate Matching) queries: this category finds
graphs which are similar, but not necessarily isomorphic to a given
query graph.


Preliminaries: Subgraph Search Queries

Given a graph database D = {g1 , g2 , ..., gn } and a graph query q, it
returns the query answer set A = {gi |q ⊆ gi , gi ∈ D}.

A graph q is described as a sub-graph of another graph database
member gi if the set of vertices and edges of q form subset of the
vertices and edges of gi .

Formally, g1 (V1 , E1 , Lv 1 , Le1 ) is deﬁned as sub-graph of
g2 (V2 , E2 , Lv 2 , Le2 ) if and only if:
1 For every distinct vertex x ∈ V1 with a label vl ∈ Lv 1 , there is a
distinct vertex y ∈ V2 with a label vl ∈ Lv 2 .

2 For every distinct edge edge ab ∈ E1 with a label el ∈ Le1 , there is a
distinct edge ab ∈ E2 with a label el ∈ Le2 .


Preliminaries: Subgraph Search Queries

A A z
f e x x C A
m
A A A
B C n
m C f e A e
x x
x
C
z
A
n B mA xB x C n C A
n
y z n x x n e x n
x
C m C
D D Dn m
D
C m D D D A D
x x
f m
f m
Ax x D
A B B
g2 g1 g2 g3 g3 qq
(a) Sample graph database (b) Graph query

Figure: An example graph database and graph query


Our Approach: GraphREL

Relational encoding of graph data.

SQL translation of sub-graph search queries.
Filtering phase.

Optional veriﬁcation phase.

Partitioned B-tree Indexes.

Statistical Summaries.

Decomposition-Based and Selectivity-Aware SQL Translation.


Relational Encoding of Graph Data

The starting point of our relational framework is to ﬁnd an eﬃcient
and suitable encoding for each graph member gi in the graph
database D.

We use the Vertex-Edge mapping scheme for storing directed
labelled graphs with the following structure:

Vertices(graphID, vertexID, vertexLabel)

Edges(graphID, sVertex, dVertex, edgeLabel)


Relational Encoding of Graph Data

graphID vertexID vLabel graphID sVertex dVertex eLabel

m
A n
1 1 1 A 1 1 2 n

m 1 1 3 m
g1 6 B A 2
1 2 A
1 2 3 n
y z n 1 3 D
1 4 3 x
5 C D 3
1 4 A 1 5 4 x

1 5 C 1 6 5 y
x x
1 5 2 z
A 4 1 6 B
1 1 6 m
2 1 A
2 1 2 e

f
A 1
e
2 2 C 2 2 3 m

2 3 D 2 4 3 m
5 B C 2
g2
2 4 2 n
2 4 C
x n m 2 5 4 x
2 5 B
4 C m D 3 2 1 5 f

Vertices Table Edges Table


SQL Translation of Graph Queries

Filtering Phase: a sub-graph query q consists of a set of vertices
QV with size equal m and a set of edges QE equal n is evaluated
using the following SQL translation template:
SELECT DISTINCT V1 .graphID, Vi .vertexID
FROM Vertices as V1 ,..., Vertices as Vm , Edges as E1 ,..., Edges as En
WHERE
∀m (V1 .graphID = Vi .graphID)
i=2
AND ∀n (V1 .graphID = Ej .graphID)
j=1
AND ∀m (Vi .vertexLabel = QVi .vertexLabel)
i=1
AND ∀n (Ej .edgeLabel = QEj .edgeLabel)
j=1
AND ∀n (Ej .sVertex = Vf .vertexID AND Ej .dVertex = Vf .vertexID);
j=1

Veriﬁcation Phase: an optional phase which is used to verify that
each vertex in the set of ﬁltered vertices for each candidate graph is
distinct. It is applied only if more than one vertex of the set of query
vertices QV have the same label. This can be easily achieved using
their vertex ID.


Partitioned B-tree Indexes

Partitioned B-tree indexing is a slight variant of the B-tree indexing
structure.
The main idea is the use of low-selectivity leading columns to
maintain partitions within the associated B-tree.
In labelled graphs, it is generally the case that the number of distinct
vertices and edges labels are far less than the number of vertices and
edges respectively.
For example, having an index defined in terms of columns
(vertexLabel, graphID) can reduce the access cost of sub-graph query
with only one label to one disk page. On the contrary, an index
defined in terms of the two columns (graphID, vertexLabel) requires
scanning a large number of disk pages.
Having partitioned B-trees indexes of the high-selectivity attributes
achieves fixed execution times which are no longer dependent on the
size of the whole graph database.

Limitations of SQL-Based Translation Approach

An obvious problem of the SQL translation template is that it
involves a large number of conjunctive SQL predicates and join
operations between the encoding tables.

Most of relational query engines will certainly fail to execute the SQL
translation queries of medium size or large sub-graph queries because
they are too long and too complex (this does not mean they must
consequently be too expensive).

Therefore, we need a decomposition mechanism to divide this large
and complex SQL translation query into a sequence of intermediate
queries.
Applying this decomposition mechanism blindly may lead to ineﬃcient
execution plans with very large, non-required and expensive
intermediate results.
We use statistical summary information to achieve an eﬃcient
decomposition process.

Statistical Summaries

In general, one of the most eﬀective techniques for optimizing the
execution times of SQL queries is to select the relational execution
based on the accurate selectivity information of the query predicates.
We construct three Markov tables to store information about the
frequency of occurrence of the distinct labels of vertices, distinct
labels of edges and connection between pair of vertices (edges).
Vertex Label Frequency Edge Label Frequency Edge Label Frequency
Connection
A 100 a 40
ab 3
B 200 c 5
ac 15
C 38 e 28
ae 45
D 4 l 54
ec 14
E 50 m 140
em 103
L 6 n 3
la 5
M 10 o 20
pc 18
N 250 p 15
px 45
O 3 x 8
xy 25
P 40 y 60
xz 2
R 55 z 15
za 1
Markov Table summary of Markov Table summary of Markov Table summary of
vertices labels edges labels pair-wise edge connections


Decomposition-Based and Selectivity-Aware SQL
Translation

Identifying the pruning points.

Calculating the number of partitions.

Decomposed SQL translation.
Blindly Single-Level Decomposition.

Pruned Single-Level Decomposition.

Pruned Multi-Level Decomposition

Selectivity-aware Annotations.


Translation
Identifying the pruning points
Each vertex label, edge label or edge connection with low frequency is
considered as a pruning point in our relational evaluation mechanism.

Given a query graph q, we ﬁrst check the structure of q against our
summary Markov tables to identify the possible pruning points (NPP).

Calculating the number of partitions
Having a sub-graph query q requires NJP join operations.

Assuming that the relational query engine can evaluate up to number
of join operations equal to MJP in one query.

The number of partitions (NOP) is computed as: (NJP/MJP)


Translation
Blindly Single-Level Decomposition
If NPP = 0 ⇒ we blindly decompose the query q into NOP partitions.
Each partition is translated into an intermediate evaluation step Si .
The final evaluation step joins all intermediate evaluation steps and
adds the conjunctive conditions of the partition’s connectors.
Pruned Single-Level Decomposition
If NPP >= NOP ⇒ we distribute the pruning points across the
different intermediate NOP partitions.
It ensures a balanced effective pruning of all intermediate results.
Pruned Multi-Level Decomposition
if NPP < NOP ⇒ we distribute the pruning points across a first level
intermediate results of NOP partitions. An intermediate collective
pruned step IPS is constructed by joining all the pruned first level
intermediate results.
IPS is used as an entry pruning point for the rest (NOP − NPP)
non-pruned partitions in a hierarchical multi-level fashion .
Each pruning point can be used to prune more than one partition (if
possible).

Translation
S1
S1 S2 FES -
S2 FES -
SQL
SQL

S1 - S1 - S2 - S2 -
SQL SQLSQL SQL

(a) NPP > NOP
S2

FES -
SQL
S2 FES -
S1 S3 SQL
S1 - S2 - S3 -
S1 S3 SQL SQL SQL
S1 - S2 - S3 -
(b) NPP < NOP SQL SQL SQL

Figure: Selectivity-aware decomposition process


Translation
Selectivity-aware Annotations

For any given SQL query, there are a large number of alternative
execution plans. These alternative execution plans may differ
significantly in their use of system resources or response time.

We use the statistical summary information to give influencing hints for
the query optimizers by injecting additional selectivity information for
the individual query predicates into the SQL translations of the graph
queries.

SELECT fieldlist FROM tablelist
WHERE Pi SELECTIVITY Si


Experimental Results: Performance and Scalability

D2kV10E20L40M50 1MB
D10kV10E20L40M50 10MB
100000 10000
D50kV30E40L90M150 50MB
D100kV30E40L90M150
100MB

10000
1000
Execution Time (ms)

Execution Time (ms)
1000

100

100

10
10

1 1
Q4 Q8 Q12 Q16 Q20 Q4 Q8 Q12 Q16 Q20
Query Size Query Size

(a) Synthetic Dataset (b) DBLP Dataset

Figure: The scalability of GraphREL.


Experimental Results: The eﬀect of using Partitioned
B-tree Indexes and Selectivity Injections

Synthetic
Synthetic
DBLP
100 DBLP
40

90
35
80
Percentage of Improvement (%)

30
70

Execution Times (ms)
60 25

50 20

40
15
30
10
20

5
10

0 0
Q4 Q8 Q12 Q16 Q20 Q4 Q8 Q12 Q16 Q20
Query Size Query Size

(a) Partitioned B-tree indexes (b) Injection of selectivity annotations

Figure: The speedup improvement for the relational evaluation of sub-graph
queries using partitioned B-tree indexes and selectivity-aware annotations.


QBP: An Application of GraphREL

Many of today’s Information Systems are driven by explicit process
models.
A business process is a set of coordinated activities to achieve a
specific business objective.
With the rapid and incremental increase in the number of process
models, it becomes crucial for business process designers to be able to
look up their repository for models efficiently.
QBP is a query processor for business processes models.
QBP is based on a new visual query language for business processes
called BPMN-Q. The language addresses processes definitions and
extends the standard BPMN notations for modeling business
processes for its concrete syntax.
A BPMN-Q query is considered to be a graph which is going to be
matched with process graph(s).


Credit Rating Credit Rating
[accepted] [rejected] Offer loan protection
insurance

Check credit rating

Prepare contract

Const. Doc All OK
[valid]

Check real-estate Offer residence
Customer applies for
construction insurance
real-estate credit
document

Const. Doc.
[invalid] Reject application

Check land register
record

Record Record
[present] [absent]


QBP: Application Architecture

BPM-Q BPM- Q Query Semantic Query
Query Editor Expander

Semantically
expanded queries

Business Process Model Result Process Models SQL-Based
Designers Query Processor
Editor

Query Results SQL Script
Updates

RDBMS
Relational Business
Process Repository

Translation Middleware

UML
BPMN BPEL ………. EPC
ADs


BPMN-Q Query Constructs

Anonymous It is used to indicate unknown activities in a query. It resembles an
Activity activity but is distinguished by the @ sign in the beginning of the label.
Generic Node It indicates an unknown node in a process. It could evaluate to any node
type.
Generic Split It refers to any type of split gateways.

Generic Join It refers to any type of join gateways.

Negative It states that two nodes A and B are not directly related by sequence
Sequence Flow flow.
Path It states that there must be a path from A to B. A query usually returns
all paths.
Negative Path It states that there is not any path between two nodes A and B.



// Reject application
real-estate credit

(a) BPMN-Q Query Example

Check credit rating

Check real-estate
construction
real-estate credit
document

Check land register Reject application
record

(b) BPMN-Q Query Match


QBP: Use Cases

Searching the structure of the process models.

Compliance checking.

Detecting design anomalies.

Discovery of frequent process patterns/anti-patterns.



http://bpmnq.sourceforge.net/

Conclusions

GraphREL is a purely relational framework to store and query graph
data.

In principle GraphREL has the following advantages:
It can reside on any relational database system and exploits its well
known matured query optimization techniques as well as its efficient
and scalable query processing techniques.

It has no required time cost for offline or pre-processing steps.

It can handle static and dynamic (with frequent updates) graph
databases very well.

The selectivity annotations for the SQL evaluation scripts provide the
relational query optimizers with the ability to select the most efficient
execution plans and apply an efficient pruning for the non-required
graph database members.


Outline





Future Work: Large Scale Graph Query Processing
(e.g: Social Networks)


Future Work: Parallel Processing / MapReduce
(HadoopDB)


Future Work: Storing and Querying Hypergraphs


References

[CIDR’03] G. Graefe. Sorting And Indexing With Partitioned
B-Trees.

[VLDB’04] T. Grust, S. Sakr, and J. Teubner. XQuery on SQL
Hosts.

[SIGMOD’07] T. Grust, M. Mayr, J. Rittinger, S. Sakr, and J.
Teubner. A SQL:1999 Code Generator for the Pathﬁnder XQuery
Compiler.

[VLDB’08] J. Teubner, T. Grust, S. Maneth, and S. Sakr.
Dependable Cardinality Forecats for XQuery.


References

[IJWIS’08] S. Sakr. ”Algebraic-Based XQuery Cardinality
Estimation.

[DASFAA’09] S. Sakr. GraphREL: A Decomposition-Based and
Selectivity-Aware Relational Framework for Processing Sub-graph
Queries.

[UNISCON’09] S. Sakr. Storing and Querying Graph Data Using
Eﬃcient Relational Processing Techniques.

[JDM’09] S. Sakr. Purely Relational Implementation of an XQuery
Processor.


The End

Thank You


GraphREL: A Relational Framework for Processing Sub-graph Queries

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to GraphREL: A Relational Framework for Processing Sub-graph Queries

Similar to GraphREL: A Relational Framework for Processing Sub-graph Queries (20)

More from University of New South Wales

More from University of New South Wales (10)

Recently uploaded

Recently uploaded (20)

GraphREL: A Relational Framework for Processing Sub-graph Queries