Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
GraphREL: A Relational Framework for Processing Sub-graph Queries
1. GraphREL: A Decomposition-Based and
Selectivity-Aware Relational Framework for Processing
Sub-graph Queries
Sherif Sakr
School of Computer Science and Engineering
University of New South Wales
.
http://www.cse.unsw.edu.au/∼ssakr/
BIT Seminars ’09 - Free University of Bolzano, Italy
16 November 2009
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 1 / 40
2. Outline
Previous Work: Pathfinder - Relational XQuery Compiler.
Current Work: GraphREL - General Graph Query Processor.
Future Work: Scalable Graph Query Processing for New Generation
of Database Applications.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 2 / 40
3. Outline
Previous Work: Pathfinder - Relational XQuery Compiler.
Current Work: GraphREL - General Graph Query Processor.
Future Work: Scalable Graph Query Processing for New Generation
of Database Applications.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 3 / 40
4. Pathfinder: A Relational XQuery Processor
XQuery Expression
Pathfinder
Relational Algebra
MIL Code Generator SQL Code Generator
MIL Scripts SQL Scripts
Monet DBMS Conventional RDBMS
http://pathfinder-xquery.org/
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 4 / 40
5. Pathfinder: A Relational XQuery Processor
XML XQuery
Document
Expression
Pathfinder
Relational Algebra + Special Properties [VLDB’04]
Estimation Rules Translation Templates
[VLDB’08]
XPath Accelerator Cardinality Properties
Encoding Tuples XQuery Estimator SQL Generator
+
Statistical Guide [SIGMOD’07]
Statistical Guide [IJWIS’09] Cardinality Properties Aware
[JDM’09]
Statistical Histograms SQL Scripts
Relational
Results XML XML
Conventional RDBMS Serializer
Statistical Histograms
S. Sakr (CSE, UNSW) System Administrator
BIT Seminars’09 16 November 2009 5 / 40
6. Outline
Previous Work: Pathfinder - Relational XQuery Compiler.
Current Work: GraphREL - General Graph Query Processor.
Future Work: Scalable Graph Query Processing for New Generation
of Database Applications.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 6 / 40
7. GraphREL: Motivations
Graphs are among the most complicated and general form of data
structures.
Recently, they have been widely used to model many complex
structured and schemaless data such as social networks, chemical
compounds, biological pathways, spatial databases, semantic web and
business process models.
Retrieving related graphs containing a query graph from a large graph
database is a key performance issue in all of these graph-based
applications.
The success of any graph database application is directly dependent
on the efficiency of the graph indexing and query processing
mechanisms.
RDBMSs have repeatedly shown that they are very efficient, scalable
and successful in hosting different kinds of data.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 7 / 40
8. Preliminaries: Graph Data Model
In labelled graphs, vertices and edges represent the entities and the
relationships between them respectively.
The attributes associated with these entities and relationships are
called labels.
A graph database D is a collection of member graphs
D = {g1 , g2 , ...gn } where each member graph gi is denoted as
(V , E , Lv , Le ).
V is the set of vertices.
E ⊆ V × V is the set of edges joining two distinct vertices.
Lv is the set of vertex labels.
Le is the set of edge labels.
labelled graphs are classified according to the direction of their edges
into two main classes:
1 Directed-labelled graphs such as XML, RDF and traffic networks.
2 Undirected-labelled graphs such as social networks and chemical
compounds.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 8 / 40
9. Preliminaries: Graph Queries
In principle, queries in graph databases can be broadly classified into the
follow- ing main categories:
Subgraph queries: this category searches for a specific pattern in the
graph database. The pattern can be either a small graph or a graph
where some parts of it are uncertain, e.g., vertices with wildcard
labels.
Supergraph queries: this category searches for the graph database
members of which their whole structures are contained in the input
query.
Similarity (Approximate Matching) queries: this category finds
graphs which are similar, but not necessarily isomorphic to a given
query graph.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 9 / 40
10. Preliminaries: Subgraph Search Queries
Given a graph database D = {g1 , g2 , ..., gn } and a graph query q, it
returns the query answer set A = {gi |q ⊆ gi , gi ∈ D}.
A graph q is described as a sub-graph of another graph database
member gi if the set of vertices and edges of q form subset of the
vertices and edges of gi .
Formally, g1 (V1 , E1 , Lv 1 , Le1 ) is defined as sub-graph of
g2 (V2 , E2 , Lv 2 , Le2 ) if and only if:
1 For every distinct vertex x ∈ V1 with a label vl ∈ Lv 1 , there is a
distinct vertex y ∈ V2 with a label vl ∈ Lv 2 .
2 For every distinct edge edge ab ∈ E1 with a label el ∈ Le1 , there is a
distinct edge ab ∈ E2 with a label el ∈ Le2 .
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 10 / 40
11. Preliminaries: Subgraph Search Queries
A A z
f e x x C A
m
A A A
B C n
m C f e A e
x x
x
C
z
A
n B mA xB x C n C A
n
y z n x x n e x n
x
C m C
D D Dn m
D
C m D D D A D
x x
f m
f m
Ax x D
A B B
g2 g1 g2 g3 g3 qq
(a) Sample graph database (b) Graph query
Figure: An example graph database and graph query
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 11 / 40
12. Our Approach: GraphREL
Relational encoding of graph data.
SQL translation of sub-graph search queries.
Filtering phase.
Optional verification phase.
Partitioned B-tree Indexes.
Statistical Summaries.
Decomposition-Based and Selectivity-Aware SQL Translation.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 12 / 40
13. Relational Encoding of Graph Data
The starting point of our relational framework is to find an efficient
and suitable encoding for each graph member gi in the graph
database D.
We use the Vertex-Edge mapping scheme for storing directed
labelled graphs with the following structure:
Vertices(graphID, vertexID, vertexLabel)
Edges(graphID, sVertex, dVertex, edgeLabel)
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 13 / 40
14. Relational Encoding of Graph Data
graphID vertexID vLabel graphID sVertex dVertex eLabel
m
A n
1 1 1 A 1 1 2 n
m 1 1 3 m
g1 6 B A 2
1 2 A
1 2 3 n
y z n 1 3 D
1 4 3 x
5 C D 3
1 4 A 1 5 4 x
1 5 C 1 6 5 y
x x
1 5 2 z
A 4 1 6 B
1 1 6 m
2 1 A
2 1 2 e
f
A 1
e
2 2 C 2 2 3 m
2 3 D 2 4 3 m
5 B C 2
g2
2 4 2 n
2 4 C
x n m 2 5 4 x
2 5 B
4 C m D 3 2 1 5 f
Vertices Table Edges Table
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 14 / 40
15. SQL Translation of Graph Queries
Filtering Phase: a sub-graph query q consists of a set of vertices
QV with size equal m and a set of edges QE equal n is evaluated
using the following SQL translation template:
SELECT DISTINCT V1 .graphID, Vi .vertexID
FROM Vertices as V1 ,..., Vertices as Vm , Edges as E1 ,..., Edges as En
WHERE
∀m (V1 .graphID = Vi .graphID)
i=2
AND ∀n (V1 .graphID = Ej .graphID)
j=1
AND ∀m (Vi .vertexLabel = QVi .vertexLabel)
i=1
AND ∀n (Ej .edgeLabel = QEj .edgeLabel)
j=1
AND ∀n (Ej .sVertex = Vf .vertexID AND Ej .dVertex = Vf .vertexID);
j=1
Verification Phase: an optional phase which is used to verify that
each vertex in the set of filtered vertices for each candidate graph is
distinct. It is applied only if more than one vertex of the set of query
vertices QV have the same label. This can be easily achieved using
their vertex ID.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 15 / 40
16. Partitioned B-tree Indexes
Partitioned B-tree indexing is a slight variant of the B-tree indexing
structure.
The main idea is the use of low-selectivity leading columns to
maintain partitions within the associated B-tree.
In labelled graphs, it is generally the case that the number of distinct
vertices and edges labels are far less than the number of vertices and
edges respectively.
For example, having an index defined in terms of columns
(vertexLabel, graphID) can reduce the access cost of sub-graph query
with only one label to one disk page. On the contrary, an index
defined in terms of the two columns (graphID, vertexLabel) requires
scanning a large number of disk pages.
Having partitioned B-trees indexes of the high-selectivity attributes
achieves fixed execution times which are no longer dependent on the
size of the whole graph database.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 16 / 40
17. Limitations of SQL-Based Translation Approach
An obvious problem of the SQL translation template is that it
involves a large number of conjunctive SQL predicates and join
operations between the encoding tables.
Most of relational query engines will certainly fail to execute the SQL
translation queries of medium size or large sub-graph queries because
they are too long and too complex (this does not mean they must
consequently be too expensive).
Therefore, we need a decomposition mechanism to divide this large
and complex SQL translation query into a sequence of intermediate
queries.
Applying this decomposition mechanism blindly may lead to inefficient
execution plans with very large, non-required and expensive
intermediate results.
We use statistical summary information to achieve an efficient
decomposition process.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 17 / 40
18. Statistical Summaries
In general, one of the most effective techniques for optimizing the
execution times of SQL queries is to select the relational execution
based on the accurate selectivity information of the query predicates.
We construct three Markov tables to store information about the
frequency of occurrence of the distinct labels of vertices, distinct
labels of edges and connection between pair of vertices (edges).
Vertex Label Frequency Edge Label Frequency Edge Label Frequency
Connection
A 100 a 40
ab 3
B 200 c 5
ac 15
C 38 e 28
ae 45
D 4 l 54
ec 14
E 50 m 140
em 103
L 6 n 3
la 5
M 10 o 20
pc 18
N 250 p 15
px 45
O 3 x 8
xy 25
P 40 y 60
xz 2
R 55 z 15
za 1
Markov Table summary of Markov Table summary of Markov Table summary of
vertices labels edges labels pair-wise edge connections
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 18 / 40
19. Decomposition-Based and Selectivity-Aware SQL
Translation
Identifying the pruning points.
Calculating the number of partitions.
Decomposed SQL translation.
Blindly Single-Level Decomposition.
Pruned Single-Level Decomposition.
Pruned Multi-Level Decomposition
Selectivity-aware Annotations.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 19 / 40
20. Decomposition-Based and Selectivity-Aware SQL
Translation
Identifying the pruning points
Each vertex label, edge label or edge connection with low frequency is
considered as a pruning point in our relational evaluation mechanism.
Given a query graph q, we first check the structure of q against our
summary Markov tables to identify the possible pruning points (NPP).
Calculating the number of partitions
Having a sub-graph query q requires NJP join operations.
Assuming that the relational query engine can evaluate up to number
of join operations equal to MJP in one query.
The number of partitions (NOP) is computed as: (NJP/MJP)
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 20 / 40
21. Decomposition-Based and Selectivity-Aware SQL
Translation
Blindly Single-Level Decomposition
If NPP = 0 ⇒ we blindly decompose the query q into NOP partitions.
Each partition is translated into an intermediate evaluation step Si .
The final evaluation step joins all intermediate evaluation steps and
adds the conjunctive conditions of the partition’s connectors.
Pruned Single-Level Decomposition
If NPP >= NOP ⇒ we distribute the pruning points across the
different intermediate NOP partitions.
It ensures a balanced effective pruning of all intermediate results.
Pruned Multi-Level Decomposition
if NPP < NOP ⇒ we distribute the pruning points across a first level
intermediate results of NOP partitions. An intermediate collective
pruned step IPS is constructed by joining all the pruned first level
intermediate results.
IPS is used as an entry pruning point for the rest (NOP − NPP)
non-pruned partitions in a hierarchical multi-level fashion .
Each pruning point can be used to prune more than one partition (if
possible).
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 21 / 40
23. Decomposition-Based and Selectivity-Aware SQL
Translation
Selectivity-aware Annotations
For any given SQL query, there are a large number of alternative
execution plans. These alternative execution plans may differ
significantly in their use of system resources or response time.
We use the statistical summary information to give influencing hints for
the query optimizers by injecting additional selectivity information for
the individual query predicates into the SQL translations of the graph
queries.
SELECT fieldlist FROM tablelist
WHERE Pi SELECTIVITY Si
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 23 / 40
24. Experimental Results: Performance and Scalability
D2kV10E20L40M50 1MB
D10kV10E20L40M50 10MB
100000 10000
D50kV30E40L90M150 50MB
D100kV30E40L90M150
100MB
10000
1000
Execution Time (ms)
Execution Time (ms)
1000
100
100
10
10
1 1
Q4 Q8 Q12 Q16 Q20 Q4 Q8 Q12 Q16 Q20
Query Size Query Size
(a) Synthetic Dataset (b) DBLP Dataset
Figure: The scalability of GraphREL.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 24 / 40
25. Experimental Results: The effect of using Partitioned
B-tree Indexes and Selectivity Injections
Synthetic
Synthetic
DBLP
100 DBLP
40
90
35
80
Percentage of Improvement (%)
30
70
Execution Times (ms)
60 25
50 20
40
15
30
10
20
5
10
0 0
Q4 Q8 Q12 Q16 Q20 Q4 Q8 Q12 Q16 Q20
Query Size Query Size
(a) Partitioned B-tree indexes (b) Injection of selectivity annotations
Figure: The speedup improvement for the relational evaluation of sub-graph
queries using partitioned B-tree indexes and selectivity-aware annotations.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 25 / 40
26. QBP: An Application of GraphREL
Many of today’s Information Systems are driven by explicit process
models.
A business process is a set of coordinated activities to achieve a
specific business objective.
With the rapid and incremental increase in the number of process
models, it becomes crucial for business process designers to be able to
look up their repository for models efficiently.
QBP is a query processor for business processes models.
QBP is based on a new visual query language for business processes
called BPMN-Q. The language addresses processes definitions and
extends the standard BPMN notations for modeling business
processes for its concrete syntax.
A BPMN-Q query is considered to be a graph which is going to be
matched with process graph(s).
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 26 / 40
27. QBP: An Application of GraphREL
Credit Rating Credit Rating
[accepted] [rejected] Offer loan protection
insurance
Check credit rating
Prepare contract
Const. Doc All OK
[valid]
Check real-estate Offer residence
Customer applies for
construction insurance
real-estate credit
document
Const. Doc.
[invalid] Reject application
Check land register
record
Record Record
[present] [absent]
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 27 / 40
28. QBP: Application Architecture
BPM-Q BPM- Q Query Semantic Query
Query Editor Expander
Semantically
expanded queries
Business Process Model Result Process Models SQL-Based
Designers Query Processor
Editor
Query Results SQL Script
Updates
RDBMS
Relational Business
Process Repository
Translation Middleware
UML
BPMN BPEL ………. EPC
ADs
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 28 / 40
29. BPMN-Q Query Constructs
Anonymous It is used to indicate unknown activities in a query. It resembles an
Activity activity but is distinguished by the @ sign in the beginning of the label.
Generic Node It indicates an unknown node in a process. It could evaluate to any node
type.
Generic Split It refers to any type of split gateways.
Generic Join It refers to any type of join gateways.
Negative It states that two nodes A and B are not directly related by sequence
Sequence Flow flow.
Path It states that there must be a path from A to B. A query usually returns
all paths.
Negative Path It states that there is not any path between two nodes A and B.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 29 / 40
30. QBP: An Application of GraphREL
Customer applies for
// Reject application
real-estate credit
(a) BPMN-Q Query Example
Check credit rating
Check real-estate
Customer applies for
construction
real-estate credit
document
Check land register Reject application
record
(b) BPMN-Q Query Match
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 30 / 40
31. QBP: Use Cases
Searching the structure of the process models.
Compliance checking.
Detecting design anomalies.
Discovery of frequent process patterns/anti-patterns.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 31 / 40
32. QBP: An Application of GraphREL
http://bpmnq.sourceforge.net/
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 32 / 40
33. Conclusions
GraphREL is a purely relational framework to store and query graph
data.
In principle GraphREL has the following advantages:
It can reside on any relational database system and exploits its well
known matured query optimization techniques as well as its efficient
and scalable query processing techniques.
It has no required time cost for offline or pre-processing steps.
It can handle static and dynamic (with frequent updates) graph
databases very well.
The selectivity annotations for the SQL evaluation scripts provide the
relational query optimizers with the ability to select the most efficient
execution plans and apply an efficient pruning for the non-required
graph database members.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 33 / 40
34. Outline
Previous Work: Pathfinder - Relational XQuery Compiler.
Current Work: GraphREL - General Graph Query Processor.
Future Work: Scalable Graph Query Processing for New Generation
of Database Applications.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 34 / 40
35. Future Work: Large Scale Graph Query Processing
(e.g: Social Networks)
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 35 / 40
36. Future Work: Parallel Processing / MapReduce
(HadoopDB)
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 36 / 40
37. Future Work: Storing and Querying Hypergraphs
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 37 / 40
38. References
[CIDR’03] G. Graefe. Sorting And Indexing With Partitioned
B-Trees.
[VLDB’04] T. Grust, S. Sakr, and J. Teubner. XQuery on SQL
Hosts.
[SIGMOD’07] T. Grust, M. Mayr, J. Rittinger, S. Sakr, and J.
Teubner. A SQL:1999 Code Generator for the Pathfinder XQuery
Compiler.
[VLDB’08] J. Teubner, T. Grust, S. Maneth, and S. Sakr.
Dependable Cardinality Forecats for XQuery.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 38 / 40
39. References
[IJWIS’08] S. Sakr. ”Algebraic-Based XQuery Cardinality
Estimation.
[DASFAA’09] S. Sakr. GraphREL: A Decomposition-Based and
Selectivity-Aware Relational Framework for Processing Sub-graph
Queries.
[UNISCON’09] S. Sakr. Storing and Querying Graph Data Using
Efficient Relational Processing Techniques.
[JDM’09] S. Sakr. Purely Relational Implementation of an XQuery
Processor.
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 39 / 40
40. The End
Thank You
S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 40 / 40