GraphREL: A Relational Graph Query Processor

2,456 views

Published on

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,456
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
69
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

GraphREL: A Relational Graph Query Processor

  1. 1. GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries Sherif Sakr School of Computer Science and Engineering University of New South Wales . http://www.cse.unsw.edu.au/∼ssakr/ BIT Seminars ’09 - Free University of Bolzano, Italy 16 November 2009 S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 1 / 40
  2. 2. Outline Previous Work: Pathfinder - Relational XQuery Compiler. Current Work: GraphREL - General Graph Query Processor. Future Work: Scalable Graph Query Processing for New Generation of Database Applications. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 2 / 40
  3. 3. Outline Previous Work: Pathfinder - Relational XQuery Compiler. Current Work: GraphREL - General Graph Query Processor. Future Work: Scalable Graph Query Processing for New Generation of Database Applications. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 3 / 40
  4. 4. Pathfinder: A Relational XQuery Processor XQuery Expression Pathfinder Relational Algebra MIL Code Generator SQL Code Generator MIL Scripts SQL Scripts Monet DBMS Conventional RDBMS http://pathfinder-xquery.org/ S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 4 / 40
  5. 5. Pathfinder: A Relational XQuery Processor XML XQuery Document Expression Pathfinder Relational Algebra + Special Properties [VLDB’04] Estimation Rules Translation Templates [VLDB’08] XPath Accelerator Cardinality Properties Encoding Tuples XQuery Estimator SQL Generator + Statistical Guide [SIGMOD’07] Statistical Guide [IJWIS’09] Cardinality Properties Aware [JDM’09] Statistical Histograms SQL Scripts Relational Results XML XML Conventional RDBMS Serializer Statistical Histograms S. Sakr (CSE, UNSW) System Administrator BIT Seminars’09 16 November 2009 5 / 40
  6. 6. Outline Previous Work: Pathfinder - Relational XQuery Compiler. Current Work: GraphREL - General Graph Query Processor. Future Work: Scalable Graph Query Processing for New Generation of Database Applications. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 6 / 40
  7. 7. GraphREL: Motivations Graphs are among the most complicated and general form of data structures. Recently, they have been widely used to model many complex structured and schemaless data such as social networks, chemical compounds, biological pathways, spatial databases, semantic web and business process models. Retrieving related graphs containing a query graph from a large graph database is a key performance issue in all of these graph-based applications. The success of any graph database application is directly dependent on the efficiency of the graph indexing and query processing mechanisms. RDBMSs have repeatedly shown that they are very efficient, scalable and successful in hosting different kinds of data. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 7 / 40
  8. 8. Preliminaries: Graph Data Model In labelled graphs, vertices and edges represent the entities and the relationships between them respectively. The attributes associated with these entities and relationships are called labels. A graph database D is a collection of member graphs D = {g1 , g2 , ...gn } where each member graph gi is denoted as (V , E , Lv , Le ). V is the set of vertices. E ⊆ V × V is the set of edges joining two distinct vertices. Lv is the set of vertex labels. Le is the set of edge labels. labelled graphs are classified according to the direction of their edges into two main classes: 1 Directed-labelled graphs such as XML, RDF and traffic networks. 2 Undirected-labelled graphs such as social networks and chemical compounds. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 8 / 40
  9. 9. Preliminaries: Graph Queries In principle, queries in graph databases can be broadly classified into the follow- ing main categories: Subgraph queries: this category searches for a specific pattern in the graph database. The pattern can be either a small graph or a graph where some parts of it are uncertain, e.g., vertices with wildcard labels. Supergraph queries: this category searches for the graph database members of which their whole structures are contained in the input query. Similarity (Approximate Matching) queries: this category finds graphs which are similar, but not necessarily isomorphic to a given query graph. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 9 / 40
  10. 10. Preliminaries: Subgraph Search Queries Given a graph database D = {g1 , g2 , ..., gn } and a graph query q, it returns the query answer set A = {gi |q ⊆ gi , gi ∈ D}. A graph q is described as a sub-graph of another graph database member gi if the set of vertices and edges of q form subset of the vertices and edges of gi . Formally, g1 (V1 , E1 , Lv 1 , Le1 ) is defined as sub-graph of g2 (V2 , E2 , Lv 2 , Le2 ) if and only if: 1 For every distinct vertex x ∈ V1 with a label vl ∈ Lv 1 , there is a distinct vertex y ∈ V2 with a label vl ∈ Lv 2 . 2 For every distinct edge edge ab ∈ E1 with a label el ∈ Le1 , there is a distinct edge ab ∈ E2 with a label el ∈ Le2 . S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 10 / 40
  11. 11. Preliminaries: Subgraph Search Queries A A z f e x x C A m A A A B C n m C f e A e x x x C z A n B mA xB x C n C A n y z n x x n e x n x C m C D D Dn m D C m D D D A D x x f m f m Ax x D A B B g2 g1 g2 g3 g3 qq (a) Sample graph database (b) Graph query Figure: An example graph database and graph query S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 11 / 40
  12. 12. Our Approach: GraphREL Relational encoding of graph data. SQL translation of sub-graph search queries. Filtering phase. Optional verification phase. Partitioned B-tree Indexes. Statistical Summaries. Decomposition-Based and Selectivity-Aware SQL Translation. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 12 / 40
  13. 13. Relational Encoding of Graph Data The starting point of our relational framework is to find an efficient and suitable encoding for each graph member gi in the graph database D. We use the Vertex-Edge mapping scheme for storing directed labelled graphs with the following structure: Vertices(graphID, vertexID, vertexLabel) Edges(graphID, sVertex, dVertex, edgeLabel) S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 13 / 40
  14. 14. Relational Encoding of Graph Data graphID vertexID vLabel graphID sVertex dVertex eLabel m A n 1 1 1 A 1 1 2 n m 1 1 3 m g1 6 B A 2 1 2 A 1 2 3 n y z n 1 3 D 1 4 3 x 5 C D 3 1 4 A 1 5 4 x 1 5 C 1 6 5 y x x 1 5 2 z A 4 1 6 B 1 1 6 m 2 1 A 2 1 2 e f A 1 e 2 2 C 2 2 3 m 2 3 D 2 4 3 m 5 B C 2 g2 2 4 2 n 2 4 C x n m 2 5 4 x 2 5 B 4 C m D 3 2 1 5 f Vertices Table Edges Table S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 14 / 40
  15. 15. SQL Translation of Graph Queries Filtering Phase: a sub-graph query q consists of a set of vertices QV with size equal m and a set of edges QE equal n is evaluated using the following SQL translation template: SELECT DISTINCT V1 .graphID, Vi .vertexID FROM Vertices as V1 ,..., Vertices as Vm , Edges as E1 ,..., Edges as En WHERE ∀m (V1 .graphID = Vi .graphID) i=2 AND ∀n (V1 .graphID = Ej .graphID) j=1 AND ∀m (Vi .vertexLabel = QVi .vertexLabel) i=1 AND ∀n (Ej .edgeLabel = QEj .edgeLabel) j=1 AND ∀n (Ej .sVertex = Vf .vertexID AND Ej .dVertex = Vf .vertexID); j=1 Verification Phase: an optional phase which is used to verify that each vertex in the set of filtered vertices for each candidate graph is distinct. It is applied only if more than one vertex of the set of query vertices QV have the same label. This can be easily achieved using their vertex ID. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 15 / 40
  16. 16. Partitioned B-tree Indexes Partitioned B-tree indexing is a slight variant of the B-tree indexing structure. The main idea is the use of low-selectivity leading columns to maintain partitions within the associated B-tree. In labelled graphs, it is generally the case that the number of distinct vertices and edges labels are far less than the number of vertices and edges respectively. For example, having an index defined in terms of columns (vertexLabel, graphID) can reduce the access cost of sub-graph query with only one label to one disk page. On the contrary, an index defined in terms of the two columns (graphID, vertexLabel) requires scanning a large number of disk pages. Having partitioned B-trees indexes of the high-selectivity attributes achieves fixed execution times which are no longer dependent on the size of the whole graph database. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 16 / 40
  17. 17. Limitations of SQL-Based Translation Approach An obvious problem of the SQL translation template is that it involves a large number of conjunctive SQL predicates and join operations between the encoding tables. Most of relational query engines will certainly fail to execute the SQL translation queries of medium size or large sub-graph queries because they are too long and too complex (this does not mean they must consequently be too expensive). Therefore, we need a decomposition mechanism to divide this large and complex SQL translation query into a sequence of intermediate queries. Applying this decomposition mechanism blindly may lead to inefficient execution plans with very large, non-required and expensive intermediate results. We use statistical summary information to achieve an efficient decomposition process. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 17 / 40
  18. 18. Statistical Summaries In general, one of the most effective techniques for optimizing the execution times of SQL queries is to select the relational execution based on the accurate selectivity information of the query predicates. We construct three Markov tables to store information about the frequency of occurrence of the distinct labels of vertices, distinct labels of edges and connection between pair of vertices (edges). Vertex Label Frequency Edge Label Frequency Edge Label Frequency Connection A 100 a 40 ab 3 B 200 c 5 ac 15 C 38 e 28 ae 45 D 4 l 54 ec 14 E 50 m 140 em 103 L 6 n 3 la 5 M 10 o 20 pc 18 N 250 p 15 px 45 O 3 x 8 xy 25 P 40 y 60 xz 2 R 55 z 15 za 1 Markov Table summary of Markov Table summary of Markov Table summary of vertices labels edges labels pair-wise edge connections S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 18 / 40
  19. 19. Decomposition-Based and Selectivity-Aware SQL Translation Identifying the pruning points. Calculating the number of partitions. Decomposed SQL translation. Blindly Single-Level Decomposition. Pruned Single-Level Decomposition. Pruned Multi-Level Decomposition Selectivity-aware Annotations. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 19 / 40
  20. 20. Decomposition-Based and Selectivity-Aware SQL Translation Identifying the pruning points Each vertex label, edge label or edge connection with low frequency is considered as a pruning point in our relational evaluation mechanism. Given a query graph q, we first check the structure of q against our summary Markov tables to identify the possible pruning points (NPP). Calculating the number of partitions Having a sub-graph query q requires NJP join operations. Assuming that the relational query engine can evaluate up to number of join operations equal to MJP in one query. The number of partitions (NOP) is computed as: (NJP/MJP) S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 20 / 40
  21. 21. Decomposition-Based and Selectivity-Aware SQL Translation Blindly Single-Level Decomposition If NPP = 0 ⇒ we blindly decompose the query q into NOP partitions. Each partition is translated into an intermediate evaluation step Si . The final evaluation step joins all intermediate evaluation steps and adds the conjunctive conditions of the partition’s connectors. Pruned Single-Level Decomposition If NPP >= NOP ⇒ we distribute the pruning points across the different intermediate NOP partitions. It ensures a balanced effective pruning of all intermediate results. Pruned Multi-Level Decomposition if NPP < NOP ⇒ we distribute the pruning points across a first level intermediate results of NOP partitions. An intermediate collective pruned step IPS is constructed by joining all the pruned first level intermediate results. IPS is used as an entry pruning point for the rest (NOP − NPP) non-pruned partitions in a hierarchical multi-level fashion . Each pruning point can be used to prune more than one partition (if possible). S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 21 / 40
  22. 22. Decomposition-Based and Selectivity-Aware SQL Translation S1 S1 S2 FES - S2 FES - SQL SQL S1 - S1 - S2 - S2 - SQL SQLSQL SQL (a) NPP > NOP S2 FES - SQL S2 FES - S1 S3 SQL S1 - S2 - S3 - S1 S3 SQL SQL SQL S1 - S2 - S3 - (b) NPP < NOP SQL SQL SQL Figure: Selectivity-aware decomposition process S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 22 / 40
  23. 23. Decomposition-Based and Selectivity-Aware SQL Translation Selectivity-aware Annotations For any given SQL query, there are a large number of alternative execution plans. These alternative execution plans may differ significantly in their use of system resources or response time. We use the statistical summary information to give influencing hints for the query optimizers by injecting additional selectivity information for the individual query predicates into the SQL translations of the graph queries. SELECT fieldlist FROM tablelist WHERE Pi SELECTIVITY Si S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 23 / 40
  24. 24. Experimental Results: Performance and Scalability D2kV10E20L40M50 1MB D10kV10E20L40M50 10MB 100000 10000 D50kV30E40L90M150 50MB D100kV30E40L90M150 100MB 10000 1000 Execution Time (ms) Execution Time (ms) 1000 100 100 10 10 1 1 Q4 Q8 Q12 Q16 Q20 Q4 Q8 Q12 Q16 Q20 Query Size Query Size (a) Synthetic Dataset (b) DBLP Dataset Figure: The scalability of GraphREL. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 24 / 40
  25. 25. Experimental Results: The effect of using Partitioned B-tree Indexes and Selectivity Injections Synthetic Synthetic DBLP 100 DBLP 40 90 35 80 Percentage of Improvement (%) 30 70 Execution Times (ms) 60 25 50 20 40 15 30 10 20 5 10 0 0 Q4 Q8 Q12 Q16 Q20 Q4 Q8 Q12 Q16 Q20 Query Size Query Size (a) Partitioned B-tree indexes (b) Injection of selectivity annotations Figure: The speedup improvement for the relational evaluation of sub-graph queries using partitioned B-tree indexes and selectivity-aware annotations. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 25 / 40
  26. 26. QBP: An Application of GraphREL Many of today’s Information Systems are driven by explicit process models. A business process is a set of coordinated activities to achieve a specific business objective. With the rapid and incremental increase in the number of process models, it becomes crucial for business process designers to be able to look up their repository for models efficiently. QBP is a query processor for business processes models. QBP is based on a new visual query language for business processes called BPMN-Q. The language addresses processes definitions and extends the standard BPMN notations for modeling business processes for its concrete syntax. A BPMN-Q query is considered to be a graph which is going to be matched with process graph(s). S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 26 / 40
  27. 27. QBP: An Application of GraphREL Credit Rating Credit Rating [accepted] [rejected] Offer loan protection insurance Check credit rating Prepare contract Const. Doc All OK [valid] Check real-estate Offer residence Customer applies for construction insurance real-estate credit document Const. Doc. [invalid] Reject application Check land register record Record Record [present] [absent] S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 27 / 40
  28. 28. QBP: Application Architecture BPM-Q BPM- Q Query Semantic Query Query Editor Expander Semantically expanded queries Business Process Model Result Process Models SQL-Based Designers Query Processor Editor Query Results SQL Script Updates RDBMS Relational Business Process Repository Translation Middleware UML BPMN BPEL ………. EPC ADs S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 28 / 40
  29. 29. BPMN-Q Query Constructs Anonymous It is used to indicate unknown activities in a query. It resembles an Activity activity but is distinguished by the @ sign in the beginning of the label. Generic Node It indicates an unknown node in a process. It could evaluate to any node type. Generic Split It refers to any type of split gateways. Generic Join It refers to any type of join gateways. Negative It states that two nodes A and B are not directly related by sequence Sequence Flow flow. Path It states that there must be a path from A to B. A query usually returns all paths. Negative Path It states that there is not any path between two nodes A and B. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 29 / 40
  30. 30. QBP: An Application of GraphREL Customer applies for // Reject application real-estate credit (a) BPMN-Q Query Example Check credit rating Check real-estate Customer applies for construction real-estate credit document Check land register Reject application record (b) BPMN-Q Query Match S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 30 / 40
  31. 31. QBP: Use Cases Searching the structure of the process models. Compliance checking. Detecting design anomalies. Discovery of frequent process patterns/anti-patterns. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 31 / 40
  32. 32. QBP: An Application of GraphREL http://bpmnq.sourceforge.net/ S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 32 / 40
  33. 33. Conclusions GraphREL is a purely relational framework to store and query graph data. In principle GraphREL has the following advantages: It can reside on any relational database system and exploits its well known matured query optimization techniques as well as its efficient and scalable query processing techniques. It has no required time cost for offline or pre-processing steps. It can handle static and dynamic (with frequent updates) graph databases very well. The selectivity annotations for the SQL evaluation scripts provide the relational query optimizers with the ability to select the most efficient execution plans and apply an efficient pruning for the non-required graph database members. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 33 / 40
  34. 34. Outline Previous Work: Pathfinder - Relational XQuery Compiler. Current Work: GraphREL - General Graph Query Processor. Future Work: Scalable Graph Query Processing for New Generation of Database Applications. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 34 / 40
  35. 35. Future Work: Large Scale Graph Query Processing (e.g: Social Networks) S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 35 / 40
  36. 36. Future Work: Parallel Processing / MapReduce (HadoopDB) S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 36 / 40
  37. 37. Future Work: Storing and Querying Hypergraphs S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 37 / 40
  38. 38. References [CIDR’03] G. Graefe. Sorting And Indexing With Partitioned B-Trees. [VLDB’04] T. Grust, S. Sakr, and J. Teubner. XQuery on SQL Hosts. [SIGMOD’07] T. Grust, M. Mayr, J. Rittinger, S. Sakr, and J. Teubner. A SQL:1999 Code Generator for the Pathfinder XQuery Compiler. [VLDB’08] J. Teubner, T. Grust, S. Maneth, and S. Sakr. Dependable Cardinality Forecats for XQuery. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 38 / 40
  39. 39. References [IJWIS’08] S. Sakr. ”Algebraic-Based XQuery Cardinality Estimation. [DASFAA’09] S. Sakr. GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries. [UNISCON’09] S. Sakr. Storing and Querying Graph Data Using Efficient Relational Processing Techniques. [JDM’09] S. Sakr. Purely Relational Implementation of an XQuery Processor. S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 39 / 40
  40. 40. The End Thank You S. Sakr (CSE, UNSW) BIT Seminars’09 16 November 2009 40 / 40

×