Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Query Processing Using Structure Index for RDF Data on the Web
1. Query Processing
Using Structure Index for RDF Data on the Web
Thanh Tran and Günter Ladwig
Institute AIFB, Karlsruhe Institute of Technology
ducthanh.tran@kit.edu, guenter.ladwig@kit.edu
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
1
2. Agenda
Problem Introduction
Approach
Structure Index for RDF Data
Structure-based Partitioning
Structure-aware Query Processing
Evaluation
Conclusion
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
2
3. RDF data
0 1
AuthorOf
Supervises AuthorOf Supervises Supervises Supervises
2 3 4 5 6 7
WorksAt
WorksAt
Name Name
KIT 8 9 MIT
- Consists of triples <s,p,o>
- Triples form a graph, where vertices denote resources and their values, connected
by directed labelled edges representing properties (i.e.,relations and attributes)
- URIs are used as labels of edges and vertices representing resources
3 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
4. Conjunctive Queries
z
Supervises
x y
WorksAt
Name
KIT u
- Important fragment of widely used languages (SQL, SPARQL)
- Consisting of triple patterns p(s,o) where p is a predicate and s and o are variables
or constants
- Distinguished variables, e.g. x, vs. undistinguished variables
- Triple patterns constitute a query graph
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
4
5. Conjunctive Query Answering
z
0 1
AuthorOf
AuthorOf
Supervises
x y
Supervises Supervises Supervises Supervises
2 3 4 5 6 7
WorksAt
WorksAt
WorksAt
Name
KIT u
Name Name
KIT 8 9 MIT
- Graph pattern matching problem: a match of a query q on a graph G is a mapping h
from the variables of q to vertices of G such that the substitution of variables in
the graph-representation of q would yield a subgraph of G
- A match h is a homomorphism from the “query graph” to the data graph
- Query answering based on two basic operations: data loading and join
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
5
6. State-of-the-art
Data Partitioning
Vertical partitioning (SW-Store)
Indexing
Sextuple indexing (Hexastore)
Materialization and indexing of entire join paths (GRIN)
Index Implementation
B+ tree
Inverted index (Semplore)
Index compression (RDF-3X)
Query processing
Sorted merge join based on vertical partitioning and indexing (SW-Store)
Join order optimization based on dynamic programming (RDF-3X)
A combination of different concepts makes up the state-of-the-art!
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
6
7. Large Volume of RDF Data on the Web
- ̴10 billions RDF triples (2009)
- Interlinked by ̴10 millions mappings (2009)
- Besides linked data, there are standalone ontologies, RDFa, etc.
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
7
8. Semi-structured RDF data on the Web
0 1
AuthorOf
Supervises AuthorOf Supervises Supervises Supervises
2 3 4 5 6 7
WorksAt
WorksAt
KIT Name 8 9 Name MIT
Publication
AuthorOf
- RDF graph often contains both
data and schema information
PhD Supervises - Resources are linked with a
Institute
Student rdf:class via rdf:type
WorksAt
- Schema information incomplete,
especially Web data, RDFa data
RDF data might be schema-less,
Name Post Doc
String semi-structured data
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
9. Overview of Our Approach
Problems
• Management of possibly semi-structured RDF data on the Web
• Scalability and efficiency of RDF Web data query processing
Contributions
• Parameterized structure index for RDF data
• Structure-based partitioning (SP)
• Structure-aware query processing
Benefits
• Reduction of unions & joins as well as IO cost
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
9
10. Structure Index for RDF data on the Web
B1 : AuthorOf B2 :
0 1
3,7 0,1
AuthorOf
WorksAt
AuthorOf
AuthorOf
Supervises Supervises Supervises Supervises
B3 : WorksAt B4 :
2 3 4 5 6 7
8,9 2,4,6
Supervises
WorksAt
WorksAt
Name
Name Name
B5:KIT,MIT B6 : 5 KIT 8 9 MIT
Structure index is a graph
Is a structural description more fine-granular then a schema
Consists of classes (extensions) and relations between them
Resources in an extension exhibit the same structure, i.e., cannot be distinguished by
outgoing (forward bisimilarity) and incoming (backward bisimilarity) “edge trees”
Parameterize bisimulation by two sets of edge labels
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
10
11. Structure-based Partitioning
B1 : AuthorOf B2 : SP B4 table
3,7 0,1 VP AuthorOf table
Sub Property Obj
AuthorOf
WorksAt
Sub Obj
2 AuthorOf 0
2 0
4 AuthorOf 0
B3 : WorksAt B4 :
8,9 2,4,6 4 0
Supervises 6 AuthorOf 1
6 1
Name
2 WorksAt 8
3 0
4 WorksAt 8
7 1
B5:KIT,MIT B6 : 5 6 WorksAt 9
Whether a graph vertex instantiates a variable of a query depends on its
structure vertices physically grouped based on structural similarity
Apply grouping captured by the structure index to the physical organization
Creating a physical group for every vertex
Triples are in the same group when their subjects belong to the same extension
Triples of a SP table satisfy not only the property of a triple pattern but also,
provide some structural guarantee, e.g., match the entire query structure
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
11
12. Structure-aware Query Processing
Proposition 1
A mapping of q into G exists only if it also exists into the
associated index graph G’.
The resulting extensions that match the nodes in q will
contain all data graph matches.
2-steps query processing
Index graph: find extensions Ei matching q
Data graph: combining data elements retrieved for Ei
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
12
13. Index Graph Matching
B1 B3 B5
WorksAt Name
h1 = {B1, B2, B3, B4, B5}
y u KIT
u KIT h2 = {B2, B3, B4, B5, B6,}
AuthorOf
WorksAt
z x
z x y
AuthorOf Supervises
B2 B4 B6
Retrieve index graph edges matching query edges (triple patterns)
Join index graph edges along query edges
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
13
14. Query Pruning
Proposition 2
If a query is tree-shaped, and consists only of
undistinguished variables (besides the root), matches on
the structure index contain all and only data graph
matches.
Data elements contained in the extensions matching the
query root node represent all and only final query answers
Given such queries, no further processing is needed
Given more general queries, tree-shaped query parts can be
pruned away
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
14
15. Query Pruning
B1 B3 B5
WorksAt Name h1 = {B1, B2, B3, B4, B5}
y u KIT
AuthorOf
WorksAt
z x
AuthorOf Supervises
B2 B4 B6
Elements in extensions are known to satisfy query structure
Elements in B4 are already known to be authors of some z
No further data processing is needed for this part
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
15
16. Data Graph Matching
B1 B3 B5
WorksAt Name
3 WorksAt 8 8 Name KIT h’1 = { 3 WorksAt 8,
7 WorksAt 9 9 Name MIT 3 Supervises 2,
3 Supervises 2 2 WorksAt 8,
3 Supervises 4 8 Name KIT}
AuthorOf
7 Supervises 6
...
WorksAt
2 WorksAt 8
AuthorOf 4 WorksAt 8 Supervises
6 WorksAt 9
B2 ...
B B6
4
Retrieve triples from matching extensions & join along query edges
Match class processing: group index graph matches to match classes to
avoid processing matches that partially overlap
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
16
17. Evaluation
DBLP and several synthetic datasets created using the
Lehigh University Benchmark (LUBM)
30 queries categorized into five classes
Single-atom query Graph-shaped query
Star query
SELECT ?x QDBLP1
type (x, Person) SELECT ?x, ?n QDBLP12 QLUBM15
type (x, Person) SELECT ?x ?a
name (x, n) teacherOf (FullProfessor5, y)
editor (y, x) takesCourse (x, y)
author (z, x) publicationAuthor (b, x)
Entity query
cites (u, z) name (b, Publication7)
SELECT ?x ?m QLUBM9 memberOf (x, z)
emailAddress (x, fp@edu) memberOf (a, z)
Path query advisor (x, a)
res.Interest (x, research24)
telephone (x, xxx-xxx-xxxx) QLUBM6 telephone (a, xxx-xxx-xxxx)
SELECT ?x ?y
takesCourse (x, y)
teacherOf (z, y)
type (z, FullProfessor)
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
17
18. Evaluation – Performance
SP VP idx match load(VP-SP) join(VP-SP) # removed query nodes
100000.0 100000.0
10000.0 10000.0
1000.0 1000.0
100.0 100.0
10.0 10.0
1.0 1.0
0.1 0.1
q1
q2
q3
q4
q5
q6
q8
q9
q7
q10
q11
q12
q13
q14
q15
Mean
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15
Total time in ms on DBLP Time of separate steps in ms, #pruned query nodes
Compare our work (SP) against vertical partitioning (VP) [Abadi et al.]
Total query processing times
Times of individual steps involved
Slightly slower w.r.t simple queries (1-3)
SP 8-9 times faster w.r.t complex queries (4-15)
With more complex queries, the overhead incurred by answer space
matching can be outweighed by the accumulated gain for load and join
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
18
19. Conclusions
Structure index that can deal with general graph-
structured RDF data on the Web
Structure index can be leveraged for dealing with
semi-structured data on the Web
Structure index can be used for RDF data
partitioning & query processing, allowing complex
queries to be processed many times faster
Future work
Adopt existing concepts in XML data management for
structure index optimization & updates
Query optimization for structure-aware query processing
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
19
20. Thank you for your attention!
Structure Index for RDF Data on the Web
Duc Thanh Tran, AIFB Institute, KIT
E-Mail: ducthanh.tran@kit.edu
Web: http://sites.google.com/site/kimducthanh
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
20
21. State-of-the-art
Data Partitioning
Big table (Old versions of Oracle, Jena, Sesame)
Property tables (Jena)
Vertical partitioning (SW-Store)
Indexing
Multiple indexing (YARS)
Sextuple indexing (Hexastore)
Materialization and indexing of entire join paths (GRIN)
Index Implementation
B+ tree
Inverted index (Semplore)
Index compression (RDF-3X)
Query processing
Sorted merge join based on vertical partitioning and indexing (SW-Store)
Join order optimization based on dynamic programming (RDF-3X)
A combination of different concepts makes up the state-of-the-art!
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
21
22. Overview of Our Approach
Problems
• Management of possibly semi-structured RDF data on the Web
• Scalability and efficiency of RDF Web data query processing
Contributions
• Parameterized structure index for RDF data
• Structure-based partitioning (SP): triples with same structure are grouped
• Structure-aware query processing
• Use structure index to focus on data that satisfy the overall query structure
• Then retrieves data in corresponding structure-based partitioned tables
Benefits
• Target data partitioning & query processing, i.e., complementary to other concepts
• Reduction of unions & joins as well as IO cost
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
22
23. Evaluation – Scalability
10000.00
25000
OSQP VPQP-SQP SQP idx match
Processing Times [ms]
20000 SQP 8000.00
load (VPQP-SQP) join(VPQP-SQP)
Query Times (ms)
15000 6000.00
10000 4000.00
5000 2000.00
0 0.00
DBLP LUBM1 LUBM5 LUBM10 LUBM50 LUBM1 LUBM5 LUBM10 LUBM20 LUBM50
Measured the average query performance for LUBM with varying size
Times increases with the size of the data
Gain for load and join increases in larger proportion than the overhead
incurred for index match
Match performance is determined by the size of the index graph
Size depends on structure but not on the size of the data graph
Match time does not necessarily increase when the data becomes larger
Positive effect of data filtering (IO reduction) and query pruning (load and
join) correlates with the data size
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
23
Editor's Notes
In recent years, the amount of structured data available on theWeb has been increasing rapidly, especially RDF data consisting oftriples of the form hs; p; oi, where s is the subject, p is a property,and o is the object. Such triples form a data graph G(V; L;E)where the vertices V denote resources and their values, which areconnected by directed edges E, each endowed with a label from alabel set L. One example is shown in Fig. 1.
This development of a data web opens new ways for addressing complex information needs. Search is no longer limited to matchingkeywords against documents, but instead, structured queries can be processed against web resources. In this regard, conjunctive queries represent an important fragment of widely used languages (SQL, SPARQL), which has been a focus of recent work on RDF data management [1, 10, 6]. Essentially, a query of this type consists of a set of triple patterns of the form p(s; o), where p is a predicate and s and o are variables (V arq) or constants (Conq).These conjunctive queries have high practical relevancebecause they are capable of expressing a large portion ofrelational queries. The vast majority of query languages usedin practice fall into this fragment, including large parts of SQLand SPARQL, the standard language for querying RDF.Intuitively speaking, variables appearing in the SELECT clause arecalled distinguished variables (V ardq ), otherwise undistinguishedvariables (V aruq ). Triple patterns constitute a query graph, as illustratedin Fig. 2b.
A match of a conjunctive query q on a graph G is a mapping hfrom the variables of q to vertices of G such that the according substitutionof variables in the graph-representation of q would yielda subgraph of G. Therefore a query match h can be interpreted asa certain type of homomorphism (i.e. a structure preserving mapping)from the “query graph” to the data graph. Because the amountof data is enormous and largely increasing, scalability of this graphpattern matching on the data web has become a key issue.Search complexity increases substantially with the size of the graph
Data organization & indexing determines efficiency of data loading and efficiency of join depends on join implementation and join order optimizationState-of-the-art For this problem of matching a query graph pattern againstthe data graph, there are RDF stores, which retrieve data for every triple pattern and join it along the query edges.While the efficiency of retrieval depends on the physical data organization and indexing, the efficiency of join is largelydetermined by the join implementation and join order optimization strategies. We discuss these performance drivers that distinguish existing RDF stores:Data Partitioning Different schemes have been proposedTo govern the ways data is physically organized and stored. Abasic scheme is the triple-based organization, where one bigthree-columns table is used to store all triples. To avoid themany self-joins on the giant table, property-based partitioningis suggested [2], where data is stored in several “property tables”,each containing triples of one particular type of entities.Vertical partitioning (VP) has been proposed to decompose thedata graph into n two-columns tables, where n is number ofproperties [1]. As this scheme allows entries to be sorted, fastmerge joins can be performed.Indexing Scheme With multiple indexing, several indexesare created for supporting different lookup patterns. Thescheme with the widest coverage of access patterns is used inYARS [3], where six indexes are proposed to cover 16 possibleaccess patterns of quads (triple patterns plus one additionalcontext element). In [4], sextuple indexing has been suggested,which generalizes the strategy in [3] such that for differentaccess patterns, retrieved data comes in a sorted fashion. Infact, this work extends VP with the idea of multiple indexingto support fast merge joins on different and more complexquery patterns. Thus, this indexing technique goes beyondtriple lookup operations to support fast joins. Along this line,entire join paths have been materialized and indexed usingsuffix arrays [5]. A different path index based on judiciouslychosen “center nodes” coined GRIN has been proposed in [6].Index Implementation B+-tree is most commonly usedin current RDF stores. Recently, the inverted index typicallyused for IR tasks has been recognized as a viable choice forindexing large amounts of web data. It has been proposedto manage RDF data [7] and dataspaces [8]. Also, indexcompression techniques for RDF has been discussed [9].Query Processing & Optimization Executing joins duringquery processing can be greatly accelerated when the retrievedtriples are already sorted. Through VP, retrieved data comesin sorted fashion, enabling fast merge joins [1]. This joinimplementation has near linear complexity, resulting in bestperformance. Sextuple indexing takes this further to allow thisjoin processing to be applied on many more query patterns,e.g. when the query contains unbound predicates such that pis a variable [4]. Further efficiency gains can be achieved byfinding an optimal query plan [9], which leverages dynamicprogramming that also involve bushy plans.It has been reported that there is no single system [10],but rather a combination of different concepts that makes upthe state-of-the-art in RDF data management. In particular,VP [1] is the candidate for physical data organization, multipleindexes [3] enable fast lookup, and optimized query plans [9]result in fast performance for complex join processing.
We elaborate on concepts that improve the state-of-the-art in data partitioning and query processing:Parameterized Structure Index for RDF Data: Generalizingwork on XML data such as dataguide [5], we propose anindex called PIG that summarizes the structure of generalgraph structured data like RDF. The size of this index can becontrolled by means of parameters (e.g. derived from workload). Structure-based Partitioning: Based on PIG, we propose a structure-based partitioning scheme, where triples about elementswith the same structure are physically grouped. Thisis to obtain a contiguous storage of data that likely co-occursin query answers. Structure-aware Query Processing: We propose to matchthe query against the structure index first, which is typicallymuch smaller than the data graph (c.f. examples in Fig. 2).This helps to focus on data that satisfy the overall structureof the query and on this basis, to proceed with standard processingat the level of the data for only certain parts of thequery.Our solution is complementary to the concepts for indexing andquery optimization [10, 8], and offers the following additional benefits: Reduction of I/O Costs: We do not simply retrieve all datathat matches some given triple patterns but focus on the onethat satisfies the entire query structure. Reduction of Union and Joins: These operations are onlyneeded only for some parts of the query. In the extreme caseswhere no structure index matches can be found, we can skipdata access and joins at the data level completely.In a benchmark against the state-of-the-art techniques for datapartitioning and query processing used in SW-Store [1], our approachis 7-8 times faster for a PIG that is parameterized accordingto the query workload.Outline We introduce PIG in Section 2. Partitioning, query processingand parameterization are discussed in Section 3, 4 and 5.Experiments along with results are discussed in Section 6 beforewe review related work in Section 7 and conclude in Section 8. Formore details, we refer the interest readers to our technical report[2].
PIG is a special graphforming a compact representation of the data graph, whose verticesstand for groups of data graph elements that have a similar or equalstructural “neighborhood”. We capture the concept of equal structuralneighborhood by the well-known notion of bisimulation originatingfrom the theoretical analysis of state-based dynamic systems.We adopt this notion to capture both directions of edges.We consider graph nodes v1; v2 as bisimilar (written: v1 v2),First, a bismulation for L1 and L2 is calculated, using an adapted version of the algorithm for determining thecoarsest stable refinement of a partitioning [9]. The algorithm starts with a partition consisting of a single block that contains all data,and splits into smaller blocks until the partition is a forward bisimulation.In order to perform both backward and forward bisimulationfor only the parameters L1 and L2, we essentially exploit the observationthat L1-forward-L2-backward bisimulation on a data graphG = (V; L;E) coincide with forward bisimulation on an altereddata graph GL1L2 = (V;L1 [ fl j l 2 L2g;EL1L2g) whereEL1L2 = fl(x; y) j l(x; y) 2 E; l 2 L1g [ fl(y; x) j l(x; y) 2 E; l 2 L2g. After having determined the bisimulation, the resulting blocks from the partition P are used to form vertices in theindex graph according to Definition 1.
Clearly, whether a graph vertexinstantiates a variable of a query obviously depends on its structuralproperties, i.e. the incoming and outgoing edges resp. paths.Therefore, if nodes are physically grouped together based on structuralsimilarity, a group would contain more candidates for variableinstantiations. Thus, we apply structure-based partitioning to thedata graph by creating a physical group (e.g. a table) for every vertexof the index graph, i.e. one group for every extension. Everygroup contains the triples, which “describe” elements contained inthe corresponding extension. That is, triples are in the same groupwhen they contain the same properties and their subjects belong tothe same extension.Recall that extensionsrepresent partitions of the data graph. Thus, grouping triplesbased on extensions guarantees an exhaustive and redundancyfreedecomposition of the data graph.Compared to VP, where triples with the same property aregrouped together, SP applies to triples that are similar instructure. Using VP tables, triples retrieved from disk matchthe property of a single triple pattern. However, whether sucha triple is also relevant for the entire query (i.e., contributesto the final results) depends on its structure. Since SP tablescontain only triples that are similar in structure, they, whenidentified to be relevant for a query, are likely to containrelatively more relevant triples. In fact, triples of a SP tableretrieved for a given query satisfy not only the property of atriple pattern of that query but also the entire query structure.Thus with SP, we can focus on relevant data. In effect, itreduces the amount of irrelevant data that might have to beretrieved from disk when using VP, and thus, can reduce I/Ocosts.
Query processing in our scenario is essentially finding a homomorphismfrom the query graph q = (Vvar ] Vcon; L; P) to elementsof the data graph G = (V; L;E). According to the followingproposition, the structure index can be exploited to perform thistask:PROPOSITION 1. Let G be a data graph with associated indexgraph G and let q be another graph such that there is a homomorphismh from q into G. Then h with h(v) = [h(v)] is ahomomorphism from q into G.Intuitively speaking, a mapping into G exists only if it does alsoexist into the associated index graph G. Further, the resultingextensions Bi = [h(v)] from V in G that match the nodes inq will obviously contain the data graph matches h(v). Thus, the procedure for query processing can be decomposed into two steps:(1) finding matching extensions Bi on the index graph G first, (2)then combining data elements retrieved for Bi to obtain the finaldata graph matches. In the following, we denote GIdx as theindex used for the retrieval of elements from the index graph andGIdx is used for accessing elements of the data graph.
Just like an answer, an index graph match is the result of a homomorphicmapping h from the query graph q(Vq;Lq; P) to theindex graph G(V ; L;E). Elements of an index graph matchare vertices of G that are assigned to variables and constants ofq. For this computation, we propose a join procedure that returns aresult table R containing all matches h : Vq ! V . First, a setof index graph candidate edges El is retrieved from G for everyquery edge label l occurring in the query (using GIdx). Then,these candidate sets are joined along the vertices of q to obtain R.Figure 3 illustrates two matches.
The previous computation results in a set R of index graphmatches h : Vq ! V . Every element of these matches is anextension which essentially is a set of vertices of the queried datagraph G. According to Proposition 1, every match of the queryagainst the data graph is “contained” in one of the index graphmatches calculated so far (e.g. h(v1) is in h(v1) = [h(v1)]).It suffices to focus on the index graph matches for the computationof the data graph matches because only data contained by them satisfythe overall query structure. We will now show that tree-shapedparts containing only undistinguished variables can even be pruned away entirely. We now inductively define this notion of tree-shapedquery part:Given such a tree-shaped query part, a stronger property can beasserted for the index graph matches, i.e. they contain all and onlydata graph matches such that no further processing at the level ofthe data graph is needed:In words: if the query is of the aforementioned tree shape, thenevery data node from any extension associated to the query root rby an index graph match is a data graph match for r. Hence, beforecomputing data graph matches, the respective query parts can beremoved.
Fig. 2b depicts a query, which asks for authors xworking at the same place as their supervisors y, namely a placecalled KIT. One match on the index graph is h1 = fu 7! B3; x 7! B4; y 7! B1; z 7! B2;KIT 7! B5g. Based on this,we know that data elements belonging to extensions obtained fromthe index graph match satisfy the query structure, e.g. elements inB4 are authors of z, supervised by y, work at some place u that hasa name. A tree-like part that can be pruned is AuthorOf(x; z). Itproduces the index graph match hB4 AuthorOf B2i. Since 2; 4;and 6 in B4 are already known to be authors of some z, no furtherdata processing is needed for this query part. However, we haveto look at the data to verify that elements in B4 work at KIT,and are supervised by some y also working at KIT. For this,we need to retrieve and join the triple matches for hxWorksAtui, hyWorksAtui, hu Name KITi, hy Supervises xi. Notethat the query example here contains cycles. In practice, there aremany queries exhibiting simpler structure, which offer greater potentialfor query pruning. In the extreme cases where no indexgraph matches can be found, we can skip the entire second stepto avoid data access and joins completely.
After pruning the query, we use another join procedure to computea result table where rows capture bindings to distinguishedquery variables. These bindings are data elements contained in theindex graph matches h, which satisfy the structure as well as theconcrete elements (i.e. constants and distinguished variables) mentionedin the query. Query edges are processed successively. Atevery iteration, triples are retrieved from GIdx and are joined withthe (intermediate) results set. More precisely, given the query edgep(x; y), the triples hx 7! s; p; y 7! oi matching p(x; y) are considered.They are fetched from the corresponding block [s] of thestructure-based partitioned data graph index GIdx, where s 2 [s]and h : x ! [s]. Intuitively speaking, only triples with subjectsthat are contained in the index match [s] are retrieved from disk. Thus, only subjects that are known to satisfy the query structure areconsidered. This is different from the standard approaches [1, 8],where all triples matching the query edge are taken into account,which might contain subjects not in [s].However, we haveto look at the data to verify that elements in B4 work at KIT,and are supervised by some y also working at KIT. For this,we need to retrieve and join the triple matches for hxWorksAtui, hyWorksAtui, hu Name KITi, hy Supervises xi.The procedure presented in the previous section computes answersdata matches for an index graph match h. In order to computeall data matches, this has to be repeated for all index graphmatches h in R. However, the diverse matches might partiallyoverlap. To formalize and computationally exploit this, we introducethe following notion:In words, the preceding proposition ensures that all data graphmatches of a query can be obtained by a successive refinement ofmatch classes and their associated data matches. Consequently, theoptimized procedure for computing query data matches consists oftwo main parts: (1) update of match classes and (2) evaluation ofmatch classes.Match classes are defined w.r.t. query vertices. For the first part,match classes are thus created (updated) according to the query verticesthat are added during the process of join processing. At first,there is only one initial match class R consisting of all index graphmatches (line 1). During the processing of query atoms p(x; y),the set of classes MC becomes more and more “fine-grained”, asany matches not coinciding on how x and y are mapped to V willbe distributed to different match classes (line 11). A hash map H,which associates pairs of index matches (x-y-instantiations) withmatch classes, is employed to check for overlaps. Note that duringthe processing of the atoms in P, the number of classes grows ashigh as the number of matches, i.e. every match constitutes its ownclass.
More optimized systems have been built that implement the conceptsof indexing and query optimization [8, 10]. Since these aspectsare orthogonal, we use the work in [1] as baseline, which isthe state-of-the-art in, and is purely focused on partitioning andquery processing. We compare our work called structure-basedquery processingWe now summarize the experiment reported in details in [2]. Itis based on DBLP and several synthetic datasets containing severalmillions of triples created using the Lehigh University Benchmark.A set of 30 queries categorized into five classes ranging fromsingle-atom query to complex structured graph-shaped queries hasbeen used. We use two parameterizations for the experiments: (1)SPB is based on G0B calculated using backward bisimulation onlyand (2) SPFB uses G0FB, a restricted back- and forward bisimulationadapted to the workload by setting L1;L2 to include onlylabels occurring in prunable query parts. G0FB is much smaller(4%-30%) and the indexes for G0B makes up only a small percentage(0.08%-2%) of the data graph.
We have proposed techniques for RDF data partitioning andquery processing that can exploit the underlying structure to improvethe management of RDF data, based on a novel structureindex call PIG. In an principled manner, we showed that this approachis faster than the state-of-the-art, especially for complexstructured queries.As future work, we will elaborate on how existing work on RDFquery optimization can be used for the proposed structure-basedquery processing technique. Further, strategies proposed for optimizingupdates of XML structure indexes will be studied andadopted.
Data organization & indexing determines efficiency of data loading and efficiency of join depends on join implementation and join order optimizationState-of-the-art For this problem of matching a query graph pattern againstthe data graph, there are RDF stores, which retrieve data for every triple pattern and join it along the query edges.While the efficiency of retrieval depends on the physical data organization and indexing, the efficiency of join is largelydetermined by the join implementation and join order optimization strategies. We discuss these performance drivers that distinguish existing RDF stores:Data Partitioning Different schemes have been proposedTo govern the ways data is physically organized and stored. Abasic scheme is the triple-based organization, where one bigthree-columns table is used to store all triples. To avoid themany self-joins on the giant table, property-based partitioningis suggested [2], where data is stored in several “property tables”,each containing triples of one particular type of entities.Vertical partitioning (VP) has been proposed to decompose thedata graph into n two-columns tables, where n is number ofproperties [1]. As this scheme allows entries to be sorted, fastmerge joins can be performed.Indexing Scheme With multiple indexing, several indexesare created for supporting different lookup patterns. Thescheme with the widest coverage of access patterns is used inYARS [3], where six indexes are proposed to cover 16 possibleaccess patterns of quads (triple patterns plus one additionalcontext element). In [4], sextuple indexing has been suggested,which generalizes the strategy in [3] such that for differentaccess patterns, retrieved data comes in a sorted fashion. Infact, this work extends VP with the idea of multiple indexingto support fast merge joins on different and more complexquery patterns. Thus, this indexing technique goes beyondtriple lookup operations to support fast joins. Along this line,entire join paths have been materialized and indexed usingsuffix arrays [5]. A different path index based on judiciouslychosen “center nodes” coined GRIN has been proposed in [6].Index Implementation B+-tree is most commonly usedin current RDF stores. Recently, the inverted index typicallyused for IR tasks has been recognized as a viable choice forindexing large amounts of web data. It has been proposedto manage RDF data [7] and dataspaces [8]. Also, indexcompression techniques for RDF has been discussed [9].Query Processing & Optimization Executing joins duringquery processing can be greatly accelerated when the retrievedtriples are already sorted. Through VP, retrieved data comesin sorted fashion, enabling fast merge joins [1]. This joinimplementation has near linear complexity, resulting in bestperformance. Sextuple indexing takes this further to allow thisjoin processing to be applied on many more query patterns,e.g. when the query contains unbound predicates such that pis a variable [4]. Further efficiency gains can be achieved byfinding an optimal query plan [9], which leverages dynamicprogramming that also involve bushy plans.It has been reported that there is no single system [10],but rather a combination of different concepts that makes upthe state-of-the-art in RDF data management. In particular,VP [1] is the candidate for physical data organization, multipleindexes [3] enable fast lookup, and optimized query plans [9]result in fast performance for complex join processing.
We elaborate on concepts that improve the state-of-the-art in data partitioning and query processing:Parameterized Structure Index for RDF Data: Generalizingwork on XML data such as dataguide [5], we propose anindex called PIG that summarizes the structure of generalgraph structured data like RDF. The size of this index can becontrolled by means of parameters (e.g. derived from workload). Structure-based Partitioning: Based on PIG, we propose a structure-based partitioning scheme, where triples about elementswith the same structure are physically grouped. Thisis to obtain a contiguous storage of data that likely co-occursin query answers. Structure-aware Query Processing: We propose to matchthe query against the structure index first, which is typicallymuch smaller than the data graph (c.f. examples in Fig. 2).This helps to focus on data that satisfy the overall structureof the query and on this basis, to proceed with standard processingat the level of the data for only certain parts of thequery.Our solution is complementary to the concepts for indexing andquery optimization [10, 8], and offers the following additional benefits: Reduction of I/O Costs: We do not simply retrieve all datathat matches some given triple patterns but focus on the onethat satisfies the entire query structure. Reduction of Union and Joins: These operations are onlyneeded only for some parts of the query. In the extreme caseswhere no structure index matches can be found, we can skipdata access and joins at the data level completely.In a benchmark against the state-of-the-art techniques for datapartitioning and query processing used in SW-Store [1], our approachis 7-8 times faster for a PIG that is parameterized accordingto the query workload.Outline We introduce PIG in Section 2. Partitioning, query processingand parameterization are discussed in Section 3, 4 and 5.Experiments along with results are discussed in Section 6 beforewe review related work in Section 7 and conclude in Section 8. Formore details, we refer the interest readers to our technical report[2].
Scalability We measured the average query performance forLUBM with varying size (i.e. generated for 1, 5, 10, 20 and 50universities). We found that the performance of our improves withthe size of the data. In particular, the gain for load and join increasesin larger proportion than the overhead incurred for indexmatch. This is because match performance is determined by thesize of the index graph. This depends on the structure but not onthe size of the data graph. Thus, the match time does not necessarilyincrease when the data graph becomes larger. The positiveeffect of data filtering (IO reduction) and query pruning (load andjoin) however, correlates with the data size.