Query Processing Using Structure Index for RDF Data on the Web

Query Processing
Using Structure Index for RDF Data on the Web
Thanh Tran and Günter Ladwig
Institute AIFB, Karlsruhe Institute of Technology
ducthanh.tran@kit.edu, guenter.ladwig@kit.edu

KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
1

Agenda

 Problem Introduction
 Approach
 Structure Index for RDF Data
 Structure-based Partitioning
 Structure-aware Query Processing
 Evaluation
 Conclusion

2

RDF data
0 1

AuthorOf
Supervises AuthorOf Supervises Supervises Supervises
2 3 4 5 6 7

WorksAt
WorksAt

Name Name
KIT 8 9 MIT

- Consists of triples <s,p,o>
- Triples form a graph, where vertices denote resources and their values, connected
by directed labelled edges representing properties (i.e.,relations and attributes)
- URIs are used as labels of edges and vertices representing resources
3 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

Conjunctive Queries
z

Supervises
x y

WorksAt

Name
KIT u

- Important fragment of widely used languages (SQL, SPARQL)
- Consisting of triple patterns p(s,o) where p is a predicate and s and o are variables
or constants
- Distinguished variables, e.g. x, vs. undistinguished variables
- Triple patterns constitute a query graph
4

Conjunctive Query Answering
z

0 1

AuthorOf
AuthorOf
Supervises
x y
Supervises Supervises Supervises Supervises
2 3 4 5 6 7
WorksAt

WorksAt
WorksAt
Name
KIT u
Name Name
KIT 8 9 MIT

- Graph pattern matching problem: a match of a query q on a graph G is a mapping h
from the variables of q to vertices of G such that the substitution of variables in
the graph-representation of q would yield a subgraph of G
- A match h is a homomorphism from the “query graph” to the data graph
- Query answering based on two basic operations: data loading and join

5

State-of-the-art
 Data Partitioning
 Vertical partitioning (SW-Store)
 Indexing
 Sextuple indexing (Hexastore)
 Materialization and indexing of entire join paths (GRIN)
 Index Implementation
 B+ tree
 Inverted index (Semplore)
 Index compression (RDF-3X)
 Query processing
 Sorted merge join based on vertical partitioning and indexing (SW-Store)
 Join order optimization based on dynamic programming (RDF-3X)
 A combination of different concepts makes up the state-of-the-art!

6

Large Volume of RDF Data on the Web

- ̴10 billions RDF triples (2009)
- Interlinked by ̴10 millions mappings (2009)
- Besides linked data, there are standalone ontologies, RDFa, etc.
7

Semi-structured RDF data on the Web
0 1

AuthorOf
Supervises AuthorOf Supervises Supervises Supervises
2 3 4 5 6 7

WorksAt
WorksAt

KIT Name 8 9 Name MIT

Publication
AuthorOf

- RDF graph often contains both
data and schema information
PhD Supervises - Resources are linked with a
Institute
Student rdf:class via rdf:type
WorksAt

- Schema information incomplete,
especially Web data, RDFa data
 RDF data might be schema-less,
Name Post Doc
String semi-structured data

Overview of Our Approach

Problems
• Management of possibly semi-structured RDF data on the Web
• Scalability and efficiency of RDF Web data query processing

Contributions
• Parameterized structure index for RDF data
• Structure-based partitioning (SP)
• Structure-aware query processing

Benefits
• Reduction of unions & joins as well as IO cost

9

Structure Index for RDF data on the Web
B1 : AuthorOf B2 :
0 1
3,7 0,1

AuthorOf
WorksAt

AuthorOf
AuthorOf
Supervises Supervises Supervises Supervises
B3 : WorksAt B4 :
2 3 4 5 6 7
8,9 2,4,6
Supervises

WorksAt
WorksAt
Name

Name Name
B5:KIT,MIT B6 : 5 KIT 8 9 MIT

 Structure index is a graph
 Is a structural description more fine-granular then a schema
 Consists of classes (extensions) and relations between them
 Resources in an extension exhibit the same structure, i.e., cannot be distinguished by
outgoing (forward bisimilarity) and incoming (backward bisimilarity) “edge trees”
 Parameterize bisimulation by two sets of edge labels

10

Structure-based Partitioning
B1 : AuthorOf B2 : SP B4 table
3,7 0,1 VP AuthorOf table
Sub Property Obj

AuthorOf
WorksAt

Sub Obj
2 AuthorOf 0
2 0
4 AuthorOf 0
B3 : WorksAt B4 :
8,9 2,4,6 4 0
Supervises 6 AuthorOf 1
6 1
Name

2 WorksAt 8
3 0
4 WorksAt 8
7 1
B5:KIT,MIT B6 : 5 6 WorksAt 9

 Whether a graph vertex instantiates a variable of a query depends on its
structure  vertices physically grouped based on structural similarity
 Apply grouping captured by the structure index to the physical organization
 Creating a physical group for every vertex
 Triples are in the same group when their subjects belong to the same extension
 Triples of a SP table satisfy not only the property of a triple pattern but also,
provide some structural guarantee, e.g., match the entire query structure
11

Structure-aware Query Processing

 Proposition 1
 A mapping of q into G exists only if it also exists into the
associated index graph G’.
 The resulting extensions that match the nodes in q will
contain all data graph matches.

 2-steps query processing
 Index graph: find extensions Ei matching q
 Data graph: combining data elements retrieved for Ei

12

Index Graph Matching
B1 B3 B5
WorksAt Name
h1 = {B1, B2, B3, B4, B5}
y u KIT
u KIT h2 = {B2, B3, B4, B5, B6,}
AuthorOf

WorksAt

z x
z x y
AuthorOf Supervises

B2 B4 B6

 Retrieve index graph edges matching query edges (triple patterns)
 Join index graph edges along query edges
13

Query Pruning

 Proposition 2
 If a query is tree-shaped, and consists only of
undistinguished variables (besides the root), matches on
the structure index contain all and only data graph
matches.

 Data elements contained in the extensions matching the
query root node represent all and only final query answers
 Given such queries, no further processing is needed
 Given more general queries, tree-shaped query parts can be
pruned away
14

Query Pruning
B1 B3 B5
WorksAt Name h1 = {B1, B2, B3, B4, B5}
y u KIT
AuthorOf

WorksAt

z x
AuthorOf Supervises

B2 B4 B6
 Elements in extensions are known to satisfy query structure
 Elements in B4 are already known to be authors of some z
 No further data processing is needed for this part
15

Data Graph Matching
B1 B3 B5
WorksAt Name
3 WorksAt 8 8 Name KIT h’1 = { 3 WorksAt 8,
7 WorksAt 9 9 Name MIT 3 Supervises 2,
3 Supervises 2 2 WorksAt 8,
3 Supervises 4 8 Name KIT}
AuthorOf

7 Supervises 6
...
WorksAt

2 WorksAt 8
AuthorOf 4 WorksAt 8 Supervises
6 WorksAt 9
B2 ...
B B6
4

 Retrieve triples from matching extensions & join along query edges
 Match class processing: group index graph matches to match classes to
avoid processing matches that partially overlap
16

Evaluation

 DBLP and several synthetic datasets created using the
Lehigh University Benchmark (LUBM)
 30 queries categorized into five classes
Single-atom query Graph-shaped query
Star query
SELECT ?x QDBLP1
type (x, Person) SELECT ?x, ?n QDBLP12 QLUBM15
type (x, Person) SELECT ?x ?a
name (x, n) teacherOf (FullProfessor5, y)
editor (y, x) takesCourse (x, y)
author (z, x) publicationAuthor (b, x)
Entity query
cites (u, z) name (b, Publication7)
SELECT ?x ?m QLUBM9 memberOf (x, z)
emailAddress (x, fp@edu) memberOf (a, z)
Path query advisor (x, a)
res.Interest (x, research24)
telephone (x, xxx-xxx-xxxx) QLUBM6 telephone (a, xxx-xxx-xxxx)
SELECT ?x ?y
takesCourse (x, y)
teacherOf (z, y)
type (z, FullProfessor)

17

Evaluation – Performance
SP VP idx match load(VP-SP) join(VP-SP) # removed query nodes
100000.0 100000.0
10000.0 10000.0
1000.0 1000.0
100.0 100.0

10.0 10.0

1.0 1.0

0.1 0.1
q1
q2
q3
q4
q5
q6

q8
q9
q7

q10
q11
q12
q13
q14
q15
Mean
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15

Total time in ms on DBLP Time of separate steps in ms, #pruned query nodes

 Compare our work (SP) against vertical partitioning (VP) [Abadi et al.]
 Total query processing times
 Times of individual steps involved
 Slightly slower w.r.t simple queries (1-3)
 SP 8-9 times faster w.r.t complex queries (4-15)
 With more complex queries, the overhead incurred by answer space
matching can be outweighed by the accumulated gain for load and join
18

Conclusions

 Structure index that can deal with general graph-
structured RDF data on the Web
 Structure index can be leveraged for dealing with
semi-structured data on the Web
 Structure index can be used for RDF data
partitioning & query processing, allowing complex
queries to be processed many times faster
 Future work
 Adopt existing concepts in XML data management for
structure index optimization & updates
 Query optimization for structure-aware query processing
19

Thank you for your attention!

Structure Index for RDF Data on the Web
Duc Thanh Tran, AIFB Institute, KIT
E-Mail: ducthanh.tran@kit.edu
Web: http://sites.google.com/site/kimducthanh

20

State-of-the-art
 Data Partitioning
 Big table (Old versions of Oracle, Jena, Sesame)
 Property tables (Jena)
 Vertical partitioning (SW-Store)
 Indexing
 Multiple indexing (YARS)
 Sextuple indexing (Hexastore)
 Materialization and indexing of entire join paths (GRIN)
 Index Implementation
 B+ tree
 Inverted index (Semplore)
 Index compression (RDF-3X)
 Query processing
 Sorted merge join based on vertical partitioning and indexing (SW-Store)
 Join order optimization based on dynamic programming (RDF-3X)
 A combination of different concepts makes up the state-of-the-art!
21

Overview of Our Approach
Problems
• Management of possibly semi-structured RDF data on the Web
• Scalability and efficiency of RDF Web data query processing

Contributions
• Parameterized structure index for RDF data
• Structure-based partitioning (SP): triples with same structure are grouped
• Structure-aware query processing
• Use structure index to focus on data that satisfy the overall query structure
• Then retrieves data in corresponding structure-based partitioned tables

Benefits
• Target data partitioning & query processing, i.e., complementary to other concepts
• Reduction of unions & joins as well as IO cost

22

Evaluation – Scalability
10000.00
25000
OSQP VPQP-SQP SQP idx match

Processing Times [ms]
20000 SQP 8000.00
load (VPQP-SQP) join(VPQP-SQP)
Query Times (ms)

15000 6000.00

10000 4000.00

5000 2000.00

0 0.00
DBLP LUBM1 LUBM5 LUBM10 LUBM50 LUBM1 LUBM5 LUBM10 LUBM20 LUBM50

 Measured the average query performance for LUBM with varying size
 Times increases with the size of the data
 Gain for load and join increases in larger proportion than the overhead
incurred for index match
 Match performance is determined by the size of the index graph
 Size depends on structure but not on the size of the data graph
 Match time does not necessarily increase when the data becomes larger
 Positive effect of data filtering (IO reduction) and query pruning (load and
join) correlates with the data size
23

Query Processing Using Structure Index for RDF Data on the Web

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Query Processing Using Structure Index for RDF Data on the Web

Editor's Notes