The increasing number of RDF data sources that allow for
querying Linked Data via Web services form the basis for federated SPARQL query processing. Federated SPARQL query engines provide a unified view of a federation of RDF data sources, and rely on source descriptions for selecting the data sources over which unified queries will be executed. Albeit efficient, existing federated SPARQL query engines usually ignore the meaning of data accessible from a data source,
and describe sources only in terms of the vocabularies utilized in the data source. Lack of source description may conduce to the erroneous selection of data sources for a query, thus affecting the performance of query processing over the federation. We tackle the problem of federated SPARQL query processing and devise MULDER, a query engine for federations of RDF data sources. MULDER describes data sources in terms of RDF molecule templates, i.e., abstract descriptions of entities
belonging to the same RDF class. Moreover, MULDER utilizes RDF molecule templates for source selection, and query decomposition and optimization. We empirically study the performance of MULDER on existing benchmarks, and compare MULDER performance with state-of-the-art federated SPARQL query engines. Experimental results suggest that RDF molecule templates empower MULDER federated query processing, and allow for the selection of RDF data sources that not only reduce execution time, but also increase answer completeness.
MULDER: Querying the Linked Data Web by Bridging RDF Molecule Templates
1. MULDER: Querying the Linked Data Web
by Bridging RDF Molecule Templates
Kemele M. Endris, Mikhail Galkin, I. Lytra, M. Mami, M.E. Vidal, Sören Auer
DEXA 2017 - August 28-31, 2017
Lyon, France
8. 8
● MULDER relies on RDF
molecule templates for
source selection, and query
decomposition and
optimization
MULDER: Architecture
9. 9
• RDF Molecule Templates (RDF-MTs)
• describe the set of properties associated with same type of RDF Molecules
• RDF Molecule - a set of triples that share same subject
MULDER: Source Description Model
dbo:Fictional_
Character
dbo:occupation
dbo:series
dbo:portrayer
dbo:City
owl:sameAs
dbo:birthplace
geo:Feature
geo:population
dbo:Person
dbo:occupation
dbo:series
dbo:portrayer
owl:sameAs
10. 10
• RDF-MTs define a community of RDF molecules that share same
characteristics, e.g., having same rdf:type, or wikidata:P31
(instance of)
• RDF-MTs are defined in terms of:
• Web service API interfaces
• Type of Molecule
• Set of predicates
• Intra and inter links between RDF-MTs
RDF Molecule Templates
20. 20
• MULDER creates a query decomposition with service graph
patterns (SGPs) of star-shaped subqueries built according to
RDF-MTs
• Star-shaped subqueries (SSQs) are a set of triple patterns that share the
same subject (variable or constant)
• minimize execution time and maximize answer completeness by selecting
only relevant sources
MULDER: Decomposition & Source Selection
26. 26
• Sources associated to SSQs determined from RDF-MTs metadata
• An SSQ that matches more than one RDF-MT in the same dataset will be
decomposed to a single service endpoint
• An SSQ that have matching RDF-MTs from more than one dataset will be
decomposed to each service endpoint of datasets
Source Selection
29. 29
• MULDER implements a greedy
heuristic based approach to
generate a bushy plan
• leaves correspond to SSQs
• number of joins between SSQs is
maximized while the plan height is
minimized
Query Planning
t1 t2
t3
t4
t5
t6
t7
DBpedia DBpedia GeonamesDBpedia
JOIN
JOIN
JOIN
30. 30
• Research Questions:
RQ1) Do different source descriptions have impact on query
processing in terms of efficiency and effectiveness?
RQ2) Are RDF-MT based query processing techniques able to
enhance query execution time and completeness?
Experimental Study
31. 31
• Metrics
• Execution time: elapsed time between the submission of a query to an
engine and the delivery of the answers (timeout: 300 sec)
• Cardinality: number of answers returned
• Completeness: query result percentage w.r.t answers from union of all
datasets
Experimental Setup
32. 32
➢ Goal: assess query performance of MULDER utilizing different
source descriptions: RDF-Molecule, METIS, and SemEP based
source descriptions
• METIS and SemEP are community detection algorithms used for
graph partitioning
• METIS and SemEP based Molecule templates are composed of predicates
with similar co-occurrence values
• Each predicate is assigned to only one community
Experiment I
33. 33
• BSBM1
: Berlin SPARQL Benchmark
• Builds around an e-commerce use case where a set of products is offered
by different vendors and consumers have posted reviews about products.
• supports the creation of arbitrarily large datasets
• Eight RDF classes
• Product, ProductType, ProductFeature, Vendor, Person, Review, Publisher, and Offer
• Data Generated:
• 200M triples
• Queries:
• 12 queries, with 20 query mixes
Experiment I: Benchmark
1. http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/
34. 34
Results: Exp I
Performance of MULDER source descriptions
★ RDF-MT based source descriptions allow MULDER
to identify query decompositions and plans that
speed up query processing, while answer
completeness is not affected (RQ1)
○ RDF-MTs help MULDER reduce intermediate
results by selecting only relevant sources
35. 35
➢ Goal: Compares MULDER with state of the art federated query
engines: FedX and ANAPSID
• FedX:
• Sends ASK queries to discover the structure of data sources at query time
• implements blocking join operators; i.e., results delivered after all data
received from sources
• ANAPSID:
• Sources described in terms of set of RDF properties in each data source
• source descriptions computed beforehand
• implements non-blocking operator; i.e., results delivered as soon as they
arrive from sources
Experiment II
36. 36
• Setup I:
• Data: BSBM 200M triples
• Query: 12 queries, with 20 query mixes
• Setup II:
• Data: FedBench - 10 datasets, in two collections:
• Cross-domain: LinkedMDB, DBPedia, GeoNames, NYTimes, SWDF, Jamendo
• Life Science: Drugbank, DBpedia, KEGG, CheBI
• Query: 35 queries
• 25 FedBench queries, includes Cross Domain (CD), Linked Data (LD) and Life Science
(LS) queries
• 10 complex (C) queries over FedBench datasets (M. Vidal, et.al)
Experiment II: Benchmarks
38. 38
Results: Exp II-I
● Performance of MULDER compared to other
Federated Engines on synthetic dataset (BSBM)
● ANAPSID returns query answers fast but at the cost
of completeness,
● FedX is slower than MULDER and ANAPSID
★ MULDER better identify decomposition and plan
that minimizes execution time and answer
completeness by utilizing RDF-MTs, compared to
FedX and ANAPSID! (RQ2)
39. 39
Results: Exp II-II
● Performance of MULDER compared to other Federated
Engines on FedBench dataset
○ Answer completeness and execution time
● Quadrants:
○ I and III: indicates bad performance and
incomplete results
○ II: (almost) complete but slower execution time
○ IV: indicates best execution time and (almost)
complete results
★ MULDER - RDF-MTs performs better in terms of
execution time and answer completeness
compared to FedX and ANAPSID! (RQ2)
40. 40
• MULDER is a query engine for federated access to SPARQL
endpoints:
• uses RDF-MTs to describe data source interfaces
• RDF-MTs enable MULDER decomposition and planning methods to identify
efficient and effective query plans compared to METIS and SemeEP based
source descriptions (RQ1)
• MULDER significantly reduces query execution time and increases
answer completeness compared to FedX and ANAPSID, by selecting
relevant sources and creating best execution plan (RQ2)
Conclusion
41. 41
• Integrate additional web access interfaces: such as TPFs and
RESTful APIs
• Empower RDF-MTs with additional information such as: link
selectivity, statistics, etc
Future Work
42. 42
Thank you!!
Follow me @KemeleM
endris@cs.uni-bonn.de
kemele.endris@gmail.com
University of Bonn,
Fraunhofer IAIS
Germany
Full experimental data:
https://github.com/EIS-Bonn/MULDER