Federated SPARQL query processing over the Web of Data

Federated SPARQL Query Processing
Over the Web of Data
Muhammad Saleem, Axel-Cyrille Ngonga
Ngomo
Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig,
Germany, 25/11/2014

Agenda
• SPARQL Query Federation Approaches
• SPARQL Query Federation Optimization
– Query Rewriting
– Source Selection
– Data Integration Options
– Join Order Selection
– Join Order Optimization
– Join Implementations
• Performance Metrics and Discussion

SPARQL Query Federation Approaches
• SPARQL Endpoint Federation (SEF)
• Linked Data Federation (LDF)
• Distributed Hash Tables (DHTs)
• Hybrid of SEF+LDF

SPARQL Endpoint Federation Approaches
• Most commonly used approaches
• Make use of SPARQL endpoints URLs
• Fast query execution
• RDF data needs to be exposed via SPARQL
endpoints
• E.g., HiBISCus, FedX, SPLENDID, ANAPSID, LHD etc.

Linked Data Federation Approaches
• Data needs not be exposed via SPARQL endpoints
• Uses URI lookups at runtime
• Data should follow Linked Data principles
• Slower as compared to previous approaches
• E.g., LDQPS, SIHJoin, WoDQA etc.

Query federation on top of Distributed Hash Tables
• Uses DHT indexing to federate SPARQL queries
• Space efficient
• Cannot deal with whole LOD
• E.g., ATLAS

Hybrid of SEF+LDF
• Federation over SPARQL endpoints and Linked
Data
• Can potentially deal with whole LOD
• E.g., ADERIS-Hybrid

SPARQL Endpoint Federation
Parsing/Rewriting
Source Selection
Federator Optimzer
Integrator
S1 S2 S3 S4
RDF RDF RDF RDF
Rewrite query
and get Individual
Triple Patterns
Identify capable
source against
Individual Triple
Patterns
Generate
optimized sub-query
Exe. Plan
Execute sub-queries
Integrate sub-queries
results

SPARQL Query Rewriting
FedBench (LD3): Return for all US presidents their party
membership and news pages about them.
SELECT ?president ?party ?page
WHERE {
?president rdf:type dbpedia:President .
?president dbpedia:nationality ?nationality.
?president dbpedia:party ?party .
?x nyt:topicPage ?page .
?x owl:sameAs ?president .
Filter (?nationality = dbpedia:United_States )
}
WHERE {
?president dbpedia:nationality dbpedia:United_States .
}
Try to simplify/avoid SPARQL FILTER and REGEX expressions

Source Selection
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Source Selection Algorithm
Triple pattern-wise source selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Source Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Source Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Source Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Source Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
TP5 = S1 S2 S4-S9
Total triple pattern-wise sources selected =
Jamendo
RDF
TP2 = S1
1+1+1+1+8 => 12
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Types of Source Selection
• Index-free
– Using SPARQL ASK queries
– No index maintenance required
– Potentially ensures result set completeness
– SPARQL ASK queries can be expensive
– Can make use of the cache to store recent SPARQL ASK queries results
– E.g., FedX
• Index-only
– Only make use of Index/data summaries
– Less efficient but fast source selection
– Result set completeness is not ensured
– E.g., DARQ, LHD
• Hybrid
– Make use of index+SPARQL ASK
– Most efficient
– Result set completeness is not ensured
– Can make use of the cache to store recent SPARQL ASK queries results
– E.g., HiBISCuS, ANAPSID, SPLENDID

Index-free Source Selection
Input: SPARQL query Q , set of all data sources D
Output: Triple pattern to relevant data sources map M
for each triple pattern ti in SPARQL query Q
Ri = {}; // set of relevant data sources for triple pattern ti
for each data source di in D
if SPARQL ASK(di , ti) = true
Ri = Ri U {di};
end if
end for
M = M U {Ri};
end for
return M What is the total number of SPARQL ASK requests used?
total number of triple patterns * total number of data sources

Index-free
Source Selection
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Index-free
Source Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Index-free
Source Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Index-free
Source Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Index-free
Source Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
TP5 = S1 S2
S4-S9
Total number of SPARQL ASK requests used = 45
Total triple pattern-wise sources selected = 12
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Index-only Source Selection (LHD)
Input: SPARQL query Q , set of all data sources D, data sources index I storing all distinct predicates for
all data sources in D
p = Pred(ti) // predicate of ti
if (bound (p))
Ri = Lookup (I, p) // index lookup for predicate of ti
else
Ri = D ; // all data sources are relevant
end if
M = M U {Ri} ;
end for
return M Why it is the less efficient approach (i.e., greatly overestimate relevant data sources)?
• Source selection is only based on predicate of triple patterns
• Simply select all data sources for triple patterns having unbound predicates

Index-only
Source Selection
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1-S9
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Index-only
Source Selection
TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
S1-S9 TP2 = S1
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Index-only
Source Selection
TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
S1-S9
TP3 = S1
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Index-only
Source Selection
TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
S1-S9
TP3 = S1 TP4 = S4
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Index-only
Source Selection
TP1 =
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
S1-S9
TP3 = S1 TP4 = S4
TP5 = S1 S2 S4-S9
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Hybrid Source Selection
Input: SPARQL query Q , set of all data sources D, data sources index I storing all distinct predicates for all data
sources in D
s = Subj(ti) , p = Pred(ti) , o = Obj(ti) ; // subject, predicate, and object of ti
if (!bound (p) || bound (s) || bound (o) )
for each data source di in D
if SPARQL ASK(di , ti) = true
Ri = RiU {di};
end if
end for
else
Ri = Lookup (I, p) // index lookup for predicate of ti
end if
M = M U {Ri}
end for
return M
What is the total number of SPARQL ASK requests used?
total number of triple patterns with bound subject or bound object
or unbound predicate * total number of data sources

Hybrid Source
Selection
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Hybrid Source
Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Hybrid Source
Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
Hybrid Source
Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

WHERE {
}
Anything still needs
to be improved?
dbpedia
RDF
//TP3
//TP4
//TP5
Hybrid Source
Selection
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
TP5 = S1 S2
S4-S9
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Source Selection
• Triple pattern-wise source selection
– Ensures 100% recall
– Can over-estimate capable sources
– Can be expensive, e.g., total number of SPARQL ASK
requests used
– Performed by FedX, SPLENDID, LHD, DARQ, ADERIS etc.
• Join-aware triple-pattern wise source selection
– Ensures 100% recall
– May selects optimal/close to optimal capable sources
– Can be expensive, e.g., total number of SPARQL ASK
requests used
– Can significantly reduce the query execution time
– Performed by ANAPSID, HiBISCuS

HiBISCuS: Hypergraph-Based Source Selection for
SPARQL Endpoint Federation
• Hybrid source selection
• Join-aware triple-pattern wise source selection
• Makes use of the hypergraph representation of
SPARQL queries
• Makes use of the URI authorities
• Makes use of the cache to store recent SPARQL
ASK queries results

Motivation
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Motivation
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Motivation
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Motivation
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
Jamendo
RDF
TP2 = S1
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Motivation
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
TP5 = S1 S2 S4 S5
Jamendo
RDF
TP2 = S1
S6 S7 S8 S9
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Motivation
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
TP3 = S1 TP4 = S4
TP5 = S1 S2 S4 S5
Total triple pattern-wise selected sources = 12
Total SPARQL ASK queries : 9*5 = 45
Jamendo
RDF
TP2 = S1
S6 S7 S8 S9
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Motivation
WHERE {
}
dbpedia
RDF
//TP3
//TP4
//TP5
TP1 = S1
TP3 = S1
TP2 = S1
TP4 = S4
TP5 = S1 S2 S4 S5
S6 S7 S8 S9
Optimal triple pattern-wise selected sources 5
KEGG
RDF
ChEBI
RDF
NYT
RDF
//TP1
SWDF
RDF
//TP2
LMDB
RDF
Jamendo
RDF
Geo
Names
RDF
DrugBank
RDF
S1 S2 S3 S4 S5 S6 S7 S8 S9

Problem Statement
• An overestimation of triple pattern-wise source selection can
be expensive
– Resources are wasted
– Query runtime is increased
– Extra traffic is generated
• How do we perform join-aware triple pattern wise source
selection in time efficient way?

HiBISCuS: Key Concept
• Makes use of the URI’s authorities
http://dbpedia.org/ontology/party
Scheme Authority Path
For URI details: http://tools.ietf.org/html/rfc3986

HiBISCuS: SPARQL Query as Hypergraph
WHERE {
}
?president
rdf:type
dbpedia:
President

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_S
tates
dbpedia:
nationality

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_S
tates
dbpedia:
party
dbpedia:
nationality
?party

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_S
tates
dbpedia:
party
dbpedia:
nationality
?party
?x
nyt:topi
cPage
?page

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_S
tates
dbpedia:
party
dbpedia:
nationality
?party
?x
nyt:topi
cPage
?page
owl:
SameAs

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_S
tates
dbpedia:
nationality
?x
owl:
SameAs
dbpedia:
party
?party
nyt:topi
cPage
?page
Star simple hybrid Tail of hyperedge

HiBISCuS: Data Summaries
[] a ds:Service ;
ds:endpointUrl <http://dbpedia.org/sparql> ;
ds:capability [
ds:predicate dbpedia:party ;
ds:sbjAuthority <http://dbpedia.org/> ;
ds:objAuthority <http://dbpedia.org/> ;
] ;
ds:capability [
ds:predicate rdf:type ;
ds:objAuthority owl:Thing, dbpedia:President; #we store all distinct
classes
] ;
ds:capability [
ds:predicate dbpedia:postalCode ;
#No objAuthority as the object value for dbpedia:postalCode is string
] ;

HiBISCuS: Triple Pattern-wise Source Selection
WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_
States
dbpedia:
nationality
?x
owl:
SameAs
dbpedia:
party ?party
nyt:topi
cPage
?page
dbpedia KEGG NYT SWDF LMDB Geo DrgBnk Jamendo

HiBISCuS: Triple Pattern-wise Source Pruning
WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_
States
dbpedia:
nationality
?x
owl:
SameAs
dbpedia:
party ?party
nyt:topi
cPage
?page
dbpedia KEGG NYT SWDF
DrgBnk LMDB Geo Jamendo
Obj.
auth.
dbpedia
Sbj. auth.
Sbj. auth.
KEGG
Sbj. auth.
NYT
Sbj. auth.
SWDF
Sbj. auth.
LMDB
Sbj. auth.
Geo
Sbj. auth.
DrgBnk
Sbj. auth.
Jamendo

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_
States
dbpedia:
nationality
?x
owl:
SameAs
dbpedia:
party ?party
nyt:topi
cPage
?page
dbpedia
Sbj. auth.
Sbj. auth.
KEGG
Sbj. auth.
NYT
Sbj. auth.
SWDF
Sbj. auth.
LMDB
Sbj. auth.
Geo
Sbj. auth.
DrgBnk
Sbj. auth.
Jamendo
Obj.
auth.

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_
States
dbpedia:
nationality
?x
owl:
SameAs
dbpedia:
party ?party
nyt:topi
cPage
?page
Obj.
auth.

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_
States
dbpedia:
nationality
?x
owl:
SameAs
dbpedia:
party ?party
nyt:topi
cPage
?page
NYT
Obj. auth.

WHERE {
}
?president
rdf:type
dbpedia:
President
dbpedia:
United_
States
dbpedia:
nationality
?x
owl:
SameAs
dbpedia:
party ?party
nyt:topi
cPage
?page
Total triple pattern-wise selected sources = 5
Total SPARQL ASK queries : 0

Complete Local Integration
• Triple patterns are individually and completely
evaluated against every endpoint
• Triple pattern results are locally integrated using
different join techniques, e.g., NLJ, Hash Join etc.
• Less efficient if query contains common
predicates such rdf:type and owl:sameAs
• Large amount of potentially irrelevant
intermediate results retrieval

Iterative Integration
• Evaluate query iteratively pattern by pattern
• Start with a single triple pattern
• Substitute mappings from previous triple pattern
in the subsequent evaluation
• Evaluate query in a NLJ fashion
• NLJ can cause many remote requests
• Block NLJ fashion minimize the remote requests

Join Order Selection
• Left-deep trees
– Joins take place in a left-to-right sequential order
– Result of the join is used as an outer input for the next join
– Used in FedX, DARQ
• Right-deep trees
– Joins take place in a right-to-left sequential order
– Result of the join is used as an inner input for the next join
• Bushy trees
– Joins take place in sub-tress both on left and right sides
– Used in ANAPSID
• Dynamic programming
– Used in SPLENDID

Join Order Selection Example
Compute Micronutrients using Drugbank and KEGG
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategory drugbank-cat:micronutrient. // TP1
?drug drugbank:casRegistryNumber ?id . // TP2
?keggDrug rdf:type kegg:Drug . // TP3
?keggDrug bio2rdf:xRef ?id . // TP4
?keggDrug dc:title ?title . // TP5
}
67
휋 ? 푑푟푢푔, ? 푡푖푡푙푒
TP1 TP2
TP3
TP4
TP5
Left-deep tree
TP1 TP2
TP3
TP4
TP5
Right-deep tree
Bushy tree
TP1 TP2
TP3 TP5
TP4
Goal: Execute smallest cardinality joins first

Join Order Optimization
• Exclusive Groups
– Group triple patterns with the same relevant data source
– Evaluation in a single (remote) sub-query
– Push join to the data source, i.e., endpoint
• Variable count-heuristic
– Iteratively determine the join order based on free variables
count of triple patterns and groups
– Consider “resolved ” variable mappings from earlier iteration
• Using Selectivities
– Store distinct predicates, avg. subject selectivities , and avg.
object selectivities for each predicate in index
– Use the predicate count, avg. subject selectivities , and avg.
object selectivities to estimate the join cardinality

Exclusive Groups
SELECT ?President ?Party ?TopicPage WHERE {
?President rdf:type dbpedia-yago:PresidentsOfTheUnitedStates .
?President dbpedia:party ?Party .
?nytPresident owl:sameAs ?President .
?nytPresident nytimes:topicPage ?TopicPage .
}
Source Selection
@ DBpedia
@ DBpedia
@ DBpedia, NYTimes
@ NYTimes
Exclusive Group
Advantage:
Delegate joins to the endpoint by forming exclusive groups (i.e. executing the
respective patterns in a single subquery)
70

Exclusive Groups Join Order Optimization
2 Unoptimized Internal Representation
1 SPARQL Query
Compute Micronutrients using Drugbank and KEGG
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategory drugbank-cat:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrug rdf:type kegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrug dc:title ?title .
}
3 Optimized Internal Representation
4x Local Join
=
4x NLJ
Exlusive Group
 Remote Join
71

Selectivity Based Join Order Optimization
[] a sd:Service ;
sd:endpointUrl <http://localhost:8890/sparql> ;
sd:capability [
sd:predicate diseasome:name ;
sd:totalTriples 147 ; // Total number of triple patterns with predicate value sd:predicate
sd:avgSbjSel ``0.0068'' ; // 1/ distinct subjects with predicate value sd:predicate
sd:avgObjSel ``0.0069'' ; // 1/ distinct Objects with predicate value sd:predicate
] ;
sd:capability [
sd:predicate diseasome:chromosomalLocation ;
sd:totalTtriples 160 ;
sd:avgSbjSel ``0.0062'' ;
sd:avgObjSel ``0.0072'' ;
] ;
S1 P O1 .
S1 P O2 .
S2 P O1 .
S3 P O2 .
totalTriples = 4
avgSbjSel(p) = 1/3
avgObjSel(p) =1/2

Selectivity Based Join Order Optimization
• Triple pattern cardinality
• Join Cardinality
푝 = pred(tp) , 푇 = Total triple having predicate 푝
퐶(푡푝) =
푇 푖푓 푛푒푖푡ℎ푒푟 푠푢푏푗푒푐푡 푛표푟 표푏푗푒푐푡 푖푠 푏표푢푛푑
푇 × 푎푣푔푆푏푗푆푒푙 푝 푖푓 푠푢푏푗푒푐푡 푖푠 푏표푢푛푑
푇 × 푎푣푔푂푏푗푆푒푙 푝 푖푓표푏푗푒푐푡 푖푠 푏표푢푛푑
퐶(퐽 푡푝1, 푡푝2 ) =
퐶 푡푝1 × 퐶 푡푝2 × 푎푣푔푃푟푒푑퐽표푖푛푆푒푙 푡푝1 × 푎푣푔푃푟푒푑퐽표푖푛푆푒푙 푡푝2 푖푓 푝 − 푝 푗표푖푛
퐶 푡푝1 × 퐶 푡푝2 × 푎푣푔푆푏푗퐽표푖푛푆푒푙 푡푝1 × 푎푣푔푆푏푗퐽표푖푛푆푒푙 푡푝2 푖푓 푠 − 푠 푗표푖푛
퐶 푡푝1 × 퐶 푡푝2 × 푎푣푔푆푏푗퐽표푖푛푆푒푙 푡푝1 × 푎푣푔푂푏푗퐽표푖푛푆푒푙 푡푝2 푖푓 푠 − 표 푗표푖푛
How to calculate avgPredJoinSel, avgSbjJoinSel, and avgObjJoinSel?
DARQ selected 0.5 as the avgJoinSel value for all joins

Join Implementations
• Bound Joins
– Start with a single triple pattern (lowest cardinality)
– Substitute mappings from previous triple pattern in the
subsequent evaluation
– Bound Joins in NLJ fashion
• Execute bound joins in nested loop join fashion
• Too many remote requests
– Bound Joins in Block NLJ fashion
• Execute bound joins in block nested loop join fashion
• Make use of SPARQL UNION construct
• Remote requests are reduced by the block size
• Other Join techniques
– E.g, Hash Joins

Bound Joins in Block NLJ
SELECT ?President ?Party ?TopicPage WHERE {
?President rdf:type dbpedia:PresidentsOfTheUnitedStates .
?President dbpedia:party ?Party .
?nytPresident owl:sameAs ?President .
?nytPresident nytimes:topicPage ?TopicPage .
}
Assume that the following intermediate results have been computed as input for the last triple pattern
Block Input
“Barack Obama”
“George W. Bush”
…
Before (NLJ)
SELECT ?TopicPage WHERE { “Barack Obama” nytimes:topicPage ?TopicPage }
SELECT ?TopicPage WHERE { “George W. Bush” nytimes:topicPage ?TopicPage }
…
Now: Evaluation in a single remote request using a SPARQL UNION
construct + local post processing (SPARQL 1.0)
76

Parallelization and Pipelining
• Execute sub-queries concurrently on different data
sources
• Multithreaded worker pool to execute the joins
and UNION operators in parallel
• Pipelining approach for intermediate results
• See FedX and LHD implementations

Performance Metrics and Discussion

Performance Metrics
• Efficient source selection in terms of
– Total triple pattern-wise sources selected
– Total number of SPARQL ASK requests used during source
selection
– Source selection time
• Query execution time
• Results completeness and correctness
• Number of remote requests during query execution
• Index compression ratio (1- index size/datadump size)
• See https://code.google.com/p/bigrdfbench/

Evaluation Setup
• Local dedicated network
• Local SPARQL endpoints (One per machine)
• Run each query 10 times and present the average results
• Statistically analyzed the results, e.g., Wilcoxon signed rank
test, student T-test

SPARQL Query Federation Engines
• FedX
• SPLENDID
• HiBISCuS+FedX
• HiBISCuS+SPLENDID
• ANAPSID
• LHD
• DARQ
81

AKSW SPARQL Federation Publications
• HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation by Muhammad
Saleem and Axel-Cyrille Ngonga Ngomo, in (ESWC, 2014)
• DAW: Duplicate-AWare Federated Query Processing over the Web of Data by Muhammad
Saleem Axel-Cyrille Ngonga Ngomo, Josiane Xavier Parreira , Helena Deus , and Manfred Hauswirth
, in (ISWC 2013).
• TopFed: TCGA Tailored Federated Query Processing and Linking to LOD by Muhammad Saleem,
Shanmukha Sampath , Axel-Cyrille Ngonga Ngomo , Aftab Iqbal, Jonas Almeida , and Helena F. Deus
, in (Journal of Biomedical Semantics, 2014).
• A Fine-Grained Evaluation of SPARQL Endpoint Federation Systems by Muhammad Saleem, Yasar
Khan, Ali Hasnain, Ivan Ermilov, and Axel-Cyrille Ngonga Ngomo , in (Semantic Web Journal, 2014)
• BigRDFBench: A Billion Triples Benchmark for SPARQL Query Federation by Muhammad Saleem,
Ali Hasnain, Axel-Cyrille Ngonga Ngomo , in (submitted WWW, 2015).
• SAFE: Policy-Aware SPARQL Query Federation Over RDF Data Cubes
By Yasar Khan, Muhammed Saleem , Aftab Iqbal, Muntazir Mehdi, Aidan Hogan, Panagiotis
Hasapis, Axel-Cyrille Ngonga Ngomo, Stefan Decker, and Ratnesh Sahay, in (SWAT4LS, 2014)
• QFed: Query Set For Federated SPARQL Query Benchmark by Nur Aini Rakhmawati, Sarasi lithsena
, Muhammad Saleem , Stefan Decker, in (iiWAS, 2014)
82

Thanks
{saleem,ngonga}@informatik.uni-leipzig.de
AKSW, University of Leipzig, Germany

Federated SPARQL query processing over the Web of Data

More Related Content

What's hot

Similar to Federated SPARQL query processing over the Web of Data

More from Muhammad Saleem

Recently uploaded

Federated SPARQL query processing over the Web of Data