More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

More Complete Resultset Retrieval from Large
Heterogeneous RDF Sources
Andre Valdestilhas Tommaso Soru Muhammad Saleem
AKSW Group, University of Leipzig, Germany
November 24, 2019
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 1 / 17

Outline
Motivation
Approach
Experiments
Conclusion & Future work

Motivation
The LOD Cloud
2007
2009
2019

Motivation
Where to find RDF datasets?
9,960
raw RDF datasets658,206
Datasets (HDT files)
LODLaundromat
Which Dataset?
...
559
Endpoint
Different formats1
Query more than 221 billion triples (> 5 Terabytes)
1Serialization.

Example
Authors that have a paper type poster/demo in the proceedings of ISWC
20082
2Query from FEDBench

Example
Authors that have a paper type poster/demo in the proceedings of ISWC
20083
4 HDT datasets4
containing data that can answer the query
3Query from FEDBench
4Semantic Web Dog Food from LOD Laundromat datasets.

Motivation
Approaches
(+) Multiple SPARQL endpoints
(-) 90% are dump ﬁles
(+) Dereferenceable URIs
(-) 43% of the URIs are
non-dereferenceable
Endpoint HDT file file.rdf Dump_file_2
WIMU
Where is my URI?
(+) Data from non-dereferenceable
URIs
(-) No SPARQL query

The approach
A hybrid SPARQL query engine
Collect data from multiple SPARQL endpoints,
Data from RDF dumps including HDT ﬁles and use Link Traversal
Link Traversal, obtaining data from non-dereferenceable URIs using WIMUa
aWhere is my URI?(WIMU) http://wimu.aksw.org/
Resulting in
More complete results
Experiments with 3 state-of-the-art SPARQL query benchmarks,
LargeRDFBench, FedBench and FEASIBLE

The approach
Select ?p ?o
Where {<http://uri.com> ?p ?o}
Endpoint
hdt ﬁle
dump.bz2
ﬁle.rdf
...
http://uri1.com
http://uri2.com
http://uriN.com
Extract URIs
WIMU
1
2
3
Data Dumps
Query processor
Traversal Based
Query processor
Union of
the results
Source Filtering
SPARQL-a-lot
Query processor
SPARQL Endpoint
Query processor
wimuQ query
execution engine
Results
<subject1><predicate1><object1>
<subject2><predicate2><object2>
<subjectN><predicateN><objectN>
4
5
6

The approach
The source selection
Identify relevant datasets from WIMU

Evaluation
Hypothesis Identify automatically relevant sources from heterogeneous RDF
data, even with non-dereferenceable URIs, can improve the
resultset retrieval
Metrics Coverage and runtime
Approaches FedX (endpoints), SQUIN (Traversal-based), SPARQL-a-lot and
WIMU(dumps)

Evaluation
Experimental setup
Datasets 221.7 billion triples (>5 terabytes)
Queries 415 queries from FedBench, LargeRDFBench and FEASIBLE
Each query executed 5 times
Hardware 200 GB HD, 8GB RAM, 2.70GHZ single core processor

Evaluation
Coverage: Overall 76% queries with results(Zero results=non-public endpoints/data -
non-indexed)
CD LS LD Simple Comp Large Chs Dbpedia SWDF
FedBench | |LargeRDFBench Feasible
100
1000
10000
100000
Averagenumberofresults
onlogscale
EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps
wimuQ
Approaches and the best coverage
FedBench 55% endpoints
LargeRDFBench 81% wimuDumps
FEASIBLE 98% wimuDumps
Observation
The combination of those query
processing engines implies more resultset
retrieval

Evaluation
Number of datasets
More datasets discovered does not implies in more results

Evaluation
Runtime
Total Average 17 minutes across 3 benchmarks (wimuDumps 2 min, Endpoints
13 min, SPARQL-a-lot 58 sec, LinkTraversal-SQUIN 36 sec)
CD LS LD Simple Comp Large Chs Dbpedia SWDF
FedBench LargeRDFBench Feasible
| |
1
10
Averagerun-time(minutes)
onlogscale
EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps
wimuQ
Interesting wimuQ takes 91% of results from wimuDumps, only 7% from
SPARQL endpoints. Possible reason, SPARQL endpoint
federation split among multiple endpoints, network and number
of intermediate results inﬂuence in the runtime

Conclusion & Future works
Conclusion
A hybrid SPARQL query processing engine to execute SPARQL queries over a
large amount of heterogeneous RDF data
Evaluation on real world datasets using the state of the art of federated and
non-federated query benchmarks (FedBench, LargeRDFBench and
FEASIBLE)
We present the ﬁrst federated SPARQL query processing engine that executes
SPARQL queries over a total of 221.7 billion triples
Future work
Add more URIs into WIMU index and use Triple Pattern Fragments
A Large Scale approach to study the relation and similarity among the
datasets

That’s all Folks!
Thanks!
Questions?Github repository: https://github.com/firmao/wimuT
Prototype: https://w3id.org/wimuq/
Contact: valdestilhas@informatik.uni-leipzig.de
Special thanks to my PhD. advisor Prof. Dr. rer. nat. Thomas Riechert

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

Similar to More Complete Resultset Retrieval from Large Heterogeneous RDF Sources (20)

More from André Valdestilhas

More from André Valdestilhas (11)

Recently uploaded

Recently uploaded (20)

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources