More Complete Resultset Retrieval from Large
Heterogeneous RDF Sources
Andre Valdestilhas Tommaso Soru Muhammad Saleem
AKSW Group, University of Leipzig, Germany
November 24, 2019
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 1 / 17
Outline
Motivation
Approach
Experiments
Conclusion & Future work
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 2 / 17
Motivation
The LOD Cloud
2007
2009
2019
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 3 / 17
Motivation
Where to find RDF datasets?
9,960
raw RDF datasets658,206
Datasets (HDT files)
LODLaundromat
Which Dataset?
...
559
Endpoint
Different formats1
Query more than 221 billion triples (> 5 Terabytes)
1Serialization.
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 4 / 17
Example
Where to find RDF datasets?
Authors that have a paper type poster/demo in the proceedings of ISWC
20082
2Query from FEDBench
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 5 / 17
Example
Where to find RDF datasets?
Authors that have a paper type poster/demo in the proceedings of ISWC
20083
4 HDT datasets4
containing data that can answer the query
3Query from FEDBench
4Semantic Web Dog Food from LOD Laundromat datasets.
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 6 / 17
Motivation
Approaches
(+) Multiple SPARQL endpoints
(-) 90% are dump files
(+) Dereferenceable URIs
(-) 43% of the URIs are
non-dereferenceable
Endpoint HDT	file file.rdf Dump_file_2
WIMU
Where is my URI?
(+) Data from non-dereferenceable
URIs
(-) No SPARQL query
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 7 / 17
The approach
A hybrid SPARQL query engine
Collect data from multiple SPARQL endpoints,
Data from RDF dumps including HDT files and use Link Traversal
Link Traversal, obtaining data from non-dereferenceable URIs using WIMUa
aWhere is my URI?(WIMU) http://wimu.aksw.org/
Resulting in
More complete results
Experiments with 3 state-of-the-art SPARQL query benchmarks,
LargeRDFBench, FedBench and FEASIBLE
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 8 / 17
The approach
Select ?p ?o
Where {<http://uri.com> ?p ?o}
Endpoint
hdt file
dump.bz2
file.rdf
...
http://uri1.com
http://uri2.com
http://uriN.com
Extract URIs
WIMU
1
2
3
Data Dumps
Query processor
Traversal Based
Query processor
Union of
the results
Source Filtering
SPARQL-a-lot
Query processor
SPARQL Endpoint 
Query processor
wimuQ query
execution engine
Results
<subject1><predicate1><object1>
<subject2><predicate2><object2>
<subjectN><predicateN><objectN>
4
5
6
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 9 / 17
The approach
The source selection
Identify relevant datasets from WIMU
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 10 / 17
Evaluation
Hypothesis Identify automatically relevant sources from heterogeneous RDF
data, even with non-dereferenceable URIs, can improve the
resultset retrieval
Metrics Coverage and runtime
Approaches FedX (endpoints), SQUIN (Traversal-based), SPARQL-a-lot and
WIMU(dumps)
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 11 / 17
Evaluation
Experimental setup
Datasets 221.7 billion triples (>5 terabytes)
Queries 415 queries from FedBench, LargeRDFBench and FEASIBLE
Each query executed 5 times
Hardware 200 GB HD, 8GB RAM, 2.70GHZ single core processor
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 12 / 17
Evaluation
Coverage: Overall 76% queries with results(Zero results=non-public endpoints/data -
non-indexed)
CD LS LD Simple Comp Large Chs Dbpedia SWDF
FedBench | |LargeRDFBench Feasible
100
1000
10000
100000
Averagenumberofresults
onlogscale
EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps
wimuQ
Approaches and the best coverage
FedBench 55% endpoints
LargeRDFBench 81% wimuDumps
FEASIBLE 98% wimuDumps
Observation
The combination of those query
processing engines implies more resultset
retrieval
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 13 / 17
Evaluation
Number of datasets
More datasets discovered does not implies in more results
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 14 / 17
Evaluation
Runtime
Total Average 17 minutes across 3 benchmarks (wimuDumps 2 min, Endpoints
13 min, SPARQL-a-lot 58 sec, LinkTraversal-SQUIN 36 sec)
CD LS LD Simple Comp Large Chs Dbpedia SWDF
FedBench LargeRDFBench Feasible
| |
1
10
Averagerun-time(minutes)
onlogscale
EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps
wimuQ
Interesting wimuQ takes 91% of results from wimuDumps, only 7% from
SPARQL endpoints. Possible reason, SPARQL endpoint
federation split among multiple endpoints, network and number
of intermediate results influence in the runtime
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 15 / 17
Conclusion & Future works
Conclusion
A hybrid SPARQL query processing engine to execute SPARQL queries over a
large amount of heterogeneous RDF data
Evaluation on real world datasets using the state of the art of federated and
non-federated query benchmarks (FedBench, LargeRDFBench and
FEASIBLE)
We present the first federated SPARQL query processing engine that executes
SPARQL queries over a total of 221.7 billion triples
Future work
Add more URIs into WIMU index and use Triple Pattern Fragments
A Large Scale approach to study the relation and similarity among the
datasets
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 16 / 17
That’s all Folks!
Thanks!
Questions?Github repository: https://github.com/firmao/wimuT
Prototype: https://w3id.org/wimuq/
Contact: valdestilhas@informatik.uni-leipzig.de
Special thanks to my PhD. advisor Prof. Dr. rer. nat. Thomas Riechert
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 17 / 17

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

  • 1.
    More Complete ResultsetRetrieval from Large Heterogeneous RDF Sources Andre Valdestilhas Tommaso Soru Muhammad Saleem AKSW Group, University of Leipzig, Germany November 24, 2019 Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 1 / 17
  • 2.
    Outline Motivation Approach Experiments Conclusion & Futurework Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 2 / 17
  • 3.
    Motivation The LOD Cloud 2007 2009 2019 Valdestilhaset al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 3 / 17
  • 4.
    Motivation Where to findRDF datasets? 9,960 raw RDF datasets658,206 Datasets (HDT files) LODLaundromat Which Dataset? ... 559 Endpoint Different formats1 Query more than 221 billion triples (> 5 Terabytes) 1Serialization. Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 4 / 17
  • 5.
    Example Where to findRDF datasets? Authors that have a paper type poster/demo in the proceedings of ISWC 20082 2Query from FEDBench Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 5 / 17
  • 6.
    Example Where to findRDF datasets? Authors that have a paper type poster/demo in the proceedings of ISWC 20083 4 HDT datasets4 containing data that can answer the query 3Query from FEDBench 4Semantic Web Dog Food from LOD Laundromat datasets. Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 6 / 17
  • 7.
    Motivation Approaches (+) Multiple SPARQLendpoints (-) 90% are dump files (+) Dereferenceable URIs (-) 43% of the URIs are non-dereferenceable Endpoint HDT file file.rdf Dump_file_2 WIMU Where is my URI? (+) Data from non-dereferenceable URIs (-) No SPARQL query Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 7 / 17
  • 8.
    The approach A hybridSPARQL query engine Collect data from multiple SPARQL endpoints, Data from RDF dumps including HDT files and use Link Traversal Link Traversal, obtaining data from non-dereferenceable URIs using WIMUa aWhere is my URI?(WIMU) http://wimu.aksw.org/ Resulting in More complete results Experiments with 3 state-of-the-art SPARQL query benchmarks, LargeRDFBench, FedBench and FEASIBLE Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 8 / 17
  • 9.
    The approach Select ?p?o Where {<http://uri.com> ?p ?o} Endpoint hdt file dump.bz2 file.rdf ... http://uri1.com http://uri2.com http://uriN.com Extract URIs WIMU 1 2 3 Data Dumps Query processor Traversal Based Query processor Union of the results Source Filtering SPARQL-a-lot Query processor SPARQL Endpoint  Query processor wimuQ query execution engine Results <subject1><predicate1><object1> <subject2><predicate2><object2> <subjectN><predicateN><objectN> 4 5 6 Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 9 / 17
  • 10.
    The approach The sourceselection Identify relevant datasets from WIMU Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 10 / 17
  • 11.
    Evaluation Hypothesis Identify automaticallyrelevant sources from heterogeneous RDF data, even with non-dereferenceable URIs, can improve the resultset retrieval Metrics Coverage and runtime Approaches FedX (endpoints), SQUIN (Traversal-based), SPARQL-a-lot and WIMU(dumps) Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 11 / 17
  • 12.
    Evaluation Experimental setup Datasets 221.7billion triples (>5 terabytes) Queries 415 queries from FedBench, LargeRDFBench and FEASIBLE Each query executed 5 times Hardware 200 GB HD, 8GB RAM, 2.70GHZ single core processor Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 12 / 17
  • 13.
    Evaluation Coverage: Overall 76%queries with results(Zero results=non-public endpoints/data - non-indexed) CD LS LD Simple Comp Large Chs Dbpedia SWDF FedBench | |LargeRDFBench Feasible 100 1000 10000 100000 Averagenumberofresults onlogscale EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps wimuQ Approaches and the best coverage FedBench 55% endpoints LargeRDFBench 81% wimuDumps FEASIBLE 98% wimuDumps Observation The combination of those query processing engines implies more resultset retrieval Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 13 / 17
  • 14.
    Evaluation Number of datasets Moredatasets discovered does not implies in more results Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 14 / 17
  • 15.
    Evaluation Runtime Total Average 17minutes across 3 benchmarks (wimuDumps 2 min, Endpoints 13 min, SPARQL-a-lot 58 sec, LinkTraversal-SQUIN 36 sec) CD LS LD Simple Comp Large Chs Dbpedia SWDF FedBench LargeRDFBench Feasible | | 1 10 Averagerun-time(minutes) onlogscale EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps wimuQ Interesting wimuQ takes 91% of results from wimuDumps, only 7% from SPARQL endpoints. Possible reason, SPARQL endpoint federation split among multiple endpoints, network and number of intermediate results influence in the runtime Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 15 / 17
  • 16.
    Conclusion & Futureworks Conclusion A hybrid SPARQL query processing engine to execute SPARQL queries over a large amount of heterogeneous RDF data Evaluation on real world datasets using the state of the art of federated and non-federated query benchmarks (FedBench, LargeRDFBench and FEASIBLE) We present the first federated SPARQL query processing engine that executes SPARQL queries over a total of 221.7 billion triples Future work Add more URIs into WIMU index and use Triple Pattern Fragments A Large Scale approach to study the relation and similarity among the datasets Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 16 / 17
  • 17.
    That’s all Folks! Thanks! Questions?Githubrepository: https://github.com/firmao/wimuT Prototype: https://w3id.org/wimuq/ Contact: valdestilhas@informatik.uni-leipzig.de Special thanks to my PhD. advisor Prof. Dr. rer. nat. Thomas Riechert Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 17 / 17