Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
More Complete Resultset Retrieval from Large
Heterogeneous RDF Sources
Andre Valdestilhas Tommaso Soru Muhammad Saleem
AKS...
Outline
Motivation
Approach
Experiments
Conclusion & Future work
Valdestilhas et al. (AKSW) More Complete Resultset Retrie...
Motivation
The LOD Cloud
2007
2009
2019
Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogene...
Motivation
Where to find RDF datasets?
9,960
raw RDF datasets658,206
Datasets (HDT files)
LODLaundromat
Which Dataset?
...
5...
Example
Where to find RDF datasets?
Authors that have a paper type poster/demo in the proceedings of ISWC
20082
2Query from...
Example
Where to find RDF datasets?
Authors that have a paper type poster/demo in the proceedings of ISWC
20083
4 HDT datas...
Motivation
Approaches
(+) Multiple SPARQL endpoints
(-) 90% are dump files
(+) Dereferenceable URIs
(-) 43% of the URIs are...
The approach
A hybrid SPARQL query engine
Collect data from multiple SPARQL endpoints,
Data from RDF dumps including HDT fi...
The approach
Select ?p ?o
Where {<http://uri.com> ?p ?o}
Endpoint
hdt file
dump.bz2
file.rdf
...
http://uri1.com
http://uri2...
The approach
The source selection
Identify relevant datasets from WIMU
Valdestilhas et al. (AKSW) More Complete Resultset ...
Evaluation
Hypothesis Identify automatically relevant sources from heterogeneous RDF
data, even with non-dereferenceable U...
Evaluation
Experimental setup
Datasets 221.7 billion triples (>5 terabytes)
Queries 415 queries from FedBench, LargeRDFBen...
Evaluation
Coverage: Overall 76% queries with results(Zero results=non-public endpoints/data -
non-indexed)
CD LS LD Simpl...
Evaluation
Number of datasets
More datasets discovered does not implies in more results
Valdestilhas et al. (AKSW) More Co...
Evaluation
Runtime
Total Average 17 minutes across 3 benchmarks (wimuDumps 2 min, Endpoints
13 min, SPARQL-a-lot 58 sec, L...
Conclusion & Future works
Conclusion
A hybrid SPARQL query processing engine to execute SPARQL queries over a
large amount...
That’s all Folks!
Thanks!
Questions?Github repository: https://github.com/firmao/wimuT
Prototype: https://w3id.org/wimuq/
...
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

Download to read offline

Presentation slides for the paper: "More Complete Resultset Retrieval from Large Heterogeneous RDF Sources" during the conference K-CAP 2019

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

  1. 1. More Complete Resultset Retrieval from Large Heterogeneous RDF Sources Andre Valdestilhas Tommaso Soru Muhammad Saleem AKSW Group, University of Leipzig, Germany November 24, 2019 Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 1 / 17
  2. 2. Outline Motivation Approach Experiments Conclusion & Future work Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 2 / 17
  3. 3. Motivation The LOD Cloud 2007 2009 2019 Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 3 / 17
  4. 4. Motivation Where to find RDF datasets? 9,960 raw RDF datasets658,206 Datasets (HDT files) LODLaundromat Which Dataset? ... 559 Endpoint Different formats1 Query more than 221 billion triples (> 5 Terabytes) 1Serialization. Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 4 / 17
  5. 5. Example Where to find RDF datasets? Authors that have a paper type poster/demo in the proceedings of ISWC 20082 2Query from FEDBench Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 5 / 17
  6. 6. Example Where to find RDF datasets? Authors that have a paper type poster/demo in the proceedings of ISWC 20083 4 HDT datasets4 containing data that can answer the query 3Query from FEDBench 4Semantic Web Dog Food from LOD Laundromat datasets. Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 6 / 17
  7. 7. Motivation Approaches (+) Multiple SPARQL endpoints (-) 90% are dump files (+) Dereferenceable URIs (-) 43% of the URIs are non-dereferenceable Endpoint HDT file file.rdf Dump_file_2 WIMU Where is my URI? (+) Data from non-dereferenceable URIs (-) No SPARQL query Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 7 / 17
  8. 8. The approach A hybrid SPARQL query engine Collect data from multiple SPARQL endpoints, Data from RDF dumps including HDT files and use Link Traversal Link Traversal, obtaining data from non-dereferenceable URIs using WIMUa aWhere is my URI?(WIMU) http://wimu.aksw.org/ Resulting in More complete results Experiments with 3 state-of-the-art SPARQL query benchmarks, LargeRDFBench, FedBench and FEASIBLE Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 8 / 17
  9. 9. The approach Select ?p ?o Where {<http://uri.com> ?p ?o} Endpoint hdt file dump.bz2 file.rdf ... http://uri1.com http://uri2.com http://uriN.com Extract URIs WIMU 1 2 3 Data Dumps Query processor Traversal Based Query processor Union of the results Source Filtering SPARQL-a-lot Query processor SPARQL Endpoint  Query processor wimuQ query execution engine Results <subject1><predicate1><object1> <subject2><predicate2><object2> <subjectN><predicateN><objectN> 4 5 6 Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 9 / 17
  10. 10. The approach The source selection Identify relevant datasets from WIMU Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 10 / 17
  11. 11. Evaluation Hypothesis Identify automatically relevant sources from heterogeneous RDF data, even with non-dereferenceable URIs, can improve the resultset retrieval Metrics Coverage and runtime Approaches FedX (endpoints), SQUIN (Traversal-based), SPARQL-a-lot and WIMU(dumps) Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 11 / 17
  12. 12. Evaluation Experimental setup Datasets 221.7 billion triples (>5 terabytes) Queries 415 queries from FedBench, LargeRDFBench and FEASIBLE Each query executed 5 times Hardware 200 GB HD, 8GB RAM, 2.70GHZ single core processor Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 12 / 17
  13. 13. Evaluation Coverage: Overall 76% queries with results(Zero results=non-public endpoints/data - non-indexed) CD LS LD Simple Comp Large Chs Dbpedia SWDF FedBench | |LargeRDFBench Feasible 100 1000 10000 100000 Averagenumberofresults onlogscale EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps wimuQ Approaches and the best coverage FedBench 55% endpoints LargeRDFBench 81% wimuDumps FEASIBLE 98% wimuDumps Observation The combination of those query processing engines implies more resultset retrieval Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 13 / 17
  14. 14. Evaluation Number of datasets More datasets discovered does not implies in more results Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 14 / 17
  15. 15. Evaluation Runtime Total Average 17 minutes across 3 benchmarks (wimuDumps 2 min, Endpoints 13 min, SPARQL-a-lot 58 sec, LinkTraversal-SQUIN 36 sec) CD LS LD Simple Comp Large Chs Dbpedia SWDF FedBench LargeRDFBench Feasible | | 1 10 Averagerun-time(minutes) onlogscale EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps wimuQ Interesting wimuQ takes 91% of results from wimuDumps, only 7% from SPARQL endpoints. Possible reason, SPARQL endpoint federation split among multiple endpoints, network and number of intermediate results influence in the runtime Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 15 / 17
  16. 16. Conclusion & Future works Conclusion A hybrid SPARQL query processing engine to execute SPARQL queries over a large amount of heterogeneous RDF data Evaluation on real world datasets using the state of the art of federated and non-federated query benchmarks (FedBench, LargeRDFBench and FEASIBLE) We present the first federated SPARQL query processing engine that executes SPARQL queries over a total of 221.7 billion triples Future work Add more URIs into WIMU index and use Triple Pattern Fragments A Large Scale approach to study the relation and similarity among the datasets Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 16 / 17
  17. 17. That’s all Folks! Thanks! Questions?Github repository: https://github.com/firmao/wimuT Prototype: https://w3id.org/wimuq/ Contact: valdestilhas@informatik.uni-leipzig.de Special thanks to my PhD. advisor Prof. Dr. rer. nat. Thomas Riechert Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 17 / 17

Presentation slides for the paper: "More Complete Resultset Retrieval from Large Heterogeneous RDF Sources" during the conference K-CAP 2019

Views

Total views

345

On Slideshare

0

From embeds

0

Number of embeds

13

Actions

Downloads

1

Shares

0

Comments

0

Likes

0

×