Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

122 views

Published on

Presentation slides for the paper: "More Complete Resultset Retrieval from Large Heterogeneous RDF Sources" during the conference K-CAP 2019

  • Be the first to comment

  • Be the first to like this

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

  1. 1. More Complete Resultset Retrieval from Large Heterogeneous RDF Sources Andre Valdestilhas Tommaso Soru Muhammad Saleem AKSW Group, University of Leipzig, Germany November 24, 2019 Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 1 / 17
  2. 2. Outline Motivation Approach Experiments Conclusion & Future work Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 2 / 17
  3. 3. Motivation The LOD Cloud 2007 2009 2019 Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 3 / 17
  4. 4. Motivation Where to find RDF datasets? 9,960 raw RDF datasets658,206 Datasets (HDT files) LODLaundromat Which Dataset? ... 559 Endpoint Different formats1 Query more than 221 billion triples (> 5 Terabytes) 1Serialization. Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 4 / 17
  5. 5. Example Where to find RDF datasets? Authors that have a paper type poster/demo in the proceedings of ISWC 20082 2Query from FEDBench Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 5 / 17
  6. 6. Example Where to find RDF datasets? Authors that have a paper type poster/demo in the proceedings of ISWC 20083 4 HDT datasets4 containing data that can answer the query 3Query from FEDBench 4Semantic Web Dog Food from LOD Laundromat datasets. Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 6 / 17
  7. 7. Motivation Approaches (+) Multiple SPARQL endpoints (-) 90% are dump files (+) Dereferenceable URIs (-) 43% of the URIs are non-dereferenceable Endpoint HDT file file.rdf Dump_file_2 WIMU Where is my URI? (+) Data from non-dereferenceable URIs (-) No SPARQL query Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 7 / 17
  8. 8. The approach A hybrid SPARQL query engine Collect data from multiple SPARQL endpoints, Data from RDF dumps including HDT files and use Link Traversal Link Traversal, obtaining data from non-dereferenceable URIs using WIMUa aWhere is my URI?(WIMU) http://wimu.aksw.org/ Resulting in More complete results Experiments with 3 state-of-the-art SPARQL query benchmarks, LargeRDFBench, FedBench and FEASIBLE Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 8 / 17
  9. 9. The approach Select ?p ?o Where {<http://uri.com> ?p ?o} Endpoint hdt file dump.bz2 file.rdf ... http://uri1.com http://uri2.com http://uriN.com Extract URIs WIMU 1 2 3 Data Dumps Query processor Traversal Based Query processor Union of the results Source Filtering SPARQL-a-lot Query processor SPARQL Endpoint  Query processor wimuQ query execution engine Results <subject1><predicate1><object1> <subject2><predicate2><object2> <subjectN><predicateN><objectN> 4 5 6 Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 9 / 17
  10. 10. The approach The source selection Identify relevant datasets from WIMU Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 10 / 17
  11. 11. Evaluation Hypothesis Identify automatically relevant sources from heterogeneous RDF data, even with non-dereferenceable URIs, can improve the resultset retrieval Metrics Coverage and runtime Approaches FedX (endpoints), SQUIN (Traversal-based), SPARQL-a-lot and WIMU(dumps) Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 11 / 17
  12. 12. Evaluation Experimental setup Datasets 221.7 billion triples (>5 terabytes) Queries 415 queries from FedBench, LargeRDFBench and FEASIBLE Each query executed 5 times Hardware 200 GB HD, 8GB RAM, 2.70GHZ single core processor Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 12 / 17
  13. 13. Evaluation Coverage: Overall 76% queries with results(Zero results=non-public endpoints/data - non-indexed) CD LS LD Simple Comp Large Chs Dbpedia SWDF FedBench | |LargeRDFBench Feasible 100 1000 10000 100000 Averagenumberofresults onlogscale EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps wimuQ Approaches and the best coverage FedBench 55% endpoints LargeRDFBench 81% wimuDumps FEASIBLE 98% wimuDumps Observation The combination of those query processing engines implies more resultset retrieval Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 13 / 17
  14. 14. Evaluation Number of datasets More datasets discovered does not implies in more results Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 14 / 17
  15. 15. Evaluation Runtime Total Average 17 minutes across 3 benchmarks (wimuDumps 2 min, Endpoints 13 min, SPARQL-a-lot 58 sec, LinkTraversal-SQUIN 36 sec) CD LS LD Simple Comp Large Chs Dbpedia SWDF FedBench LargeRDFBench Feasible | | 1 10 Averagerun-time(minutes) onlogscale EndPoints(FedX) SPARQL-a-lot LinkTraversal(SQUIN) wimuDumps wimuQ Interesting wimuQ takes 91% of results from wimuDumps, only 7% from SPARQL endpoints. Possible reason, SPARQL endpoint federation split among multiple endpoints, network and number of intermediate results influence in the runtime Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 15 / 17
  16. 16. Conclusion & Future works Conclusion A hybrid SPARQL query processing engine to execute SPARQL queries over a large amount of heterogeneous RDF data Evaluation on real world datasets using the state of the art of federated and non-federated query benchmarks (FedBench, LargeRDFBench and FEASIBLE) We present the first federated SPARQL query processing engine that executes SPARQL queries over a total of 221.7 billion triples Future work Add more URIs into WIMU index and use Triple Pattern Fragments A Large Scale approach to study the relation and similarity among the datasets Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 16 / 17
  17. 17. That’s all Folks! Thanks! Questions?Github repository: https://github.com/firmao/wimuT Prototype: https://w3id.org/wimuq/ Contact: valdestilhas@informatik.uni-leipzig.de Special thanks to my PhD. advisor Prof. Dr. rer. nat. Thomas Riechert Valdestilhas et al. (AKSW) More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesNovember 24, 2019 17 / 17

×