Your SlideShare is downloading. ×
SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

1,228
views

Published on

ISWC'12 research track

ISWC'12 research track


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,228
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Institute for Web Science and Technologies University of Koblenz ▪ Landau, Germany Systematic Generation of SPARQL Benchmark Queries for Linked Open DataOlaf Görlitz, Matthias Thimm, Steffen Staab
  • 2. Linked Data Federation SPARQL Queries on the Linked Data Cloud Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 2 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 3. The Problem Why not use benchmark queries? distributed federation queries implementationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 3 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 4. RDF Benchmarks LUBM, BSBM, SP²B, ... FedBench (ISWC11) • Synthetic datasets • 10 Linked Data sets • Domain-specific (~170M triples) • Highly structured • 25 handpicked • Sophisticated queries distributed queries Centralized Fixed Scalable, Flexible, Expressive Linked Data BenchmarkISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 4 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 5. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 5 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 6. Linked Data Benchmark Features Scalability Flexibility Expressiveness Real Linked Data Sets Customization Typical+Complex Queries Systematic SPARQL Benchmark Query Generator for Linked Open DataISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 6 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 7. Requirements What we want: 1. Define Query Customize Benchmark Characteristics 2. Automatic Query Random Queries Generation 3. Query Validation #results > 0ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 7 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 8. Contribution Methodology and toolset for systematic query generation Linked Data Config Benchmark Queries Parameterization Query Generation Query ValidationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 8 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 9. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 9 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 10. SPLODGE Methodology Query Query Query Parameterization Generation Validation Define typical + challenging distributed queries No federation query Analyze queries logs available of benchmarks SELECT ?drug ?keggUrl ?chebiImage WHERE {   ?drug rdf:type drugbank:drugs .   ?drug drugbank:keggCompoundId ?keggDrug .   ?keggDrug bio2rdf:url ?keggUrl .   ?drug drugbank:genericName ?drugBankName .   ?chebiDrug purl:title ?drugBankName .   ?chebiDrug chebi:image ?chebiImage . } FedBench/LifeScience#5ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 10 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 11. SPLODGE Methodology Query Query Query Parameterization Generation Validation Algebra Structure Cardinality • Query Form • Variable Patterns • # Data Sources (Select, Construct, ...) (s, o, s+o, ...) • Join Type • Join Patterns • # Joins/ Patterns (conj. / disj. / left-join) (star, path) • Result Modifiers • Cross Product • # Results (limit, offs, order by)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 11 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 12. SPLODGE Methodology Query Query Query Parameterization Generation Validation Main query parameter: join structure path join FedBench queries star joinISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 12 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 13. SPLODGE Methodology Query Query Query Parameterization Generation Validation Additional query parameters: # triple patterns # data sources result size ... Path-join: n triple patterns, Star-join: n triple pattern, m sources (m≤n) anchor node (s/o)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 13 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 14. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Iteratively add random triple pattern #results > 0 ? Need background knowledge level of detail? Predicate combinations how provided?ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 14 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 15. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Linked Predicates Characteristics Sets* (owl:sameAs → rdf:type) {rdfs:label, foaf:knows, …} DBpedia → geonames (43, 58) DBpedia (322), rdfs:label (437) freebase → DBpedia (86, 72) foaf:knows (322) ... ... *[Neumann, Moerkotte, ICDE 2011]ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 15 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 16. SPLODGE Methodology Query Query Query Parameterization Generation Validation p1 p2 p3 p4 Linked Predicates Characteristics Sets (p1 → p2) ⊗ (p2 → p3) {p1, p4} ⊗ (p3 → pi ) {p1, p4, ...}ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 16 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 17. SPLODGE Methodology Query Query Query Parameterization Generation Validation Verify generated queries (#results >0) How to evaluate? Compute confidence value minimum join selectivity > eISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 17 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 18. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 18 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 19. Evaluation Objective Verify generation of valid queries (#results >0) Compare variations of query generation algorithms Baseline SPLODGElite SPLODGE “random“ background + minimum predicate knowlege join selectivity (> 10-4/10-3/10-2) Metrics:  #queries with non-empty results  #result per queryISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 19 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 20. Evaluation Setup Real Linked Data Billion Triple Challenge Dataset Random queries Triple Store • Path-joins across data sources • 3-6 patterns, bound predicates • 100 queries per batch RDF3X SELECT * WHERE { ?var1 <http://dbpedia.org/property/description> ?var2 . ?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 . ?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 . ?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 . ?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6 }ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 20 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 21. Evaluation Results#queries Joined triple patternsISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 21 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 22. Evaluation Results#results Joined triple patternsISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 22 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 23. Estimated vs. actual results sizeactual result size estimated result sizeISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 23 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 24. Predicate Occurrence in QueriesISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 24 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 25. ConclusionSPLODGE provides Flexible query characterization + parameterization Methodology for Systematic & Scalable Query Generation Toolset as Open Source (http://code.google.com/p/splodge/)Future Work: Create a LOD Federation Benchmark Interactive SPARQL query construction Questions?ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 25 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 26. SPLODGE Evaluation SetupBTC 2011 dataset in RDF3X pure triples, no context 160 GB repository file (14h loading, 200 GB tmp mem)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 26 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 27. SPLODGE Pre-Processing for BTC data Identify common domains17 GB gzip (e.g. jane08.lifejournal.com/home) 3,0 h Replace quad context 4,4 h (reduce number of sources) Sort quads + remove duplicates 8,5 h<1 MB gzip Build predicate/context dictionary 1,0 h1.7 GB gzip Create resource in/out-link index 9,7 h Create linked predicate stats Compute characteristic sets 1,6 hISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 27 Olaf Görlitz, Matthias Thimm, Steffen Staab