SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

  • 1,401 views
Uploaded on

ISWC'12 research track

ISWC'12 research track

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,401
On Slideshare
1,179
From Embeds
222
Number of Embeds
6

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 222

http://west.uni-koblenz.de 93
http://www.uni-koblenz-landau.de 93
http://code.google.com 19
http://m.uni-koblenz-landau.de 12
http://m.west.uni-koblenz.de 3
https://twitter.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Institute for Web Science and Technologies University of Koblenz ▪ Landau, Germany Systematic Generation of SPARQL Benchmark Queries for Linked Open DataOlaf Görlitz, Matthias Thimm, Steffen Staab
  • 2. Linked Data Federation SPARQL Queries on the Linked Data Cloud Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 2 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 3. The Problem Why not use benchmark queries? distributed federation queries implementationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 3 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 4. RDF Benchmarks LUBM, BSBM, SP²B, ... FedBench (ISWC11) • Synthetic datasets • 10 Linked Data sets • Domain-specific (~170M triples) • Highly structured • 25 handpicked • Sophisticated queries distributed queries Centralized Fixed Scalable, Flexible, Expressive Linked Data BenchmarkISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 4 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 5. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 5 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 6. Linked Data Benchmark Features Scalability Flexibility Expressiveness Real Linked Data Sets Customization Typical+Complex Queries Systematic SPARQL Benchmark Query Generator for Linked Open DataISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 6 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 7. Requirements What we want: 1. Define Query Customize Benchmark Characteristics 2. Automatic Query Random Queries Generation 3. Query Validation #results > 0ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 7 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 8. Contribution Methodology and toolset for systematic query generation Linked Data Config Benchmark Queries Parameterization Query Generation Query ValidationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 8 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 9. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 9 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 10. SPLODGE Methodology Query Query Query Parameterization Generation Validation Define typical + challenging distributed queries No federation query Analyze queries logs available of benchmarks SELECT ?drug ?keggUrl ?chebiImage WHERE {   ?drug rdf:type drugbank:drugs .   ?drug drugbank:keggCompoundId ?keggDrug .   ?keggDrug bio2rdf:url ?keggUrl .   ?drug drugbank:genericName ?drugBankName .   ?chebiDrug purl:title ?drugBankName .   ?chebiDrug chebi:image ?chebiImage . } FedBench/LifeScience#5ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 10 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 11. SPLODGE Methodology Query Query Query Parameterization Generation Validation Algebra Structure Cardinality • Query Form • Variable Patterns • # Data Sources (Select, Construct, ...) (s, o, s+o, ...) • Join Type • Join Patterns • # Joins/ Patterns (conj. / disj. / left-join) (star, path) • Result Modifiers • Cross Product • # Results (limit, offs, order by)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 11 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 12. SPLODGE Methodology Query Query Query Parameterization Generation Validation Main query parameter: join structure path join FedBench queries star joinISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 12 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 13. SPLODGE Methodology Query Query Query Parameterization Generation Validation Additional query parameters: # triple patterns # data sources result size ... Path-join: n triple patterns, Star-join: n triple pattern, m sources (m≤n) anchor node (s/o)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 13 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 14. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Iteratively add random triple pattern #results > 0 ? Need background knowledge level of detail? Predicate combinations how provided?ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 14 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 15. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Linked Predicates Characteristics Sets* (owl:sameAs → rdf:type) {rdfs:label, foaf:knows, …} DBpedia → geonames (43, 58) DBpedia (322), rdfs:label (437) freebase → DBpedia (86, 72) foaf:knows (322) ... ... *[Neumann, Moerkotte, ICDE 2011]ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 15 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 16. SPLODGE Methodology Query Query Query Parameterization Generation Validation p1 p2 p3 p4 Linked Predicates Characteristics Sets (p1 → p2) ⊗ (p2 → p3) {p1, p4} ⊗ (p3 → pi ) {p1, p4, ...}ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 16 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 17. SPLODGE Methodology Query Query Query Parameterization Generation Validation Verify generated queries (#results >0) How to evaluate? Compute confidence value minimum join selectivity > eISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 17 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 18. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 18 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 19. Evaluation Objective Verify generation of valid queries (#results >0) Compare variations of query generation algorithms Baseline SPLODGElite SPLODGE “random“ background + minimum predicate knowlege join selectivity (> 10-4/10-3/10-2) Metrics:  #queries with non-empty results  #result per queryISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 19 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 20. Evaluation Setup Real Linked Data Billion Triple Challenge Dataset Random queries Triple Store • Path-joins across data sources • 3-6 patterns, bound predicates • 100 queries per batch RDF3X SELECT * WHERE { ?var1 <http://dbpedia.org/property/description> ?var2 . ?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 . ?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 . ?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 . ?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6 }ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 20 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 21. Evaluation Results#queries Joined triple patternsISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 21 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 22. Evaluation Results#results Joined triple patternsISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 22 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 23. Estimated vs. actual results sizeactual result size estimated result sizeISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 23 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 24. Predicate Occurrence in QueriesISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 24 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 25. ConclusionSPLODGE provides Flexible query characterization + parameterization Methodology for Systematic & Scalable Query Generation Toolset as Open Source (http://code.google.com/p/splodge/)Future Work: Create a LOD Federation Benchmark Interactive SPARQL query construction Questions?ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 25 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 26. SPLODGE Evaluation SetupBTC 2011 dataset in RDF3X pure triples, no context 160 GB repository file (14h loading, 200 GB tmp mem)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 26 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 27. SPLODGE Pre-Processing for BTC data Identify common domains17 GB gzip (e.g. jane08.lifejournal.com/home) 3,0 h Replace quad context 4,4 h (reduce number of sources) Sort quads + remove duplicates 8,5 h<1 MB gzip Build predicate/context dictionary 1,0 h1.7 GB gzip Create resource in/out-link index 9,7 h Create linked predicate stats Compute characteristic sets 1,6 hISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 27 Olaf Görlitz, Matthias Thimm, Steffen Staab