Institute for Web Science and Technologies                       University of Koblenz ▪ Landau, Germany                 S...
Linked Data Federation            SPARQL Queries on the Linked Data Cloud                                                 ...
The Problem              Why not use              benchmark              queries?              distributed                ...
RDF Benchmarks       LUBM, BSBM, SP²B, ...                                FedBench (ISWC11)       • Synthetic datasets    ...
Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Gene...
Linked Data Benchmark Features         Scalability                      Flexibility                      Expressiveness  R...
Requirements          What we want:          1. Define Query                                   Customize Benchmark        ...
Contribution                              Methodology and toolset for                              systematic query genera...
Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Gene...
SPLODGE Methodology                     Query                       Query                    Query                 Paramet...
SPLODGE Methodology                     Query                       Query                    Query                 Paramet...
SPLODGE Methodology                     Query                       Query                    Query                 Paramet...
SPLODGE Methodology                     Query                       Query                    Query                 Paramet...
SPLODGE Methodology                     Query                        Query                      Query                 Para...
SPLODGE Methodology                     Query                        Query                        Query                 Pa...
SPLODGE Methodology                     Query                       Query                    Query                 Paramet...
SPLODGE Methodology                     Query                       Query                    Query                 Paramet...
Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Gene...
Evaluation Objective Verify generation of valid queries (#results >0) Compare variations of query generation algorithms ...
Evaluation Setup Real Linked Data                                  Billion Triple Challenge Dataset Random queries Trip...
Evaluation Results#queries                                          Joined triple patternsISWC12, Boston, 11/15/2012   SPL...
Evaluation Results#results                                          Joined triple patternsISWC12, Boston, 11/15/2012   SPL...
Estimated vs. actual results sizeactual result size                                          estimated result sizeISWC12, ...
Predicate Occurrence in QueriesISWC12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query GenerationSlide 24    ...
ConclusionSPLODGE provides Flexible query characterization + parameterization Methodology for Systematic & Scalable Quer...
SPLODGE Evaluation SetupBTC 2011 dataset in RDF3X pure triples, no context 160 GB repository file  (14h loading, 200 GB ...
SPLODGE Pre-Processing for BTC data                               Identify common domains17 GB gzip               (e.g. ja...
Upcoming SlideShare
Loading in …5
×

SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

1,602 views
1,509 views

Published on

ISWC'12 research track

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,602
On SlideShare
0
From Embeds
0
Number of Embeds
298
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

  1. 1. Institute for Web Science and Technologies University of Koblenz ▪ Landau, Germany Systematic Generation of SPARQL Benchmark Queries for Linked Open DataOlaf Görlitz, Matthias Thimm, Steffen Staab
  2. 2. Linked Data Federation SPARQL Queries on the Linked Data Cloud Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 2 Olaf Görlitz, Matthias Thimm, Steffen Staab
  3. 3. The Problem Why not use benchmark queries? distributed federation queries implementationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 3 Olaf Görlitz, Matthias Thimm, Steffen Staab
  4. 4. RDF Benchmarks LUBM, BSBM, SP²B, ... FedBench (ISWC11) • Synthetic datasets • 10 Linked Data sets • Domain-specific (~170M triples) • Highly structured • 25 handpicked • Sophisticated queries distributed queries Centralized Fixed Scalable, Flexible, Expressive Linked Data BenchmarkISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 4 Olaf Görlitz, Matthias Thimm, Steffen Staab
  5. 5. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 5 Olaf Görlitz, Matthias Thimm, Steffen Staab
  6. 6. Linked Data Benchmark Features Scalability Flexibility Expressiveness Real Linked Data Sets Customization Typical+Complex Queries Systematic SPARQL Benchmark Query Generator for Linked Open DataISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 6 Olaf Görlitz, Matthias Thimm, Steffen Staab
  7. 7. Requirements What we want: 1. Define Query Customize Benchmark Characteristics 2. Automatic Query Random Queries Generation 3. Query Validation #results > 0ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 7 Olaf Görlitz, Matthias Thimm, Steffen Staab
  8. 8. Contribution Methodology and toolset for systematic query generation Linked Data Config Benchmark Queries Parameterization Query Generation Query ValidationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 8 Olaf Görlitz, Matthias Thimm, Steffen Staab
  9. 9. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 9 Olaf Görlitz, Matthias Thimm, Steffen Staab
  10. 10. SPLODGE Methodology Query Query Query Parameterization Generation Validation Define typical + challenging distributed queries No federation query Analyze queries logs available of benchmarks SELECT ?drug ?keggUrl ?chebiImage WHERE {   ?drug rdf:type drugbank:drugs .   ?drug drugbank:keggCompoundId ?keggDrug .   ?keggDrug bio2rdf:url ?keggUrl .   ?drug drugbank:genericName ?drugBankName .   ?chebiDrug purl:title ?drugBankName .   ?chebiDrug chebi:image ?chebiImage . } FedBench/LifeScience#5ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 10 Olaf Görlitz, Matthias Thimm, Steffen Staab
  11. 11. SPLODGE Methodology Query Query Query Parameterization Generation Validation Algebra Structure Cardinality • Query Form • Variable Patterns • # Data Sources (Select, Construct, ...) (s, o, s+o, ...) • Join Type • Join Patterns • # Joins/ Patterns (conj. / disj. / left-join) (star, path) • Result Modifiers • Cross Product • # Results (limit, offs, order by)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 11 Olaf Görlitz, Matthias Thimm, Steffen Staab
  12. 12. SPLODGE Methodology Query Query Query Parameterization Generation Validation Main query parameter: join structure path join FedBench queries star joinISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 12 Olaf Görlitz, Matthias Thimm, Steffen Staab
  13. 13. SPLODGE Methodology Query Query Query Parameterization Generation Validation Additional query parameters: # triple patterns # data sources result size ... Path-join: n triple patterns, Star-join: n triple pattern, m sources (m≤n) anchor node (s/o)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 13 Olaf Görlitz, Matthias Thimm, Steffen Staab
  14. 14. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Iteratively add random triple pattern #results > 0 ? Need background knowledge level of detail? Predicate combinations how provided?ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 14 Olaf Görlitz, Matthias Thimm, Steffen Staab
  15. 15. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Linked Predicates Characteristics Sets* (owl:sameAs → rdf:type) {rdfs:label, foaf:knows, …} DBpedia → geonames (43, 58) DBpedia (322), rdfs:label (437) freebase → DBpedia (86, 72) foaf:knows (322) ... ... *[Neumann, Moerkotte, ICDE 2011]ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 15 Olaf Görlitz, Matthias Thimm, Steffen Staab
  16. 16. SPLODGE Methodology Query Query Query Parameterization Generation Validation p1 p2 p3 p4 Linked Predicates Characteristics Sets (p1 → p2) ⊗ (p2 → p3) {p1, p4} ⊗ (p3 → pi ) {p1, p4, ...}ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 16 Olaf Görlitz, Matthias Thimm, Steffen Staab
  17. 17. SPLODGE Methodology Query Query Query Parameterization Generation Validation Verify generated queries (#results >0) How to evaluate? Compute confidence value minimum join selectivity > eISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 17 Olaf Görlitz, Matthias Thimm, Steffen Staab
  18. 18. Overview Benchmark Idea Methodology EvaluationISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 18 Olaf Görlitz, Matthias Thimm, Steffen Staab
  19. 19. Evaluation Objective Verify generation of valid queries (#results >0) Compare variations of query generation algorithms Baseline SPLODGElite SPLODGE “random“ background + minimum predicate knowlege join selectivity (> 10-4/10-3/10-2) Metrics:  #queries with non-empty results  #result per queryISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 19 Olaf Görlitz, Matthias Thimm, Steffen Staab
  20. 20. Evaluation Setup Real Linked Data Billion Triple Challenge Dataset Random queries Triple Store • Path-joins across data sources • 3-6 patterns, bound predicates • 100 queries per batch RDF3X SELECT * WHERE { ?var1 <http://dbpedia.org/property/description> ?var2 . ?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 . ?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 . ?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 . ?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6 }ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 20 Olaf Görlitz, Matthias Thimm, Steffen Staab
  21. 21. Evaluation Results#queries Joined triple patternsISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 21 Olaf Görlitz, Matthias Thimm, Steffen Staab
  22. 22. Evaluation Results#results Joined triple patternsISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 22 Olaf Görlitz, Matthias Thimm, Steffen Staab
  23. 23. Estimated vs. actual results sizeactual result size estimated result sizeISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 23 Olaf Görlitz, Matthias Thimm, Steffen Staab
  24. 24. Predicate Occurrence in QueriesISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 24 Olaf Görlitz, Matthias Thimm, Steffen Staab
  25. 25. ConclusionSPLODGE provides Flexible query characterization + parameterization Methodology for Systematic & Scalable Query Generation Toolset as Open Source (http://code.google.com/p/splodge/)Future Work: Create a LOD Federation Benchmark Interactive SPARQL query construction Questions?ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 25 Olaf Görlitz, Matthias Thimm, Steffen Staab
  26. 26. SPLODGE Evaluation SetupBTC 2011 dataset in RDF3X pure triples, no context 160 GB repository file (14h loading, 200 GB tmp mem)ISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 26 Olaf Görlitz, Matthias Thimm, Steffen Staab
  27. 27. SPLODGE Pre-Processing for BTC data Identify common domains17 GB gzip (e.g. jane08.lifejournal.com/home) 3,0 h Replace quad context 4,4 h (reduce number of sources) Sort quads + remove duplicates 8,5 h<1 MB gzip Build predicate/context dictionary 1,0 h1.7 GB gzip Create resource in/out-link index 9,7 h Create linked predicate stats Compute characteristic sets 1,6 hISWC12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query GenerationSlide 27 Olaf Görlitz, Matthias Thimm, Steffen Staab

×