Strategies for Processing and Explaining Distributed Queries on Linked Data

598 views

Published on

Strategies for Processing and Explaining Distributed Queries on Linked Data

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
598
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Overview of distributed query processing, extracting interesting features from sparql queries for machine learning algorithms, explaining sparql queries
  • The graph edit distance between two graphs is the minimum amount of distortion needed to transform one graph to another.The minimum amount of distortion is the sequence of edit operations with minimum cost. The edit operations are deletions, insertions, and substitutions of nodes and edges.
  • RMSD represents the sample standard deviation of the differences between predicted values and observed valuesprediction errors when computed out-of-sample
  • As the figures show, predictions for queries from templates 15, 22, 24, and 25 have large errors. Execution times of the queries from template 15 range from 1 ms to 21 ms, template 22 range from 1 ms to 22 ms, and template 25 range from 1 ms to 24 ms. For this reason, many data points in range of 1 ms to 24 ms are distant from the perfect prediction line in Figure~\ref{fig:all_predictions}(a). Execution times of the queries from template 24 range from 86 ms to 110 ms. Many data points at this interval are at even more distance from the perfect prediction line than the interval 1 ms to 24 ms. This explains the highest error for queries belonging to template 25.
  • Refining training queries from query logs by considering the statistically significant characteristics Bootstrapping: Starting with a initial set of properties, resources and literals and than generate training queries by permutations and combinations of the statistically significant features Simplifying the pattern features: join features, triple pattern features, pattern graph features to represent query patterns
  • our goal is not to find the minimal parts of axioms in a justification which are required to hold an entailment.
  • triples in the virtual model for the query result are causes
  • Strategies for Processing and Explaining Distributed Queries on Linked Data

    1. 1. Strategies for Processing and Explaining Distributed Queries on Linked Data Rakebul Hasan Wimmics Inria Sophia Antipolis
    2. 2. Research Theme • Distribute Query Processing – Optimization techniques for querying Linked Data • Distributed Query Explanation – Query plan explanation • How query solving works – Query result explanation • Why • Where 1
    3. 3. DISTRIBUTE QUERY PROCESSING 2
    4. 4. Querying Linked Data • Bottom-up strategies – Discover sources during query processing by following links between sources • Top-down strategies – Sources are known 3
    5. 5. Querying Linked Data FedBench Query CD5: Find the director and the genre of movies directed by Italians SELECT ?film ?director ?genre WHERE { ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationalitydbpedia:Italy . ?xowl:sameAs ?film . ?xlinkedMDB:genre ?genre } 4
    6. 6. Querying Linked Data FedBench Query CD5: Find the director and the genre of movies directed by Italians SELECT ?film ?director ?genre WHERE { SERVICE <http://dbpedia.org/sparql>{ ?film dbpedia-owl:director ?director. ?director dbpedia-owl:nationalitydbpedia:Italy . ?xowl:sameAs ?film . } SERVICE <http://data.linkedmdb.org/sparql> { ?xlinkedMDB:genre ?genre } } SPARQL 1.1 SERVICE clause Need knowledge of which part of the query should be solved by which endpoint 5
    7. 7. • Top-down approaches – Data warehousing approach – Virtual integration approach 6
    8. 8. Querying Linked Data • Data warehousing approach – Collect the data in a central triple store – Process queries on that triple store • Disadvantages • Expensive preprocessing (data collection + integration) and maintenance • Data not up to date 7
    9. 9. Querying Linked Data • Virtual integration approach – A query federation middleware • Parse and split into subqueries • Source selection for subqueries • Evaluate the subqueries against corresponding sources directly • Merge the results – Advantages • no preprocessing and maintenance • up to date data 8
    10. 10. Running Example Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient drugbank http://www4.wiwiss.fu-berlin.de/drugbank/sparql SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } KEGG http://cu.bioportal.bio2rdf.org/sparql DBpedia http://data.linkedmdb.org/sparql 9
    11. 11. Parsing and Source Selecting drugbank Query LS6: Find KEGG drug names of all drugs in Drugbank belonging to category Micronutrient http://www4.wiwiss.fu-berlin.de/drugbank/sparql ?drug drugbank:drugCategorydrugbank-category:micronutrient ?drug drugbank:casRegistryNumber ?id SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } KEGG http://cu.bioportal.bio2rdf.org/sparql ?keggDrugrdf:typekegg:Drug ?keggDrug bio2rdf:xRef ?id Send ask queries to all the sources ASK {?drug drugbank:drugCategorydrugbank-category:micronutrient } ASK {?drug drugbank:casRegistryNumber ?id } ASK {?keggDrugrdf:typekegg:Drug } ASK {?keggDrug bio2rdf:xRef ?id } ASK {?keggDrugpurl:title ?title } DBpedia http://data.linkedmdb.org/sparql ?keggDrugpurl:title ?title 10
    12. 12. Evaluating subqueries • Two options – All triple patterns are individually evaluated – Nested loop join (NLJ): evaluate iteratively pattern by pattern 11
    13. 13. Evaluating subqueries • Example of NLJ SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } 12
    14. 14. Optimization techniques • Source Selection – Indexing: characterization of RDF graphs – Statistics based catalogue – Caching 13
    15. 15. Optimization techniques • Exclusive grouping SELECT ?drug ?title WHERE { ?drug drugbank:drugCategorydrugbank-category:micronutrient . ?drug drugbank:casRegistryNumber ?id . ?keggDrugrdf:typekegg:Drug . ?keggDrug bio2rdf:xRef ?id . ?keggDrugpurl:title ?title . } 14
    16. 16. Optimization techniques • Bound join – Effective for NLJ – Mappings of variable values from the intermediate result to the next subquery – SPARQL 1.1 VALUES clause 15
    17. 17. Optimization techniques • Hash join – Hash table for intermediate mappings 16
    18. 18. Optimization techniques • FILTER optimization – Evaluating the subqueries with corresponding FILTERS – Reduces the number of intermediate results 17
    19. 19. Optimization techniques • Parallelization – Effective for individual subquery evaluation approach 18
    20. 20. Optimization techniques • Join order – Selectivity estimation: join order heuristic based on the number of bound and free variables [1] – Statistics: based on cardinalities of the triple patterns [2] [1] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph pattern optimization using selectivity estimation. In Proceedings of the 17th international conference on World Wide Web (WWW '08) [2] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the 19th international conference on World wide web (WWW '10) 19
    21. 21. Selectivity Estimation • Ideal for the Linked Data scenario – No need for statistics about the underlying data – Estimations can be wrong however BGP Number of results Selectivity estimation ?v1 p1 ?v2 ?v1 p2 ?v3 5 2 ?v4 p3 c1 ?v4 p4 ?v1 10 1 20
    22. 22. Query Performance Prediction • Learn query performance from already executed queries • Statistics about the underlying data not required • Applications – Query optimization: join order – Workload management/scheduling 21
    23. 23. Regression Learn a mapping function f(X) = y X = feature vector, vector representation of SPARQL query y = performance metric (e.g. latency, number of results) Support vector machine with nu-SVR k-nearest neighbors regression 22
    24. 24. Feature Extraction • How can we represent SPARQL queries as vectors? 23
    25. 25. SPARQL • SPARQL algebra features • Graph pattern features 24
    26. 26. SPARQL Algebra http://www.w3.org/TR/sparql11-query/#sparqlQuery 25
    27. 27. SPARQL Algebra Features 26
    28. 28. Graph Pattern Features 27
    29. 29. Graph Pattern Features • Clustering Training Queries – K-mediods clustering algorithm with approximated edit distance as distance function • Selects data points as cluster centers • Arbitrary distance function 28
    30. 30. Graph Pattern Features • Graph Edit Distance – Minimum amount of distortion needed to transform one graph to another – Compute similarity by inversing distance 29
    31. 31. Graph Pattern Features • Graph Edit Distance – Usually computed using A* search • Exponential running time – Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with classification problems 30
    32. 32. Experiment Setup • Triple store and hardware – Jena TDB 1.0.0 – 16 GB memory – 2.53 GHz CPU – 48 GB system RAM – Linux 2.6.32 operating system 31
    33. 33. Experiment Setup • Datasets – Training, validation, test datasets generated from 25 DBPSB query templates • 1260 training queries • 420 validation queries • 420 test queries – RDF data: DBpedia 3.5.1 with 100% scaling factor from DBPSB framework 32
    34. 34. Prediction Models • Support Vector Machine (SVM) with the nuSVR kernel for regression • k-nearest neighbors (k-NN) regression – Euclidean distance as the distance function – k-dimensional tree (k-d tree) data structure to compute the nearest neighbors 33
    35. 35. Evaluation Measures • Coefficient of determination • Root mean squared error (RMSE) 34
    36. 36. Predicting Query Execution Time • SPARQL algebra features 35
    37. 37. Predicting Query Execution Time • SPARQL algebra and graph pattern features 36
    38. 38. What’s Next • Apply QPP in join order optimization • Benchmarking – FedBench: a benchmarking framework for federated query processing • Systematic generation of training queries – Bootstrapping – Refining training queries from query logs 37
    39. 39. Statistical Analysis of Query Logs • Approach to Systematic generation of training queries Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011) 38
    40. 40. Summary • Distributed query processing • Optimization techniques • Query performance prediction – Join order optimization 39
    41. 41. DISTRIBUTED QUERY EXPLANATION 40
    42. 42. Query Explanation • Query plan explanation • Query result explanation • Motivation – Understanding – Transparency – Trust 41
    43. 43. Query Explanation in the Semantic Web Query plan explanation Query result explanation Jena Sesame Virtuoso * BigOWLIM + * Explanation of SPARQL to SQL translation + a debugging feature on the query engine side 42
    44. 44. Distributed Query Explanation in the Semantic Web Query plan explanation Query result explanation FedX DARQ SemWIQ ADERIS Anapsid 43
    45. 45. ADERIS Query Plan Explanation 44
    46. 46. Related Work • Database: Why, how, where provenance • Semantic Web: – Inference explanation • Provenance • Generating justifications 45
    47. 47. Distributed Query Explanation What We Provide • Query plan explanation – Work in progress – Prior to query execution • Includes predicted performance metrics – After query execution with performance metrics • Query result explanation – Why provenance – Where provenance 46
    48. 48. Query Result Explanation (RDF model, Query, Result) Result Explainer Plug-in Can be any RDF model (RDF graph, SPARQL endpoint) We generate result explanations by querying this model for why provenance 47
    49. 49. Why Provenance • Triples in virtual model from which the result is derived <http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" . 48
    50. 50. Where Provenance • Keep the provenance of sources in the virtual model fedqld:source1 { <http://example.org/book/book8> <http://purl.org/dc/elements/1.1/creator> <http://www-sop.inria.fr/members/Alice> ; <http://purl.org/dc/elements/1.1/title> "Distributed Query Processing for Linked Data" . } fedqld:source2 { <http://www-sop.inria.fr/members/Charlie> <http://xmlns.com/foaf/0.1/name> "Charlie" . Where the triples in this graph come from <http://www-sop.inria.fr/members/Alice> <http://xmlns.com/foaf/0.1/knows> <http://www-sop.inria.fr/members/Charlie> ; <http://xmlns.com/foaf/0.1/name> "Alice" . } fedqld:prov { fedqld:source1 void:sparqlEndpoint<http://localhost:3030/books/query> . fedqld:source2 void:sparqlEndpoint<http://localhost:3031/person/query> . } 49
    51. 51. What’s next • Explanation user interfaces • Evaluating the impacts of our explanations 50
    52. 52. References • • • Andreas Schwarte, Peter Haase, KatjaHoose, Ralf Schenkel, and Michael Schmidt. Fedx: A federation layer for distributed query processing on linked open data. In ESWC, 2011 Shinji Kikuchi, Shelly Sachdeva, SubhashBhalla, Steven Lynden, Isao Kojima, AkiyoshiMatono, and Yusuke Tanimura. Adaptive integration of distributed semantic web data. Databases in Networked Information Systems Michael Schmidt, Olaf Görlitz, Peter Haase, Günter Ladwig, Andreas Schwarte, and Thanh Tran. 2011. FedBench: a benchmark suite for federated semantic data query processing. In Proceedings of the 10th international conference on The semantic web - Volume Part I (ISWC'11) 51

    ×