Semantics and optimization of the SPARQL 1.1 federation extension Carlos Buil Aranda (1), Marcelo Arenas (2),  Oscar Corch...
Introduction <ul><li>Example </li></ul><ul><ul><li>Using the Pubmed references obtained from the  Geneid gene dataset , re...
Introduction Pubmed MESH HHPID ?meshReference <owl:sameAs> ?descriptor  {?pubmed <pubmed:meshref> ?mesh . ?mesh <pubmed:de...
Introduction <ul><li>Given SPARQL1.0: How do you do those queries? </li></ul><ul><li>Option 1 : Make local copies of all t...
Introduction: SPARQL 1.1 Federation Extension <ul><li>Allows specifying queries over distributed SPARQL endpoints </li></u...
Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></...
Syntax & Semantics Preliminaries
Syntax & Semantics Preliminaries
Syntax & Semantics Preliminaries
SPARQL1.1 SERVICE syntax <ul><li>Queries are of the form: </li></ul><ul><li>“ a ”  is an IRI or a variable, so it can be: ...
SPARQL 1.1 SERVICE Semantics <ul><li>We extend [PAG09] with the semantics for SERVICE: </li></ul>[PAG09] J. Pérez, M. Aren...
SPARQL 1.1 SERVICE Semantics <ul><li>So, if we find  SERVICE ?X P1 , do we have to send queries to every single endpoint i...
BINDINGS Semantics <ul><li>We also define BINDINGS  </li></ul><ul><ul><li>Previously in SPARQL 1.1 Federated Extension, no...
Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></...
SERVICE Evaluation <ul><li>What happens when there is a variable ?X in  SERVICE ?X P  ? </li></ul><ul><ul><li>?X must be b...
SERVICE Evaluation – Ingredients and Informal Definitions <ul><li>Boundedness (of a variable in a query) </li></ul><ul><ul...
SERVICE Evaluation - Boundedness
SERVICE Evaluation - Strong Boundedness
SERVICE Evaluation - Safeness <ul><li>Given a SPARQL query P, define T(P) as the parse tree of P. In this tree, every node...
SERVICE Evaluation - Safeness <ul><li>Definition (service-boundedness) A SPARQL query  P  is service-bound if for every no...
Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></...
Optimising Federated Queries <ul><li>Well-designed patterns [PAG09] </li></ul>
Optimising federated queries with well-designed patterns <ul><li>We extended the notion of well-designed patterns for SPAR...
Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></...
Implementation: OGSA-DAI <ul><li>Implemented on top of OGSA-DAI/OGSA-DQP </li></ul><ul><ul><li>Extensible framework to acc...
Implementation: Evaluation <ul><li>Benchmark test </li></ul><ul><ul><li>Existing benchmarks (Berlin SPARQL Benchmark and S...
Implementation: Evaluation Queries
Implementation: The query plans Before Optimisation After Optimisation
Results
Upcoming SlideShare
Loading in …5
×

Semantics and optimisation of the SPARQL1.1 federation extension

1,405 views

Published on

Presentation done at ESWC2011 for the paper "Semantics and optimisation of the SPARQL1.1 federation extension". Buil-Aranda C, Arenas M, Corcho O. ESWC2011, May 2011, Hersonissos, Greece

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,405
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
28
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Semantics and optimisation of the SPARQL1.1 federation extension

  1. 1. Semantics and optimization of the SPARQL 1.1 federation extension Carlos Buil Aranda (1), Marcelo Arenas (2), Oscar Corcho (1) cbuil @ fi.upm.es, marenas@ing.puc.cl, [email_address] 11th November, 2010, Madrid, Spain (1) Ontology Engineering Groupd, Facultad de Informática, Universidad Politécnica de Madrid (2) Departamento Ciencias de la Computacion, Pontificia Universidad Católica de Chile
  2. 2. Introduction <ul><li>Example </li></ul><ul><ul><li>Using the Pubmed references obtained from the Geneid gene dataset , retrieve information about genes and their references in the Pubmed dataset . </li></ul></ul><ul><ul><li>From Pubmed we access the information in the National Library of Medicine’s controlled vocabulary thesaurus, stored at the MeSH endpoint , so we have more complete information about such genes. </li></ul></ul><ul><ul><li>Finally, we also access the HHPID endpoint , which is the knowledge base for the HIV-1 protein. </li></ul></ul><ul><li>How many of you have been in the need of making queries to distributed SPARQL endpoints? </li></ul>
  3. 3. Introduction Pubmed MESH HHPID ?meshReference <owl:sameAs> ?descriptor {?pubmed <pubmed:meshref> ?mesh . ?mesh <pubmed:descriptor> ?descriptor .} ?int <hhpid:elementGene2> ?gene1 GeneID ?gene1 <geneid:pubmed_xref> ?pubmed
  4. 4. Introduction <ul><li>Given SPARQL1.0: How do you do those queries? </li></ul><ul><li>Option 1 : Make local copies of all those graphs into your favourite triple store, separated into different named graphs / contexts, and evaluate a single query over the whole set of graphs. </li></ul><ul><li>Option 2 : Send individual queries into each SPARQL endpoint, and join information in a programmatic manner on the client side. Highly inefficient. </li></ul><ul><li>Option 3 : Use some of the existing distributed query processing extensions : DARQ, NetworkedGraphs, ARQ, etc. </li></ul>SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 . ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed . ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}
  5. 5. Introduction: SPARQL 1.1 Federation Extension <ul><li>Allows specifying queries over distributed SPARQL endpoints </li></ul><ul><ul><li>New operator: SERVICE a P </li></ul></ul><ul><li>We may combine local and remote SPARQL endpoints, depending on the characteristics of the data that we are handling </li></ul>SELECT ?pubmed ?gene1 ?mesh ?descriptor ?meshReference WHERE { SERVICE < http://quebec.hhpid.bio2rdf.org/sparql > { ?interaction <http://ontology.bio2rdf.org/hhpid:elementGene2> ?gene1 .} . SERVICE < http://127.0.0.1:2020/sparql-geneid > { ?gene1 <http://bio2rdf.org/geneid_resource:pubmed_xref> ?pubmed .} . SERVICE < http://pubmed.bio2rdf.org/sparql > { ?pubmed <http://bio2rdf.org/pubmed_resource:meshref> ?mesh . ?mesh <http://bio2rdf.org/pubmed_resource:descriptor> ?descriptor . }. SERVICE < http://127.0.0.1:2021/sparql-mesh > { ?meshReference <http://www.w3.org/2002/07/owl#sameAs> ?descriptor .}
  6. 6. Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></ul><ul><li>Optimising Federated Queries </li></ul><ul><li>Implementation </li></ul><ul><li>Conclusions </li></ul>
  7. 7. Syntax & Semantics Preliminaries
  8. 8. Syntax & Semantics Preliminaries
  9. 9. Syntax & Semantics Preliminaries
  10. 10. SPARQL1.1 SERVICE syntax <ul><li>Queries are of the form: </li></ul><ul><li>“ a ” is an IRI or a variable, so it can be: </li></ul><ul><ul><li>A predefined service endpoint: </li></ul></ul><ul><ul><ul><li>e.g., < http://quebec.hhpid.bio2rdf.org/sparql > </li></ul></ul></ul><ul><ul><li>A variable: SERVICE ?X {P1} </li></ul></ul><ul><li>SELECT * WHERE </li></ul><ul><li>{ </li></ul><ul><ul><li>P2 . </li></ul></ul><ul><ul><li>P3 . </li></ul></ul><ul><ul><li>SERVICE a {P1} . </li></ul></ul><ul><ul><li>... </li></ul></ul><ul><li>} </li></ul>
  11. 11. SPARQL 1.1 SERVICE Semantics <ul><li>We extend [PAG09] with the semantics for SERVICE: </li></ul>[PAG09] J. Pérez, M. Arenas and C. Gutiérrez. Semantics and complexity of SPARQL. TODS 34(3), 2009 !!!
  12. 12. SPARQL 1.1 SERVICE Semantics <ul><li>So, if we find SERVICE ?X P1 , do we have to send queries to every single endpoint in the world? </li></ul>
  13. 13. BINDINGS Semantics <ul><li>We also define BINDINGS </li></ul><ul><ul><li>Previously in SPARQL 1.1 Federated Extension, now in main query document </li></ul></ul><ul><li>Given a list L = [?X_0,...?X_l] of pairwise distinct variables (l>0) and a list A = [a_1,...a_l] of values </li></ul>
  14. 14. Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></ul><ul><li>Optimising Federated Queries </li></ul><ul><li>Implementation </li></ul><ul><li>Conclusions </li></ul>
  15. 15. SERVICE Evaluation <ul><li>What happens when there is a variable ?X in SERVICE ?X P ? </li></ul><ul><ul><li>?X must be bound in order to evaluate the pattern </li></ul></ul><ul><ul><li>That is, ?X needs to have a value when the SERVICE operator is evaluated </li></ul></ul><ul><li>Examples: </li></ul><ul><ul><li>P 1 = (SELECT {?X, ?N, ?E} WHERE {(?X, service_address, ?Y ) AND (SERVICE ?Y {?N, email, ?E})} </li></ul></ul><ul><ul><li>P 2 = (SELECT {?X, ?N, ?E} WHERE {((?X, service_description, ?Z ) UNION (?X, service_address, ?Y )) AND ((SERVICE ?Z {?N, email, ?E}) UNION (SERVICE ?Y {?N, email, ?E})) } </li></ul></ul><ul><li>… In order to evaluate P 1 and P 2 , we must ensure a specific evaluation order so as to ensure a safe evaluation </li></ul>
  16. 16. SERVICE Evaluation – Ingredients and Informal Definitions <ul><li>Boundedness (of a variable in a query) </li></ul><ul><ul><li>?Y is bound in P1 = ((?X, service address, ?Y) AND (SERVICE ?Y (?N, email, ?E))) </li></ul></ul><ul><ul><li>?Y is not bound in P1 = ((?X, service address, ?Z) OPT (?X, service_desc, ?Y)) AND (SERVICE ?Y (?N, email, ?E)) </li></ul></ul><ul><ul><li>However, checking this is undecidable </li></ul></ul><ul><li>Strong Boundedness (of a variable in a query) </li></ul><ul><ul><li>We impose some syntactic conditions (details in the paper) </li></ul></ul><ul><li>Service Boundedness </li></ul><ul><ul><li>Based on the parse tree of the query, if we find a SERVICE ?X P, then if it has an ancestor where ?X is bound, the service is bound </li></ul></ul><ul><ul><li>However, checking this is undecidable </li></ul></ul><ul><li>Service Safeness </li></ul><ul><ul><li>Hence we impose syntactic conditions, and we have to check if variable ?X is strongly bound </li></ul></ul>
  17. 17. SERVICE Evaluation - Boundedness
  18. 18. SERVICE Evaluation - Strong Boundedness
  19. 19. SERVICE Evaluation - Safeness <ul><li>Given a SPARQL query P, define T(P) as the parse tree of P. In this tree, every node corresponds to a sub-pattern of P. </li></ul>
  20. 20. SERVICE Evaluation - Safeness <ul><li>Definition (service-boundedness) A SPARQL query P is service-bound if for every node u of T(P) with label (SERVICE ?X P 1 ) , it holds that: </li></ul><ul><ul><li>There exists a node v of T(P) with label P 2 such that v is an ancestor of u in T(P) and ?X is bound in P 2 </li></ul></ul><ul><ul><li>P 1 is service-bound </li></ul></ul><ul><li>Theorem The problem of verifying, given a SPARQL query P , whether P is service-bound is undecidable. </li></ul><ul><li>Definition (service-safeness) A SPARQL query P is service-safe if for every node u of T(P) with label (SERVICE ?X P 1 ) , it holds that: </li></ul><ul><ul><li>There exists a node v of T(P) with label P 2 such that v is an ancestor of u in T(P) and ?X is strongly bound in P 2 </li></ul></ul><ul><ul><li>P 1 is service-safe </li></ul></ul>
  21. 21. Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></ul><ul><li>Optimising Federated Queries </li></ul><ul><li>Implementation </li></ul><ul><li>Conclusions </li></ul>
  22. 22. Optimising Federated Queries <ul><li>Well-designed patterns [PAG09] </li></ul>
  23. 23. Optimising federated queries with well-designed patterns <ul><li>We extended the notion of well-designed patterns for SPARQL1.1 Federated Query </li></ul><ul><li>The following rules (from [PAG09]) also hold for SERVICE: </li></ul><ul><li>Proposition </li></ul><ul><ul><li>If P is a well-designed pattern and Q is obtained for P by applying either (1) or (2) or (3), then Q is a well-designed pattern equivalent to P . </li></ul></ul>
  24. 24. Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></ul><ul><li>Optimising Federated Queries </li></ul><ul><li>Implementation </li></ul><ul><li>Conclusions </li></ul>
  25. 25. Implementation: OGSA-DAI <ul><li>Implemented on top of OGSA-DAI/OGSA-DQP </li></ul><ul><ul><li>Extensible framework to access, integrate, transform and deliver distributed and heterogeneous sources of data </li></ul></ul><ul><ul><li>Implements part of the WS-DAI specification </li></ul></ul><ul><ul><ul><li>Service Oriented Data Access (direct and indirect access) </li></ul></ul></ul><ul><ul><li>Distributed Query Processing </li></ul></ul><ul><li>Features </li></ul><ul><ul><li>RDF extension (available in the official OGSA-DAI release) </li></ul></ul><ul><ul><li>We process </li></ul></ul><ul><ul><ul><li>SERVICE IRI P </li></ul></ul></ul><ul><ul><ul><li>AND, OPTIONAL, UNION </li></ul></ul></ul><ul><ul><ul><li>Most common FILTERS (<, =, >) </li></ul></ul></ul><ul><ul><ul><li>Solution modifiers </li></ul></ul></ul><ul><ul><li>Coming features missing </li></ul></ul><ul><ul><ul><li>SERVICE ?X P </li></ul></ul></ul>
  26. 26. Implementation: Evaluation <ul><li>Benchmark test </li></ul><ul><ul><li>Existing benchmarks (Berlin SPARQL Benchmark and SP2Bench) were not suitable (no distributed queries), and other benchmarks in an early stage </li></ul></ul><ul><ul><li>Focus in life sciences queries: bio2rdf.org project </li></ul></ul><ul><ul><li>Seven queries of increasing complexity </li></ul></ul><ul><ul><ul><li>http://www.oeg-upm.net/files/sparql-dqp/ </li></ul></ul></ul><ul><li>Bio2rdf datasets </li></ul><ul><ul><li>bio2rdf.org: 2.3 billion triples </li></ul></ul><ul><ul><li>Used Entrez Gene (13 million triples), pubmed (797 million triples), HHPID (244.091 tiples) and MeSH (689.542 triples) </li></ul></ul><ul><ul><li>Downloaded some datasets (HHPID and pubmed) and divided into several endpoints of 300.000 triples </li></ul></ul><ul><li>Hardware used: </li></ul><ul><ul><li>Intel Core 2 Duo, 2,50GH, 3GB RAM </li></ul></ul>
  27. 27. Implementation: Evaluation Queries
  28. 28. Implementation: The query plans Before Optimisation After Optimisation
  29. 29. Results
  30. 30. Table of Contents <ul><li>Introduction </li></ul><ul><li>Syntax and Semantics </li></ul><ul><li>SERVICE evaluation </li></ul><ul><li>Optimising Federated Queries </li></ul><ul><li>Implementation </li></ul><ul><li>Conclusions </li></ul>
  31. 31. Conclusions <ul><li>Formalisation of the SPARQL 1.1 Basic Federation Extension syntax and semantics </li></ul><ul><li>Safeness conditions in the evaluation of SERVICE in the presence of variables </li></ul><ul><li>Simple query optimisation based on an extension of well-designed patterns </li></ul><ul><li>Implementation based on a robust data-access system like OGSA-DAI </li></ul><ul><ul><li>Focused on large-scale data sources </li></ul></ul><ul><ul><li>More optimisations can be easily included </li></ul></ul><ul><ul><li>Indirect data access mode (you send the query, it sends you a handler to where the result will be placed, and you can use that resource). </li></ul></ul>
  32. 32. Semantics and optimization of the SPARQL 1.1 federation extension <ul><li>Acknowledgements </li></ul><ul><li>Implementation: </li></ul><ul><ul><li>OGSA-DAI team (specially Ally Hume) </li></ul></ul><ul><li>Query generation: </li></ul><ul><ul><li>Bio2RDF project team (specially Marc-Alexandre Nolin) </li></ul></ul><ul><li>Heavy discussions on syntax and semantics </li></ul><ul><ul><li>Jorge Pérez </li></ul></ul><ul><li>Funding </li></ul><ul><ul><li>ADMIRE Project (FP7-ICT-215024) </li></ul></ul><ul><ul><li>FONDECYT grant 1090565 </li></ul></ul>

×