Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bio2RDF Distributed Querying model


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Bio2RDF Distributed Querying model

  1. 1. URI based distributed querying Peter Ansell
  2. 2. Aim <ul><li>Access normalised RDF information located in multiple endpoints using the concept of Public Namespaces and Private Record Identifiers and distributed SPARQL queries which are matched to the content in each endpoint </li></ul>
  3. 3. Overall concepts <ul><li>Query Types : Basically wrapping up SPARQL queries based on a regular expression matching an input query string.
  4. 4. Normalisation Rules : Rules that define the transformations from a standard normalised URI system to a system matching a particular endpoint, and the reverse if necessary
  5. 5. Providers : The entities which provide the information. They can be SPARQL endpoints or even simple URL's. If they are proxied they should return RDF information, but redirects are also available for other providers. </li></ul>
  6. 6. URI resolution example <ul><li>User enters HTTP URL into their user agent </li><ul><li>http://mybio2rdf.local/namespace:identifier </li></ul><li>Servlet receives request </li><ul><li>Hostname: mybio2rdf.local
  7. 7. Query string: /namespace:identifier </li></ul><li>Servlet performs URL rewriting to pass query string to the atlas2rdf.jsp page based on WEB-INF/urlrewrite.xml </li></ul>
  8. 8. URI resolution example <ul><li>The query string is matched against the regular expressions in the configured query types and the unique query titles which had successful matches are selected
  9. 9. /namespace:identifier matches at least and </li></ul>
  10. 10. URI resolution step <ul><li>For each of the query types a namespace test is applied to determine which regular expression matching groups are relevant, and whether the query type matches the given namespace </li></ul>
  11. 11. URI resolution step <ul><li>Namespace test: </li></ul><ul><ul><li>Is the query type specific to namespaces? If false, include the query type. </li></ul></ul><ul><ul><ul><li>See CUSTOM_QUERY_NAMESPACE_PROVIDER_SPECIFIC </li></ul></ul></ul><ul><ul><li>If so, is the query type relevant to all namespaces. If true, include the query type </li></ul></ul><ul><ul><ul><li>See CUSTOM_QUERY_HANDLE_ALL_NAMESPACES </li></ul></ul></ul><ul><ul><li>If not, check whether the query string matching groups matched either any or all of the query types namespaces—as configured—of the matching group numbers declared for the query type. </li></ul></ul><ul><ul><ul><li>See CUSTOM_QUERY_NAMESPACES_TO_HANDLE, CUSTOM_QUERY_NAMESPACE_INPUT_INDEXES, and CUSTOM_QUERY_NAMESPACE_MATCH_METHOD </li></ul></ul></ul>
  12. 12. URI resolution example <ul><li>Both query:construct and query:taglabels are relevant to all namespaces, and contain the namespace as the first matching group index, and since they have only one matching group as a namespace the match method is not relevant </li></ul>
  13. 13. URI resolution step <ul><li>For each of the chosen query types, get a list of providers which handle the query title
  14. 14. If a query type is namespace specific, filter its list of providers based on whether they match any or all of the namespaces according to the query title namespace matching configuration. This time the inclusion is based on the namespace test with the list of namespaces configured for the provider </li></ul>
  15. 15. URI resolution example <ul><li>The query titles “construct” and “taglabels” were chosen, so they are now matched against the total list of providers to gain an initial list
  16. 16. The construct query is namespace specific so only construct providers which handle the given namespace will be included, where the taglabels query is not namespace specific so the any taglabels providers will be included in the final provider list </li></ul>
  17. 17. URI resolution step <ul><li>Any of the providers which were defined as “default” and which handle the given query type are also included at this stage, without regard to the namespaces.
  18. 18. Default providers are intended to make it simpler to configure intermediate servers without having to know about all of the known namespaces </li></ul>
  19. 19. URI resolution step <ul><li>For each of the query types, for each of the providers which remain.
  20. 20. If a provider needs a redirect, as opposed to proxying communication, replace any template variables on the endpoint URL and send an HTTP 302 redirect response as the result </li></ul>
  21. 21. URI resolution step <ul><li>If no redirects generate the actual queries based on the templates given in the query types and the normalisation rules for the provider
  22. 22. The normalisation rules are matched against the template variables and replaced as necessary in order to make them specific to the relevant endpoint </li></ul>
  23. 23. Query templates <ul><li>Some of the template variables include: </li><ul><li>${graphStart} and ${graphEnd} to allow for SPARQL graphs, or the lack of a graph
  24. 24. ${endpointSpecificUri} to allow for the SPARQL endpoint to contain a different URI to the one which is desired
  25. 25. ${input_1}, ${input_2}, etc., which correspond to the matching groups from the query type. ${input_1} is typically the namespace, although this is configurable. </li></ul></ul>
  26. 26. Query templates <ul><li>Some more template variables include: </li><ul><li>${graphUri} – if it doesn't exist it is empty
  27. 27. ${endpointUrl} – this can also have template variables inside it, which are replaced before the redirect check phase
  28. 28. ${defaultHostAddress} – the standard base URL for this configuration, ie,
  29. 29. ${realHostName} – the actual host being used, ie. http://mymirror.local/bio2rdf/ </li></ul></ul>
  30. 30. Query templates <ul><li>Some template variables are available in their encoded forms. For example: </li><ul><li>${urlEncoded_endpointSpecificUri} – a fully percent encoded version of the URI
  31. 31. ${inputUrlEncoded_normalisedStandardUri} – a version of the standard URI as given by the query type with the ${input_NN} sections internally percent encoded
  32. 32. ${xmlEncoded_inputUrlEncoded_normalisedStandardUri} – for use in RDF/XML templates
  33. 33. ${inputUrlEncoded_privatelowercase_endpointSpecificUri} – for use with endpoints which contain percent encoded URI's that have the private ${input_NN} variables completely in lowercase without regard to the case given in the ${queryString}
  34. 34. ${queryString} – The original input string which matched against the query type regular expression </li></ul></ul>
  35. 35. Query templates example <ul><li>For </li><ul><li>${queryString}=”namespace1:identifier1”
  36. 36. The other variables will be different depending on whether the construct provider for namespace1 is being contacted, or </li></ul></ul>
  37. 37. URI resolution step <ul><li>For each query, check its communication method
  38. 38. If it is declared as “nocommunication”, ignore it for now. It will be used with the static RDF/XML insertion stage
  39. 39. If it is declared as “httpgeturl” then perform HTTP resolution on the provider endpoint URL after replacing the relevant template variables </li></ul>
  40. 40. URI resolution step <ul><li>If the communication method is declared as “httppostsparql” then POST the replaced query template to the endpoint URL
  41. 41. The SPARQL query is matched to the endpoint at this stage by the use of a query type that contains the basic structure of the query, and normalisation rules to make sure the URI's in the SPARQL match the endpoint and Graph combination </li></ul>
  42. 42. URI resolution step <ul><li>The results of the httpgeturl and httppostsparql HTTP requests are passed through the list of rdf normalisation rules which are configured for the provider that was chosen so that they are normalised to the desired output format
  43. 43. More than one provider may be attached to the same endpoint and graph combination, so a given URI may resolve using more than one query on the same endpoint and graph depending on the query needs </li></ul>
  44. 44. Accessible databases <ul><li>Each of the following databases have normalisation rules which normalise them back to URI's </li><ul><li>Dbpedia, Drugbank, LinkedCT, HCLS KB/Neurocommons, Diseasome, Dailymed, Bioguid DOI </li></ul><li>These, together with the 40+ Bio2RDF sparql endpoints form a very large accessible knowledge base! </li></ul>
  45. 45. RDF accessible configuration <ul><li>The configuration, including all query types, RDF normalisation rules, providers and known namespaces is available in RDF
  46. 46. </li></ul>
  47. 47. Integrating user extensions <ul><li>A clear use case for a system where arbitrary queries can be performed as part of a single URI resolution is to integrate novel datasources such as user tags
  48. 48. The only requirement is that the query type relevant to the tags etc., matches the regular expression for the the URI it is extending. For example and both have regular expressions that match the basic URI </li></ul>
  49. 49. Future work <ul><li>Content negotiation between RDF formats
  50. 50. HTML formatted results for easy browsing, possibly using Pubby as the rendering engine
  51. 51. Paged SPARQL calls using OFFSET and LIMIT
  52. 52. Alternative configurations for Dbpedia, SharedNames etc. that don't require as the base URI and have different basic queries
  53. 53. Import configuration from RDF similar to the current configuration output </li></ul>
  54. 54. Future work <ul><li>Provide more pipes to perform integrated actions without having to put HTTP SPARQL requests into a workflow system when a URI resolution can perform the query in a distributed and normalised manner more efficiently
  55. 55. Bring together the current distributed efforts to provide a complete HTML redirection registry so that a large percentage of Bio2RDF namespaces can be redirected with
  56. 56. Form ontologies describing the query type, provider, rdf normalisation rule, namespace paradigm </li></ul>
  57. 57. Future work <ul><li>Integrate and similar workflow RDF endpoints so that scientific workflows can be linked to their data cleanly, and user enhancements such as tags and publications are cleanly integrated with the actual datasources they were derived from </li></ul>