Bio2RDF @ W3C HCLS2009


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bio2RDF @ W3C HCLS2009

  1. 1. Bio2RDF cloud of Virtuoso SPARQL endpoints Life Science Raw Data Now François Belleau, Marc-Alexandre Nolin, Peter Ansell, Michel Dumontier 30th April 2009 W3C-HCLS F2F Meeting, Cambridge, MA
  2. 2. Agenda Why we did Bio2RDF ? ● How we did it ? ● What is know about hexokinase ? ● Where we are going ? ●
  3. 3. The problem According to NAR 2009 Database collection 1170 public databases exists. How can they be integrated to behave like a global coherent resource ?
  4. 4. Public map of 1744 namespaces according to BioMoby, NAR, SRS, GO, NCBI, UniProt
  5. 5. Bio2RDF vision in 2007 Johanne Luciano vision for knowledge integration in 2005 W3C vision of semantic web in 2006
  6. 6. Bio2RDF Mouse and Human Atlas map in 2008 65 millions triples
  7. 7. Bio2RDF actual contribution to the Linked Data cloud Linked data cloud in 2007 Linked data cloud in March 2009
  8. 8. Bio2RDF cloud map of 2,3 billions triples in 2009
  9. 9. Why do it ? Not to replace HTML or XML by an other new format, RDF and OWL, but to answer science question by submiting SPARQL query over the global knowledge base accessible through the Internet to the Life Science SPARQL endpoints cloud.
  10. 10. Solution Bio2RDF approach to the data integration problem in bioinformatics : Apply the semantic web approach based on RDF, OWL and SPARQL technologies.
  11. 11. How we did it ? Bio2RDF architecture
  12. 12. Our design principles
  13. 13. YeastHub design in 2005 Conversion of Dataset to RDF ● Use of Sesame Triplestore ● SeRQL query interface ●
  14. 14. Bio2RDF at ISMB 2005 the begining Thanks to Kei Cheung, Johanne Luciano, Eric Neumann and Christopher Baker they draw the lines.
  15. 15. Bio2RDF realtime rdfiser in 2007
  16. 16. Actual Architecture Offline rdfising process ● ● Virtuoso SPARQL endpoints network ● Namespace resolution through DNS subdomain
  17. 17. Main REST services Describe a ressource by a dereferencable URI ● ● Global services over federated endpoints ● ● ● Targeted services to a specific endpoint ● ● ● other services are available. ●
  18. 18. Describe service implementation ● Corresponding SPARQL query : ● CONSTRUCT { ● ?s ?p ?o . } WHERE { ?s ?p ?o . FILTER(?s = <>). } Submited at this URL ● ● Based of DNS subdomain resolution service –
  19. 19. Bio2RDF JSP server software
  20. 20. Peter Ansell is writing the Bio2RDF JSP server The software transform Bio2RDF URIs to SPARQL ● queries in real time. Its aim is to access normalised RDF information ● located in multiple endpoints using the concept of Public Namespaces and Private Record Identifiers and distributed SPARQL queries which are matched to the content in each endpoint. Each of the following databases have normalisation ● rules which normalise them back to URI's :Dbpedia, Drugbank, LinkedCT, HCLS KB/Neurocommons, Diseasome, Dailymed, Bioguid DOI
  21. 21. Bio2RDF.war package future Provide more pipes to perform integrated actions without ● having to put HTTP SPARQL requests into a workflow system when a URI resolution can perform the query in a distributed and normalised manner more efficiently Bring together the current distributed efforts to provide a ● complete HTML redirection registry so that a large percentage of Bio2RDF namespaces can be redirected with Form ontologies describing the query type, provider, rdf ● normalisation rule, namespace paradigm Integrate and similar ● workflow RDF endpoints so that scientific workflows can be linked to their data cleanly
  22. 22. Bio2RDF.owl
  23. 23. Michel Dumontier will design Bio2RDF.owl ontology next version
  24. 24. What is known about hexokinase ?
  25. 25. Submit your query... To the web search engine ● To existing public web site offering data ● integration services; Using Bio2RDF SPARQL endpoints ● Submitting a SPARQL query; ● Using facet browser interface from Virtuoso 6.0 ● server; Dereferencing Bio2RDF search URI; ● Using a Taverna workflow composed of SPARQL ● queries to obtain federated results from KEGG, Entrez Gene and GO;
  26. 26. The usual unsemantic way
  27. 27. Existing integrated search services EBI/EB-eye NCBI/Entrez KEGG/DBGET GoPubmed
  28. 28. By submitting a SPARQL query
  29. 29. What is know about « hexokinase » with semantic ? select ?t1 ?p2 count(*) where { ?s1 ?p1 ?o1 . FILTER( bif:contains(?o1, quot;hexokinasequot;)) . ?s1 a ?t1 . ?s1 ?p2 ?o2 . } ORDER BY ?t1 ?p2
  30. 30. Use Virtuoso 6.0 facet browser
  31. 31. Dereferencing search URL
  32. 32. How can we submit a complex query over the network of SPARQL endpoints ?
  33. 33. By building a mashup with Taverna 1) Write your complex SPARQL query as if a global graph would be available 2) Identify the needed namespaces and split the query to fetch each data source separetly 3) Build a mashup using a Taverna workflow that instanciate a local triplestore 4) Execute your complex query locally on the mashup
  34. 34. The SPARQL query needed (dont try this home, do it on the web !)
  35. 35. Get the list of genes from KEGG pathways of a specified taxon Clear graph ● Get KEGG pathways list for a ● specific taxon For each pathway get genes ● list and import instances Count the number of genes ● found
  36. 36. Insert into local triplestore GeneID genes and KEGG pathways Get the list of genes ● Get the list of pathways ● Insert into local triplestore ● each corresponding graph
  37. 37. Insert into local triplestore the needed GO annotations Get the GO annotations for ● each gene
  38. 38. Finally, the neeeded query merging KEGG, Entrez Gene and GO together
  39. 39. Bio2RDF resources
  40. 40. Bio2RDF's mirrors
  41. 41. Bio2RDF SPARQL endpoints
  42. 42. Life Science Raw Data Now
  43. 43. Visit our Wiki rdfiser cookbook
  44. 44. Bio2RDF news
  45. 45. Our 2009 objectives Get approval from data provider to distribute ● RDF dump and publish SPARQL endpoints (UniProt, BioCyc, Pathway Commons, Bind are in); Start using Virtuoso 6 cluster; ● Design more services accessible with REST ● protocol via our JSP package; Recruit mirror server; ● Develop new rdfiser program in a community ● effort;
  46. 46. Thanks Jean Morissette, Nicole Tourigny The Bio2RDF community ● Centre de recherche du CHUL ● Université Laval ● Dumontier Lab ● QUT eResearch Center ● Openlink Virtuoso ●