Bio2RDF : A biological knowledge base for the Semantic Web

4,329 views

Published on

A presentation given at the University of Toronto on June 18, 2009 describing the current state of Bio2RDF with respect to biological knowledge representation on the semantic web as linked data with services to describe and answer questions.

1 Comment
9 Likes
Statistics
Notes
  • helps a lot when I also want to build a linked data application
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
4,329
On SlideShare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
Downloads
161
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide

Bio2RDF : A biological knowledge base for the Semantic Web

  1. 1. Bio2RDF: A biological knowledge base for the Semantic Web Michel Dumontier, François Belleau, Marc-Alexandre Nolin, Peter Ansell
  2. 2. Web search for biological information is hit or miss
  3. 3. Introducing... something you can lookup and search for with rich descriptions
  4. 4. Surface web: 167 terabytes Deep web: 91,000 terabytes 545-to-one
  5. 5. Bio-Portals provide Database access give better results
  6. 6. We want to simultaneously query the 1000+ biological databases
  7. 7. Data silos – not made for sharing
  8. 8. How do we integrate these resources?
  9. 9. Bio2RDF provides the methodology to create and glue these different networks.
  10. 10. Bio2RDF is building the linked data web for biological data
  11. 11. Contributing to a growing linked data web
  12. 12. What is the semantic web? The Semantic Web is a web of knowledge. It is about standard formats for representing and querying knowledge drawn from diverse sources and making statements about real objects.
  13. 13. Goals for the Semantic Web • Provide a common knowledge representation • syntax & semantics • Facilitate publishing, data integration and information retrieval • Make possible semantically interoperable web applications and services • Enable the answering of questions across global repositories of knowledge
  14. 14. Resource Description Framework (RDF) • Allows one to express propositions, and reason about them • Uniform Resource Identifier (URI) are entity names • i.e http://purl.uniprot.org/uniprot/Q16665 • A RDF statement consists of: – Subject: resource identified by a URI u:Q16665 – Predicate: resource identified by a URI rdf:type – Object: resource or literal Protein
  15. 15. Semantic Knowledge Base fact Q16665 rdf:type Protein rdf:type rdfs:subClassOf Molecule ontology Knowledge base
  16. 16. RDF/XML <?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:u="http://purl.uniprot.org/uniprot/" <rdf:Description rdf:about=“&u;Q16665"> <rdf:type rdf:resource=“&u;Protein"/> </rdf:Description> </rdf:RDF> N3 PREFIX u: <http://purl.uniprot.org/uniprot/> . <u:Q16665> a <u:Protein> . 16
  17. 17. Syntactic Data Integration depends on consistent naming has name u:Q16665 HIF1-alpha HIF1-alpha UniProt has name + located in located in u:Q16665 go:nucleus u:Q16665 go:nucleus Gene Ontology + interacts with u:vhl interacts with u:Q16665 u:vhl Unified view BIND
  18. 18. Semantic Data Integration depends on accurate typing Protein rdf:type U:Q16665 u:vhl
  19. 19. Linked Data http://www.w3.org/DesignIssues/LinkedData
  20. 20. Bio2RDF Design Principles http://bio2rdf.wiki.sourceforge.net/Banff+Manifesto
  21. 21. Over 1800 namespaces Compiled From: NAR, BioMoby, UniProt, NCBI, SRS
  22. 22. Naming Convention http://bio2rdf.org/namespace:identifier http://bio2rdf.org/pdb:1AM0 http://bio2rdf.org/gi:99
  23. 23. Bio2RDF network = 2.3 BT
  24. 24. Namespace Domain Updated Triples Topics Namespaces SPARQL Affymetrix Probeset loading 45560115 1708777 20affymetrix BIND Network information 09/04/1930 bind BioCYC Pathway/BioPAX 4418699622 + xref biocyc ChEBI@EBI Chemistry 09/03/2025 4764030 50377 25chebi CPD@KEGG Chemistry 09/04/2014 177199 14071 10kegg cPath Pathway/BioPAX 09/04/2007 28052098 51cpath DBpedia Encyclopedia 09/03/2023 190790 0 21dbpedia DR@KEGG Drug 09/04/2014 116822 8117 8dr EC@KEGG Enzyme 09/04/2014 556888 4245 4ec EC@UniProt Enzyme 09/04/2014 36109 enzyme GeneID@NCBI Gene loading 1.73E+08 86geneid GL@KEGG Chemistry 09/04/2014 94148 10965 2kegg GO Ontology 09/03/2015 8188649 804979 144go HGNC Genome 09/03/2025 1085662 125256 14hgnc HomoloGene@NCBI Homolog 09/03/1931 6598206 7homologene IProClass@PIR Protein loading 1.92E+08 19iproclass MGI Genome 09/03/2025 3089976 12mgi OBO Ontology 09/03/2027 4507016 4954332 165obo OMIM@NCBI Disease 09/03/2024 1048053 32102 7omim Path@KEGG Pathway 09/03/2028 50793314 kegg PDB Protein 09/03/2021 1215254 44569 2pdb Pubmed@UniProt Article 09/03/1931 pubmed Pubmed@NCBI Article 09/03/1931 pubmed Reactome Pathway/BioPAX 09/04/2015 57527092 22reactome RN@KEGG Pathway 09/04/2015 110971 7755 5kegg SGD Genome 09/04/2015 1437648 13sgd Taxonomy@UniProt Taxon 09/04/2014 3230933 taxonomy UniParc@UniProt Sequence 09/04/2009 5.59E+08 53uniparc UniPathway@UniProt Pathway 09/04/2014 8508 unipathway UniProtKB@UniProt Protein 09/04/2016 4.56E+08 135uniprot UniRef@UniProt Homolog 09/04/2008 3.9E+08 5uniref UniSTS@NCBI Marker 09/03/1931 7542235 7unists
  25. 25. Mouse and Human Atlas (65 MT)
  26. 26. Free, Open Source software
  27. 27. Bio2RDF Software • http://sourceforge.net/projects/bio2rdf/ • Virtuoso Triple Store gives SPARQL endpoint • Bio2RDF software transforms URIs to SPARQL queries directed to one or more endpoints • RDFizers – transform legacy data into RDF – OMIM, KEGG • SW DBs – rules to create Bio2RDF URI’s – Dbpedia, BioPAX
  28. 28. SPARQL Endpoints http://ns.bio2rdf.org/sparql http://atlas.bio2rdf.org/sparql
  29. 29. Services • Describe a resource – http://bio2rdf.org/ns:id • Global services over federated endpoints – http://bio2rdf.org/links/ns:id – http://bio2rdf.org/search/term • Targeted services to a specific endpoint – http://bio2rdf.org/linksns/ns/ns2:id – http://bio2rdf.org/searchns/ns/term
  30. 30. Describe service http://bio2rdf.org/ns:id Corresponding SPARQL query : CONSTRUCT { ?s ?p ?o . } WHERE { ?s ?p ?o . FILTER(?s = <http://bio2rdf.org/ns:id>). } Sent to http://ns.bio2rdf.org/sparql?query=... DNS subdomain resolution service
  31. 31. Search Service http://bio2rdf.org/search/hexokinase
  32. 32. Virtuoso 6.0 Facet Browsing http://lod.openlinksw.com/
  33. 33. Multiple Ways To Represent Knowledge Fig. 2. Three ways to model the relationship between a protein and the volume it occupies.
  34. 34. Fig. 1. From linked data to linked knowledge through syntactic and semantic normalization.
  35. 35. Ontology as Strategy
  36. 36. OWL Has Explicit Semantics Can therefore be used to captured knowledge in a machine understandable way
  37. 37. A generalized Biological Data Model
  38. 38. Semantic normalization will improve facet browsing and question answering
  39. 39. You want to join the knowledge web
  40. 40. Share your data
  41. 41. Bridge your data with others in semantic communities (data networks).
  42. 42. Time-sensitive or frequently updated data is one way to encourage more visits.
  43. 43. Bioinformatics Discovery Registry • Part of SharedName initiative to provide stable URI patterns for data records. • We add the relationship between entities and records Discovery Service • Registry links entities to data records, their formats (RDF/XML, HTML, etc) and provider (Bio2RDF, Uniprot) http://registry.semanticscience.org/ns:id Redirection Service • Automatic redirection to data provider document http://registry.semanticsience.org/doc/provider/format/ns:id
  44. 44. Build a knowledge base from a series of questions
  45. 45. Web-based Knowledge Discovery a very painful process Carole Goble (ISWC 2005)
  46. 46. The Knowledge Web • Merging data & services • Reasoning & question answering • Persistent (RESTful) • Trust & Security Data consumers must be able to rely upon your data to use it as a foundation for their own applications.
  47. 47. 2009 Goals • Add more data! – Standardize RDFizers – Enrichment from small producer data! • Design more RESTful services (Workflow) • Start using Virtuoso 6 cluster • Add mirrors • Approval from data providers to distribute RDF dump and publish SPARQL endpoints – Confirmed: UniProt, BioCyc, Pathway Commons, BIND
  48. 48. Triplified Data and Virtuoso DB http://quebec.bio2rdf.org/download
  49. 49. RDFizer Cookbook http://bio2rdf.wiki.sourceforge.net/
  50. 50. BIO2RDF Materials
  51. 51. Thanks To: • The Bio2RDF community • Dumontier Lab – Alex De Leon, Jose Cruz, Natalia Villanueva-Rosales • Quebec Reseachers – Francois Belleau, Marc-Alexandre Nolin • Australian Researchers – Peter Ansell • Openlink Virtuoso Team
  52. 52. dumontierlab.com michel_dumontier@carleton.ca

×