Your SlideShare is downloading. ×
Bio2RDF Release 2: Improved coverage, interoperability and provenance of Linked Data for the Life Sciences
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Bio2RDF Release 2: Improved coverage, interoperability and provenance of Linked Data for the Life Sciences


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Bio2RDF facilitates data sharing and reuse in a number of ways: Access the scripts that generate our linked data. These scripts are openly licensed for modification, redistribution or use by anyone wishing to generate RDF data on their own We make use of syntactically consistent IRI patterns We do not impose a structural scheme for representing our data Bio2RDF allows for multiple mirrors host our data, a DNS “round robin” procedure balances incoming requests between participating mirrors, this prevents a single point of failure and allows for additional mirrors to be added to the network A centralized resource registry has been developed for consistent usage of namespaces and vocabularies- In the next slides I will go into a bit more detail about the first 3 points
  • In order to facilitate collaboration by the community, one of the most important developments to this second release of Bio2RDF was the creation of a publicly available code repository for all of the programs used for converting biological data to linked data.- Anyone can reuse or modify our openly (MIT) licensed code
  • In order to keep a clear link back to the original data, in our RDFized datasets we maintain the original data provider’s record identifiers by making use of the following URI pattern:- namespace: preferred short name for a biological dataset
  • In order to keep a clear link back to the original data, in our RDFized datasets we maintain the original data provider’s record identifiers by making use of the following URI pattern:- namespace: preferred short name for a biological dataset
  • All resources in every bio2RDF graph is linked to a graph that describes the provenance of the generated linked data. We make use of the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary and Dublin core vocabulary to describe items such as: URL to the source data Date of generation Licensing information (if available) URL to the script used to generate the data Download URL for linked data files URL of SPARQL endpoint
  • In order to eliminate the need for our users to run expensive queries on our endpoints we have made available several metrics that can aid in the description of the contents of our linked data sets.- These metrics have been included into each of our datasets and also serve as guides for developing queries over the data based on the data’s structure
  • As I had mentioned earlier we have included dataset metrics for every one of our endpoints. - These metrics are stored in a named graph and can be queried for as shown:
  • Here I am showing with more detail the top 10 subject type – predicate - object type frequencies that can be found in our Drugbank endpoint So for example to find drugs that participate in drug-drug interactions one would use the ddi-interactor-in predicate that associates the 1074 resources of type drug to the 10891 resources typed as drug-drug-interaction.
  • Here I present some numbers for the 19 updated datasets
  • HHPID is database of HIV-1 human protein interactions that was created to catalog all interactions between HIV-1 and human proteins published in the peer-reviewed literature. The database serves the scientific community exploring the discovery of novel HIV vaccine candidates and therapeutic targets.MGI – Mouse Genome Informatics-==-====Boutique endpoints maintained in Release 2 (will not be updated): Bio2RDF Atlas and HHPID
  • We are currently in the process of exposing Bio2RDF data using a variety of 3rd party tools- specifically, all of our endpoints can be navigated using Virtuoso’s faceted browser- We have also generated Sindice metrics that enable the use of SPARQLedautomed query builder. (This tool allows users to semiautomatically construct SPARQL queries based on namespaces used therein)-- finally it is possible to use browser to aggreate data from multiple endpoints and construct mashups
  • However, types and predicates are dataset specific and therefore not consistentWouldn’t it be nice to be able to query all of these datasets with a common nomenclature? (our objective)
  • Transcript

    • 1. Linked Data for the Life Sciences Release 2 Michel DumontierAssociate Professor, Carleton University on behalf of the Bio2RDF team Dumontier::EBI:Jan 23, 2013
    • 2. is an open source framework that makes biological data available on the emerging semantic web using a set of simple conventions Dumontier::EBI:Jan 23, 2013
    • 3. reduces the time and effort involved in data integration so that you can get to the business of doing science Dumontier::EBI:Jan 23, 2013
    • 4. the Semantic Webis the new global web of knowledge It provides standards for publishing, sharing and querying facts, expert knowledge and services It is a scalable approach to the discovery of independently formulated and highly distributed knowledge Dumontier::EBI:Jan 23, 2013
    • 5. a rapidly growing network of linked data“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.” Dumontier::EBI:Jan 23, 2013
    • 6. Link all the data!!! Dumontier::EBI:Jan 23, 2013
    • 7. Four Simple Rules for Linked Data1) Use Internationalized Resource Identifiers (IRIs;unicode) or Uniform Resource Identifiers (URIs; ascii) for names for things2) Use HTTP URIs so that people can look up those names.3) When someone looks up a URI, provide useful information about them using the standards (RDF)4) Include links to other URIs so that they can discover more things. Dumontier::EBI:Jan 23, 2013
    • 8. Bio2RDF is a framework tocreate and provide linkeddata for the life sciences Dumontier::EBI:Jan 23, 2013
    • 9. Main features of Bio2RDF Release 2• Bio2RDF conversion scripts, mapping files and web application are open source and freely available at• Bio2RDF enables (syntactic) data integration within and across datasets by using one language (RDF), having a common URI pattern and a common resource registry• 19 Release 2 datasets have provenance and endpoints feature pre-computed graph summaries for fast lookup• Bio2RDF web application enables entity resolution, query federation across an expandable distributed network of SPARQL endpoints Dumontier::EBI:Jan 23, 2013
    • 10. Bio2RDF RDFization guidelines are available at our wiki Dumontier::EBI:Jan 23, 2013
    • 11. Bio2RDF convertersare open-source and available at GitHub Dumontier::EBI:Jan 23, 2013
    • 12. Bio2RDF data are identified using simple http URI patternsWhen available, use the provider’s identifier inthe naming the resource DrugBank’s resource IRI for Leucovorin Dumontier::EBI:Jan 23, 2013
    • 13. Linked Data: You can look it up Dumontier::EBI:Jan 23, 2013
    • 14. Valid Bio2RDF namespaces are listed in a dataset registry• An initial registry of ~600 datasets is accessible through an API provided by my PHP-LIB library (available on github). It includes – Dataset title, Preferred namespace prefix, Alternative namespace prefixes• In the summer, we consolidated and curated nearly 2100 entries in a Google spreadsheet, which includes a mostly complete coverage of datasets/collections listed in Bio2RDF, MIRIAM, BioPortal, UniProt, NCBI, NAR database issue. New fields were added including: – Dataset description, organization, website, HTML template – Identifier syntax, license and rights• Working with team (Nick Juty, Camille Laibe, Nicolas Le Novere) to have a single dataset registry that we can use for both Bio2RDF and – enable automatic cross-links between Bio2RDF and Dumontier::EBI:Jan 23, 2013
    • 15. vocabulary and resource namespaces are used to describe auxiliary resources• types and predicates that are generated to support the semantic annotation are in the vocabulary namespace (type) (predicate)• n-ary relations are named in the resource namespace Dumontier::EBI:Jan 23, 2013
    • 16. The Resource Description Framework (RDF) is a formal knowledge representation language capable of expressing a statement A RDF statement consists of: – Subject: resource identified by a URI – Predicate: resource identified by a URI – Object: resource or literal drugbank:DB00650 rdf:type rdf:type drugbank_vocabulary:Drug Dumontier::EBI:Jan 23, 2013
    • 17. Every statement expands the network of linked data drugbank_vocabulary:Drug rdf:type rdfs:label drugbank:DB00650 Leucovorin [drugbank:DB00650] drugbank_vocabulary:ddi-interactor-in rdfs:label DDI between Trimethoprim and drugbank_resource:DB00440_DB00650 Leucovorin [drugbank_resource: DB00440_DB00650] rdf:typedrugbank_vocabulary:Drug-Drug-Interaction Dumontier::EBI:Jan 23, 2013
    • 18. Syntactic integration across datasets DrugBank drugbank_vocabulary:Drug rdf:type rdfs:label drugbank:DB00650 Leucovorin [drugbank:DB00650] pharmgkb_vocabulary:xref rdfs:label pharmgkb:PA450198 leucovorin [pharmgkb:PA450198] pharmgkb_vocabulary:Drug PharmGKB Dumontier::EBI:Jan 23, 2013
    • 19. You can get what links to itWhat links to DrugBank’s Leucovorin? Dumontier::EBI:Jan 23, 2013
    • 20. The Bio2RDF Network of Linked Data Dumontier::EBI:Jan 23, 2013
    • 21. Every Bio2RDF dataset now contains provenance metadata Features - Dates - Licensing - Source - Creators - Publishers - Availability Vocabularies VoID Dublin Core W3C Provenance Bio2RDF vocabulary Dumontier::EBI:Jan 23, 2013
    • 22. For every Bio2RDF dataset we pre- compute 9 descriptive metrics• total number of triples• number of unique subjects• number of unique predicates• number of unique objects• number of unique types• unique predicate-object links and their frequencies• unique predicate-literal links and their frequencies• unique subject type-predicate-object type links and their frequencies• unique subject type-predicate-literal links and their frequencies Dumontier::EBI:Jan 23, 2013
    • 23. Accessing Bio2RDF dataset metrics • Each Bio2RDF endpoint contains a named graph that holds the pre-computed metrics –[namespace]-statistics • Metrics can be queried using SPARQL, e.g.: SELECT * FROM <> WHERE { ?dataset a <> . ?dataset <> ?tc . ?dataset <> ?sc . ?dataset <> ?pc . ?dataset <> ?oc . … } Dumontier::EBI:Jan 23, 2013
    • 24. Graph summaries can also assist in query formulation Subject Type Subject Count Predicate Object Type Object CountPharmaceutical 11512 form Unit 56Drug-Transporter-Interaction 1440 drug Drug 534Drug-Transporter-Interaction 1440 transporter Target 88Drug 1266 dosage Dosage 230Patent 1255 country Country 2Drug 1127 product Pharmaceutical 11512Drug 1074 ddi-interactor-in Drug-Drug-Interaction 10891Drug 532 patent Patent 1255Drug 277 mixture Mixture 3317Dosage 230 route Route 42Drug-Target-Interaction 84 target Target 43 PREFIX drugbank_vocabulary: <> PREFIX rdfs: <> SELECT ?ddi ?d1name ?d2name WHERE { ?ddi a drugbank_vocabulary:Drug-Drug-Interaction . ?d1 drugbank_vocabulary:ddi-interactor-in ?ddi . ?d1 rdfs:label ?d1name . ?d2 drugbank_vocabulary:ddi-interactor-in ?ddi . ?d2 rdfs:label ?d2name. FILTER (?d1 != ?d2) } Dumontier::EBI:Jan 23, 2013
    • 25. You can use the SPARQLed queryassistant with updated endpoints graph: Dumontier::EBI:Jan 23, 2013
    • 26. Bio2RDF covers the major biological databases Dumontier::EBI:Jan 23, 2013
    • 27. Bio2RDF Release 2 – New and Updated Datasets Dataset Namespace # of triples Affymetrix affymetrix 44469611 Biomodels* biomodels 589753 Comparative Toxicogenomics Database ctd 141845167 DrugBank drugbank 1121468 NCBI Gene ncbigene 394026267 Gene Ontology Annotations goa 80028873 HUGO Gene Nomenclature Committee hgnc 836060 Homologene homologene 1281881 InterPro* interpro 999031 iProClass iproclass 211365460 iRefIndex irefindex 31042135 Medical Subject Headings mesh 4172230 NCBO BioPortal* bioportal 15384622 National Drug Code Directory* ndc 17814216 Online Mendelian Inheritance in Man omim 1848729 Pharmacogenomics Knowledge Base pharmgkb 37949275 SABIO-RK* sabiork 2618288 Saccharomyces Genome Database sgd 5551009 NCBI Taxonomy taxon 17814216 Total 19 1010758291 Dumontier::EBI:Jan 23, 2013
    • 28. Status of Bio2RDF Release 1 datasetsDataset StatusAtlas Maintained – will not be updatedBIND Deprecated – in iRefIndexBioCarta Deprecated – in Pathway CommonsBioCyc Deprecated – in Pathway CommonsEC Deprecated – in Gene Ontology/UniProtGenBank Maintained – will be updatedHHPID Maintained – will not be updatedINOH Deprecated – in Pathway CommonsKEGG Maintained – will not be updatedMGI Maintained – will be updatedPubmed Maintained – will be updatedPID Deprecated – in Pathway CommonsReactome Deprecated – in Pathway CommonsRefSeq Maintained – will be updated Dumontier::EBI:Jan 23, 2013
    • 29. A PHP-based library acts a point of integration• Provides a set of APIs – to produce RDF and OWL statements – to generate valid Bio2RDF URIs by checking against a dataset registry – to generate dataset provenance Dumontier::EBI:Jan 23, 2013
    • 30. Oh Bio2RDF, how can I access you?Let me count the ways:1. Downloads (data, stats, virtuoso db)2. Web interface (lookup + services)3. SPARQL endpoint4. SPARQLed editor5. Virtuoso Faceted Browser Dumontier::EBI:Jan 23, 2013
    • 31. Use virtuoso’s built in facetedbrowser to construct increasinglycomplex queries with little effort Dumontier::EBI:Jan 23, 2013
    • 32. Heterogeneous biological data on the semantic web is difficult to queryQuestion: Find all proteins that interact with beta amyloid (uniprot:P05067) UniProt Protein PDB Protein ?SELECT * WHERE { iRefIndex Protein ?protein a bio2rdf:Protein . ?protein bio2rdf:interacts_with uniprot:P05067 .} Physical interaction? Genetic interaction? Pathway interaction? Dumontier::EBI:Jan 23, 2013
    • 33. RDF-based Linked Data is a great first step, but it’s not enough. From linked data to linked knowledge through syntactic and semantic normalization. Dumontier::EBI:Jan 23, 2013
    • 34. ontology as a strategy to formally represent andintegrate knowledge Dumontier::EBI:Jan 23, 2013
    • 35. SIO provides an OWL ontology for the representation of diverse biomedical knowledge Dumontier::EBI:Jan 23, 2013
    • 36. Dumontier::EBI:Jan 23, 2013
    • 37. Semantic data integration, consistency checking and query answering over Bio2RDF with the Semanticscience Integrated Ontology (SIO) uniprot:P05067 uniprot:P05067 refseq:NP_009225.1 is a is a uniprot:Protein uniprot:Protein refseq:Protein refseq:Protein dataset is a is a is a sio:protein ontology Knowledge BaseQuerying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo andMichel Dumontier. Bio-ontologies 2012. Dumontier::EBI:Jan 23, 2013
    • 38. Bio2RDF types include processes, material entities and informational entitiesCTD: Chemical, Disease, Chemical-Disease Interaction, Chemical- Gene InteractionNCBIGene: Gene, Protein, Model Organism, PublicationHGNC: Accession Number, Gene, Gene SymboliRefIndex: Protein Complex, Protein InteractionMGI: Gene Marker, Gene SymbolPharmGKB: Gene-Disease Associations, Disease, Drug, GeneSGD: Enzyme, Pathway, Protein, RNA, Reaction,Location, Experiment Dumontier::EBI:Jan 23, 2013
    • 39. Bio2RDF and SIO powered SPARQL 1.1 federated query:Find chemicals in CTD and proteins in SGD that participate in the same GO process SELECT ?chem, ?prot, ?proc FROM <> WHERE { ?chemical a sio:chemical-entity. ?chemical rdfs:label ?chem. ?chemical sio:is-participant-in ?process. ?process rdfs:label ?proc. FILTER regex (?process, "") SERVICE <> { ?protein a sio:protein . ?protein sio:is-participant-in ?process. ?protein rdfs:label ?prot . } } Dumontier::EBI:Jan 23, 2013
    • 40. Bio2RDF Release 2 – A summary• Updated data conversion source code to use PHP API (all available through GitHub)• Simple Bio2RDF IRI design patterns that facilitate syntactic consistency and interoperability backed by simple registry• Dataset provenance and metrics• We welcome contributions from the community • Join our mailing list at Dumontier::EBI:Jan 23, 2013
    • 41. Future Directions• Aiming for twice yearly release schedule• Update large scale datasets – RefSeq, Genbank, PubMed, PDB• Incorporate EBI & W3C Linking Open Drug Data (LODD) effort – SIDER (beta), TCM (RDF available), CHEMBL (in dev), OMIM (released), DailyMed (RDF available), LinkedCT (beta)• Extended dataset coverage by tapping into existing endpoints (uniprot, bioportal, ebi-rdf?)• OpenBioCloud w/DERI + SindiceTech• Showcase with other third party tools Dumontier::EBI:Jan 23, 2013
    • 42. AcknowledgementsBio2RDF SADI: Christopher Baker, Melanie Courtot, JosePeter Ansell, Francois Belleau, Allison Cruz-Toledo, Steve Etlinger, NicheallaCallahan, Jacques Corbeil, Jose Cruz- Keath, Artjom Klein, Luke McCarthy, SilvaneToledo, Alex De Leon, Steve Etlinger, James Paixao, Ben Vandervalk, Natalia Villanueva-Hogan, Nichealla Keath, Jean Rosales, Mark WilkinsonMorissette, Marc-Alexandre Nolin, NicoleTourigny, Philippe Rigault and Paul Roe W3C HCLS: J Luciano, B Andersson, C Batchelor, O Bodenreider, T Clark, C Denney, C Domarew, TOpenBioCloud Gambet, L Harland, A Jentzsch, V Kashyap, PDana Klassen and Giovanni Tumarello Kos, J Kozlovsky, T Lebo, SM Marshall, JP McCusker, DL McGuinness, C Ogbuji, E Pichler, R Powers, E Prud hommeaux, M Samwald, L Schriml, PJ Tonellato, PL Whetzel, J Zhao, S Stephens, C Denney, J Luciano, J McGurk, Lynn Schriml, and Peter J. Tonellato. Dumontier::EBI:Jan 23, 2013
    • 43. Website: Presentations: Dumontier::EBI:Jan 23, 2013