  • Bio2RDF facilitates data sharing and reuse in a number of ways: Access the scripts that generate our linked data. These scripts are openly licensed for modification, redistribution or use by anyone wishing to generate RDF data on their own We make use of syntactically consistent IRI patterns We do not impose a structural scheme for representing our data Bio2RDF allows for multiple mirrors host our data, a DNS “round robin” procedure balances incoming requests between participating mirrors, this prevents a single point of failure and allows for additional mirrors to be added to the network A centralized resource registry has been developed for consistent usage of namespaces and vocabularies- In the next slides I will go into a bit more detail about the first 3 points
  • The Bio2RDF project transforms silos of life science data into a globally distributed network of linked data for biological knowledge discovery.
  • In order to keep a clear link back to the original data, in our RDFized datasets we maintain the original data provider’s record identifiers by making use of the following URI pattern:- namespace: preferred short name for a biological dataset
  • All resources in every bio2RDF graph is linked to a graph that describes the provenance of the generated linked data. We make use of the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary and Dublin core vocabulary to describe items such as: URL to the source data Date of generation Licensing information (if available) URL to the script used to generate the data Download URL for linked data files URL of SPARQL endpoint
  • Here I present some numbers for the 19 updated datasets
  • Here I am showing with more detail the top 10 subject type – predicate - object type frequencies that can be found in our Drugbank endpoint So for example to find drugs that participate in drug-drug interactions one would use the ddi-interactor-in predicate that associates the 1074 resources of type drug to the 10891 resources typed as drug-drug-interaction.
  • We are currently in the process of exposing Bio2RDF data using a variety of 3rd party tools- specifically, all of our endpoints can be navigated using Virtuoso’s faceted browser- We have also generated Sindice metrics that enable the use of SPARQLedautomed query builder. (This tool allows users to semiautomatically construct SPARQL queries based on namespaces used therein)-- finally it is possible to use Sig.ma browser to aggreate data from multiple endpoints and construct mashups
  • However, types and predicates are dataset specific and therefore not consistentWouldn’t it be nice to be able to query all of these datasets with a common nomenclature? (our objective)
    1. 1. Bio2RDF Release 2: Improved coverage,interoperability and provenance of LifeScience Linked DataLinked Data for the Life SciencesAlison Callahan1, Jose Cruz-Toledo1, Peter Ansell2, Michel Dumontier11Carleton University, 2University of QueenslandESWC2013::Bio2RDF Release 21
    2. 2. ESWC2013::Bio2RDF Release 2is an open source frameworkon the emerging semantic webto produce and provide biological linked datathat uses simple conventions2
    3. 3. ESWC2013::Bio2RDF Release 2reduces the time and effortso that you can get toinvolved in data integrationdoing science3
    4. 4. Main features of Bio2RDF Release 2• Bio2RDF conversion scripts, mapping files and webapplication are open source and freely available athttp://github.com/bio2rdf• Bio2RDF enables (syntactic) data integration within andacross datasets by using one language (RDF), having acommon URI pattern and a common resource registry• 19 Release 2 datasets have provenance and endpointsfeature pre-computed graph summaries for fast lookup• Bio2RDF web application enables entity resolution, queryfederation across an expandable distributed network ofSPARQL endpointsESWC2013::Bio2RDF Release 24
    5. 5. ESWC2013::Bio2RDF Release 2At the heart of Linked Data for the Life Sciences5
    6. 6. Bio2RDF data are identified usingsimple http URI patternsBio2RDF data are identified by Internationalized Resource Identifiers(IRIs) of the form:• http://bio2rdf.org/namespace:identifierfor source data with an assigned identifier• http://bio2rdf.org/namespace_resource:identifierfor source data without an assigned identifier• http://bio2rdf.org/namespace_vocabulary:identifierfor dataset-specific types and relationsWhere namespace comes from a curated registry of datasets, henceenabling simple syntactic-based integration in and across datasets.ESWC2013::Bio2RDF Release 26
    7. 7. Data are described through machine-understandable statementsESWC2013::Bio2RDF Release 2drugbank:DB00650drugbank_vocabulary:Drugrdf:typedrugbank_resource:DB00440_DB00650drugbank_vocabulary:Drug-Drug-Interactionrdf:typedrugbank_vocabulary:ddi-interactor-inrdfs:labelDDI between Trimethoprim andLeucovorin [drugbank_resource:DB00440_DB00650]rdfs:labelLeucovorin [drugbank:DB00650]7
    8. 8. The linked data network expandswith inter-dataset statementsESWC2013::Bio2RDF Release 2drugbank:DB00650pharmgkb_vocabulary:Drugrdf:typerdfs:labelLeucovorin [drugbank:DB00650]pharmgkb:PA450198drugbank_vocabulary:Drugpharmgkb_vocabulary:xrefleucovorin [pharmgkb:PA450198]rdfs:labelDrugBankPharmGKB8
    9. 9. Linked Data: You can look it upESWC2013::Bio2RDF Release 29
    10. 10. You can get what links to itESWC2013::Bio2RDF Release 2http://bio2rdf.org/linksns/drugbank/drugbank:DB00650What links to DrugBank’s Leucovorin?10
    11. 11. Every Bio2RDF dataset now containsprovenance metadataESWC2013::Bio2RDF Release 2Features- Entity-dataset link- Creator- Publisher- Date created- License & rights- Source- Availability- SPARQL endpoint- Data dumpVocabulariesVoIDDublin CoreW3C ProvenanceBio2RDF vocabulary11data itemBio2RDFdatasetSourcedatasetvoid:inDatasetprov:wasDerivedFrom
    12. 12. Dataset Namespace # of triplesAffymetrix affymetrix 44469611Biomodels* biomodels 589753Comparative Toxicogenomics Database ctd 141845167DrugBank drugbank 1121468NCBI Gene ncbigene 394026267Gene Ontology Annotations goa 80028873HUGO Gene Nomenclature Committee hgnc 836060Homologene homologene 1281881InterPro* interpro 999031iProClass iproclass 211365460iRefIndex irefindex 31042135Medical Subject Headings mesh 4172230NCBO BioPortal* bioportal 15384622National Drug Code Directory* ndc 17814216Online Mendelian Inheritance in Man omim 1848729Pharmacogenomics Knowledge Base pharmgkb 37949275SABIO-RK* sabiork 2618288Saccharomyces Genome Database sgd 5551009NCBI Taxonomy taxon 17814216Total 19 1,010,758,291Bio2RDF Release 2 – New and Updated DatasetsESWC2013::Bio2RDF Release 212
    13. 13. Inter-dataset connectivityESWC2013::Bio2RDF Release 213
    14. 14. ESWC2013::Bio2RDF Release 214
    15. 15. The Wider Network of Bio2RDF Linked DataESWC2013::Bio2RDF Release 215
    16. 16. ESWC2013::Bio2RDF Release 216
    17. 17. ESWC2013::Bio2RDF Release 217
    18. 18. ESWC2013::Bio2RDF Release 218
    19. 19. Graph summaries in query formulationESWC2013::Bio2RDF Release 2PREFIX drugbank_vocabulary: <http://bio2rdf.org/drugbank_vocabulary:>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>SELECT ?ddi ?d1name ?d2nameWHERE {?ddi a drugbank_vocabulary:Drug-Drug-Interaction .?d1 drugbank_vocabulary:ddi-interactor-in ?ddi .?d1 rdfs:label ?d1name .?d2 drugbank_vocabulary:ddi-interactor-in ?ddi .?d2 rdfs:label ?d2name.FILTER (?d1 != ?d2)}19
    20. 20. You can use the SPARQLed queryassistant with updated endpointsESWC2013::Bio2RDF Release 2http://sindicetech.com/sindice-suite/sparqled/graph: http://sindicetech.com/analytics20
    21. 21. Use virtuoso’s built in facetedbrowser to construct increasinglycomplex queries with little effortESWC2013::Bio2RDF Release 221
    22. 22. Federated Queries over independentSPARQL endpoints# get all biochemical reactions in biomodels that are kinds of "protein catabolicprocess“, as defined by the gene ontology (in bioportal endpoint)SPARQL Endpoint: http://bioportal.bio2rdf.org/sparqlSELECT ?go ?label count(distinct ?x)WHERE {?go rdfs:label ?label .?go rdfs:subClassOf ?tgo OPTION (TRANSITIVE) .?tgo rdfs:label ?tlabel .FILTER regex(?tlabel, "^protein catabolic process")service <http://biomodels.bio2rdf.org/sparql> {?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go .?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> .} # end service}ESWC2013::Bio2RDF Release 222
    23. 23. Question: Find all proteins that interact with betaamyloidSELECT * WHERE {?protein a bio2rdf:Protein .?protein bio2rdf:interacts_with bio2rdf:beta-amyloid.}Heterogeneous biological data on thesemantic web is difficult to queryUniProt Protein PDB ProteiniRefIndex Protein?Physical interaction?Pathway interaction?Genetic interaction?ESWC2013::Bio2RDF Release 223
    24. 24. ontology as astrategy to formallyrepresent andintegrate knowledgeESWC2013::Bio2RDF Release 224
    25. 25. uniprot:P05067uniprot:Proteinis asio:geneis a is aSemantic data integration, consistency checking andquery answering over Bio2RDF with theSemanticscience Integrated Ontology (SIO)datasetontologyKnowledge BaseESWC2013::Bio2RDF Release 2pharmgkb:PA30917refseq:Proteinis ais aomim:189931omim:Gene pharmgkb:GeneQuerying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo andMichel Dumontier. Bio-ontologies 2012.25
    26. 26. ESWC2013::Bio2RDF Release 226SRIQ(D)10700+ axioms1300+ classes201 object properties (inc. inverses)1 datatype property
    27. 27. Bio2RDF and SIO powered SPARQL 1.1 federated query:Find chemicals in CTD and proteins in SGD that participatein the same GO processSELECT ?chem, ?prot, ?procFROM <http://bio2rdf.org/ctd>WHERE {?chemical a sio:chemical-entity.?chemical rdfs:label ?chem.?chemical sio:is-participant-in ?process.?process rdfs:label ?proc.FILTER regex (?process, "http://bio2rdf.org/go:")SERVICE <http://sgd.bio2rdf.org/sparql> {?protein a sio:protein .?protein sio:is-participant-in ?process.?protein rdfs:label ?prot .}}ESWC2013::Bio2RDF Release 227
    28. 28. Bio2RDF RDFization guidelines areavailable at our wikihttps://github.com/bio2rdf/bio2rdf-scripts/wiki/RDFization-Guide-v1.1ESWC2013::Bio2RDF Release 228
    29. 29. What the future holds• Aiming for twice yearly release schedule. Next release willinclude large datasets (>15B RefSeq, Genbank, PubMed, PDB)• Working with identifiers.org to create a common registry with2200 entries• Spiking in identifiers.org and original data uris, whereavailable• Consolidating provenance/metrics with OpenPHACTS• Incorporate W3C Linking Open Drug Data (LODD) effort– OMIM (released), SIDER (beta), CHEMBL (beta), LinkedCT(beta), DailyMed (RDF available), TCM (RDF available),• Extended dataset coverage by tapping into existing endpoints(uniprot, bioportal, ebi-rdf?)• Showcase with other third party toolsESWC2013::Bio2RDF Release 229
    30. 30. Bio2RDF Release 2 – A summary• Updated data conversion source code to use PHP API (allavailable through GitHub)– http://github.com/bio2rdf/bio2rdf-scripts– https://github.com/bio2rdf/bio2rdf-scripts/wiki/RDFization-Guide-v1.1• Simple Bio2RDF IRI design patterns that facilitate syntacticconsistency and interoperability backed by simple registry• Dataset provenance and metrics• We welcome comments, suggestions and contributions• Join our mailing list at bio2rdf@googlegroups.comESWC2013::Bio2RDF Release 230
    31. 31. AcknowledgementsBio2RDF Release 2Allison Callahan, Jose Cruz-Toledo, Peter AnsellBio2RDFFrancois Belleau, Marc-Alexandre NolinAlex De Leon, Steve Etlinger, Nichealla KeathJacques Corbeil, James Hogan Jean Morissette, NicoleTourigny, Philippe Rigault and Paul RoeESWC2013::Bio2RDF Release 231
    32. 32. dumontierlab.commichel_dumontier@carleton.caESWC2013::Bio2RDF Release 2Website: http://dumontierlab.comPresentations: http://slideshare.com/micheldumontier32