Linked Data for the Life Sciences           Release 2           Michel DumontierAssociate Professor, Carleton University  ...
is an open source framework  that makes biological data available       on the emerging semantic web           using a set...
reduces the time and effort  involved in data integration      so that you can get to         the business           of do...
the Semantic Webis the new global web of knowledge    It provides standards for publishing, sharing and querying          ...
a rapidly growing network of linked data“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lo...
Link all the data!!!                   Dumontier::EBI:Jan 23, 2013
Four Simple Rules for Linked Data1) Use Internationalized Resource Identifiers  (IRIs;unicode) or Uniform Resource Identif...
Bio2RDF is a framework tocreate and provide linkeddata for the life sciences                             Dumontier::EBI:Ja...
Main features of Bio2RDF Release 2• Bio2RDF conversion scripts, mapping files and web  application are open source and fre...
Bio2RDF RDFization guidelines are      available at our wiki  https://github.com/bio2rdf/bio2rdf-scripts/wiki/RDFization-G...
Bio2RDF convertersare open-source and available at GitHub    http://github.com/bio2rdf/bio2rdf-scripts                    ...
Bio2RDF data are identified using       simple http URI patternsWhen available, use the provider’s identifier inthe naming...
Linked Data: You can look it up                           Dumontier::EBI:Jan 23, 2013
Valid Bio2RDF namespaces are listed          in a dataset registry• An initial registry of ~600 datasets is accessible thr...
vocabulary and resource namespaces are used       to describe auxiliary resources• types and predicates that are generated...
The Resource Description Framework (RDF) is a            formal knowledge representation language                capable o...
Every statement expands the network            of linked data        drugbank_vocabulary:Drug            rdf:type         ...
Syntactic integration across datasets                                                               DrugBank   drugbank_vo...
You can get what links to itWhat links to DrugBank’s Leucovorin?http://bio2rdf.org/linksns/drugbank/drugbank:DB00650      ...
The Bio2RDF Network of Linked Data                              Dumontier::EBI:Jan 23, 2013
Every Bio2RDF dataset now contains       provenance metadata                           Features                           ...
For every Bio2RDF dataset we pre-      compute 9 descriptive metrics• total number of triples• number of unique subjects• ...
Accessing Bio2RDF dataset metrics    • Each Bio2RDF endpoint contains a named      graph that holds the pre-computed metri...
Graph summaries can also assist in               query formulation         Subject Type            Subject Count         P...
You can use the SPARQLed queryassistant with updated endpoints              http://sindicetech.com/sindice-suite/sparqled/...
Bio2RDF covers the major biological           databases                             Dumontier::EBI:Jan 23, 2013
Bio2RDF Release 2 – New and Updated Datasets    Dataset                               Namespace     # of triples    Affyme...
Status of Bio2RDF Release 1 datasetsDataset      StatusAtlas        Maintained – will not be updatedBIND         Deprecate...
A PHP-based library acts a point of            integration• Provides a set of APIs  – to produce RDF and OWL statements  –...
Oh Bio2RDF, how can I access you?Let me count the ways:1. Downloads (data, stats, virtuoso db)2. Web interface (lookup + s...
Use virtuoso’s built in facetedbrowser to construct increasinglycomplex queries with little effort                        ...
Heterogeneous biological data on the    semantic web is difficult to queryQuestion: Find all proteins that interact with b...
RDF-based Linked Data is a great first     step, but it’s not enough.     From linked data to linked knowledge through syn...
ontology as a strategy to formally    represent andintegrate knowledge             Dumontier::EBI:Jan 23, 2013
SIO provides an OWL ontology for the representation          of diverse biomedical knowledge                              ...
Dumontier::EBI:Jan 23, 2013
Semantic data integration, consistency checking and           query answering over Bio2RDF with the         Semanticscienc...
Bio2RDF types include processes, material       entities and informational entitiesCTD: Chemical, Disease, Chemical-Diseas...
Bio2RDF and SIO powered SPARQL 1.1 federated query:Find chemicals in CTD and proteins in SGD that participate             ...
Bio2RDF Release 2 – A summary• Updated data conversion source code to use  PHP API (all available through GitHub)• Simple ...
Future Directions• Aiming for twice yearly release schedule• Update large scale datasets   – RefSeq, Genbank, PubMed, PDB•...
AcknowledgementsBio2RDF                                       SADI: Christopher Baker, Melanie Courtot, JosePeter Ansell, ...
dumontierlab.commichel_dumontier@carleton.ca                         Website: http://dumontierlab.com    Presentations: ht...
Upcoming SlideShare
Loading in …5
×

Bio2RDF Release 2: Improved coverage, interoperability and provenance of Linked Data for the Life Sciences

2,994 views

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,994
On SlideShare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
45
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Bio2RDF facilitates data sharing and reuse in a number of ways: Access the scripts that generate our linked data. These scripts are openly licensed for modification, redistribution or use by anyone wishing to generate RDF data on their own We make use of syntactically consistent IRI patterns We do not impose a structural scheme for representing our data Bio2RDF allows for multiple mirrors host our data, a DNS “round robin” procedure balances incoming requests between participating mirrors, this prevents a single point of failure and allows for additional mirrors to be added to the network A centralized resource registry has been developed for consistent usage of namespaces and vocabularies- In the next slides I will go into a bit more detail about the first 3 points
  • In order to facilitate collaboration by the community, one of the most important developments to this second release of Bio2RDF was the creation of a publicly available code repository for all of the programs used for converting biological data to linked data.- Anyone can reuse or modify our openly (MIT) licensed code
  • In order to keep a clear link back to the original data, in our RDFized datasets we maintain the original data provider’s record identifiers by making use of the following URI pattern:- namespace: preferred short name for a biological dataset
  • In order to keep a clear link back to the original data, in our RDFized datasets we maintain the original data provider’s record identifiers by making use of the following URI pattern:- namespace: preferred short name for a biological dataset
  • All resources in every bio2RDF graph is linked to a graph that describes the provenance of the generated linked data. We make use of the W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary and Dublin core vocabulary to describe items such as: URL to the source data Date of generation Licensing information (if available) URL to the script used to generate the data Download URL for linked data files URL of SPARQL endpoint
  • In order to eliminate the need for our users to run expensive queries on our endpoints we have made available several metrics that can aid in the description of the contents of our linked data sets.- These metrics have been included into each of our datasets and also serve as guides for developing queries over the data based on the data’s structure
  • As I had mentioned earlier we have included dataset metrics for every one of our endpoints. - These metrics are stored in a named graph and can be queried for as shown:
  • Here I am showing with more detail the top 10 subject type – predicate - object type frequencies that can be found in our Drugbank endpoint So for example to find drugs that participate in drug-drug interactions one would use the ddi-interactor-in predicate that associates the 1074 resources of type drug to the 10891 resources typed as drug-drug-interaction.
  • Here I present some numbers for the 19 updated datasets
  • HHPID is database of HIV-1 human protein interactions that was created to catalog all interactions between HIV-1 and human proteins published in the peer-reviewed literature. The database serves the scientific community exploring the discovery of novel HIV vaccine candidates and therapeutic targets.MGI – Mouse Genome Informatics-==-====Boutique endpoints maintained in Release 2 (will not be updated): Bio2RDF Atlas and HHPID
  • We are currently in the process of exposing Bio2RDF data using a variety of 3rd party tools- specifically, all of our endpoints can be navigated using Virtuoso’s faceted browser- We have also generated Sindice metrics that enable the use of SPARQLedautomed query builder. (This tool allows users to semiautomatically construct SPARQL queries based on namespaces used therein)-- finally it is possible to use Sig.ma browser to aggreate data from multiple endpoints and construct mashups
  • However, types and predicates are dataset specific and therefore not consistentWouldn’t it be nice to be able to query all of these datasets with a common nomenclature? (our objective)
  • Bio2RDF Release 2: Improved coverage, interoperability and provenance of Linked Data for the Life Sciences

    1. 1. Linked Data for the Life Sciences Release 2 Michel DumontierAssociate Professor, Carleton University on behalf of the Bio2RDF team Dumontier::EBI:Jan 23, 2013
    2. 2. is an open source framework that makes biological data available on the emerging semantic web using a set of simple conventions Dumontier::EBI:Jan 23, 2013
    3. 3. reduces the time and effort involved in data integration so that you can get to the business of doing science Dumontier::EBI:Jan 23, 2013
    4. 4. the Semantic Webis the new global web of knowledge It provides standards for publishing, sharing and querying facts, expert knowledge and services It is a scalable approach to the discovery of independently formulated and highly distributed knowledge Dumontier::EBI:Jan 23, 2013
    5. 5. a rapidly growing network of linked data“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/” Dumontier::EBI:Jan 23, 2013
    6. 6. Link all the data!!! Dumontier::EBI:Jan 23, 2013
    7. 7. Four Simple Rules for Linked Data1) Use Internationalized Resource Identifiers (IRIs;unicode) or Uniform Resource Identifiers (URIs; ascii) for names for things2) Use HTTP URIs so that people can look up those names.3) When someone looks up a URI, provide useful information about them using the standards (RDF)4) Include links to other URIs so that they can discover more things.http://www.w3.org/DesignIssues/LinkedData.html Dumontier::EBI:Jan 23, 2013
    8. 8. Bio2RDF is a framework tocreate and provide linkeddata for the life sciences Dumontier::EBI:Jan 23, 2013
    9. 9. Main features of Bio2RDF Release 2• Bio2RDF conversion scripts, mapping files and web application are open source and freely available at http://github.com/bio2rdf• Bio2RDF enables (syntactic) data integration within and across datasets by using one language (RDF), having a common URI pattern and a common resource registry• 19 Release 2 datasets have provenance and endpoints feature pre-computed graph summaries for fast lookup• Bio2RDF web application enables entity resolution, query federation across an expandable distributed network of SPARQL endpoints Dumontier::EBI:Jan 23, 2013
    10. 10. Bio2RDF RDFization guidelines are available at our wiki https://github.com/bio2rdf/bio2rdf-scripts/wiki/RDFization-Guide-v1.1 Dumontier::EBI:Jan 23, 2013
    11. 11. Bio2RDF convertersare open-source and available at GitHub http://github.com/bio2rdf/bio2rdf-scripts Dumontier::EBI:Jan 23, 2013
    12. 12. Bio2RDF data are identified using simple http URI patternsWhen available, use the provider’s identifier inthe naming the resource http://bio2rdf.org/namespace:identifiere.g.: DrugBank’s resource IRI for Leucovorin http://bio2rdf.org/drugbank:DB00650 Dumontier::EBI:Jan 23, 2013
    13. 13. Linked Data: You can look it up Dumontier::EBI:Jan 23, 2013
    14. 14. Valid Bio2RDF namespaces are listed in a dataset registry• An initial registry of ~600 datasets is accessible through an API provided by my PHP-LIB library (available on github). It includes – Dataset title, Preferred namespace prefix, Alternative namespace prefixes• In the summer, we consolidated and curated nearly 2100 entries in a Google spreadsheet, which includes a mostly complete coverage of datasets/collections listed in Bio2RDF, MIRIAM, BioPortal, UniProt, NCBI, NAR database issue. New fields were added including: – Dataset description, organization, website, HTML template – Identifier syntax, license and rights• Working with identifiers.org team (Nick Juty, Camille Laibe, Nicolas Le Novere) to have a single dataset registry that we can use for both Bio2RDF and identifiers.org – enable automatic cross-links between Bio2RDF and identifiers.org Dumontier::EBI:Jan 23, 2013
    15. 15. vocabulary and resource namespaces are used to describe auxiliary resources• types and predicates that are generated to support the semantic annotation are in the vocabulary namespace http://bio2rdf.org/drugbank_vocabulary:Drug (type) http://bio2rdf.org/drugbank_vocabulary:target (predicate)• n-ary relations are named in the resource namespace http://bio2rdf.org/drugbank_resource:DB00440_DB00650 Dumontier::EBI:Jan 23, 2013
    16. 16. The Resource Description Framework (RDF) is a formal knowledge representation language capable of expressing a statement A RDF statement consists of: – Subject: resource identified by a URI – Predicate: resource identified by a URI – Object: resource or literal http://bio2rdf.org/drugbank:DB00650 drugbank:DB00650 rdf:type rdf:typehttp://bio2rdf.org/drugbank_vocabulary:Drug drugbank_vocabulary:Drug Dumontier::EBI:Jan 23, 2013
    17. 17. Every statement expands the network of linked data drugbank_vocabulary:Drug rdf:type rdfs:label drugbank:DB00650 Leucovorin [drugbank:DB00650] drugbank_vocabulary:ddi-interactor-in rdfs:label DDI between Trimethoprim and drugbank_resource:DB00440_DB00650 Leucovorin [drugbank_resource: DB00440_DB00650] rdf:typedrugbank_vocabulary:Drug-Drug-Interaction Dumontier::EBI:Jan 23, 2013
    18. 18. Syntactic integration across datasets DrugBank drugbank_vocabulary:Drug rdf:type rdfs:label drugbank:DB00650 Leucovorin [drugbank:DB00650] pharmgkb_vocabulary:xref rdfs:label pharmgkb:PA450198 leucovorin [pharmgkb:PA450198] pharmgkb_vocabulary:Drug PharmGKB Dumontier::EBI:Jan 23, 2013
    19. 19. You can get what links to itWhat links to DrugBank’s Leucovorin?http://bio2rdf.org/linksns/drugbank/drugbank:DB00650 Dumontier::EBI:Jan 23, 2013
    20. 20. The Bio2RDF Network of Linked Data Dumontier::EBI:Jan 23, 2013
    21. 21. Every Bio2RDF dataset now contains provenance metadata Features - Dates - Licensing - Source - Creators - Publishers - Availability Vocabularies VoID Dublin Core W3C Provenance Bio2RDF vocabulary Dumontier::EBI:Jan 23, 2013
    22. 22. For every Bio2RDF dataset we pre- compute 9 descriptive metrics• total number of triples• number of unique subjects• number of unique predicates• number of unique objects• number of unique types• unique predicate-object links and their frequencies• unique predicate-literal links and their frequencies• unique subject type-predicate-object type links and their frequencies• unique subject type-predicate-literal links and their frequencies Dumontier::EBI:Jan 23, 2013
    23. 23. Accessing Bio2RDF dataset metrics • Each Bio2RDF endpoint contains a named graph that holds the pre-computed metrics – http://bio2rdf.org/bio2rdf-[namespace]-statistics • Metrics can be queried using SPARQL, e.g.: SELECT * FROM <http://bio2rdf.org/bio2rdf-drugbank-statistics> WHERE { ?dataset a <http://bio2rdf.org/dataset_vocabulary:Endpoint> . ?dataset <http://bio2rdf.org/dataset_vocabulary:has_triple_count> ?tc . ?dataset <http://bio2rdf.org/dataset_vocabulary:has_unique_subject_count> ?sc . ?dataset <http://bio2rdf.org/dataset_vocabulary:has_unique_predicate_count> ?pc . ?dataset <http://bio2rdf.org/dataset_vocabulary:has_unique_object_count> ?oc . … }https://github.com/bio2rdf/bio2rdf-scripts/wiki/Bio2RDF-dataset-metrics Dumontier::EBI:Jan 23, 2013
    24. 24. Graph summaries can also assist in query formulation Subject Type Subject Count Predicate Object Type Object CountPharmaceutical 11512 form Unit 56Drug-Transporter-Interaction 1440 drug Drug 534Drug-Transporter-Interaction 1440 transporter Target 88Drug 1266 dosage Dosage 230Patent 1255 country Country 2Drug 1127 product Pharmaceutical 11512Drug 1074 ddi-interactor-in Drug-Drug-Interaction 10891Drug 532 patent Patent 1255Drug 277 mixture Mixture 3317Dosage 230 route Route 42Drug-Target-Interaction 84 target Target 43 PREFIX drugbank_vocabulary: <http://bio2rdf.org/drugbank_vocabulary:> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?ddi ?d1name ?d2name WHERE { ?ddi a drugbank_vocabulary:Drug-Drug-Interaction . ?d1 drugbank_vocabulary:ddi-interactor-in ?ddi . ?d1 rdfs:label ?d1name . ?d2 drugbank_vocabulary:ddi-interactor-in ?ddi . ?d2 rdfs:label ?d2name. FILTER (?d1 != ?d2) } Dumontier::EBI:Jan 23, 2013
    25. 25. You can use the SPARQLed queryassistant with updated endpoints http://sindicetech.com/sindice-suite/sparqled/ graph: http://sindicetech.com/analytics Dumontier::EBI:Jan 23, 2013
    26. 26. Bio2RDF covers the major biological databases Dumontier::EBI:Jan 23, 2013
    27. 27. Bio2RDF Release 2 – New and Updated Datasets Dataset Namespace # of triples Affymetrix affymetrix 44469611 Biomodels* biomodels 589753 Comparative Toxicogenomics Database ctd 141845167 DrugBank drugbank 1121468 NCBI Gene ncbigene 394026267 Gene Ontology Annotations goa 80028873 HUGO Gene Nomenclature Committee hgnc 836060 Homologene homologene 1281881 InterPro* interpro 999031 iProClass iproclass 211365460 iRefIndex irefindex 31042135 Medical Subject Headings mesh 4172230 NCBO BioPortal* bioportal 15384622 National Drug Code Directory* ndc 17814216 Online Mendelian Inheritance in Man omim 1848729 Pharmacogenomics Knowledge Base pharmgkb 37949275 SABIO-RK* sabiork 2618288 Saccharomyces Genome Database sgd 5551009 NCBI Taxonomy taxon 17814216 Total 19 1010758291 Dumontier::EBI:Jan 23, 2013
    28. 28. Status of Bio2RDF Release 1 datasetsDataset StatusAtlas Maintained – will not be updatedBIND Deprecated – in iRefIndexBioCarta Deprecated – in Pathway CommonsBioCyc Deprecated – in Pathway CommonsEC Deprecated – in Gene Ontology/UniProtGenBank Maintained – will be updatedHHPID Maintained – will not be updatedINOH Deprecated – in Pathway CommonsKEGG Maintained – will not be updatedMGI Maintained – will be updatedPubmed Maintained – will be updatedPID Deprecated – in Pathway CommonsReactome Deprecated – in Pathway CommonsRefSeq Maintained – will be updated Dumontier::EBI:Jan 23, 2013
    29. 29. A PHP-based library acts a point of integration• Provides a set of APIs – to produce RDF and OWL statements – to generate valid Bio2RDF URIs by checking against a dataset registry – to generate dataset provenance Dumontier::EBI:Jan 23, 2013
    30. 30. Oh Bio2RDF, how can I access you?Let me count the ways:1. Downloads (data, stats, virtuoso db)2. Web interface (lookup + services)3. SPARQL endpoint4. SPARQLed editor5. Virtuoso Faceted Browser Dumontier::EBI:Jan 23, 2013
    31. 31. Use virtuoso’s built in facetedbrowser to construct increasinglycomplex queries with little effort Dumontier::EBI:Jan 23, 2013
    32. 32. Heterogeneous biological data on the semantic web is difficult to queryQuestion: Find all proteins that interact with beta amyloid (uniprot:P05067) UniProt Protein PDB Protein ?SELECT * WHERE { iRefIndex Protein ?protein a bio2rdf:Protein . ?protein bio2rdf:interacts_with uniprot:P05067 .} Physical interaction? Genetic interaction? Pathway interaction? Dumontier::EBI:Jan 23, 2013
    33. 33. RDF-based Linked Data is a great first step, but it’s not enough. From linked data to linked knowledge through syntactic and semantic normalization. Dumontier::EBI:Jan 23, 2013
    34. 34. ontology as a strategy to formally represent andintegrate knowledge Dumontier::EBI:Jan 23, 2013
    35. 35. SIO provides an OWL ontology for the representation of diverse biomedical knowledge Dumontier::EBI:Jan 23, 2013
    36. 36. Dumontier::EBI:Jan 23, 2013
    37. 37. Semantic data integration, consistency checking and query answering over Bio2RDF with the Semanticscience Integrated Ontology (SIO) uniprot:P05067 uniprot:P05067 refseq:NP_009225.1 is a is a uniprot:Protein uniprot:Protein refseq:Protein refseq:Protein dataset is a is a is a sio:protein ontology Knowledge BaseQuerying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo andMichel Dumontier. Bio-ontologies 2012. Dumontier::EBI:Jan 23, 2013
    38. 38. Bio2RDF types include processes, material entities and informational entitiesCTD: Chemical, Disease, Chemical-Disease Interaction, Chemical- Gene InteractionNCBIGene: Gene, Protein, Model Organism, PublicationHGNC: Accession Number, Gene, Gene SymboliRefIndex: Protein Complex, Protein InteractionMGI: Gene Marker, Gene SymbolPharmGKB: Gene-Disease Associations, Disease, Drug, GeneSGD: Enzyme, Pathway, Protein, RNA, Reaction,Location, Experiment Dumontier::EBI:Jan 23, 2013
    39. 39. Bio2RDF and SIO powered SPARQL 1.1 federated query:Find chemicals in CTD and proteins in SGD that participate in the same GO process SELECT ?chem, ?prot, ?proc FROM <http://bio2rdf.org/ctd> WHERE { ?chemical a sio:chemical-entity. ?chemical rdfs:label ?chem. ?chemical sio:is-participant-in ?process. ?process rdfs:label ?proc. FILTER regex (?process, "http://bio2rdf.org/go:") SERVICE <http://sgd.bio2rdf.org/sparql> { ?protein a sio:protein . ?protein sio:is-participant-in ?process. ?protein rdfs:label ?prot . } } Dumontier::EBI:Jan 23, 2013
    40. 40. Bio2RDF Release 2 – A summary• Updated data conversion source code to use PHP API (all available through GitHub)• Simple Bio2RDF IRI design patterns that facilitate syntactic consistency and interoperability backed by simple registry• Dataset provenance and metrics• We welcome contributions from the community • Join our mailing list at bio2rdf@googlegroups.com Dumontier::EBI:Jan 23, 2013
    41. 41. Future Directions• Aiming for twice yearly release schedule• Update large scale datasets – RefSeq, Genbank, PubMed, PDB• Incorporate EBI & W3C Linking Open Drug Data (LODD) effort – SIDER (beta), TCM (RDF available), CHEMBL (in dev), OMIM (released), DailyMed (RDF available), LinkedCT (beta)• Extended dataset coverage by tapping into existing endpoints (uniprot, bioportal, ebi-rdf?)• OpenBioCloud w/DERI + SindiceTech• Showcase with other third party tools Dumontier::EBI:Jan 23, 2013
    42. 42. AcknowledgementsBio2RDF SADI: Christopher Baker, Melanie Courtot, JosePeter Ansell, Francois Belleau, Allison Cruz-Toledo, Steve Etlinger, NicheallaCallahan, Jacques Corbeil, Jose Cruz- Keath, Artjom Klein, Luke McCarthy, SilvaneToledo, Alex De Leon, Steve Etlinger, James Paixao, Ben Vandervalk, Natalia Villanueva-Hogan, Nichealla Keath, Jean Rosales, Mark WilkinsonMorissette, Marc-Alexandre Nolin, NicoleTourigny, Philippe Rigault and Paul Roe W3C HCLS: J Luciano, B Andersson, C Batchelor, O Bodenreider, T Clark, C Denney, C Domarew, TOpenBioCloud Gambet, L Harland, A Jentzsch, V Kashyap, PDana Klassen and Giovanni Tumarello Kos, J Kozlovsky, T Lebo, SM Marshall, JP McCusker, DL McGuinness, C Ogbuji, E Pichler, R Powers, E Prud hommeaux, M Samwald, L Schriml, PJ Tonellato, PL Whetzel, J Zhao, S Stephens, C Denney, J Luciano, J McGurk, Lynn Schriml, and Peter J. Tonellato. Dumontier::EBI:Jan 23, 2013
    43. 43. dumontierlab.commichel_dumontier@carleton.ca Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier Dumontier::EBI:Jan 23, 2013

    ×