More Related Content

More from Alasdair Gray(20)


Computing Identity Co-Reference Across Drug Discovery Datasets

  1. Computing Identity Co-reference Across Drug Discovery Datasets Christian Y A Brenninkmeijer, Ian Dunlop Carole Goble, Alasdair J G Gray, and Steve Pettifer @open_phacts @gray_alasdair
  2. Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” GB:29384 P12047 Are these the same thing? X31045 10/12/2013 SWAT4LS 2013 1
  3. Gleevec® = Imatinib Mesylate Imatinib Imatinib Mesylate Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N ChemSpider 10/12/2013 Drugbank SWAT4LS 2013 PubChem 2
  4. 10/12/2013 SWAT4LS 2013 3
  5. 10/12/2013 SWAT4LS 2013 4
  6. Multiple Links: Different Reasons Link: skos:closeMatch Reason: non-salt form 10/12/2013 Link: skos:exactMatch Reason: drug name SWAT4LS 2013 6
  7. Open PHACTS Discovery Platform Apps Interactive responses Method Calls Domain API Drug Discovery Platform Production quality integration platform 10/12/2013 SWAT4LS 2013 7
  8. OPS Discovery Platform Core Platform Apps Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” Linked Data API (RDF/XML, TTL, JSON) P12374 EC2.43.4 CS4532 Domain Specific Services Semantic Workflow Engine Chemistry Registration Normalisatio n & Q/C Data Cache (Virtuoso Triple Store) Indexing VoID VoID VoID Nanopub Public Ontologies Db Db 10/12/2013 VoID Nanopub Db Nanopub Db SWAT4LS 2013 Public Content VoID Commercial User Annotations 8
  9. Platform Interaction 10/12/2013 SWAT4LS 2013 9
  10. Connectivity of Initial Linksets Datasets 37 Linksets 104 Links 7,096,712 Justifications 10/12/2013 7 SWAT4LS 2013 10
  11. Genes == Proteins? BRCA1 Breast cancer type 1 susceptibility protein ng otein_BRCA1_PDB_1jm7.png 10/12/2013 SWAT4LS 2013 12
  12. Proceed with Caution! 10/12/2013 SWAT4LS 2013 13
  13. Co-reference Computation Rules ensure • Unrestricted transitivity within conceptual type • Restrict crossing conceptual types 0..* 0..1 0..* Based on justifications 0..1 Provenance captured 0..* 10/12/2013 SWAT4LS 2013 14
  14. Connectivity of Initial Linksets Datasets 37 Linksets 104 Links 7,096,712 Justifications 10/12/2013 7 SWAT4LS 2013 15
  15. Connectivity of Computed Linksets Datasets 37 Linksets 883 Links Justifications 10/12/2013 17,383,846 7 SWAT4LS 2013 16
  16. BridgeDb 10/12/2013 SWAT4LS 2013 17
  17. Conclusions • Computing co-reference advantageous – Requires less raw linksets – Larger coverage across datasets • Rules ensure control – Genes can equal proteins – Compounds never equal proteins • Provenance captured throughout 10/12/2013 SWAT4LS 2013 18
  18. Questions @gray_alasdair Open PHACTS Project @open_phacts

Editor's Notes

  1. Each captures a subtly different view of the worldAre they the same? … depends on your point of view
  2. Example drug:Gleevec Cancer drug for leukemiaLookup in three popular public chemical databasesDifferent resultsData is messy!
  3. Enter with ChemSpider URI forImatinibThis is not Gleevec
  4. sameAs != sameAs depends on your point of viewLinks relate individual data instances: source, target, predicate, reason.Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  5. A platform for integratedpharmacology data Reliedupon by pharma companiesPublic domain, commercial, and private data sourcesProvidesdomainspecific APIMakingiteasyto build multiple drugdiscoveryapplications:examplesdeveloped in the project
  6. Import data into cacheAPI calls populate SPARQL queriesIntegration approachData kept in original modelData cached in central triple storeAPI call translated to SPARQL queryQuery expressed in terms of original dataQueries expanded by IMS to cover URIs of original datasets
  7. User starts typingServer sends back suggestionsUser selects oneURI sent to platformIntegrated Information returned
  8. Can enter with IDs from any of the supported datasets
  9. Platform extracts data from certain datasetsThese need to be connectedHere there is no issue in computing transitive as they are all the same compound based on InChI keyWould compute the full set of links
  10. Do genes == proteins?Different conceptual types: gene and proteinOften used as a shortcut for retrieval: BRCA1 easier to remember and type!Require the ability to equate them in the IMS----But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor: there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene
  11. Insulin ReceptorIssue when linking through PDB due to the way that proteins are crystalised
  12. Can enter with IDs from any of the supported datasets