Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
168
On Slideshare
162
From Embeds
6
Number of Embeds
2

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 6

http://www.macs.hw.ac.uk 5
https://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • 1 of 83 business driver questions
  • Pharma are all accessing, processing, storing & re-processing external research data <br /> OPS: 29 partners
  • A platform for integrated pharmacology data <br /> Relied upon by pharma companies <br /> Public domain, commercial, and private data sources <br /> <br /> Provides domain specific API <br /> <br /> Making it easy to build multiple drug discovery applications: examples developed in the project <br />
  • Public launch April 2013
  • 17 apps <br /> 5 external <br /> 1 in partnership
  • Linked data API: multiple response formats (JSON, RDF, XML, CSV …) <br /> 3scala deployment <br /> Public dataset
  • Import data into cache <br /> <br /> API calls populate SPARQL queries <br /> <br /> Integration approach <br /> Data kept in original model <br /> Data cached in central triple store <br /> API call translated to SPARQL query <br /> Query expressed in terms of original data <br /> <br /> Queries expanded by IMS to cover URIs of original datasets
  • Example using Explorer application, see Ian’s demo of the new version in the demo session <br /> User starts typing <br /> Server sends back suggestions – User selects one <br /> URI sent to platform <br /> Integrated Information returned including provenance
  • Each captures a subtly different view of the world <br /> <br /> Are they the same? … depends on your point of view
  • Example drug: Gleevec Cancer drug for leukemia <br /> <br /> Lookup in three popular public chemical databases <br /> Different results <br /> <br /> Data is messy!
  • Enter with ChemSpider URI for Imatinib <br /> <br /> This is not Gleevec
  • sameAs != sameAs depends on your point of view <br /> <br /> Links relate individual data instances: source, target, predicate, reason. <br /> <br /> Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  • Interested in physiochemical properties of Gleevec
  • Interested in biomedical and pharmacological properties
  • Can enter with IDs from any of the supported datasets <br />
  • Platform extracts data from certain datasets <br /> <br /> These need to be connected <br /> <br /> Here there is no issue in computing transitive as they are all the same compound based on InChI key <br /> <br /> Would compute the full set of links
  • Do genes == proteins? <br /> <br /> Different conceptual types: gene and protein <br /> <br /> Often used as a shortcut for retrieval: BRCA1 easier to remember and type! <br /> <br /> Require the ability to equate them in the IMS <br /> <br /> <br /> ---- <br /> <br /> But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor: <br /> http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS] <br /> <br /> And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
  • Insulin Receptor <br /> <br /> Issue when linking through PDB due to the way that proteins are crystalised
  • Can enter with IDs from any of the supported datasets <br />
  • These are 1.3 figures <br /> <br /> In 1.4 <br /> 130 raw linksets with 6,985,278 links <br /> 40,802 computed linksets with 25,584,293 links
  • Implementation available <br /> <br /> IMS takes query and expands URIs
  • Retinoic Acid
  • Reminder: enter with method and URI, implemented as a query <br /> <br /> Challenge: can we efficiently support lenses <br /> <br /> Lenses require stand-off mappings, implemented as extra service call
  • Query with URIs <br /> Extract URIs <br /> Find equivalents <br /> Expand query <br /> Optimise based on context
  • Result size in brackets
  • Orange are actual OPS queries
  • Subset of the OPS data
  • Linked data approach performs badly with query 6 due to the query construction <br /> Name being bound to the chemical structure returned
  • Focus on other queries <br /> In general expansion is slower than base lines <br /> Worst case delta: 0.01842 (under 20ms) <br /> Human perception is 0.050 to 0.2 (50 -200ms)
  • Focus on query 6 <br /> No linked data as it performed very poorly on this query <br /> Size of result obliterates external call cost

Transcript

  • 1. Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project Alasdair J G Gray A.J.G.Gray@hw.ac.uk www.alasdairjggray.co.uk @gray_alasdair http://c745.r45.cf2.rackcdn.com/img/2009/le ns_filter_coasters.jpg
  • 2. Open PHACTS Use Case “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”  Chemical Properties (Chemspider)  Launched drugs (Drugbank)  Human => Mouse (Homologene)  Protein Families (Enzyme)  Bioactivty Data (ChEMBL)  … other info (Uniprot/Entrez etc.) “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases” 21/05/2014 Brighton Seminar 1
  • 3. Literature PubChem Genbank Patents Databases Downloads Data Integration Data Analysis Firewalled Databases Repeat @ each company x Lowering industry firewalls: pre-competitive informatics in drug discovery Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944 A single, shared solution. Funded under • IMI: 2011-14 • ENSO: 2014-16 Pre-competitive Informatics
  • 4. Open PHACTS Discovery Platform 21/05/2014 Brighton Seminar 3 Drug Discovery Platform Apps Domain API Interactive responses Production quality integration platform Method Calls
  • 5. (April 2013 – March 2014) 15.8 million total hits API Hits
  • 6. An “App Store”? http://www.openphactsfoundation.org/apps.html Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium MOE Collector Cytophacts Utopia Garfield SciBite KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
  • 7. Drug Disease PathwayTarget https://dev.openphacts.org/ Linked Data API 21/05/2014 Brighton Seminar 6
  • 8. OPS Discovery Platform Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Apps
  • 9. Platform Interaction
  • 10. Provenance
  • 11. Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ 21/05/2014 Brighton Seminar 10 P12047 X31045 GB:29384 Are these the same thing?
  • 12. Gleevec® = Imatinib Mesylate 21/05/2014 Brighton Seminar 11 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
  • 13. 21/05/2014 Brighton Seminar 12
  • 14. 21/05/2014 Brighton Seminar 13
  • 15. Multiple Links: Different Reasons 21/05/2014 Brighton Seminar 15 Link: skos:closeMatch Reason: non-salt form Link: skos:exactMatch Reason: drug name
  • 16. Strict Relaxed Analysing Browsing Dynamic Equality 21/05/2014 Brighton Seminar 16 skos:exactMatch (InChI)
  • 17. Strict Relaxed Analysing Browsing Dynamic Equality 21/05/2014 Brighton Seminar 17 skos:closeMatch (Drug Name) skos:closeMatch (Drug Name) skos:exactMatch (InChI)
  • 18. Initial Connectivity 21/05/2014 Brighton Seminar 18 Datasets 37 Linksets 104 Links 7,096,712 Justifications 7
  • 19. Compound Information
  • 20. Genes == Proteins? BRCA1 Breast cancer type 1 susceptibility protein 21/05/2014 Brighton Seminar 20 http://en.wikipedia.org/wiki/File:Pr otein_BRCA1_PDB_1jm7.png http://en.wikipedia.org/wiki/File:BRCA1_en.p ng
  • 21. Proceed with Caution! 21/05/2014 Brighton Seminar 21
  • 22. Co-reference Computation Rules ensure • Unrestricted transitivity within conceptual type • Restrict crossing conceptual types Based on justifications Provenance captured 21/05/2014 Brighton Seminar 22 0..* 0..* 0..* 0..1 0..1
  • 23. Initial Connectivity 21/05/2014 Brighton Seminar 23 Datasets 37 Linksets 104 Links 7,096,712 Justifications 7
  • 24. Inferred Connectivity 21/05/2014 Brighton Seminar 24 Datasets 37 Linksets 883 Links 17,383,846 Justifications 7
  • 25. BridgeDb 21/05/2014 Brighton Seminar 25
  • 26. http://ops.rsc.org/OPS45975 http://ops.rsc.org/OPS45978 has_isotopically_unspecified_parent [CHEMINF:000459] has OPS normalized counterpart [CHEMINF:000458] http://ops.rsc.org/OPS45991 is_tautomer_of [chebi:is_tautomer_of] http://ops.rsc.org/OPS45987 has_stereoundefined_parent [CHEMINF:000456] http://ops.rsc.org/OPS45981 Lenses
  • 27. OPS Discovery Platform Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Apps
  • 28. ?iri cheminf:logd ?logd . FILTER (?iri = cw:979b545d-f9a9 || ?iri = cs:2157 || ?iri = chembl:1280 || ?iri = db:db00945 ) cw:979b545d-f9a9 cheminf:logd ?logd . GRAPH <http://rdf.chemspider.com> { } cw:979b545d-f9a9 cheminf:logd ?logd . Query Expansion Identity Mapping Service (BridgeDB) Query Expander Service Profiles Mappings Q, L1 Q’ [cw:979b545d-f9a9, cs:2157, chembl:1280, db:db00945] cw:979b545d-f9a9, L1 Can also be achieved through UNION 21/05/2014 Brighton Seminar 28
  • 29. Experiment Is it feasible to use a stand-off mapping service? • Base lines (no external call): – “Perfect” URIs – Linked data querying • Expansion approaches (external service call): – FILTER by Graph – UNION by Graph C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S. Pettifer: Including Co- referent URIs in a SPARQL Query. COLD 2013. http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf 21/05/2014 Brighton Seminar 29
  • 30. “Perfect” URI Baseline WHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { chembl_mol:m1280 cheminf:mw ?mw . } } 21/05/2014 Brighton Seminar 30
  • 31. Linked Data Baseline WHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { ?chemblid cheminf:mw ?mw . } cs:2157 skos:exactMatch ?chemblid . } 21/05/2014 Brighton Seminar 31
  • 32. Queries Drawn from Open PHACTS API: 1. Simple compound information (1) 2. Compound information (1) 3. Compound pharmacology (M) 4. Simple target information (1) 5. Target information (1) 6. Target pharmacology (M) 21/05/2014 Brighton Seminar 32
  • 33. Queries Drawn from Open PHACTS API: 1. Simple compound information (1) 2. Compound information (1) 3. Compound pharmacology (M) 4. Simple target information (1) 5. Target information (1) 6. Target pharmacology (M) 21/05/2014 Brighton Seminar 33
  • 34. Data: 167,783,592 triples Mappings: 2,114,584 triples Lenses: 1 Experiment Data 21/05/2014 Brighton Seminar 34
  • 35. Average execution times 35
  • 36. Average execution times 0.018 36
  • 37. Q6: Target Pharmacology 43
  • 38. Conclusions • Computing co-reference advantageous – Requires less raw linksets – Larger coverage across datasets • Rules ensure control – Genes can equal proteins – Compounds never equal proteins • Provenance captured throughout 21/05/2014 Brighton Seminar 44
  • 39. Conclusions • Query expansion slower in general – Due to separate service call – Difference below human perception – UNION faster than FILTER on Virtuoso • Stand-off mappings feasible • Infrastructure can support lenses 21/05/2014 Brighton Seminar 45 Strict Relaxed Analysing Browsing
  • 40. Questions A.J.G.Gray@hw.ac.uk www.alasdairjggray.co.uk @gray_alasdair pmu@openphacts.org www.openphacts.org @open_phacts