Scientific Lenses over Linked Data:
Identity Management in the
Open PHACTS project
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
w...
Open PHACTS Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
 Chemical...
Literature
PubChem
Genbank
Patents
Databases
Downloads
Data Integration Data Analysis
Firewalled Databases
Repeat @ each
c...
Open PHACTS Discovery Platform
21/05/2014 Brighton Seminar 3
Drug Discovery Platform
Apps
Domain API
Interactive
responses...
(April 2013 – March 2014)
15.8 million total hits
API Hits
An “App Store”?
http://www.openphactsfoundation.org/apps.html
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatre...
Drug
Disease
PathwayTarget
https://dev.openphacts.org/
Linked Data API
21/05/2014 Brighton Seminar 6
OPS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XM...
Platform Interaction
Provenance
Multiple Identities
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than th...
Gleevec® = Imatinib Mesylate
21/05/2014 Brighton Seminar 11
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
...
21/05/2014 Brighton Seminar 12
21/05/2014 Brighton Seminar 13
Multiple Links: Different Reasons
21/05/2014 Brighton Seminar 15
Link: skos:closeMatch
Reason: non-salt form
Link: skos:ex...
Strict Relaxed
Analysing Browsing
Dynamic Equality
21/05/2014 Brighton Seminar 16
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Dynamic Equality
21/05/2014 Brighton Seminar 17
skos:closeMatch
(Drug Name)
skos:closeMa...
Initial Connectivity
21/05/2014 Brighton Seminar 18
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
Compound Information
Genes == Proteins?
BRCA1
Breast cancer type 1
susceptibility protein
21/05/2014 Brighton Seminar 20
http://en.wikipedia.or...
Proceed with Caution!
21/05/2014 Brighton Seminar 21
Co-reference Computation
Rules ensure
• Unrestricted transitivity
within conceptual type
• Restrict crossing
conceptual ty...
Initial Connectivity
21/05/2014 Brighton Seminar 23
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
Inferred Connectivity
21/05/2014 Brighton Seminar 24
Datasets 37
Linksets 883
Links 17,383,846
Justifications 7
BridgeDb
21/05/2014 Brighton Seminar 25
http://ops.rsc.org/OPS45975 http://ops.rsc.org/OPS45978
has_isotopically_unspecified_parent
[CHEMINF:000459]
has OPS norma...
OPS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XM...
?iri cheminf:logd ?logd .
FILTER (?iri = cw:979b545d-f9a9 ||
?iri = cs:2157 ||
?iri = chembl:1280 ||
?iri = db:db00945 )
c...
Experiment
Is it feasible to use a stand-off
mapping service?
• Base lines (no external call):
– “Perfect” URIs
– Linked d...
“Perfect” URI Baseline
WHERE {
GRAPH <chemspider> {
cs:2157 cheminf:logp ?logp .
}
GRAPH <chembl> {
chembl_mol:m1280 chemi...
Linked Data Baseline
WHERE {
GRAPH <chemspider> {
cs:2157 cheminf:logp ?logp .
}
GRAPH <chembl> {
?chemblid cheminf:mw ?mw...
Queries
Drawn from Open PHACTS API:
1. Simple compound information (1)
2. Compound information (1)
3. Compound pharmacolog...
Queries
Drawn from Open PHACTS API:
1. Simple compound information (1)
2. Compound information (1)
3. Compound pharmacolog...
Data:
167,783,592 triples
Mappings:
2,114,584 triples
Lenses:
1
Experiment Data
21/05/2014 Brighton Seminar 34
Average execution times
35
Average execution times
0.018
36
Q6: Target Pharmacology
43
Conclusions
• Computing co-reference advantageous
– Requires less raw linksets
– Larger coverage across datasets
• Rules e...
Conclusions
• Query expansion slower in general
– Due to separate service call
– Difference below human perception
– UNION...
Questions
A.J.G.Gray@hw.ac.uk
www.alasdairjggray.co.uk
@gray_alasdair
pmu@openphacts.org
www.openphacts.org
@open_phacts
Upcoming SlideShare
Loading in...5
×

Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project

178

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
178
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • 1 of 83 business driver questions
  • Pharma are all accessing, processing, storing & re-processing external research data
    OPS: 29 partners
  • A platform for integrated pharmacology data
    Relied upon by pharma companies
    Public domain, commercial, and private data sources

    Provides domain specific API

    Making it easy to build multiple drug discovery applications: examples developed in the project
  • Public launch April 2013
  • 17 apps
    5 external
    1 in partnership
  • Linked data API: multiple response formats (JSON, RDF, XML, CSV …)
    3scala deployment
    Public dataset
  • Import data into cache

    API calls populate SPARQL queries

    Integration approach
    Data kept in original model
    Data cached in central triple store
    API call translated to SPARQL query
    Query expressed in terms of original data

    Queries expanded by IMS to cover URIs of original datasets
  • Example using Explorer application, see Ian’s demo of the new version in the demo session
    User starts typing
    Server sends back suggestions – User selects one
    URI sent to platform
    Integrated Information returned including provenance
  • Each captures a subtly different view of the world

    Are they the same? … depends on your point of view
  • Example drug: Gleevec Cancer drug for leukemia

    Lookup in three popular public chemical databases
    Different results

    Data is messy!
  • Enter with ChemSpider URI for Imatinib

    This is not Gleevec
  • sameAs != sameAs depends on your point of view

    Links relate individual data instances: source, target, predicate, reason.

    Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  • Interested in physiochemical properties of Gleevec
  • Interested in biomedical and pharmacological properties
  • Can enter with IDs from any of the supported datasets
  • Platform extracts data from certain datasets

    These need to be connected

    Here there is no issue in computing transitive as they are all the same compound based on InChI key

    Would compute the full set of links
  • Do genes == proteins?

    Different conceptual types: gene and protein

    Often used as a shortcut for retrieval: BRCA1 easier to remember and type!

    Require the ability to equate them in the IMS


    ----

    But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor:
    http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]

    And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
  • Insulin Receptor

    Issue when linking through PDB due to the way that proteins are crystalised
  • Can enter with IDs from any of the supported datasets
  • These are 1.3 figures

    In 1.4
    130 raw linksets with 6,985,278 links
    40,802 computed linksets with 25,584,293 links
  • Implementation available

    IMS takes query and expands URIs
  • Retinoic Acid
  • Reminder: enter with method and URI, implemented as a query

    Challenge: can we efficiently support lenses

    Lenses require stand-off mappings, implemented as extra service call
  • Query with URIs
    Extract URIs
    Find equivalents
    Expand query
    Optimise based on context
  • Result size in brackets
  • Orange are actual OPS queries
  • Subset of the OPS data
  • Linked data approach performs badly with query 6 due to the query construction
    Name being bound to the chemical structure returned
  • Focus on other queries
    In general expansion is slower than base lines
    Worst case delta: 0.01842 (under 20ms)
    Human perception is 0.050 to 0.2 (50 -200ms)
  • Focus on query 6
    No linked data as it performed very poorly on this query
    Size of result obliterates external call cost
  • Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project

    1. 1. Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project Alasdair J G Gray A.J.G.Gray@hw.ac.uk www.alasdairjggray.co.uk @gray_alasdair http://c745.r45.cf2.rackcdn.com/img/2009/le ns_filter_coasters.jpg
    2. 2. Open PHACTS Use Case “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”  Chemical Properties (Chemspider)  Launched drugs (Drugbank)  Human => Mouse (Homologene)  Protein Families (Enzyme)  Bioactivty Data (ChEMBL)  … other info (Uniprot/Entrez etc.) “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases” 21/05/2014 Brighton Seminar 1
    3. 3. Literature PubChem Genbank Patents Databases Downloads Data Integration Data Analysis Firewalled Databases Repeat @ each company x Lowering industry firewalls: pre-competitive informatics in drug discovery Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944 A single, shared solution. Funded under • IMI: 2011-14 • ENSO: 2014-16 Pre-competitive Informatics
    4. 4. Open PHACTS Discovery Platform 21/05/2014 Brighton Seminar 3 Drug Discovery Platform Apps Domain API Interactive responses Production quality integration platform Method Calls
    5. 5. (April 2013 – March 2014) 15.8 million total hits API Hits
    6. 6. An “App Store”? http://www.openphactsfoundation.org/apps.html Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium MOE Collector Cytophacts Utopia Garfield SciBite KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
    7. 7. Drug Disease PathwayTarget https://dev.openphacts.org/ Linked Data API 21/05/2014 Brighton Seminar 6
    8. 8. OPS Discovery Platform Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Apps
    9. 9. Platform Interaction
    10. 10. Provenance
    11. 11. Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ 21/05/2014 Brighton Seminar 10 P12047 X31045 GB:29384 Are these the same thing?
    12. 12. Gleevec® = Imatinib Mesylate 21/05/2014 Brighton Seminar 11 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
    13. 13. 21/05/2014 Brighton Seminar 12
    14. 14. 21/05/2014 Brighton Seminar 13
    15. 15. Multiple Links: Different Reasons 21/05/2014 Brighton Seminar 15 Link: skos:closeMatch Reason: non-salt form Link: skos:exactMatch Reason: drug name
    16. 16. Strict Relaxed Analysing Browsing Dynamic Equality 21/05/2014 Brighton Seminar 16 skos:exactMatch (InChI)
    17. 17. Strict Relaxed Analysing Browsing Dynamic Equality 21/05/2014 Brighton Seminar 17 skos:closeMatch (Drug Name) skos:closeMatch (Drug Name) skos:exactMatch (InChI)
    18. 18. Initial Connectivity 21/05/2014 Brighton Seminar 18 Datasets 37 Linksets 104 Links 7,096,712 Justifications 7
    19. 19. Compound Information
    20. 20. Genes == Proteins? BRCA1 Breast cancer type 1 susceptibility protein 21/05/2014 Brighton Seminar 20 http://en.wikipedia.org/wiki/File:Pr otein_BRCA1_PDB_1jm7.png http://en.wikipedia.org/wiki/File:BRCA1_en.p ng
    21. 21. Proceed with Caution! 21/05/2014 Brighton Seminar 21
    22. 22. Co-reference Computation Rules ensure • Unrestricted transitivity within conceptual type • Restrict crossing conceptual types Based on justifications Provenance captured 21/05/2014 Brighton Seminar 22 0..* 0..* 0..* 0..1 0..1
    23. 23. Initial Connectivity 21/05/2014 Brighton Seminar 23 Datasets 37 Linksets 104 Links 7,096,712 Justifications 7
    24. 24. Inferred Connectivity 21/05/2014 Brighton Seminar 24 Datasets 37 Linksets 883 Links 17,383,846 Justifications 7
    25. 25. BridgeDb 21/05/2014 Brighton Seminar 25
    26. 26. http://ops.rsc.org/OPS45975 http://ops.rsc.org/OPS45978 has_isotopically_unspecified_parent [CHEMINF:000459] has OPS normalized counterpart [CHEMINF:000458] http://ops.rsc.org/OPS45991 is_tautomer_of [chebi:is_tautomer_of] http://ops.rsc.org/OPS45987 has_stereoundefined_parent [CHEMINF:000456] http://ops.rsc.org/OPS45981 Lenses
    27. 27. OPS Discovery Platform Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Apps
    28. 28. ?iri cheminf:logd ?logd . FILTER (?iri = cw:979b545d-f9a9 || ?iri = cs:2157 || ?iri = chembl:1280 || ?iri = db:db00945 ) cw:979b545d-f9a9 cheminf:logd ?logd . GRAPH <http://rdf.chemspider.com> { } cw:979b545d-f9a9 cheminf:logd ?logd . Query Expansion Identity Mapping Service (BridgeDB) Query Expander Service Profiles Mappings Q, L1 Q’ [cw:979b545d-f9a9, cs:2157, chembl:1280, db:db00945] cw:979b545d-f9a9, L1 Can also be achieved through UNION 21/05/2014 Brighton Seminar 28
    29. 29. Experiment Is it feasible to use a stand-off mapping service? • Base lines (no external call): – “Perfect” URIs – Linked data querying • Expansion approaches (external service call): – FILTER by Graph – UNION by Graph C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S. Pettifer: Including Co- referent URIs in a SPARQL Query. COLD 2013. http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf 21/05/2014 Brighton Seminar 29
    30. 30. “Perfect” URI Baseline WHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { chembl_mol:m1280 cheminf:mw ?mw . } } 21/05/2014 Brighton Seminar 30
    31. 31. Linked Data Baseline WHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { ?chemblid cheminf:mw ?mw . } cs:2157 skos:exactMatch ?chemblid . } 21/05/2014 Brighton Seminar 31
    32. 32. Queries Drawn from Open PHACTS API: 1. Simple compound information (1) 2. Compound information (1) 3. Compound pharmacology (M) 4. Simple target information (1) 5. Target information (1) 6. Target pharmacology (M) 21/05/2014 Brighton Seminar 32
    33. 33. Queries Drawn from Open PHACTS API: 1. Simple compound information (1) 2. Compound information (1) 3. Compound pharmacology (M) 4. Simple target information (1) 5. Target information (1) 6. Target pharmacology (M) 21/05/2014 Brighton Seminar 33
    34. 34. Data: 167,783,592 triples Mappings: 2,114,584 triples Lenses: 1 Experiment Data 21/05/2014 Brighton Seminar 34
    35. 35. Average execution times 35
    36. 36. Average execution times 0.018 36
    37. 37. Q6: Target Pharmacology 43
    38. 38. Conclusions • Computing co-reference advantageous – Requires less raw linksets – Larger coverage across datasets • Rules ensure control – Genes can equal proteins – Compounds never equal proteins • Provenance captured throughout 21/05/2014 Brighton Seminar 44
    39. 39. Conclusions • Query expansion slower in general – Due to separate service call – Difference below human perception – UNION faster than FILTER on Virtuoso • Stand-off mappings feasible • Infrastructure can support lenses 21/05/2014 Brighton Seminar 45 Strict Relaxed Analysing Browsing
    40. 40. Questions A.J.G.Gray@hw.ac.uk www.alasdairjggray.co.uk @gray_alasdair pmu@openphacts.org www.openphacts.org @open_phacts
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×