Computing Identity Co-Reference Across Drug Discovery Datasets

•Download as PPTX, PDF•

2 likes•5,221 views

This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.

Technology Business

Computing Identity Co-reference
Across Drug Discovery Datasets
Christian Y A Brenninkmeijer, Ian Dunlop
Carole Goble, Alasdair J G Gray, and
Steve Pettifer

www.openphacts.org
@open_phacts

A.J.G.Gray@hw.ac.uk
@gray_alasdair

Multiple Identities
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/

GB:29384

P12047

Are these the
same thing?

X31045

10/12/2013

SWAT4LS 2013

1

Gleevec® = Imatinib Mesylate
Imatinib

Imatinib Mesylate Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N

ChemSpider
10/12/2013

Drugbank
SWAT4LS 2013

PubChem
2

Multiple Links: Different Reasons

Link: skos:closeMatch
Reason: non-salt form

10/12/2013

Link: skos:exactMatch
Reason: drug name

SWAT4LS 2013

6

Open PHACTS Discovery Platform
Apps
Interactive
responses
Method
Calls

Domain API

Drug Discovery Platform
Production quality
integration platform

10/12/2013

SWAT4LS 2013

7

OPS Discovery Platform

Core Platform

Apps
Identity
Resolution
Service
Identifier
Management
Service

“Adenosine
receptor 2a”

Linked Data API (RDF/XML, TTL, JSON)

P12374
EC2.43.4
CS4532

Domain
Specific
Services

Semantic Workflow Engine
Chemistry
Registration
Normalisatio
n & Q/C

Data Cache
(Virtuoso Triple Store)

Indexing
VoID

VoID

VoID

Nanopub

Public
Ontologies

Db

Db

10/12/2013

VoID

Nanopub

Db

Nanopub

Db

SWAT4LS 2013

Public Content

VoID

Commercial

User
Annotations

8

Platform Interaction

10/12/2013

SWAT4LS 2013

9

Connectivity of Initial Linksets
Datasets

37

Linksets

104

Links

7,096,712

Justifications

10/12/2013

7

SWAT4LS 2013

10

Genes == Proteins?
BRCA1

Breast cancer type 1
susceptibility protein

http://en.wikipedia.org/wiki/File:BRCA1_en.p
ng

http://en.wikipedia.org/wiki/File:Pr
otein_BRCA1_PDB_1jm7.png

10/12/2013

SWAT4LS 2013

12

Proceed with Caution!

10/12/2013

SWAT4LS 2013

13

Co-reference Computation
Rules ensure
• Unrestricted transitivity
within conceptual type
• Restrict crossing
conceptual types

0..*
0..1

0..*

Based on justifications

0..1

Provenance captured
0..*
10/12/2013

SWAT4LS 2013

14

Connectivity of Initial Linksets
Datasets

37

Linksets

104

Links

7,096,712

Justifications

10/12/2013

7

SWAT4LS 2013

15

Connectivity of Computed
Linksets

Datasets

37

Linksets

883

Links
Justifications
10/12/2013

17,383,846
7
SWAT4LS 2013

16

Conclusions
• Computing co-reference advantageous
– Requires less raw linksets
– Larger coverage across datasets

• Rules ensure control
– Genes can equal proteins
– Compounds never equal proteins

• Provenance captured throughout

10/12/2013

SWAT4LS 2013

18

Questions
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33
@gray_alasdair

Open PHACTS Project

pmu@openphacts.org
www.openphacts.org
@open_phacts

Recently uploaded

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Histor y of HAM Radio presentation slidevu2urc

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

🐬 The future of MySQL is Postgres 🐘RTylerCroy

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts

Automating Google Workspace (GWS) & more with Apps Script

Finology Group – Insurtech Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Axa Assurance Maroc - Insurer Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Histor y of HAM Radio presentation slide

How to Troubleshoot Apps for the Modern Connected Worker

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

🐬 The future of MySQL is Postgres 🐘

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

How to Troubleshoot Apps for the Modern Connected Worker

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Boost PC performance: How more available memory can improve productivity

AWS Community Day CPH - Three problems of Terraform

Powerful Google developer tools for immediate impact! (2023-24 C)

Computing Identity Co-Reference Across Drug Discovery Datasets

1. Computing Identity Co-reference Across Drug Discovery Datasets Christian Y A Brenninkmeijer, Ian Dunlop Carole Goble, Alasdair J G Gray, and Steve Pettifer www.openphacts.org @open_phacts A.J.G.Gray@hw.ac.uk @gray_alasdair

2. Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ GB:29384 P12047 Are these the same thing? X31045 10/12/2013 SWAT4LS 2013 1

3. Gleevec® = Imatinib Mesylate Imatinib Imatinib Mesylate Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N ChemSpider 10/12/2013 Drugbank SWAT4LS 2013 PubChem 2

4. 10/12/2013 SWAT4LS 2013 3

5. 10/12/2013 SWAT4LS 2013 4

6. Multiple Links: Different Reasons Link: skos:closeMatch Reason: non-salt form 10/12/2013 Link: skos:exactMatch Reason: drug name SWAT4LS 2013 6

7. Open PHACTS Discovery Platform Apps Interactive responses Method Calls Domain API Drug Discovery Platform Production quality integration platform 10/12/2013 SWAT4LS 2013 7

8. OPS Discovery Platform Core Platform Apps Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” Linked Data API (RDF/XML, TTL, JSON) P12374 EC2.43.4 CS4532 Domain Specific Services Semantic Workflow Engine Chemistry Registration Normalisatio n & Q/C Data Cache (Virtuoso Triple Store) Indexing VoID VoID VoID Nanopub Public Ontologies Db Db 10/12/2013 VoID Nanopub Db Nanopub Db SWAT4LS 2013 Public Content VoID Commercial User Annotations 8

9. Platform Interaction 10/12/2013 SWAT4LS 2013 9

10. Connectivity of Initial Linksets Datasets 37 Linksets 104 Links 7,096,712 Justifications 10/12/2013 7 SWAT4LS 2013 10

11.

12. Genes == Proteins? BRCA1 Breast cancer type 1 susceptibility protein http://en.wikipedia.org/wiki/File:BRCA1_en.p ng http://en.wikipedia.org/wiki/File:Pr otein_BRCA1_PDB_1jm7.png 10/12/2013 SWAT4LS 2013 12

13. Proceed with Caution! 10/12/2013 SWAT4LS 2013 13

14. Co-reference Computation Rules ensure • Unrestricted transitivity within conceptual type • Restrict crossing conceptual types 0..* 0..1 0..* Based on justifications 0..1 Provenance captured 0..* 10/12/2013 SWAT4LS 2013 14

15. Connectivity of Initial Linksets Datasets 37 Linksets 104 Links 7,096,712 Justifications 10/12/2013 7 SWAT4LS 2013 15

16. Connectivity of Computed Linksets Datasets 37 Linksets 883 Links Justifications 10/12/2013 17,383,846 7 SWAT4LS 2013 16

17. BridgeDb 10/12/2013 SWAT4LS 2013 17

18. Conclusions • Computing co-reference advantageous – Requires less raw linksets – Larger coverage across datasets • Rules ensure control – Genes can equal proteins – Compounds never equal proteins • Provenance captured throughout 10/12/2013 SWAT4LS 2013 18

19. Questions A.J.G.Gray@hw.ac.uk www.macs.hw.ac.uk/~ajg33 @gray_alasdair Open PHACTS Project pmu@openphacts.org www.openphacts.org @open_phacts

Editor's Notes

Each captures a subtly different view of the worldAre they the same? … depends on your point of view
Example drug:Gleevec Cancer drug for leukemiaLookup in three popular public chemical databasesDifferent resultsData is messy!
Enter with ChemSpider URI forImatinibThis is not Gleevec
sameAs != sameAs depends on your point of viewLinks relate individual data instances: source, target, predicate, reason.Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
A platform for integratedpharmacology data Reliedupon by pharma companiesPublic domain, commercial, and private data sourcesProvidesdomainspecific APIMakingiteasyto build multiple drugdiscoveryapplications:examplesdeveloped in the project
Import data into cacheAPI calls populate SPARQL queriesIntegration approachData kept in original modelData cached in central triple storeAPI call translated to SPARQL queryQuery expressed in terms of original dataQueries expanded by IMS to cover URIs of original datasets
User starts typingServer sends back suggestionsUser selects oneURI sent to platformIntegrated Information returned
Can enter with IDs from any of the supported datasets
Platform extracts data from certain datasetsThese need to be connectedHere there is no issue in computing transitive as they are all the same compound based on InChI keyWould compute the full set of links
Do genes == proteins?Different conceptual types: gene and proteinOften used as a shortcut for retrieval: BRCA1 easier to remember and type!Require the ability to equate them in the IMS----But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor:http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
Insulin ReceptorIssue when linking through PDB due to the way that proteins are crystalised
Can enter with IDs from any of the supported datasets

Computing Identity Co-Reference Across Drug Discovery Datasets

Recommended

Recommended

More Related Content

More from Alasdair Gray

More from Alasdair Gray (20)

Recently uploaded

Recently uploaded (20)

Computing Identity Co-Reference Across Drug Discovery Datasets

Editor's Notes