This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.
Powerful Google developer tools for immediate impact! (2023-24 C)
Computing Identity Co-Reference Across Drug Discovery Datasets
1. Computing Identity Co-reference
Across Drug Discovery Datasets
Christian Y A Brenninkmeijer, Ian Dunlop
Carole Goble, Alasdair J G Gray, and
Steve Pettifer
www.openphacts.org
@open_phacts
A.J.G.Gray@hw.ac.uk
@gray_alasdair
2. Multiple Identities
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
GB:29384
P12047
Are these the
same thing?
X31045
10/12/2013
SWAT4LS 2013
1
6. Multiple Links: Different Reasons
Link: skos:closeMatch
Reason: non-salt form
10/12/2013
Link: skos:exactMatch
Reason: drug name
SWAT4LS 2013
6
7. Open PHACTS Discovery Platform
Apps
Interactive
responses
Method
Calls
Domain API
Drug Discovery Platform
Production quality
integration platform
10/12/2013
SWAT4LS 2013
7
8. OPS Discovery Platform
Core Platform
Apps
Identity
Resolution
Service
Identifier
Management
Service
“Adenosine
receptor 2a”
Linked Data API (RDF/XML, TTL, JSON)
P12374
EC2.43.4
CS4532
Domain
Specific
Services
Semantic Workflow Engine
Chemistry
Registration
Normalisatio
n & Q/C
Data Cache
(Virtuoso Triple Store)
Indexing
VoID
VoID
VoID
Nanopub
Public
Ontologies
Db
Db
10/12/2013
VoID
Nanopub
Db
Nanopub
Db
SWAT4LS 2013
Public Content
VoID
Commercial
User
Annotations
8
12. Genes == Proteins?
BRCA1
Breast cancer type 1
susceptibility protein
http://en.wikipedia.org/wiki/File:BRCA1_en.p
ng
http://en.wikipedia.org/wiki/File:Pr
otein_BRCA1_PDB_1jm7.png
10/12/2013
SWAT4LS 2013
12
Each captures a subtly different view of the worldAre they the same? … depends on your point of view
Example drug:Gleevec Cancer drug for leukemiaLookup in three popular public chemical databasesDifferent resultsData is messy!
Enter with ChemSpider URI forImatinibThis is not Gleevec
sameAs != sameAs depends on your point of viewLinks relate individual data instances: source, target, predicate, reason.Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
A platform for integratedpharmacology data Reliedupon by pharma companiesPublic domain, commercial, and private data sourcesProvidesdomainspecific APIMakingiteasyto build multiple drugdiscoveryapplications:examplesdeveloped in the project
Import data into cacheAPI calls populate SPARQL queriesIntegration approachData kept in original modelData cached in central triple storeAPI call translated to SPARQL queryQuery expressed in terms of original dataQueries expanded by IMS to cover URIs of original datasets
User starts typingServer sends back suggestionsUser selects oneURI sent to platformIntegrated Information returned
Can enter with IDs from any of the supported datasets
Platform extracts data from certain datasetsThese need to be connectedHere there is no issue in computing transitive as they are all the same compound based on InChI keyWould compute the full set of links
Do genes == proteins?Different conceptual types: gene and proteinOften used as a shortcut for retrieval: BRCA1 easier to remember and type!Require the ability to equate them in the IMS----But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor:http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
Insulin ReceptorIssue when linking through PDB due to the way that proteins are crystalised
Can enter with IDs from any of the supported datasets