Computing Identity Co-Reference Across Drug Discovery Datasets

Computing Identity Co-reference
Across Drug Discovery Datasets
Christian Y A Brenninkmeijer, Ian Dunlop
Carole Goble, Alasdair J G Gray, and
Steve Pettifer

www.openphacts.org
@open_phacts

A.J.G.Gray@hw.ac.uk
@gray_alasdair

Multiple Identities
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/

GB:29384

P12047

Are these the
same thing?

X31045

10/12/2013

SWAT4LS 2013

1

Gleevec® = Imatinib Mesylate
Imatinib

Imatinib Mesylate Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N

ChemSpider
10/12/2013

Drugbank
SWAT4LS 2013

PubChem
2

Multiple Links: Different Reasons

Link: skos:closeMatch
Reason: non-salt form

10/12/2013

Link: skos:exactMatch
Reason: drug name

SWAT4LS 2013

6

Open PHACTS Discovery Platform
Apps
Interactive
responses
Method
Calls

Domain API

Drug Discovery Platform
Production quality
integration platform

10/12/2013

SWAT4LS 2013

7

OPS Discovery Platform

Core Platform

Apps
Identity
Resolution
Service
Identifier
Management
Service

“Adenosine
receptor 2a”

Linked Data API (RDF/XML, TTL, JSON)

P12374
EC2.43.4
CS4532

Domain
Specific
Services

Semantic Workflow Engine
Chemistry
Registration
Normalisatio
n & Q/C

Data Cache
(Virtuoso Triple Store)

Indexing
VoID

VoID

VoID

Nanopub

Public
Ontologies

Db

Db

10/12/2013

VoID

Nanopub

Db

Nanopub

Db

SWAT4LS 2013

Public Content

VoID

Commercial

User
Annotations

8

Platform Interaction

10/12/2013

SWAT4LS 2013

9

Connectivity of Initial Linksets
Datasets

37

Linksets

104

Links

7,096,712

Justifications

10/12/2013

7

SWAT4LS 2013

10

Genes == Proteins?
BRCA1

Breast cancer type 1
susceptibility protein

http://en.wikipedia.org/wiki/File:BRCA1_en.p
ng

http://en.wikipedia.org/wiki/File:Pr
otein_BRCA1_PDB_1jm7.png

10/12/2013

SWAT4LS 2013

12

Proceed with Caution!

10/12/2013

SWAT4LS 2013

13

Co-reference Computation
Rules ensure
• Unrestricted transitivity
within conceptual type
• Restrict crossing
conceptual types

0..*
0..1

0..*

Based on justifications

0..1

Provenance captured
0..*
10/12/2013

SWAT4LS 2013

14

Connectivity of Initial Linksets
Datasets

37

Linksets

104

Links

7,096,712

Justifications

10/12/2013

7

SWAT4LS 2013

15

Connectivity of Computed
Linksets

Datasets

37

Linksets

883

Links
Justifications
10/12/2013

17,383,846
7
SWAT4LS 2013

16

BridgeDb

10/12/2013

SWAT4LS 2013

17

Conclusions
• Computing co-reference advantageous
– Requires less raw linksets
– Larger coverage across datasets

• Rules ensure control
– Genes can equal proteins
– Compounds never equal proteins

• Provenance captured throughout

10/12/2013

SWAT4LS 2013

18

Questions
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33
@gray_alasdair

Open PHACTS Project

pmu@openphacts.org
www.openphacts.org
@open_phacts

Computing Identity Co-Reference Across Drug Discovery Datasets

More Related Content

More from Alasdair Gray

Recently uploaded

Computing Identity Co-Reference Across Drug Discovery Datasets

Editor's Notes