Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Computing Identity Co-reference
Across Drug Discovery Datasets
Christian Y A Brenninkmeijer, Ian Dunlop
Carole Goble, Alas...
Multiple Identities
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than th...
Gleevec® = Imatinib Mesylate
Imatinib

Imatinib Mesylate Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N

ChemSpider
10/12/2013

Drug...
10/12/2013

SWAT4LS 2013

3
10/12/2013

SWAT4LS 2013

4
Multiple Links: Different Reasons

Link: skos:closeMatch
Reason: non-salt form

10/12/2013

Link: skos:exactMatch
Reason: ...
Open PHACTS Discovery Platform
Apps
Interactive
responses
Method
Calls

Domain API

Drug Discovery Platform
Production qua...
OPS Discovery Platform

Core Platform

Apps
Identity
Resolution
Service
Identifier
Management
Service

“Adenosine
receptor...
Platform Interaction

10/12/2013

SWAT4LS 2013

9
Connectivity of Initial Linksets
Datasets

37

Linksets

104

Links

7,096,712

Justifications

10/12/2013

7

SWAT4LS 201...
Genes == Proteins?
BRCA1

Breast cancer type 1
susceptibility protein

http://en.wikipedia.org/wiki/File:BRCA1_en.p
ng

ht...
Proceed with Caution!

10/12/2013

SWAT4LS 2013

13
Co-reference Computation
Rules ensure
• Unrestricted transitivity
within conceptual type
• Restrict crossing
conceptual ty...
Connectivity of Initial Linksets
Datasets

37

Linksets

104

Links

7,096,712

Justifications

10/12/2013

7

SWAT4LS 201...
Connectivity of Computed
Linksets

Datasets

37

Linksets

883

Links
Justifications
10/12/2013

17,383,846
7
SWAT4LS 2013...
BridgeDb

10/12/2013

SWAT4LS 2013

17
Conclusions
• Computing co-reference advantageous
– Requires less raw linksets
– Larger coverage across datasets

• Rules ...
Questions
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33
@gray_alasdair

Open PHACTS Project

pmu@openphacts.org
www.openpha...
Computing Identity Co-Reference Across Drug Discovery Datasets
Upcoming SlideShare
Loading in …5
×

Computing Identity Co-Reference Across Drug Discovery Datasets

5,022 views

Published on

This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to compute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts captured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one representation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the context of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference computation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications.

Published in: Technology, Business
  • These are one of the best companies for review articles. High quality with cheap rates. ⇒⇒⇒WRITE-MY-PAPER.net ⇐⇐⇐ I highly recommend it :)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you need your papers to be written and if you are not that kind of person who likes to do researches and analyze something - you should definitely contact these guys! They are awesome ⇒⇒⇒WRITE-MY-PAPER.net ⇐⇐⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • To get professional research papers you must go for experts like ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You have to choose carefully. ⇒ www.WritePaper.info ⇐ offers a professional writing service. I highly recommend them. The papers are delivered on time and customers are their first priority. This is their website: ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ❤❤❤ http://bit.ly/2Qu6Caa ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Computing Identity Co-Reference Across Drug Discovery Datasets

  1. 1. Computing Identity Co-reference Across Drug Discovery Datasets Christian Y A Brenninkmeijer, Ian Dunlop Carole Goble, Alasdair J G Gray, and Steve Pettifer www.openphacts.org @open_phacts A.J.G.Gray@hw.ac.uk @gray_alasdair
  2. 2. Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ GB:29384 P12047 Are these the same thing? X31045 10/12/2013 SWAT4LS 2013 1
  3. 3. Gleevec® = Imatinib Mesylate Imatinib Imatinib Mesylate Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N ChemSpider 10/12/2013 Drugbank SWAT4LS 2013 PubChem 2
  4. 4. 10/12/2013 SWAT4LS 2013 3
  5. 5. 10/12/2013 SWAT4LS 2013 4
  6. 6. Multiple Links: Different Reasons Link: skos:closeMatch Reason: non-salt form 10/12/2013 Link: skos:exactMatch Reason: drug name SWAT4LS 2013 6
  7. 7. Open PHACTS Discovery Platform Apps Interactive responses Method Calls Domain API Drug Discovery Platform Production quality integration platform 10/12/2013 SWAT4LS 2013 7
  8. 8. OPS Discovery Platform Core Platform Apps Identity Resolution Service Identifier Management Service “Adenosine receptor 2a” Linked Data API (RDF/XML, TTL, JSON) P12374 EC2.43.4 CS4532 Domain Specific Services Semantic Workflow Engine Chemistry Registration Normalisatio n & Q/C Data Cache (Virtuoso Triple Store) Indexing VoID VoID VoID Nanopub Public Ontologies Db Db 10/12/2013 VoID Nanopub Db Nanopub Db SWAT4LS 2013 Public Content VoID Commercial User Annotations 8
  9. 9. Platform Interaction 10/12/2013 SWAT4LS 2013 9
  10. 10. Connectivity of Initial Linksets Datasets 37 Linksets 104 Links 7,096,712 Justifications 10/12/2013 7 SWAT4LS 2013 10
  11. 11. Genes == Proteins? BRCA1 Breast cancer type 1 susceptibility protein http://en.wikipedia.org/wiki/File:BRCA1_en.p ng http://en.wikipedia.org/wiki/File:Pr otein_BRCA1_PDB_1jm7.png 10/12/2013 SWAT4LS 2013 12
  12. 12. Proceed with Caution! 10/12/2013 SWAT4LS 2013 13
  13. 13. Co-reference Computation Rules ensure • Unrestricted transitivity within conceptual type • Restrict crossing conceptual types 0..* 0..1 0..* Based on justifications 0..1 Provenance captured 0..* 10/12/2013 SWAT4LS 2013 14
  14. 14. Connectivity of Initial Linksets Datasets 37 Linksets 104 Links 7,096,712 Justifications 10/12/2013 7 SWAT4LS 2013 15
  15. 15. Connectivity of Computed Linksets Datasets 37 Linksets 883 Links Justifications 10/12/2013 17,383,846 7 SWAT4LS 2013 16
  16. 16. BridgeDb 10/12/2013 SWAT4LS 2013 17
  17. 17. Conclusions • Computing co-reference advantageous – Requires less raw linksets – Larger coverage across datasets • Rules ensure control – Genes can equal proteins – Compounds never equal proteins • Provenance captured throughout 10/12/2013 SWAT4LS 2013 18
  18. 18. Questions A.J.G.Gray@hw.ac.uk www.macs.hw.ac.uk/~ajg33 @gray_alasdair Open PHACTS Project pmu@openphacts.org www.openphacts.org @open_phacts

×