I recently was asked what sort of things i have considered in my experience of performing data integration in pharma for drug discovery. So here's the ten things i thought most important!
Relationship between vascular system disfunction, neurofluid flow and Alzheim...
Data Integration Score Card
1. The Data Integration Score Card: Context
• I recently gave a talk where I was asked what sort of things I look out for when building a data
integration environment within Industry
• It was a good question so I thought about it and captured the top 10 in the following slide
• This is specifically aimed at pre-clinical discovery
• And there are a lot more you could add of course, but these are my personal top 10! I’m sure a lot of
people will disagree
• There’s a lot of references on the last slide
• The views in this slide are entirely my own
• Visit us at http://www.scibite.com for more news on all things data & semantics!
2. Data Integration Score Card
1. Adds Value To Public Data
• Shouldn’t just regurgitate freely available sources of data. Integrating these isn’t that hard to do these days. It should add things that you
just can’t get/do elsewhere
2. Unambiguously Identifies Concepts. Eliminates Redundancy
• There should only be one “asthma” in the system
• This includes synonyms (Breast cancer = breast tumor = breast tumour etc)
3. Maps Across Identifiers
• There should be strong connectivity across identifiers for the same concept
4. Supports Ontology-based Query
• “Give me all the inflammatory diseases…..”
5. Handles Both Structured & Unstructured Data
• Pubmed, clinical trials, OMIM etc need to be correctly indexed on concept not synonym (see #2).
6. Integrates Data!
• Sounds crazy but is the data *really* connected or does it just look like it is?
7. Can Connect To Live Data
• A system cannot always include everything. How can live data be connected up?
8. Enterprise Resource Connectors
• Is there proven path to incorporating at both a technical level AND a data level (see #9)
9. Extensible Data Model
• Even RDF systems use a data model. How will this support future use cases, how can the system cope with concept types it currently
does not know about? Note: don’t believe the “its RDF so it will magically just cope with it” answer!
10. Supports Manual Curation
• Whatever the data is, on current estimates up to 50% of it is invalid. How does the system handle this and how can users change what
they see?
3. References
TERMite: Turning unstructured text into data https://www.scibite.com/products/termite/
Drug discovery FAQs: workflows for answering
multidomain drug discovery questions
http://dx.doi.org/10.1016/j.drudis.2014.11.006
Scientific Lenses to Support Multiple Views over Linked
Chemistry Data
http://rd.springer.com/chapter/10.1007%2F978-3-319-
11964-9_7
API-centric Linked Data integration: The Open PHACTS
Discovery Platform case study
http://dx.doi.org/10.1016/j.websem.2014.03.003
Open PHACTS: semantic interoperability for drug
discovery
http://dx.doi.org/10.1016/j.drudis.2012.05.016
Systems chemical biology and the Semantic Web: what
they mean for the future of drug discovery research
http://dx.doi.org/10.1016/j.drudis.2011.12.019
Empowering industrial research with shared biomedical
vocabularies
http://dx.doi.org/10.1016/j.drudis.2011.09.013
Visualizing the drug target landscape http://dx.doi.org/10.1016/j.drudis.2009.09.011