Data Integration Score Card

The Data Integration Score Card: Context
• I recently gave a talk where I was asked what sort of things I look out for when building a data
integration environment within Industry
• It was a good question so I thought about it and captured the top 10 in the following slide
• This is specifically aimed at pre-clinical discovery
• And there are a lot more you could add of course, but these are my personal top 10! I’m sure a lot of
people will disagree 
• There’s a lot of references on the last slide
• The views in this slide are entirely my own 
• Visit us at http://www.scibite.com for more news on all things data & semantics!

Data Integration Score Card
1. Adds Value To Public Data
• Shouldn’t just regurgitate freely available sources of data. Integrating these isn’t that hard to do these days. It should add things that you
just can’t get/do elsewhere
2. Unambiguously Identifies Concepts. Eliminates Redundancy
• There should only be one “asthma” in the system
• This includes synonyms (Breast cancer = breast tumor = breast tumour etc)
3. Maps Across Identifiers
• There should be strong connectivity across identifiers for the same concept
4. Supports Ontology-based Query
• “Give me all the inflammatory diseases…..”
5. Handles Both Structured & Unstructured Data
• Pubmed, clinical trials, OMIM etc need to be correctly indexed on concept not synonym (see #2).
6. Integrates Data!
• Sounds crazy but is the data *really* connected or does it just look like it is?
7. Can Connect To Live Data
• A system cannot always include everything. How can live data be connected up?
8. Enterprise Resource Connectors
• Is there proven path to incorporating at both a technical level AND a data level (see #9)
9. Extensible Data Model
• Even RDF systems use a data model. How will this support future use cases, how can the system cope with concept types it currently
does not know about? Note: don’t believe the “its RDF so it will magically just cope with it” answer!
10. Supports Manual Curation
• Whatever the data is, on current estimates up to 50% of it is invalid. How does the system handle this and how can users change what
they see?

References
TERMite: Turning unstructured text into data https://www.scibite.com/products/termite/
Drug discovery FAQs: workflows for answering
multidomain drug discovery questions
http://dx.doi.org/10.1016/j.drudis.2014.11.006
Scientific Lenses to Support Multiple Views over Linked
Chemistry Data
http://rd.springer.com/chapter/10.1007%2F978-3-319-
11964-9_7
API-centric Linked Data integration: The Open PHACTS
Discovery Platform case study
http://dx.doi.org/10.1016/j.websem.2014.03.003
Open PHACTS: semantic interoperability for drug
discovery
Systems chemical biology and the Semantic Web: what
they mean for the future of drug discovery research
Empowering industrial research with shared biomedical
vocabularies
Visualizing the drug target landscape http://dx.doi.org/10.1016/j.drudis.2009.09.011

Data Integration Score Card

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (13)

Similar to Data Integration Score Card

Similar to Data Integration Score Card (20)

More from SciBite Limited

More from SciBite Limited (6)

Recently uploaded

Recently uploaded (20)

Data Integration Score Card