Data Integration in a Big Data Context: An Open PHACTS Case Study

Data Integration in a
Big Data Context
Open PHACTS Case Study
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair

Big Data
@gray_alasdair Big Data Integration 2
Volume Velocity
Variety Veracity
http://i.kinja-img.com/gawker-media/image/upload/lvzm0afp8kik5dctxiya.jpg

Open PHACTS Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
 Chemical Properties (Chemspider)
 Launched drugs (Drugbank)
 Human => Mouse (Homologene)
 Protein Families (Enzyme)
 Bioactivty Data (ChEMBL)
 … other info (Uniprot/Entrez etc.)
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”

Open PHACTS Mission:
Integrate Multiple Research
Biomedical Data Resources
Into A Single Open & Free
Access Point

Literature
PubChem
Genbank
Patents
Databases
Downloads
Data Integration Data Analysis
Firewalled Databases
Repeat @ each
company
x
A single, shared
solution.
Funded under
• IMI: 2011-14
• ENSO: 2014-16
Pre-competitive Data

http://dx.doi.org/10.1016/j.websem.2014.03.003
• Cloud-Based
“Production” Level
System.
• Secure & Private
• Guided By Business
Questions
• Uses Semantic Web
Technology
• Provides REST-ful API
http://dx.doi.org/10.1016/j.drudis.2013.05.008
Discovery Platform

Scientific Results
http://ceur-ws.org/Vol-
1114/Demo_Dunlop.pdf
http://dx.doi.org/10.1016/j.drudis.2014.11.006 http://dx.doi.org/10.1002/minf.v31.8
http://dx.doi.org/10.1371/journal.pone.0115
460

OPS Discovery Platform
Drug Discovery Platform
Apps
Domain API
Interactive
responses
Production quality
integration platform
Method
Calls
Standard Web
Technologies

App Ecosystem
@gray_alasdair
An “App Store”?
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
Big Data Integration 9https://www.openphacts.org/2/sci/apps.html

http://chembionavigator.com
ChemBio
Navigator

API Hits
0
10
20
30
40
50
60
Jan
2013
Feb
2013
Mar
2013
Apr
2013
May
2013
June
2013
July
2013
Aug
2013
Sept
2013
Oct
2013
Nov
2013
Dec
2013
Jan
2014
Feb
2014
Mar
2014
Apr
2014
May
2014
June
2014
July
2014
Aug
2014
Sept
2014
Oct
2014
Nov
2014
Dec
2014
Jan
2015
Feb
2015
Mar
2015
Apr
2015
May
2015
June
2015
NoofHits
Millions
Month
Public launch
of 1.2 API
1.3 API 1.4 API 1.5 API

OPS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps

Open PHACTS Data

John Wilbanks consulted for us
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Data Licensing (Or Lack Of!)

API: Complex Interactions
Disease
Tissue
Target
Compound
Pathway

STANDARD_TYPE UNIT_COUNT
---------------- -------
AC50 7
Activity 421
EC50 39
IC50 46
ID50 42
Ki 23
Log IC50 4
Log Ki 7
Potency 11
log IC50 0
STANDARD_TYPE STANDARD_UNITS COUNT(*)
------------------ ------------------ --------
IC50 nM 829448
IC50 ug.mL-1 41000
IC50 38521
IC50 ug/ml 2038
IC50 ug ml-1 509
IC50 mg kg-1 295
IC50 molar ratio 178
IC50 ug 117
IC50 % 113
IC50 uM well-1 52
~ 100 units
>5000 types
Implemented using the Quantities, Units, Dimension, Types
Ontology (http://www.qudt.org/)
Quantitative Data
Challenges

Quality Assurance

P12047
X31045
GB:29384
Identity Mapping
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never
less than the number of Institutions
involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/

Gleevec®: Imatinib Mesylate
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N

Gleevec®: Imatinib Mesylate
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?
It depends upon your task!

Big Data Integration 23
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
@gray_alasdair
I need to perform an analysis, give me
details of the active compound in
Gleevec.

Big Data Integration 24
skos:closeMatch
(Drug Name)
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Name Lens
@gray_alasdair
Which targets are known to interact
with Gleevec?

Data Provenance

dev.openphacts.org

Open PHACTS Approach
1. Know your audience
Web developers
2. Understand your use cases
Prioritised business questions
3. Identify access pathways
Identify data
Identify connections
Implement API

Questions
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Open PHACTS
contact@openphacts.org
openphacts.org
@open_phacts

Data Integration in a Big Data Context: An Open PHACTS Case Study

Recommended

Recommended

More Related Content

Similar to Data Integration in a Big Data Context: An Open PHACTS Case Study

Similar to Data Integration in a Big Data Context: An Open PHACTS Case Study (20)

More from Alasdair Gray

More from Alasdair Gray (20)

Recently uploaded

Recently uploaded (20)

Data Integration in a Big Data Context: An Open PHACTS Case Study

Editor's Notes