The Data Today
Alasdair Gray
Heriot-Watt University, Edinburgh, UK
A.J.G.Gray@hw.ac.uk
@gray_alasdair
@gray_alasdair Big Data Integration 2
Dataset Downloaded Version Licence Triples
Bio Assay Ontology CC-By 10,360
CALOHA 8 Apr 2015 2014-01-22 CC-By-ND 14,552
ChEBI 4 Mar 2015 125 CC-By-SA 1,012,056
ChEMBL 18 Feb 2015 20.0 CC-By-SA 445,732,880
ConceptWiki 12 Dec 2013 CC-By-SA 4,331,760
DisGeNET 31 Mar 2015 2.1.0 ODbL 15,011,136
Disease Ontology 2015-05-21 CC-By 188,062
DrugBank 19 Feb 2015 4.1 Non-commercial 4,028,767
ENZYME 2015_11 CC-By-ND 61,467
FDA Adverse Events 9 Jul 2012 CC0 13,557,070
Total: ~3 Billion triples
Dataset Downloaded Version Licence Triples
Gene Ontology 4 Mar 2015 CC-By 1,366,494
Gene Ontology Annotations 17 Feb 2015 CC-By 879,448,347
NCATS OPDDR Nov 2015 Oct 2015 2,643
neXTProt (NP) 1 Feb 2014 1.0 CC-By-ND 215,006,108
OPS Chemical Registry 4 Nov 2014 CC-By-SA 241,986,722
HMDB 3.6 HMDB
MeSH 2015 MeSH
PDB Ligands 2 PDB
OPS Metadata CC-By-SA 2,053
UniProt 2015_11 CC-By-ND 1,131,186,434
WikiPathways 20151118 CC-By 11,781,627
Total: ~3 Billion triples
John Wilbanks consulted for us
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Data Licensing (Or Lack Of!)
Disease
Tissue
Target
Compound
Pathway
STANDARD_TYPE UNIT_COUNT
---------------- -------
AC50 7
Activity 421
EC50 39
IC50 46
ID50 42
Ki 23
Log IC50 4
Log Ki 7
Potency 11
log IC50 0
STANDARD_TYPE STANDARD_UNITS COUNT(*)
------------------ ------------------ --------
IC50 nM 829448
IC50 ug.mL-1 41000
IC50 38521
IC50 ug/ml 2038
IC50 ug ml-1 509
IC50 mg kg-1 295
IC50 molar ratio 178
IC50 ug 117
IC50 % 113
IC50 uM well-1 52
~ 100 units
>5000 types
Implemented using the Quantities, Units, Dimension, Types
Ontology (http://www.qudt.org/)
Quantitative Data Challenges
Quality Assurance
ops:OPS437281
✔
ops:OPS380297 ops:OPS380292
is_stereoisomer_of
[ci:CHEMINF_000461]
has_stereoundefined_parent
[ci:CHEMINF_000456] Other relationships
• has part
• is tautomer of
• uncharged counterpart
• isotope
…
Chemical Registration Service Data
Mappings: Raw
Mappings (Raw)
25,087,328
Mappings: Computed
Mappings (Comp)
200,000,000+
P12047
X31045
GB:29384
Andy Law's Third Law
“The number of unique identifiers assigned to an individual is
never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
DrugbankChemSpider PubChem
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?
It depends upon your task!
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
I need to perform an analysis, give me
details of the active compound in Gleevec.
skos:closeMatch
(Drug Name)
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Which targets are known to interact
with Gleevec?
A lens defines a conceptual view over the data
Specifies operational equivalence conditions
Consists of:
Identifier (URI)
Title
(dct:title)
Description
(dct:description)
Documentation link
(dcat:landingPage)
Creator
(pav:createdBy)
Timestamp
(pav:createdOn)
Equivalence rules
(bdb:linksetJustification)
Scientific Lens
Lenses
34 in total
7 Public
25 Chemistry
2 Gene
Data Governance
Contribution must not be underestimated!!!
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
www.macs.hw.ac.uk/~ajg33/
@gray_alasdair
Open PHACTS
contact@openphacts.org
openphacts.org
@open_phacts

Open PHACTS: The Data Today

Editor's Notes

  • #3 Data provided by many publishers: some cover other sets, e.g. ChemSpider Originally in many formats: relational, SD files and RDF Worked closely with publishers getting them to publish Raw RDF Metadata descriptions of their data Links between their data and others
  • #4 ~3billion triples 42GB gzip nquads 400GB uncompressed
  • #5 Getting this informaiton is still hard and manual! ~3billion triples 42GB gzip nquads 400GB uncompressed
  • #6 ~3billion triples 42GB gzip nquads 400GB uncompressed
  • #8 API: Complex data interactions/relationships Interactions needed to satisfy use cases Gradually added additional types of data and interactions
  • #9 Quantitative Data Challenges No standard units Even in curated sources! Feedback issues to data providers
  • #10 Quality Assurance Validation & Standardization Platform Developed by Royal Society of Chemistry http://bit.ly/NZF5VB
  • #11 CRS Dataset Generation Validate structure: Source data is messy! Identify common problems: Charge imbalance Stereochemistry Compute physiochemical properties Identify related properties based on structure 17 relationship types
  • #12 230MB gzipped nquads 2 GB uncompressed 238 Mapping sets 43 data sources 11 predicates
  • #14 Identity Mapping
  • #15 Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases  Different results Chemistry is complicated, often simplified for convenience Data is messy! Are these records the same? It depends on what you are doing with the data! Each captures a subtly different view of the world
  • #16 Structure Lens Interested in physiochemical properties of Gleevec
  • #17 Name Lens Interested in biomedical and pharmacological properties sameAs != sameAs depends on your point of view Links relate individual data instances: source, target, predicate, reason. Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  • #18 Lens enables certain relationships and disables others Alters links between the data
  • #21 Builds on OPS document: Checklist and guidance notes! Covers a wider range of use cases Large community buy in – Including EBI
  • #22 Builds on OPS document: Checklist and guidance notes! Covers a wider range of use cases Large community buy in – Including EBI
  • #23 Verifying data Verifying linkages Investigating unexpected answers Not to be