Successfully reported this slideshow.
Your SlideShare is downloading. ×

Open PHACTS: The Data Today

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 22 Ad

Open PHACTS: The Data Today

Download to read offline

Presentation given at the Open PHACTS project symposium.

The slides give an overview of the data in the 2.0 Open PHACTS drug discovery platform and the challenges that have been faced in the Open PHACTS project to reach this stage.

Presentation given at the Open PHACTS project symposium.

The slides give an overview of the data in the 2.0 Open PHACTS drug discovery platform and the challenges that have been faced in the Open PHACTS project to reach this stage.

Advertisement
Advertisement

More Related Content

Similar to Open PHACTS: The Data Today (9)

More from Alasdair Gray (18)

Advertisement

Recently uploaded (20)

Open PHACTS: The Data Today

  1. 1. The Data Today Alasdair Gray Heriot-Watt University, Edinburgh, UK A.J.G.Gray@hw.ac.uk @gray_alasdair
  2. 2. @gray_alasdair Big Data Integration 2
  3. 3. Dataset Downloaded Version Licence Triples Bio Assay Ontology CC-By 10,360 CALOHA 8 Apr 2015 2014-01-22 CC-By-ND 14,552 ChEBI 4 Mar 2015 125 CC-By-SA 1,012,056 ChEMBL 18 Feb 2015 20.0 CC-By-SA 445,732,880 ConceptWiki 12 Dec 2013 CC-By-SA 4,331,760 DisGeNET 31 Mar 2015 2.1.0 ODbL 15,011,136 Disease Ontology 2015-05-21 CC-By 188,062 DrugBank 19 Feb 2015 4.1 Non-commercial 4,028,767 ENZYME 2015_11 CC-By-ND 61,467 FDA Adverse Events 9 Jul 2012 CC0 13,557,070 Total: ~3 Billion triples
  4. 4. Dataset Downloaded Version Licence Triples Gene Ontology 4 Mar 2015 CC-By 1,366,494 Gene Ontology Annotations 17 Feb 2015 CC-By 879,448,347 NCATS OPDDR Nov 2015 Oct 2015 2,643 neXTProt (NP) 1 Feb 2014 1.0 CC-By-ND 215,006,108 OPS Chemical Registry 4 Nov 2014 CC-By-SA 241,986,722 HMDB 3.6 HMDB MeSH 2015 MeSH PDB Ligands 2 PDB OPS Metadata CC-By-SA 2,053 UniProt 2015_11 CC-By-ND 1,131,186,434 WikiPathways 20151118 CC-By 11,781,627 Total: ~3 Billion triples
  5. 5. John Wilbanks consulted for us A framework built around STANDARD well-understood Creative Commons licences – and how they interoperate Deal with the problems by: Interoperable licences Appropriate terms Declare expectations to users and data publishers One size won‘t fit all requirements Data Licensing (Or Lack Of!)
  6. 6. Disease Tissue Target Compound Pathway
  7. 7. STANDARD_TYPE UNIT_COUNT ---------------- ------- AC50 7 Activity 421 EC50 39 IC50 46 ID50 42 Ki 23 Log IC50 4 Log Ki 7 Potency 11 log IC50 0 STANDARD_TYPE STANDARD_UNITS COUNT(*) ------------------ ------------------ -------- IC50 nM 829448 IC50 ug.mL-1 41000 IC50 38521 IC50 ug/ml 2038 IC50 ug ml-1 509 IC50 mg kg-1 295 IC50 molar ratio 178 IC50 ug 117 IC50 % 113 IC50 uM well-1 52 ~ 100 units >5000 types Implemented using the Quantities, Units, Dimension, Types Ontology (http://www.qudt.org/) Quantitative Data Challenges
  8. 8. Quality Assurance
  9. 9. ops:OPS437281 ✔ ops:OPS380297 ops:OPS380292 is_stereoisomer_of [ci:CHEMINF_000461] has_stereoundefined_parent [ci:CHEMINF_000456] Other relationships • has part • is tautomer of • uncharged counterpart • isotope … Chemical Registration Service Data
  10. 10. Mappings: Raw Mappings (Raw) 25,087,328
  11. 11. Mappings: Computed Mappings (Comp) 200,000,000+
  12. 12. P12047 X31045 GB:29384 Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/
  13. 13. DrugbankChemSpider PubChem MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N Are these records the same? It depends upon your task!
  14. 14. skos:exactMatch (InChI) Strict Relaxed Analysing Browsing I need to perform an analysis, give me details of the active compound in Gleevec.
  15. 15. skos:closeMatch (Drug Name) skos:closeMatch (Drug Name) skos:exactMatch (InChI) Strict Relaxed Analysing Browsing Which targets are known to interact with Gleevec?
  16. 16. A lens defines a conceptual view over the data Specifies operational equivalence conditions Consists of: Identifier (URI) Title (dct:title) Description (dct:description) Documentation link (dcat:landingPage) Creator (pav:createdBy) Timestamp (pav:createdOn) Equivalence rules (bdb:linksetJustification) Scientific Lens Lenses 34 in total 7 Public 25 Chemistry 2 Gene
  17. 17. Data Governance Contribution must not be underestimated!!!
  18. 18. Alasdair J G Gray A.J.G.Gray@hw.ac.uk www.macs.hw.ac.uk/~ajg33/ @gray_alasdair Open PHACTS contact@openphacts.org openphacts.org @open_phacts

Editor's Notes

  • Data provided by many publishers: some cover other sets, e.g. ChemSpider
    Originally in many formats: relational, SD files and RDF
    Worked closely with publishers getting them to publish
    Raw RDF
    Metadata descriptions of their data
    Links between their data and others
  • ~3billion triples
    42GB gzip nquads
    400GB uncompressed
  • Getting this informaiton is still hard and manual!
    ~3billion triples
    42GB gzip nquads
    400GB uncompressed
  • ~3billion triples
    42GB gzip nquads
    400GB uncompressed
  • API: Complex data interactions/relationships
    Interactions needed to satisfy use cases
    Gradually added additional types of data and interactions
  • Quantitative Data Challenges
    No standard units
    Even in curated sources!

    Feedback issues to data providers
  • Quality Assurance
    Validation & Standardization Platform
    Developed by Royal Society of Chemistry
    http://bit.ly/NZF5VB
  • CRS Dataset Generation
    Validate structure: Source data is messy!
    Identify common problems:
    Charge imbalance
    Stereochemistry
    Compute physiochemical properties
    Identify related properties based on structure
    17 relationship types
  • 230MB gzipped nquads
    2 GB uncompressed

    238 Mapping sets
    43 data sources
    11 predicates
  • Identity Mapping
  • Example drug: Gleevec Cancer drug for leukemia

    Lookup in three popular public chemical databases  Different results

    Chemistry is complicated, often simplified for convenience
    Data is messy!

    Are these records the same? It depends on what you are doing with the data!
    Each captures a subtly different view of the world

  • Structure Lens
    Interested in physiochemical properties of Gleevec
  • Name Lens
    Interested in biomedical and pharmacological properties

    sameAs != sameAs depends on your point of view

    Links relate individual data instances: source, target, predicate, reason.

    Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  • Lens enables certain relationships and disables others
    Alters links between the data
  • Builds on OPS document: Checklist and guidance notes!

    Covers a wider range of use cases
    Large community buy in – Including EBI

  • Builds on OPS document: Checklist and guidance notes!

    Covers a wider range of use cases
    Large community buy in – Including EBI

  • Verifying data
    Verifying linkages
    Investigating unexpected answers
    Not to be

×