Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Integration in a Big Data Context: An Open PHACTS Case Study

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Open PHACTS: The Data Today
Open PHACTS: The Data Today
Loading in …3
×

Check these out next

1 of 30 Ad

Data Integration in a Big Data Context: An Open PHACTS Case Study

Download to read offline

Keynote presentation at the EU Ambient Assisted Living Forum workshop The Crusade for Big Data in the AAL Domain.

The presentation explores the Open PHACTS project and how it overcame various Big Data challenges.

Keynote presentation at the EU Ambient Assisted Living Forum workshop The Crusade for Big Data in the AAL Domain.

The presentation explores the Open PHACTS project and how it overcame various Big Data challenges.

Advertisement
Advertisement

More Related Content

Similar to Data Integration in a Big Data Context: An Open PHACTS Case Study (20)

More from Alasdair Gray (20)

Advertisement
Advertisement

Data Integration in a Big Data Context: An Open PHACTS Case Study

  1. 1. Data Integration in a Big Data Context Open PHACTS Case Study Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair
  2. 2. Big Data @gray_alasdair Big Data Integration 2 Volume Velocity Variety Veracity http://i.kinja-img.com/gawker-media/image/upload/lvzm0afp8kik5dctxiya.jpg
  3. 3. Open PHACTS Use Case “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”  Chemical Properties (Chemspider)  Launched drugs (Drugbank)  Human => Mouse (Homologene)  Protein Families (Enzyme)  Bioactivty Data (ChEMBL)  … other info (Uniprot/Entrez etc.) “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases” @gray_alasdair Big Data Integration 3
  4. 4. Open PHACTS Mission: Integrate Multiple Research Biomedical Data Resources Into A Single Open & Free Access Point @gray_alasdair Big Data Integration 4
  5. 5. Literature PubChem Genbank Patents Databases Downloads Data Integration Data Analysis Firewalled Databases Repeat @ each company x A single, shared solution. Funded under • IMI: 2011-14 • ENSO: 2014-16 Pre-competitive Data @gray_alasdair Big Data Integration 5
  6. 6. http://dx.doi.org/10.1016/j.websem.2014.03.003 • Cloud-Based “Production” Level System. • Secure & Private • Guided By Business Questions • Uses Semantic Web Technology • Provides REST-ful API http://dx.doi.org/10.1016/j.drudis.2013.05.008 Discovery Platform @gray_alasdair Big Data Integration 6
  7. 7. Scientific Results http://ceur-ws.org/Vol- 1114/Demo_Dunlop.pdf http://dx.doi.org/10.1016/j.drudis.2014.11.006 http://dx.doi.org/10.1002/minf.v31.8 http://dx.doi.org/10.1371/journal.pone.0115 460 @gray_alasdair Big Data Integration 7
  8. 8. OPS Discovery Platform @gray_alasdair Big Data Integration 8 Drug Discovery Platform Apps Domain API Interactive responses Production quality integration platform Method Calls Standard Web Technologies
  9. 9. App Ecosystem @gray_alasdair An “App Store”? Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium MOE Collector Cytophacts Utopia Garfield SciBite KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna Big Data Integration 9https://www.openphacts.org/2/sci/apps.html
  10. 10. http://chembionavigator.com ChemBio Navigator @gray_alasdair Big Data Integration 10
  11. 11. @gray_alasdair Big Data Integration 11
  12. 12. @gray_alasdair Big Data Integration 12
  13. 13. API Hits @gray_alasdair Big Data Integration 13 0 10 20 30 40 50 60 Jan 2013 Feb 2013 Mar 2013 Apr 2013 May 2013 June 2013 July 2013 Aug 2013 Sept 2013 Oct 2013 Nov 2013 Dec 2013 Jan 2014 Feb 2014 Mar 2014 Apr 2014 May 2014 June 2014 July 2014 Aug 2014 Sept 2014 Oct 2014 Nov 2014 Dec 2014 Jan 2015 Feb 2015 Mar 2015 Apr 2015 May 2015 June 2015 NoofHits Millions Month Public launch of 1.2 API 1.3 API 1.4 API 1.5 API
  14. 14. OPS Discovery Platform Nanopub Db VoID Data Cache (Virtuoso Triple Store) Semantic Workflow Engine Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing CorePlatform P12374 EC2.43.4 CS4532 “Adenosine receptor 2a” VoID Db Nanopub Db VoID Db VoID Nanopub VoID Public Content Commercial Public Ontologies User Annotations Apps @gray_alasdair Big Data Integration 14
  15. 15. Open PHACTS Data @gray_alasdair Big Data Integration 15
  16. 16. John Wilbanks consulted for us A framework built around STANDARD well-understood Creative Commons licences – and how they interoperate Deal with the problems by: Interoperable licences Appropriate terms Declare expectations to users and data publishers One size won‘t fit all requirements Data Licensing (Or Lack Of!)
  17. 17. API: Complex Interactions @gray_alasdair Big Data Integration 17 Disease Tissue Target Compound Pathway
  18. 18. STANDARD_TYPE UNIT_COUNT ---------------- ------- AC50 7 Activity 421 EC50 39 IC50 46 ID50 42 Ki 23 Log IC50 4 Log Ki 7 Potency 11 log IC50 0 STANDARD_TYPE STANDARD_UNITS COUNT(*) ------------------ ------------------ -------- IC50 nM 829448 IC50 ug.mL-1 41000 IC50 38521 IC50 ug/ml 2038 IC50 ug ml-1 509 IC50 mg kg-1 295 IC50 molar ratio 178 IC50 ug 117 IC50 % 113 IC50 uM well-1 52 ~ 100 units >5000 types Implemented using the Quantities, Units, Dimension, Types Ontology (http://www.qudt.org/) Quantitative Data Challenges @gray_alasdair Big Data Integration 18
  19. 19. Quality Assurance @gray_alasdair Big Data Integration 19
  20. 20. P12047 X31045 GB:29384 Identity Mapping @gray_alasdair Big Data Integration 20 Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/
  21. 21. Gleevec®: Imatinib Mesylate @gray_alasdair Big Data Integration 21 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
  22. 22. Gleevec®: Imatinib Mesylate @gray_alasdair Big Data Integration 22 DrugbankChemSpider PubChem Imatinib MesylateImatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N Are these records the same? It depends upon your task!
  23. 23. Big Data Integration 23 skos:exactMatch (InChI) Strict Relaxed Analysing Browsing Structure Lens @gray_alasdair I need to perform an analysis, give me details of the active compound in Gleevec.
  24. 24. Big Data Integration 24 skos:closeMatch (Drug Name) skos:closeMatch (Drug Name) skos:exactMatch (InChI) Strict Relaxed Analysing Browsing Name Lens @gray_alasdair Which targets are known to interact with Gleevec?
  25. 25. Data Provenance @gray_alasdair Big Data Integration 26
  26. 26. Data Provenance @gray_alasdair Big Data Integration 27
  27. 27. dev.openphacts.org @gray_alasdair Big Data Integration 29
  28. 28. Open PHACTS Approach 1. Know your audience Web developers 2. Understand your use cases Prioritised business questions 3. Identify access pathways Identify data Identify connections Implement API @gray_alasdair Big Data Integration 31
  29. 29. Questions Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair Open PHACTS contact@openphacts.org openphacts.org @open_phacts @gray_alasdair Big Data Integration 32

Editor's Notes

  • Deriving value from the data

    Volume: More data than you can process – relative term; complexity of processing
    Velocity: Data constantly being generated
    Variety: Multiple sources, formats, models
    Veracity: Accuracy of the data

    Open PHACTS: Not dealt with Velocity, although it is a challenge for us
  • 1 of 83 business driver questions
    Took a team of 5 experienced researchers 6 hours to manually gather the answer
    Start of the project couldn’t be answered by a computer system
    6 months in 30s with prototype
    now subsecond
  • Pharma are all accessing, processing, storing & re-processing external research data Big waste of resources
    No competitive advantage

    OPS: 29 partners including many major pharma
  • 83 questions ranked and top 20 taken as target
  • 18 of top 20
  • A platform for integrated pharmacology data
    Relied upon by pharma companies
    Public domain, commercial, and private data sources
    Provides domain specific API
    Making it easy to build multiple drug discovery applications: examples developed in the project
  • Not just in-house apps
  • Actively being used for different purposes
    Public launch April 2013
    Averaging 20 million hits a month from the start of 2015
    38 million in the last 30 days

    Heavy usage from pharma, academia, and biotech
    500+ registered users
  • Import data into cache

    Integration approach
    Data kept in original model but cached centrally
    API call translated to SPARQL query
    Query expressed in terms of original data

    Queries expanded by IMS to cover URIs of original datasets
  • Data provided by many publishers
    Originally in many formats: relational, SD files and RDF
    Worked closely with publishers
    Data licensing was a major issue

    Over 3 billion triples – 12 datasets
    Hosted on beefy hardware; data in memory (aim)
    Extensive memcaching
    Pose complex queries to extract data
  • Interactions needed to satisfy use cases
    Gradually added additional types of data and interactions
  • No standard units
    Even in curated sources!

    Feedback issues to data providers
  • Validation & Standardization Platform
    Developed by Royal Society of Chemistry
    http://bit.ly/NZF5VB
  • Example drug: Gleevec Cancer drug for leukemia

    Lookup in three popular public chemical databases  Different results

    Chemistry is complicated, often simplified for convenience
    Data is messy!
  • Are these records the same? It depends on what you are doing with the data!
    Each captures a subtly different view of the world

    Chemistry is complicated, often simplified for convenience
    Data is messy!
  • Interested in physiochemical properties of Gleevec
  • Interested in biomedical and pharmacological properties

    sameAs != sameAs depends on your point of view

    Links relate individual data instances: source, target, predicate, reason.

    Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  • Open for anybody

    API grouped into theme areas

    Two phase interaction:
    Resolve thing to identifier
    Retrieve data about the identifier
  • Sustainability
  • API -> queries

×