Successfully reported this slideshow.
Your SlideShare is downloading. ×

Scientific Lenses over Linked Data An approach to support multiple integrated views

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 44 Ad

Scientific Lenses over Linked Data An approach to support multiple integrated views

Download to read offline


When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.


In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.


When are two entries about a concept in different datasets the same? If they have the same name, properties, or some other criteria? The choice depends upon the application to which the data will be put. However, existing Linked Data approaches provide a single global view over the data with no way of varying the notion of equivalence to be applied.


In this presentation, I will introduce Scientific lenses, an approach that enables applications to vary the equivalence conditions between linked datasets. They have been deployed in the Open PHACTS Discovery Platform – a large scale data integration platform for drug discovery. To cater for different use cases, the platform allows the application of different lenses which vary the equivalence rules to be applied based on the context and interpretation of the links.

Advertisement
Advertisement

More Related Content

Similar to Scientific Lenses over Linked Data An approach to support multiple integrated views (20)

More from Alasdair Gray (20)

Advertisement
Advertisement

Scientific Lenses over Linked Data An approach to support multiple integrated views

  1. 1. Scientific Lenses over Linked Data An approach to support multiple integrated views Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair
  2. 2. Open PHACTS Use Case “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”  Chemical Properties (Chemspider)  Launched drugs (Drugbank)  Human => Mouse (Homologene)  Protein Families (Enzyme)  Bioactivty Data (ChEMBL)  … other info (Uniprot/Entrez etc.) 16 October 2014 Scientific Lenses – A. J. G. Gray 1
  3. 3. Discovery Platform Apps Method Calls Domain API Drug Discovery Platform Interactive responses Production quality integration platform 16 October 2014 Scientific Lenses – A. J. G. Gray 2
  4. 4. App Ecosystem An “App Store”? Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium MOE Collector Cytophacts Utopia Garfield SciBite KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna http://www.openphactsfoundation.org/apps.html 16 October 2014
  5. 5. API Hits April 2013 – March 2014: 15.8m April 2014 – Sept 2014: 14m Total: 29.8 million 16 October 2014 Scientific Lenses – A. J. G. Gray 4
  6. 6. Linked Data API Drug Target Pathway Disease (1.4) https://dev.openphacts.org/ 16 October 2014 Scientific Lenses – A. J. G. Gray 5
  7. 7. Open PHACTS Data Source Initial Records Triples Properties ChEMBL 1,481,473 304,360,749 77 DrugBank 19,628 517,584 74 UniProt 564,246 405,473,138 82 ENZYME 6,187 73,838 2 ChEBI 40,575 1,673,863 2 GeneOntology 38,137 2,447,682 26 GOA 661,232 1,765,622,393 15 ChemSpider 1,361,568 215,193,441 23 ConceptWiki 2,828,966 4,291,131 1 WikiPathways 946 1,949,074 34 16 October 2014 Scientific Lenses – A. J. G. Gray 6
  8. 8. Dataset Descriptions in the Open Pharmacological Space 14 January 2013 Being replaced by W3C HCLS community profile http://tiny.cc/hcls-datadesc-ed OPS Dataset Descriptions – A. J. G. Gray 7
  9. 9. OPS Discovery Platform Linked Data API (RDF/XML, TTL, JSON) Semantic Workflow Engine VoID Nanopub Db Data Cache (Virtuoso Triple Store) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing Core Platform “Adenosine receptor 2a” P12374 EC2.43.4 CS4532 VoID Db VoID Nanopub Db VoID Db VoID Nanopub Public Content Commercial Public Ontologies User Annotations Apps
  10. 10. Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ GB:29384 P12047 X31045 16 October 2014 Scientific Lenses – A. J. G. Gray 9 Are these the same thing?
  11. 11. Gleevec®: Imatinib Mesylate Imatinib Imatinib MesylateMesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N ChemSpider Drugbank PubChem 16 October 2014 Scientific Lenses – A. J. G. Gray 10
  12. 12. Gleevec®: Imatinib Mesylate Imatinib Are these records the same? It depends upon your task! Imatinib MesylateMesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N ChemSpider Drugbank PubChem 16 October 2014 Scientific Lenses – A. J. G. Gray 11
  13. 13. Genes == Proteins? BRCA1: Chromosome 17 Breast cancer type 1 susceptibility protein http://en.wikipedia.org/wiki/File:Protei n_BRCA1_PDB_1jm7.png http://en.wikipedia.org/wiki/File:BRCA1 _en.png 16 October 2014 Scientific Lenses – A. J. G. Gray 12
  14. 14. Genes == Proteins? BRCA1: Chromosome 17 Breast cancer type 1 susceptibility protein http://en.wikipedia.org/wiki/File:Protei n_BRCA1_PDB_1jm7.png http://en.wikipedia.org/wiki/File:BRCA1 _en.png Are these records the same? It depends upon your task! 16 October 2014 Scientific Lenses – A. J. G. Gray 13
  15. 15. Example Use Cases I need to perform an analysis, give me details of the active compound in Gleevec. Which targets are known to interact with Gleevec? 16 October 2014 Scientific Lenses – A. J. G. Gray 14
  16. 16. Structure Lens I need to perform an analysis, give me Strict Relaxed Analysing Browsing skos:exactMatch (InChI) Scientific Lenses – A. J. G. Gray 15 16 October 2014 details of the active compound in Gleevec.
  17. 17. Name Lens Which targets are known to interact Strict Relaxed Analysing Browsing skos:closeMatch (Drug Name) skos:exactMatch (InChI) skos:closeMatch (Drug Name) Scientific Lenses – A. J. G. Gray 16 16 October 2014 with Gleevec?
  18. 18. What is a Scientific Lens? A lens defines a conceptual view over the data  Specifies operational equivalence conditions Consists of:  Identifier (URI)  Title (dct:title)  Description (dct:description)  Documentation link (dcat:landingPage)  Creator (pav:createdBy)  Timestamp (pav:createdOn)  Equivalence rules (bdb:linksetJustification) 16 October 2014 Scientific Lenses – A. J. G. Gray 17
  19. 19. Lens Effects: Ibuprofen Ibuprofen consists of two equally active stereoisomers. • Stereoisomers not always represented in data Users wish to retrieve information for any stereoisomer. CHEMBL427526 CHEMBL521 CHEMBL175 16 October 2014 Scientific Lenses – A. J. G. Gray 18
  20. 20. Default Lens Ibuprofen consists of two equally active stereoisomers. • Stereoisomers not always represented in data Users wish to retrieve information for any stereoisomer. 16 October 2014 Scientific Lenses – A. J. G. Gray 19
  21. 21. Stereoisomer Lens Ibuprofen consists of two equally active stereoisomers. • Stereoisomers not always represented in data Users wish to retrieve information for any stereoisomer. 16 October 2014 Scientific Lenses – A. J. G. Gray 20
  22. 22. Mapping Generation ✔ ops:OPS437281 has_stereoundefined_parent [ci:CHEMINF_000456] ops:OPS380297 is_stereoisomer_of [ci:CHEMINF_000461] ops:OPS380292 Other relationships • has part • is tautomer of • uncharged counterpart • isotope … 16 October 2014 Scientific Lenses – A. J. G. Gray 21
  23. 23. Initial Connectivity Datasets 37 Linksets 104 Links 7,096,712 Justifications 7 16 October 2014 Scientific Lenses – A. J. G. Gray 22
  24. 24. Compound Information Scientific Lenses – A. J. G. Gray 23 16 October 2014
  25. 25. Proceed with Caution! 16 October 2014 Scientific Lenses – A. J. G. Gray 24
  26. 26. Co-reference Computation Rules ensure  Unrestricted transitivity within conceptual type  Restrict crossing conceptual types Based on justifications Provenance captured 0..* 0..* 0..* 0..1 0..1 16 October 2014 Scientific Lenses – A. J. G. Gray 25
  27. 27. Initial Connectivity Datasets 37 Linksets 104 Links 7,096,712 Justification s 7 16 October 2014 Scientific Lenses – A. J. G. Gray 26
  28. 28. Inferred Connectivity Datasets 37 Linksets 883 Links 17,383,846 Justifications 7 16 October 2014 Scientific Lenses – A. J. G. Gray 27
  29. 29. BridgeDb 16 October 2014 Scientific Lenses – A. J. G. Gray 28
  30. 30. Lenses: Under the hood GRAPH <http://rdf.chemspider.com> { cw:979b545d-f9a9 cheminf:logd ?logd . ?iri cheminf:logd ?logd . FILTER (?iri = cw:979b545d-f9a9 || ?iri = cs:2157 || ?iri = chembl:1280 || ?iri = db:db00945 ) } GRAPH <http://… Q, L1 Q’ Query Expander Service Identity Mapping Service (BridgeDB) Mappings Profiles cw:979b545d-f9a9, L1 [cw:979b545d-f9a9, cs:2157, chembl:1280, db:db00945] • Can also be achieved through UNION • IMS call adds overhead 16 October 2014 Scientific Lenses – A. J. G. Gray 29
  31. 31. Experiment Is it feasible to use a stand-off mapping service?  Base lines (no external call):  “Perfect” URIs  Linked data querying  Expansion approaches (external service call):  FILTER by Graph  UNION by Graph C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S. Pettifer: Including Co-referent URIs in a SPARQL Query. COLD 2013. http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
  32. 32. “Perfect” URI Baseline WHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { chembl_mol:m1280 cheminf:mw ?mw . } } 16 October 2014 Scientific Lenses – A. J. G. Gray 31
  33. 33. Linked Data Baseline WHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { ?chemblid cheminf:mw ?mw . } cs:2157 skos:exactMatch ?chemblid . } 16 October 2014 Scientific Lenses – A. J. G. Gray 32
  34. 34. Queries Drawn from Open PHACTS API: 1. Simple compound information (1) 2. Compound information (1) 3. Compound pharmacology (M) 4. Simple target information (1) 5. Target information (1) 6. Target pharmacology (M) 16 October 2014 Scientific Lenses – A. J. G. Gray 33
  35. 35. Queries Drawn from Open PHACTS API: 1. Simple compound information (1) 2. Compound information (1) 3. Compound pharmacology (M) 4. Simple target information (1) 5. Target information (1) 6. Target pharmacology (M) 16 October 2014 Scientific Lenses – A. J. G. Gray 34
  36. 36. Data: 167,783,592 triples Mappings: 2,114,584 triples Lenses: 1 Experiment Data 16 October 2014 Scientific Lenses – A. J. G. Gray 35
  37. 37. Average execution times
  38. 38. Average execution times 0.018
  39. 39. Q6: Target Pharmacology
  40. 40. Explorer Screenshot 16 October 2014 Scientific Lenses – A. J. G. Gray 45
  41. 41. Explorer Screenshot 16 October 2014 Scientific Lenses – A. J. G. Gray 46
  42. 42. Conclusions  Scientific data is complex and messy  Requires flexibility in linking  Equivalence depends upon context  Lenses provide support for operation equivalence  Chemical structures support automatic computing of links with justification 16 October 2014 Scientific Lenses – A. J. G. Gray 47
  43. 43. Acknowledgements Royal Society of Chemistry  Colin Batchelor  Karen Karapetyan  Jon Steele  Valery Tkachenko  Antony Williams University of Manchester  Christian Brenninkmeijer  Ian Dunlop  Carole Goble  Steve Pettifer  Robert Stevens Swiss Institute for Bioinformatics  Christine Chichester European Bioinformatics Institute  Mark Davies  Anna Gaulton  John Overington University of Vienna  Daniela Digles Maastricht University  Chris Evelo  Andra Waagmeester  Egon Willighagen VU University of Amsterdam  Paul Groth  Antonis Loizou Connected Discovery  Lee Harland 16 October 2014 Scientific Lenses – A. J. G. Gray 48
  44. 44. Questions Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair Open PHACTS pmu@openphacts.org openphacts.org @open_phacts

Editor's Notes

  • 1 of 83 business driver questions
    Took a team of 5 experienced researchers 6 hours to manually gather the answer
  • A platform for integrated pharmacology data
    Relied upon by pharma companies
    Public domain, commercial, and private data sources
    Provides domain specific API
    Making it easy to build multiple drug discovery applications: examples developed in the project
  • Actively being used
    Since launch (April 2013): 30million hits
  • Linked data API: multiple response formats (JSON, RDF, XML, CSV …)
    3scala deployment, extensive memcaching
    Public dataset
    Provenance of data returned in response
  • Hosted on beefy hardware; data in memory (aim)
  • Specifies MIM checklist
    Reuses terms from VoID, PAV, DCTerms, PROV predicates
  • Import data into cache

    API calls populate SPARQL queries

    Integration approach
    Data kept in original model
    Data cached in central triple store
    API call translated to SPARQL query
    Query expressed in terms of original data

    Queries expanded by IMS to cover URIs of original datasets
  • Concept appears in multiple datasets, each with its own identifier
    This talk is about supporting the multiple identities that exist
    Rather than define a single approach, we want to support the use of multiple identifiers
  • Example drug: Gleevec Cancer drug for leukemia

    Lookup in three popular public chemical databases  Different results

    Chemistry is complicated, often simplified for convenience
    Data is messy!
  • Are these records the same? It depends on what you are doing with the data!
    Each captures a subtly different view of the world

    Chemistry is complicated, often simplified for convenience
    Data is messy!
  • Do genes == proteins? Different conceptual types: gene and protein

    Biological data is complicated  simplified for convenience

    ----

    But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor:
    http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]

    And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
  • Often used as a shortcut for retrieval: BRCA1 easier to remember and type!

    Require the ability to equate them in the IMS


    ----

    But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor:
    http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS]

    And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
  • Analysis requires precise knowledge of the form of the compound across datasets

    Targets is a search activity, some likely to be mis-entered

    We use lenses to change the links between the data
  • Interested in physiochemical properties of Gleevec
  • Interested in biomedical and pharmacological properties

    sameAs != sameAs depends on your point of view

    Links relate individual data instances: source, target, predicate, reason.

    Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  • Lens enables certain relationships and disables others
    Alters links between the data
  • Default lens matches structures
    Only get data back associated with the structure entered with

    Really want all information about Ibuprofen
    Need a different lens
  • Validate structure: Source data is messy!
    Identify common problems:
    Charge imbalance
    Stereochemistry
    Compute physiochemical properties
    Identify related properties based on structure
    17 relationship types
  • Can enter with IDs from any of the supported datasets
  • Platform extracts data from certain datasets

    These need to be connected

    Here there is no issue in computing transitive as they are all the same compound based on InChI key

    Would compute the full set of links
  • Insulin Receptor

    Issue when linking through PDB due to the way that proteins are crystalised
  • Can enter with IDs from any of the supported datasets
  • These are 1.3 figures

    In 1.4
    130 raw linksets with 6,985,278 links
    40,802 computed linksets with 25,584,293 links
  • Implementation available

    IMS takes query and expands URIs
  • Query with URIs
    Extract URIs
    Find equivalents under a certain lens (Isolates lens behaviour)
    Expand query
    Optimise based on context
  • Result size in brackets
  • Orange are actual OPS queries
  • Subset of the OPS data
  • Linked data approach performs badly with query 6 due to the query construction
    Name being bound to the chemical structure returned
  • Focus on other queries
    In general expansion is slower than base lines
    Worst case delta: 0.01842 (under 20ms)
    Human perception is 0.050 to 0.2 (50 -200ms)
  • Focus on query 6
    No linked data as it performed very poorly on this query
    Size of result obliterates external call cost
  • Pharmacology count 2370  3044

×