Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data integration with identifiers and ontologies

2,052 views

Published on

Presentation given at OpenTox Euro 2016 in Rheinfelden, Germany.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Data integration with identifiers and ontologies

  1. 1. Data integration with identifiers and ontologies Why are names and graphs not enough? Egon Willighagen http://chem-bla-ics.blogspot.com/ @egonwillighagen ORCID:0000-0001-7542-0286 OpenTox Euro 2016, Rheinfelden/DE 2016-10-28
  2. 2. Acknowledgements ● WikiPathways and PathVisio projects – Prof. Alex Pico's team, UCSF – Current and past members of BiGCaT (Prof. Chris Evelo): Marloes Poort – Pathway Providers: Pieter Giesbertz (TUM), Kozo Nishida (RIKEN) ● Maastricht University – Toxicology: Rianne Fijten – MaCSBio team – Maastricht Science Programma (VOC project) ● Open PHACTS – Manchester University: Prof. Carole Goble, Christian Brenninkmeijer, Stian Soiland-Reyes – Heriot-Watt University: Alasdair Gray – Royal Society of Chemistry: Colin Batchelor ● Others – Bioclipse: Ola Spjuth (Uppsala University), Bioclipse-Opentox: Nina Jeliazkova – MetaboLights collaboration: Reza Salek, Chandu Venkata, Garima Thakur – ChEBI collaboration: Christoph Steinbeck, Gareth Owen – PubChem collaboration: Evan Bolton, Gang Fu – HMDB, Wikidata teams
  3. 3. Asthma: Detecting and Understanding Smolinska et al. PLOS ONE. 2014 9:e105447 doi:10.1371/journal.pone.0105447
  4. 4. Systems Biology: pathways Andón FT, Fadeel B; ''Programmed Cell Death: Molecular Mechanisms and Implications for Safety Assessment of Nanomaterials.''; Acc Chem Res, 2012
  5. 5. Dopamine metabolism Marloes Poort
  6. 6. The effect of troglitazone on heme biosynthesis
  7. 7. PathVisio: pathway enrichment (etc) Van Iersel, M.P., et al. "Presenting and exploring biological pathways with PathVisio." BMC bioinformatics 9.1 (2008): 399. http://pathvisio.org/ → Martina Kutmon
  8. 8. We see a lot? But what is it? ● Current techniques can see up to 1000 metabolites in one analysis – Only part of all 40k metabolites ● Only 10% we can identify – The other 90% is unknown
  9. 9. Databases & identifiers ● HMDB: Human Metabolome Database ● ChEBI: Database of Chemicals Entities of Biological Interest ● ChemSpider, PubChem ● CAS: Chemical Abstracts Service ● InChI: International Chemical Identifier
  10. 10. Acid/Base conjugates CHEBI:15361 (Pyruvate) -> Ce:CHEBI:32816 (conjugate) -> Ck:C00022 -> [WP2456 HIF1A and PPARG regulation of glycolysis, WP2453 TCA Cycle and PDHc]
  11. 11. Switching identities: Glucose
  12. 12. Switching identities: Warfarin Porter, W. (2010). Warfarin: history, tautomerism and activity Journal of Computer-Aided Molecular Design, 24 (6-7), 553-573 DOI: 10.1007/s10822-010-9335-7
  13. 13. Bridging: identifiers
  14. 14. So, what IDs are used in WikiPathways? Curated Collection subset
  15. 15. BridgeDb Van Iersel, M.P., et al. "The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services." BMC Bioinformatics 11.1 (2010): 5. New tools ● Open PHACTS' Identifier Mapping Service ● R package ● Bioclipse
  16. 16. Metabolite ID Mapping database ● HMDB, ChEBI Wikidata
  17. 17. BridgeDb: scientific lenses ● Gene – gene-protein – gene-probe ● Metabolite – Tautomers – Compound class – Charge (acid/ate) Brenninkmeijer, CYA, et al. "Scientific Lenses over Linked Data: An approach to support task specific views of the data. A vision." Proceedings of 2nd International Workshop on Linked Science. 2012.
  18. 18. #1: The breath data set CAS numbers: 1843 CAS numbers (unique): 1733 CAS numbers with mappings: 718 CAS numbers matches: 54 Pathways found: 76 Matches via CAS: 9 Matches via mapping: 29 Matches via ChEBI super class: 35 Matches via ChEBI charged species: 3 Matches via ChEBI tautomers: 0 CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]
  19. 19. What if we add more CAS ID mappings? (e.g. from Wikidata) INFO: Number of ids in Ch (HMDB): 41514 (changed +0.0%) INFO: Number of ids in Ce (ChEBI): 64222 (changed +0.0%) INFO: Number of ids in Kd (KEGG Drug): 2406 (changed +23960.0%) INFO: Number of ids in Ca (CAS): 38621 (changed +30.5%) INFO: Number of ids in Wi (Wikipedia): 3991 (changed +0.0%) INFO: Number of ids in Ck (KEGG Compound): 15896 (changed +0.0%) INFO: Number of ids in Cpc (PubChem-compound): 29170 (changed +72.5%) INFO: Number of ids in Wd: 18237 INFO: Number of ids in Cs (Chemspider): 23981 (changed +49.4%) - 30% more CAS numbers (294 unique IDs in WikiPathways) - 73% more PubChem compound identifiers (217 unique IDs in WP) - 50% more Chemspider identifiers (157 unique IDs in WP) - a lot more KEGG Drug identifiers
  20. 20. #1: The breath data set CAS numbers: 1843 CAS numbers (unique): 1733 CAS numbers with mappings: 978 CAS numbers matches: 116 Pathways found: 158 (unique: 62) Matches via CAS: 9 Matches via mapping: 28 Matches via ChEBI super class: 108 Matches via ChEBI charged species: 9 Matches via ChEBI tautomers: 0 Matches via ChEBI roles: 4 CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]
  21. 21. Wikidata Mietchen, D. et al. Enabling open science: Wikidata for research (Wiki4R). Research Ideas and Outcomes 1, e7573+ (2015)
  22. 22. Wikidata: identifiers
  23. 23. Which API approach? REST, SADI, XMPP† , SOAP† ? Willighagen et al. "Computational toxicology using the OpenTox application programming interface and Bioclipse." BMC Research Notes 4.1 (2011): 1. Wagener et al. "XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous web services." BMC Bioinformatics 10.1 (2009): 279.
  24. 24. Interactive API Docs with OpenAPI (Swagger)
  25. 25. Application Programming Interfaces
  26. 26. Conclusions ● Updated metabolite ID database – HMDB: still a major workhorse – ChEBI: charged species, compound classes – Wikidata: CAS numbers, other missing ● Pathway Analysis – Mapping with Bioclipse and PathVisio – Scientific lenses improve mappings – Better annotation

×