Data integration with identifiers and ontologies

1,674 views

Published on

Presentation given at OpenTox Euro 2016 in Rheinfelden, Germany.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,674
On SlideShare
0
From Embeds
0
Number of Embeds
786
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data integration with identifiers and ontologies

  1. 1. Data integration with identifiers and ontologies Why are names and graphs not enough? Egon Willighagen http://chem-bla-ics.blogspot.com/ @egonwillighagen ORCID:0000-0001-7542-0286 OpenTox Euro 2016, Rheinfelden/DE 2016-10-28
  2. 2. Acknowledgements ● WikiPathways and PathVisio projects – Prof. Alex Pico's team, UCSF – Current and past members of BiGCaT (Prof. Chris Evelo): Marloes Poort – Pathway Providers: Pieter Giesbertz (TUM), Kozo Nishida (RIKEN) ● Maastricht University – Toxicology: Rianne Fijten – MaCSBio team – Maastricht Science Programma (VOC project) ● Open PHACTS – Manchester University: Prof. Carole Goble, Christian Brenninkmeijer, Stian Soiland-Reyes – Heriot-Watt University: Alasdair Gray – Royal Society of Chemistry: Colin Batchelor ● Others – Bioclipse: Ola Spjuth (Uppsala University), Bioclipse-Opentox: Nina Jeliazkova – MetaboLights collaboration: Reza Salek, Chandu Venkata, Garima Thakur – ChEBI collaboration: Christoph Steinbeck, Gareth Owen – PubChem collaboration: Evan Bolton, Gang Fu – HMDB, Wikidata teams
  3. 3. Asthma: Detecting and Understanding Smolinska et al. PLOS ONE. 2014 9:e105447 doi:10.1371/journal.pone.0105447
  4. 4. Systems Biology: pathways Andón FT, Fadeel B; ''Programmed Cell Death: Molecular Mechanisms and Implications for Safety Assessment of Nanomaterials.''; Acc Chem Res, 2012
  5. 5. Dopamine metabolism Marloes Poort
  6. 6. The effect of troglitazone on heme biosynthesis
  7. 7. PathVisio: pathway enrichment (etc) Van Iersel, M.P., et al. "Presenting and exploring biological pathways with PathVisio." BMC bioinformatics 9.1 (2008): 399. http://pathvisio.org/ → Martina Kutmon
  8. 8. We see a lot? But what is it? ● Current techniques can see up to 1000 metabolites in one analysis – Only part of all 40k metabolites ● Only 10% we can identify – The other 90% is unknown
  9. 9. Databases & identifiers ● HMDB: Human Metabolome Database ● ChEBI: Database of Chemicals Entities of Biological Interest ● ChemSpider, PubChem ● CAS: Chemical Abstracts Service ● InChI: International Chemical Identifier
  10. 10. Acid/Base conjugates CHEBI:15361 (Pyruvate) -> Ce:CHEBI:32816 (conjugate) -> Ck:C00022 -> [WP2456 HIF1A and PPARG regulation of glycolysis, WP2453 TCA Cycle and PDHc]
  11. 11. Switching identities: Glucose
  12. 12. Switching identities: Warfarin Porter, W. (2010). Warfarin: history, tautomerism and activity Journal of Computer-Aided Molecular Design, 24 (6-7), 553-573 DOI: 10.1007/s10822-010-9335-7
  13. 13. Bridging: identifiers
  14. 14. So, what IDs are used in WikiPathways? Curated Collection subset
  15. 15. BridgeDb Van Iersel, M.P., et al. "The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services." BMC Bioinformatics 11.1 (2010): 5. New tools ● Open PHACTS' Identifier Mapping Service ● R package ● Bioclipse
  16. 16. Metabolite ID Mapping database ● HMDB, ChEBI Wikidata
  17. 17. BridgeDb: scientific lenses ● Gene – gene-protein – gene-probe ● Metabolite – Tautomers – Compound class – Charge (acid/ate) Brenninkmeijer, CYA, et al. "Scientific Lenses over Linked Data: An approach to support task specific views of the data. A vision." Proceedings of 2nd International Workshop on Linked Science. 2012.
  18. 18. #1: The breath data set CAS numbers: 1843 CAS numbers (unique): 1733 CAS numbers with mappings: 718 CAS numbers matches: 54 Pathways found: 76 Matches via CAS: 9 Matches via mapping: 29 Matches via ChEBI super class: 35 Matches via ChEBI charged species: 3 Matches via ChEBI tautomers: 0 CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]
  19. 19. What if we add more CAS ID mappings? (e.g. from Wikidata) INFO: Number of ids in Ch (HMDB): 41514 (changed +0.0%) INFO: Number of ids in Ce (ChEBI): 64222 (changed +0.0%) INFO: Number of ids in Kd (KEGG Drug): 2406 (changed +23960.0%) INFO: Number of ids in Ca (CAS): 38621 (changed +30.5%) INFO: Number of ids in Wi (Wikipedia): 3991 (changed +0.0%) INFO: Number of ids in Ck (KEGG Compound): 15896 (changed +0.0%) INFO: Number of ids in Cpc (PubChem-compound): 29170 (changed +72.5%) INFO: Number of ids in Wd: 18237 INFO: Number of ids in Cs (Chemspider): 23981 (changed +49.4%) - 30% more CAS numbers (294 unique IDs in WikiPathways) - 73% more PubChem compound identifiers (217 unique IDs in WP) - 50% more Chemspider identifiers (157 unique IDs in WP) - a lot more KEGG Drug identifiers
  20. 20. #1: The breath data set CAS numbers: 1843 CAS numbers (unique): 1733 CAS numbers with mappings: 978 CAS numbers matches: 116 Pathways found: 158 (unique: 62) Matches via CAS: 9 Matches via mapping: 28 Matches via ChEBI super class: 108 Matches via ChEBI charged species: 9 Matches via ChEBI tautomers: 0 Matches via ChEBI roles: 4 CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]
  21. 21. Wikidata Mietchen, D. et al. Enabling open science: Wikidata for research (Wiki4R). Research Ideas and Outcomes 1, e7573+ (2015)
  22. 22. Wikidata: identifiers
  23. 23. Which API approach? REST, SADI, XMPP† , SOAP† ? Willighagen et al. "Computational toxicology using the OpenTox application programming interface and Bioclipse." BMC Research Notes 4.1 (2011): 1. Wagener et al. "XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous web services." BMC Bioinformatics 10.1 (2009): 279.
  24. 24. Interactive API Docs with OpenAPI (Swagger)
  25. 25. Application Programming Interfaces
  26. 26. Conclusions ● Updated metabolite ID database – HMDB: still a major workhorse – ChEBI: charged species, compound classes – Wikidata: CAS numbers, other missing ● Pathway Analysis – Mapping with Bioclipse and PathVisio – Scientific lenses improve mappings – Better annotation

×