Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls

193 views

Published on

The European MassBank server (www.massbank.eu) was founded in 2012 by the NORMAN Network (www.norman-network.net) to provide open access to mass spectra of substances of environmental interest contributed by NORMAN members. The automated workflow RMassBank was developed as a part of this effort (Stravs et al 2013, DOI: 10.1002/jms.3131; https://github.com/MassBank/RMassBank/). This workflow included automated processing of the mass spectral data, as well as automated annotation using the SMILES, Names and CAS numbers provided by the user. Cheminformatics toolkits (e.g. Open Babel, rcdk) and web services (e.g. the CACTUS Chemical Identifier Resolver, Chemical Translation Services (CTS), ChemSpider, PubChem) were then used to convert and/or retrieve the remaining information for completion of the MassBank records (additional names, InChIs, InChIKeys, several database identifiers, mol files), to avoid excessive burden on the users and reduce the chance of errors. To date, approximately 16,000 MS/MS spectra (61 %*) corresponding with 1,269 (18 %*) unique chemicals (*of all open data as of Nov. 2016) have been uploaded to MassBank.EU via RMassBank. Curating the MassBank.EU records, as part of efforts to provide EPA CompTox Dashboard identifiers (DTXSIDs) for each record, revealed several issues in data quality. In addition, the representation of “ambiguous substances”, for example complex surfactant mixtures of various chain lengths and branching or incompletely-defined structures of transformaton products, is an ongoing challenge. While “ambiguous structures” cannot be represented in the majority of cheminformatics tools, we report on proof-of-concept solutions in this work. This presentation reflects on the effectiveness of the original RMassBank concept but also identifies pitfalls that automated structure annotation with open resources offers to streamline spectra contributions from external laboratories and users with widely ranging cheminformatics experience. Note: this work does not necessarily reflect U.S. EPA policy.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls

  1. 1. 1 Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls Emma Schymanski Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg. Email: emma.schymanski@uni.lu Michael A. Stravs (Eawag, Dübendorf, Switzerland) Tobias Schulze (Helmholtz Centre for Environmental Research, Germany) Antony J. Williams (NCCT, US EPA, Research Triangle Park, NC, USA) The views expressed in this presentation are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency.
  2. 2. 2 MassBank: Japan, Europe, America …. Horai et al. 2010. JMS, 45(7) pp 703-714. DOI: 10.1002/jms.1777 www.massbank.jp, www.massbank.eu, http://mona.fiehnlab.ucdavis.edu/ o MassBank started as a public repository in Japan, 2006 o No standard analytical method o Include many different data types (GC, LC, MS, MS/MS, HR, LR, AM…) o Contributor is responsible for data quality o NORMAN network of reference laboratories, research centres and related organisations for monitoring of emerging environmental substances o Many different laboratories with different instruments & reference standards o “Emerging substances” and TPs: not yet widely known; not yet in databases o NORMAN joined MassBank in 2012 and founded MassBank.EU o MassBank.JP and MassBank.EU are quite similar … o MoNA (MassBank of North America) is the latest in the collection o Completely different database concept
  3. 3. 3 MassBank: Crossing the World Image: www.massbank.jp
  4. 4. 4 MassBank Now o www.massbank.jp & www.massbank.eu MassBank now has 46,334 spectra* from 32 contributing institutes! Contributions from European NORMAN member institutes *Spectra numbers from http://mona.fiehnlab.ucdavis.edu/downloads 10,668 MS/MS
  5. 5. 5 MassBank Now Image: www.massbank.eu http://massbank.eu/MassBank
  6. 6. 6 Example Mass Spectrum MassBank Now
  7. 7. 7 European MassBank http://massbank.eu/MassBank o MassBank.EU was founded late 2012, hosted at UFZ, Leipzig, Germany o 16,017 MS/MS spectra; 1,232 substances from NORMAN members o Tentative/unknown/literature spectra on massbank.eu (not massbank.jp)
  8. 8. 8 European MassBank Image: www.massbank.eu http://massbank.eu/MassBank o MassBank.EU was founded late 2012, hosted at UFZ, Leipzig, Germany o 16,017 MS/MS spectra; 1,232 substances from NORMAN members o Tentative/unknown/literature spectra on massbank.eu (not massbank.jp)
  9. 9. 9 Creating High-Quality Mass Spectra Automatic MS and MS/MS Recalibration and Clean-up Remove interfering peaks Spectral Annotation with - Experimental Details - Compound Information https://github.com/MassBank/RMassBank/ http://bioconductor.org/packages/RMassBank/ Stravs, Schymanski, Singer and Hollender, 2013, Journal of Mass Spectrometry, 48, 89–99. DOI: 10.1002/jms.3131 16,004 (61 %*) MS/MS spectra 1,269 (18 %*) substances *% of all open LC-MS/MS data
  10. 10. 10 Record Specifications – Well Defined https://github.com/MassBank/MassBank-web/tree/master/Documentation
  11. 11. 11 Concept of RMassBank Processing o Connect spectrum and compound information o User need to contribute a bare minimum of information o Identification: • Only the user knows what compound has been measured • At least one form of (unambiguous) compound identifier required Typically: internal ID, name, SMILES, retention time • Use web services to fill in the rest => reduces manual errors o Measurement: • Measurement parameters, methods, settings are consistent • Added in batch form via a settings file Stravs et al, 2013, JMS, 48, 89–99. DOI: 10.1002/jms.3131
  12. 12. 12 Concept of RMassBank Processing o Web services: let these do the work for you! o CACTUS Chemical Identifier Resolver o http://cactus.nci.nih.gov/chemical/structure o SMILES (c1ccccc1) to InChI Key (UHOVQNZJYSORNB- UHFFFAOYSA-N) o Chemical Translation Service (CTS) o http://cts.fiehnlab.ucdavis.edu/ o Names, CAS #, InChI and Identifiers (IDs, if available): PubChem CID, ChemSpider, ChEBI, HMDB, KEGG, LipidMaps o PubChem o https://pubchem.ncbi.nlm.nih.gov/ o PubChem CID o ChemSpider o http://www.chemspider.com/ o ChemSpider ID Stravs et al, 2013, JMS, 48, 89–99. DOI: 10.1002/jms.3131 Behind the scenes … OpenBabel & rcdk
  13. 13. 13 Auto-Completed Compound List o Exported to user for manual verification
  14. 14. 14 MassBank Naming Issues – Contributor Defined
  15. 15. 15 Chemicals we purchase are not always “MS-ready” Schymanski & Williams, 2017, ES&T, 51 (10), pp 5357–5359. DOI: 10.1021/acs.est.7b01908
  16. 16. 16 MassBank/CompTox Curation of External Data o A “nice” example: 4-4'-Bis(2-sulfostyryl)biphenyl Purchased: CAS: 27344-41-8 DTXSID6036467 Registered: CAS: 38775-22-3 (UFZ) DTXSID7047017
  17. 17. 17 Two MassBank Lists on CompTox Chem. Dashboard Image: www.massbank.eu
  18. 18. 18 MassBank Reference Collection https://comptox.epa.gov/dashboard/chemical_lists/massbankref o All mass spectra measured with reference standards (Level 1)
  19. 19. 19 MassBank EU Special Cases o Literature Spectra, Supporting Information, Transformation Products, Complex Mixtures …
  20. 20. 20 Creating High-Quality Mass Spectra II Automatic MS and MS/MS Recalibration and Clean-up Remove interfering peaks Spectral Annotation with - Experimental Details - Compound Information https://github.com/MassBank/RMassBank/ http://bioconductor.org/packages/RMassBank/ MS/MS for further processing Knowns, Suspects, Unknowns…
  21. 21. 21 Confidence Levels for Tentative Structures Schymanski, Jeon, Gulde, Fenner, Ruff, Singer & Hollender (2014) ES&T, 48 (4), 2097-2098. DOI: 10.1021/es5002105 o Annotation is the key to communicating information MS, MS2, RT, Reference Std. Level 1: Confirmed structure by reference standard Level 2: Probable structure a) by library spectrum match b) by diagnostic evidence Identification confidence N N N NHNH CH3 CH3 S CH3 OH MS, MS2, Library MS2 MS, MS2, Exp. data Example Minimum data requirements Level 4: Unequivocal molecular formula Level 5: Exact mass of interest C6H5N3O4 192.0757 MS isotope/adduct MS Level 3: Tentative candidate(s) structure, substituent, class MS, MS2, Exp. data
  22. 22. 22 Automating Confidence Levels Schymanski et al, 2014, ES&T. DOI: 10.1021/es5002105 & Schymanski et al. 2015, ABC, DOI: 10.1007/s00216-015-8681-7
  23. 23. 23 Suspect Screening: Benzotriazole TPs Huntscha et al. 2014, ES&T, 48(8), 4435-4443. 1H-BT .eu
  24. 24. 24 Surfactant Screening From Literature Schymanski et al. (2014), ES&T, 48: 1811-1818. DOI: 10.1021/es4044374 Literature sources o Formulas, masses (ions), retention times and intensities o Spectra of selected compounds (different instruments) Gonzalez et al. Rapid Comm. Mass Spec. 2008, 22: 1445-54 Lara-Martin et al. EST. 2010, 44: 1670-1676 massbank.eu 39 literature spectra (so far)
  25. 25. 25 Supporting Information Mass Spectra in MassBank Schymanski et al, 2014, ES&T. DOI: 10.1021/es5002105 o Implementation of “levels” enables creation of supporting information collections o Several Eawag tentative/unknown collections following the 2014 Eawag Level Scheme (DOI 10.1021/es5002105) • Gulde et al 2016 (DOI 10.1021/acs.est.6b01301) • TPs already found in GNPS! http://goo.gl/NmO4tx • Rösch et al 2016 (DOI 10.1021/acs.est.5b05186) • …and many more
  26. 26. 26 Several Years of Automated Curation Issues … but… Discussions will move to https://github.com/MassBank/MassBank-Curation o Curation discussion on Github …
  27. 27. 27 …we have enabled world-wide MS exchange! Schymanski et al. 2015, ABC, DOI: 10.1007/s00216-015-8681-7 NORMAN Suspect List Exchange: http://www.norman-network.com/?q=node/236 Tentatively Identified Spectra: http://goo.gl/0t7jGp Hits in GNPS MassIVE datasets: TPs in skin: http://goo.gl/NmO4tx Surfactants: http://goo.gl/7sY9Pf
  28. 28. 28 Acknowledgements Questions? NORMAN MassBank: www.massbank.eu CompTox Chemistry Dashboard: https://comptox.epa.gov/ Contact: emma.schymanski@uni.lu
  29. 29. 29 Target, Suspect and Non-Target Screening KNOWNS SUSPECTS No Prior Knowledge HPLC separation and HR-MS/MS TARGET ANALYSIS SUSPECT SCREENING NON-TARGET SCREENING Targets found Suspects found Masses of interest (Molecular formula) DATABASE SEARCH STRUCTURE GENERATION Confirmation and quantification of compounds present Candidate selection (retention time, MS/MS, calculated properties) Sampling extraction (SPE) HPLC separation HR-MS/MS Time, Effort & Number of Compounds…. SUSPECTS SPECTRUM SEARCH Spectral match
  30. 30. 30 Supporting Evidence for Homologues Stravs et al. (2013), J. Mass Spectrom, 48(1):89-99. DOI: 10.1002/jms.3131 OHSO O CH3 O OH m n SPA-9C m+n=6 Formulas: http://sourceforge.net/projects/genform/ Meringer et al, 2011, MATCH 65, 259-290 Data: Schymanski et al. 2014, ES&T, 48: 1811-1818. DOI: 10.1021/es4044374 Chromatography and MS/MS Annotation Literature: LIT00034,35 Sample: ETS00002 Standard: ETS00016,17,19,20 https://github.com/MassBank/RMassBank/
  31. 31. 31 Cross-Linking Homologues in the Dashboard CDK Depict https://www.slideshare.net/AntonyWilliams/ markush-enumeration-to-manage-mesh-and-manipulate-substances-of-unknown-or-variable-composition
  32. 32. 32 Cross-Linking Homologues in the Dashboard Schymanski, Grulke, Williams et al, in prep. & Williams et al. 2017 J. Cheminformatics 9:61 DOI: 10.1186/s13321-017-0247-6 https://comptox.epa.gov/dashboard/chemical_lists/eawagsurf
  33. 33. 33 Enhancing Access to Mass Spectral Information Viniaxa, Schymanski, Navarro, Neumann, Salek, Yanes, 2016, TrAC, DOI: 10.1016/j.trac.2015.09.005 = HMDB, GNPS, MassBank, ReSpect Compound lists provided by: S. Stein, R. Mistrik, Agilent Most libraries still have many unique entries – intercomparability?
  34. 34. 34 SPLASH – Communicate between libraries Wohlgemuth et al., 2016, Nature Biotechnology 34, 1099-1101, DOI: 10.1038/nbt.3689 splash10 - 0002 - 0900000000 - b112e4e059e1ecf98c5f [version] - [top10] - [histogram] - [hash of full spectrum] http://mona.fiehnlab.ucdavis.edu/#/spectra/splash/splash10-0002-0900000000-b112e4e059e1ecf98c5f https://www.google.ch/search?q=splash10-0002-0900000000-b112e4e059e1ecf98c5f
  35. 35. 35 Homologues … UVCBs … Complex Mixtures Schymanski, Grulke, Williams et al, in prep. & Williams et al. 2017 J. Cheminformatics 9:61 DOI: 10.1186/s13321-017-0247-6 https://comptox.epa.gov/dashboard/chemical_lists/eawagsurf

×