The European MassBank server (www.massbank.eu) was founded in 2012 by the NORMAN Network (www.norman-network.net) to provide open access to mass spectra of substances of environmental interest contributed by NORMAN members. The automated workflow RMassBank was developed as a part of this effort (Stravs et al 2013, DOI: 10.1002/jms.3131; https://github.com/MassBank/RMassBank/). This workflow included automated processing of the mass spectral data, as well as automated annotation using the SMILES, Names and CAS numbers provided by the user. Cheminformatics toolkits (e.g. Open Babel, rcdk) and web services (e.g. the CACTUS Chemical Identifier Resolver, Chemical Translation Services (CTS), ChemSpider, PubChem) were then used to convert and/or retrieve the remaining information for completion of the MassBank records (additional names, InChIs, InChIKeys, several database identifiers, mol files), to avoid excessive burden on the users and reduce the chance of errors. To date, approximately 16,000 MS/MS spectra (61 %*) corresponding with 1,269 (18 %*) unique chemicals (*of all open data as of Nov. 2016) have been uploaded to MassBank.EU via RMassBank. Curating the MassBank.EU records, as part of efforts to provide EPA CompTox Dashboard identifiers (DTXSIDs) for each record, revealed several issues in data quality. In addition, the representation of “ambiguous substances”, for example complex surfactant mixtures of various chain lengths and branching or incompletely-defined structures of transformaton products, is an ongoing challenge. While “ambiguous structures” cannot be represented in the majority of cheminformatics tools, we report on proof-of-concept solutions in this work. This presentation reflects on the effectiveness of the original RMassBank concept but also identifies pitfalls that automated structure annotation with open resources offers to streamline spectra contributions from external laboratories and users with widely ranging cheminformatics experience. Note: this work does not necessarily reflect U.S. EPA policy.
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls
1. 1
Automated Structure Annotation and
Curation for MassBank:
Potential and Pitfalls
Emma Schymanski
Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg.
Email: emma.schymanski@uni.lu
Michael A. Stravs (Eawag, Dübendorf, Switzerland)
Tobias Schulze (Helmholtz Centre for Environmental Research, Germany)
Antony J. Williams (NCCT, US EPA, Research Triangle Park, NC, USA)
The views expressed in this presentation are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency.
2. 2
MassBank: Japan, Europe, America ….
Horai et al. 2010. JMS, 45(7) pp 703-714. DOI: 10.1002/jms.1777
www.massbank.jp, www.massbank.eu, http://mona.fiehnlab.ucdavis.edu/
o MassBank started as a public repository in Japan, 2006
o No standard analytical method
o Include many different data types (GC, LC, MS, MS/MS, HR, LR, AM…)
o Contributor is responsible for data quality
o NORMAN network of reference laboratories, research centres and related
organisations for monitoring of emerging environmental substances
o Many different laboratories with different instruments & reference standards
o “Emerging substances” and TPs: not yet widely known; not yet in databases
o NORMAN joined MassBank in 2012 and founded MassBank.EU
o MassBank.JP and MassBank.EU are quite similar …
o MoNA (MassBank of North America) is the latest in the collection
o Completely different database concept
4. 4
MassBank Now
o www.massbank.jp & www.massbank.eu
MassBank now has 46,334 spectra* from 32 contributing institutes!
Contributions from European NORMAN member institutes
*Spectra numbers from http://mona.fiehnlab.ucdavis.edu/downloads
10,668 MS/MS
7. 7
European MassBank
http://massbank.eu/MassBank
o MassBank.EU was founded late 2012, hosted at UFZ, Leipzig, Germany
o 16,017 MS/MS spectra; 1,232 substances from NORMAN members
o Tentative/unknown/literature spectra on massbank.eu (not massbank.jp)
9. 9
Creating High-Quality Mass Spectra
Automatic MS and MS/MS
Recalibration and Clean-up
Remove interfering peaks
Spectral Annotation with
- Experimental Details
- Compound Information
https://github.com/MassBank/RMassBank/
http://bioconductor.org/packages/RMassBank/
Stravs, Schymanski, Singer and Hollender, 2013,
Journal of Mass Spectrometry, 48, 89–99. DOI: 10.1002/jms.3131
16,004 (61 %*) MS/MS spectra
1,269 (18 %*) substances
*% of all open LC-MS/MS data
10. 10
Record Specifications – Well Defined
https://github.com/MassBank/MassBank-web/tree/master/Documentation
11. 11
Concept of RMassBank Processing
o Connect spectrum and compound information
o User need to contribute a bare minimum of information
o Identification:
• Only the user knows what compound has been measured
• At least one form of (unambiguous) compound identifier required
Typically: internal ID, name, SMILES, retention time
• Use web services to fill in the rest => reduces manual errors
o Measurement:
• Measurement parameters, methods, settings are consistent
• Added in batch form via a settings file
Stravs et al, 2013, JMS, 48, 89–99. DOI: 10.1002/jms.3131
12. 12
Concept of RMassBank Processing
o Web services: let these do the work for you!
o CACTUS Chemical Identifier Resolver
o http://cactus.nci.nih.gov/chemical/structure
o SMILES (c1ccccc1) to InChI Key (UHOVQNZJYSORNB-
UHFFFAOYSA-N)
o Chemical Translation Service (CTS)
o http://cts.fiehnlab.ucdavis.edu/
o Names, CAS #, InChI and Identifiers (IDs, if available): PubChem
CID, ChemSpider, ChEBI, HMDB, KEGG, LipidMaps
o PubChem
o https://pubchem.ncbi.nlm.nih.gov/
o PubChem CID
o ChemSpider
o http://www.chemspider.com/
o ChemSpider ID
Stravs et al, 2013, JMS, 48, 89–99. DOI: 10.1002/jms.3131
Behind the scenes …
OpenBabel & rcdk
19. 19
MassBank EU Special Cases
o Literature Spectra, Supporting Information,
Transformation Products, Complex Mixtures …
20. 20
Creating High-Quality Mass Spectra II
Automatic MS and MS/MS
Recalibration and Clean-up
Remove interfering peaks
Spectral Annotation with
- Experimental Details
- Compound Information
https://github.com/MassBank/RMassBank/
http://bioconductor.org/packages/RMassBank/
MS/MS for
further
processing
Knowns, Suspects, Unknowns…
21. 21
Confidence Levels for Tentative Structures
Schymanski, Jeon, Gulde, Fenner, Ruff, Singer & Hollender (2014) ES&T, 48 (4), 2097-2098. DOI: 10.1021/es5002105
o Annotation is the key to communicating information
MS, MS2, RT, Reference Std.
Level 1: Confirmed structure
by reference standard
Level 2: Probable structure
a) by library spectrum match
b) by diagnostic evidence
Identification confidence
N
N
N
NHNH
CH3
CH3
S
CH3
OH
MS, MS2, Library MS2
MS, MS2, Exp. data
Example Minimum data requirements
Level 4: Unequivocal molecular formula
Level 5: Exact mass of interest
C6H5N3O4
192.0757
MS isotope/adduct
MS
Level 3: Tentative candidate(s)
structure, substituent, class MS, MS2, Exp. data
24. 24
Surfactant Screening From Literature
Schymanski et al. (2014), ES&T, 48: 1811-1818. DOI: 10.1021/es4044374
Literature sources
o Formulas, masses (ions), retention times and intensities
o Spectra of selected compounds (different instruments)
Gonzalez et al. Rapid Comm.
Mass Spec. 2008, 22: 1445-54
Lara-Martin et al. EST. 2010, 44: 1670-1676
massbank.eu
39 literature spectra (so far)
25. 25
Supporting Information Mass Spectra in MassBank
Schymanski et al, 2014, ES&T. DOI: 10.1021/es5002105
o Implementation of “levels” enables creation of supporting
information collections
o Several Eawag tentative/unknown collections following
the 2014 Eawag Level Scheme (DOI 10.1021/es5002105)
• Gulde et al 2016 (DOI 10.1021/acs.est.6b01301)
• TPs already found in GNPS! http://goo.gl/NmO4tx
• Rösch et al 2016 (DOI 10.1021/acs.est.5b05186)
• …and many more
26. 26
Several Years of Automated Curation Issues … but…
Discussions will move to https://github.com/MassBank/MassBank-Curation
o Curation discussion on Github …
27. 27
…we have enabled world-wide MS exchange!
Schymanski et al. 2015, ABC, DOI: 10.1007/s00216-015-8681-7
NORMAN Suspect List Exchange:
http://www.norman-network.com/?q=node/236
Tentatively Identified Spectra:
http://goo.gl/0t7jGp
Hits in GNPS MassIVE datasets:
TPs in skin: http://goo.gl/NmO4tx
Surfactants: http://goo.gl/7sY9Pf
29. 29
Target, Suspect and Non-Target Screening
KNOWNS SUSPECTS No Prior Knowledge
HPLC separation and HR-MS/MS
TARGET
ANALYSIS
SUSPECT
SCREENING
NON-TARGET
SCREENING
Targets found Suspects found Masses of interest
(Molecular formula)
DATABASE
SEARCH
STRUCTURE
GENERATION
Confirmation and quantification of compounds present
Candidate selection (retention time, MS/MS, calculated properties)
Sampling extraction (SPE) HPLC separation HR-MS/MS
Time, Effort & Number of Compounds….
SUSPECTS
SPECTRUM
SEARCH
Spectral match
30. 30
Supporting Evidence for Homologues
Stravs et al. (2013), J. Mass Spectrom, 48(1):89-99. DOI: 10.1002/jms.3131
OHSO
O
CH3
O
OH
m n
SPA-9C
m+n=6
Formulas: http://sourceforge.net/projects/genform/
Meringer et al, 2011, MATCH 65, 259-290
Data: Schymanski et al. 2014, ES&T, 48:
1811-1818. DOI: 10.1021/es4044374
Chromatography and MS/MS Annotation
Literature: LIT00034,35
Sample: ETS00002
Standard: ETS00016,17,19,20
https://github.com/MassBank/RMassBank/
31. 31
Cross-Linking Homologues in the Dashboard
CDK Depict
https://www.slideshare.net/AntonyWilliams/
markush-enumeration-to-manage-mesh-and-manipulate-substances-of-unknown-or-variable-composition
32. 32
Cross-Linking Homologues in the Dashboard
Schymanski, Grulke, Williams et al, in prep. & Williams et al. 2017 J. Cheminformatics 9:61 DOI: 10.1186/s13321-017-0247-6
https://comptox.epa.gov/dashboard/chemical_lists/eawagsurf
33. 33
Enhancing Access to Mass Spectral Information
Viniaxa, Schymanski, Navarro, Neumann, Salek, Yanes, 2016, TrAC, DOI: 10.1016/j.trac.2015.09.005
= HMDB,
GNPS,
MassBank,
ReSpect
Compound lists
provided by:
S. Stein, R. Mistrik, Agilent
Most libraries still have many unique entries – intercomparability?
34. 34
SPLASH – Communicate between libraries
Wohlgemuth et al., 2016, Nature Biotechnology 34, 1099-1101, DOI: 10.1038/nbt.3689
splash10 - 0002 - 0900000000 - b112e4e059e1ecf98c5f
[version] - [top10] - [histogram] - [hash of full spectrum]
http://mona.fiehnlab.ucdavis.edu/#/spectra/splash/splash10-0002-0900000000-b112e4e059e1ecf98c5f
https://www.google.ch/search?q=splash10-0002-0900000000-b112e4e059e1ecf98c5f
35. 35
Homologues … UVCBs … Complex Mixtures
Schymanski, Grulke, Williams et al, in prep. & Williams et al. 2017 J. Cheminformatics 9:61 DOI: 10.1186/s13321-017-0247-6
https://comptox.epa.gov/dashboard/chemical_lists/eawagsurf