The iCSS CompTox Dashboard is a publicly accessible dashboard provided by the National Center for Computation Toxicology at the US-EPA. It serves a number of purposes, including providing a chemistry database underpinning many of our public-facing projects (e.g. ToxCast and ExpoCast). The available data and searches provide a valuable path to structure identification using mass spectrometry as the source data. With an underlying database of over 720,000 chemicals, the dashboard has already been used to assist in identifying chemicals present in house dust. However, it can also be applied to many other purposes, e.g., the identification of agrochemicals in waste streams. This presentation will provide a review of the EPA’s platform and underlying algorithms used for the purpose of compound identification using high-resolution mass spectrometry data. In order to examine its performance for structure identification, especially in terms of rank-ordering database hits, we have compared it with the ChemSpider database, a well-regarded public database that has become one of the community standards for structure identification. The study has shown that the CompTox Dashboard outperforms ChemSpider in terms of structure identification and ranking providing improved outcomes for mass spectrometry analysis of “known unknowns”.
Structure Identification Using High Resolution Mass Spectrometry Data and the EPA CompTox Dashboard
1. Structure Identification Using High Resolution
Mass Spectrometry Data and the EPA
CompTox Dashboard
Antony J. Williams, Andrew McEachran, Chris Grulke,
Elin Ulrich, Jennifer Smith, Jeff Edwards and Jon Sobus,
November 2-3, 2016
SWEMSA 2016
http://www.orcid.org/0000-0002-2668-4821
The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
2. Who is NCCT?
• National Center for Computational Toxicology – part of EPA’s
Office of Research and Development
• Research driven by EPA’s Chemical Safety for Sustainability
Research Program
– Develop new approaches to evaluate the safety of chemicals
– Integrate advances in biology, biotechnology, chemistry, exposure
science and computer science
• Goal - To identify chemical exposures that may disrupt
biological processes and cause adverse outcomes.
1
24. Does the Dashboard Add Value?
• Remember:
– Focus on high quality data and curation
– Data sources include EPA data sources and a focus on
environmental chemistry
• No “dilution” by chemical vendors
23
29. Using Meta-Data to Sort Candidates
28
Anti-cancer Drug
Microbiological
Indicator Dye
Textile/Product Dye
30. Same top hits – different ranking
90 hits only versus 6926 hits
29
18
17
4Tacedinaline
Methyl Red
C.I Disperse
Yellow 3
31. Chemical Identification
Dashboard vs ChemSpider
Sorted by number of references (ChemSpider) or data sources (Dashboard)
Monoisotopic Mass (+/- 0.005 amu) Search
Position of compound sorted
Source of List # of
Compounds
Search Tool Mean
Position
Median
Position #1 #2 #3 #4 #5+
McEachran et al
Wastewater
34 ChemSpider 1.8 1 28 5 0 0 1
Dashboard 1.3 1 31 2 0 0 1
Misc. NTA Compounds 13 ChemSpider 2 1 7 5 0 0 1
Dashboard 1.7 1 10 2 0 0 1
Bade et al (2016) 19 ChemSpider 2.1 1 11 2 5 0 1
Dashboard 1.6 1 12 3 3 1 0
Rager et al (2016) 24 ChemSpider 2.25 1 15 2 1 2 4
Dashboard 1.08 1 22 2 0 0 0
32. Dashboard vs ChemSpider
Ranking Summary
Mass-based Searching Formula Based Searching
Dashboard ChemSpider Dashboard ChemSpider
Cumulative Average
Position 1.3 2.2 1.2 1.4
% in #1 Position 85% 70% 88% 80%
• Selected peer-reviewed publications
• 162 total individual chemicals in search
33. The Confusion of Chemicals…
Valid CAS-substance?
Monoisotopic Mass
Formula
Parent structure
(no stereo, desalted)
Resolve CAS-structure mappings for
accurate data mapping
Collapse sphere to collect all data at
parent structure-formula level
DSSTox_v2 Database
& Cheminformatics Layer
many:1
• Deleted CAS
• Invalid CAS
• Salt forms
• Complex forms
• Hydrate forms
• Approx mappings to mixtures
• Approx mappings to ill-
defined substances
• Stereoisomers
• Unresolved tautomers
CAS2 ?
CAS5 ?
CAS3 ?
CAS1 ?
CAS4 ?NOCAS?
Data1
Data2
Data3
Data4
Data5
Data6
Data7
Data8
Data9
CAS-Structure “Sphere of Confusion”
35. Helping to Curate Data
• We are helping to Curate Data (prior to
linking from our dashboard)
• Our discussions with Thomas and team –
“We all agree it is hard!”
34
36. Helping to Curate Data
• We are helping to Curate Data (prior to
linking from our dashboard)
• Our discussions with Thomas and team –
“We all agree it is hard!”
• Approx. 80% of STOFF-IDENT is done…
35
37. Helping to Curate Data
• We are helping to Curate Data (prior to
linking from our dashboard)
• Our discussions with Thomas and team –
“We all agree it is hard!”
• Approx. 80% of STOFF-IDENT is done…
36
38. Curating on CASRNs is DIFFICULT
• 36861-47-9 : SI00004220
• 38102-62-4 : SI00008957
37
43. How Bad Can It Get?
This one is 316 Deleted CASRN
42
44. Helping to Curate Data
• We are missing 159
CAS Numbers listed in
STOFF-IDENT
• Work on curating and
mapping MASSBANK is
underway. Much
bigger!! Way more work
45. Our OPEN Data is available…
• Various types of data at FTP download site:
ftp://newftp.epa.gov/COMPTOX/Sustainable_Chemistry_
Data/Chemistry_Dashboard
44
46. Data Availability
• The data are now available in METLIN
• Available in MetFrag (alpha) and in testing.
45
53. “QSAR-Ready Structures”
• For the purpose of building QSAR Models
we already “standardize” structures
– Desalt/Neutralize
– Desolvate
– Remove stereochemistry
• Some minor tweaks gets us “MS-ready
Structures”. ALREADY in our database.
52
54. “QSAR-Ready Structures”
• Mass and Formula-based searches will be
based on MS-ready structures but
connected to the original chemical (with
name, CAS, rank ordering)
• MS-ready structures and substance
mappings will be available as Open Data
53
56. Future Work
• Continue to research rank-ordering approaches
• Working on “retention time prediction”
• Search for adducts (+Na, +K, +NH4) and handle
decarboxylation, loss of water etc
• Additional links to methods – CDC NIOSH
• Expand link outs to Mass Spec databases –
Thermo’s mzCloud, Massbank, etc.
• Predicting metabolites and degradants
• Optimize web services for the community
55
57. Conclusions
• Only 1 aspect of the dashboard is focused on
MS – to support the EPA NTA Trial underway
• We should work on data curation TOGETHER!
• We are “part” of the solution. Our Open Data
and Open Services should be of value.
56