A presentation given at the 5th Metabolomics of North America webinar on September 8th 2023. Provides an overview of the cheminformatics support provided by the DSSTox database, CompTox Chemicals Dashboard and multiple other web-based applications in development
Cheminformatics Support for MS Supporting Exposomics
1. Cheminformatics Support for Mass
Spectrometry Supporting
Exposomics at the US-EPA
September 2023: Metabolomics Association of North America
http://www.orcid.org/0000-0002-2668-4821
The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
Antony Williams
Center for Computational Toxicology and Exposure, US-EPA, RTP, NC
2. The role of cheminformatics at EPA
• Our branch is in the Center for Computational
Toxicology and Exposure (CCTE)
• We develop curated chemistry data streams to
support our applications and models
• We develop prediction models, web-based
applications and data streams to support others
• Today’s presentation: how do our efforts
support Exposomics and especially NTA efforts
– What’s public and what’s in development?
1
3. Why Does EPA Need Measurement Data?
2
• Measurement data needed to ensure chemical
safety
• Characterize risk
• Regulate use & disposal
• Manage human & ecological exposures
• Ensure compliance under federal statutes
Chemical Monitoring Needs
Exposure
Assessment
Dose-
Response
Assessment
Risk
Characterization
Hazard
Identification
4. Challenges
• High-quality monitoring data are unavailable for most chemicals
• Measurement data normally generated using “targeted” methods
• Targeted analytical methods:
- Require a priori knowledge of chemicals of interest
- Produce data for few selected analytes (10s-100s)
- Standards for method development & compound quantitation
- Are blind to emerging contaminants
- Can’t keep pace with needs of 21st century risk characterizations
• Data gaps being filled with exposure models and “NTA” methods
3
5. Relevant Questions of NTA Studies?
• Which chemicals are where?
• Do we see any “new” chemicals?
• Do observed co-occurrences highlight:
– Important exposure sources?
– Stressor-response relationships?
– What is the concentration of each chemical?
– Do estimated concentrations suggest unacceptable risk?
• How does cheminformatics support this effort?
4
6. Everything is underpinned by the
DSSTox Database
5
• >1.2M substances
• Highly curated data
• Mapped relationships
• The data are made
available via the
Dashboard…
7. Accessing DSSTox chemistry:
CompTox Chemicals Dashboard
• A publicly accessible website delivering:
– 1.2M chemicals with related property data
– Experimental/predicted physicochemical property data
– Experimental Human and Ecological hazard data
– Integration to “biological assay data” (ToxCast/Tox21)
– Information regarding chemicals in consumer products
– Links to other agency websites and public data resources
– Related substances: transformation products, metabolites
– “Batch searching” for tens to thousands of chemicals
6
11. Batch Searching is a big enabler
https://pubs.acs.org/doi/10.1021/acs.jcim.0c01273
10
12. Batch Searching
• Singleton searches are useful but we work
with thousands of masses and formulae!
• Typical questions
– What is the list of chemicals for the formula CxHyOz
– What is the list of chemicals for a mass +/- error
– Can I get chemical lists in Excel files? In SDF files?
– Can I include properties in the download file?
11
16. Chemical Lists
• Chemical lists are focused on regulations,
specific research efforts and categories
• 425 lists and growing
– TSCA Inventory
– Clean Water Act Hazardous Substances
– Consumer Products database
– Chemicals of Emerging Concern
– PFAS lists
– Extractables and Leachables
– …lists are versioned and updated and new lists added
15
24. Benefits of bringing it all together
• The true dashboard benefit is integration
• Rank potential candidates for toxicity using
available data – hazard, exposure, in vitro
23
25. Supporting Exposomics Research
• DSSTox database substances map to
– Their structures (mass/formulae/InChIs etc)
– Hazard data : human, mammalian and ecotox
– Exposure data: products in commerce, categories and
functional use, measured concentrations, etc.
• There are many types of metadata that can
be used for candidate ranking (old approach)
24
26. Data Source Ranking of
“known unknowns”
25
• A mass and/or formula search is
for an unknown chemical but it
is a known chemical contained
within a reference database
• Most likely candidate chemicals
have the most associated data
sources, most associated
literature articles or both
C14H22N2O3
266.16304
Chemical
Reference
Database
Sorted candidate
structures
27. Data Streams for Ranking
• Dashboard Data Sources
• PubChem Data Source Count
• PubMed Reference Count
• Toxcast in vitro bioactivity
• Presence in Consumer Products database
• Predicted physicochemical Properties
28. BIG databases are GREAT!
P
u
b
C
h
e
m
C
A
S
R
e
g
i
s
t
r
y
C
h
e
m
S
p
i
d
e
r
E
P
A
D
S
S
T
o
x
B
l
o
o
d
E
x
p
o
s
o
m
e
1 0 4
1 0 5
1 0 6
1 0 7
1 0 8
1 0 9
C
h
e
m
ic
a
l
S
u
b
s
ta
n
c
e
s
• Thanks to all of the public database efforts
• So much benefit from what’s been done
• There are hundreds of them at this point…
29. Is a bigger database better?
28
• ChemSpider was 26 million chemicals for
the original work
• Much BIGGER today
• Is bigger better??
• Are there other metadata to use for ranking?
30. Comparing Search Performance
29
• When dashboard contained 720k chemicals
• Only 3% of ChemSpider size
• What was the comparison in performance?
37. PubChem – “virtual chemistry”
• Other databases grow quickly…a lot of “virtual
chemistry” and “make on demand” compounds.
• Efforts such as the BloodExposome and
PubChemLite are critical to focus efforts
36
38. Applications at the EPA
• We have ongoing efforts applying
NTA to multiple challenges including
– PFAS identification
– Pesticides in various matrices
– CECs in water
– Biosolids
• Examples include…
37
40. Example 1: Consumer Product Analysis
39
Many chemicals
observed in
consumer product
extracts
More observed
chemicals not
known to be in
consumer products
Why might the
‘other’ chemicals be
in the products?
Many observed
chemicals known to
be in consumer
products
42. Example 2: Recycled Product Analysis
41
Significant differences
between chemicals in
recycled vs. virgin products
for certain product & use
categories
Most differences observed in
paper products and
construction materials
Some uses (e.g., fragrances)
highly represented across all
product/use categories
44. Supporting Exposomics Research
• DSSTox database substances map to
– Their structures (mass/formulae/InChIs etc)
– Hazard data : human, mammalian and ecotox
– Exposure data: products in commerce, categories
and functional use, measured concentrations, etc.
• Structures have to be standardized…
44
45. “MS-Ready Chemicals”
• MS-Ready chemical standardization is ESSENTIAL to our
support of Non-Targeted Analysis
• It links chemicals across the Dashboard and facilitates
detection linking back to products in commerce
45
https://jcheminf.biomedcentral.com/articles/
10.1186/s13321-018-0299-2
46. Predicted Mass Spectra
http://cfmid.wishartlab.com/
• MS/MS spectra prediction for ESI+, ESI-, and EI
• Predictions generated for MS-Ready structures
• Use experimental vs predicted spectral searches
for candidate identification
46
47. Predicted Data Already Public
Publication and Data Files
47
https://epa.figshare.com/articles/CFM-ID_Paper_Data/7776212/1
51. Candidate Identification is only
PART of the process
• Whatever the approach for candidate
identification chemical hazard is important
• Hazard Comparison Profiling is important
https://www.epa.gov/chemical-research/cheminformatics
51
53. AMOS: Analytical Methods and
Spectra Database
• Three types of data in the database:
– Methods (regulatory, lab manuals and SOPs, publications,
tech notes)
– Spectra (from public domain and our own laboratories)
– Fact Sheets (harvested from SWGDRUG and other sites)
• Some methods have associated spectra
• Some data are just externally linked
• Currently contains around 200,000 spectra,
700,000 external links, 3000 “Fact Sheets”
and ~4000 methods
• ALL data are growing in number
53
66. Manual Curation and Annotation
Analytical QC data for Tox21
• ~9000 chemicals with tens of thousands of
spectra (LCMS, GCMS & NMR)
• These data will feed prediction algorithms…
66
70. UVCBs challenge in non-target analysis
70
Homologue screening plots from
Swiss Wastewater (Schymanski et al
2014, left) and Novi Sad (right)
o Complex mixtures (UVCBs) are a huge
and very challenging part of the
unknowns in many environmental
samples
78. Conclusions
• Our data resources underpin our research
efforts – data quality and curation is key
• Our web-based applications deliver our data
to the community for multiple use cases
• Our support for Exposomics is multi-fold
– Curated chemistry data streams
– Experimental and predicted properties, toxicity, etc.
• The NTA WebApp in development will use
all of these data streams to support analysis
78
79. Acknowledgments
• DSSTox curation team
• CCTE IT team for software development, DevOPs
• Mass spectrometry scientists across EPA,
especially the NTA team
• Open Databases: PubChem, ChEMBL, Mona,
MassBank, GNPS, SWGDRUG, Cayman Chems.
• Instrument vendors – many have contributed
methods to the AMOS database
• …and thank you to you for your time 79
80. Contact Information
• Contact info: williams.antony@epa.gov
• Send methods for inclusion in AMOS
• We fully support Open Data so ask us for what
you need
• Slides at: https://www.slideshare.net/AntonyWilliams/
80