Structure Identification Using High Resolution
Mass Spectrometry Data and the EPA
CompTox Dashboard
Antony J. Williams, Andrew McEachran, Chris Grulke,
Elin Ulrich, Jennifer Smith, Jeff Edwards and Jon Sobus,
November 2-3, 2016
SWEMSA 2016
http://www.orcid.org/0000-0002-2668-4821
The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
Who is NCCT?
• National Center for Computational Toxicology – part of EPA’s
Office of Research and Development
• Research driven by EPA’s Chemical Safety for Sustainability
Research Program
– Develop new approaches to evaluate the safety of chemicals
– Integrate advances in biology, biotechnology, chemistry, exposure
science and computer science
• Goal - To identify chemical exposures that may disrupt
biological processes and cause adverse outcomes.
1
Our Dashboard Applications
• Some of our Web-based Applications
2
Introducing Our Latest Dashboard
https://comptox.epa.gov
3
• >720,000 chemicals
• >14 years assembling data
Bisphenol A
4
Physicochemical Properties
5
ToxCast Bioassay Screening Data
Useful Meta Data
6
Functional Use and Composition
VERY Useful Meta Data
7
Dashboard: External Links to
Analytical Methods and Data
8
National Environmental Methods Index
9
RSC Analytical Abstracts
10
For_IDENT and MONA
11
Previous Work with Suspect-Screening
ONE ASPECT of the dashboard is to
support Non-targeted Analysis
Rank-Ordering of “Known-Unknowns”
using ChemSpider
13
Some history…
• 2007 A Hobby Project
14
Some history…
• 2007 A Hobby Project
• 2009 ChemSpider Acquired
15
Some history…
• 2007 A Hobby Project
• 2009 ChemSpider Acquired
• May 2015 Joined EPA – what
we are showing is very new
16
Advanced MS Searches
17
Monoisotopic Mass Search
18
Found 344 results for '215.096 ± 0.005 amu'
Download to Excel
19
Download as SDF file
20
Formula Search
21
Found 8 results for 'C8H14ClN5'
Does the Dashboard Add Value?
22
721k structures
Does the Dashboard Add Value?
• Remember:
– Focus on high quality data and curation
– Data sources include EPA data sources and a focus on
environmental chemistry
• No “dilution” by chemical vendors
23
Dilution Example…
Morphine Skeleton
24
Bisphenol A as an example
ChemSpider: 1564 Structures
25
Bisphenol A as an example
Dashboard: 215 Structures
26
ChemSpider 6926 Results!!!
27
Tacedinaline
Methyl Red
C.I Disperse
Yellow 3
Using Meta-Data to Sort Candidates
28
Anti-cancer Drug
Microbiological
Indicator Dye
Textile/Product Dye
Same top hits – different ranking
90 hits only versus 6926 hits
29
18
17
4Tacedinaline
Methyl Red
C.I Disperse
Yellow 3
Chemical Identification
Dashboard vs ChemSpider
Sorted by number of references (ChemSpider) or data sources (Dashboard)
Monoisotopic Mass (+/- 0.005 amu) Search
Position of compound sorted
Source of List # of
Compounds
Search Tool Mean
Position
Median
Position #1 #2 #3 #4 #5+
McEachran et al
Wastewater
34 ChemSpider 1.8 1 28 5 0 0 1
Dashboard 1.3 1 31 2 0 0 1
Misc. NTA Compounds 13 ChemSpider 2 1 7 5 0 0 1
Dashboard 1.7 1 10 2 0 0 1
Bade et al (2016) 19 ChemSpider 2.1 1 11 2 5 0 1
Dashboard 1.6 1 12 3 3 1 0
Rager et al (2016) 24 ChemSpider 2.25 1 15 2 1 2 4
Dashboard 1.08 1 22 2 0 0 0
Dashboard vs ChemSpider
Ranking Summary
Mass-based Searching Formula Based Searching
Dashboard ChemSpider Dashboard ChemSpider
Cumulative Average
Position 1.3 2.2 1.2 1.4
% in #1 Position 85% 70% 88% 80%
• Selected peer-reviewed publications
• 162 total individual chemicals in search
The Confusion of Chemicals…
Valid CAS-substance?
Monoisotopic Mass
Formula
Parent structure
(no stereo, desalted)
 Resolve CAS-structure mappings for
accurate data mapping
 Collapse sphere to collect all data at
parent structure-formula level
DSSTox_v2 Database
& Cheminformatics Layer
many:1
• Deleted CAS
• Invalid CAS
• Salt forms
• Complex forms
• Hydrate forms
• Approx mappings to mixtures
• Approx mappings to ill-
defined substances
• Stereoisomers
• Unresolved tautomers
CAS2 ?
CAS5 ?
CAS3 ?
CAS1 ?
CAS4 ?NOCAS?
Data1
Data2
Data3
Data4
Data5
Data6
Data7
Data8
Data9
CAS-Structure “Sphere of Confusion”
DSSTox List Curation Tool
Conflicts binned to facilitate curation
Helping to Curate Data
• We are helping to Curate Data (prior to
linking from our dashboard)
• Our discussions with Thomas and team –
“We all agree it is hard!”
34
Helping to Curate Data
• We are helping to Curate Data (prior to
linking from our dashboard)
• Our discussions with Thomas and team –
“We all agree it is hard!”
• Approx. 80% of STOFF-IDENT is done…
35
Helping to Curate Data
• We are helping to Curate Data (prior to
linking from our dashboard)
• Our discussions with Thomas and team –
“We all agree it is hard!”
• Approx. 80% of STOFF-IDENT is done…
36
Curating on CASRNs is DIFFICULT
• 36861-47-9 : SI00004220
• 38102-62-4 : SI00008957
37
Collisions in CAS Numbers
38
Active and Deleted CASRN
39
But there are MANY CASRNs!
• http://web.stanford.edu/group/swain/cinf/c
asreg/snumber.html
40
How Bad Can It Get??
41
How Bad Can It Get?
This one is 316 Deleted CASRN
42
Helping to Curate Data
• We are missing 159
CAS Numbers listed in
STOFF-IDENT
• Work on curating and
mapping MASSBANK is
underway. Much
bigger!! Way more work
Our OPEN Data is available…
• Various types of data at FTP download site:
ftp://newftp.epa.gov/COMPTOX/Sustainable_Chemistry_
Data/Chemistry_Dashboard
44
Data Availability
• The data are now available in METLIN
• Available in MetFrag (alpha) and in testing.
45
Coming December 2016
Batch Searching Names/CASRNs
• What are these chemicals?
46
Coming December 2016
Batch Searching…
47
Coming December 2016
Download to Excel
48
In-testing
49
Metadata included for Ranking
50
Need for “MS-Ready Structures”
51
“QSAR-Ready Structures”
• For the purpose of building QSAR Models
we already “standardize” structures
– Desalt/Neutralize
– Desolvate
– Remove stereochemistry
• Some minor tweaks gets us “MS-ready
Structures”. ALREADY in our database.
52
“QSAR-Ready Structures”
• Mass and Formula-based searches will be
based on MS-ready structures but
connected to the original chemical (with
name, CAS, rank ordering)
• MS-ready structures and substance
mappings will be available as Open Data
53
Rank-Ordering – incl. PubChem
54
Future Work
• Continue to research rank-ordering approaches
• Working on “retention time prediction”
• Search for adducts (+Na, +K, +NH4) and handle
decarboxylation, loss of water etc
• Additional links to methods – CDC NIOSH
• Expand link outs to Mass Spec databases –
Thermo’s mzCloud, Massbank, etc.
• Predicting metabolites and degradants
• Optimize web services for the community
55
Conclusions
• Only 1 aspect of the dashboard is focused on
MS – to support the EPA NTA Trial underway
• We should work on data curation TOGETHER!
• We are “part” of the solution. Our Open Data
and Open Services should be of value.
56
Acknowledgements
EPA NCCT
Chris Grulke
Jeff Edwards
Ann Richard
Jennifer Smith
Andrew McEachran*
EPA NERL
Jon Sobus
Seth Newton
Elin Ulrich
* = ORISE Participant

Structure Identification Using High Resolution Mass Spectrometry Data and the EPA CompTox Dashboard

  • 1.
    Structure Identification UsingHigh Resolution Mass Spectrometry Data and the EPA CompTox Dashboard Antony J. Williams, Andrew McEachran, Chris Grulke, Elin Ulrich, Jennifer Smith, Jeff Edwards and Jon Sobus, November 2-3, 2016 SWEMSA 2016 http://www.orcid.org/0000-0002-2668-4821 The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
  • 2.
    Who is NCCT? •National Center for Computational Toxicology – part of EPA’s Office of Research and Development • Research driven by EPA’s Chemical Safety for Sustainability Research Program – Develop new approaches to evaluate the safety of chemicals – Integrate advances in biology, biotechnology, chemistry, exposure science and computer science • Goal - To identify chemical exposures that may disrupt biological processes and cause adverse outcomes. 1
  • 3.
    Our Dashboard Applications •Some of our Web-based Applications 2
  • 4.
    Introducing Our LatestDashboard https://comptox.epa.gov 3 • >720,000 chemicals • >14 years assembling data
  • 5.
  • 6.
  • 7.
    ToxCast Bioassay ScreeningData Useful Meta Data 6
  • 8.
    Functional Use andComposition VERY Useful Meta Data 7
  • 9.
    Dashboard: External Linksto Analytical Methods and Data 8
  • 10.
  • 11.
  • 12.
  • 13.
    Previous Work withSuspect-Screening ONE ASPECT of the dashboard is to support Non-targeted Analysis
  • 14.
  • 15.
    Some history… • 2007A Hobby Project 14
  • 16.
    Some history… • 2007A Hobby Project • 2009 ChemSpider Acquired 15
  • 17.
    Some history… • 2007A Hobby Project • 2009 ChemSpider Acquired • May 2015 Joined EPA – what we are showing is very new 16
  • 18.
  • 19.
    Monoisotopic Mass Search 18 Found344 results for '215.096 ± 0.005 amu'
  • 20.
  • 21.
  • 22.
    Formula Search 21 Found 8results for 'C8H14ClN5'
  • 23.
    Does the DashboardAdd Value? 22 721k structures
  • 24.
    Does the DashboardAdd Value? • Remember: – Focus on high quality data and curation – Data sources include EPA data sources and a focus on environmental chemistry • No “dilution” by chemical vendors 23
  • 25.
  • 26.
    Bisphenol A asan example ChemSpider: 1564 Structures 25
  • 27.
    Bisphenol A asan example Dashboard: 215 Structures 26
  • 28.
  • 29.
    Using Meta-Data toSort Candidates 28 Anti-cancer Drug Microbiological Indicator Dye Textile/Product Dye
  • 30.
    Same top hits– different ranking 90 hits only versus 6926 hits 29 18 17 4Tacedinaline Methyl Red C.I Disperse Yellow 3
  • 31.
    Chemical Identification Dashboard vsChemSpider Sorted by number of references (ChemSpider) or data sources (Dashboard) Monoisotopic Mass (+/- 0.005 amu) Search Position of compound sorted Source of List # of Compounds Search Tool Mean Position Median Position #1 #2 #3 #4 #5+ McEachran et al Wastewater 34 ChemSpider 1.8 1 28 5 0 0 1 Dashboard 1.3 1 31 2 0 0 1 Misc. NTA Compounds 13 ChemSpider 2 1 7 5 0 0 1 Dashboard 1.7 1 10 2 0 0 1 Bade et al (2016) 19 ChemSpider 2.1 1 11 2 5 0 1 Dashboard 1.6 1 12 3 3 1 0 Rager et al (2016) 24 ChemSpider 2.25 1 15 2 1 2 4 Dashboard 1.08 1 22 2 0 0 0
  • 32.
    Dashboard vs ChemSpider RankingSummary Mass-based Searching Formula Based Searching Dashboard ChemSpider Dashboard ChemSpider Cumulative Average Position 1.3 2.2 1.2 1.4 % in #1 Position 85% 70% 88% 80% • Selected peer-reviewed publications • 162 total individual chemicals in search
  • 33.
    The Confusion ofChemicals… Valid CAS-substance? Monoisotopic Mass Formula Parent structure (no stereo, desalted)  Resolve CAS-structure mappings for accurate data mapping  Collapse sphere to collect all data at parent structure-formula level DSSTox_v2 Database & Cheminformatics Layer many:1 • Deleted CAS • Invalid CAS • Salt forms • Complex forms • Hydrate forms • Approx mappings to mixtures • Approx mappings to ill- defined substances • Stereoisomers • Unresolved tautomers CAS2 ? CAS5 ? CAS3 ? CAS1 ? CAS4 ?NOCAS? Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 CAS-Structure “Sphere of Confusion”
  • 34.
    DSSTox List CurationTool Conflicts binned to facilitate curation
  • 35.
    Helping to CurateData • We are helping to Curate Data (prior to linking from our dashboard) • Our discussions with Thomas and team – “We all agree it is hard!” 34
  • 36.
    Helping to CurateData • We are helping to Curate Data (prior to linking from our dashboard) • Our discussions with Thomas and team – “We all agree it is hard!” • Approx. 80% of STOFF-IDENT is done… 35
  • 37.
    Helping to CurateData • We are helping to Curate Data (prior to linking from our dashboard) • Our discussions with Thomas and team – “We all agree it is hard!” • Approx. 80% of STOFF-IDENT is done… 36
  • 38.
    Curating on CASRNsis DIFFICULT • 36861-47-9 : SI00004220 • 38102-62-4 : SI00008957 37
  • 39.
  • 40.
  • 41.
    But there areMANY CASRNs! • http://web.stanford.edu/group/swain/cinf/c asreg/snumber.html 40
  • 42.
    How Bad CanIt Get?? 41
  • 43.
    How Bad CanIt Get? This one is 316 Deleted CASRN 42
  • 44.
    Helping to CurateData • We are missing 159 CAS Numbers listed in STOFF-IDENT • Work on curating and mapping MASSBANK is underway. Much bigger!! Way more work
  • 45.
    Our OPEN Datais available… • Various types of data at FTP download site: ftp://newftp.epa.gov/COMPTOX/Sustainable_Chemistry_ Data/Chemistry_Dashboard 44
  • 46.
    Data Availability • Thedata are now available in METLIN • Available in MetFrag (alpha) and in testing. 45
  • 47.
    Coming December 2016 BatchSearching Names/CASRNs • What are these chemicals? 46
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
    Need for “MS-ReadyStructures” 51
  • 53.
    “QSAR-Ready Structures” • Forthe purpose of building QSAR Models we already “standardize” structures – Desalt/Neutralize – Desolvate – Remove stereochemistry • Some minor tweaks gets us “MS-ready Structures”. ALREADY in our database. 52
  • 54.
    “QSAR-Ready Structures” • Massand Formula-based searches will be based on MS-ready structures but connected to the original chemical (with name, CAS, rank ordering) • MS-ready structures and substance mappings will be available as Open Data 53
  • 55.
  • 56.
    Future Work • Continueto research rank-ordering approaches • Working on “retention time prediction” • Search for adducts (+Na, +K, +NH4) and handle decarboxylation, loss of water etc • Additional links to methods – CDC NIOSH • Expand link outs to Mass Spec databases – Thermo’s mzCloud, Massbank, etc. • Predicting metabolites and degradants • Optimize web services for the community 55
  • 57.
    Conclusions • Only 1aspect of the dashboard is focused on MS – to support the EPA NTA Trial underway • We should work on data curation TOGETHER! • We are “part” of the solution. Our Open Data and Open Services should be of value. 56
  • 58.
    Acknowledgements EPA NCCT Chris Grulke JeffEdwards Ann Richard Jennifer Smith Andrew McEachran* EPA NERL Jon Sobus Seth Newton Elin Ulrich * = ORISE Participant

Editor's Notes

  • #32 For example- Rager was actually 33 confirmed; Bade was 25