Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Non-targeted analysis supported by data and cheminformatics delivered via the US-EPA CompTox Chemicals Dashboard


Published on

Non-targeted analysis (NTA) uses high-resolution mass spectrometry to better understand the identity of a wide variety of chemicals present in environmental samples (and other matrices). However, data processing remains challenging due to the vast number of chemicals detected in samples, software and computational requirements of data processing, and inherent uncertainty in confidently identifying chemicals from candidate lists. Analysis of the resultant mass spectrometry information relies on cheminformatics to identify and rank chemicals and the US EPA has developed functionality within the CompTox Chemicals Dashboard ( to address challenges related to this analysis. These tools include the generation of “MS-Ready” structures to optimize database searching, retention time prediction for candidate reduction, consensus ranking using chemical metadata, and in silico MS/MS fragmentation prediction for spectral matching. Combining these tools into a comprehensive workflow improves certainty in candidate identification. This presentation will review how the CompTox Chemicals Dashboard via its flexible search capabilities, rich data for ~900,000 chemical substances, and visualization approaches within this open chemistry resource provides a freely available software tool to support structure identification and NTA. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Published in: Science
  • Be the first to comment

Non-targeted analysis supported by data and cheminformatics delivered via the US-EPA CompTox Chemicals Dashboard

  1. 1. Non-targeted analysis supported by data and cheminformatics delivered via the US EPA CompTox Chemicals Dashboard Antony Williams, Alex Chao, Tom Transue, Tommy Cathey, Elin Ulrich and Jon Sobus 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC 2) Oak Ridge Institute of Science and Education (ORISE) Research Participant, RTP, NC 3) GDIT, Research Triangle Park, North Carolina, United State 4) National Exposure Research Laboratory, U.S. Environmental Protection Agency, RTP, NC August 2019 ACS Fall Meeting, San Diego The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
  2. 2. An intro to the Dashboard • Freely available web-based database from the National Center for Computational Toxicology • Providing data for 875,000 substances including – Experimental and predicted physicochemical properties – In vivo toxicity data harvested from dozens of public resources – In vitro bioactivity data for thousands of chemicals and assays – Exposure data including chemicals in consumer products – Real time predictions for >20 physchem and toxicological endpoints • Dashboard is used by mass spectrometrists for chemical identification • A quick view of general capabilities… 1
  3. 3. CompTox Chemicals Dashboard 2 875k Chemical Substances
  4. 4. Detailed Chemical Pages 3
  5. 5. Access to Chemical Hazard Data 4
  6. 6. Sources of Exposure to Chemicals 5
  7. 7. Link Access 6 Links based on chemical identifiers to dozens of online resources – including analytical data
  8. 8. MassBank of North America 7
  9. 9. “MS-ready” structures 8
  10. 10. Overview of MS-Ready Structures • All structure-based chemical substances are algorithmically processed to – Split multicomponent chemicals into individual structures – Desalt and neutralize individual structures – Remove stereochemical bonds from all chemicals • MS-Ready structures are then mapped to original substances to provide a path between chemicals detected by mass spectrometry to original substances 9
  11. 11. 10
  12. 12. MS-Ready Mappings from Details Page 11
  13. 13. Two MS-Ready Mappings Set 12
  14. 14. MS-Ready Mappings Set All substances containing component 13
  15. 15. Mass/Formula Searching and Metadata Ranking 14
  16. 16. Advanced Searches Mass Search 15
  17. 17. Advanced Searches Mass Search 16
  18. 18. MS-Ready Structures for Formula Search 17
  19. 19. MS-Ready Mappings • EXACT Formula: C10H16N2O8: 3 Hits 18
  20. 20. MS-Ready Mappings • Same Input Formula: C10H16N2O8 • MS Ready Formula Search: 125 Chemicals 19
  21. 21. MS-Ready Mappings • Exact Formula – 3 hits • MS-Ready Formula – 125 hits!! – ONLY 8 of the 125 are single component chemicals – 3 are neutral compounds and 2 are charged • How can we rank the candidates list? 20
  22. 22. Candidate ranking using metadata 21
  23. 23. Data Source Ranking of “known unknowns” 22 • A mass and/or formula search is for an unknown chemical but it is a known chemical contained within a reference database • Most likely candidate chemicals have the most associated data sources, most associated literature articles or both C14H22N2O3 266.16304 Chemical Reference Database Sorted candidate structures
  24. 24. The original ChemSpider work 23
  25. 25. Is a bigger database better? 24 • ChemSpider was 26 million chemicals for the original work • Much BIGGER today • Is bigger better?? • Are there other metadata to use for ranking?
  26. 26. Using Metadata for Ranking • Chosen dashboard metadata to rank candidates – Associated data sources • Lists in the underlying database (more about lists later) • Associated data sources in PubChem • Specific source types (e.g. water, surfactants, pesticides) – Number of associated literature articles (Pubmed) – Chemicals in the environment – the number of products/categories containing the chemical is an important source of data (from CPDat database) 25
  27. 27. Identification ranks for 1783 chemicals using multiple data streams 26 DS: Data Sources PC: PubChem PM: PubMed STOFF: DB KEMI: DB Data Sources alone rank ~75% of the chemicals as Top Hit
  28. 28. Comparing Search Performance 27 • When dashboard contained 720k chemicals • Only 3% of ChemSpider size • What was the comparison in performance?
  29. 29. SAME dataset for comparison 28
  30. 30. How did performance compare? 29 For the same 162 chemicals, Dashboard outperforms ChemSpider for both Mass and Formula Ranking
  31. 31. How did performance compare? 30
  32. 32. Data Quality is important • Data quality in free web-based databases! 31
  33. 33. Public Databases require curation • There is significant bloating in the public databases because of lack of curation • The number of hits retrieved based on mass or formula searching can explode based on poorly represented chemicals – especially stereochemistry issues • MS-Ready structures will map back to multiple versions of “the same chemical”. 32
  34. 34. Will the correct Microcystin LR Stand Up? ChemSpider Skeleton Search 33
  35. 35. Comparing ChemSpider Structures 34
  36. 36. Comparing ChemSpider Structures 35
  37. 37. Other Searches 36
  38. 38. Batch Searching mass and formula 37
  39. 39. Batch Searching • Singleton searches are useful but we work with thousands of masses and formulae! • Typical questions – What is the list of chemicals for the formula CxHyOz – What is the list of chemicals for a mass +/- error – Can I get chemical lists in Excel files? In SDF files? – Can I include properties in the download file? 38
  40. 40. Batch Searching Formula/Mass 39
  41. 41. Searching batches using MS-Ready Formula (or mass) searching 40
  42. 42. Mass Spectrometry Related Searches 41
  43. 43. Find me “related structures” Formula-Based Search 42
  44. 44. Select Chemicals of Interest 43
  45. 45. Find me “related structures” Based on Structure Similarity 44
  46. 46. Find me “related structures” Based on Structure Similarity 45
  47. 47. Find me “related structures” Structure Similarity – sort on mass 46
  48. 48. Chemical Lists 47
  49. 49. Chemical Lists 48
  50. 50. EPAHFR: Hydraulic Fracturing 49
  51. 51. PFAS lists of Chemicals 50
  52. 52. Research in Progress 51
  53. 53. Predicted Mass Spectra • MS/MS spectra prediction for ESI+, ESI-, and EI • Predictions generated and stored for >800,000 structures, to be accessible via Dashboard 52
  54. 54. Search Expt. vs. Predicted Spectra
  55. 55. Search Expt. vs. Predicted Spectra
  56. 56. Spectral Viewer Comparison 55
  57. 57. Prototype Development 56
  58. 58. Prototype Development 57
  59. 59. API services and Open Data • Present API and web services available at but major redevelopment is underway • Downloadable data available via the downloads page 58
  60. 60. Web Services • Data in UI, JSON and XML format 59
  62. 62. Data and Services used by the Community 61
  63. 63. NORMAN Suspect List Exchange 62
  64. 64. Integration to MetFrag in place 63
  65. 65. MassBank mapping to Dashboard Based on Web Service lookup 64
  66. 66. Conclusion • Dashboard access to data for ~875,000 chemicals • MS-Ready data facilitates structure identification • Related metadata facilitates candidate ranking 65 • Relationship mappings and chemical lists of great utility • Dashboard and contents are one part of the solution • New developments in progress, especially API development, will be very enabling…
  67. 67. Acknowledgements • IT Development team – especially Jeff Edwards and Jeremy Dunne • Chris Grulke for the ChemReg system • Andrew McEachran (now at Agilent) • Valery Tkachenko (working on new MS-Ready) • NERL colleagues – Jon Sobus, Elin Ulrich, Mark Strynar, Seth Newton, Alex Chao • Emma Schymanski, LCSB, Luxembourg • NORMAN Network and all contributors 66
  68. 68. Contact Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology EMAIL: ORCID: 67