Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sharing chemical structures with peer reviewed publications

1,217 views

Published on

In the domain of chemistry one of the greatest benefits to publishing research is that data can be shared. Unfortunately, the vast majority of chemical structure data associated with scientific publications remain locked up in document form, primarily in PDF files or trapped on webpages. Despite the explosive growth of online chemical databases and the overall maturity of cheminformatics platforms, many barriers stifle the exchange of chemical structures via publications. These challenges include incomplete support by accepted standards (especially InChI) for advanced stereochemistry, organometallic compounds and generic “Markush” representations, the difference between human-readable and computer-readable forms of data, and challenges with the computer representation of chemical structures. To address these obstacles to chemical structure sharing, US EPA National Center for Computational Toxicology scientists are using a combination of cheminformatics applications and online repositories to distribute chemical structure data associated with their publications. This presentation will describe how EPA-NCCT chemical structure data that is amenable to indexing and distribution are shared and highlight the benefit of open data sharing for modeling, data integration, and increasing research impact. This abstract does not reflect U.S. EPA policy.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Sharing chemical structures with peer reviewed publications

  1. 1. Sharing chemical structures with peer-reviewed publications. Are we there yet? Antony Williams National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC March 2018 ACS Spring Meeting, New Orleans http://www.orcid.org/0000-0002-2668-4821 The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
  2. 2. Disclaimer of Endorsement • Mention of or referral to commercial products or services, and/or links to non- EPA sites does not imply official EPA endorsement of or responsibility for the opinions, ideas, data, or products presented at those locations, or guarantee the validity of the information provided. 1
  3. 3. • National Center for Computational Toxicology established in 2005 to integrate: – High-throughput and high-content technologies – Modern molecular biology – Data mining and statistical modeling – Computational biology and chemistry • Outputs: a lot of data, models, algorithms and software applications • Open Data – we want scientists to interrogate it, learn from it, develop understanding National Center for Computational Toxicology
  4. 4. We publish a lot… And lots of chemistry… 3
  5. 5. The project I work on… https://comptox.epa.gov 4
  6. 6. We Curated Public Data to Create the Models – STANDARDS! Public data should be curated prior to modeling 5 Covalent Halogens Identical Chemicals Mismatches
  7. 7. Workflow Details and Data 6 OPERA Models: https://github.com/kmansouri/OPERA
  8. 8. OPEN Models for Prediction 7 Model Performance with full QMRF Nearest Neighbors from Training Set 7
  9. 9. OPERA Models 8 • The best way to publish our data as Supplementary Information Files?
  10. 10. Supplementary Information • Supplementary information files generally Word Docs and Spreadsheets to PDF • PDF conversions can be problematic. A recent example of interest 9
  11. 11. Chemical structures as text 10
  12. 12. How do I extract structures? • Copy and paste into Excel as a start point • Assume no loss of formatting! • Convert SMILES to structures 11
  13. 13. How do I extract structures? • Copy and paste into Excel as a start point • Assume no loss of formatting ! • Convert SMILES to structures • But Copy-Paste doesn’t work 12
  14. 14. How do I extract structures? • Copy and paste into Excel as a start point • Assume no loss of formatting ! • Convert SMILES to structures • But Copy-Paste doesn’t work 13
  15. 15. Extract Structure Drawings??? 14
  16. 16. Try hand-drawing Algal Toxins! 15
  17. 17. Think of files in multiple formats! • SMILES are hyper-dependent on good layout algorithms. It’s not easy! 16 [H][C@]12OC3(C[C@ @H](C)[C@H]1O)C[C @@]1([H])CCCC4(CC C5(O4)O[C@@]([H])( CC[C@@]5(C)O)CC(= C)CCCC4=NC[C@H]( C)[C@@H](C)CC44C CC(=C[C@]4([H])[C@] 2([H])O3)C(O)=O)O1
  18. 18. Names and CASRNs are NOT structures • In our domain most chemicals are text – chemical names and CAS Numbers 17
  19. 19. And generally problematic… 18
  20. 20. When publishing data consider.. • Licenses should be an important part of your data distribution considerations • Digital Object Identifiers for versioning • Consider your greater contribution to science outside of the publication 19
  21. 21. Consider Distribution Repositories: Examples 20
  22. 22. Repositories cover it all… 21 VERSIONING DOIs LICENSING
  23. 23. When we publish now… • Generally store files on our FTP site PLUS copies in a repository (or two) • Multiple formats of data as appropriate – Can be as supplementary data or DOI’ed data files • DOI’ed data gives altmetrics also.. 22
  24. 24. OPERA Models 23
  25. 25. OPERA Models 24
  26. 26. Thirty Years of Experience • What has changed? – Cheminformatics has progressed – The internet proliferates data access – Standards have progressed (InChI) and are improving – Anybody can access/download millions of structures! • What hasn’t changed? – Minor progress presenting structures in publications – Delivery via PDFs still dominates – Mandates on scientists are very unlikely 25
  27. 27. Contact Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology (NCCT) Williams.Antony@epa.gov ORCID: https://orcid.org/0000-0002-2668-4821 26

×