Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Our dire need to mandate data
standards and expectations for
scientific publishing
Antony Williams
ACS Denver, March 2015
Reproducibility, Reporting,
Sharing & Plagiarism
• I will present from the point of view of:
• Losing way too much of my o...
Consider a shift to Openness
Times have really changed…
Open Access funder mandates…
Publishers are responding
The world of Open Data is here
What technical solutions tho’?
• Despite the push for Open Data the funders
are not really pushing solutions yet
• Institu...
Digital Science Figshare
Elsevier Pure
RSC ChemSpider
So what do I do…
• VP Strategic Development for RSC
• Manage the cheminformatics team
• Interested in Open Drug Discovery,...
Some NMR…in this CASE…
Some NMR…
Studying DOZENS of compounds
• NO access to raw data files – in binary or
even standard file formats for processing
• Figu...
…I (co-)author many articles…
My favorite part of writing!
What.. NO STANDARD???
In researcher mode…
• I want to access and use data
• I want to:
• Download molecules
• Download tables
• Download spectra...
Community Norms
• Some wonderful community norms and
mandates!
• Deposit crystal structures in CSD
• Deposit Proteins in P...
What of general chemistry?
• We publish into locked down files and then
“abstract” the data!
• Could publishers help drive...
Nature Chemistry Compound
Pages
RSC Prospected Articles
Could we at least improve
quality of compounds?
• Maybe forcing compound registration ahead
of time won’t work (would need...
EXPERTS must get it right?!
What about a validated dictionary?
There are Standards!
There are Standards!
There are Standards!
CVSP: Validate and Standardize
CVSP Rules Sets
CVSP Filtering of DrugBank
CVSP Filtering of DrugBank
CVSP is Open to Anyone!
What if…
• CVSP was used to check and process all
ChemDraw, Molfiles, SDF files before
submitting to publishers or databas...
A Talk from Yesterday…
http://www.slideshare.net/AntonyWilliams/
Spectral Data
ChemSpider ID 24528095 H1 NMR
ChemSpider ID 24528095 C13 NMR
ChemSpider ID 24528095 HHCOSY
ESI – Text Spectra
We want to find text spectra?
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
b...
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t,
1H, Jb = 10.8 Hz, C(6)H...
Developing Proof-of-Concept
• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucl...
ESI Data also contains figures
“Where is the real data please?”
FIGURE
DATA
Extraction is the WRONG WAY
• We should NOT mine data out – digital form!
• Structures should be submitted “correctly”
• S...
We can solve for Authors here
Will it be used though??? YES!
Supplementary Info Data now..
The challenges of analytical data
• Vendors produce complex proprietary data
formats and standard formats are required
(JC...
Analytical data
Data Mining – it’s mine, mine!
Related… Published this week
It’s Dangerous to Mandate
• Scientists prefer guidelines rather than rules
• It can be more work to meet mandates
• Mandat...
Reproducibility, Reporting,
Sharing & Plagiarism
• If publishers demanded it of me…
• I would lose less of my own data!
• ...
It’s a long road ahead…
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
Our dire need to mandate data standards and expectations for scientific publishing
Our dire need to mandate data standards and expectations for scientific publishing
Our dire need to mandate data standards and expectations for scientific publishing
Our dire need to mandate data standards and expectations for scientific publishing
Our dire need to mandate data standards and expectations for scientific publishing
Upcoming SlideShare
Loading in …5
×

Our dire need to mandate data standards and expectations for scientific publishing

3,946 views

Published on

This is a presentation that I delivered at the ACS Division of Chemical Information meeting regarding "Reproducibility, Reporting, Sharing & Plagiarism".

I took the opportunity to remove my hat that has me be the VP of Strategic Development at RSC, and a member of the cheminformatics group that built ChemSpider and works on other RSC projects related to it. Instead I presented on how a LACK OF MANDATES from publishers on me in terms of submission of data accompanying articles I am involved with writing is actually weakening my scientific record as data is not getting shared in the most useful forms possible to the benefit of the community. I think there would be benefits for publishers to start pushing me for MORE data, in fairly general standards, and allowing me (and others) to download the data in the form of molecules (and collections), spectral data, CSV files etc.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Our dire need to mandate data standards and expectations for scientific publishing

  1. 1. Our dire need to mandate data standards and expectations for scientific publishing Antony Williams ACS Denver, March 2015
  2. 2. Reproducibility, Reporting, Sharing & Plagiarism • I will present from the point of view of: • Losing way too much of my own data! • Someone who actively wants to share data • My involvement with a chemistry database • As a reviewer of publications • As an author of scientific publications • ..and as a replacement speaker…
  3. 3. Consider a shift to Openness
  4. 4. Times have really changed… Open Access funder mandates…
  5. 5. Publishers are responding
  6. 6. The world of Open Data is here
  7. 7. What technical solutions tho’? • Despite the push for Open Data the funders are not really pushing solutions yet • Institutional repositories are commonplace • (Partial) solutions are becoming available
  8. 8. Digital Science Figshare
  9. 9. Elsevier Pure
  10. 10. RSC ChemSpider
  11. 11. So what do I do… • VP Strategic Development for RSC • Manage the cheminformatics team • Interested in Open Drug Discovery, Open Data management, Cheminformatics standards • But originally an NMR spectroscopist with a focus on structure elucidation - very interested in “CASE”, study of natural products
  12. 12. Some NMR…in this CASE…
  13. 13. Some NMR…
  14. 14. Studying DOZENS of compounds • NO access to raw data files – in binary or even standard file formats for processing • Figures are close to USELESS for 2D NMR – representative not accurate shifts • Tabulated shifts are in PDF files and needed transcribing – where are CSV files??? • TORTUROUS WORK!!!!
  15. 15. …I (co-)author many articles…
  16. 16. My favorite part of writing! What.. NO STANDARD???
  17. 17. In researcher mode… • I want to access and use data • I want to: • Download molecules • Download tables • Download spectra • Download figures • Then reprocess, replot, repurpose
  18. 18. Community Norms • Some wonderful community norms and mandates! • Deposit crystal structures in CSD • Deposit Proteins in PDB • Deposit gene sequences in Genbank • Increasingly deposit bioassay data in Pubchem
  19. 19. What of general chemistry? • We publish into locked down files and then “abstract” the data! • Could publishers help drive a community norm for: • Chemical compound registration • Spectral data • Property data • What else?
  20. 20. Nature Chemistry Compound Pages
  21. 21. RSC Prospected Articles
  22. 22. Could we at least improve quality of compounds? • Maybe forcing compound registration ahead of time won’t work (would need a business model etc.) • But what can be done to help correct the many issues we see with structures? • Examples?
  23. 23. EXPERTS must get it right?!
  24. 24. What about a validated dictionary?
  25. 25. There are Standards!
  26. 26. There are Standards!
  27. 27. There are Standards!
  28. 28. CVSP: Validate and Standardize
  29. 29. CVSP Rules Sets
  30. 30. CVSP Filtering of DrugBank
  31. 31. CVSP Filtering of DrugBank
  32. 32. CVSP is Open to Anyone!
  33. 33. What if… • CVSP was used to check and process all ChemDraw, Molfiles, SDF files before submitting to publishers or databases? • Publishers used the CVSP API to check their data? • All the rules were openly available for adoption
  34. 34. A Talk from Yesterday… http://www.slideshare.net/AntonyWilliams/
  35. 35. Spectral Data
  36. 36. ChemSpider ID 24528095 H1 NMR
  37. 37. ChemSpider ID 24528095 C13 NMR
  38. 38. ChemSpider ID 24528095 HHCOSY
  39. 39. ESI – Text Spectra
  40. 40. We want to find text spectra? • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) • What would be better are spectral figures – and include assignments where possible!
  41. 41. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  42. 42. Developing Proof-of-Concept • Extract from 1976-2014 USPTO applications *unknown – starts off with NMR: peak list (no nucleus) H 975543 C 56536 unknown 44306 F 9429 P 3241 B 91 Si 62 Sn 22 Se 11 N 8
  43. 43. ESI Data also contains figures
  44. 44. “Where is the real data please?” FIGURE DATA
  45. 45. Extraction is the WRONG WAY • We should NOT mine data out – digital form! • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  46. 46. We can solve for Authors here Will it be used though??? YES!
  47. 47. Supplementary Info Data now..
  48. 48. The challenges of analytical data • Vendors produce complex proprietary data formats and standard formats are required (JCAMP, NetCDF, AniML) • ChemSpider already hosts thousands of JCAMP spectra • Support of “assigned spectra” in place • Data validation approaches understood • There are a myriad of analytical data types…
  49. 49. Analytical data
  50. 50. Data Mining – it’s mine, mine!
  51. 51. Related… Published this week
  52. 52. It’s Dangerous to Mandate • Scientists prefer guidelines rather than rules • It can be more work to meet mandates • Mandates may discourage submissions to journals • But what’s good for science? • Will the Open Data movement shift things? • Will the latest generation share more?
  53. 53. Reproducibility, Reporting, Sharing & Plagiarism • If publishers demanded it of me… • I would lose less of my own data! • I would actively be sharing data • As a reviewer of publications..enables me • As an author of scientific publications..makes the publications better I believe • ..and I did my best as a replacement speaker…
  54. 54. It’s a long road ahead…
  55. 55. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×