Royal Society of Chemistry activities
to develop a data repository for
chemistry-specific data
Aileen Day, Alexey Pshenich...
Data in a Scientific Publication
• This is not new, you known the story…
• So much data of value contained within a
public...
Many useful discussions…
Many good visions…
And over the years, progress…
• There is much progress with open access, data
access, licensing, enhanced articles, open
d...
But it’s NOT easy..technology
But it’s not easy…US
• Not everything we would like around data
handling is there for sure
• Many systems, tools, platform...
An Initial “Vague” Vision Set
• Manage “all” of the chemistry data associated
with chemical substances
• Data to be downlo...
Data Repository
• Registration of chemical compounds
• Deposition of chemical syntheses
• Addition of analytical data
• In...
Solving for Authors
I hate text mining data
• DERA: Developing pipelining tools for text-
mining so we will be able to process
documents for m...
“Where is the real data please?”
FIGURE
DATA
Data Preferences - total bias
• Views of a spectroscopist
• Give me the data – interactive, downloadable
spectrum is way m...
Solving the problem here..
• Binary file formats are problematic – think of
the variations in instrumentation and software...
…and what does it solve?
• “Fixing the data” – data can’t be faked as
easily
• Reprocessing of analytical data can be
done...
But solve it for many things
• I want molecules as structure formats not
images
• Please don’t make us hack tables of data...
Input data pipeline
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds
M...
Depositions Gateway User
Interface
Depositions Gateway User
Interface
Depositions Gateway User
Interface
Depositions Gateway User
Interface
Depositions Gateway User
Interface
Document processing
Input data pipeline
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds
M...
Depositions Gateway User
Interface
User Interface Approach
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
...
User Interface Approach
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
...
User Interface Approach
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
...
User Interface Approach
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
...
Analytical Chemist
Characterize
Measure
Search
Store
<<include>>
<<include>>
<<include>>
Synthetic Chemist
Search
(synthet...
Medicinal Chemist
Search
(against database of properties)
Source
(find vendor)
Analyse
(cluster, dock, screen)
Computation...
Addition of Analytical Data
• Spectral Container is in development using
componentized widgets for display
• NIST spectra ...
Electronic Notebook Data
• Development work integrating chemistry
into the Southampton Labtrove notebook
• Stoichiometry t...
Present activities – ACS Fall
• Deposition process development of
compounds, reactions and spectral data by
Spring
• FTP, ...
Acknowledgments
• Jeremy Frey and Simon Coles, University of
Southampton
• Will Dichtel and Leah McEwan, Cornell
Universit...
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Royal society of chemistry activities to develop a data repository for chemistry specific data final
Upcoming SlideShare
Loading in …5
×

Royal society of chemistry activities to develop a data repository for chemistry specific data final

3,810 views

Published on

The Royal Society of Chemistry publishes many thousands of articles per year, the majority of these containing rich chemistry data that, in general, in limited in its value when isolated only to the HTML or PDF form of the articles commonly consumed by readers. RSC also has an archive of over 300,000 articles containing rich chemistry data especially in the form of chemicals, reactions, property data and analytical spectra. RSC is developing a platform integrating these various forms of chemistry data. The data will be aggregated both during the manuscript deposition process as well as the result of text-mining and extraction of data from across the RSC archive. This presentation will report on the development of the platform including our success in extracting compounds, reactions and spectral data from articles. We will also discuss our developing process for handling data at manuscript deposition and the integration and support of eLab Notebooks (ELNS) in terms of facilitating data deposition and sourcing data. Each of these processes is intended to ensure long-term access to research data with the intention of facilitating improved discovery.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,810
On SlideShare
0
From Embeds
0
Number of Embeds
3,231
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Royal society of chemistry activities to develop a data repository for chemistry specific data final

  1. 1. Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor, Peter Corbett, Jon Steele, Valery Tkachenko and Antony Williams, ACS Dallas March 2014
  2. 2. Data in a Scientific Publication • This is not new, you known the story… • So much data of value contained within a publication and delivered in a PDF form • PDF files, and especially unclear licensing, don’t allow me at the data so I can rework, reuse, repurpose, text mine etc. • I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with capabilities I need, and the publishers should just do it
  3. 3. Many useful discussions…
  4. 4. Many good visions…
  5. 5. And over the years, progress… • There is much progress with open access, data access, licensing, enhanced articles, open data, free online tools, open source codes, publishers waking up, scientists contributing • We should be excited at what is available now, what the future holds, what opportunities exist in front of us
  6. 6. But it’s NOT easy..technology
  7. 7. But it’s not easy…US • Not everything we would like around data handling is there for sure • Many systems, tools, platforms are already available but we don’t know about them or even if we did contributing us “more work” • “What’s in it for me?”, “It’s my data”, “It’s too much work”, “What credit do I get?”
  8. 8. An Initial “Vague” Vision Set • Manage “all” of the chemistry data associated with chemical substances • Data to be downloadable, reusable, interactive • Build a platform that enables the scientist • Data storage, validation, standardization and curation • Collaborative data sharing • Provide data platform that can enable and enhance publishing of scientific papers
  9. 9. Data Repository • Registration of chemical compounds • Deposition of chemical syntheses • Addition of analytical data • Integration to electronic notebooks • Rewards and recognition for data sharing • Document processing • Hosting of data as private, embargoed or public
  10. 10. Solving for Authors
  11. 11. I hate text mining data • DERA: Developing pipelining tools for text- mining so we will be able to process documents for mark-up • Compound extraction/markup • Reaction extraction/conversion • Convert “text spectra” to generate spectral libraries… AGGHHHHH!
  12. 12. “Where is the real data please?” FIGURE DATA
  13. 13. Data Preferences - total bias • Views of a spectroscopist • Give me the data – interactive, downloadable spectrum is way more valuable to me (processed spectrum and FID available) • Spectral header in JCAMP standard is very incomplete (and most spectral standards) • I want ASSIGNED/ANNOTATED spectra if possible – don’t “textify” a spectrum!
  14. 14. Solving the problem here.. • Binary file formats are problematic – think of the variations in instrumentation and software • Standards can be defined – are they correctly implemented? CIF and its Checking, Spectral standards - JCAMP versions, Structure formats, etc… • Metadata is crucial
  15. 15. …and what does it solve? • “Fixing the data” – data can’t be faked as easily • Reprocessing of analytical data can be done…weighting functions, baseline correction, deconvolution etc. • I can convert and store it locally
  16. 16. But solve it for many things • I want molecules as structure formats not images • Please don’t make us hack tables of data • Tell us how you generated your files – software version, software libraries, etc.
  17. 17. Input data pipeline Deposition Gateway Staging databases Compounds Reactions Spectra Materials Articles / CSSP Compounds Module Spectra Module Reactions Module Materials Module Textmining Module ͙ Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc LabTroveand other templated data Documents API, FTP, etc Raw data Validated data Staging databases Alldatabases are sliced by data sources/data collections and havesimple security model where each data slice/sourceis private, public or embargoed
  18. 18. Depositions Gateway User Interface
  19. 19. Depositions Gateway User Interface
  20. 20. Depositions Gateway User Interface
  21. 21. Depositions Gateway User Interface
  22. 22. Depositions Gateway User Interface
  23. 23. Document processing
  24. 24. Input data pipeline Deposition Gateway Staging databases Compounds Reactions Spectra Materials Articles / CSSP Compounds Module Spectra Module Reactions Module Materials Module Textmining Module ͙ Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc LabTroveand other templated data Documents API, FTP, etc Raw data Validated data Staging databases Alldatabases are sliced by data sources/data collections and havesimple security model where each data slice/sourceis private, public or embargoed
  25. 25. Depositions Gateway User Interface
  26. 26. User Interface Approach Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  27. 27. User Interface Approach Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  28. 28. User Interface Approach Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  29. 29. User Interface Approach Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  30. 30. Analytical Chemist Characterize Measure Search Store <<include>> <<include>> <<include>> Synthetic Chemist Search (synthetic procedure) Document (publish synthetic procedure) Retrosynthetic analysis
  31. 31. Medicinal Chemist Search (against database of properties) Source (find vendor) Analyse (cluster, dock, screen) Computational Chemist Search or Develop algorithm Store results Run calculations Synthesize Measure activity
  32. 32. Addition of Analytical Data • Spectral Container is in development using componentized widgets for display • NIST spectra converted into standardized JCAMP format for deposition - 296,103 spectra deposited • 10% of remaining NIST spectra need to be curated as there are obvious structure issues
  33. 33. Electronic Notebook Data • Development work integrating chemistry into the Southampton Labtrove notebook • Stoichiometry table development • Analytical data integration • “ChemTrove” rolled out to a small test group in January
  34. 34. Present activities – ACS Fall • Deposition process development of compounds, reactions and spectral data by Spring • FTP, DropBox, Web-upload, ELN integration • Compounds, Reactions, Spectral data search, display, download • Data sharing – private, public, collaborative • Metadata, metadata, metadata standards! • Open Sourcing CRD and CVSP
  35. 35. Acknowledgments • Jeremy Frey and Simon Coles, University of Southampton • Will Dichtel and Leah McEwan, Cornell University • Stuart Chalk, University of North Florida • Bob Hanson and Bob Lancashire, Jmol and JSpecView
  36. 36. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×