0
Data enhancing the Royal
Society of Chemistry
publication archive
Antony Williams, Colin Batchelor,
Peter Corbett, Ken Kar...
Data Enhancing the RSC
Archive
• Publications summarise
data acquisition, analysis
and conclusions.
• Much detail in the d...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
How is DERA going? TEXT
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Mostly marked up w...
Chemical Validation and
Standardization
The RSC Data
Repository
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compoun...
Text-Mining
ChemSpider Reactions
Reactions
• We will put reactions from our databases into
the Reactions Repository
• We will use “Reaction Validation” pro...
Reaction Deposition/Validation
ESI – Text Spectra
Lots of “Textual Spectra”
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35
(t, 1H, Jb = 10.8 Hz, C(6)H...
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 ...
How is DERA going? Text Spectra
• Overall progress is good
• Improved algorithms for extraction of spectra
• Extraction of...
Visualization of Spectra
• For spectra associated with compounds we
would like to view “interactive spectra”
Javascript viewer with JMol
Figure Spectra into “Real
Spectra”?
• We are turning text into structures
• We are turning text into spectra
• And we are ...
Turn “Figures” Into Data
EXTRACTED
DATA
FIGURE
EXTRACTED
DATA
FIGURE
How is DERA going? Figures
• Validation tests performed with William
Brouwer. Good enough to proceed with
larger test set
...
Early Test Experiments

Input : 74 supplementary data documents/ 3444 pages

Output : p2t extracted content in 1069 page...
Validating Spectra
• How will we check data consistency?
• How do we know the structure and the
spectra match? Comparing i...
What are we extracting?
• Compounds from compound names
• Reactions from the text
• Spectral extraction – from figures and...
Building out the technology
• We are presently Open-Sourcing a chemical
registration system developed for OpenPHACTS
• We ...
Javascript viewer NMR, MS, IR
Grand Target
• Fingers crossed to get 21st
century spectra
converted
• Spectra associated with compounds will go
into Chem...
DERA is FINE for an archive
The WRONG WAY otherwise!
• We should NOT be mining data out of future
publications
• Structure...
We can solve for Authors here
Will it be used though???
Advanced ESI
Conclusions
• Great progress in mining the archive and 21st
century articles are being enhanced on the
publishing platform...
Acknowledgments
• Bill Brouwer – Plot2Txt Development
• Carlos Cobas and Santi Dominguez
• Bob Hanson and Bob Lancashire f...
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com...
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archive
Upcoming SlideShare
Loading in...5
×

Data enhancing the royal society of chemistry publication archive

3,856

Published on

The Royal Society of Chemistry has an archive of hundreds of thousands of published articles containing various types of chemistry related data – compounds, reactions, property data, spectral data etc. RSC has a vision of extracting as much of these data as possible and providing access via ChemSpider and its related projects. To this end we have applied a combination of text-mining extraction, image conversion and chemical validation and standardization approaches. The outcome of this project will result in new chemistry related data being added to our chemical and reaction databases and in the ability to more tightly couple web-based versions of the articles with these extracted data. The ability to search across the archive will be enhanced as a result. This presentation will report on our progress in this data extraction project and discuss how we will ultimately use similar approaches in our publishing pipeline to enhance article markup for new publications.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,856
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Data enhancing the royal society of chemistry publication archive"

  1. 1. Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko ACS Dallas March 2014
  2. 2. Data Enhancing the RSC Archive • Publications summarise data acquisition, analysis and conclusions. • Much detail in the data • Improved navigation includes data access • Reanalysis of data is limited in PDFs
  3. 3. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  4. 4. How is DERA going? TEXT • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Mostly marked up with XML, more structured, easier to handle. Markup mostly published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, OSCAR extraction • New visualization approaches in development
  5. 5. Chemical Validation and Standardization
  6. 6. The RSC Data Repository Deposition Gateway Staging databases Compounds Reactions Spectra Materials Articles / CSSP Compounds Module Spectra Module Reactions Module Materials Module Textmining Module ͙ Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc LabTroveand other templated data Documents API, FTP, etc Raw data Validated data Staging databases Alldatabases are sliced by data sources/data collections and havesimple security model where each data slice/sourceis private, public or embargoed
  7. 7. Text-Mining
  8. 8. ChemSpider Reactions
  9. 9. Reactions • We will put reactions from our databases into the Reactions Repository • We will use “Reaction Validation” procedures to clean up Daniel Lowe’s USPTO patent set of over a million extracted reactions • We will move ChemSpider SyntheticPages content to the Reactions Repository • We will use the RXNO Ontology to classify the reactions
  10. 10. Reaction Deposition/Validation
  11. 11. ESI – Text Spectra
  12. 12. Lots of “Textual Spectra”
  13. 13. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  14. 14. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  15. 15. How is DERA going? Text Spectra • Overall progress is good • Improved algorithms for extraction of spectra • Extraction of associated compound name with spectrum – name to structure conversion now • MestreLabs have provided us with batch conversion tool • Work in progress – manual and automated validation. In theory auto-assignment also
  16. 16. Visualization of Spectra • For spectra associated with compounds we would like to view “interactive spectra”
  17. 17. Javascript viewer with JMol
  18. 18. Figure Spectra into “Real Spectra”? • We are turning text into structures • We are turning text into spectra • And we are turning figures into spectra
  19. 19. Turn “Figures” Into Data EXTRACTED DATA FIGURE
  20. 20. EXTRACTED DATA FIGURE
  21. 21. How is DERA going? Figures • Validation tests performed with William Brouwer. Good enough to proceed with larger test set • Ready to run process across larger collection • Focus on 21st century articles only for now
  22. 22. Early Test Experiments  Input : 74 supplementary data documents/ 3444 pages  Output : p2t extracted content in 1069 page instances − 578 molecules  ~ 10% false positives eg., classifies Bruker logo as chemical object  ~ 20% false negatives eg., missing some symbols from structure − 1151 spectra  > 80% of peaks extracted to within 1-2 decimal places (ppm)
  23. 23. Validating Spectra • How will we check data consistency? • How do we know the structure and the spectra match? Comparing image to spectrum is NOT enough!!! • Predict spectra, use spectral verification, use algorithmic checking. • Flag “dodgy data” and use crowdsourcing for data checking • MULTIPLE prediction technologies now available – VERIFICATION is tougher
  24. 24. What are we extracting? • Compounds from compound names • Reactions from the text • Spectral extraction – from figures and text • Extraction of data from “tables” – not only CSV files but literal tables in the publication – specifically data from MedChemComm as proof of concept
  25. 25. Building out the technology • We are presently Open-Sourcing a chemical registration system developed for OpenPHACTS • We will then Open Source the Chemical Validation and Standardization Platform • We are working with Bob Hanson and Bob Lancashire on Jmol/JSpecView Open Source • We will deliver a set of Open Source widgets for structure handling/visualization
  26. 26. Javascript viewer NMR, MS, IR
  27. 27. Grand Target • Fingers crossed to get 21st century spectra converted • Spectra associated with compounds will go into ChemSpider • Spectra converted from Figures but without compound association will be captured with Figures into the Data Repository • Focus on IR, Raman, UV-Vis & 1D NMR
  28. 28. DERA is FINE for an archive The WRONG WAY otherwise! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  29. 29. We can solve for Authors here Will it be used though???
  30. 30. Advanced ESI
  31. 31. Conclusions • Great progress in mining the archive and 21st century articles are being enhanced on the publishing platform iteratively • Spectral Data is the next focus – directly connected to our work on the data repository • Reaction extraction, processing and validation from articles is progressing more slowly • Results are content, software components and and Open Source Contributions
  32. 32. Acknowledgments • Bill Brouwer – Plot2Txt Development • Carlos Cobas and Santi Dominguez • Bob Hanson and Bob Lancashire for Jmol/JSpecView Javascript version • Leah McEwan and Will Dichtel • ACD/Labs – Provider of spectroscopy tools
  33. 33. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×