Digitizing documents to
provide a public spectroscopy
database
Antony Williams, Colin Batchelor, William
Brouwer and Valer...
How can we digitize documents?
• As a publisher we would LOVE to bring data
out of our historical archive
• What could we ...
DERA
• Data enabling the RSC Archive
• Data extraction from the RSC Archive
• Difficult enhancements of the RSC Archive!!!
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thio...
Text-Mining
How is DERA going?
• We are working on 21st
articles first
• Mostly marked up with XML, more structured,
easier to handle
...
ChemSpider Reactions
ChemSpider Reactions
Structure Extraction from Images
• Structure extraction from images is old
technology. It’s difficult!
• Commercial and Op...
Detailed analysis and test sets
• Detailed analysis from GGA : http://
ggasoftware.com/imago/report/report.html
ESI – Text Spectra
Lots of “Textual Spectra”
Do we want to search text spectra?
What do we get when we search:
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3),
30.11 (CH, be...
1 Hit. Yay!
Reality
• No one will ever have perform a “spectral
search” based on text searching!
• From sample to sample, solvents, co...
Text and Images Spectra into
“Real Spectra”?
• We can turn text into structures
• We can turn images into structures
• So ...
MestreLabs Mnova NMR Beta
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz,
C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H...
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic methane), 66.12
(CH2), 68.49 ...
Text Conversion Approaches
• Work in progress but early observations
• Converted spectra are NOT what would be
seen in the...
It’s exactly the WRONG WAY!
• We should NOT be mining data out of future
publications
• Structures should be submitted “co...
ESI – Text and Image Spectra
ESI – Text and Image Spectra
Extracted JCAMP Spectrum
Turn “Figures” Into Data
Plot2Txt (p2t)

Plot2txt.com (p2t) proprietary cloud based service
for fast large scale document content extraction

Fig...
What’s the process?

Input : PDF document collection, split into pages, handed to
p2t instances and processed

Output : ...
Test Experiments

Input : 74 supplementary data documents/ 3444 pages

Output : p2t extracted content in 1069 page insta...
Performance

Plot2Txt output:
− processed on average 1.4 M pixels / second / CPU core
(Intel i7, O3 optimization in compi...
Analysis Process
• Manual examination….viewing spectra, one
at a time, and comparing extracted JCAMP
versus image (TIME!)
...
Prepare CONSISTENT JCAMP
Data onto ChemSpider
Summary

Plot2txt does recognize and extract content

Rapid and increasingly accurate process

Fails in low resolution ...
Future data checking opportunity
• How will we check data consistency?
• How do we know the structure and the
spectra matc...
Future Work
• We can EASILY find text spectra in articles
but have work to do regarding:
• Pipelining of work and structur...
Grand Target
• I want ALL 21st
century spectra converted
and in ChemSpider in one year
• I REALLY want scientists to get t...
Acknowledgments
• Bill Brouwer – Plot2Txt.com live in 2 weeks
• Carlos Cobas and Santi Dominguez
• Colin Batchelor and Pet...
Digitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy database
Upcoming SlideShare
Loading in...5
×

Digitizing documents to provide a public spectroscopy database

4,245

Published on

RSC hosts a number of platforms providing free access to chemistry related data. The content includes chemical compounds and associated experimental and predicted data, chemical reactions and, increasingly, spectral data. The ChemSpider database primarily contains electronic spectral data generated at the instrument, converted into standard formats such as JCAMP, then uploaded for the community to access. As a publisher RSC holds a rich source of spectral data within our scientific publications and associated electronic supplementary information. We have undertaken a project to Digitally Enable the RSC Archive (DERA) and as part of this project are converting figures of spectral data into standard spectral data formats for storage in our ChemSpider database. This presentation will report on our progress in the project and some of the challenges we have faced to date.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
4,245
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Digitizing documents to provide a public spectroscopy database

  1. 1. Digitizing documents to provide a public spectroscopy database Antony Williams, Colin Batchelor, William Brouwer and Valery Tkachenko ACS Indianapolis
  2. 2. How can we digitize documents? • As a publisher we would LOVE to bring data out of our historical archive • What could we do? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions – and make a database! • Find data (MP, BP, LogP) and deposit • Find figures and database them • Find spectra (and link to structures)
  3. 3. DERA • Data enabling the RSC Archive • Data extraction from the RSC Archive • Difficult enhancements of the RSC Archive!!!
  4. 4. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  5. 5. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  6. 6. Text-Mining
  7. 7. How is DERA going? • We are working on 21st articles first • Mostly marked up with XML, more structured, easier to handle • 8.2Gbytes of data, >100k articles from 2000- 2013 • Markup will be published onto the HTML forms of the articles • We will iterate based on dictionaries, markup, OSCAR extraction
  8. 8. ChemSpider Reactions
  9. 9. ChemSpider Reactions
  10. 10. Structure Extraction from Images • Structure extraction from images is old technology. It’s difficult! • Commercial and Open Source tools • CLiDE • OSRA • Imago • Lots of others
  11. 11. Detailed analysis and test sets • Detailed analysis from GGA : http:// ggasoftware.com/imago/report/report.html
  12. 12. ESI – Text Spectra
  13. 13. Lots of “Textual Spectra”
  14. 14. Do we want to search text spectra? What do we get when we search: 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  15. 15. 1 Hit. Yay!
  16. 16. Reality • No one will ever have perform a “spectral search” based on text searching! • From sample to sample, solvents, concentration, temperature will change peak positions. The chance of even the same peak list is tiny. • Reality need is a “spectral database” where search algorithms deal with peak positions, intensities, multiplicity when appropriate
  17. 17. Text and Images Spectra into “Real Spectra”? • We can turn text into structures • We can turn images into structures • So is it possible to turn text into spectra?
  18. 18. MestreLabs Mnova NMR Beta
  19. 19. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  20. 20. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  21. 21. Text Conversion Approaches • Work in progress but early observations • Converted spectra are NOT what would be seen in the data • They are commonly GOOD approximations of C13 spectra (except intensity) • They are average BUT useful approximations of H1 spectra – couplings are tough, dispersion of spectra, overlaps etc. • We need to figure out workflows, structure associations, storage in ChemSpider
  22. 22. It’s exactly the WRONG WAY! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive
  23. 23. ESI – Text and Image Spectra
  24. 24. ESI – Text and Image Spectra
  25. 25. Extracted JCAMP Spectrum
  26. 26. Turn “Figures” Into Data
  27. 27. Plot2Txt (p2t)  Plot2txt.com (p2t) proprietary cloud based service for fast large scale document content extraction  Figures in technical documents are recognized and converted into text, CSV and other formats eg., JCAMP without human intervention.  Extracted data suitable for storage/indexing, further reuse
  28. 28. What’s the process?  Input : PDF document collection, split into pages, handed to p2t instances and processed  Output : Spectra in JCAMP/CSV, molecules in BMP images pdf page page p2t p2t
  29. 29. Test Experiments  Input : 74 supplementary data documents/ 3444 pages  Output : p2t extracted content in 1069 page instances − 578 molecules  ~ 10% false positives eg., classifies Bruker logo as chemical object  ~ 20% false negatives eg., missing some symbols from structure − 1151 spectra  > 80% of peaks extracted to within 1-2 decimal places (ppm)
  30. 30. Performance  Plot2Txt output: − processed on average 1.4 M pixels / second / CPU core (Intel i7, O3 optimization in compilation) − 2 hours for 1069 pages, in serial 0 0.5 1 1.5 2 2.5 0 200 400 600 800 1000 Mpixels/second page number
  31. 31. Analysis Process • Manual examination….viewing spectra, one at a time, and comparing extracted JCAMP versus image (TIME!) • Generally excellent results for high S/N – small/close peaks can be lost • Spectrum is “representative enough” and way more useful than just images for indexing and searching • Structure association MUST be checked but name-structure association can be used
  32. 32. Prepare CONSISTENT JCAMP
  33. 33. Data onto ChemSpider
  34. 34. Summary  Plot2txt does recognize and extract content  Rapid and increasingly accurate process  Fails in low resolution cases, some fine structure in spectra is lost  Structure recognition is NEW needs some work in order to lower false negatives
  35. 35. Future data checking opportunity • How will we check data consistency? • How do we know the structure and the spectra match? Comparing image to spectrum is NOT enough!!! • Predict spectra, use spectral verification, use algorithmic checking. • Flag “dodgy data” and use crowdsourcing for data checking – If 10,000 spectra online are 5% in error are they useful???
  36. 36. Future Work • We can EASILY find text spectra in articles but have work to do regarding: • Pipelining of work and structure association • Non-truncation from wordwrapping • We can quite easily find spectra based on Figure Legends and have work regarding • Pipelining of work and structure association • Validation of structure-spectrum association • Data curation
  37. 37. Grand Target • I want ALL 21st century spectra converted and in ChemSpider in one year • I REALLY want scientists to get the value of real data over image data in terms of ESI • I want authors to have data validation via our web services • We will support IR, Raman, UV-Vis, 1D NMR and 2D…yet to come!
  38. 38. Acknowledgments • Bill Brouwer – Plot2Txt.com live in 2 weeks • Carlos Cobas and Santi Dominguez • Colin Batchelor and Peter Corbett – OSCAR, text mining, dictionaries, markup • Valery Tkachenko, Alexey Pshenichnov and Richard Gay – ChemSpider Reactions • Daniel Lowe – ChemSpider Reactions data • ACD/Labs – Provider of spectroscopy tools
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×