Providing access to a million NMR spectra online

4,818 views

Published on

Access to large scale NMR collections of spectral data can be used for a number of purposes in terms of teaching spectroscopy to students. The data can be used for teaching purposes in lectures, as training data sets for spectral interpretation and structure elucidation, and to underpin educational resources such as the Royal Society of Chemistry’s Learn Chemistry. These resources have been available for a number of years but have been limited to rather small collections of spectral data and specifically only about 3000 spectra. In order to expand the data collection and provide richer resources for the community we have been gathering data from various laboratories and, as part of a research project, we have used text-mining approaches to extract spectral data from articles and patents in the form of textual strings and utilized algorithms to convert the data into spectral representations. While these spectra are reconstructions of text representations of the original spectral data we are investigating their value in terms of utilizing for the purpose of structure identification. This presentation will report on the processes of extracting structure-spectral pairs from text, approaches to performing automated spectral verification and our intention to assemble a spectral collection of a million NMR spectra and make them available online.

Published in: Science
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,818
On SlideShare
0
From Embeds
0
Number of Embeds
3,263
Actions
Shares
0
Downloads
9
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Providing access to a million NMR spectra online

  1. 1. Providing Access to a Million NMR Spectra Online… Antony Williams, Daniel Lowe, Carlos Coba, Stan Sykora, Peter Corbett, Alexey Pshenichnov, Valery Tkachenko ACS Denver, March 2015
  2. 2. Free and Easy • Everything I will show in terms of ChemSpider is available for free online today • To make it easy to “take notes” these slides will be available at: www.slideshare.net/AntonyWilliams/
  3. 3. www.ChemSpider.com
  4. 4. ChemSpider
  5. 5. JCAMP NMR Spectra
  6. 6. ChemSpider ID 24528095 H1 NMR
  7. 7. ChemSpider ID 24528095 C13 NMR
  8. 8. MS Spectra
  9. 9. ChemSpider ID 24528095 images
  10. 10. Images
  11. 11. Managing Assignments?
  12. 12. Visualization of Spectra • We would like to view “interactive spectra”
  13. 13. Jmol – Bob Hanson
  14. 14. ChemDoodle Components
  15. 15. Spectral Data • ChemSpider requires spectral data to be deposited in standard formats – JCAMP or images • All spectra available at: http://www.chemspider.com/spectra.aspx • Data are deposited on a regular basis • Students • Chemical vendors • Growing collection now
  16. 16. We want this…we need YOU!
  17. 17. Student Submissions
  18. 18. 9400 Spectra and growing http://www.chemspider.com/spectra.aspx
  19. 19. Publications & “Real Spectra” • We are turning “text into spectra”
  20. 20. ESI – Text Spectra
  21. 21. Developing Proof-of-Concept • Extract from 1976-2014 USPTO applications *unknown – starts off with NMR: peak list (no nucleus) H 975543 C 56536 unknown 44306 F 9429 P 3241 B 91 Si 62 Sn 22 Se 11 N 8
  22. 22. Extracted data • 3.2 million spectra extracted with 1.1 million associated with compounds
  23. 23. We want to find text spectra? • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) • What would be better are spectral figures – and include assignments where possible!
  24. 24. MestreLabs Mnova NMR
  25. 25. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  26. 26. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  27. 27. Sounds easy right? • Potential for errors with names • No name extracted for structure • Incomplete names extracted • Misassociation of names with structures • Incorrect conversion of names to structures
  28. 28. Name-to-structure
  29. 29. BIGGEST problem - BRACKETS • Brackets in names is a big problem- either an additional bracket or a missing bracket
  30. 30. Cannot be converted • https://www.google.co.uk/patents/US20050187390A1 • 2-[2-(4′-carbamoyl-4-methoxy-biphen-2-yl)- quinolin-6-yl]-1-cyclohexyl-1H- benzoimidazole-5-carboxylic Acid • OPSIN expects biphenyl-2-yl
  31. 31. OCR error Correction • https://www.google.co.uk/patents/WO2012150220A1 • di-terf-butyl (4S)-/V-(fert-butoxycarbonyl)-4-{4-[3- (tosyloxy)propyl]benzyl}-L-glutamate CaffeineFix corrected to: • di-tert-butyl (4S)-N-(tert-butoxycarbonyl)-4-{4-[3- (tosyloxy)propyl]benzyl}-L-glutamate Corrections made: f--> t , / V --> N, f --> t
  32. 32. Sounds easy right? • Textual Spectrum descriptions have issues • Transcription errors (rare) • Subjective interpretation (very common) • Incomplete listing of shifts • No/incomplete couplings/multiplicities listed • Overlap of multiplets (very common) • Labile protons – included/excluded/partial
  33. 33. Sounds easy right? • Textual Spectrum descriptions have issues • No peak width indications – especially labiles • No peak shape indications – dynamic exchange • Presence of rotamers • Impurities included or misidentified • Solvent peak belonging to the compound • Wrong number of nuclei
  34. 34. Problems Generating Spectra • Multiplicities no coupling constants • δ 1H NMR (300 MHz, CDCl3): 1.48 (t, 3H), 4.15 (q, 2H), 7.03 (td, 1H), 7.16 (td, 1H), 7.49 (m, 1H), 7.70 (dd, 1H), 7.88 (dd, 1H), 8.77 (d, 1H)
  35. 35. Problems Generating Spectra • PARTIAL couplings only for ca. 90% of spectra! • δ 1H NMR (300 MHz, CDCl3): 0.48-0.66 (m, 2H) 0.75-0.95 (m, 2H), 1.80 (s, 1H), 3.86 (s, 3H), 5.56 (s, 2H), 6.59 (d, J=8.50 Hz, 1H), 7.03 (dd, J=8.50, 2.15 Hz, 1H), 7.60 (s, 1H)
  36. 36. Error Detection 1H NMR (400 MHz, CDCl3) d ppm 11.47-12.05 (1H), 7.97-8.24 (1H), 7.61-7.97 (2H), 7.28-7.61 (2H), 7.21 (1H), 5.27 (1H), 3.70-4.74 (8H), 2.80- 3.16 (2H), 2.46-2.80 (2H), 1.87-2.45 (2H), 1.35- 1.77 (11H), 1.24 (18H), 0.87 (3H) associated with Glyceryl Monolaurate
  37. 37. Error Detection • 54 hydrogens counted in the reported spectrum. Glyceryl Monolaurate has only 30 hydrogens. • Title was: “Polymerization of Monomer 4 with Glyceryl Monolaurate” • Text-mining title missed compound: Monomer 4 is the compound below
  38. 38. Summary… • There is LOTS of work to do on error checking • It will be iterative for sure • There will be no perfect set of data as an outcome! • But there is no perfect spectral database!
  39. 39. ESI Data also contains figures
  40. 40. Publications & “Real Spectra” • We are turning text into spectra • We are turning figures into spectra
  41. 41. Early Test Experiments  Input  74 supplementary data documents. 3444 pages  Output  Plot2Txt extracted content from 1069 pages  1151 spectra total - >80% of peaks extracted to within 1-2 decimal places (ppm)
  42. 42. FIGURE DATA
  43. 43. “Where is the real data please?” FIGURE DATA
  44. 44. Data added to ChemSpider
  45. 45. Manual Curation Layer • ChemSpider has had a manual curation layer for >8 years • Users can annotate data on ChemSpider • We do receive useful feedback from the community on the data and are optimistic!
  46. 46. Extraction is the WRONG WAY • We should NOT mine data out – digital form! • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  47. 47. We can solve for Authors here Will it be used though??? YES!
  48. 48. Grand Target • Fingers crossed to get 21st century spectra converted • Spectra associated with compounds will go into ChemSpider • Spectra converted from Figures but without compound association will be captured with Figures into the Data Repository • Focus on IR, Raman, UV-Vis & 1D NMR
  49. 49. New Repository Architecture doi: 10.1007/s10822-014-9784-5
  50. 50. Future Developments • We have extracted 100s of 1000s of text strings from patents – next we go into our archive • We estimate many 1000s of figures with spectral data in our ESI and articles • We are aiming for a million spectra online… • YOU can submit your data today and share it
  51. 51. Extracting other properties • We want to get property data out also • Work in progress already…Google Patents • Extracting melting points (>300,000) already found and prediction models built • Extracting boiling points (challenging!) • Over 1 million reactions extracted • Then apply the algorithms to RSC articles
  52. 52. Acknowledgments • Daniel Lowe (NextMove Software) • Carlos Coba and Stan Sykora (Mestrelab) • William Brouwer (Plot2Txt) • Rudy Potenzone and Kevin Thiesen (ChemDoodle) • Bob Hanson and Bob Lancashire (Jmol) • Valery Tkachenko and Alexey Pshenichnov (ChemSpider, Learn Chemistry, Data Repository)
  53. 53. Thank You Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×