Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cheminformatics and the Structure Elucidation of Natural Products

4,025 views

Published on

The structure elucidation of natural product structures from analytical data, specifically NMR and MS, remains a major challenge. With an enormous palette of NMR experiments to choose from, and supported by breakthrough technologies in hardware, the generation of high quality data to enable even the most complex of natural product structures to be determined is no longer the major hurdle. The challenge is in the analysis of the data. We are in a new era in terms of approaches to structure elucidation: one where computers, databases, and a synergy between scientists and algorithms can offer an accelerated path forward. Software tools are capable of digesting spectroscopic data to elucidate extremely complex natural products. Scientists can now elucidate chemical structures utilizing multinuclear chemical shift data, correlation data from an array of 2D NMR experiments and utilize existing data sets for the purpose of dereplication and computer-assisted structure elucidation. With the explosion of online data especially, in public databases such as PubChem and ChemSpider, many tens of millions of chemical structures are available to seed fragment databases to include in the elucidation process. This presentation will provide an overview of how cheminformatics and chemical databases have been brought together to assist in the identification of natural products. It will include an examination of the state-of-the-art developments in Computer-Assisted Structure Elucidation.

Published in: Science
  • Be the first to comment

Cheminformatics and the Structure Elucidation of Natural Products

  1. 1. Cheminformatics and the Structure Elucidation of Natural Products (or can Big Data help elucidate structures!) Antony Williams 5th Brazilian Conference of Natural Products October 27th 2015 ORCID ID:0000-0002-2668-4821
  2. 2. A Bit About Me… • NMR spectroscopist by training • Chief Science Officer ACD/Labs Software • One of founders of ChemSpider database • VP for Cheminformatics at RSC
  3. 3. Why is this important? • Structure verification and elucidation of 1000s of compounds • NMR predictors with >2,000,000 shifts & Computer-Assisted Structure Elucidation • Made >20,000,000 chemical compounds & data freely accessible to the community • Grew the dataset to over >30,000,000 chemicals & used for structure elucidation • Big data can assist structure identification
  4. 4. The Agenda… • Dereplication using prior knowledge • The increasing prevalence of online content • Data generation is not the issue. Analysis is. • Computer-assisted structure elucidation • New experiments to improve elucidation • Rethink data-sharing through publications!
  5. 5. The Agenda… • Dereplication using prior knowledge • The increasing prevalence of online content • Data generation is not the issue. Analysis is. • Computer-assisted structure elucidation • New experiments to improve elucidation • Rethink data-sharing through publications!
  6. 6. …for each natural product dereplicated, at an average cost of $300 … a savings of $50,000 is incurred in isolation and identification time.
  7. 7. Dereplication • There are ca. 200,000 known natural products • The chance for rediscovery is very high! • We need efficient “dereplication” processes • Most general approach – acquire analytical data and search existing databases…
  8. 8. Scale of Dereplication Exercise 0.5 – 2 mg extract 4 mL agar slope Petri dish Bioassay & HPLC/UV/MS/NMR evaluation 100 mg sponge With gratitude to John Blunt
  9. 9. Approaches to Dereplication Desirable to know: For each compound isolated: If new then acquire data: Fully elucidate structure Taxonomy of organism Molecular Wt/formula UV Spectrum 1H NMR Spectrum [13C NMR Spectrum if possible] 1D and 2D NMR array, MS with fragmentation, IR, [α]D, ORD Identify as known or new compound. If known STOP.
  10. 10. What Databases are Available? Public ChemSpider CSLS PubChem NMRShift DB Naproc-13 SuperNatural SDBS Private All Pharma GVK Biosciences NPD UC UV DB DTU UV DB Marine NP DB GVK NP DB InterMed UV DB InterMed NMR DB Novartis IR DB Natl. Centre Plant Metabol. CH-NMR-NP Commercial SciFinder SpecInfo (Crossfire) Beilstein Crossfire Gmelin Reaxys ACD Spectral Libraries NaprAlert Dict. Natural Products Dict. Marine Nat. Prods AntiBase MarinLit AntiMarin With gratitude to John Blunt
  11. 11. PU10-F2 m/z 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 % 0 100 SSA0006 291 (3.284) Cm (241:343) 1: TOF MS ES+ 3.92e4261.564 241.060 241.560 241.974 242.062 262.071 481.122262.517 304.098 263.024 282.074 465.101305.100 482.127 483.122 511.102 M+H Search MW = 480 in Dict. Nat. Prod. 562 hits out of 230,000 compounds!!! MW 480 MF = C28H36N2O5 Nominal Mass Searching
  12. 12. Molecular Formula Searching Search MF=C28H36N2O5 in Dict. Nat. Prod. 2 hits out of 230,000 compounds!!! Compare UV spectrum and 1H NMR features
  13. 13. How many isomers for a formula? C10H17Br2ClO2, 50,502,293 C15H22O2, 138,136,211,624 C15H20O1, 37,568,150,635 C12H12O3, 68,930,547,646 C13H20O3, 14,431,269,166 C11H12N2O2, 3⋅1011 <n1012
  14. 14. How many isomers for a formula? C10H17Br2ClO2, 50,502,293 C15H22O2, 138,136,211,624 C15H20O1, 37,568,150,635 C12H12O3, 68,930,547,646 C13H20O3, 14,431,269,166 C11H12N2O2, 3⋅1011 <n1012
  15. 15. 1 x triplet methyl 3 x methoxy 3 x olefinic H solvent ppm1234567 6.42 6.27 6.24 19.96 15.15 24.03 21.93 1 H NMR spectrum, CD3OD
  16. 16. Marinlit Dereplication
  17. 17. • 1 of 5 hits from 230,000 compounds • The ONLY hit if MW = 480 included NMR Features Dereplication
  18. 18. Marinlit Enhanced Features
  19. 19. 1H/13C Predicted Spectra
  20. 20. HSQC-DEPT Predicted Spectrum
  21. 21. Dereplication in MarinLit Online • Can be achieved using • 1 H NMR features e.g. number of Me groups • 13 C and 1 H chemical shifts • Molecular formula (complete or partial) • UV maxima • Exact mass • OR a combination of any or all of the above.
  22. 22. 1 H NMR Spectrum - new or known? 9 Me groups are obvious (from integrals) Search of MarinLit: 9 Me gave 628 answers
  23. 23. 4 Me singlets 4 Me doublets 1 OMe singlet Aromatic protons Characterizing the spectrum further Search MarinLit for 9 total methyls: 4 singlets, 4 doublets, 1 OMe there were 39 answers,
  24. 24. COSY spectrum This implies a 1,2,4- trisubstituted aromatic system A broad singlet coupled/on-coupled to 2 doublets
  25. 25. 4 Me singlets 4 Me doublets 1 OMe singlet 4 singlets, 4 doublets, 1 OMe, 1,2,4-trisubstituted aromatic 2 answers only
  26. 26. Comparison of NMR data confirmed that the unknown had this structure
  27. 27. Commercial Assigned Databases >320,000 assigned chemical structures >2,500,000 shifts
  28. 28. Searching Assigned Databases • mI = 306.1 – 306.2 • 591/322,319 hits
  29. 29. Searching Assigned Databases • 10 13 C shifts to +/- 3.0ppm • 5 1 H shifts to +/- 0.3ppm • 7 hits – very different
  30. 30. Including 15 N, 19 F and 31 P data
  31. 31. Experimental vs. Experimental Differences between C13 shifts are generally small
  32. 32. Experimental vs. Experimental Differences between C13 shifts are generally small
  33. 33. Searching experimental data 30 seconds from peak-picking to suggested molecules
  34. 34. Experimental vs. Predicted Differences between exp. and pred. C13 shifts can be larger – useful to limit number of shifts searched
  35. 35. The Agenda… • Dereplication using prior knowledge • Increasing prevalence of free online content • Data generation is not the issue. Analysis is. • Computer-assisted structure elucidation • New experiments to improve elucidation • Rethink data-sharing through publications!
  36. 36. Online content also available! NMRShiftDB http://nmrshiftdb.nmr.uni-koeln.de/
  37. 37. Online content also available! NMRShiftDB http://nmrshiftdb.nmr.uni-koeln.de/
  38. 38. Online content also available! www.nmrdb.org
  39. 39. • ~35 million chemicals and growing • Data sourced from ~500 different sources • Structure centric hub for web-searching • Already used many mass spectrometry software packages for structure ID Mining Big Data for Natural Products???
  40. 40. ChemSpider Interface – no NMR
  41. 41. 26/35,000,000 Million Hits Ranked by # of References
  42. 42. Top Ranked Hit
  43. 43. What can I find on ChemSpider?
  44. 44. What can I find? All for free…
  45. 45. NMR Predictions on ChemSpider Data for Dereplication
  46. 46. 1 2 • fC = full composition (C0-100 H0-100 O0-20 N0-10) • lC= limited composition (C10-30 H25-40 O0-15 N0-5) NMR Predictions on ChemSpider Data for Dereplication Compound 1 Compound 2
  47. 47. Large Fragments can be found Top 2 hits searched by 1 H chemical shifts. Hits ranked by the 1 H NMR deviation and filtered with C10-30 H25-40 O0-15 N0- 5,Good List and Bad List. Good List was determined from 1 H shifts, integrals and 1 H-1 H COSY
  48. 48. • Search nominal mass 490-491 gave the following results: ChemSpider : 46,234 SciFinder: 171,904 Dictionary of Natural Products: 537 Dictionary of Marine Natural Products 90 MarinLit: 94 AntiMarin: 131 • Molecular formula obtained C30H50O5 (490.3658): ChemSpider: 208 SciFinder 2,366 Dictionary of Natural Products 238 Dictionary of Marine Natural Products 43 MarinLit 43 AntiMarin 48 Marine Natural Product Example
  49. 49. • Search nominal mass 490-491 gave the following results: ChemSpider : 46,234 SciFinder: 171,904 Dictionary of Natural Products: 537 Dictionary of Marine Natural Products 90 MarinLit: 94 AntiMarin: 131 • Molecular formula obtained C30H50O5 (490.3658): ChemSpider: 208 SciFinder 2,366 Dictionary of Natural Products 238 Dictionary of Marine Natural Products 43 MarinLit 43 AntiMarin 48 Marine Natural Product Example Focused Datasets Valuable
  50. 50. Approaches to Dereplication Desirable to know: For each compound isolated: If new then acquire data: Fully elucidate structure Taxonomy of organism Molecular wt/formula UV Spectrum 1H NMR Spectrum [13C NMR Spectrum] 1D and 2D NMR array, MS with fragmentation, IR, [α]D, ORD Identify as known or new compound. If known STOP.
  51. 51. The Agenda… • Dereplication using prior knowledge • The increasing prevalence of online content • Data generation is not the issue. Analysis is. • Computer-assisted structure elucidation • New experiments to improve elucidation • Rethink data-sharing through publications!
  52. 52. Modern NMR Technologies • Even a basic array of 1D/2D experiments can provide the relevant data in the majority of cases • The past few years have seen improvements in: • Hardware: Magnets, Probes and RF • Software: Data acquisition and processing • Pulse sequences to probe direct and (very) long- range homo- and heteronuclear correlations
  53. 53. Magnetic Field Strength over time
  54. 54. NMR Developments – 30 years of improvements • 1984 – First report of cryogenic NMR probe • 1986 – HMBC experiment reported • 1991 – First commercial 3 mm gradient inverse probes. • 1996 – ADEQUATE NMR experiments first reported. • 1996 – 1 H-15 N HMBC applications reported. • 1998 – Commercial 1.7 mm gradient inverse triple probes. • 1999 – First commercial cryogenic NMR probes delivered. • 2000 – First 3 mm prototype cryoprobe developed. • 2006 – First 1.7 mm MicroCryoProbes™ delivered. • 2009 – Pure shift HSQC experiments developed. • 2014 –1,1- and -1,n-HD-ADEQUATE experiments With gratitude to Gary E. Martin
  55. 55. COSY Correlations Vicinal H-H couplings Geminal H-H couplings 9 19 N N O O H H H H H H H H H H H H H H H HHH H H H H 1 2 3 4 5 6 7 8 10 11 1213 14 16 17 18 20 21 22 23
  56. 56. HMBC Correlations (8Hz Optimized) 9 17a/b N N O O H H H H H H H H H H H H H H H HHH H H H H 1 2 3 4 5 6 7 8 10 11a 1213 1416 18a 20a 21 22 23a 23b 18b 20b 11b
  57. 57. Always new sequences coming: 1,1- and -1,n-HD-ADEQUATE Examples show all three scenarios for 1,1- and 1,n-HD- ADEQUATE correlations for cryptospirolepine.
  58. 58. Adoption can take a long time HSQC vs. HMQC took > 20 years! • HMQC is an older technique and affords lower F1 resolution. • HSQC is a better technique but SLOWLY supplanted HMQC! Year Range #HMQC reports #HSQC reports 1990-94 52 10 1995-99 177 39 2000-04 346 111 2005-09 358 266 2010-14 345 423 Totals 1278 849 From: A. Williams, G.E. Martin, & D.J. Rovnyak, “Increasing the Adoption of Advanced Techniques for the Structure Elucidation of Natural Products,” from Modern NMR Approaches to the Structure Elucidation of Natural Products, vol. 1, A.J. Williams, G.E. Martin, and D.J. Rovnyak, Eds., RSC, London, 2015.
  59. 59. The Agenda… • Dereplication using prior knowledge • The increasing prevalence of online content • Data generation is not the issue. Analysis is. • Computer-assisted structure elucidation • New experiments to improve elucidation • Rethink data-sharing through publications!
  60. 60. AI Research in 1965…
  61. 61. 50 years of iterative development DENDRAL NMR-SAMS SENECA SpecInfo ACD/Labs CMC-SE LSD Others…
  62. 62. Computer Assisted Structure Elucidation: Methodology • Interpret data to extract knowledge • Molecular Formula • Integrals • Chemical shifts • Multiplicity • Connectivity • Known fragments • Known exclusions • Search structure space to derive all structures • Rank-order based on set criteria • Predicted chemical shift • Mass Spec Fragmentation
  63. 63. Remember how many isomers C10H17Br2ClO2, 50,502,293 C15H22O2, 138,136,211,624 C15H20O1, 37,568,150,635 C12H12O3, 68,930,547,646 C13H20O3, 14,431,269,166 C11H12N2O2, 3⋅1011 <n1012
  64. 64. Computer-Aided Structure Elucidation • Eliminate “superfluous” isomers by imposing different structural constraints • Structural constraints are from: • Spectral data of various types: • NMR shifts/multiplicity constrain atom types; Correlations constrain connectivities • MS constrains formula and fragments • IR constrains functional groups • Prior information – sample origin • Chemical rules – valence, ring size, charge, etc.
  65. 65. CH3 17.60 CH3 18.13 CH3 20.20 CH3 31.40 18.09 19.10 19.50 19.50 28.20 29.20 41.20 34.30 42.20 63.30 33.40 61.20 67.80 68.10 80.40 174.10 OH O O O COSY 1 H - 1 H coupling through 3 bonds HMBC 1 H – 13 C coupling through 2/3 bonds 2D NMR spectra: Extraction of Structural Information: COSY/HMBC
  66. 66. 1D & 2D NMR Synchronized Processing The Software displays correlations for assigned spectra and structures, and highlights correlations that are likely to be erroneous.
  67. 67. CH3 17.60(fb) CH2 18.09(fb) CH3 18.13(fb) CH2 19.10(fb) CH2 19.50(fb) CH2 19.50(fb) CH3 20.20(fb) CH2 28.20(fb) CH2 29.20(fb) CH 34.30(fb) CH2 41.20(fb) CH 42.20(fb) C 61.20 CH 63.30 C 67.80 C 68.10 C 80.40 C 174.10 O H CH3 31.40(fb) C 33.40(fb) O O O Molecular Connectivity Diagram (MCD) Molecular Formula C20H30O4 Use spectroscopists experience to add bonds: Create C=O, COOH, Ring systems, etc.
  68. 68. Not that easy though… “Nonstandard Correlations” “Standard” and “Nonstandard” correlations are experimentally indistinguishable If 2D NMR data contain both “Standard” and “Nonstandard” correlations we see contradictions in interpretation H Ñ Ñ Ñ Ñ Ñ H H H H Ñ Ñ Ñ Ñ Ñ Ñ COSY HMBC Standard
  69. 69. CH3 1 CH3 2 CH3 3 CH3 4 5 67 8 9 CH2 10 11 12 13 14 15 16 17 18 19 20 OH 21 OH 22 Non-standard Correlation Example 6-bond 6-bond
  70. 70. Strychnine Non-standard Correlations 9 17a/b N N O O H H H H H H H H H H H H H H H HHH H H H H 1 2 3 4 5 6 7 8 10 11a 1213 1416 18a 20a 21 22 23a 23b 18b 20b 11b 19 9 17a/b N N O O H H H H H H H H H H H H H H H HHH H H H H 1 2 3 4 5 6 7 8 10 11a 1213 1416 18a 20a 21 22 23a 23b 18b 20b 11b 9 17a/b N N O O H H H H H H H H H H H H H H H HHH H H H H 1 2 3 4 5 6 7 8 10 11a 1213 1416 18a 20a 21 22 23a 23b 18b 20b 11b 9 19N N O O H H H H H H H H H H H H H H H HHH H H H H 1 2 3 4 5 6 7 8 10 11a 12 13 14 16 17 18b 20 21 22 23 11b 2 JC 2 JCH 4 JCH 3 JCH 5 JCH
  71. 71. Structure Generation combined with Structural and Spectral Filtering • Internal Badlist • User Badlist • User Goodlist • Rings: Obligatory, Forbidden • Bredt’s Rule • Maximum Match Factor • Filter Tolerance: Tight, Medium, Loose
  72. 72. Selection of the Preferable Structure • Remove duplicates • 1 H and 13 C shift calculation for all output structures • Rank structures in ascending order of average chemical shift deviation • Structure with minimum d is the most probable.
  73. 73. Low Structural Information in 2D Spectral Data: Use Fragment DB • Number of observed 2D NMR correlations is smaller than expected • Deficit of hydrogen atoms results in a low number of correlations • Search in Fragment Library using the 13C NMR spectrum and embed in the MCD
  74. 74. Example of Fragment Usage. Symmetric molecule C56H78O12S1 CH 5.76 CH 6.42 CH C C C CH 2.661.38 CH 1.10 1.60 CH2 CH2 CH CH2 CH H2C CH3 0.65HC CH3 0.88 CH 4.29 CH2 2.36C C OC OH 5.35OH 3.73 CH3 1.12 CH3 1.99 CH2 4.13 OH 4.18 O O S CH 5.76 CH 6.42 CH C C C CH 2.66 1.38 CH 1.10 1.60 H2C CH2 CH CH2 CH CH2 CH CH3 0.88 CH 4.29 CH2 2.36 C C O C O CH2 4.13 OH 4.18 CH3 1.99 CH3 0.65 CH3 1.12 O OH 5.35 OH 3.73 Ashwaganhanolide Small number of correlations
  75. 75. 13 C NMR Fragment search - 5524 found Exp. Frag. Fragment # 1 С17Н22О2
  76. 76. Solution • 960 MCDs were created using fragment #1 • Structure Generation from 960 MCDs gave 24 structures after filtering and 6 output structures. • Total time was tg= 29 m 30 s
  77. 77. Compare Hypotheses with Data
  78. 78. Wrong Molecular Formula Only CHNO in formula assumed J. Am. Chem. Soc., 2001, 123, 10870-10876. Tetrahedron Letters, 2002, 43, 5707-5710. FAB-MS: C31H54N4O8 ESI-MS: C31H54N4SO6
  79. 79. Wrong Molecular Formula Only CHNO in formula assumed J. Am. Chem. Soc., 2001, 123, 10870-10876. Tetrahedron Letters, 2002, 43, 5707-5710. FAB-MS: C31H54N4O8 ESI-MS: C31H54N4SO6
  80. 80. Wrong Initial Suggestion 13C shift at 173.50 ppm is O-C=O group J. Nat. Prod., 2000, 63, 1677-1678. J. Nat. Prod., 2003, 66, 716-718. 13 C signal at 173 ppm led to COO bias Data compared to a similar compound
  81. 81. J. Nat. Prod., 2000, 63, 1677-1678. J. Nat. Prod., 2003, 66, 716-718. 13 C signal at 173 ppm led to COO bias Data compared to a similar compound Wrong Initial Suggestion 13C shift at 173.50 ppm is O-C=O group 13 C signal at 173 ppm led to COO bias Data compared to a similar compound
  82. 82. Misinterpretation of 2D NMR Data Presence of a guanidine group substituted with 2xCH3 groups was hypothesized. Absence of an expected HMBC correlation from methyls to C(159.0) ignored. J. Org. Chem., 2004, 69,9025-9029. J. Org. Chem., 2008, 73, 8719-8722. Misinterpreted HMBC signal Verified by X-ray crystallography
  83. 83. Misinterpretation of 2D NMR Data Presence of a guanidine group substituted with 2xCH3 groups was hypothesized. Absence of an expected HMBC correlation from methyls to C(159.0) ignored. J. Org. Chem., 2004, 69,9025-9029 J. Org. Chem., 2008, 73, 8719-8722 Misinterpreted HMBC signal Verified by X-ray crystallography
  84. 84. J. Cheminf. 2012, 4:5
  85. 85. Number of Skeletal Atoms J. Cheminf. 2012, 4:5
  86. 86. MW Distribution J. Cheminf. 2012, 4:5
  87. 87. The Agenda… • Dereplication using prior knowledge • The increasing prevalence of online content • Data generation is not the issue. Analysis is. • Computer-assisted structure elucidation • New experiments to improve elucidation • Rethink data-sharing through publications!
  88. 88. New Experiments Influence CASE! Cervinomycin O NO O O OO OH O O 1 4 7 9 10 12 14 16 1922 26 29 30 CH3 (fb) CH2 CH2CH2 (ob) C (ob) C CH C CH CCH C CC C (ob) C (ob) C (ob) C C O O O O O H CH3 (ob) CH3 (ob) CH CH C C (ob) C (ob) C (ob) C O O O O
  89. 89. The Influence of Data on Elucidation Time: Cervinomycin COSY, HSQC 1 H-13 C HMBC 1 H-13 C LR-HSQMBC Structure Generation Time # of Structure s Generated 8 Hz 4 Hz 4 Hz 2 Hz + + + 49 h 314 + + + + 37 h 4 + + + + 150 s 7 + + + + + 104 s 1
  90. 90. New Experiments Cryptospirolepine over 20 years! Inexplicably, the vinyl proton has no evident 2 JCH correlation to the carbonyl! DFT predicted ~0.3 Hz coupling! Synergistic interpretation and CASE applied to an array of 2D data elucidated this compound. Included new 1,1-ADEQUATE and 1,n-ADEQUATE data. The absence of a 2 JCH correlation from the vinyl proton to the adjacent carbonyl is perplexing. A new long-range heteronuclear correlation NMR experiment was acquired: LR-HSQMBC.
  91. 91. Key 1,1-HD-ADEQUATE Correlations • Experiment was optimized for 60 Hz • Typical range for 1 JCC sp2 couplings is 60-75 Hz • The 2 JCC coupling from C13 to C1/C11’ was calculated (DFT) to be 15.4 Hz, which would give a calculated intensity of 0.16 in this experiment.
  92. 92. • Experiment optimized for 7 Hz • Typical range for n JCC couplings is approximately 2-7 Hz • 2 JCC correlations across carbonyls are typically 10-16 Hz • Correlations were observed, including the 1 JCC correlations from C13 to C2 and C13a that unavoidably “leak” into all 1,n- ADEQUATE spectra. Key 1,1-HD-ADEQUATE Correlations
  93. 93. Revision of the [7.5.5] Core of Cryptospirolepine to a [6.6.5] System • Based on correlations from the 1,1- and -1,n-HD-ADEQUATE spectra, the [7.5.5] core shown in red was revised to a [6.6.5] system. • The γ-lactam was rearranged to a dehydropiperidinone. • Key correlations were the 1 JCC correlation from the vinyl CH to the flanking carbonyl and quaternary carbons.
  94. 94. Could CASE methods sort out the structure? 1,1- ADEQUATE 1,n- ADEQUATE 1 H-13 C HMBC IDR HSQC- TOCSY 1 H-13 C LR- HSQMBC 1 H-15 N LR- HSQMBC GENERATION 60 Hz 7 Hz 8 Hz 4 Hz 15 ms 2 Hz 4 Hz 2 Hz Time (s) # Structures + >420 h >10,400 + + + 140 6816 + + + + 142 3360 + + + + 40 522 + + + + + 45 258 + + + + + + + + 7 24 • Modern “1993” data set used as input failed to lead to the generation of the structure in 3 week calculation! • More complete input data reduced calculation to secs!
  95. 95. The Agenda… • Dereplication using prior knowledge • The increasing prevalence of online content • Data generation is not the issue. Analysis is. • Computer-assisted structure elucidation • New experiments to improve elucidation • Rethink data-sharing through publications!
  96. 96. Errors in published structures…
  97. 97. ESI – Text Spectra
  98. 98. ChemSpider ID 24528095 H1 NMR
  99. 99. ChemSpider ID 24528095 C13 NMR
  100. 100. ChemSpider ID 24528095 HHCOSY
  101. 101. ChemSpider ID 24528095 HSQC
  102. 102. ChemSpider ID 24528095 HMBC
  103. 103. What would it take??? • PDFs containing text descriptions of spectra are problematic for reinterpretation of data • Publishers should host at least high resolution images of all spectra • Really we need the data files!!!
  104. 104. Conclusions • Dereplication is increasingly feasible using online content • Analysis of data is generally a bigger issue than data generation itself • Computer-assisted structure elucidation works • Data-sharing associated with publications needs rethinking
  105. 105. Books of Interest?
  106. 106. Acknowledgements RSC/ChemSpider/Marinlit •John Blunt •Serin Dabb •Valery Tkachenko NMR (Book) Collaborators •Gary Martin •David Rovnyak ACD/Labs •Structure Elucidator •Mikhail Elyashberg •Kirill Blinov •Arvin Moser •Patrick Wheeler
  107. 107. Thank you ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×