Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Validating the Open Spectral                                                                                              ...
Upcoming SlideShare
Loading in …5

Validating the ChemSpider Open Spectral Database NMR Collection using ACD/Labs Verification Algorithms


Published on

ChemSpider is a free access online database of over 26 million chemical compounds sourced from over 400 different sources including government laboratories, chemical vendors, public resources and publications. ChemSpider allows its users to deposit data including structures, properties, links to external resources and various forms of spectral data. ChemSpider has aggregated over 3000 high quality NMR spectra and continues to expand as the community deposits additional data. The majority of spectral data is licensed as Open Data allowing it to be downloaded and reused. The validation of the data can be performed by members of the community but an automated validation of the data was undertaken using ACD/Labs software using NMR prediction and verification routines. The dataset is a “real world” dataset containing the contributions of a number of laboratories around the world supplying data of varying quality including S/N issues, misreferencing, impurities etc. This work will report on the batch analysis of the ChemSpider spectral data including the identification of multiple errors in the spectra.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Validating the ChemSpider Open Spectral Database NMR Collection using ACD/Labs Verification Algorithms

  1. 1. Validating the Open Spectral Ryan Sasaki1, Sergey Golotvin2 and Antony Williams3 1 Advanced Chemistry Development, Inc. Database NMR Collection using ACD/Labs (ACD/Labs) 2 ACD Moscow Inc., Moscow, Russian Federation Verification Algorithms 3 ChemSpider, Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, North Carolina 27587, USAIntroduction 2) Chemical shift, integration, and multiplicity information are Other encountered issues include spectra with low resolution,In parallel with the development of new 2D NMR techniques, new predicted for the proposed chemical structure and compared with incorrect spectrometer frequency, unknown solvents, and of course aChemSpider is a free online database of over 26 million unique the related properties extracted from the experimental spectrum. series of incorrectly proposed structureschemical compounds sourced from over 400 different sources A comparison is then made based on an auto-assignmentincluding government laboratories, chemical vendors, and public procedure3 that finds the best possible fit as the minimum of aresources. ChemSpider allows its users to deposit data including special objective function.structures, properties, links to external resources, and various formsof spectral data. ChemSpider has aggregated over 2000 high quality A similar approach is taken for 13C NMR verification but comparesNMR spectra and continues to expand as the community deposits the experimental and predicted chemical shift values and peakadditional data. The data are generally validated by the community heights. In both cases the output for each verification procedurebut a batch-wise verification of all 1D 1H and 13C NMR spectral data is a Match Factor metric (0-1) produced to illustrate the level ofin the database was performed using ACD/Labs NMR verification consistency between the proposed structure and the experimental Figure 2: Example of a 1H NMR spectrum with a mixture ofsoftware. spectrum. For the purpose of the 1H NMR study, structure-spectrum components as evidenced by integral values. pairs that generate a match factor >0.8 were considered consistent.Sources of Spectral Data For 13C NMR, a match factor of >0.75 was considered consistent. Inconsistent results for the 13C NMR data were also evaluated. CloseDatabases of structures with associated NMR assignments are inspection revealed that the biggest culprit was due to poor S/N thatavailable as commercial or open data. However, databases of Analysis of Data led to the absence of 13C peaks for quaternary carbons. As a result,NMR spectral curves are less common and generally limited to The ACD/Labs automated 1H and 13C verification routines were run the software was unable to find peaks corresponding to quaternarymetabonomics data (for example, the BMRB1 and DrugBank2). One on the NMR spectra dataset from ChemSpider. The results of this carbons in many proposed structures and thus a significant numbercomponent of the ChemSpider project is to gather, host, and make procedure are shown in Figure 1 below: of inconsistent results were observed.available a structure searchable database of spectral data: 1D/2D 7% 8%NMR, IR, Raman, and MS. The majority of data are deposited by users 16% Conclusionsof ChemSpider. Submission of spectra in the form of JCAMP-DX (for 25% ChemSpider is an online structure database allowing the community1D spectra) or images/PDF (for 1D or 2D spectra) are supported. In to participate in the deposition of additional data. A growing NMRorder to deposit a spectrum a user simply searches ChemSpider for 77% 67% spectral curve data collection is available to download. In this waythe associated structure and uploads the JCAMP-DX or image form of Consistent Ambiguous a major reference source of Open NMR data can be provided. The A Bthe spectrum. Community-based curators validate and annotate the Inconsistent validation of the existing set of spectral data has been performeddata as appropriate to ensure that only the highest quality data are Figure 1: (A) The ACD/Labs 1H verification methodology suggests using ACD/Labs NMR Verification routines. The data validation workavailable in the database. As the data collection grew, a batchwise that 77% of the 744 NMR spectra submitted to ChemSpider were highlighted a number of errors in the data, that have now beenvalidation of the data quality was required and ACD/Labs NMR consistent with the proposed chemical structure. (B) The ACD/Labs resolved, as well as providing a thorough test of the algorithms onverification software was used to perform the analysis. 13C verification methodology suggests that 67% of the 704 NMR real-world data. spectra submitted to ChemSpider were consistent with the proposedACD/Labs NMR Verification Routines chemical structure. ReferencesThe ACD/Labs approach to 1H NMR verification consists of two steps: 1) Biological Magnetic Resonance Bank: The experimental spectrum with an attached chemical structure Identified Issues with the Data 2) DrugBank: is automatically processed and analyzed. Analysis includes Structures that were deemed inconsistent by the ACD/Labs system 3) Automated Structure Verification Based on 1H NMR Prediction S.S. automated peak picking, integration, and multiplicity analysis were manually reviewed. The most frequent reason for inconsistent Golotvin, E.Vodopianov, B.A. Lefebvre, A.J. Williams, and T.D. Spitzer (GSK) Magn. Reson. Chem., 44 (5) 524–538, 2006. (extraction of coupling constants and coupling patterns). In 1H NMR verification results were in spectra where multiple addition, all extraneous signals present in the spectrum are components were observed, i.e., a mixture of isomers. Typically identified (i.e., solvent, reference, known admixtures, etc. ) these were observed based on two signals in close proximity with Tel: (416) 368-3435 partial integrals (for example 0.6H and 0.4H instead of 1H). Manual Fax: (416) 368-5596 Toll Free: 1-800-304-3988 inspection of all inconsistent results revealed 22 such cases where Email: mixtures were present.