On the Accuracy of Chemical Structures Found on the Internet


Published on

The Internet has been widely lauded as a great equalizer of information access. However, the absence of any central authority on content places the burden on the end-user to verify the quality of the information accessed. We have examined the accuracy of the chemical structures of ca. 200 major pharmaceutical products that can be found on the internet. We have demonstrated that while erroneous structures are commonplace, it is possible to determine the correct structures by utilizing a carefully defined structure validation workflow. In addition, we and others have shown that the use of un-curated structures affects the accuracy of cheminformatics investigations such as QSAR modeling. Furthermore, models built for carefully curated datasets can be used to correct erroneously reported biological data. We posit that chemical datasets must be carefully curated prior to any cheminformatics investigations. We summarize best practices developed in our groups for data curation.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

On the Accuracy of Chemical Structures Found on the Internet

  1. 1. On the Accuracy of Chemical Structures Found on the Internet Andrew D. Fant1, Eugene Muratov1, Denis Fourches1, David Sharpe2, Antony J. Williams2, and Alexander Tropsha1 1 Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill; 2 Royal Society of Chemistry Figure 1: Which structure of the top-selling anti-glaucoma drug dorzolamide is correct? Methods Results • Structures from the consensus master list were compared (as InChI keys) to hits on name searches against several well-known • An initial master set of 151 (out of 200) names was generated by • Out of 151 total compounds, all 4 groups reported a structure chemical structure repositories. The number of correct structures one author. identical to that on the initial master list for 113 compounds and total number of hits are summarized in Figure 7 • Each team was required to return the following information (74.8%). • Incorrect structures in ChemSpider were corrected when about each compound: systematic name, MOL-formatted • No compound was incorrectly reported by all 4 groups; no group found, but not counted as correct for this analysis. record, and JPEG/PNG/GIF image. achieved 100% accuracy (Figure 5) • The following search workflows were employed: Figure 7: Accuracy of results from public structure • Differing results between the curated and unsupervised structure repositories • The UNC workflow (Figure 3) was based entirely on open determination methods are highly significant by Fisher’s Exact Internet data repositories and included some manual Test.ChemSpider ID 4447604 ChemSpider ID 23499154 reentry of structures from PDF sources. Figure 5: Relative accuracy of groups against final master list Figure 3: UNC workflow – Name to structure resolutionMotivation• It is axiomatic that data stored in chemical databases must be accurate; yet it has been reported the error rate in freely-accessible public databases may exceed 8%.1 A recent example comes from the NCGC (National Chemical Genomics Center) pharmaceutical collection (Figure 2).2 Conclusions• When building computational models of chemical properties, one wrong  Identifying correct chemical structures from compound structure in twenty is enough to reduce the reliability and prediction names utilizing publicly available resources on the performance of the model.3 Internet is possible, but not trivial.• Chemical data curation is labor-intensive, perhaps unexciting but critical;  Success requires careful comparison of multiple but it should be recognized and supported as an inseparable component of Figure 6: Examples of problematic structures and sources resources. No single source is correct in all cases. cheminformatics research of disagreementFigure 2: “Neomycin” – First six structures retrieved from the • The RSC workflow (Figure 4) was more iterative in the early Tautomeric Forms  Automated Internet queries are still significantly lessNCGC browser stages, and included redistribution-restricted sources in Vardenafil accurate than manually guided searches. some cases.  InChI strings and keys are an improvement in chemical Figure 4: RSC workflow – Name to structure resolution data handling, but the current standard keys are not perfect for large-scale comparisons. Pro-drug Forms Olmesartan  We believe that the adoption of the MIABE (Minimum Information About Bioactive Entity) standard5 as part of the peer-reviewed literature publication process could improve the quality of public structural information by eliminating manual re-entry of structures from theStudy Design primary literature as is currently required in most cases.• Select and curate a list of the top-200 selling drugs (as of 2006 from Chiral Sulphoxides Wikipedia). Esomeprazole  It is insufficient for a database to return the correct• Distribute the list to four independent groups of cheminformaticians and structure from a name query. It also should minimize ask each group to generate the structures of the drugs using their preferred (better, eliminate) the number of incorrect and/or methods. ✔ RSC, AZ & IMIM/CT ✗ auxiliary answers that are returned along with the correct • Royal Society of Chemistry (RSC) • The other two workflows (AZ and IMIM/CT) utilized were Wrong Chirality one. Pravastatin • Manual Web Search more highly automated and are not described further in the • University of North Carolina (UNC) • Manual Web Search current work. References 1 Young, D.; Martin, T.; Venkatapathy, R.; Harten, P. QSAR Comb Sci 2008, 27, 1337–1345. 2 Williams, A. J.; Ekins, S.; Tkachenko, V. Drug Discov Today 2012, 1–17. • AstraZeneca (AZ) • InChI keys were calculated from the returned molecular 3 Fourches, D.; Muratov, E.; Tropsha, A. J Chem Inf Model 2010, 50, 1189–1204. • Automated Search of Pre-curated Internal Source4 ✔ UNC ✗ 4 Muresan, S. et al. Drug Discov Today 2011, 16, 1019–1030. structures and compared. Discrepancies in structures were Just Plain Wrong 5 Orchard, S. et al. Nat Rev Drug Discov 2011, 10, 661–669. • Institut de Recera Hospital del Mar/Chemotargets S.L. (IMIM/CT) discussed by participants and a consensus was reached on which Paclitaxel • Automated Internet Search structure for the compound was supported by the evidence Acknowledgements• Compare the results and discuss any discrepancies until agreement on the The authors would like to thank Ricard Garcia (Chemotargets S.L), Jordi Mestres available, leading to the final master list. (Barcelona IMIM), Sorel Muresan and Christopher Southan (AstraZeneca), and Andrey correct structure is reached. Erin (ACD/Labs) for their participation in the search for Internet drug structures. Phyllis• Once a master list is established, compare those structures to individual Pugh provided workflow graphics and statistical consulting. We acknowledge software licenses donated by OpenEye Scientific Software, ChemAxon, and ACD/Labs that were public chemical structure sources. ✔ UNC & IMIM/CT ✗ used for portions of the data collection and analysis. AT acknowledges financial support from NIH (grant GM66940) and EPA (grant RD 83499901 ).