The Internet has been widely lauded as a great equalizer of information access. However, the absence of any central authority on content places the burden on the end-user to verify the quality of the information accessed. We have examined the accuracy of the chemical structures of ca. 200 major pharmaceutical products that can be found on the internet. We have demonstrated that while erroneous structures are commonplace, it is possible to determine the correct structures by utilizing a carefully defined structure validation workflow. In addition, we and others have shown that the use of un-curated structures affects the accuracy of cheminformatics investigations such as QSAR modeling. Furthermore, models built for carefully curated datasets can be used to correct erroneously reported biological data. We posit that chemical datasets must be carefully curated prior to any cheminformatics investigations. We summarize best practices developed in our groups for data curation.
More than Just Lines on a Map: Best Practices for U.S Bike Routes
On the Accuracy of Chemical Structures Found on the Internet
1. On the Accuracy of Chemical Structures Found on the Internet
Andrew D. Fant1, Eugene Muratov1, Denis Fourches1, David Sharpe2, Antony J. Williams2, and Alexander Tropsha1
1 Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill;
2 Royal Society of Chemistry
Figure 1: Which structure of the top-selling anti-glaucoma drug
dorzolamide is correct? Methods Results • Structures from the consensus master list were compared (as
InChI keys) to hits on name searches against several well-known
• An initial master set of 151 (out of 200) names was generated by • Out of 151 total compounds, all 4 groups reported a structure chemical structure repositories. The number of correct structures
one author. identical to that on the initial master list for 113 compounds and total number of hits are summarized in Figure 7
• Each team was required to return the following information (74.8%). • Incorrect structures in ChemSpider were corrected when
about each compound: systematic name, MOL-formatted • No compound was incorrectly reported by all 4 groups; no group found, but not counted as correct for this analysis.
record, and JPEG/PNG/GIF image. achieved 100% accuracy (Figure 5)
• The following search workflows were employed: Figure 7: Accuracy of results from public structure
• Differing results between the curated and unsupervised structure repositories
• The UNC workflow (Figure 3) was based entirely on open
determination methods are highly significant by Fisher’s Exact
Internet data repositories and included some manual
Test.
ChemSpider ID 4447604 ChemSpider ID 23499154 reentry of structures from PDF sources.
Figure 5: Relative accuracy of groups against final master
list
Figure 3: UNC workflow – Name to structure resolution
Motivation
• It is axiomatic that data stored in chemical databases must be accurate; yet
it has been reported the error rate in freely-accessible public databases may
exceed 8%.1 A recent example comes from the NCGC (National Chemical
Genomics Center) pharmaceutical collection (Figure 2).2
Conclusions
• When building computational models of chemical properties, one wrong Identifying correct chemical structures from compound
structure in twenty is enough to reduce the reliability and prediction names utilizing publicly available resources on the
performance of the model.3 Internet is possible, but not trivial.
• Chemical data curation is labor-intensive, perhaps unexciting but critical;
Success requires careful comparison of multiple
but it should be recognized and supported as an inseparable component of
Figure 6: Examples of problematic structures and sources resources. No single source is correct in all cases.
cheminformatics research of disagreement
Figure 2: “Neomycin” – First six structures retrieved from the • The RSC workflow (Figure 4) was more iterative in the early Tautomeric Forms Automated Internet queries are still significantly less
NCGC browser stages, and included redistribution-restricted sources in Vardenafil
accurate than manually guided searches.
some cases.
InChI strings and keys are an improvement in chemical
Figure 4: RSC workflow – Name to structure resolution
data handling, but the current standard keys are not
perfect for large-scale comparisons.
Pro-drug Forms
Olmesartan We believe that the adoption of the MIABE (Minimum
Information About Bioactive Entity) standard5 as part of
the peer-reviewed literature publication process could
improve the quality of public structural information by
eliminating manual re-entry of structures from the
Study Design primary literature as is currently required in most cases.
• Select and curate a list of the top-200 selling drugs (as of 2006 from Chiral Sulphoxides
Wikipedia).
Esomeprazole It is insufficient for a database to return the correct
• Distribute the list to four independent groups of cheminformaticians and structure from a name query. It also should minimize
ask each group to generate the structures of the drugs using their preferred (better, eliminate) the number of incorrect and/or
methods. ✔ RSC, AZ & IMIM/CT ✗ auxiliary answers that are returned along with the correct
• Royal Society of Chemistry (RSC) • The other two workflows (AZ and IMIM/CT) utilized were Wrong Chirality one.
Pravastatin
• Manual Web Search more highly automated and are not described further in the
• University of North Carolina (UNC)
• Manual Web Search
current work. References
1 Young, D.; Martin, T.; Venkatapathy, R.; Harten, P. QSAR Comb Sci 2008, 27, 1337–1345.
2 Williams, A. J.; Ekins, S.; Tkachenko, V. Drug Discov Today 2012, 1–17.
• AstraZeneca (AZ) • InChI keys were calculated from the returned molecular 3 Fourches, D.; Muratov, E.; Tropsha, A. J Chem Inf Model 2010, 50, 1189–1204.
• Automated Search of Pre-curated Internal Source4 ✔ UNC ✗ 4 Muresan, S. et al. Drug Discov Today 2011, 16, 1019–1030.
structures and compared. Discrepancies in structures were Just Plain Wrong 5 Orchard, S. et al. Nat Rev Drug Discov 2011, 10, 661–669.
• Institut de Recera Hospital del Mar/Chemotargets S.L. (IMIM/CT) discussed by participants and a consensus was reached on which Paclitaxel
• Automated Internet Search structure for the compound was supported by the evidence Acknowledgements
• Compare the results and discuss any discrepancies until agreement on the The authors would like to thank Ricard Garcia (Chemotargets S.L), Jordi Mestres
available, leading to the final master list. (Barcelona IMIM), Sorel Muresan and Christopher Southan (AstraZeneca), and Andrey
correct structure is reached. Erin (ACD/Labs) for their participation in the search for Internet drug structures. Phyllis
• Once a master list is established, compare those structures to individual Pugh provided workflow graphics and statistical consulting. We acknowledge software
licenses donated by OpenEye Scientific Software, ChemAxon, and ACD/Labs that were
public chemical structure sources. ✔ UNC & IMIM/CT
✗ used for portions of the data collection and analysis. AT acknowledges financial support
from NIH (grant GM66940) and EPA (grant RD 83499901 ).