Keynote ReviewTowards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases andApproaches to Improving t...
Short BiographiesAntony J. Williams graduated with a Ph.D. in chemistry as an NMR spectroscopist. Dr Williamsis currently ...
Valery Tkachenko has graduated from the Lomonosov Moscow State University, receiving hisM.Sc. in Chemistry and B.Sc. in Co...
Teaser Sentence:Chemistry databases are widely available on the internet which is potentially of highvalue to researchers,...
The big picture: Error detection in databases       “If I have seen further it is only by standing on the shoulders of gia...
have been developed for identifier error detection (these types of approaches may be applicableto large chemical databases...
not believe such studies identifying errors in databases have received the attention theydeserve.The quality of chemistry ...
highly curated online database of validated chemical name-structure relationships would likelyprovide for the underpinning...
by crowdsourcing are of high quality because there is debate between the editors. Data qualitycan be compared with other r...
to the interpretation of the surveyed population. The table represents a number of key points atthe time of reporting:    ...
the number of responders has doubled. The data available as of October 2011 are presented inTable 2.What is in a name: Str...
Silver graphite       Special schwarz       GRAPHITE, NATURAL       Activated charcoal, iodinated       Carbon nanotube, s...
dataset, some as a response to a series of public blog posts [41-43]. For example, someobvious mis-associations are displa...
retrieve information about that particular drug only. For example, a search on Neomycin wouldreturn a single record for th...
Figure 5. Each of these substructures were used as a separate search and the individualsubsets of molecules examined. Duri...
two undefined stereocenters and one is a totally unrelated structure. This is typicallyrepresentative of the confusions in...
appear to be a way to view a list of all curations made by a single contributor and this wouldmake for an appropriate enha...
them. For data aggregators of millions of chemicals, the curation of even the basic chemicalcompounds and associated ident...
While aromaticity is a well-known property of many organic chemicals, there is no standard inrepresenting such parts of th...
As discussed earlier in the analysis of steroids in the NPC Browser, it is rather common forstereochemistry to be excluded...
identification of certain types of covalently bonded metals as a prefilter is an appropriate step invalidating data prior ...
deprecation was also extended to larger multiples such as triples, quads etc. of the samecompounds in a single record.Data...
Wolfram Alpha declares PubChem as one of their sources of information [58] but do not deposit,as yet, back to the platform...
private vaults [63]. It should be noted that the linking between the external sites andChemSpider is based on retrieval of...
The majority of online databases do not provide details regarding the provenance of all of theircontent. We think is an ex...
on the database. For databases containing chemical compounds it is intended that some formof quantitative quality factor c...
Discussion       We have described some of the errors we are finding which are common to moleculedatabases. As chemistry c...
useful for ranking targets and looking at druggability of targets and scaffold distribution. Thesetypes of metrics will be...
dictionary of correct structure files for drugs with their associated synonyms is not yet availableonline. It has been ack...
NIH : National Institutes of HealthNPC browser : NIH Chemical Genomics Center Pharmaceutical Collection browserOpen PHACTS...
15   Zhang, C. et al. (2009) Methods for labeling error detection in microarrays based on the     effect of data perturbat...
50   Ekins, S. et al. (2011) Collaborative computational technologies for biomedical research,     Wiley51   Pentavalent c...
81     (2011) Open Data on Wikipedia. (Vol. 2011)82     Open Knowledge Foundation: Open Data Licensing. (Vol. 2011)83     ...
where data transitions to become a copyrightable collection. Is a file containing 100 chemicalstructures, associated chemi...
TablesTable 1. An online survey requesting community feedback regarding trust in online chemistrydatabases.Table 2
Table 3. Summary of substructure search results for NPC browser searching with 3 steroidsubstructures.   Substructure    #...
Table 4. Six identifiers linked with bufogenin in PubChem.
Table 5: A review of chemical structures retrieved based on name-based searches of the NPCBrowser (version 1.0.22) using a...
Cefdinir      X   1 Cyclosporin         3   7,8Clarithromycin       2   2,8  Tegaserod          1   6 Famotidine      X   ...
Donepezil                     X                  2   8      Cetirizine                    X                  2   8      Ir...
Figure LegendsFigure 1. Image from the original downloaded NPC browser showing the result of searching forchromium.Figure ...
structure shown is of bare carbon. All displayed synonyms associated with the compounds areincorrect: charcoal, graphite, ...
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Towards a gold standard and regarding quality in public domain chemistry databases and approaches to improving the situation.
Upcoming SlideShare
Loading in …5
×

Towards a gold standard and regarding quality in public domain chemistry databases and approaches to improving the situation.

784 views
767 views

Published on

In recent years there has been a dramatic increase in the number of freely accessible online databases serving the chemistry community. The internet provides chemistry data that can be used for data-mining, for computer models, and integration into systems to aid drug discovery. There is however a responsibility to ensure that the data are high quality to ensure that time is not wasted in erroneous searches, that models are underpinned by accurate data and that improved discoverability of online resources is not marred by incorrect data. In this article we provide an overview of some of the experiences of the authors using online chemical compound databases, critique the approaches taken to assemble data and we suggest approaches to deliver definitive reference data sources.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
784
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Towards a gold standard and regarding quality in public domain chemistry databases and approaches to improving the situation.

  1. 1. Keynote ReviewTowards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases andApproaches to Improving the SituationAntony J. Williams1, Sean Ekins2 and Valery Tkachenko11 Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC-27587, U.S.A.2 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, U.S.A.Corresponding Author: Antony Williams, 904 Tamaras Circle, Wake Forest, NC27587 Email:williamsa@rsc.org Tel 919-201-1516.
  2. 2. Short BiographiesAntony J. Williams graduated with a Ph.D. in chemistry as an NMR spectroscopist. Dr Williamsis currently VP, Strategic development for ChemSpider at the Royal Society of Chemistry. Dr.Williams has written chapters for many books and authored or >120 peer reviewed papers andbook chapters on NMR, predictive ADME methods, internet-based tools, crowdsourcing anddatabase curation. He is an active blogger and participant in the internet chemistry network.Sean Ekins graduated from the University of Aberdeen; receiving his M.Sc., Ph.D. andD.Sc. He is Principal Consultant for Collaborations in Chemistry and CollaborationsDirector at Collaborative Drug Discovery Inc. He has written over 170 papers and bookchapters on topics including drug-drug interaction screening, computational ADME/Tox,collaborative computational technologies and neglected disease research. He hasedited or co-edited 4 books.
  3. 3. Valery Tkachenko has graduated from the Lomonosov Moscow State University, receiving hisM.Sc. in Chemistry and B.Sc. in Computer Sciences. He is currently Chief Technology Officer ofChemSpider at the Royal Society of Chemistry. Over the course of the last 15 years he hasparticipated in development of a number of successful enterprise projects for large pharma andpublic domain. He is an author of 30 peer reviewed papers and book chapters. He’s one shiningpiece of a masculine brightness and attractiveness.
  4. 4. Teaser Sentence:Chemistry databases are widely available on the internet which is potentially of highvalue to researchers, however the quality of the content is variable and errorsproliferate, we suggest there should be efforts to improve the situation and provide achemistry database as a gold standard.AbstractIn recent years there has been a dramatic increase in the number of freely accessible onlinedatabases serving the chemistry community. These databases are diverse in nature andencompass chemical vendor catalogs, government-funded research results, definitively focusedand highly curated collections and large-scale information aggregators. As a result the internettheoretically now represents a rich array of chemistry data that can be utilized for the purpose ofdata-mining, for utilizing in computer models, and for integrating into other systems to expanddata accessibility and aid drug discovery. With this dramatic improvement in accessibility comesa responsibility to ensure that the data are as high a quality as possible. This will ensure thattime is not wasted in erroneous searches, that models are underpinned by accurate data andthat the improved discoverability of online resources is not marred by the delivery of incorrectdata. This paper represents an overview of some of the experiences of the authors in utilizingonline chemical compound databases, critiques some of the approaches being taken toassemble data and suggests alternative approaches to collaborate to deliver definitive referencedata sources for the community to utilize.Key WordsChemistry databases, ChemSpider, Crowdsourcing, Curation NPC browser, PubChem,Scifinder
  5. 5. The big picture: Error detection in databases “If I have seen further it is only by standing on the shoulders of giants” Isaac Newton Isaac Newton alluded to science progressing by building on the past experiments anddata of others. This can, however, be significantly inhibited or misdirected by errors indatabases that have long been suggested as having downstream effects when the data isreused. For example in the 1990’s it was proposed that errors in genotyping data could impacthigh resolution genetic maps and one human polymorphism database had 3% errors whichimpacted maps developed with it [1]. Some bioinformatics databases have been described thatwere designed to perform data curation and error identification [2] but it is unclear how widelythese have been embraced. The impact of the correctness of molecular structures oncomputational models has been discussed to a limited extent [3]. Oprea and colleagues haveshown how errors in chemical structures published in scientific journals can propagate in theliterature [4] and then into databases like SciFinder [[5]] and the Merck Index [6]. Even manualcuration of biological activity databases such as the Accelrys drug data report, MDDR, [7] hasbeen proposed to have errors which have been compensated for by calculating activity-activitysimilarities [8]. It has also been suggested that automatic classification of molecules based onSMILES strings may be useful for error detection and aiding biochemical pathway databaseconstruction [9]. Error detection is important in clinical practice to avoid mortality due to missed injury[10]. Errors in clinical research databases of thousands of patients have been shown in onestudy to vary from 2.3-26.9% [11] resulting in data analysis errors that have the potential toimpact the standards of care for many thousands of patients. Medical records often containmultiple identifiers and are error prone when it comes to linkage to samples. Therefore, methods
  6. 6. have been developed for identifier error detection (these types of approaches may be applicableto large chemical databases too) [12]. A recent multicentre analysis of a common biological sample by mass spectrometry–based proteomics identified generic problems in databases as the major hurdle forcharacterizing proteins in the test sample correctly. Primarily, search engines could notdistinguish among different identifiers and many of these algorithms calculated molecular weightincorrectly [13]. Computational genomic annotation errors such as those in the Brucella abortusgenome (which had seven annotation errors) have been corrected by proteomics-basedconfirmation of protein expression [14]. Methods have been developed for labeling errordetection to improve analysis of microarray data and to discover the correct relationshipbetween diseases and genes [15] (which suggests that many of the microarray databases mayhave mislabeled data). Simple rule-based methods for validating ontology-based instance datacan detect curation errors during the development of biological pathways [16]. It has beensuggested that functional annotation from well-studied proteins to new sequences hit a plateausince annotation transfer led to error propagation across databases, which resulted in follow upexperiments that failed. Many proteins have completely wrong function assignments and onedatabase had between 2.1 – 13.6% of annotated Pfam hits unjustified [17]. The authors alsoindicated difficulty in assessing incorrect annotations in public sequence databases as many ofthe sequences have not been studied experimentally. While most scientists think a ligand-protein X-ray structure is definitive others have highlighted how these can also have errors withfar reaching consequences [18]. A recent review of data governance in predictive toxicology analyzed several publicdatabases. The authors mentioned a lack of systematic and standard measures of data qualitybut did not address error rates or address molecule structure quality [19]. In our combinedliterature analysis independent groups have identified significant errors across all types ofdatabases which if not checked and corrected will be an impediment to future science. We do
  7. 7. not believe such studies identifying errors in databases have received the attention theydeserve.The quality of chemistry databases Scientific knowledge is fragile, it demands uncorruptable storage media and in todays’electronic age the sheer amount of data underpinning this knowledge requires careful curationand verification. Chemistry databases are no exception. There are now many such databasesthat are freely available on the internet (e.g. PubChem [20], ChemSpider [21], DrugBank [22],etc) and we rely on them to be correct, often granting them “trust” and declaring them as highquality without validation of these beliefs. We have also previously discussed the importance ofchemical data curation [23]. For many types of scientific data minimal data standards have been created whichprovide confidence in data deposited in databases etc. [24]. Unfortunately, for chemistrydatabases there are as yet no agreed upon standards and there is no freely available goldstandard structure database which we can yet rely on. Despite the decades of experience thatunderpin the assembly of commercial molecule databases (e.g. Scifinder, MDDR, etc.)primarily depending on skilled staff for curation and data checking, the delivery of onlinedatabases commonly appears to focus more on the development of the underlyingcheminformatics architecture and platform rather than the delivery of a high quality resource ofdata. Also there are as yet no definitive guides for how to assemble and integrate disparate datasources and each of the individual groups appear to follow custom unproven andundocumented approaches in assembling data. Some of these databases are simplyrepositories whereby the data deposited to the system remain unedited even when thedatabase hosts are well aware of errors in the data. In our chemistry domain there is a dire needfor the suppliers of chemistry-based community resources to work together and develop bestpractices to reduce the amount of repetition and lessen the impact of poor data assembly. A
  8. 8. highly curated online database of validated chemical name-structure relationships would likelyprovide for the underpinnings of a semantic web for chemistry, for chemistry text-mining, forintegration to online resources and to allow for efficient disambiguation of chemical names.Unfortunately we currently do not have such a reliable resource which we would term a “goldstandard”. A recently published paper describes an effort to assemble clinically approved drugsfrom the US, EU, UK, Canada and Japan and the creation of a new database of molecularstructures [25]. This group described it as a “comprehensive and curated resource”, publishedthe data in a cheminformatics browser and declared that it will be used along with the NCGCscreening resources as a component of the NIH therapeutics for rare and neglected diseases(TRND) program [26]. The paper described in detail the considerable effort that went intodealing with semantic errors and sourcing “correct structures”. The NPC Browser was releasedto the public on 27th April 2011 [27]. In keeping with our ongoing efforts to aggregate andcurate data for our own work in the fields of QSAR and drug repositioning we were interested inexamining the quality of the data in the NPC browser, and have used it as an example in thispaper for discussing the challenges of assembling high quality data and, specifically, as anexample of how more foresight, consideration and care is required when releasing morechemistry related data into the public domain.Trust and Chemistry databases While the quality of academic or commercial databases is rarely questioned the mediahave made much of the implicit trust granted to the online encyclopedia, Wikipedia, questioningwhether a crowdsourced database can be as high-quality and as trusted as a highly curated andexpert assembled resource such as the Encyclopedia Britannica. Wikipedia has a great diversityof coverage, and offers unsurpassed immediacy although the question as to which is more“trustworthy” is still an issue. Wikipedia chemical compound pages which have been developed
  9. 9. by crowdsourcing are of high quality because there is debate between the editors. Data qualitycan be compared with other resources such as the Merck Index [28] and the US Pharmacopeia[29] which themselves are less frequently updated than Wikipedia. Errors detected in any of thepublished (book-bound) resources are not updated for many months after detection of errorsand even then purchasers of a volume will not receive those corrections. Meanwhile, theobvious advantages of immediacy of editing and availability of an online data source, is clear.The devil’s advocate and critic of the wiki approach would of course note that such edits couldalso degrade the quality of what is on Wikipedia. It is our anecdotal observation that Wikipediainformation on chemical structures is considered trustworthy. We think the standard resourceslike the Merck Index themselves will ultimately be replaced by Wikipedia or similarcrowdsourced sites. The shift to mobile technologies and the expectation of improvedsearchability over thumbing physical paper pages is going to help to drive this shift. It seems paradoxical if the people question the validity of data and information capturedon Wikipedia, yet many are willing to grant trust to public databases and in particular, chemicalcompound databases at a surprisingly high level. One of the authors (AJW) prepared an onlinesurvey (Table 1) requesting community feedback regarding the trust granted to onlinedatabases containing chemistry related information. The trust granted to online resourcescaptured in the survey, considering the actual quality as discussed below, is consistent with ourown experience of interacting with many scientists. We suggest that no database should alwaysbe trusted but that there is a different level of data-driven trustworthiness that can be granted toonline chemistry databases, albeit at a specific point in time, as content continues to changeover time and quality can certainly improve or degrade. We have recently introduced a wikienvironment listing online scientific databases and contributions and commentary from thecommunity might aid us in providing a ranking of quality (vide infra). The initial data shown here were gathered over 4 days and from 46 survey responders[30]. It should be noted that the term “trust” was purposely not explicitly defined and was left up
  10. 10. to the interpretation of the surveyed population. The table represents a number of key points atthe time of reporting: 1) All responders were familiar with Wikipedia and the majority commonly trusted the resource. 2) Only one database had the majority of users always trusting the resource, PubChem [20]. It should be noted however that the distribution of trust appeared to be largest for this database. 3) ChemSpider is the database that is trusted by the largest population of users (see caveats below). Over 50-70% of the responders had no experience with the DrugBank [31], ChemIDPlus[32], PDSP [33] and DailyMed [34] resources. Some notable caveats are as follows. Since theauthor of the questionnaire is the host of the ChemSpider database a bias as to thetrustworthiness of the ChemSpider resource is to be expected since the link to the originalsurvey was posted on his blog and many blog posts regarding data quality in public domaindatabases and the efforts to curate ChemSpider had been posted there historically. The reverseis true in terms of the “Always Trust” bias for PubChem. A number of reports regarding qualityissues about PubChem have also been made on blogs [35,36]. It is assumed that the majority ofresponders were unaware of these commentaries as well as others reported in mainstreamjournals [37]. The results of the questionnaire are likely representative of both the respondersbeliefs and experience of the listed resources. In the case of this questionnaire trust is likelyrather an invested emotional response and for each individual may comprise of, for example,belief in a resource (e.g. based on marketing, word-of-mouth or peer pressure) and quantifiabledata-driven experience of a resource (the user may have downloaded the data and checkeddata quality and content thoroughly). These are really the two extremes and clearly there arevarious other biases inherent in the granting of trust. It should be note that the survey remainsonline 10 months after it was originally posted and the bias of the questionnaire has changed as
  11. 11. the number of responders has doubled. The data available as of October 2011 are presented inTable 2.What is in a name: Structure-identifier relationships in chemical databases The relationship between chemical compounds and all possible identifiers leads tosignificant confusion in chemical databases. Some simple examples commonly observed inpublic domain databases are listed below: 1) A chemical name can refer to a structure of a particular isomeric form but thestereochemistry may be confused. As an example, Taxol has a specific stereochemistry but inPubChem is confused and returns five structures with different stereochemistry. [38] 2) A CAS registry number associated with a particular salt form can be incorrectlyassociated with the neutral compound. 3) Systematic names containing all explicit stereochemical detail may be associated witha molecular skeleton with all stereochemistry absent. As an example, cholesterol has specificstereochemistry whereas one of the multiple forms of cholesterol on PubChem has no definedstereochemistry [39]. One example of the extreme nature and diversity of name-structure relationship errors inpublic domain databases is for the simplest organic molecule, methane. A review of all namesand identifiers associated with methane provides a long list of obvious errors as listed below[40]. Furnace black Graphitic acid Mineral carbon Royal spectra
  12. 12. Silver graphite Special schwarz GRAPHITE, NATURAL Activated charcoal, iodinated Carbon nanotube, single-walled Fullerene soot Carbon Activated Diamond- CHARCOAL (2R,3R)-Butanediol dimesylate 1,3-DICHLORO-PROPAN-2-ONE (2R,3R)-Butanediol bis(methanesulfonate) Ethyl-1-propenyl ether, mixture of cis and trans Carbon, activated [UN1362] [Spontaneously combustible] PSS-[2-[(Chloromethyl)phenyl]ethyl]-Heptaisobutyl substituted These can be categorized below:  Representations of Carbon: graphite, diamond, soot, fullerene, coal.  Organic compounds  Trade names This example is rather extreme in its nature whereas others given below are moregeneral in nature and represent issues detected in many other public databases.Data Errors in the NPC Browser: Mis-associations A number of data errors were detected with the originally downloaded NPC browser [25].Since the original release some of these errors have been resolved in the presently available
  13. 13. dataset, some as a response to a series of public blog posts [41-43]. For example, someobvious mis-associations are displayed in Figures 1 to 3. Figure 1 shows one of the ninechemicals returned following a search for “chromium” in the NTP browser. The list of associatedsynonyms, do not coincide with the displayed chemical moiety of the bare Chromium (IV) ion.The list is diverse and refers to a number of different chemicals and the species of interest isalmost certainly meant to be a chromate ion based on the long list of identifiers, with thevariation in the chemical only being the associated counterion. It should be noted that there is along list of associated CAS numbers also, each likely associated with one of the many forms ofchromate. While this may also be deemed to be an extreme case, there are multiple otherexamples including search results for manganese, titanium and chromium. It should be notedthat the NPC browser is meant to represent the active drug moiety so there is an error in thisrespect. In the case of arsenic (see Figure 2) the species of interest is meant to be an arsenicoxide. The display of two equivalent trivalent arsenic cations is, in itself, clearly an error as thereis no value in retaining two degenerate ions. Unfortunately the registration of chemical recordscontaining degenerate compounds as doubles, triples, quads etc. in online databases is notlimited to the NPC Browser, and these have been detected in other databases as will bediscussed later. The misassociation of a chemical with one or more chemical names is not limited tosimply inorganic ions. An example is shown in Figure 3 for a well known antibiotic “neomycin”.Despite the fact that 15 synonyms identify the displayed structure of hexadecanoic acid as anantibiotic, neomycin is in fact an aminoglycoside [44] and this represents another clear exampleof miss-association. Clearly the effects of such egregious errors in databases, if the data is usedelsewhere, can result in erroneous structure-property relationships when used for modeling etc. In our judgment the behavior of the search in the NPC browser itself is confusing asusers would likely expect the chemical record associated with a search on a drug name to
  14. 14. retrieve information about that particular drug only. For example, a search on Neomycin wouldreturn a single record for that drug. However, the present text-based search is a search acrossall information in the entire database and therefore retrieves a total of 26 records, the majority ofthem retrieved because some mention of the text string neomycin has been made in anassociated document referring to that drug. That said, even that is not the complete explanationas there are three distinct drugs retrieved labeled as neomycin as shown in Figure 4. Such a “full text” based search is similar to that available in PubChem. A naïve usersearching PubChem using a drug name as the input might expect to retrieve just the recordassociated with that drug. However, by default the search is a complete text search throughoutthe database and the user must be aware that only a constrained search of the formName[CompleteSynonym] will retrieve the appropriate record(s). As an example, a search forthe drug name aspirin retrieves 69 records whereas a search on Aspirin[CompleteSynonym]retrieves a single record in PubChem. It should also be noted however that such a[CompleteSynonym] search does not always retrieve a single record as retrieval depends on thequality of the data in the database. A search for the drug name Taxol in PubChem retrieves 59records whereas a search on Taxol[CompleteSynonym] retrieves five records. This is confusingas it is not clear which is the actual structure. Closer examination shows that all are consistentin terms of connectivity but differ in stereochemistry and the majority of associated bioassaydata are associated with the incorrect structure of Taxol. It should be noted that the correctstructure is listed as one of the five.Data Errors in the NPC Browser: Analysis of SteroidsIn order to examine potential patterns in the quality of data contained within the originaldownloaded NPC browser “HTS screening” dataset a series of three specific steroidalsubstructures were searched against the compounds contained within the dataset. Thesesubstructures were the gonane, gon-4-ene and gona-1,4-diene substructures as shown in
  15. 15. Figure 5. Each of these substructures were used as a separate search and the individualsubsets of molecules examined. During the initial examination specific patterns emerged. Forexample, since the majority of steroids contain specific stereochemistry centers, theexamination involved validating the stereospecific details of the structure against the structureexpected based on the CAS number and chemical name combination as described earlier. Asan example we will consider the structure of bufogenin. In the NPC browser the structurerecovered is as shown in Figure 6. Figure 6 shows a single CAS number and a series of chemical names, all consistent withbufogenin. Using the series of databases outlined earlier (Table 1) as validation sources thestructure of Bufogenin was identified. Comparing the determined structure with that in the NPCbrowser shows the differences illustrated in Figure 7. Specifically, 2 stereocenters are missingand 3 stereocenters are inverted. A similar analysis was performed for each of the retrieved chemical in the three classesof steroids defined by the various substructures and the results were tabulated. A total of 149unique compounds were retrieved and each was annotated as shown in Table 2. In each caseonly a fraction of the structures matched the correctly validated structure identified using theprocesses listed earlier. For each of the classes the majority of compounds had incompletestereochemistry and for two of the classes, the gon-4-ene and gon-1,4-diene substructures,over 12% of the structures had complete but incorrect stereochemistry. This pattern wasrepeated in other cases throughout the database analysis. Clicking on the hyperlinked CAS number in the NTP browser interface for bufogeninperforms a search against PubChem for that CAS number. This search produces sixcompounds in [45]. Each has the associated name bufogenin as well as the associated CASnumber of 465-39-4. There are 4 unique chemical structures contained within the PubChemdatabase (listed as four different CIDs in Table 3). Of the four listed one is the correct structure,one has the correct connectivity but with no stereochemistry, one has the correct structure with
  16. 16. two undefined stereocenters and one is a totally unrelated structure. This is typicallyrepresentative of the confusions in name-structure relationships in the PubChem database.Solutions to prevent data errors in databases: Curation Systems and Validating Data Thus far we have provided some examples of quality in public domain chemistrydatabases. We clearly understand that no release of data, as highly curated or validated as itmay (or may not) be, will be perfect. In the current age of nascent crowdsourcing [46] that wenow find ourselves in there is the ability to gather feedback from users of the database so thattheir comments can be reviewed and appropriate actions can be taken. This should be implicitin the design of any database. While the majority of chemistry databases online provide anemail address to contact the database host this is rather an imperfect solution as any commentsabout a particular record are not associated with the record itself and are therefore unavailableto other users of the database to view. As a result obvious errors, even though reported to thedatabase administrators, remain hidden from the community until appropriate action is taken.This is the case with some of the most “trustworthy” databases including PubChem,ChemIDPlus, and others. Sometimes responses are simply canned replies with no actionstaken for months and, based on the experiences of the authors, in some cases years, and thefeedback is not actioned. Other database hosts are responsive and pay attention to thefeedback with a short cycle time. The authors have positive experiences with the hosts of theDrugbank [31], ChEBI [47] and ChEMBL [48] databases specifically. Other databases offer per record annotation and feedback as has been implemented onChemSpider [21], DrugBank [31] and on the NPC Browser [25]. The Wolfram Alpha database[49] provides the ability to submit feedback on individual records in the database. Based on ourexperiences all of our reported errors remain unresolved. The NPC browser includes a ratherelegant approach to curation including the ability to edit the structure and remove and addsynonyms, and secondary curators check the data and flag it further. As yet there does not
  17. 17. appear to be a way to view a list of all curations made by a single contributor and this wouldmake for an appropriate enhancement. ChemSpider has implemented both curation and annotation capability and already offersthe ability to provide direct feedback on a record-per-record basis regarding data quality. The“dictionary” aspect of a chemical compound database, i.e. the ability to retrieve high qualitychemical compounds based on a name search, provides value to scientists, the greater value ofcourse likely coming from the additional data and links associated with these records.Crowdsourced community participation has allowed for many tens of thousands of incorrectsynonyms to be removed or validated on the database, thousands of incorrect chemicalrepresentations to be deprecated and new chemical compounds and spectra to be added.These contributions follow a similar power law trend to those demonstrated in internal projectssuch as those at Pfizer [50]. The number of curators is rather limited with less than 150 everhaving contributed as of this writing. There are a very small number of dedicated contributorswho have contributed thousands of curation actions to the database and these have proven tobe of very high quality. In comparison to date only 11 people ever have contributed to thecuration of the data in the NPC browser, with 5 of them associated directly with the developmentof the software tool. While few people to date have actively contributed in these crowdsourcingefforts, this makes the task of cleaning up the databases immense for those that are activelyinvolved, thus we require more automated mechanisms to ensure structures and data arecorrect.Structure Validation FiltersThe construction of a database of chemical compounds should attempt to deliver the highestquality of data possible to its users. For small collections of data, of a few thousand compoundsfor example, this is possible by manual curation of the data. As we have shown earlier howevereven small databases are commonly not curated and can have many errors associated with
  18. 18. them. For data aggregators of millions of chemicals, the curation of even the basic chemicalcompounds and associated identifiers is an enormous challenge when taking into account thescale of the data involved. However, it is possible to introduce some very basic structurevalidation filters into every system that can be used as immediate checks on the quality of dataand provide automated flags for review as necessary. We are not aware of any previousreports of such filters being compiled for use in such a manner, but recommend the followingfilters after based on our experience in developing the ChemSpider database. These filters canbe implemented in a manner that immediately reject the data at deposition or can be post-filtered following human review.1)Incorrect valenceHypervalency is a rather common situation in many chemistry databases and the rejection ofchemical compounds containing pentavalent carbon atoms, for example, should be a generalrule. The implementation of such a rule would have removed such errors from the NPC Browser[51]. A related issue in some databases is the preference of display for nitro groups where somedatabases allow the functional group to be displayed with a pentavalent nitrogen while othersprefer charge separated groups. It should be noted that this is a structure standardization choiceand will be discussed below.2)Atom LabelsIt is quite common for chemical compounds to be represented using atom labels of the type –Bz(for benzyl) and –Tos (for tosylate) and FMOC (for 9-fluorenylmethoxycarbonyl). In this caseprocessing systems for the various databases hosting the data must correctly process andconvert the data into an expanded form that accurately captures the intention of the label.3) Aromatic Bonds.
  19. 19. While aromaticity is a well-known property of many organic chemicals, there is no standard inrepresenting such parts of the molecule either graphically or in chemistry format files. Oneapproach is to represent aromatic rings as a particular Kekule form. Another is to explicitlydesignate aromatic bonds as solid-dashed parallel lines representing 1.5 bond order. Yetanother one is often used to represent benzene – as a circle inside a hexagon. Neither of theseapproaches is chemically correct, but situation gets even worse when particularcheminformatics software is being used to read-in molecule represented in one of these formats– then it’s up to that package’s logic to figure out what kind of structure was meant to be hereand how to standardize its representation. As such interpretation highly depends on softwarepackage one often can see multiple instances of the same chemical represented in differentways being listed in databases as different structures.4) Non-zero total chargeIt is quite common for chemicals to be represented in databases as active moieties associatedwith their salt counterions. For example, in Figure 8 the compound is a disodium salt asevidenced by the associated chemical name and the “SaltData” field. Unfortunately, when datais delivered in this format the structure deposited into the database will have obvious chargeimbalance, the association with the chemical name will be incorrect and any experimentalparameters in the file associated with a particular salt will be mismatched. While there areappropriate cases where a chemical record should be associated with a charged speciesexperience has proven that checking for a net zero charge is definitely of value in catchingmany errors at deposition. Many of the errors observed in the NPC browser in terms ofcompound-identifier mismatches likely results from aggregation of the data around the activedrug component whilst ignoring the originally associated salt form.5) Absent stereochemistry
  20. 20. As discussed earlier in the analysis of steroids in the NPC Browser, it is rather common forstereochemistry to be excluded from chemical compound representations, whether intentionallyor accidental. Steroids commonly show this issues as stereochemistry is generally assumedbased on standard steroid skeletons. Certain databases have also excluded stereochemistryfrom their collections when aggregating their data (e.g. the PDSP database [33] sourced non-isomeric SMILES strings from PubChem for the structure representations). While missed stereobonds are of course acceptable for structure representations (e,g. for representation ofracemates or for unknown stereochemistry) it is the authors experience that in the majority ofcases, as evidenced by the associated chemical name/identifier, that incomplete or absentstereochemistry is in fact an error. Flagging such compounds for manual review is anappropriate choice.6) Salts with Covalent BondsWhile the majority of chemists would agree that a sodium carboxylate salt should berepresented using a positively charged sodium ion and a negatively charged oxygen in thecarboxylate anion, an alternative representation is a sodium atom covalently bonded to theoxygen as shown in Figure 9. Using InChI the species are actually equivalent, as shown in thefigure, whereas the SMILES string and standard molfile would clearly distinguish between them.For databases constructed using InChI as the basis of deduplication (for example, ChemSpider)this equivalence has proven to be an issue in structure representation [52]. In the case of theChemSpider database a number of molfiles containing carboxylate groups with covalentlybonded metals were deposited to the database and all future representations of the compound,whether ionic or not, were de-duplicated based on the InChI. Such misrepresentations onChemSpider will be addressed in the future as part of a structure standardization project (videinfra). While clearly it is appropriate to have covalently bonded metals in many cases the
  21. 21. identification of certain types of covalently bonded metals as a prefilter is an appropriate step invalidating data prior to deposition.7) 0D structure layoutThe vast majority of molfiles contain 2D coordinates representing flat depictions of chemicalsand, assuming appropriate layouts, are the most appropriate manner in which to depictchemicals. The submission of 3D molfiles for deposition can be quite common and, in general isnot an issue as the removal of the z-coordinate to flatten the molecule often produces anacceptable input. Unfortunately these “0D molfiles” where all XYZ coordinates are set to zeroare also possible. The connectivities and bond orders between atoms are still contained withinthe file but all layout information is removed. In these cases the data should be pretreated usingalgorithmic 2D layout so as to ensure that visually interpretive data are available to the user.Without such 2D layouts the displayed compound will be confusing. An example of such anissue is where a “hydroxyl group” shown in the NPC browser actually corresponds to thestructure of Silidianin, albeit without accurately encoded stereochemistry [53].8) Duplicated structuresBased on our experience of digesting data from various depositors into ChemSpider we haverecognized that the submission of data containing multiple copies of the same compound is arather common situation. The observation of a chemical record called “terminal dimethyl”, arecord containing two methane molecules in the Drugbank database [31], allowed us to thentrace the compound into the Wolfram alpha database [49], PubChem, ChemSpider and others[54]. As a result of this observation we sought out similar records in ChemSpider ultimatelyremoving almost 3000 of these twinned records [WRITE BLOG POST]. They likely arise as aresult of attempting to represent racemates but we are not sure of the reason for their existence.Nevertheless, we see no value in storing such records in our database. The search and
  22. 22. deprecation was also extended to larger multiples such as triples, quads etc. of the samecompounds in a single record.Data Proliferation Between DatabasesThe distribution of online chemistry content is, in the opinion of the authors, dramaticallyoverlapped in the majority of cases with only a small number of key resources adding data ofvalue to the internet. PubChem serving the national library screening initiative is a valuableresource in terms of the hosting of bioassay data associated with hundreds of thousands ofchemicals. The platform however offers many additional resources and, since the data areavailable for download and reuse (vide infra) many other resources have used slices of thecontent as the basis of their own offerings, and generally link back to PubChem using CID(Compound ID) links in order to drive traffic to their site. As a result data, both good and bad,proliferates amongst the databases and provenance is commonly confused or lost altogether.The situation is further confused when data providers who deposit to PubChem originallysourced the data from the same platform and in this manner they are simply looped depositions.Cheminformatics tools or standardization approaches used by the various hosts can differ and itis likely that data originally sourced from PubChem, when processed for hosting in a third partyresource will be mapped back to a different, though related chemical. The cyclic processing ofdata through molfiles to SMILES to molfiles can introduce changes in stereochemistry andtherefore add to the workload for those cleaning up the data. There are numerous examples of data from PubChem being sourced and redepositedback into the database. For example, ChemSpider sourced their original seed set of 10.5 millionchemicals from PubChem and redeposited the set back to PubChem later when theChemSpider database was expanded to almost 20 million chemicals [55]. NextBio sourced themajority of its content from PubChem and utilize it as part of their integrated content system onplatforms such as Elsevier [56]. NextBio also deposited their content back to PubChem [57].
  23. 23. Wolfram Alpha declares PubChem as one of their sources of information [58] but do not deposit,as yet, back to the platform. The Chemical Translation Service [59] hosted by the University ofCalifornia [60] has clearly sourced the majority of their data from PubChem as the majority oferroneous name-structure synonyms have been migrated to the system. For example, thechemical record associated with methane is labeled as activated charcoal, includes most of theincorrect names from PubChem for methane (including diamond, soot etc.) but the issue isfurther compounded with errors by displaying incorrect structure representations as shown inFigure 10 [Write blog post]. Since data continues to proliferate between various online resources there is acontinuous and growing problem as new databases appear. We are unaware of any reportsdetailing the processing and preparation of data other than the recent work describing the NPCBrowser but we can report on our experiences with developing the ChemSpider database. ChemSpider has used both algorithmic and human curation to remove many thousandsof inherited errors from the database. This has led it to being recognized as a valuable source ofdata which a number of online databases and publishers have shown an interest in linking to.ChemSpider data are now linked to other resources in two ways. A database host provides theirset of chemical structures in either SDF format, as InChIs or in SMILES format. This file is thenused to obtain associated ChemSpider IDs and the file is returned to the database host for themto insert the appropriate links to ChemSpider in their interface. This has already been done forthe ChEBI database [47]. An alternative manner in which to produce the links to ChemSpider isto use the ChemSpider web services [61] to search the database based on chemical structure(in one of a number of query formats). As a result of the web service query they will return theChemSpider ID and insert it into their own database for linking. Several examples of thisapproach exist including that used by Nature Publishing Group [62] to populate the chemicalrecords associated with their articles in both Nature Chemistry and Nature Chemical Biology.Collaborative Drug Discovery also links to compounds that are registered in either public or
  24. 24. private vaults [63]. It should be noted that the linking between the external sites andChemSpider is based on retrieval of the ChemSpider ID associated with their query structure.This approach does not guarantee the validity of the association of the compound in theirdatabase with any chemical name, property or assertion. It is simply a link based on a look-up ofthe compound.Structure Standardization An agreed upon set of standardization rules that can be agreed upon and implementedfor all chemistry databases would greatly enable the interlinking between online resources andperhaps also help decrease errors. If both common standards and a common toolset wereavailable then databases would be standardized in the same way and contain commonidentifiers for linking, for example InChIs and SMILES, each generated using the same codebase. The Open PHACTS project [64] has agreed on the need for a set of structurestandardization rules that will be used to process all incoming chemical compound content thatwill be processed and hosted in the ChemSpider database serving the chemical services to theproject. As the project is to serve the pharmaceutical industry it has been decided that thestructure standardization guidelines provided by the Food and Drug Adminstration [65] will formthe foundation of the rule base, modified as necessary with the agreement of the EFPIA [66]members of the Open PHACTS consortium. Some very basic rules extracted from thedocument include the handling of the nitro groups and salts as discussed earlier. It is to beexpected that under the standardization rules that will be applied that we may see a significantreduction in the number of records in the aggregators database if the standardization processcollapses tautomers.Provenance in Databases
  25. 25. The majority of online databases do not provide details regarding the provenance of all of theircontent. We think is an extremely challenging issue. As an example, for articles regardingchemical compounds on Wikipedia much of the content is aggregated by a number of co-authors, with only some of it sufficiently referenced, with the ChemBoxes [67] or DrugBoxes [68]containing various types of experimental data, identifiers and links to external resources. Untilrecently much of this data was not validated in any way but efforts are presently underway tovalidate the data and mark it as such [69]. A ChemBox shows a chemical structure diagram, asystematic name, a list of identifiers, some links to online databases and a series ofphysicochemical properties. Ideally there needs to be some attribution as to where data ormolecules came from. One of the reasons that provenance might not be provided is that thesource of the data, and the associated license, might preclude such data sharing. Data licensingof online data is both a confusing and contentious issue but ultimately underlies thedevelopment of new systems, both commercial and public.Crowdsourced Review of Public Domain Databases Based on the discussions in this publication and others referenced herein, public domaindatabases contain data of variable quality. The value and utility of the databases depends notonly on the quality and quantity of the content but also the mappings and associated meta data.While we have focused on the quality of data based primarily on the mappings betweenchemical names and the correctness of the associated chemical structures, the overall value ofthe database is best defined by the users of the database resource and its content. We believeit is therefore appropriate to engage the community in providing their feedback regardingdatabases they use regularly. To facilitate this we are gathering input from the community via aScientific Databases Wiki [70]. The intention is to have both the hosts of scientific databases, aswell as the users, contribute wiki pages. Since the wiki is an open environment anybody canregister and contribute content. At present there are 15 chemistry related databases described
  26. 26. on the database. For databases containing chemical compounds it is intended that some formof quantitative quality factor can be created that ranks the database. Williams [71] haspreviously reported on a drug disambiguation exercise to validate the accuracy of the structurerepresentations of over 200 of the world’s bestselling drugs in a series of databases. The workdemonstrated that structure validation is a time-consuming and painstaking process susceptibleto the performance of cheminformatics software tools and dependent on the cross-validation ofvarious data sources. The result in this case was a quantifiable ranking of data accuracy in aseries of well-known public databases. Williams has reported that efforts are afoot to share curation of the data on ChemSpiderwith other databases [WRITE BLOG POST]. A daily curation feed from ChemSpider has beenestablished that summarizes the validation and deletion of name-structure relationships onChemSpider. The feed includes an InChIKey [72] together with a list of validated names anddeleted names. The InChIKey is a hashed version of the International Chemical Identifier, atext-based representation of a chemical structure, and can be used to check for the presence ofthe associated structure on another database. If the compound is detected via an InChIkeymatch then validated identifiers can be added and deleted identifiers can be removed therebyexpanding the curation efforts of ChemSpider to other databases. An alternative use of the feedwould be to use the validated names to search the database for the associated chemical andthen compare the InChIKey from the feed with that associated with the chemical name in thedatabase. If they do not match then the record can be flagged for manual inspection. While itwould be possible to pass out InChI strings, SMILES strings or molfiles in the validation feed thedecision was made to not do this for the time being in order to engage other databases ingenerating such a feed in a standard format. To date only the Drugbank database is utilizing thefeed for validating their data but there is no reciprocal return as yet. The feed can be extendedto include compound deprecation flagging, property value validation etc. if the communitychooses to engage in mutual sharing.
  27. 27. Discussion We have described some of the errors we are finding which are common to moleculedatabases. As chemistry content is expanding on the internet these errors are proliferating.Many errors can be identified quickly. For example, our analysis of the NPC browser ‘HTSamenable compounds’ subset of data for >7600 compounds identified fundamental errors instereochemistry, valency issues and charge imbalances in a few minutes work using arudimentary software tool. Such analyses can be performed by database owners prior torelease. Even compounds that are suggested as having undergone “quality control” have errorswhich could range from structural integrity to mis-assignment of synonyms, incorrectlyassociated CAS numbers or target mappings etc. Correction of these errors manually indatabases will be a considerable task. This raises several important questions such as how dowe ensure that structures are as close to 100% correct as is possible based on assertion-basedapproaches and manual correction. Who corrects the errors in this database and who should beresponsible for ensuring the integrity of such databases? The multitude of government fundeddatabases such as EPA’s DSSTOX [73], ACTOR [74] and Toxcast [75], NIH’s PubChem [76],the FDA’s multitude of systems, ChemIDPlus [32], and many others are all generatingcompound databases of differing quality and, surely, at this point, one would expect that it wouldbe easy to gather a qualified set of well-known drug compounds with little effort. It is also likely atrue statement that the quality of data in Wikipedia for many of these drugs is of higher qualitythan most databases. Although previous research efforts have gone into high throughput analyticalcharacterization of compound libraries [77] to identify impure or incorrectly synthesizedcompounds and to prevent ambiguous HTS results, there has been little research intocompound quality. Uses of compound databases include being combined with target annotationinformation and used to infer correspondence of molecule name and structure. These efforts are
  28. 28. useful for ranking targets and looking at druggability of targets and scaffold distribution. Thesetypes of metrics will be meaningless if the underlying structures are incorrect. Southan [78]discusses commercial compound databases vs public databases. Private databases are of adifferent scale. They did not look at quality of curation in the databases. Southan also describedpairwise comparison of public versus private databases which will also be impacted by thefidelity of structures in each database. Schuffenhauer et al., [79] described the use of ontologies for pharmaceutical compoundsfor pharmaceutical ligands and virtual screening. However such efforts will certainly be nullifiedif the structures used in such databases are incorrect thereby leading to incorrect classificationsor retrieval of compounds from similarity searching that may be false positives. Chen et al. [ref]described automated biochemotype annotation methods using PASS. If the underlyingcompound structures have errors then the predictions will also be erroneous.The dangers of scientists taking the molecule structures in databases at face value are that theerrors will profoundly impact their work. Any computational models generated will be incorrect[3, 4 Sean, what references?]. If virtual hits are found by 3D screening this dataset they mayalso be misleading due to the stereochemistry errors. It is not just an issue with this databasebut all of these resources on the web [5 Sean, what reference?]. As chemistry databases haveproliferated in size these errors have accumulated. While some are checking for errors andcorrecting as they are suggested, this is the exception rather than the norm. We have called fora good-faith effort for checking the data content carefully before making the database public[23]. We suggest that there needs to be a considerable investment in structure integritychecking software and more manual curation efforts. There does not appear to be any change on the horizon in terms of the number andnature of these databases that, based on our evidence-based examination of data reuse andproliferation, will continue to distribute data of unknown quality across the internet. Thisdisturbing and continuing trend needs to be managed. A basic gold-standard drug look-up
  29. 29. dictionary of correct structure files for drugs with their associated synonyms is not yet availableonline. It has been acknowledged that even the FDA does not have a repository of approveddrugs [25] which is quite shocking in this day and age. We can envisage a change in thissituation on all fronts through collaboration and some of the large scale semantic web effortslike Open PHACTS [64]. Ultimately data validation is a human activity performed by experts anda clear path forward to engage community participation is likely going to require direct funding,some form of rewards and recognition to encourage engagement, or depend on the charitablenature of skilled scientists to contribute.Conflicts of interestAntony J Williams is employed by The Royal Society of Chemistry which produces ChemSpiderdiscussed in this article. Sean Ekins consults for Collaborative Drug Discovery, Inc.AbbreviationsACTOR : Aggregated Computational Toxicology ResourceCAS : Chemical Abstracts ServiceChEBI : Chemical Entities of Biological InterestChEMBL : ??????????????????DSSTox : Distributed Structure Searchable Toxicity databaseEPA : Environmental Protection AgencyHTS : High-throughput screeningInChI : International Chemical IdentifierInChIKey : International Chemical Identifier KeyMDDR: MDL Drug Data ReportNCGC: NIH Chemical Genomics Center
  30. 30. NIH : National Institutes of HealthNPC browser : NIH Chemical Genomics Center Pharmaceutical Collection browserOpen PHACTS : Open Pharmaceutical Concepts Triple StorePDSP : Psychoactive Drug Screening ProgramQSAR : Quantitative Structure Activity RelationshipSMILES : Simplified Molecular Input Line Entry SpecificationTRND : Therapeutics for Rare and Neglected DiseasesReferences1 Brzustowicz, L.M. et al. (1993) Molecular and statistical approaches to the detection and correction of errors in genotype databases. Am J Hum Genet 53 (5), 1137-11452 Migliavacca, E. et al. (2001) MDB: a database system utilizing automatic construction of modules and STAR-derived universal language. Bioinformatics 17 (11), 1047-10523 Fourches, D. et al. (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50 (7), 1189-12044 Oprea, T. et al. (2002) On the propogation of errors in the QSAR literature. In Euro QSAR 20025 CAS SciFinder database. (Vol. 2011), pp. CAS SciFinder database6 Olah, M. et al. (2005) WOMBAT: world of molecular bioactivity. In Chemoinformatics in Drug Discovery (Oprea, T.I., ed.), pp. 223-239, Wiley7 http://accelrys.com/products/databases/bioactivity/mddr.html.8 Sheridan, R.P. and Shpungin, J. (2004) Calculating similarities between biological activities in the MDL Drug Data Report database. J Chem Inf Comput Sci 44 (2), 727- 7409 Wittig, U. et al. (2004) Classification of chemical compounds to support complex queries in a pathway database. Comp Funct Genomics 5 (2), 156-16210 Clarke, D.L. et al. (2008) Applying modern error theory to the problem of missed injuries in trauma. World J Surg 32 (6), 1176-118211 Goldberg, S.I. et al. (2008) Analysis of data errors in clinical research databases. AMIA Annu Symp Proc, 242-24612 Finney, J.M. et al. (2011) An efficient record linkage scheme using graphical analysis for identifier error detection. BMC Med Inform Decis Mak 11, 713 Bell, A.W. et al. (2009) A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat Methods 6 (6), 423-43014 Lamontagne, J. et al. (2010) Proteomics-based confirmation of protein expression and correction of annotation errors in the Brucella abortus genome. BMC Genomics 11, 300
  31. 31. 15 Zhang, C. et al. (2009) Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model. Bioinformatics 25 (20), 2708-271416 Jeong, E. et al. (2011) Ontology-based instance data validation for high-quality curated biological pathways. BMC Bioinformatics 12 Suppl 1, S817 Wong, W.C. et al. (2010) More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 6 (7), e100086718 Davis, A.M. et al. (2008) Limitations and lessons in the use of X-ray structural information in drug design. Drug Discov Today 13 (19-20), 831-84119 Fu, X. et al. (2011) Data governance in predictive toxicology: A review. J Cheminform 3 (1), 2420 The PubChem Database.21 ChemSpider.22 Wishart, D.S. et al. (2008) DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 36 (Database issue), D901-90623 Williams, A.J. and Ekins, S. (2011) A quality alert and call for improved curation of public chemistry databases. Drug Disc Today 16, 747-75024 Brazma, A. et al. (2001) Minimum information about a microarray experiment (MIAME)- toward standards for microarray data. Nat Genet 29 (4), 365-37125 Huang, R. et al. (2011) The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Sci Transl Med 3 (80), 80ps1626 TRND Program.27 NPC Browser Press Release.28 (2006) The Merck Index, Merck29 Anon. (2011) The US Pharmacopeia, The United States Pharmacopeial Convention30 Williams, A.J. (2010) Community Views and Trust in Public Domain Chemistry Resources.31 DrugBank.32 http://chem.sis.nlm.nih.gov/chemidplus/.33 PDSP Ki Database.34 http://dailymed.nlm.nih.gov/dailymed/about.cfm.35 Williams, A.J. (2011) All that glisters is not gold: Quality of Public Domain Chemistry Databases. (http://blogs.scientificamerican.com/guest-blog/2011/08/02/all-that-glisters- is-not-gold-quality-of-public-domain-chemistry-databases/)36 Williams, A.J. (2010) The Messy World of Even Curated Chemistry on the Internet. (Vol. 2011)37 Baker, M. (2006) Open-access chemistry databases evolving slowly but not surely. Nature Rev Drug Disc 5, 707-70838 Structures of Taxol on PubChem. (Vol. 2011)39 Cholesterol without stereochemistry. (Vol. 2011)40 List of chemical names for methane. (Vol. 2011)41 Confusing Search Results in the NPC Browser. (Vol. 2011)42 Rabbits, Potatoes and other Vegetables in the NCGC Database. (Vol. 2011)43 Duplicate compounds in the NPC Browser and NCGC Dataset. (Vol. 2011)44 The chemical structure of Neomycin. (Vol. 2011)45 Bufogenin structures in the PubChem database. (Vol. 2011)46 Crowdsourcing on Wikipedia. (http://en.wikipedia.org/wiki/Crowdsourcing)47 ChEBI, Chemical Entities of Biological Interest.48 ChEMBL.49 Wolfram Alpha.
  32. 32. 50 Ekins, S. et al. (2011) Collaborative computational technologies for biomedical research, Wiley51 Pentavalent carbons in the NCGC Collection in the NPC Browser. (Vol. 2011)52 Williams, A.J. (2011) ChemSpider – An Online Database and Registration System Linking the Web (Slide 21/73).53 Williams, A.J. (2011) Data Quality in the NCGC Pharmaceutical Collection Browser Part 4. (Vol. 2011)54 Williams, A.J. (2011) "Terminal dimethyl" means death by methane, twice. (Vol. 2011)55 Williams, A.J. (2009) PubChem Deposition of ChemSpider Data is Well Underway. My Favorite Color is Green. (Vol. 2011)56 Williams, A.J. (2011) Press Release: Elsevier and NextBio Sign Partnership to Enrich ScienceDirect Content. (Vol. 2011)57 (2011) Structures from NextBio in PubChem. (Vol. 2011)58 (2011) Wolfram chemical data source information. (Vol. 2011)59 (2011) CTS: Chemical Translation Service. (Vol. 2011)60 Wohlgemuth, G. et al. (2010) The Chemical Translation Service--a web-based tool to improve standardization of metabolomic reports. Bioinformatics 26 (20), 2647-264861 (2011) ChemSpider web services. pp. October 31st 201162 Nature Publishing Group.63 Ekins, S. et al. (2011) Pioneering use of the cloud for development of the collaborative drug discovery (cdd) database In Collaborative Computational Technologies for Biomedical Research (Vol. 335-361) (Ekins, S. et al., eds.), Wiley and Sons64 (2011) OpenPHACTS Project. (Vol. 2011)65 Substance Registration System - Unique Ingredient Identifier (UNII). (Vol. 2011)66 EFPIA, European Federation of Pharmaceutical Industries and Associations. (Vol. 2011)67 (2011) ChemBox Template in Wikipedia. (Vol. 2011)68 (2011) DrugBox template in Wikipedia. (Vol. 2011)69 Wikipedia talk:WikiProject Chemistry/CAS validation. (Vol. 2011)70 Scientific databases wiki. (Vol. 2011)71 Williams, A.J. (2011) Structure representations in public chemistry databases: The challenges of validating the chemical structures for 200 top-selling drugs. In American Chemical Society72 (2011) InChIKey on the InChI Wikipedia Page. (Vol. 2011)73 Richard, A.M. (2006) DSSTox web site launch: Improving public access to databases for building structure-toxicity prediction models. Preclinica 2, 103-10874 Judson, R. et al. (2008) ACToR--Aggregated Computational Toxicology Resource. Toxicol Appl Pharmacol 233 (1), 7-1375 Dix, D.J. et al. (2007) The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicol Sci 95 (1), 5-1276 Wang, Y. et al. (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37 (Web Server issue), W623-63377 Kenseth, J.R. and Coldiron, S.J. (2004) High-throughput characterization and quality control of small-molecule combinatorial libraries. Curr Opin Chem Biol 8 (4), 418-42378 Southan, C. et al. (2009) Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds. J Cheminformatics 1, 1079 Schuffenhauer, A. et al. (2002) An ontology for pharmaceutical ligands and its application for in silico screening and library design. J Chem Inf Comput Sci 42 (4), 947- 95580 (2011) Peter Murray-Rust webpage. (Vol. 2011)
  33. 33. 81 (2011) Open Data on Wikipedia. (Vol. 2011)82 Open Knowledge Foundation: Open Data Licensing. (Vol. 2011)83 Panton Principles: Principles for Open Data in Science. (Vol. 2011)84 Figshare. (Vol. 2011)85 Flickr photo sharing system. (Vol. 2011)86 PHYSPROP database. (Vol. 2011)87 ChEMBL Creative Commons Case Study. (Vol. 2011)88 PubChem data transfer agreement. (Vol. 2011)89 Wikipedia List of Bestselling Drugs. (Vol. 2011)SET AS SIDEBOX IN ARTICLELicensing of Data ContentOnline databases mix and aggregate content on a regular basis, linking to each other, losingprovenance in many other cases and, in the process of passing through cheminformatics tools,sometimes changing the nature of the chemical compounds. An even larger issue is thepotential fragility of the online databases based on poorly understood licensing for each of thedatabases. The chemical blogosphere has been host to many discussions regarding the needfor clear data licensing definitions on chemistry related data. In particular, Murray-Rust [80]espouses the value of “Open Data” [81] to the scientific discovery process and encourages clearlicensing of all chemistry data according to Open Knowledge Foundation licensing [82] and theso-called Panton Principles [83]. It is generally accepted that individual data points cannot becopyrighted but that data collections may be copyrighted. Therefore a single data point such asa melting point cannot be copyrighted. Neither can a connection table, InChI or SMILES.However, a depiction of the chemical compound as a figure in a publication can be copyrighted,even though the appearance of FigShare [84] now facilitates making figures open and availableto the community. In theory, if authors uploaded their figures to FigShare (or other onlinestorage such as Flickr [85]) prior to publication even following copyright transfer to publishersnear identical images will be available to be sourced from the internet. It is difficult to define
  34. 34. where data transitions to become a copyrightable collection. Is a file containing 100 chemicalstructures, associated chemical identifiers and experimental parameters such as melting pointscopyrightable? Based on the activities of commercial businesses in this domain the answer islikely yes. Numerous well known databases supporting the Life Sciences are freely available fordownload. These include PubChem [20], DrugBank [31], ChEBI [47], ChEMBL [48] and thePHYSPROP data collection [86]. A review of the licensing details for each provides a variety ofdetails regarding the terms and conditions of usage. In general, of the many databasesavailable online, the licensing of the majority of the data is undefined. The majority of SDF filesdownloadable from chemical vendor websites have no defined licenses at all. Despite theassumptions that PubChem data are “Open”, since the data are downloadable, they are notprovided with any specific licenses per se but rather depositors assign rights simply bydepositing data, thereby indicating acceptance of the depositors agreement. It is unlikely thatthe majority of scientists who download the data are aware of any license limitationsconstraining the data usage and have not concerned themselves with whether it is appropriateto monetize the data or repackage and redistribute under new licenses. It is just as unlikely thatall depositors have fully understood that their data can be downloaded, redistributed and,ultimately, licensed, commoditized and monetized. The ChEMBL database hosted by the European Bioinformatics Institute was recentlyreleased under a Creative Commons data license [87] and it is hoped that more databases willbe released with such clarity in the future. That said, even such well-defined and communityaccepted licenses can be abused. The continuation of an original license through otheraggregators is also difficult to police and the deposition of ChEMBL data to PubChem is madeunder the PubChem data transfer agreement [88] and will likely confuse the majority of thecommunity to believing, once again, that all data are public domain, therefore requiring noattribution.
  35. 35. TablesTable 1. An online survey requesting community feedback regarding trust in online chemistrydatabases.Table 2
  36. 36. Table 3. Summary of substructure search results for NPC browser searching with 3 steroidsubstructures. Substructure # of # of No Incomplete Complete but Hits Correct stereochemistry Stereochemistry incorrect Hits stereochemistry Gonane 34 5 8 21 0 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23 10
  37. 37. Table 4. Six identifiers linked with bufogenin in PubChem.
  38. 38. Table 5: A review of chemical structures retrieved based on name-based searches of the NPCBrowser (version 1.0.22) using a random selection of 50 of the top-selling US drugs (Source:Wikipedia [89]. The numbers in the error column refer to the list of errors given below the table. Generic Name Correct Structure Number of hits Error Rosuvastatin 1 2 Zocor X 1 Thalidomide X 1 Taxol 1 2 Basen 1 1 Vytorin 1 4 Depakote 1 4 Symbicort* 1 3,4 Spiriva 1 6 Prograf 1 2 Ezetimibe 2 5,8 Budesonide 1 3 Formoterol 3 3,8 Pioglitazone X 2 2,8 Rabeprazole X 1 Anastrozole X 1 Nifedipine X 1 Goserelin 1 1 Sildenafil X 1
  39. 39. Cefdinir X 1 Cyclosporin 3 7,8Clarithromycin 2 2,8 Tegaserod 1 6 Famotidine X 1Drospirenone X 1 Tenofovir 4 3,8Emtricitabine 4 3,8 Atorvastatin X 1 Clopidogrel X 1Esomeprazole 3 1,8 Amlodipine X 5 8 Olanzapine X 3 8 Valsartan X 3 8 Risperidone X 4 8Montelukast X 1 Quetiapine X 1Lansoprazole X 3 8 Losartan X 1 Alendronate X 4 8Escitalopram 1 1 Imatinib X 3 8 Zolpidem X 1
  40. 40. Donepezil X 2 8 Cetirizine X 2 8 Irbesartan X 2 8 Docetaxel X 1 2 Sertraline X 2 8 Oseltamivir X 1 Celecoxib X 1* Multicomponent combination drug1 Correct skeleton, No stereochemistry2 Correct skeleton, Missing stereochemistry3 Correct skeleton, Incorrect stereochemistry4 Single component of multicomponent structure5 Multiple components for single component structure6 No structure retrieved7 Incorrect skeleton8 Multiple structures based on name search
  41. 41. Figure LegendsFigure 1. Image from the original downloaded NPC browser showing the result of searching forchromium.Figure 2. Image from the original downloaded NPC browser showing the result of searching forarsenic.Figure 3. Image from the original downloaded NPC browser showing the result of searching forneomycin, an incorrect structure.Figure 4. Image from the originally downloaded NPC browser showing the result of searchingfor neomycin, resulting in 26 structures.Figure 5. Steroidal substructures used for searching the NPC browser “HTS screening” dataset.Figure 6. The result of searching for bufogenin in the original downloaded NPC browser datacollection.Figure 7. The structure of bufogenin. The left hand side shows the structural representationgiven in the NPC Browser while the structure on the right is determined from a number ofvalidated sources. The S and R labels indicate specific stereocenters while the asterisksindicate undefined stereocenters.Figure 8. An example of a compound with charge imbalance and misassociation of name andstructure. This commonly also leads to the misassociation of CAS Numbers in chemical vendorfiles as vendors submit the chemical structure of a neutral compound but the CAS Number foran associated salt.Figure 9. Alternative representations of a sodium atom with a carboxylic acid and theequivalency of InChIs versus SMILES strings.Figure 10. The chemical record for methane from the “Chemical Translation Service” based onPubChem data. The molecular formula, InChI and mass agree with that of methane but the
  42. 42. structure shown is of bare carbon. All displayed synonyms associated with the compounds areincorrect: charcoal, graphite, CH2 etc.
  43. 43. Figure 1
  44. 44. Figure 2
  45. 45. Figure 3
  46. 46. Figure 4
  47. 47. Figure 5
  48. 48. Figure 6
  49. 49. Figure 7
  50. 50. Figure 8
  51. 51. Figure 9
  52. 52. Figure 10

×