Public Compound Databases


Published on

The internet has fast become the first port of call for all searches. The increasing array of chemistry-related resources now available provides chemists a direct path to the discovery of information, one previously accessed via library services and limited to commercial and costly resources. The diversity of information available online is expanding at a dramatic rate and a shift to publicly available resources offers significant opportunities in terms of the benefit to science and society. While the data available online do not generally meet the quality standards available from manually curated sources there are efforts afoot to gather scientists and “crowd source” an improvement in the quality of available data. This article will discuss the types of public compound databases available online, provide a series of example databases and focus on the benefits and disruptions associated with the increased availability of such data and integrating technologies to data-mine the available information.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Public Compound Databases

  1. 1. Page 1 of 37 Public Chemical Compound Databases Antony J. Williams Address: ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587 Corresponding PHONE: 919 341-8375 The internet has fast become the first port of call for all searches. The increasing array of chemistry-related resources now available provides chemists a direct path to the discovery of information, one previously accessed via library services and limited to commercial and costly resources. The diversity of information available online is expanding at a dramatic rate and a shift to publicly available resources offers significant opportunities in terms of the benefit to science and society. While the data available online do not generally meet the quality standards available from manually curated sources there are efforts afoot to gather scientists and “crowd source” an improvement in the quality of available data. This article will discuss the types of public compound databases available online, provide a series of example databases and focus on the benefits and disruptions associated with the increased availability of such data and integrating technologies to data-mine the available information. Keywords Public databases, chemical structure databases, Open Data, chemoinformatics, data mining, internet chemistry, Wikis, blogs,
  2. 2. Page 2 of 37 Introduction The internet is likely used on a daily basis by the majority of scientists. There is little doubt that the web is the primary portal to query for information and data and, when coupled with the intranet services for most companies, is the tool of choice for most general searches. For many years the search for scientific-related information would start at the library and commonly engage skilled professionals in the domain of searching. These people would have a deep understanding of navigating the plethora of databases and resources, using their own query languages, and would perform searches using for-fee resources. While such skills remain of value most scientists conduct the majority of their own searches and certainly utilize their access to a no-cost, intuitive and expansive internet of information. There has been a tremendous growth in scientific internet resources and there are enormous opportunities provided by such facile access to chemistry information and data. Bioinformatics certainly established the trend of providing online access to data and Chemistry, in many ways, is far behind. Open-access databases such as GenBank [1] and the Protein Data Bank (PDB) [2] have been assisting biologists to translate gene and protein sequences into biological relevance for over two decades. It is possible that the differences in efforts results from publishers in Chemistry discouraging the open flow of data and information. This is true not only for scientific articles but also for chemistry databases. With the changing expectations of society in terms of freedom of access to information, and the efforts of many evangelists
  3. 3. Page 3 of 37 and groups, a shift towards both free and open access (vide infra) chemistry-related information is well underway and is likely to accelerate. Murray-Rust envisages a world in which all scientific information is instantly available [3•]. This emerging world of e-science or cyberscholarship seeks “to develop the tools, content and social attitudes to support multidisciplinary, collaborative science. Its immediate aims are to find ways of sharing information in a form that is appropriate to all readers.” This article will discuss the work already underway to support this noble and valid effort to provide enhanced public access to Chemistry data and specifically focus on public chemical compound databases. There are many tens of indexes of chemistry databases available online and the reader is encouraged to perform one or more generic searches on “chemistry databases” to retrieve a list of related information. The authors preferred source of information is the Wiki hosted by Gary Wiggins [4•]. While the availability of freely accessible information is clearly of value to scientists there are risks in terms of the quality of information available. The issue of quality is certainly one the mainstream publishers focus on during their peer review, editorial and curation processes and their efforts certainly provide added-value in terms of access to qualified scientific information. Of course no process is perfect and inaccuracies do creep into even reviewed publications and databases. That said, public compound databases have become a disruptive force and there is likely to be a significant impact on the business models of publishers, especially with the increased capabilities and diversity of data presently developing within the public compound databases. Public Chemistry Databases There are many freely available chemical compound databases on the web and they assume many different forms. They can simply be a collection of chemical
  4. 4. Page 4 of 37 structures aggregated into a single file and made available, gratis, for people to download and utilize as they see fit. These files are generally available in the form of an SDF file [5] and can be downloaded and then imported to a database for searching and viewing. There are literally hundreds of such files available online and they are commonly available from chemical vendors in order to advertise their catalog collections. These files generally contain the chemical identifiers in the form of chemical names (systematic and trade) and registry numbers. The files can also contain experimental or physical properties, file specific identifiers and pricing information. There are aggregators who gather such files of chemical structures and related information and assemble them into a single database and serve up to the public (some examples will be discussed later). Since the files are assembled in a heterogeneous manner the resulting data are plagued with inconsistencies and data quality issues. Such an approach to gathering and merging data is a far cry from that taken by commercial database vendors who manually gather and curate data. Some examples of these commercial organizations are CAS [6], InfoChem [7] and Symyx [8]. While the commercial databases offer curated data there is certainly a price- barrier to accessing the information. A number of the free online resources are also manually curated and, as will be discussed later, can offer as high a quality as the commercial offerings. These resources are, however, constructed with a specific focus in mind and therefore commonly number in the low thousands of structures rather than the millions available in the larger online databases. Meanwhile, there are a number of large online database resources offering access to valuable data and knowledge. Some of these databases should be thought of as “linkbases”. For the purpose of this article a linkbase is a repository of molecular connection tables (chemical structures) linking out to various sources of data and associated information. While it is impossible to be exhaustive within the confines of an article
  5. 5. Page 5 of 37 of this nature an overview of a number of online public compound databases focusing specifically on free access databases will be provided. Data access – Open and Free are Different The confusion around the differences between Open Access (OA) versus Free Access (FA) continues to persist [9] but both offer an opportunity to help advance science by facilitating the sharing of data, information and knowledge with no barriers of price or access. The first major international statement on open access was the Budapest Open Access Initiative (BOAI), in February 2002 [10]. The definition of Open Access is as follows: “By 'open access' to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” [11]. Free Access is not equivalent to Open Access but a simple definition has been suggested [12]: “Free access is access that removes price barriers but not necessarily any permission barriers.” For the purpose of this article we are not only interested in FA and OA but also Open Data. Quoting from an online resource [13] “Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control”. As yet there are no commonly agreed upon definitions but as a result of Open Data evangelists and groups progress is being made [14•,15••,16-18].
  6. 6. Page 6 of 37 The majority of scientists cannot however differentiate between free access and open access since both provide free access to information of value to them in their work. In a similar way, the majority of scientists do not care about the distinctions between Open and Closed data. They utilize free access public chemical compound databases on an as-needed basis, derive value from the content and move on, not concerned whether the data posted online are Open or Closed. Chemical Abstracts Services (CAS) [6] and their CAS Registry Numbers (RNs) [19] have played a dominant role in managing a curated registry of chemical entities and related chemical and biological literature. Their proprietary registration system does not link to chemical structures in the public domain and their business model is at risk [20••,21]. Data Quality – The Necessity for Curation Before reviewing examples of public compound databases we should review the issues of data quality. All content databases containing chemical compounds contain errors. These errors can arise for a series of reasons including errors in transcription, historical errors (a compound was “correct” when entered but later re- characterized), issues with graphical representation and a plethora of other reasons. The quality of chemical information in the public domain is generally quite low. This does not mean that the data are not of value but that care needs to be taken in the nature of the provider as an authority. There is, of course, no central body responsible for the quality of data in the public domain. Databases of chemical structure information such as PubChem [22••], ChemIDPLus [23] and ChemFinder [24] etc., are commonly looked upon as authorities in terms of reliable information. However, these sources are also aggregators of information and are at risk of perpetuating errors form the original public data and depositions. Errors in structure- identifier pairs are common [25] and inaccurate structure representations,
  7. 7. Page 7 of 37 specifically in regards to stereochemistry, proliferate across many databases. A definitive description of the challenges regarding quality in public domain databases, and the rigorous processes required to aggregate quality data was provided by Richards et al [26••]. During their assembly of the EPA DSSTox databases the assembled the chemical structures, chemical names and CAS Registry Numbers for over 8000 chemicals from numerous toxicity databases. The data they extracted were carefully curated and validated using multiple public information sources [27]. In regards to the quality of the chemical information presented with bioassay data on PubChem Richards cautioned 'user beware' [26]. Since the chemical structure content is deposited without additional review the user is at risk. Errors in chemical names are common, and multiple structure errors have been identified. Richards encourages users to make informed judgments on the quality of data based on prior knowledge of the data submitter. The responsibility for the quality of the PubChem database therefore rests with the depositors primarily and, as many of these are commercial chemical vendors, their focus on quality is far less than the stringent expectations of the community. The proliferation of errors from PubChem into other databases has been identified [28] and a definitive effort to cleanse the errors from the data, be it in regards to chemical structures, names or identifiers, is going to be required. They are not resourced to perform these operations but the efforts of groups such as the ChemSpider team with their online curation [29] offers an opportunity to dramatically improve the quality of the data through both a roboticized cleansing approach and manual examination by many users. Efforts such as these should help reduce errors and result in the proliferation of more validated information. Public Compound Databases
  8. 8. Page 8 of 37 PubChem The highest profile online database is certainly PubChem [22]. Launched by NIH in 2004 to support the New Pathways to Discovery component of their roadmap initiative [30]. PubChem archives and organizes information about the biological activities of chemical compounds into a comprehensive biomedical database and is the informatics backbone for the initiative, intended to empower the scientific community to use small molecule chemical compounds in their research. PubChem consists of three databases (PubChem Compound, PubChem Substance, and PubChem Bio-Assay) connected together. PubChem Compound contains 18 million unique structures and provides biological property information for each compound. PubChem Substance contains records of substances from depositors into the system. These are publishers, chemical vendors, commercial databases and other sources. The PubChem Compound database contains records of individual compounds (see Figure 1). PubChem BioAssay contains information about bioassays using specific terms pertinent to the bioassay. PubChem can be searched by alphanumeric text variables such as names of chemicals, property ranges or by structure, substructure or structural similarity. As of December 2007 its content is approaching 38.7 million substances and 18.4 million unique structures. Such a source of data opens up new possibilities [31] in regards to data mining and extraction. Zhou et al [32•] concluded that the system has an important role as a central repository for chemical vendors and content providers enabling evaluation of commercial compound libraries and saving biomedical researchers from the work associated with gathering and searching commercial databases. They identified that over 35% of the 5 million structures from chemical vendors or screening centers found in the PubChem database currently are not present in the CAS registry.
  9. 9. Page 9 of 37 PubChem continues to grow in stature, content and capability. The bioassay data resulting from the NIH Roadmap initiative is likely to continue to grow and PubChem will assume a prominent role in distributing the data in a standard format. Despite the obvious value of PubChem the platform has caused quite a furor in recent years including debates regarding the position of CAS relative to the resource. The reader is referred elsewhere for commentaries [33,34]. Others have commented on the quality of the data content within PubChem. Shoichet [35••] believes that the screening data are less rigorous than those in peer-reviewed articles, and contain many false positives. Shoichet worries that chemists who use PubChem may be sent on a wild goose chase. Numerous problems arise from the quality of submissions from various data sources and there are thousands of errors in the structure- identifier associations due to this contamination and this can lead to the retrieval of incorrect chemical structures. It is also common to have multiple representations of a single structure due to incomplete or total lack of stereochemistry for a molecule [36]. DSSTox The EPA Distributed Structure-Searchable Toxicity (DSSTox) database project [37- 39] provides a series of documented, standardized and fully structure-annotated files of toxicity information [40]. The initial intention for the project was to deliver a public central repository of toxicity information to allow for flexible analogue searching, SAR model development and the building of chemical relational databases. In order to ensure maximum uptake by the public and allow users to integrate the data into their own systems the DSSTox project adopted the use of the common standard file format (SDF) to include chemical structure, text and property information. The DSSTox databases was also deployed online to provide free public access to the data files without the dependency on a desktop software package for
  10. 10. Page 10 of 37 querying and managing the data files. The overall aims of the project, to deeply integrate chemical structure information with existing toxicity data and to facilitate interrogation of the data have been achieved. The DSSTox datasets are among the most highly curated public datasets available and likely the reference standard in publicly available structure-based toxicity data. eMolecules eMolecules [41] offers a free online database of almost 8 million unique chemical structures. The database is assembled from data supplied by over 150 suppliers and provides a path to identifying a vendor for a particular chemical compound. By providing access to compounds for purchase they are providing a free access online service similar to those of commercial databases such as Symyx Available Chemical Directory [42], CAS’ ChemCats [43] and Cambridgesoft’s ChemACX [44] as well as a number of other providers. The system offers access to more than 4 million commercially available screening compounds and many tens of thousands of building blocks and intermediates. Their database was recently enhanced by providing access to NMR, MS and IR spectra for over 500,000 compounds via ChemGate [45], a fee-based service. eMolecules also provides links to many sources of data for spectra, physical properties and biological data including include the NIST WebBook [46], the National Cancer Institute [47], DrugBank [48•] and PubChem. eMolecules is presently fairly limited in its scope and primarily offers a very useful path to the purchase of chemicals and links to the more popular government databases. Nevertheless, the site is popular with chemists who are searching for chemicals and the interface is intuitive and easy to use, a key element in attracting users.
  11. 11. Page 11 of 37 DrugBank DrugBank [48•] is a manually curated resource assembled from the collection information of a series of other public domain databases and enhanced with additional data generated within the laboratories of the hosts. The database aggregates both bioinformatics and cheminformatics data and combines detailed drug data with comprehensive drug target (i.e. protein) information. The database is hosted by the University of Alberta, Canada. Version 1 of the database, released in 2006, contained >4100 drug entries including >800 FDA approved small molecule and biotech drugs as well as >3200 experimental drugs. Over 14,000 protein or drug target sequences were linked to these drug entries. Each record in the database, known as a DrugCard, has >80 data fields. The information is split into drug/chemical data and drug target or protein data and many data fields are linked to other databases (KEGG [49], PubChem, ChEBI [50], PDB [2] and others). The database supports extensive text, sequence, chemical structure and relational query searches. DrugBank has been used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. The version 2.0 release of DrugBank [51••] released in January of this year with over 800 new drug entries and each DrugCard entry extended to include over 100 data fields with half of the information being devoted to drug/chemical data and the other half devoted to pharmacological, pharmacogenomic and molecular biological data. They have started to add experimental spectral data (NMR and MS specifically), and have expanded the coverage to nutraceuticals and herbal medicines. The Drugbank team also host the Human Metabolome Database (HMDB) [52], a database containing nformation about small molecule metabolites found in the human body. The database is used by scientists working in the areas of
  12. 12. Page 12 of 37 metabolomics, clinical chemistry and biomarker discovery. The database currently contains nearly 3000 metabolite entries and each MetaboCard entry contains more than 90 data fields devoted to chemical, clinical data, enzymatic and biochemical data. NMRShiftDB The NMRShiftDB is an open source collection of chemical structures and their associated NMR shift assignments [53•,54]. The database is generated as a result of contributions by the public and currently contains over 20,000 structures with >220,000 assigned carbon chemical shifts. Datasets entered by contributors are sent to registered reviewers for evaluation. A significant part of NMRShiftDB was initially assembled from in-house databases from collaborating institutions and were entered unchecked. This called for external checks of the data based on independent databases and resources and these have now been carried out by two specific groups [55,56]. Williams et al. [56] performed a cursory examination of the structural diversity within the database and concluded that the data represented a statistically relevant set to use in an evaluation of predictive accuracy and demonstrated that the quality of the data is rather impressive. This effort shows the advantages of providing a set of Open Data for reuse and examination and the benefits of having many scientists examine, validate and correct. The benefit is possible for any database allowing its users to qualify, annotate and correct its data. ChemSpider ChemSpider was released to the public in March 2007 with the intention of “building a structure centric community for chemists” [57]. ChemSpider has grown into a resource containing almost 18 million unique chemical structures and recently
  13. 13. Page 13 of 37 shared its data with PubChem providing about 7 million unique compounds. The data sources have been gathered from chemical vendors as well as commercial database vendors and publishers and members of the Open Notebook Science community [58]. ChemSpider has also integrated the SureChem patent database [59] collection of structures to facilitate links [60] between the systems. The database can be queried using structure/substructure searching and alphanumeric text searching of both intrinsic as well as predicted molecular properties. They have recently added virtual screening results using the LASSO similarity search tool [61] to screen the ChemSpider database against all 40 target families from the Database of Useful Decoys (DUD) dataset [62]. ChemSpider has enabled unique capabilities relative to the primary public chemistry databases. These include real time curation of the data, association of analytical data with chemical structures, real-time deposition of single or batch chemical structures (including with activity data) and transaction-based predictions of physicochemical data. The ChemSpider developers have made available a series of web services to allow integration to the system for the purpose of searching the system as well as generation of InChI identifiers and conversion routines. The system also integrates text-based searching of Open Access articles and presently search over 50,000 OA Chemistry articles, soon to be extended to 150,000 articles. The index is expected to increase dramatically as they extract chemical names from OA articles and convert the names to chemical structures using name to structure conversion algorithms. These chemical structures will be deposited back to the ChemSpider database thereby facilitating structure and substructure searching in concert with text-based searching. ChemSpider has a focus on, and commitment to, community curation. The social community aspects of the system demonstrate the potential of this approach. The team have committed to the release of a wiki-like environment for further
  14. 14. Page 14 of 37 annotation of the chemical structures in the database, a project they term WiChempedia [63]. They will utilize both available Wikipedia content and deposited content from users to enable the ongoing development of community curated chemistry. The WikiSphere, Blogosphere and Internet as a Public Compound Database. Wikis and blogs are common terms now for the majority of users of the worldwide web and both are fast becoming chosen platforms for the exchange of information between many scientists, not only as tools within their own research groups but, more generally, with the public in general. A blog, or weblog is a website where entries are written in chronological order and generally provide commentary or news on a particular subject [64]. A typical blog combines text, images and links to other blogs, web pages, and other media related to its topic. The original blog posting remains untouched by the commenter and readers are free to add their comments, generally in a mediated manner where the blog host retains control over the postings. An example screenshot from a chemistry-based blog hosted with the intention of examining and discussing organic syntheses is shown in Figure 2. The number of chemistry-related blogs continues to grow dramatically and there have been efforts to provide a unified view into some of these [65,66]. A wiki is a type of computer software that allows users easily to create, edit and link web pages and enables documents to be written collaboratively, in a simple markup language using a web browser, and is essentially a database for creating, browsing and searching information [67]. Certainly Wikipedia is the most well-known today though there are many others already online and used within the confines of an organization to manage content. There are active groups supporting the development of chemistry on Wikipedia and there are now thousands of pages
  15. 15. Page 15 of 37 describing small organic molecules, inorganics, organometallics, polymers and even large biomolecules. Focusing on small molecules in general, each one has a DrugBox [68] or a Chemical Infobox [69]. A drug box provides identifier information (chemical name, registry number, and so on) and commonly the identifiers link out to a related resource. Chemical data, pharmacokinetic data and therapeutic considerations can also be listed. At present there are approximately 8000 articles with a chembox or drugbox with between 500-1000 articles added since May. The detailed information offered on Wikipedia regarding a particular chemical or drug can be excellent [70], see Figure 3, or weak in the case of “stub articles” [71]. There are many dedicated supporters and contributors to the quality of the online resource. Drug and chemboxes have been shown to contain errors but the advantage of a wiki is that changes can be made within a few keystrokes and the quality is immediately enhanced. The opposite is also true and vandalism can occur. This community curation process makes Wikipedia a very important online chemistry resource whose impact will only expand with time. Wikis have recently been used as the basis of Open Notebook Science [72]. The UsefulChem Wiki [73] includes a series of experimental pages commonly linked to related blog pages as shown in Figure 4. The Open Notebook Science efforts and the movement appears to be gaining momentum with the support of vocal advocates, such as Neylon [74], Murray-Rust [75] and many others. While both wikis and blogs are very valuable for information exchange, what they enable in terms of text and image exchange is all but crippled in terms of searching by many chemists’ additional query needs for chemical structures, reactions and data. Neither Wikis nor blogs, as yet, are enabled for the purpose of structure and substructure searching and, therefore, remain isolated, in general, from cheminformatics based search procedures. One of the key developments which
  16. 16. Page 16 of 37 has already facilitated the Semantic Web for chemistry is the InChI [76], the International Chemical Identifier. The InChI string is a textual identifier for chemical substances designed to provide a standard and human-readable way to encode molecular information (see Figure 5) and to facilitate the search for such information in databases and on the web. The InChI string, unfortunately, has only partly delivered on the promise of facilitating web-based searches, due to unpredictable breaking of InChI character strings by search engines. In order to resolve this issue the InChIKey was introduced. The condensed, 25 character InChIKey is a hashed version of the full InChI and is not human-readable. The equivalent InChIKey for the InChIString of L-ascorbic acid is CIWBSHSKHKDKBQ-JLAZNSOCBT. The advantage of the key is one of enabling web searches, but a lookup table to identify the associated structure, or reference to the original InChI String, is necessary [77]. While tens of millions of InChI strings and keys have been populated into databases, their value is still in its infancy. Publishers have started to embed InChIs into their articles and the Royal Society of Chemistry [78] is presently pioneering a new publishing model, Project Prospect [79], including InChI to demonstrate movement toward the semantic web for chemistry. Bloggers have started to use InChI Strings and Keys on their postings, and wiki-pages are being InChI-enabled to help the web become structure searchable. The necessity of a central lookup facility for published InChIStrings will be necessary in order to facilitate substructure searching of the web but this capability is likely to be developed in the near future. Willighagen already aggregates InChI Strings onto a blog [80]. BioSpider [81] users are able to type in almost any kind of biological or chemical identifier (protein/gene name, sequence, accession number, chemical name, brand name, SMILES string, InCHI string, CAS number, etc.) and delivers a report about the biomolecule. BioSpider uses a web-crawler to scan through dozens
  17. 17. Page 17 of 37 of public databases and employs a variety of specially developed text mining tools and locally developed prediction tools to find, extract and assemble data for its reports. A summary includes physicochemical parameters, images, models, data files, descriptions and predictions concerning the query molecule. An increasing number of public databases will continue to become available but the challenge, even now, is how to integrate and access the data. The implementation of InChIs for web-based searching [82], and the delivery of userscripts to aggregate information and computational results from different web resources are bringing together internet resources to appear as a single monolithic public chemistry database. Willighagen et al. [83] use userscripts to enrich biology and chemistry related web resources by incorporating or linking to other computational or data sources on the web. They showed how information from web pages can be used to link to, search, and process information in other resources thereby allowing scientists to select and incorporate the appropriate web resources to enhance their productivity. Such tools connecting open chemistry databases and user web pages is an ideal path to more highly integrated information sharing. Other Databases The list of databases and resources reviewed above is only representative of the type of information available online. Other highly regarded databases frequented by this author include the Chemical Structure Lookup Service (with over 36 million unique structures) [84], CrystalEye [85], KEGG [49] and CheBI [50]. There are also many other resources available and the reader is referred to one of the many indexes of such databases available on the internet to identify potential resources of interest [4, 86].
  18. 18. Page 18 of 37 The Future of Public Compound Databases The semantic web [87] is already offering us the chance to connect, simultaneously interrogate and mash-up the results of searching multiple public compound databases simultaneously. An enormous diversity of data is already available for interrogation by the public and continues to expand daily. This author remains concerned with the very real quality issues associated with public data sets. While the utopian dream of no errors in freely available data cannot be met the push towards more Open Data without consideration being given to both manual and robotic curation could be risky to those using the data. Real-time curation of data within public compound databases is feasible [29] and certainly Wikipedia is a model of crowd sourcing [88] to build, curate and maintain a quality database. Unfortunately, even these world-renowned platforms actually sit on the shoulders of a very few dedicated individuals, relative to the users, who care about quality. There is no simple solution to the issues of quality and it will persist for the foreseeable future until processes, procedures and momentum to resolve the issues are established. Even in its earliest form PubChem has been referred to, tongue-in-cheek, as “the granddaddy of all free chemistry databases”. Certainly it presently holds the premier position in reputation, capabilities and connectivities built on a database of chemical structures and linked out to biological assay data, the PubMed database and an array of services to facilitate both the distribution of the data and the wealth of tools developed to support the system. The majority of databases discussed in this article now uses two primary identifiers in their systems – the CAS registry number and a PubChem ID number. This alone indicates a shift in equality of commercial versus public compound repositories. For now, PubChem remains focused on its initial intent to support the National Molecular Libraries Initiative. The data within PubChem have never formally been declared as Open Data but are assumed to be
  19. 19. Page 19 of 37 available in that manner and thereby offer to scientists a valuable aggregate of data for the purpose of data mining and discovery. At the time of writing the newest addition to the proliferating domain of public chemical compound databases is the ChemSpider Database [57], working to “Build a Structure Centric Community for Chemists”. This system presently offers a series of unique capabilities which might become trend-setting for present and future databases. As discussed earlier these include the user deposition of structures, real- time annotation and curation of data, management of analytical data and online transaction services. It is this authors’ belief that such capabilities will likely become standard for the majority of most public chemical compound databases in the near future. These types of capabilities could help establish the newfound shift to Open Notebook Science and shift the bias from the chemical biology databases (PubChem, Drugbank, HMDB and DSSTox) to even provide an environment for non-life science chemists, polymer chemists and material scientists to manage and research information of interest to them. Public Compound Databases versus Commercial Databases The creation, hosting and support of a curated chemical compound database with integrated content is an expensive enterprise. Historically these databases have been built as a result of hundreds if not thousands of man years of rigorous and exacting human effort and then, for some of the original founders in this domain, migrated onto computer systems. In the development of these systems host organizations have created sizeable revenues and estimated annual fees for accessing this information via just a few organizations likely exceeds half a billion dollars. With the advances in technology accompanying the internet boom the hosting of large databases, the text-based searching of immense amounts of data and the ability to disseminate complex forms of graphical information via standard
  20. 20. Page 20 of 37 protocols provided an opportunity created for disruptive offerings in this domain. They soon arrived. The primary advantage of commercial databases is that they have been manually examined by skilled curators, addressing the tedious task of quality data- checking. Certainly the aggregation of data from multiple sources, both historical and modern, from multiple countries and languages and from sources not available electronically are significant enhancements over what is available via an internet search. The question remains how long will this remain an issue? Scientists working in new areas of science and domains of expertise reflect on the most recent literature in general. Can you imagine a search about the semantic web being conducted just a few years ago? What about metabonomics or even genomics? Certain areas of the scientific literature, while still of high value, can become antiquated fairly quickly. With the new capabilities of internet-based searching and direct access to abstracts for the majority of publishers even a rudimentary text search can expose articles previously unavailable except through an abstracting service. Search engines will increasingly be utilized for first level searches specifically because they are simple to use, they are fast and they are free. With chemically searchable patents also available online [59,89], at no charge, the landscape for scientists searching for information is more open than ever. If there are data of interest to be located then internet search engines will enable it. The premier curated database offerings of today have an interesting if not challenging future ahead of them. Their value-added enhancements of the distributed data must be significant enough to warrant an investment in their services [90]. As expressed earlier the quality of the data resulting from curation is significant but this author questions the longevity of that distinguishing factor moving forward. Roboticized recognition and conversion of chemical names to chemical structures can dramatically shift this domain and efforts have already been
  21. 21. Page 21 of 37 demonstrated in applications to patents and publications. Should the quality reach a sufficient standard then today’s publishers business models will definitely be at risk. Conclusion There is little doubt that the newfound availability of public chemical compound databases with their associated chemistry and biological data is enabling scientists to access information at less cost in both time and currency. The increasing quantity of freely accessible and integrated data can speed decision making and bring clarity or alternatively inundate and saturate the user with poor quality information. Scientists now have free access to structure-searchable patents, open and free access peer-reviewed publications and software tools for the manipulation of chemistry related data. Members of the Open Source movement are developing toolkits including visualization and data-mining tools and, when coupled with the public chemistry databases reviewed here, will likely benefit the process of discovery. There are likely to be challenging times ahead in terms of meshing the needs of commercial database publishers versus proliferation of free databases but this journey will not be halted by the objections of the commercial entities provided that legal copyrights are respected and the shift towards a more open community for science persists. Acknowledgements The author wishes to thank the following people: Stephen Bryant and Evan Bolton from the PubChem team, the IUPAC/National Institute of Standards and Technology InChI team (Alan McNaught, Stephen Stein, Stephen Heller, Dmitrii Tchekhovskoi); David Wishart and Nelson Young (Drugbank and HMDB), Nicko Goncharoff (SureChem), Stephen Boyer (IBM), Marc Nicklaus (Chemical Structure Lookup Service), members of the ChemSpider Advisory Group (Egon Willighagen, Sean
  22. 22. Page 22 of 37 Ekins, Joerg Wegner and Alex Tropsha specifically), Ann Richard and Marti Wolf (DSSTox), Christoph Steinbeck (NMRShiftDB), Nick Day and Peter Murray-Rust (CrystalEye), Martin Walker, Andrew Yeung and Dirk Beestra (Wikipedia Chemistry). I would also like to acknowledge the many contributors to the blogging discussions about Open and Free Access. References 1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. (2007) 35(Database issue):D21-5. 2. Berman HM, Henrick K, Nakamura H: Announcing the worldwide Protein Data Bank. Nature Structural Biology (2003) 12: 980 3. Murray-Rust P: Chemistry for everyone. NATURE (2008) 451, 648-651 •Provides a vision for the future of data distribution, access and integration across the worldwide web and espouses the need for Open Data policies and adoption of the Semantic Web. 4. Gary Wiggins’ Wiki. CHEMBIOGRID, Chemistry Databases on the Web: Alphabetical: atabases_on_the_Web_%28Alphabetical_List%29 Classified: abases_on_the_Web_%28Classified_List%29 •An aggregation of chemistry databases, curated and annoted, to provide significantly more information than would be returned in a generic search of the internet. 5. Symyx: CTFile formats no-fee. (2008)
  23. 23. Page 23 of 37 6. CAS: Chemical Abstract Services, Columbus, OH, USA (2006). 7. InfoChem: InfoChem Gesellschaft für Chemische Information, München, Germany (2008). 8. Symyx: Santa Clara, California, USA (2008). 9. The University’s Mandate To Mandate Open Access: Harnad S, (2008) Mandate-Open-Access.html 10. Open Access: Wikipedia Article on Open Access. (2008) 11. The BOAI FAQ page: Frequently Accessed Questions about the Budapest Open Access Initiative (2008), 12. Williams AJ: A perspective of Publicly Accessible/Open Access Chemistry Databases: Drug Discovery News (2008), accepted for publication 13. Open Data: Wikipedia Article on Open Data. (2008) 14. Murray-Rust P, Rzepa HS, Tyrrell SM and Zhang Y: Representation and use of Chemistry in the Global Electronic Age ChemInform, 36(15), (2005) • An excellent outline regarding the potential of combining open access and the semantic web in chemistry. Rzepa and Murray-Rust are two of the evangelists of this domain and outline in this article how data may be interconnected to the benefit of all chemists. 15. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C, Wegner J , Willighagen EL: The Blue Obelisk-Interoperability in Chemical Informatics, J Chem Inf Model, (2006) 46 (3), 991-998.
  24. 24. Page 24 of 37 ••The Blue Obelisk Movement ( is the name used by a group of scientists and developers supporting open source software development, consistent and complimentary chemoinformatics research, open data, and open standards in Chemistry. 16. CODATA, The Committee on Data for Science and Technology: CODATA, Paris, France (2008). 17. An Introduction to Science Commons: Wilbanks J, Boyle J, (2006). content/uploads/ScienceCommons_Concept_Paper.pdf 18. The Open Knowledge Foundation: Protecting and Promoting Open Knowledge in a Digital Age (2008). 19. CAS Registry Numbers: Chemical Abstract Services, Columbus, OH, USA (2008). 20. Murray-Rust P, Mitchell JB, Rzepa HS: Communication and re-use of chemical information in bioscience. BMC Bioinform (2005) 6:180-196. •• Provides an overview of chemical information on the Internet and, while slightly outdated, is an important read in regards to the challenges and the vision of a Semantic Web for Chemistry. 21. Heller SR, Stein SE, Tchekhovskoi DV: Open source/open access/open data and the IUPAC International Chemical Identifier - InChI. American Chemical Society National Meeting, Washington, DC, USA (2005):CINF-60. 22. NCBI: PubChem: National Center for Biotechnology Information, Bethesda, MD, USA (2008). •• Pubchem is a large data aggregator (nearing 20 million structures) and offers relational searching capabilities via text, structure and substructure searching and
  25. 25. Page 25 of 37 access to the entire dataset via download of SDF files. A series of services for the handling of chemistry databases are also available via the website. 23. ChemIDplus: National Library of Medicine, Bethesda, MD, USA (2008). 24. CambridgeSoft Corp, Cambridge, MA, USA (2008). 25. Hacking Pubchem - Technology easy, Quality difficult: Williams AJ, (2007) difficult.html. 26. Richard AM, Swirsky Gold L, Nicklaus MC: Chemical structure indexing of toxicity data on the Internet: Moving toward a flat world. Current Opinion in Drug Discovery & Development (2006) 9(3): 314-325. •• The review discusses efforts to gather, curate and make publicly available toxicology-related chemical information. The specific discussions regarding the quality issues with public chemistry databases and efforts to produce clean quality databases are noteworthy. 27. DSSTox Quality Chemical Information Review Procedures: US Environmental Protection Agency, Washington, DC, USA (2008). 28. PubChem Errors: Williams AJ, PubChem Meeting, Washington DC: (2007) ember_2007.pdf 29. The Process of Curating Identifiers on ChemSpider: Williams AJ, (2008) der.pdf
  26. 26. Page 26 of 37 30. The NIH Roadmap Initiative: Office of Portfolio Analysis and Strategic Initiatives, National Institutes of Health, Bethesda, Maryland 20892: (2008) 31. Hacking PubChem: Why The Open Access Fight is Just the Beginning, Apodaca R, (2006), why-the-open-access-fight-is-just-the-beginning 32. Zhou Y, Chen K, Yan SF, King FJ, Jiang S, Winzeler EA: Large-Scale Annotation of Small-Molecule Libraries Using Public Databases. J. Chem. Inf. Model. (2007) 47:1386-1394 •• The 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) was used as a model to determine whether automated annotation of screening hits in batch is feasible. 33. The American Chemical Society and NIH’s PubChem, Reshaping Scholarly Communication Blog: (2008) 34. Background of the PubChem/CAS Issue: (2008) 35. Baker M: Open-access chemistry databases evolving slowly but not surely:Nature Reviews, Drug Discovery, (2006) 5:707-708 • A critical review of how far publicly available initiatives have to go to catch up with commercial offerings. 36. How big is the challenge of curation and what is the structure of Ginkgolide-B: Antony Williams (2008), is-the-challenge-of-curation-and-what-is-the-structure-of-ginkgolide-b.html
  27. 27. Page 27 of 37 37 DSSTOX: Distributed Structure-Searchable Toxicity (DSSTox) Database: US Environmental Protection Agency, Washington, DC, USA (2006). 38. Richard AM and Williams CR (2002) Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network: A Proposal, Mutation Research: New Frontiers, 499:27-52. 39. Richard AM: DSSTox web site launch: Improving public access to databases for building structure-toxicity prediction models, Preclinica, (2006) 2(2):103-108. 40. DSSTox Data Files: 41. eMolecules Online Service: eMolecules, Del Mar, CA, USA (2008). 42. Available Chemical Directory: Santa Clara, California, USA (2008). 43. ChemCats: Chemical Abstract Services, Columbus, OH, USA (2006). 44. ChemACX: CambridgeSoft Corp, Cambridge, MA, USA (2008). 45. ChemGate: Tony Davies, eMolecules and Spectroscopy: Spectroscopy Europe, (2007) 19(1):27-28 46. The NIST Chemistry WebBook: (2008) 47. NCI/NIH Developmental Therapeutics Program: National Cancer Institute, Frederick/National Institutes of Health, Bethesda, MD, USA. (2008). 48. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery and exploration, Nucleic Acids Res. (2006) 34:D668-72
  28. 28. Page 28 of 37 • A detailed description of the intent, development and capabilities of the Drugbank database, one of the most respected public chemistry databases utilized by drug discovery scientists today. 49. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: KEGG: The KEGG resource for deciphering the genome, Nucleic Acids Res. (2004) 32 (Database issue):D277-80 50. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest, Nucl. Acids Res. (2008) 36: D344- D350; 51. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drug targets, Nucleic Acids Res. (2008) 36(Database issue):D901-6. •• An update regarding the DrugBank database as it is released in its Version 2 state. 52. HMDB: The Human Metabolome Database. Nucleic Acids Res. (2007) 35: D521-6 53. Steinbeck C, Krause S, Kuhn S: NMRShiftDB– Constructing A Chemical Information System With Open Source Components. J. Chem. Inf. Comput. Sci. (2003) 43:1733-1739. •The defining article regarding the development of the NMRShiftDB database defining the intention of the work, the development of the software components and a vision of how such a platform can lead to widespread dissemination of analytical data, at no-charge, to the chemistry community.
  29. 29. Page 29 of 37 54. Steinbeck C, Kuhn S. NMRShiftDB – Compound Identification And Structure Elucidation Support Through a Free Community-Built Web Database. Phytochemistry, (2004), 65:2711–2717. 55. Blinov KA, Smurnyy YD, Elyashberg ME, Churanova TS, Kvasha M, Steinbeck C, Lefebvre BA, Williams AJ: Performance Validation of Neural Network Based 13C NMR Prediction Using a Publicly Available Data Source. J Chem Inf Model, (2008), Accepted for publication, doi: 10.1021/ci700363r. 56. CSEARCH and NMRShiftDB: Robien W (2007) 57. Williams AJ, ChemSpider and Its Expanding Web: Building a Structure- Centric Community for Chemists, Chemistry International (2007) 30(1): 30. 58. Open Notebook Science: Bradley JC, (2006) Drexel CoAs E-Learning Blog, 59. SureChem: San Francisco, CA, USA (2008) 60. Free Access Structure Searching of Patents: Williams AJ (2007), er.pdf 61. LASSO: Ligand Activity in Surface Similarity Order, SioBioSys Inc., Toronto, Canada. 62. Database of Useful Decoys: 63. WiChempedia: ChemSpider Blog (2007) 64. The Definition of a Blog: 65. ScienceBlogs: 66. Chemical BlogSpace:
  30. 30. Page 30 of 37 67. The Definition of a Wiki: 68. Wikipedia Chemical Drugbox: 69. Wikipedia Chemical Infobox: 70. Taxol on Wikipedia: 71. AP7 on Wikipedia: 72. Bradley JC, Open Notebook Science Using Blogs and Wikis, Nature Preceedings (2007) doi:10.1038/npre.2007.39.1, 73. UsefulChem Open Notebook Science: Bradley JC, Drexel University, and http://usefulchem- 74. Open Notebook Science: Neylon C, Science in the open, An openwetware blog on the challenges of open and connected science (2008) open-notebook-science/ 75. Open Notebook Science NMR: Murray-Rust P, A Scientist and the Web Blog (2008) 76. The IUPAC International Chemical Identifier: (2008)
  31. 31. Page 31 of 37 77. The IUPAC International Chemical Identifier Software: (2008) 78. Royal Society of Chemistry: (2008) 79. Project Prospect: (2008) RSC Publishing, 80. Chemical Blogspace, (2008) 81. Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS: BioSpider, A web Server for Automating Metabolome Annotations. Pacific Symposium on Biocomputing, (2007) 12:145-156. 82. Cole SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhancement of the Chemical Semantic Web through INChIfication. Org Biomol Chem, (2005) 3:1832-1834 83. Willighagen EL, O'Boyle NM, Gopalakrishnan H, Jiao D, Guha R, Steinbeck C and Wild DJ: Userscripts for the Life Sciences. BMC Bioinformatics, (2007) 8:487. •• Discusses the use of userscripts to change the appearance of web pages by modifying web content on the fly to enable aggregation of information and computational results from different web resources into a single webpage. Indicative of the future of integration and the possibilities which exist to gather information from a multitude of resources and reformat and deliver to the consumer. 84. Chemical Structure Lookup Service: National Institutes of Health, 85. CrystalEye Crystallogrpahic Database:
  32. 32. Page 32 of 37 86. Thirty Two Free Chemistry Databases: Apodaca R, Depth-First Blog, 87. Feigenbaum L, Herman I, Hongsermeier T, Neumann E, Stephens S: The Semantic Web in Action, Scientific American Magazine 88. The Benefits of Crowdsourcing: 89. IBM’s Online Patent Search: (2008) IBM Chemical Search Alpha, IBM, Almaden Services Research, San Jose, CA 95120, USA, 90. Kemper K, Chemical Abstracts still developing ways to help its core – scientists, Columbus Business First, 1
  33. 33. Page 33 of 37 Figures Figure 1 - The Compound Summary Page for Taxol in PubChem. Page 1 only is shown. (
  34. 34. Page 34 of 37 Figure 2: The blog. Paul Docherty discusses complex syntheses and offers readers an opportunity to comment, analyze and provide feedback. Many articles are labeled with InChIKeys to allow indexing by search engines. (
  35. 35. Page 35 of 37 Figure 3: The DrugBox for Taxol from Wikipedia (
  36. 36. Page 36 of 37 Figure 4: An Example UsefulChem wiki page ( This UsefulChem wiki page shows a number of important content items: 1) Links to the prior failed experiment; 2) Links to the docking results that justified making this compound; 3) Full characterization (spectroscopy and photographs) of an isolated product, with interactive NMRs (JSpecView/JCAMP-dx) of the starting materials; 4) In the discussion section a question is posed by Professor Bradley to his student, and then answered. The entire discussion history is captured. 5) A complete, detailed and dated log of the steps taken by the student; 6) In the tag section, InChIs of every compound used are provided for indexing by search engines.
  37. 37. Page 37 of 37 HO O O HO InChI=1/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-10H,1H2/t2-,5+/m0/s1 CIW BSHSKHKDKBQ-JLAZNSOCBT HO OH Figure 5: The InChI String (top) and InChI Key (bottom) for L-ascorbic acid.