The internet has fast become the first port of call for all searches. The increasing array of chemistry-related resources now available provides chemists a direct path to the discovery of information, one previously accessed via library services and limited to commercial and costly resources. The diversity of information available online is expanding at a dramatic rate and a shift to publicly available resources offers significant opportunities in terms of the benefit to science and society. While the data available online do not generally meet the quality standards available from manually curated sources there are efforts afoot to gather scientists and “crowd source” an improvement in the quality of available data. This article will discuss the types of public compound databases available online, provide a series of example databases and focus on the benefits and disruptions associated with the increased availability of such data and integrating technologies to data-mine the available information.
Current opinions in drug discovery public compound databases
1. Page 1 of 37
Public Chemical Compound Databases
Antony J. Williams
Address: ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587
Corresponding Author:antony.williams@chemspider.com
PHONE: 919 341-8375
The internet has fast become the first port of call for all searches. The
increasing array of chemistry-related resources now available provides chemists a
direct path to the discovery of information, one previously accessed via library
services and limited to commercial and costly resources. The diversity of information
available online is expanding at a dramatic rate and a shift to publicly available
resources offers significant opportunities in terms of the benefit to science and
society. While the data available online do not generally meet the quality standards
available from manually curated sources there are efforts afoot to gather scientists
and “crowd source” an improvement in the quality of available data. This article will
discuss the types of public compound databases available online, provide a series of
example databases and focus on the benefits and disruptions associated with the
increased availability of such data and integrating technologies to data-mine the
available information.
Keywords Public databases, chemical structure databases, Open Data,
chemoinformatics, data mining, internet chemistry, Wikis, blogs,
2. Page 2 of 37
Introduction
The internet is likely used on a daily basis by the majority of scientists. There
is little doubt that the web is the primary portal to query for information and data
and, when coupled with the intranet services for most companies, is the tool of
choice for most general searches. For many years the search for scientific-related
information would start at the library and commonly engage skilled professionals in
the domain of searching. These people would have a deep understanding of
navigating the plethora of databases and resources, using their own query
languages, and would perform searches using for-fee resources. While such skills
remain of value most scientists conduct the majority of their own searches and
certainly utilize their access to a no-cost, intuitive and expansive internet of
information. There has been a tremendous growth in scientific internet resources and
there are enormous opportunities provided by such facile access to chemistry
information and data.
Bioinformatics certainly established the trend of providing online access to
data and Chemistry, in many ways, is far behind. Open-access databases such as
GenBank [1] and the Protein Data Bank (PDB) [2] have been assisting biologists to
translate gene and protein sequences into biological relevance for over two decades.
It is possible that the differences in efforts results from publishers in Chemistry
discouraging the open flow of data and information. This is true not only for scientific
articles but also for chemistry databases. With the changing expectations of society
in terms of freedom of access to information, and the efforts of many evangelists
3. Page 3 of 37
and groups, a shift towards both free and open access (vide infra) chemistry-related
information is well underway and is likely to accelerate.
Murray-Rust envisages a world in which all scientific information is instantly
available [3•]. This emerging world of e-science or cyberscholarship seeks “to
develop the tools, content and social attitudes to support multidisciplinary,
collaborative science. Its immediate aims are to find ways of sharing information in a
form that is appropriate to all readers.” This article will discuss the work already
underway to support this noble and valid effort to provide enhanced public access to
Chemistry data and specifically focus on public chemical compound databases.
There are many tens of indexes of chemistry databases available online and
the reader is encouraged to perform one or more generic searches on “chemistry
databases” to retrieve a list of related information. The authors preferred source of
information is the Wiki hosted by Gary Wiggins [4•]. While the availability of freely
accessible information is clearly of value to scientists there are risks in terms of the
quality of information available. It is this quality issue which provides the
mainstream publishers, for the time-being, a foothold in the domain of providing
value-added access to scientific information. That said, public compound databases
especially have become a disruptive force for certain commercial bodies and the
threat has caused significant duress. The potential impact on the business models of
publishers and the increased capabilities and diversity of data within public
compound databases will also be highlighted.
Public Chemistry Databases
There are many freely available chemical compound databases on the web
and they assume many different forms. They can simply be a collection of chemical
structures aggregated into a single file and made available, gratis, for people to
4. Page 4 of 37
download and utilize as they see fit. These files are generally available in the form of
an SDF file [5] and can be downloaded and then imported to a database for
searching and viewing. There are literally hundreds of such files available online and
they are commonly available from chemical vendors in order to advertise their
catalog collections. These files generally contain the chemical identifiers in the form
of chemical names (systematic and trade) and registry numbers. The files can also
contain experimental or physical properties, file specific identifiers and pricing
information. There are aggregators who gather such files of chemical structures and
related information and assemble them into a single database and serve up to the
public (some examples will be discussed later). Since the files are assembled in a
heterogeneous manner the resulting data are plagued with inconsistencies and data
quality issues. Such an approach to gathering and merging data is a far cry from that
taken by commercial database vendors who manually gather and curate data. Some
examples of these commercial organizations are CAS [6], InfoChem [8] and Symyx
[9].
While the commercial databases offer curated data there is certainly a price-
barrier to accessing the information. A number of the free online resources are also
manually curated and, as will be discussed later, can offer as high a quality as the
commercial offerings. These resources are, however, constructed with a specific
focus in mind and therefore commonly number in the low thousands of structures
rather than the millions available in the larger online databases. Meanwhile, there
are a number of large online database resources offering access to valuable data and
knowledge. Some of these databases should be thought of as “linkbases”. For the
purpose of this article a linkbase is a repository of molecular connection tables
(chemical structures) linking out to various sources of data and associated
information. While it is impossible to be exhaustive within the confines of an article
5. Page 5 of 37
of this nature an overview of a number of online public compound databases focusing
specifically on free access databases will be provided.
The confusion around the differences between Open Access (OA) versus Free
Access (FA) continues to persist [9] but both offer an opportunity to help advance
science by facilitating the sharing of data, information and knowledge with no
barriers of price or access. The first major international statement on open access
was the Budapest Open Access Initiative (BOAI), in February 2002 [10]. The
definition of Open Access is as follows: “By 'open access' to this literature, we mean
its free availability on the public internet, permitting any users to read, download,
copy, distribute, print, search, or link to the full texts of these articles, crawl them
for indexing, pass them as data to software, or use them for any other lawful
purpose, without financial, legal, or technical barriers other than those inseparable
from gaining access to the internet itself. The only constraint on reproduction and
distribution, and the only role for copyright in this domain, should be to give authors
control over the integrity of their work and the right to be properly acknowledged
and cited.” [11]. Free Access is not equivalent to Open Access but a simple definition
has been suggested [12]: “Free access is access that removes price barriers but not
necessarily any permission barriers.” For the purpose of this article we are not only
interested in FA and OA but also Open Data.
Quoting from an online resource [13] “Open Data is a philosophy and
practice requiring that certain data are freely available to everyone, without
restrictions from copyright, patents or other mechanisms of control”. As yet there
are no commonly agreed upon definitions but as a result of Open Data evangelists
and groups progress is being made [14•,15••,16-18].
The majority of scientists cannot however differentiate between free access
and open access since both provide free access to information of value to them in
6. Page 6 of 37
their work. In a similar way, the majority of scientists do not care about the
distinctions between Open and Closed data. They utilize free access public chemical
compound databases on an as-needed basis, derive value from the content and
move on, not concerned whether the data posted online are Open or Closed.
Chemical Abstracts Services (CAS) [5] and their CAS Registry Numbers (RNs) [19]
have played a dominant role in managing a curated registry of chemical entities and
related chemical and biological literature. Their proprietary registration system does
not link to chemical structures in the public domain and their business model is at
risk [20••,21].
Before reviewing examples of public compound databases we should review
the issues of data quality. All content databases containing chemical compounds
contain errors. These errors can arise for a series of reasons including errors in
transcription, historical errors (a compound was “correct” when entered but later re-
characterized), issues with graphical representation and a plethora of other reasons.
The quality of chemical information in the public domain is generally quite low. This
does not mean that the data are not of value but that care needs to be taken in the
nature of the provider as an authority. There is, of course, no central body
responsible for the quality of data in the public domain. Databases of chemical
structure information such as PubChem [22••], ChemIDPLus [23] and ChemFinder
[24] etc., are commonly looked upon as authorities in terms of reliable information.
However, these sources are also aggregators of information and are at risk of
perpetuating errors form the original public data and depositions. Errors in structure-
identifier pairs are common [25] and inaccurate structure representations,
specifically in regards to stereochemistry, proliferate across many databases. A
definitive description of the challenges regarding quality in public domain databases,
and the rigorous processes required to aggregate quality data was provided by
Richards et al [26••]. During their assembly of the EPA DSSTox databases the
7. Page 7 of 37
assembled the chemical structures, chemical names and CAS Registry Numbers for
over 8000 chemicals from numerous toxicity databases. The data they extracted
were carefully curated and validated using multiple public information sources [27].
In regards to the quality of the chemical information presented with bioassay
data on PubChem Richards cautioned 'user beware' [26]. Since the chemical
structure content is deposited without additional review the user is at risk. Errors in
chemical names are common, and multiple structure errors have been identified.
Richards encourages users to make informed judgments on the quality of data based
on prior knowledge of the data submitter. The responsibility for the quality of the
PubChem database therefore rests with the depositors primarily and, as many of
these are commercial chemical vendors, their focus on quality is far less than the
stringent expectations of the community. The proliferation of errors from PubChem
into other databases has been identified [28] and a definitive effort to cleanse the
errors from the data, be it in regards to chemical structures, names or identifiers, is
going to be required. The efforts of groups such as the ChemSpider team with their
online curation [29] offers an opportunity to dramatically improve the quality of the
data through both a roboticized cleansing approach and manual examination by
many users. Efforts such as these should help reduce errors and result in the
proliferation of more validated information.
Public Compound Databases
PubChem
The highest profile online database is certainly PubChem [22]. Launched by
NIH in 2004 to support the New Pathways to Discovery component of their roadmap
initiative [30]. PubChem archives and organizes information about the biological
activities of chemical compounds into a comprehensive biomedical database and is
8. Page 8 of 37
the informatics backbone for the initiative, intended to empower the scientific
community to use small molecule chemical compounds in their research.
PubChem consists of three databases (PubChem Compound, PubChem
Substance, and PubChem Bio-Assay) connected together. PubChem Compound
contains 18 million unique structures and provides biological property information for
each compound. PubChem Substance contains records of substances from depositors
into the system. These are publishers, chemical vendors, commercial databases and
other sources. The PubChem Compound database contains records of individual
compounds (see Figure 1). PubChem BioAssay contains information about bioassays
using specific terms pertinent to the bioassay.
PubChem can be searched by alphanumeric text variables such as names of
chemicals, property ranges or by structure, substructure or structural similarity. As
of December 2007 its content is approaching 38.7 million substances and 18.4
million unique structures. Such a source of data opens up new possibilities [31] in
regards to data mining and extraction. Zhou et al [32•] concluded that the system
has an important role as a central repository for chemical vendors and content
providers enabling evaluation of commercial compound libraries and saving
biomedical researchers from the work associated with gathering and searching
commercial databases. They identified that over 35% of the 5 million structures from
chemical vendors or screening centers found in the PubChem database currently are
not present in the CAS registry.
PubChem continues to grow in stature, content and capability. The bioassay
data resulting from the NIH Roadmap initiative is likely to continue to grow and
PubChem will assume a prominent role in distributing the data in a standard format.
Despite the obvious value of PubChem the platform has caused quite a furor in
recent years including debates regarding the position of CAS relative to the resource.
The reader is referred elsewhere for commentaries [33,34]. Others have commented
9. Page 9 of 37
on the quality of the data content within PubChem. Shoichet [35••] believes that the
screening data are less rigorous than those in peer-reviewed articles, and contain
many false positives. Shoichet worries that chemists who use PubChem may be sent
on a wild goose chase. Numerous problems arise from the quality of submissions
from various data sources and there are thousands of errors in the structure-
identifier associations due to this contamination and this can lead to the retrieval of
incorrect chemical structures. It is also common to have multiple representations of
a single structure due to incomplete or total lack of stereochemistry for a molecule
[36].
DSSTox
The EPA Distributed Structure-Searchable Toxicity (DSSTox) database project
[38,39] provides a series of documented, standardized and fully structure-annotated
files of toxicity information [40]. The initial intention for the project was to deliver a
public central repository of toxicity information to allow for flexible analogue
searching, SAR model development and the building of chemical relational
databases. In order to ensure maximum uptake by the public and allow users to
integrate the data into their own systems the DSSTox project adopted the use of the
common standard file format (SDF) to include chemical structure, text and property
information. The DSSTox databases was also deployed online to provide free public
access to the data files without the dependency on a desktop software package for
querying and managing the data files. The overall aims of the project, to deeply
integrate chemical structure information with existing toxicity data and to facilitate
interrogation of the data have been achieved. The DSSTox datasets are among the
most highly curated public datasets available and likely the reference standard in
publicly available structure-based toxicity data.
10. Page 10 of 37
eMolecules
eMolecules [41] offers a free online database of almost 8 million unique
chemical structures. The database is assembled from data supplied by over 150
suppliers and provides a path to identifying a vendor for a particular chemical
compound. By providing access to compounds for purchase they are providing a free
access online service similar to those of commercial databases such as Symyx
Available Chemical Directory [42], CAS’ ChemCats [43] and Cambridgesoft’s
ChemACX [44] as well as a number of other providers. The system offers access to
more than 4 million commercially available screening compounds and many tens of
thousands of building blocks and intermediates. Their database was recently
enhanced by providing access to NMR, MS and IR spectra from Wiley-VCH [45] for
over 500,000 compounds via ChemGate [45], a fee-based service. eMolecules also
provides links to many sources of data for spectra, physical properties and biological
data including include the NIST WebBook [46], the National Cancer Institute [47],
DrugBank [48•] and PubChem.
eMolecules is presently fairly limited in its scope and primarily offers a very
useful path to the purchase of chemicals and links to the more popular government
databases. Nevertheless, the site is popular with chemists who are searching for
chemicals and the interface is intuitive and easy to use, a key element in attracting
users.
DrugBank
DrugBank [48•] is a manually curated resource assembled from the collection
information of a series of other public domain databases and enhanced with
11. Page 11 of 37
additional data generated within the laboratories of the hosts. The database
aggregates both bioinformatics and cheminformatics data and combines detailed
drug data with comprehensive drug target (i.e. protein) information. The database is
hosted by the University of Alberta, Canada. Version 1 of the database, released in
2006, contained >4100 drug entries including >800 FDA approved small molecule
and biotech drugs as well as >3200 experimental drugs. Over 14,000 protein or drug
target sequences were linked to these drug entries. Each record in the database,
known as a DrugCard, has >80 data fields. The information is split into
drug/chemical data and drug target or protein data and many data fields are linked
to other databases (KEGG [49], PubChem, ChEBI [50], PDB [2] and others). The
database supports extensive text, sequence, chemical structure and relational query
searches.
DrugBank has been used to facilitate in silico drug target discovery, drug
design, drug docking or screening, drug metabolism prediction, drug interaction
prediction and general pharmaceutical education. The version 2.0 release of
DrugBank [51••] released in January of this year with over 800 new drug entries and
each DrugCard entry extended to include over 100 data fields with half of the
information being devoted to drug/chemical data and the other half devoted to
pharmacological, pharmacogenomic and molecular biological data. They have started
to add experimental spectral data (NMR and MS specifically), and have expanded the
coverage to nutraceuticals and herbal medicines.
The Drugbank team also host the Human Metabolome Database (HMDB)
[52], a database containing nformation about small molecule metabolites found in
the human body. The database is used by scientists working in the areas of
metabolomics, clinical chemistry and biomarker discovery. The database currently
contains nearly 3000 metabolite entries and each MetaboCard entry contains more
12. Page 12 of 37
than 90 data fields devoted to chemical, clinical data, enzymatic and biochemical
data.
NMRShiftDB
The NMRShiftDB is an open source collection of chemical structures and their
associated NMR shift assignments [53•,54]. The database is generated as a result of
contributions by the public and currently contains over 20,000 structures with
>220,000 assigned carbon chemical shifts. Datasets entered by contributors are sent
to registered reviewers for evaluation. A significant part of NMRShiftDB was initially
assembled from in-house databases from collaborating institutions and were entered
unchecked. This called for external checks of the data based on independent
databases and resources and these have now been carried out by two specific groups
[56,57]. Williams et al. [56] performed a cursory examination of the structural
diversity within the database and concluded that the data represented a statistically
relevant set to use in an evaluation of predictive accuracy and demonstrated that the
quality of the data is rather impressive. This effort shows the advantages of
providing a set of Open Data for reuse and examination and the benefits of having
many scientists examine, validate and correct. The benefit is possible for any
database allowing its users to qualify, annotate and correct its data.
ChemSpider
ChemSpider was released to the public in March 2007 with the intention of
“building a structure centric community for chemists”. ChemSpider has grown into a
resource containing almost 18 million unique chemical structures and recently shared
its data with PubChem providing about 7 million unique compounds. The data
sources have been gathered from chemical vendors as well as commercial database
13. Page 13 of 37
vendors and publishers and members of the Open Notebook Science community.
ChemSpider has also integrated the SureChem patent database [59] collection of
structures to facilitate links [60] between the systems. The database can be queried
using structure/substructure searching and alphanumeric text searching of both
intrinsic as well as predicted molecular properties. They have recently added virtual
screening results using the LASSO similarity search tool [61] to screen the
ChemSpider database against all 40 target families from the Database of Useful
Decoys (DUD) dataset.
ChemSpider has enabled unique capabilities relative to the primary public
chemistry databases. These include real time curation of the data, association of
analytical data with chemical structures, real-time deposition of single or batch
chemical structures (including with activity data) and transaction-based predictions
of physicochemical data. The ChemSpider developers have made available a series of
web services to allow integration to the system for the purpose of searching the
system as well as generation of InChI identifiers and conversion routines.
The system also integrates text-based searching of Open Access articles and
presently search over 50,000 OA Chemistry articles, soon to be extended to 150,000
articles. The index is expected to increase dramatically as they extract chemical
names from OA articles and convert the names to chemical structures using name to
structure conversion algorithms. These chemical structures will be deposited back to
the ChemSpider database thereby facilitating structure and substructure searching in
concert with text-based searching.
ChemSpider has a focus on, and commitment to, community curation. The
social community aspects of the system demonstrate the potential of this approach.
The team have committed to the release of a wiki-like environment for further
annotation of the chemical structures in the database, a project they term
WiChempedia. They will utilize both available Wikipedia content and deposited
14. Page 14 of 37
content from users to enable the ongoing development of community curated
chemistry.
Other Databases
The list of databases and resources reviewed above is only representative of
the type of information available online. Other highly regarded databases frequented
by this author include the Chemical Structure Lookup Service (with over 36 million
unique structures) [64], CrystalEye [65], KEGG [49] and CheBI [50]. There are also
many other resources available and the reader is referred to one of the many
indexes of such databases available on the internet to identify potential resources of
interest [4,66].
Public Compound Databases versus Commercial Databases
The creation, hosting and support of a curated chemical compound database
with integrated content is an expensive enterprise. Historically these databases have
been built as a result of hundreds if not thousands of man years of rigorous and
exacting human effort and then, for some of the original founders in this domain,
migrated onto computer systems. In the development of these systems host
organizations have created sizeable revenues and estimated annual fees for
accessing this information via just a few organizations likely exceeds half a billion
dollars. With the advances in technology accompanying the internet boom the
hosting of large databases, the text-based searching of immense amounts of data
and the ability to disseminate complex forms of graphical information via standard
protocols provided an opportunity created for disruptive offerings in this domain.
They soon arrived.
The primary advantage of commercial databases is that they have been
manually examined by skilled curators, addressing the tedious task of quality data-
15. Page 15 of 37
checking. Certainly the aggregation of data from multiple sources, both historical and
modern, from multiple countries and languages and from sources not available
electronically are significant enhancements over what is available via an internet
search. The question remains how long will this remain an issue? Scientists working
in new areas of science and domains of expertise reflect on the most recent
literature in general. Can you imagine a search about the semantic web being
conducted just a few years ago? What about metabonomics or even genomics?
Certain areas of the scientific literature, while still of high value, can become
antiquated fairly quickly. With the new capabilities of internet-based searching and
direct access to abstracts for the majority of publishers even a rudimentary text
search can expose articles previously unavailable except through an abstracting
service. Search engines will increasingly be utilized for first level searches specifically
because they are simple to use, they are fast and they are free. With chemically
searchable patents also available online [59,67], at no charge, the landscape for
scientists searching for information is more open than ever. If there are data of
interest to be located then internet search engines will enable it.
The premier curated database offerings of today have an interesting if not
challenging future ahead of them. Their value-added enhancements of the
distributed data must be significant enough to warrant an investment in their
services [68]. As expressed earlier the quality of the data resulting from curation is
significant but this author questions the longevity of that distinguishing factor
moving forward. Roboticized recognition and conversion of chemical names to
chemical structures can dramatically shift this domain and efforts have already been
demonstrated in applications to patents and publications. Should the quality reach a
sufficient standard then today’s publishers business models will definitely be at risk.
The Future of Public Compound Databases
16. Page 16 of 37
The semantic web [69] is already offering us the chance to connect,
simultaneously interrogate and mash-up the results of searching multiple public
compound databases simultaneously. An enormous diversity of data is already
available for interrogation by the public and continues to expand daily. This author
remains concerned with the very real quality issues associated with public data sets.
While the utopian dream of no errors in freely available data cannot be met the push
towards more Open Data without consideration being given to both manual and
robotic curation could be risky to those using the data. Real-time curation of data
within public compound databases is feasible [29] and certainly Wikipedia is a model
of crowd sourcing [71] to build, curate and maintain a quality database.
Unfortunately, even these world-renowned platforms actually sit on the shoulders of
a very few dedicated individuals, relative to the users, who care about quality. There
is no simple solution to the issues of quality and it will persist for the foreseeable
future until processes, procedures and momentum to resolve the issues are
established.
Even in its earliest form PubChem has been referred to, tongue-in-cheek, as
“the granddaddy of all free chemistry databases”. Certainly it presently holds the
premier position in reputation, capabilities and connectivities built on a database of
chemical structures and linked out to biological assay data, the PubMed database
and an array of services to facilitate both the distribution of the data and the wealth
of tools developed to support the system. The majority of databases discussed in this
article now uses two primary identifiers in their systems – the CAS registry number
and a PubChem ID number. This alone indicates a shift in equality of commercial
versus public compound repositories. For now, PubChem remains focused on its
initial intent to support the National Molecular Libraries Initiative. The data within
PubChem have never formally been declared as Open Data but are assumed to be
17. Page 17 of 37
available in that manner and thereby offer to scientists a valuable aggregate of data
for the purpose of data mining and discovery.
At the time of writing the newest addition to the proliferating domain of public
chemical compound databases is the ChemSpider Database [57], working to “Build a
Structure Centric Community for Chemists”. This system presently offers a series of
unique capabilities which might become trend-setting for present and future
databases. As discussed earlier these include the user deposition of structures, real-
time annotation and curation of data, management of analytical data and online
transaction services. It is this authors’ belief that such capabilities will likely become
standard for the majority of most public chemical compound databases in the near
future. These types of capabilities could help establish the newfound shift to Open
Notebook Science and shift the bias from the chemical biology databases (PubChem,
Drugbank, HMDB and DSSTox) to even provide an environment for non-life science
chemists, polymer chemists and material scientists to manage and research
information of interest to them.
The WikiSphere, Blogosphere and Internet as a Public Compound Database.
Wikis and blogs are common terms now for the majority of users of the
worldwide web and both are fast becoming chosen platforms for the exchange of
information between many scientists, not only as tools within their own research
groups but, more generally, with the public in general. A blog, or weblog is a website
where entries are written in chronological order and generally provide commentary
or news on a particular subject [71]. A typical blog combines text, images and links
to other blogs, web pages, and other media related to its topic. The original blog
posting remains untouched by the commenter and readers are free to add their
comments, generally in a mediated manner where the blog host retains control over
18. Page 18 of 37
the postings. An example screenshot from a chemistry-based blog hosted with the
intention of examining and discussing organic syntheses is shown in Figure 3. The
number of chemistry-related blogs continues to grow dramatically and there have
been efforts to provide a unified view into some of these [72,73].
A wiki is a type of computer software that allows users easily to create, edit
and link web pages and enables documents to be written collaboratively, in a simple
markup language using a web browser, and is essentially a database for creating,
browsing and searching information. Certainly Wikipedia is the most well-known
today though there are many others already online and used within the confines of
an organization to manage content. There are active groups supporting the
development of chemistry on Wikipedia and there are now thousands of pages
describing small organic molecules, inorganics, organometallics, polymers and even
large biomolecules. Focusing on small molecules in general, each one has a Drug Box
[75] or a Chemical infobox [76]. A drug box provides identifier information
(chemical name, registry number, and so on) and commonly the identifiers link out
to a related resource. Chemical data, pharmacokinetic data and therapeutic
considerations can also be listed. At present there are approximately 8000 articles
with a chembox or drugbox [3], with between 500-1000 articles added since May.
The detailed information offered on Wikipedia regarding a particular chemical or drug
can be excellent [77], see Figure 2, or weak [78]. There are many dedicated
supporters and contributors to the quality of the online resource. Drug and
chemboxes have been shown to contain errors but the advantage of a wiki is that
changes can be made within a few keystrokes and the quality is immediately
enhanced. The opposite is also true and vandalism can occur. This community
curation process makes Wikipedia a very important online chemistry resource whose
impact will only expand with time.
19. Page 19 of 37
Wikis have recently been used as the basis of Open Notebook Science [79].
The UsefulChem Wiki [80] includes a series of experimental pages commonly linked
to related blog pages as shown in Figure 4. The Open Notebook Science efforts and
the movement appears to be gaining momentum with the support of vocal
advocates, such as Neylon [81], Murray-Rust [82] and many others.
While both wikis and blogs are very valuable for information exchange, what
they enable in terms of text and image exchange is all but crippled in terms of
searching by many chemists’ additional query needs for chemical structures,
reactions and data. Neither Wikis nor blogs, as yet, are enabled for the purpose of
structure and substructure searching and, therefore, remain isolated, in general,
from cheminformatics based search procedures. One of the key developments which
has already facilitated the Semantic Web for chemistry is the InChI,[83] the
International Chemical Identifier. The InChI string is a textual identifier for chemical
substances designed to provide a standard and human-readable way to encode
molecular information (see Figure 5) and to facilitate the search for such information
in databases and on the web. The InChI string, unfortunately, has only partly delivered on
the promise of facilitating web-based searches, due to unpredictable breaking of InChI
character strings by search engines. In order to resolve this issue the InChIKey was
introduced. The condensed, 25 character InChIKey is a hashed version of the full InChI
and is not human-readable. The equivalent InChIKey for the InChIString of L-ascorbic
acid is CIWBSHSKHKDKBQ-JLAZNSOCBT. The advantage of the key is one of
enabling web searches, but a lookup table to identify the associated structure, or reference
20. Page 20 of 37
to the original InChI String, is necessary [85]. While tens of millions of InChI strings
and keys have been populated into databases, their value is still in its infancy.
Publishers have started to embed InChIs into their articles and the Royal Society of
Chemistry [85] is presently pioneering a new publishing model, Project Prospect,
including InChI to demonstrate movement toward the semantic web for chemistry.
Bloggers have started to use InChI Strings and Keys on their postings, and wiki-
pages are being InChI-enabled to help the web become structure searchable. The
necessity of a central lookup facility for published InChIStrings will be necessary in
order to facilitate substructure searching of the web but this capability is likely to be
developed in the near future. Willighagen already aggregates InChI Strings onto a
blog [87].
BioSpider [88] users are able to type in almost any kind of biological or
chemical identifier (protein/gene name, sequence, accession number, chemical
name, brand name, SMILES string, InCHI string, CAS number, etc.) and delivers a
report about the biomolecule. BioSpider uses a web-crawler to scan through dozens
of public databases and employs a variety of specially developed text mining tools
and locally developed prediction tools to find, extract and assemble data for its
reports. A summary includes physico-chemical parameters, images, models, data
files, descriptions and predictions concerning the query molecule.
An increasing number of public databases will continue to become available
but the challenge, even now, is how to integrate and access the data. The
implementation of InChIs for web-based searching [89], and the delivery of
userscripts to aggregate information and computational results from different web
resources [90] are bringing together internet resources to appear as a single
monolithic public chemistry database. Willighagen et al. [90] use userscripts to
21. Page 21 of 37
enrich biology and chemistry related web resources by incorporating or linking to
other computational or data sources on the web. They showed how information from
web pages can be used to link to, search, and process information in other resources
thereby allowing scientists to select and incorporate the appropriate web resources
to enhance their productivity. Such tools connecting open chemistry databases and
user web pages is an ideal path to more highly integrated information sharing.
Conclusion
There is little doubt that the newfound availability of public chemical
compound databases with their associated chemistry and biological data is enabling
scientists to access information at less cost in both time and currency. The increasing
quantity of freely accessible and integrated data can speed decision making and
bring clarity or alternatively inundate and saturate the user with poor quality
information. Scientists now have free access to structure-searchable patents, open
and free access peer-reviewed publications and software tools for the manipulation
of chemistry related data. Members of the Open Source movement are developing
toolkits including visualization and data-mining tools and, when coupled with the
public chemistry databases reviewed here, will likely benefit the process of discovery.
There are likely to be challenging times ahead in terms of meshing the needs of
commercial database publishers versus proliferation of free databases but this
journey will not be halted by the objections of the commercial entities provided that
legal copyrights are respected and the shift towards a more open community for
science persists.
Acknowledgements
The author wishes to thank the following people: Stephen Bryant and Evan Bolton
from the PubChem team, the IUPAC/National Institute of Standards and Technology
22. Page 22 of 37
InChI team (Alan McNaught, Stephen Stein, Stephen Heller, Dmitrii Tchekhovskoi);
David Wishart and Nelson Young (Drugbank and HMDB), Nicko Goncharoff
(SureChem), Stephen Boyer (IBM), Marc Nicklaus (Chemical Structure Lookup
Service), members of the ChemSpider Advisory Group (Egon Willighagen, Sean
Ekins, Joerg Wegner and Alex Tropsha specifically), Ann Richard and Marti Wolf
(DSSTox), Christoph Steinbeck (NMRShiftDB), Nick Day and Peter Murray-Rust
(CrystalEye), Martin Walker, Andrew Yeung and Dirk Beestra (Wikipedia Chemistry).
I would also like to acknowledge the many contributors to the blogging discussions
about Open and Free Access.
References
1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic
Acids Res. (2007) 35(Database issue):D21-5.
2. Berman HM, Henrick K, Nakamura H: Announcing the worldwide Protein Data
Bank. Nature Structural Biology (2003) 12: 980
3. Murray-Rust P: Chemistry for everyone. NATURE (2008) 451, 648-651
•Provides a vision for the future of data distribution, access and integration across
the worldwide web and espouses the need for Open Data policies and adoption of the
Semantic Web.
4. Gary Wiggins’ Wiki. CHEMBIOGRID, Chemistry Databases on the Web:
Alphabetical:http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_D
atabases_on_the_Web_%28Alphabetical_List%29
Classified:http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_Dat
abases_on_the_Web_%28Classified_List%29
23. Page 23 of 37
•An aggregation of chemistry databases, curated and annoted, to provide
significantly more information than would be returned in a generic search of the
internet.
5. Symyx: CTFile formats no-fee. (2008)
http://www.mdli.com/downloads/public/ctfile/ctfile.jsp
6. CAS: Chemical Abstract Services, Columbus, OH, USA (2006).
http://www.cas.org/
7. InfoChem: InfoChem Gesellschaft für Chemische Information, München,
Germany (2008). http://infochem.de/
8. Symyx: Santa Clara, California, USA (2008). http://www.symyx.com/
9. The University’s Mandate To Mandate Open Access: Harnad S, (2008)
http://openaccess.eprints.org/index.php?/archives/358-The-Universitys-Mandate-To-
Mandate-Open-Access.html
10. Open Access: Wikipedia Article on Open Access. (2008)
http://en.wikipedia.org/wiki/Open_access
11. The BOAI FAQ page: Frequently Accessed Questions about the Budapest Open
Access Initiative (2008), http://www.earlham.edu/~peters/fos/boaifaq.htm
12. Williams AJ: A perspective of Publicly Accessible/Open Access Chemistry
Databases: Drug Discovery News (2008), accepted for publication
13. Open Data: Wikipedia Article on Open Data. (2008)
http://en.wikipedia.org/wiki/Open_data
14. Murray-Rust P, Rzepa HS, Tyrrell SM and Zhang Y: Representation and use of
Chemistry in the Global Electronic Age ChemInform, 36(15), (2005)
• An excellent outline regarding the potential of combining open access and the
semantic web in chemistry. Rzepa and Murray-Rust are two of the evangelists of this
domain and outline in this article how data may be interconnected to the benefit of
all chemists.
24. Page 24 of 37
15. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C,
Wegner J , Willighagen EL: The Blue Obelisk-Interoperability in Chemical
Informatics, J Chem Inf Model, (2006) 46 (3), 991-998.
••The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a
group of scientists and developers supporting open source software development,
consistent and complimentary chemoinformatics research, open data, and open
standards in Chemistry.
16. CODATA, The Committee on Data for Science and Technology: CODATA,
Paris, France (2008). http://www.codata.org/
17. An Introduction to Science Commons: Wilbanks J, Boyle J, (2006).
http://sciencecommons.org/wp-
content/uploads/ScienceCommons_Concept_Paper.pdf
18. The Open Knowledge Foundation: Protecting and Promoting Open Knowledge
in a Digital Age (2008). http://www.okfn.org/
19. CAS Registry Numbers: Chemical Abstract Services, Columbus, OH, USA
(2008). http://www.cas.org/expertise/cascontent/registry/regsys.html
20. Murray-Rust P, Mitchell JB, Rzepa HS: Communication and re-use of
chemical information in bioscience. BMC Bioinform (2005) 6:180-196.
•• Provides an overview of chemical information on the Internet and, while slightly
outdated, is an important read in regards to the challenges and the vision of a
Semantic Web for Chemistry.
21. Heller SR, Stein SE, Tchekhovskoi DV: Open source/open access/open data
and the IUPAC International Chemical Identifier - InChI. American Chemical
Society National Meeting, Washington, DC, USA (2005):CINF-60.
22. NCBI: PubChem: National Center for Biotechnology Information, Bethesda, MD,
USA (2008). http://pubchem.ncbi.nlm.nih.gov
25. Page 25 of 37
•• Pubchem is a large data aggregator (nearing 20 million structures) and offers
relational searching capabilities via text, structure and substructure searching and
access to the entire dataset via download of SDF files. A series of services for the
handling of chemistry databases are also available via the website.
23. ChemIDplus: National Library of Medicine, Bethesda, MD, USA (2008).
http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp
24. ChemFinder.com: CambridgeSoft Corp, Cambridge, MA, USA (2008).
http://chemfinder.cambridgesoft.com/
25. Hacking Pubchem - Technology easy, Quality difficult: Williams AJ, (2007)
http://www.chemspider.com/blog/hacking-pubchem-technology-easy-quality-
difficult.html.
26. Richard AM, Swirsky Gold L, Nicklaus MC: Chemical structure indexing of
toxicity data on the Internet: Moving toward a flat world. Current Opinion in
Drug Discovery & Development (2006) 9(3): 314-325.
•• The review discusses efforts to gather, curate and make publicly available
toxicology-related chemical information. The specific discussions regarding the
quality issues with public chemistry databases and efforts to produce clean quality
databases are noteworthy.
27. DSSTox Quality Chemical Information Review Procedures: US
Environmental Protection Agency, Washington, DC, USA (2008).
http://www.epa.gov/nheerl/dsstox/ChemicalInfQAProcedures.html
28. PubChem Errors: Williams AJ, PubChem Meeting, Washington DC: (2007)
http://www.chemspider.com/docs/PubChem_at_ChemSpider_Overview_SLides_Sept
ember_2007.pdf
29. The Process of Curating Identifiers on ChemSpider: Williams AJ, (2008)
http://www.chemspider.com/docs/The_Process_of_Curating_Identifiers_on_ChemSpi
der.pdf
26. Page 26 of 37
30. The NIH Roadmap Initiative: Office of Portfolio Analysis and Strategic
Initiatives, National Institutes of Health, Bethesda, Maryland 20892: (2008)
http://nihroadmap.nih.gov/
31. Hacking PubChem: Why The Open Access Fight is Just the Beginning,
Apodaca R, (2006), http://depth-first.com/articles/2006/09/22/hacking-pubchem-
why-the-open-access-fight-is-just-the-beginning
32. Zhou Y, Chen K, Yan SF, King FJ, Jiang S, Winzeler EA: Large-Scale
Annotation of Small-Molecule Libraries Using Public Databases. J. Chem. Inf.
Model. (2007) 47:1386-1394
•• The 2.5 million compound collection at the Genomics Institute of the Novartis
Research Foundation (GNF) was used as a model to determine whether automated
annotation of screening hits in batch is feasible.
33. The American Chemical Society and NIH’s PubChem, Reshaping Scholarly
Communication Blog: (2008)
http://osc.universityofcalifornia.edu/news/acs_pubchem.html
34. Background of the PubChem/CAS Issue: (2008)
http://www.arl.org/bm~doc/backgroundfaqpb.pdf
35. Baker M: Open-access chemistry databases evolving slowly but not
surely:Nature Reviews, Drug Discovery, (2006) 5:707-708
• A critical review of how far publicly available initiatives have to go to catch up with
commercial offerings.
36. How big is the challenge of curation and what is the structure of
Ginkgolide-B: Antony Williams (2008), http://www.chemspider.com/blog/how-big-
is-the-challenge-of-curation-and-what-is-the-structure-of-ginkgolide-b.html
27. Page 27 of 37
37 DSSTOX: Distributed Structure-Searchable Toxicity (DSSTox) Database:
US Environmental Protection Agency, Washington, DC, USA (2006).
http://www.epa.gov/nheerl/dsstox/
38. Richard AM and Williams CR (2002) Distributed Structure-Searchable
Toxicity (DSSTox) Public Database Network: A Proposal, Mutation Research:
New Frontiers, 499:27-52.
39. Richard AM: DSSTox web site launch: Improving public access to
databases for building structure-toxicity prediction models, Preclinica, (2006)
2(2):103-108.
40. DSSTox Data Files: http://www.epa.gov/ncct/dsstox/DataFiles.html
41. eMolecules Online Service: eMolecules, Del Mar, CA, USA (2008).
http://www.emolecules.com
42. Available Chemical Directory: Santa Clara, California, USA (2008).
http://www.mdli.com/products/experiment/available_chem_dir/index.jsp
43. ChemCats: Chemical Abstract Services, Columbus, OH, USA (2006).
http://www.cas.org/expertise/cascontent/chemcats.html
44. ChemACX: CambridgeSoft Corp, Cambridge, MA, USA (2008).
http://www.cambridgesoft.com/databases/details/?db=12
45. ChemGate: Tony Davies, eMolecules and Spectroscopy: Spectroscopy Europe,
(2007) 19(1):27-28
46. The NIST Chemistry WebBook: (2008) http://webbook.nist.gov/chemistry/
47. NCI/NIH Developmental Therapeutics Program: National Cancer Institute,
Frederick/National Institutes of Health, Bethesda, MD, USA. (2008).
http://dtp.nci.nih.gov/index.html
48. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z,
Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery
and exploration, Nucleic Acids Res. (2006) 34:D668-72
28. Page 28 of 37
• A detailed description of the intent, development and capabilities of the Drugbank
database, one of the most respected public chemistry databases utilized by drug
discovery scientists today.
49. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: KEGG: The KEGG
resource for deciphering the genome, Nucleic Acids Res. (2004) 32 (Database
issue):D277-80
50. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A,
Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology
for chemical entities of biological interest, Nucl. Acids Res. (2008) 36: D344-
D350;
51. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B,
Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drug
targets, Nucleic Acids Res. (2008) 36(Database issue):D901-6.
•• An update regarding the DrugBank database as it is released in its Version 2
state.
52. HMDB: The Human Metabolome Database. Nucleic Acids Res. (2007) 35:
D521-6
53. Steinbeck C, Krause S, Kuhn S: NMRShiftDB– Constructing A Chemical
Information System With Open Source Components. J. Chem. Inf. Comput. Sci.
(2003) 43:1733-1739.
•The defining article regarding the development of the NMRShiftDB database defining
the intention of the work, the development of the software components and a vision
of how such a platform can lead to widespread dissemination of analytical data, at
no-charge, to the chemistry community.
29. Page 29 of 37
54. Steinbeck C, Kuhn S. NMRShiftDB – Compound Identification And
Structure Elucidation Support Through a Free Community-Built Web
Database. Phytochemistry, (2004), 65:2711–2717.
55. Blinov KA, Smurnyy YD, Elyashberg ME, Churanova TS, Kvasha M, Steinbeck C,
Lefebvre BA, Williams AJ: Performance Validation of Neural Network Based
13C NMR Prediction Using a Publicly Available Data Source. J Chem Inf
Model, (2008), Accepted for publication, doi: 10.1021/ci700363r.
56. CSEARCH and NMRShiftDB: Robien W (2007)
http://nmrpredict.orc.univie.ac.at/csearchlite/enjoy_its_free.html
57. Williams AJ, ChemSpider and Its Expanding Web: Building a Structure-
Centric Community for Chemists, Chemistry International (2007) 30(1): 30.
58. Open Notebook Science: Bradley JC, (2006) Drexel CoAs E-Learning Blog,
http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html
59. SureChem: San Francisco, CA, USA (2008) http://www.surechem.org/
60. Free Access Structure Searching of Patents: Williams AJ (2007),
http://www.chemspider.com/docs/Structure_Searching_of_Patents_Using_ChemSpid
er.pdf
61. LASSO: Ligand Activity in Surface Similarity Order, SioBioSys Inc., Toronto,
Canada. http://www.simbiosys.ca/ehits_lasso/index.html
62. Database of Useful Decoys: http://dud.docking.org./
63. WiChempedia: ChemSpider Blog (2007)
http://www.chemspider.com/blog/wichempedia-is-now-on-its-way.html
64. Chemical Structure Lookup Service: National Institutes of Health,
http://cactus.nci.nih.gov/cgi-bin/lookup/search
65. CrystalEye Crystallogrpahic Database:
http://wwmm.ch.cam.ac.uk/crystaleye/
30. Page 30 of 37
66. Thirty Two Free Chemistry Databases: Apodaca R, Depth-First Blog,
http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases
67. IBM’s Online Patent Search: (2008) IBM Chemical Search Alpha, IBM,
Almaden Services Research, San Jose, CA 95120, USA,
https://chemsearch.almaden.ibm.com/chemsearch/SearchServlet
68. Kemper K, Chemical Abstracts still developing ways to help its core –
scientists, Columbus Business First,
http://columbus.bizjournals.com/columbus/stories/2007/06/18/story20.html?page=
1
69. Feigenbaum L, Herman I, Hongsermeier T, Neumann E, Stephens S: The
Semantic Web in Action, Scientific American Magazine
http://www.sciam.com/article.cfm?id=the-semantic-web-in-action
70. The Benefits of Crowdsourcing: http://en.wikipedia.org/wiki/Crowdsourcing
71. The Definition of a Blog: http://en.wikipedia.org/wiki/Blog
72. ScienceBlogs: http://scienceblogs.com/
73. Chemical BlogSpace: http://cb.openmolecules.net/
74. The Definition of a Wiki: http://en.wikipedia.org/wiki/Wiki
75. Wikipedia Chemical Drugbox: http://en.wikipedia.org/wiki/Template:Drugbox
76. Wikipedia Chemical Infobox:
http://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox
77. Taxol on Wikipedia: http://en.wikipedia.org/wiki/Taxol
78. AP7 on Wikipedia: http://en.wikipedia.org/wiki/AP7
31. Page 31 of 37
79. Bradley JC, Open Notebook Science Using Blogs and Wikis, Nature
Preceedings (2007) doi:10.1038/npre.2007.39.1,
http://precedings.nature.com/documents/39/version/1
80. UsefulChem Open Notebook Science: Bradley JC, Drexel University,
http://usefulchem.wikispaces.com/All+Reactions and http://usefulchem-
experiments1.blogspot.com/2006/05/exp-009.html
81. Open Notebook Science: Neylon C, Science in the open, An openwetware blog
on the challenges of open and connected science (2008)
http://blog.openwetware.org/scienceintheopen/2007/12/12/a-big-few-weeks-for-
open-notebook-science/
82. Open Notebook Science NMR: Murray-Rust P, A Scientist and the Web Blog
(2008) http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=671
83. The IUPAC International Chemical Identifier: (2008)
http://www.iupac.org/inchi/
84. The IUPAC International Chemical Identifier Software: (2008)
http://www.iupac.org/inchi/release102.html
85. Royal Society of Chemistry: (2008) http://www.rsc.org/
86. Project Prospect: (2008) RSC Publishing,
http://www.rsc.org/Publishing/Journals/ProjectProspect/
87. Chemical Blogspace, (2008) http://cb.openmolecules.net/inchis.php
88. Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS: BioSpider, A web
Server for Automating Metabolome Annotations. Pacific Symposium on
Biocomputing, (2007) 12:145-156.
32. Page 32 of 37
89. Cole SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhancement of the
Chemical Semantic Web through INChIfication. Org Biomol Chem, (2005)
3:1832-1834
90. Willighagen EL, O'Boyle NM, Gopalakrishnan H, Jiao D, Guha R, Steinbeck C and
Wild DJ: Userscripts for the Life Sciences. BMC Bioinformatics, (2007) 8:487.
•• Discusses the use of userscripts to change the appearance of web pages by
modifying web content on the fly to enable aggregation of information and
computational results from different web resources into a single webpage. Indicative
of the future of integration and the possibilities which exist to gather information
from a multitude of resources and reformat and deliver to the consumer.
33. Page 33 of 37
Figures
Figure 1 - The Compound Summary Page for Taxol in PubChem. Page 1 only is
shown. (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=36314)
34. Page 34 of 37
Figure 2: The DrugBox for Taxol from Wikipedia (http://en.wikipedia.org/wiki/Taxol)
35. Page 35 of 37
Figure 3: The TotallySynthetic.com blog. Paul Docherty discusses complex
syntheses and offers readers an opportunity to comment, analyze and provide
feedback. Many articles are labeled with InChIKeys to allow indexing by search
engines. (http://totallysynthetic.com/blog/)
36. Page 36 of 37
Figure 4: An Example UsefulChem wiki page
(http://usefulchem.wikispaces.com/Exp148)
This UsefulChem wiki page shows a number of important content items: 1) Links to
the prior failed experiment; 2) Links to the docking results that justified making this
compound; 3) Full characterization (spectroscopy and photographs) of an isolated
product, with interactive NMRs (JSpecView/JCAMP-dx) of the starting materials; 4)
In the discussion section a question is posed by Professor Bradley to his student, and
then answered. The entire discussion history is captured. 5) A complete, detailed and
dated log of the steps taken by the student; 6) In the tag section, InChIs of every
compound used are provided for indexing by search engines.
37. Page 37 of 37
HO
O
O
HO InChI=1/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-10H,1H2/t2-,5+/m0/s1
CIWBSHSKHKDKBQ-JLAZNSOCBT
HO OH
Figure 5: The InChI String (top) and InChI Key (bottom) for L-ascorbic acid.