SlideShare a Scribd company logo
1 of 37
Download to read offline
Page 1 of 37

Public Chemical Compound Databases

Antony J. Williams



Address: ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587

Corresponding Author:antony.williams@chemspider.com

PHONE: 919 341-8375



       The internet has fast become the first port of call for all searches. The

increasing array of chemistry-related resources now available provides chemists a

direct path to the discovery of information, one previously accessed via library

services and limited to commercial and costly resources. The diversity of information

available online is expanding at a dramatic rate and a shift to publicly available

resources offers significant opportunities in terms of the benefit to science and

society. While the data available online do not generally meet the quality standards

available from manually curated sources there are efforts afoot to gather scientists

and “crowd source” an improvement in the quality of available data. This article will

discuss the types of public compound databases available online, provide a series of

example databases and focus on the benefits and disruptions associated with the

increased availability of such data and integrating technologies to data-mine the

available information.



Keywords     Public      databases,   chemical   structure   databases,   Open   Data,

chemoinformatics, data mining, internet chemistry, Wikis, blogs,
Page 2 of 37


Introduction

       The internet is likely used on a daily basis by the majority of scientists. There

is little doubt that the web is the primary portal to query for information and data

and, when coupled with the intranet services for most companies, is the tool of

choice for most general searches. For many years the search for scientific-related

information would start at the library and commonly engage skilled professionals in

the domain of searching. These people would have a deep understanding of

navigating the plethora of databases and resources, using their own query

languages, and would perform searches using for-fee resources. While such skills

remain of value most scientists conduct the majority of their own searches and

certainly utilize their access to a no-cost, intuitive and expansive internet of

information. There has been a tremendous growth in scientific internet resources and

there are enormous opportunities provided by such facile access to chemistry

information and data.

       Bioinformatics certainly established the trend of providing online access to

data and Chemistry, in many ways, is far behind. Open-access databases such as

GenBank [1] and the Protein Data Bank (PDB) [2] have been assisting biologists to

translate gene and protein sequences into biological relevance for over two decades.

It is possible that the differences in efforts results from publishers in Chemistry

discouraging the open flow of data and information. This is true not only for scientific

articles but also for chemistry databases. With the changing expectations of society

in terms of freedom of access to information, and the efforts of many evangelists
Page 3 of 37

and groups, a shift towards both free and open access (vide infra) chemistry-related

information is well underway and is likely to accelerate.

       Murray-Rust envisages a world in which all scientific information is instantly

available [3•]. This emerging world of e-science or cyberscholarship seeks “to

develop the tools, content and social attitudes to support multidisciplinary,

collaborative science. Its immediate aims are to find ways of sharing information in a

form that is appropriate to all readers.” This article will discuss the work already

underway to support this noble and valid effort to provide enhanced public access to

Chemistry data and specifically focus on public chemical compound databases.

       There are many tens of indexes of chemistry databases available online and

the reader is encouraged to perform one or more generic searches on “chemistry

databases” to retrieve a list of related information. The authors preferred source of

information is the Wiki hosted by Gary Wiggins [4•]. While the availability of freely

accessible information is clearly of value to scientists there are risks in terms of the

quality of information available. It is this quality issue which provides the

mainstream publishers, for the time-being, a foothold in the domain of providing

value-added access to scientific information. That said, public compound databases

especially have become a disruptive force for certain commercial bodies and the

threat has caused significant duress. The potential impact on the business models of

publishers and the increased capabilities and diversity of data within public

compound databases will also be highlighted.



Public Chemistry Databases

       There are many freely available chemical compound databases on the web

and they assume many different forms. They can simply be a collection of chemical

structures aggregated into a single file and made available, gratis, for people to
Page 4 of 37

download and utilize as they see fit. These files are generally available in the form of

an SDF file [5] and can be downloaded and then imported to a database for

searching and viewing. There are literally hundreds of such files available online and

they are commonly available from chemical vendors in order to advertise their

catalog collections. These files generally contain the chemical identifiers in the form

of chemical names (systematic and trade) and registry numbers. The files can also

contain experimental or physical properties, file specific identifiers and pricing

information. There are aggregators who gather such files of chemical structures and

related information and assemble them into a single database and serve up to the

public (some examples will be discussed later). Since the files are assembled in a

heterogeneous manner the resulting data are plagued with inconsistencies and data

quality issues. Such an approach to gathering and merging data is a far cry from that

taken by commercial database vendors who manually gather and curate data. Some

examples of these commercial organizations are CAS [6], InfoChem [8] and Symyx

[9].

       While the commercial databases offer curated data there is certainly a price-

barrier to accessing the information. A number of the free online resources are also

manually curated and, as will be discussed later, can offer as high a quality as the

commercial offerings. These resources are, however, constructed with a specific

focus in mind and therefore commonly number in the low thousands of structures

rather than the millions available in the larger online databases. Meanwhile, there

are a number of large online database resources offering access to valuable data and

knowledge. Some of these databases should be thought of as “linkbases”. For the

purpose of this article a linkbase is a repository of molecular connection tables

(chemical structures) linking out to various sources of data and associated

information. While it is impossible to be exhaustive within the confines of an article
Page 5 of 37

    of this nature an overview of a number of online public compound databases focusing

    specifically on free access databases will be provided.

           The confusion around the differences between Open Access (OA) versus Free

    Access (FA) continues to persist [9] but both offer an opportunity to help advance

    science by facilitating the sharing of data, information and knowledge with no

    barriers of price or access. The first major international statement on open access

    was the Budapest Open Access Initiative (BOAI), in February 2002 [10]. The

    definition of Open Access is as follows: “By 'open access' to this literature, we mean

    its free availability on the public internet, permitting any users to read, download,

    copy, distribute, print, search, or link to the full texts of these articles, crawl them

    for indexing, pass them as data to software, or use them for any other lawful

    purpose, without financial, legal, or technical barriers other than those inseparable

    from gaining access to the internet itself. The only constraint on reproduction and

    distribution, and the only role for copyright in this domain, should be to give authors

    control over the integrity of their work and the right to be properly acknowledged

    and cited.” [11]. Free Access is not equivalent to Open Access but a simple definition

    has been suggested [12]: “Free access is access that removes price barriers but not

    necessarily any permission barriers.” For the purpose of this article we are not only

    interested in FA and OA but also Open Data.

           Quoting from an online resource [13] “Open Data is a philosophy and

    practice requiring that certain data are freely available to everyone, without

    restrictions from copyright, patents or other mechanisms of control”. As yet there

    are no commonly agreed upon definitions but as a result of Open Data evangelists

    and groups progress is being made [14•,15••,16-18].



          The majority of scientists cannot however differentiate between free access

    and open access since both provide free access to information of value to them in
Page 6 of 37

their work.     In a similar way, the majority of scientists do not care about the

distinctions between Open and Closed data. They utilize free access public chemical

compound databases on an as-needed basis, derive value from the content and

move on, not concerned whether the data posted online are Open or Closed.

Chemical Abstracts Services (CAS) [5] and their CAS Registry Numbers (RNs) [19]

have played a dominant role in managing a curated registry of chemical entities and

related chemical and biological literature. Their proprietary registration system does

not link to chemical structures in the public domain and their business model is at

risk [20••,21].

        Before reviewing examples of public compound databases we should review

the issues of data quality. All content databases containing chemical compounds

contain errors. These errors can arise for a series of reasons including errors in

transcription, historical errors (a compound was “correct” when entered but later re-

characterized), issues with graphical representation and a plethora of other reasons.

The quality of chemical information in the public domain is generally quite low. This

does not mean that the data are not of value but that care needs to be taken in the

nature of the provider as an authority. There is, of course, no central body

responsible for the quality of data in the public domain. Databases of chemical

structure information such as PubChem [22••], ChemIDPLus [23] and ChemFinder

[24] etc., are commonly looked upon as authorities in terms of reliable information.

However, these sources are also aggregators of information and are at risk of

perpetuating errors form the original public data and depositions. Errors in structure-

identifier   pairs   are   common   [25]   and   inaccurate   structure   representations,

specifically in regards to stereochemistry, proliferate across many databases. A

definitive description of the challenges regarding quality in public domain databases,

and the rigorous processes required to aggregate quality data was provided by

Richards et al [26••]. During their assembly of the EPA DSSTox databases the
Page 7 of 37

assembled the chemical structures, chemical names and CAS Registry Numbers for

over 8000 chemicals from numerous toxicity databases. The data they extracted

were carefully curated and validated using multiple public information sources [27].

       In regards to the quality of the chemical information presented with bioassay

data on PubChem Richards cautioned 'user beware' [26]. Since the chemical

structure content is deposited without additional review the user is at risk. Errors in

chemical names are common, and multiple structure errors have been identified.

Richards encourages users to make informed judgments on the quality of data based

on prior knowledge of the data submitter. The responsibility for the quality of the

PubChem database therefore rests with the depositors primarily and, as many of

these are commercial chemical vendors, their focus on quality is far less than the

stringent expectations of the community. The proliferation of errors from PubChem

into other databases has been identified [28] and a definitive effort to cleanse the

errors from the data, be it in regards to chemical structures, names or identifiers, is

going to be required. The efforts of groups such as the ChemSpider team with their

online curation [29] offers an opportunity to dramatically improve the quality of the

data through both a roboticized cleansing approach and manual examination by

many users. Efforts such as these should help reduce errors and result in the

proliferation of more validated information.



Public Compound Databases



PubChem

       The highest profile online database is certainly PubChem [22]. Launched by

NIH in 2004 to support the New Pathways to Discovery component of their roadmap

initiative [30]. PubChem archives and organizes information about the biological

activities of chemical compounds into a comprehensive biomedical database and is
Page 8 of 37

the informatics backbone for the initiative, intended to empower the scientific

community to use small molecule chemical compounds in their research.

       PubChem consists of three databases (PubChem Compound, PubChem

Substance, and PubChem Bio-Assay) connected together. PubChem Compound

contains 18 million unique structures and provides biological property information for

each compound. PubChem Substance contains records of substances from depositors

into the system. These are publishers, chemical vendors, commercial databases and

other sources. The PubChem Compound database contains records of individual

compounds (see Figure 1). PubChem BioAssay contains information about bioassays

using specific terms pertinent to the bioassay.

       PubChem can be searched by alphanumeric text variables such as names of

chemicals, property ranges or by structure, substructure or structural similarity. As

of December 2007 its content is approaching 38.7 million substances and 18.4

million unique structures. Such a source of data opens up new possibilities [31] in

regards to data mining and extraction. Zhou et al [32•] concluded that the system

has an important role as a central repository for chemical vendors and content

providers   enabling   evaluation   of commercial   compound   libraries and   saving

biomedical researchers from the work associated with gathering and searching

commercial databases. They identified that over 35% of the 5 million structures from

chemical vendors or screening centers found in the PubChem database currently are

not present in the CAS registry.

       PubChem continues to grow in stature, content and capability. The bioassay

data resulting from the NIH Roadmap initiative is likely to continue to grow and

PubChem will assume a prominent role in distributing the data in a standard format.

Despite the obvious value of PubChem the platform has caused quite a furor in

recent years including debates regarding the position of CAS relative to the resource.

The reader is referred elsewhere for commentaries [33,34]. Others have commented
Page 9 of 37

on the quality of the data content within PubChem. Shoichet [35••] believes that the

screening data are less rigorous than those in peer-reviewed articles, and contain

many false positives. Shoichet worries that chemists who use PubChem may be sent

on a wild goose chase. Numerous problems arise from the quality of submissions

from various data sources and there are thousands of errors in the structure-

identifier associations due to this contamination and this can lead to the retrieval of

incorrect chemical structures. It is also common to have multiple representations of

a single structure due to incomplete or total lack of stereochemistry for a molecule

[36].



DSSTox

The EPA Distributed Structure-Searchable Toxicity (DSSTox) database project

[38,39] provides a series of documented, standardized and fully structure-annotated

files of toxicity information [40]. The initial intention for the project was to deliver a

public central repository of toxicity information to allow for flexible analogue

searching, SAR model       development and the building of chemical             relational

databases. In order to ensure maximum uptake by the public and allow users to

integrate the data into their own systems the DSSTox project adopted the use of the

common standard file format (SDF) to include chemical structure, text and property

information. The DSSTox databases was also deployed online to provide free public

access to the data files without the dependency on a desktop software package for

querying and managing the data files. The overall aims of the project, to deeply

integrate chemical structure information with existing toxicity data and to facilitate

interrogation of the data have been achieved. The DSSTox datasets are among the

most highly curated public datasets available and likely the reference standard in

publicly available structure-based toxicity data.
Page 10 of 37




eMolecules

         eMolecules [41] offers a free online database of almost 8 million unique

chemical structures. The database is assembled from data supplied by over 150

suppliers and provides a path to identifying a vendor for a particular chemical

compound. By providing access to compounds for purchase they are providing a free

access online service similar to those of commercial databases such as Symyx

Available Chemical Directory [42], CAS’ ChemCats [43] and Cambridgesoft’s

ChemACX [44] as well as a number of other providers. The system offers access to

more than 4 million commercially available screening compounds and many tens of

thousands of building blocks and intermediates. Their database was recently

enhanced by providing access to NMR, MS and IR spectra from Wiley-VCH [45] for

over 500,000 compounds via ChemGate [45], a fee-based service. eMolecules also

provides links to many sources of data for spectra, physical properties and biological

data including include the NIST WebBook [46], the National Cancer Institute [47],

DrugBank [48•] and PubChem.

         eMolecules is presently fairly limited in its scope and primarily offers a very

useful path to the purchase of chemicals and links to the more popular government

databases. Nevertheless, the site is popular with chemists who are searching for

chemicals and the interface is intuitive and easy to use, a key element in attracting

users.



DrugBank

DrugBank [48•] is a manually curated resource assembled from the collection

information of a series of other public domain databases and enhanced with
Page 11 of 37

additional data generated within the laboratories of the hosts. The database

aggregates both bioinformatics and cheminformatics data and combines detailed

drug data with comprehensive drug target (i.e. protein) information. The database is

hosted by the University of Alberta, Canada. Version 1 of the database, released in

2006, contained >4100 drug entries including >800 FDA approved small molecule

and biotech drugs as well as >3200 experimental drugs. Over 14,000 protein or drug

target sequences were linked to these drug entries. Each record in the database,

known as a DrugCard, has >80 data fields. The information is split into

drug/chemical data and drug target or protein data and many data fields are linked

to other databases (KEGG [49], PubChem, ChEBI [50], PDB [2] and others). The

database supports extensive text, sequence, chemical structure and relational query

searches.

      DrugBank has been used to facilitate in silico drug target discovery, drug

design, drug docking or screening, drug metabolism prediction, drug interaction

prediction and general pharmaceutical education. The version 2.0 release of

DrugBank [51••] released in January of this year with over 800 new drug entries and

each DrugCard entry extended to include over 100 data fields with half of the

information being devoted to drug/chemical data and the other half devoted to

pharmacological, pharmacogenomic and molecular biological data. They have started

to add experimental spectral data (NMR and MS specifically), and have expanded the

coverage to nutraceuticals and herbal medicines.

      The Drugbank team also host the Human Metabolome Database (HMDB)

[52], a database containing nformation about small molecule metabolites found in

the human body. The database is used by scientists working in the areas of

metabolomics, clinical chemistry and biomarker discovery. The database currently

contains nearly 3000 metabolite entries and each MetaboCard entry contains more
Page 12 of 37

than 90 data fields devoted to chemical, clinical data, enzymatic and biochemical

data.



NMRShiftDB

        The NMRShiftDB is an open source collection of chemical structures and their

associated NMR shift assignments [53•,54]. The database is generated as a result of

contributions by the public and currently contains over 20,000 structures with

>220,000 assigned carbon chemical shifts. Datasets entered by contributors are sent

to registered reviewers for evaluation. A significant part of NMRShiftDB was initially

assembled from in-house databases from collaborating institutions and were entered

unchecked. This called for external checks of the data based on independent

databases and resources and these have now been carried out by two specific groups

[56,57]. Williams et al. [56] performed a cursory examination of the structural

diversity within the database and concluded that the data represented a statistically

relevant set to use in an evaluation of predictive accuracy and demonstrated that the

quality of the data is rather impressive.      This effort shows the advantages of

providing a set of Open Data for reuse and examination and the benefits of having

many scientists examine, validate and correct. The benefit is possible for any

database allowing its users to qualify, annotate and correct its data.




ChemSpider

         ChemSpider was released to the public in March 2007 with the intention of

“building a structure centric community for chemists”. ChemSpider has grown into a

resource containing almost 18 million unique chemical structures and recently shared

its data with PubChem providing about 7 million unique compounds. The data

sources have been gathered from chemical vendors as well as commercial database
Page 13 of 37

vendors and publishers and members of the Open Notebook Science community.

ChemSpider has also integrated the SureChem patent database [59] collection of

structures to facilitate links [60] between the systems. The database can be queried

using structure/substructure searching and alphanumeric text searching of both

intrinsic as well as predicted molecular properties. They have recently added virtual

screening results using the LASSO similarity search tool [61] to screen the

ChemSpider database against all 40 target families from the Database of Useful

Decoys (DUD) dataset.

       ChemSpider has enabled unique capabilities relative to the primary public

chemistry databases. These include real time curation of the data, association of

analytical data with chemical structures, real-time deposition of single or batch

chemical structures (including with activity data) and transaction-based predictions

of physicochemical data. The ChemSpider developers have made available a series of

web services to allow integration to the system for the purpose of searching the

system as well as generation of InChI identifiers and conversion routines.

       The system also integrates text-based searching of Open Access articles and

presently search over 50,000 OA Chemistry articles, soon to be extended to 150,000

articles. The index is expected to increase dramatically as they extract chemical

names from OA articles and convert the names to chemical structures using name to

structure conversion algorithms. These chemical structures will be deposited back to

the ChemSpider database thereby facilitating structure and substructure searching in

concert with text-based searching.

       ChemSpider has a focus on, and commitment to, community curation. The

social community aspects of the system demonstrate the potential of this approach.

The team have committed to the release of a wiki-like environment for further

annotation of the chemical structures in the database, a project they term

WiChempedia. They will utilize both available Wikipedia content and deposited
Page 14 of 37

content from users to enable the ongoing development of community curated

chemistry.



Other Databases

       The list of databases and resources reviewed above is only representative of

the type of information available online. Other highly regarded databases frequented

by this author include the Chemical Structure Lookup Service (with over 36 million

unique structures) [64], CrystalEye [65], KEGG [49] and CheBI [50]. There are also

many other resources available and the reader is referred to one of the many

indexes of such databases available on the internet to identify potential resources of

interest [4,66].



Public Compound Databases versus Commercial Databases

       The creation, hosting and support of a curated chemical compound database

with integrated content is an expensive enterprise. Historically these databases have

been built as a result of hundreds if not thousands of man years of rigorous and

exacting human effort and then, for some of the original founders in this domain,

migrated onto computer systems. In the development of these systems host

organizations have created sizeable revenues and estimated annual fees for

accessing this information via just a few organizations likely exceeds half a billion

dollars. With the advances in technology accompanying the internet boom the

hosting of large databases, the text-based searching of immense amounts of data

and the ability to disseminate complex forms of graphical information via standard

protocols provided an opportunity created for disruptive offerings in this domain.

They soon arrived.

       The primary advantage of commercial databases is that they have been

manually examined by skilled curators, addressing the tedious task of quality data-
Page 15 of 37

checking. Certainly the aggregation of data from multiple sources, both historical and

modern, from multiple countries and languages and from sources not available

electronically are significant enhancements over what is available via an internet

search. The question remains how long will this remain an issue? Scientists working

in new areas of science and domains of expertise reflect on the most recent

literature in general. Can you imagine a search about the semantic web being

conducted just a few years ago? What about metabonomics or even genomics?

Certain areas of the scientific literature, while still of high value, can become

antiquated fairly quickly. With the new capabilities of internet-based searching and

direct access to abstracts for the majority of publishers even a rudimentary text

search can expose articles previously unavailable except through an abstracting

service. Search engines will increasingly be utilized for first level searches specifically

because they are simple to use, they are fast and they are free. With chemically

searchable patents also available online [59,67], at no charge, the landscape for

scientists searching for information is more open than ever. If there are data of

interest to be located then internet search engines will enable it.

       The premier curated database offerings of today have an interesting if not

challenging future ahead of them. Their value-added enhancements of the

distributed data must be significant enough to warrant an investment in their

services [68]. As expressed earlier the quality of the data resulting from curation is

significant but this author questions the longevity of that distinguishing factor

moving forward. Roboticized recognition and conversion of chemical names to

chemical structures can dramatically shift this domain and efforts have already been

demonstrated in applications to patents and publications. Should the quality reach a

sufficient standard then today’s publishers business models will definitely be at risk.



The Future of Public Compound Databases
Page 16 of 37

       The semantic web [69] is already offering us the chance to connect,

simultaneously interrogate and mash-up the results of searching multiple public

compound databases simultaneously. An enormous diversity of data is already

available for interrogation by the public and continues to expand daily. This   author

remains concerned with the very real quality issues associated with public data sets.

While the utopian dream of no errors in freely available data cannot be met the push

towards more Open Data without consideration being given to both manual and

robotic curation could be risky to those using the data. Real-time curation of data

within public compound databases is feasible [29] and certainly Wikipedia is a model

of crowd sourcing [71] to build, curate and maintain a quality database.

Unfortunately, even these world-renowned platforms actually sit on the shoulders of

a very few dedicated individuals, relative to the users, who care about quality. There

is no simple solution to the issues of quality and it will persist for the foreseeable

future until processes, procedures and momentum to resolve the issues are

established.

       Even in its earliest form PubChem has been referred to, tongue-in-cheek, as

“the granddaddy of all free chemistry databases”. Certainly it presently holds the

premier position in reputation, capabilities and connectivities built on a database of

chemical structures and linked out to biological assay data, the PubMed database

and an array of services to facilitate both the distribution of the data and the wealth

of tools developed to support the system. The majority of databases discussed in this

article now uses two primary identifiers in their systems – the CAS registry number

and a PubChem ID number. This alone indicates a shift in equality of commercial

versus public compound repositories. For now, PubChem remains focused on its

initial intent to support the National Molecular Libraries Initiative. The data within

PubChem have never formally been declared as Open Data but are assumed to be
Page 17 of 37

available in that manner and thereby offer to scientists a valuable aggregate of data

for the purpose of data mining and discovery.

       At the time of writing the newest addition to the proliferating domain of public

chemical compound databases is the ChemSpider Database [57], working to “Build a

Structure Centric Community for Chemists”. This system presently offers a series of

unique capabilities which might become trend-setting for present and future

databases. As discussed earlier these include the user deposition of structures, real-

time annotation and curation of data, management of analytical data and online

transaction services. It is this authors’ belief that such capabilities will likely become

standard for the majority of most public chemical compound databases in the near

future. These types of capabilities could help establish the newfound shift to Open

Notebook Science and shift the bias from the chemical biology databases (PubChem,

Drugbank, HMDB and DSSTox) to even provide an environment for non-life science

chemists, polymer chemists and material scientists to manage and research

information of interest to them.



The WikiSphere, Blogosphere and Internet as a Public Compound Database.


       Wikis and blogs are common terms now for the majority of users of the

worldwide web and both are fast becoming chosen platforms for the exchange of

information between many scientists, not only as tools within their own research

groups but, more generally, with the public in general. A blog, or weblog is a website

where entries are written in chronological order and generally provide commentary

or news on a particular subject [71]. A typical blog combines text, images and links

to other blogs, web pages, and other media related to its topic. The original blog

posting remains untouched by the commenter and readers are free to add their

comments, generally in a mediated manner where the blog host retains control over
Page 18 of 37

the postings. An example screenshot from a chemistry-based blog hosted with the

intention of examining and discussing organic syntheses is shown in Figure 3. The

number of chemistry-related blogs continues to grow dramatically and there have

been efforts to provide a unified view into some of these [72,73].


       A wiki is a type of computer software that allows users easily to create, edit

and link web pages and enables documents to be written collaboratively, in a simple

markup language using a web browser, and is essentially a database for creating,

browsing and searching information. Certainly Wikipedia is the most well-known

today though there are many others already online and used within the confines of

an organization to manage content. There are active groups supporting the

development of chemistry on Wikipedia and there are now thousands of pages

describing small organic molecules, inorganics, organometallics, polymers and even

large biomolecules. Focusing on small molecules in general, each one has a Drug Box

[75] or a Chemical infobox [76].        A drug box provides identifier information

(chemical name, registry number, and so on) and commonly the identifiers link out

to a related resource. Chemical data, pharmacokinetic data and therapeutic

considerations can also be listed. At present there are approximately 8000 articles

with a chembox or drugbox [3], with between 500-1000 articles added since May.

The detailed information offered on Wikipedia regarding a particular chemical or drug

can be excellent [77], see Figure 2, or weak [78]. There are many dedicated

supporters and contributors to the quality of the online resource. Drug and

chemboxes have been shown to contain errors but the advantage of a wiki is that

changes can be made within a few keystrokes and the quality is immediately

enhanced. The opposite is also true and vandalism can occur. This community

curation process makes Wikipedia a very important online chemistry resource whose

impact will only expand with time.
Page 19 of 37




       Wikis have recently been used as the basis of Open Notebook Science [79].

The UsefulChem Wiki [80] includes a series of experimental pages commonly linked

to related blog pages as shown in Figure 4. The Open Notebook Science efforts and

the movement appears to be gaining momentum with the support of vocal

advocates, such as Neylon [81], Murray-Rust [82] and many others.


       While both wikis and blogs are very valuable for information exchange, what

they enable in terms of text and image exchange is all but crippled in terms of

searching by many chemists’ additional query needs for chemical structures,

reactions and data. Neither Wikis nor blogs, as yet, are enabled for the purpose of

structure and substructure searching and, therefore, remain isolated, in general,

from cheminformatics based search procedures. One of the key developments which

has already facilitated the Semantic Web for chemistry is the InChI,[83] the

International Chemical Identifier. The InChI string is a textual identifier for chemical

substances designed to provide a standard and human-readable way to encode

molecular information (see Figure 5) and to facilitate the search     for such information

in databases and on the web. The InChI string, unfortunately, has only partly delivered on

the promise of facilitating web-based searches, due to unpredictable breaking of InChI

character strings by search engines. In order to resolve this issue the InChIKey was

introduced. The condensed, 25 character InChIKey is a hashed version of the full InChI

and is not human-readable. The equivalent InChIKey for the InChIString of L-ascorbic

acid is CIWBSHSKHKDKBQ-JLAZNSOCBT. The advantage of the key is one of

enabling web searches, but a lookup table to identify the associated structure, or reference
Page 20 of 37


to the original InChI String, is necessary [85]. While tens of millions of InChI strings

and keys have been populated into databases, their value is still in its infancy.

Publishers have started to embed InChIs into their articles and the Royal Society of

Chemistry [85] is presently pioneering a new publishing model, Project Prospect,

including InChI to demonstrate movement toward the semantic web for chemistry.

Bloggers have started to use InChI Strings and Keys on their postings, and wiki-

pages are being InChI-enabled to help the web become structure searchable. The

necessity of a central lookup facility for published InChIStrings will be necessary in

order to facilitate substructure searching of the web but this capability is likely to be

developed in the near future. Willighagen already aggregates InChI Strings onto a

blog [87].


       BioSpider [88] users are able to type in almost any kind of biological or

chemical identifier (protein/gene name, sequence, accession number, chemical

name, brand name, SMILES string, InCHI string, CAS number, etc.) and delivers a

report about the biomolecule. BioSpider uses a web-crawler to scan through dozens

of public databases and employs a variety of specially developed text mining tools

and locally developed prediction tools to find, extract and assemble data for its

reports. A summary includes physico-chemical parameters, images, models, data

files, descriptions and predictions concerning the query molecule.


       An increasing number of public databases will continue to become available

but the challenge, even now, is how to integrate and access the data. The

implementation of InChIs for web-based searching [89], and the delivery of

userscripts to aggregate information and computational results from different web

resources [90] are bringing together internet resources to appear as a single

monolithic public chemistry database. Willighagen et al. [90] use userscripts to
Page 21 of 37

enrich biology and chemistry related web resources by incorporating or linking to

other computational or data sources on the web. They showed how information from

web pages can be used to link to, search, and process information in other resources

thereby allowing scientists to select and incorporate the appropriate web resources

to enhance their productivity. Such tools connecting open chemistry databases and

user web pages is an ideal path to more highly integrated information sharing.



Conclusion

       There is little doubt that the newfound availability of public chemical

compound databases with their associated chemistry and biological data is enabling

scientists to access information at less cost in both time and currency. The increasing

quantity of freely accessible and integrated data can speed decision making and

bring clarity or alternatively inundate and saturate the user with poor quality

information. Scientists now have free access to structure-searchable patents, open

and free access peer-reviewed publications and software tools for the manipulation

of chemistry related data. Members of the Open Source movement are developing

toolkits including visualization and data-mining tools and, when coupled with the

public chemistry databases reviewed here, will likely benefit the process of discovery.

There are likely to be challenging times ahead in terms of meshing the needs of

commercial database publishers versus proliferation of free databases but this

journey will not be halted by the objections of the commercial entities provided that

legal copyrights are respected and the shift towards a more open community for

science persists.



Acknowledgements

The author wishes to thank the following people: Stephen Bryant and Evan Bolton

from the PubChem team, the IUPAC/National Institute of Standards and Technology
Page 22 of 37

InChI team (Alan McNaught, Stephen Stein, Stephen Heller, Dmitrii Tchekhovskoi);

David Wishart and Nelson Young (Drugbank and HMDB),                  Nicko Goncharoff

(SureChem), Stephen Boyer (IBM), Marc Nicklaus (Chemical Structure Lookup

Service), members of the ChemSpider Advisory Group (Egon Willighagen, Sean

Ekins, Joerg Wegner and Alex Tropsha        specifically), Ann Richard and Marti Wolf

(DSSTox), Christoph Steinbeck (NMRShiftDB), Nick Day and Peter Murray-Rust

(CrystalEye), Martin Walker, Andrew Yeung and Dirk Beestra (Wikipedia Chemistry).

I would also like to acknowledge the many contributors to the blogging discussions

about Open and Free Access.




References



1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic

Acids Res. (2007) 35(Database issue):D21-5.

2. Berman HM, Henrick K, Nakamura H: Announcing the worldwide Protein Data

Bank. Nature Structural Biology (2003) 12: 980

3. Murray-Rust P: Chemistry for everyone. NATURE (2008) 451, 648-651

•Provides a vision for the future of data distribution, access and integration across

the worldwide web and espouses the need for Open Data policies and adoption of the

Semantic Web.

4. Gary Wiggins’ Wiki. CHEMBIOGRID, Chemistry Databases on the Web:

Alphabetical:http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_D

atabases_on_the_Web_%28Alphabetical_List%29

Classified:http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_Dat

abases_on_the_Web_%28Classified_List%29
Page 23 of 37

•An aggregation of chemistry databases, curated and annoted, to provide

significantly more information than would be returned in a generic search of the

internet.

5. Symyx: CTFile formats no-fee. (2008)

http://www.mdli.com/downloads/public/ctfile/ctfile.jsp

6. CAS: Chemical Abstract Services, Columbus, OH, USA (2006).

http://www.cas.org/

7. InfoChem: InfoChem Gesellschaft für Chemische Information, München,

Germany (2008). http://infochem.de/

8. Symyx: Santa Clara, California, USA (2008). http://www.symyx.com/

9. The University’s Mandate To Mandate Open Access: Harnad S, (2008)

http://openaccess.eprints.org/index.php?/archives/358-The-Universitys-Mandate-To-

Mandate-Open-Access.html

10. Open Access: Wikipedia Article on Open Access. (2008)

http://en.wikipedia.org/wiki/Open_access

11. The BOAI FAQ page: Frequently Accessed Questions about the Budapest Open

Access Initiative (2008), http://www.earlham.edu/~peters/fos/boaifaq.htm

12. Williams AJ: A perspective of Publicly Accessible/Open Access Chemistry

Databases: Drug Discovery News (2008), accepted for publication

13. Open Data: Wikipedia Article on Open Data. (2008)

http://en.wikipedia.org/wiki/Open_data

14. Murray-Rust P, Rzepa HS, Tyrrell SM and Zhang Y: Representation and use of

Chemistry in the Global Electronic Age ChemInform, 36(15), (2005)

• An excellent outline regarding the potential of combining open access and the

semantic web in chemistry. Rzepa and Murray-Rust are two of the evangelists of this

domain and outline in this article how data may be interconnected to the benefit of

all chemists.
Page 24 of 37

15. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C,

Wegner J , Willighagen EL: The Blue Obelisk-Interoperability in Chemical

Informatics, J Chem Inf Model, (2006) 46 (3), 991-998.


••The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a

group of scientists and developers supporting open source software development,

consistent and complimentary chemoinformatics research, open data, and open

standards in Chemistry.


16. CODATA, The Committee on Data for Science and Technology: CODATA,

Paris, France (2008). http://www.codata.org/

17. An Introduction to Science Commons: Wilbanks J, Boyle J, (2006).

http://sciencecommons.org/wp-

content/uploads/ScienceCommons_Concept_Paper.pdf

18. The Open Knowledge Foundation: Protecting and Promoting Open Knowledge

in a Digital Age (2008). http://www.okfn.org/

19. CAS Registry Numbers: Chemical Abstract Services, Columbus, OH, USA

(2008). http://www.cas.org/expertise/cascontent/registry/regsys.html

20. Murray-Rust P, Mitchell JB, Rzepa HS: Communication and re-use of

chemical information in bioscience. BMC Bioinform (2005) 6:180-196.

•• Provides an overview of chemical information on the Internet and, while slightly

outdated, is an important read in regards to the challenges and the vision of a

Semantic Web for Chemistry.

21. Heller SR, Stein SE, Tchekhovskoi DV: Open source/open access/open data

and the IUPAC International Chemical Identifier - InChI. American Chemical

Society National Meeting, Washington, DC, USA (2005):CINF-60.

22. NCBI: PubChem: National Center for Biotechnology Information, Bethesda, MD,

USA (2008). http://pubchem.ncbi.nlm.nih.gov
Page 25 of 37

•• Pubchem is a large data aggregator (nearing 20 million structures) and offers

relational searching capabilities via text, structure and substructure searching and

access to the entire dataset via download of SDF files. A series of services for the

handling of chemistry databases are also available via the website.

23. ChemIDplus: National Library of Medicine, Bethesda, MD, USA (2008).

http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp

24.   ChemFinder.com:       CambridgeSoft   Corp,   Cambridge,   MA,   USA   (2008).

http://chemfinder.cambridgesoft.com/

25. Hacking Pubchem - Technology easy, Quality difficult: Williams AJ, (2007)

http://www.chemspider.com/blog/hacking-pubchem-technology-easy-quality-

difficult.html.

26. Richard AM, Swirsky Gold L, Nicklaus MC: Chemical structure indexing of

toxicity data on the Internet: Moving toward a flat world. Current Opinion in

Drug Discovery & Development (2006) 9(3): 314-325.

•• The review discusses efforts to gather, curate and make publicly available

toxicology-related chemical information. The specific discussions regarding the

quality issues with public chemistry databases and efforts to produce clean quality

databases are noteworthy.

27. DSSTox Quality Chemical Information Review Procedures: US

Environmental Protection Agency, Washington, DC, USA (2008).

http://www.epa.gov/nheerl/dsstox/ChemicalInfQAProcedures.html

28. PubChem Errors: Williams AJ, PubChem Meeting, Washington DC: (2007)

http://www.chemspider.com/docs/PubChem_at_ChemSpider_Overview_SLides_Sept

ember_2007.pdf

29. The Process of Curating Identifiers on ChemSpider: Williams AJ, (2008)

http://www.chemspider.com/docs/The_Process_of_Curating_Identifiers_on_ChemSpi

der.pdf
Page 26 of 37

30. The NIH Roadmap Initiative: Office of Portfolio Analysis and Strategic

Initiatives, National Institutes of Health, Bethesda, Maryland 20892: (2008)

http://nihroadmap.nih.gov/


31. Hacking PubChem: Why The Open Access Fight is Just the Beginning,

Apodaca R, (2006), http://depth-first.com/articles/2006/09/22/hacking-pubchem-

why-the-open-access-fight-is-just-the-beginning


32. Zhou Y, Chen K, Yan SF, King FJ, Jiang S, Winzeler EA: Large-Scale

Annotation of Small-Molecule Libraries Using Public Databases. J. Chem. Inf.

Model. (2007) 47:1386-1394


•• The 2.5 million compound collection at the Genomics Institute of the Novartis

Research Foundation (GNF) was used as a model to determine whether automated

annotation of screening hits in batch is feasible.

33. The American Chemical Society and NIH’s PubChem, Reshaping Scholarly

Communication Blog: (2008)

http://osc.universityofcalifornia.edu/news/acs_pubchem.html

34. Background of the PubChem/CAS Issue: (2008)

http://www.arl.org/bm~doc/backgroundfaqpb.pdf

35. Baker M: Open-access chemistry databases evolving slowly but not

surely:Nature Reviews, Drug Discovery, (2006) 5:707-708

• A critical review of how far publicly available initiatives have to go to catch up with

commercial offerings.

36. How big is the challenge of curation and what is the structure of

Ginkgolide-B: Antony Williams (2008), http://www.chemspider.com/blog/how-big-

is-the-challenge-of-curation-and-what-is-the-structure-of-ginkgolide-b.html
Page 27 of 37

37 DSSTOX: Distributed Structure-Searchable Toxicity (DSSTox) Database:

US Environmental Protection Agency, Washington, DC, USA (2006).

http://www.epa.gov/nheerl/dsstox/

38. Richard AM and Williams CR (2002) Distributed Structure-Searchable

Toxicity (DSSTox) Public Database Network: A Proposal, Mutation Research:

New Frontiers, 499:27-52.

39. Richard AM: DSSTox web site launch: Improving public access to

databases for building structure-toxicity prediction models, Preclinica, (2006)

2(2):103-108.

40. DSSTox Data Files: http://www.epa.gov/ncct/dsstox/DataFiles.html

41.   eMolecules     Online   Service:    eMolecules,   Del   Mar,   CA,    USA   (2008).

http://www.emolecules.com

42. Available Chemical Directory: Santa Clara, California, USA (2008).

http://www.mdli.com/products/experiment/available_chem_dir/index.jsp

43. ChemCats: Chemical Abstract Services, Columbus, OH, USA (2006).

http://www.cas.org/expertise/cascontent/chemcats.html

44. ChemACX: CambridgeSoft Corp, Cambridge, MA, USA (2008).

http://www.cambridgesoft.com/databases/details/?db=12

45. ChemGate: Tony Davies, eMolecules and Spectroscopy: Spectroscopy Europe,

(2007) 19(1):27-28

46. The NIST Chemistry WebBook: (2008) http://webbook.nist.gov/chemistry/

47. NCI/NIH Developmental Therapeutics Program: National Cancer Institute,

Frederick/National    Institutes    of   Health,   Bethesda,    MD,        USA.   (2008).

http://dtp.nci.nih.gov/index.html

48. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z,

Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery

and exploration, Nucleic Acids Res. (2006) 34:D668-72
Page 28 of 37

• A detailed description of the intent, development and capabilities of the Drugbank

database, one of the most respected public chemistry databases utilized by drug

discovery scientists today.


49. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: KEGG: The KEGG

resource for deciphering the genome, Nucleic Acids Res. (2004) 32 (Database

issue):D277-80


50. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A,

Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology

for chemical entities of biological interest, Nucl. Acids Res. (2008) 36: D344-

D350;

51. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B,

Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drug

targets, Nucleic Acids Res. (2008) 36(Database issue):D901-6.

•• An update regarding the DrugBank database as it is released in its Version 2

state.

52. HMDB: The Human Metabolome Database. Nucleic Acids Res. (2007) 35:

D521-6

53. Steinbeck C, Krause S, Kuhn S: NMRShiftDB– Constructing A Chemical

Information System With Open Source Components. J. Chem. Inf. Comput. Sci.

(2003) 43:1733-1739.

•The defining article regarding the development of the NMRShiftDB database defining

the intention of the work, the development of the software components and a vision

of how such a platform can lead to widespread dissemination of analytical data, at

no-charge, to the chemistry community.
Page 29 of 37

54. Steinbeck C, Kuhn S. NMRShiftDB – Compound Identification And

Structure Elucidation Support Through a Free Community-Built Web

Database. Phytochemistry, (2004), 65:2711–2717.

55. Blinov KA, Smurnyy YD, Elyashberg ME, Churanova TS, Kvasha M, Steinbeck C,

Lefebvre BA, Williams AJ: Performance Validation of Neural Network Based

13C NMR Prediction Using a Publicly Available Data Source. J Chem Inf

Model, (2008), Accepted for publication, doi: 10.1021/ci700363r.

56. CSEARCH and NMRShiftDB: Robien W (2007)

http://nmrpredict.orc.univie.ac.at/csearchlite/enjoy_its_free.html

57. Williams AJ, ChemSpider and Its Expanding Web: Building a Structure-

Centric Community for Chemists, Chemistry International (2007) 30(1): 30.

58. Open Notebook Science: Bradley JC, (2006) Drexel CoAs E-Learning Blog,

http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html

59. SureChem: San Francisco, CA, USA (2008) http://www.surechem.org/

60. Free Access Structure Searching of Patents: Williams AJ (2007),

http://www.chemspider.com/docs/Structure_Searching_of_Patents_Using_ChemSpid

er.pdf


61. LASSO: Ligand Activity in Surface Similarity Order, SioBioSys Inc., Toronto,

Canada. http://www.simbiosys.ca/ehits_lasso/index.html

62. Database of Useful Decoys: http://dud.docking.org./

63. WiChempedia: ChemSpider Blog (2007)

http://www.chemspider.com/blog/wichempedia-is-now-on-its-way.html

64. Chemical Structure Lookup Service: National Institutes of Health,

http://cactus.nci.nih.gov/cgi-bin/lookup/search

65. CrystalEye Crystallogrpahic Database:

http://wwmm.ch.cam.ac.uk/crystaleye/
Page 30 of 37

66. Thirty Two Free Chemistry Databases: Apodaca R, Depth-First Blog,

http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases

67. IBM’s Online Patent Search: (2008) IBM Chemical Search Alpha, IBM,

Almaden Services Research, San Jose, CA 95120, USA,

https://chemsearch.almaden.ibm.com/chemsearch/SearchServlet


68. Kemper K, Chemical Abstracts still developing ways to help its core –

scientists, Columbus Business First,

http://columbus.bizjournals.com/columbus/stories/2007/06/18/story20.html?page=

1

69. Feigenbaum L,    Herman I,    Hongsermeier T, Neumann E, Stephens S: The

Semantic       Web       in      Action,     Scientific      American      Magazine

http://www.sciam.com/article.cfm?id=the-semantic-web-in-action

70. The Benefits of Crowdsourcing: http://en.wikipedia.org/wiki/Crowdsourcing

71. The Definition of a Blog: http://en.wikipedia.org/wiki/Blog


72. ScienceBlogs: http://scienceblogs.com/


73. Chemical BlogSpace: http://cb.openmolecules.net/


74. The Definition of a Wiki: http://en.wikipedia.org/wiki/Wiki


75. Wikipedia Chemical Drugbox: http://en.wikipedia.org/wiki/Template:Drugbox


76. Wikipedia Chemical Infobox:

http://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox


77. Taxol on Wikipedia: http://en.wikipedia.org/wiki/Taxol


78. AP7 on Wikipedia: http://en.wikipedia.org/wiki/AP7
Page 31 of 37

79. Bradley JC, Open Notebook Science Using Blogs and Wikis, Nature

Preceedings (2007) doi:10.1038/npre.2007.39.1,

http://precedings.nature.com/documents/39/version/1


80. UsefulChem Open Notebook Science:             Bradley JC, Drexel      University,

http://usefulchem.wikispaces.com/All+Reactions        and        http://usefulchem-

experiments1.blogspot.com/2006/05/exp-009.html


81. Open Notebook Science: Neylon C, Science in the open, An openwetware blog

on the challenges of open and connected science (2008)

http://blog.openwetware.org/scienceintheopen/2007/12/12/a-big-few-weeks-for-

open-notebook-science/


82. Open Notebook Science NMR: Murray-Rust P, A Scientist and the Web Blog

(2008) http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=671


83. The IUPAC International Chemical Identifier: (2008)

http://www.iupac.org/inchi/

84. The IUPAC International Chemical Identifier Software: (2008)

http://www.iupac.org/inchi/release102.html

85. Royal Society of Chemistry: (2008) http://www.rsc.org/

86. Project Prospect: (2008) RSC Publishing,

http://www.rsc.org/Publishing/Journals/ProjectProspect/

87. Chemical Blogspace, (2008) http://cb.openmolecules.net/inchis.php


88. Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS: BioSpider, A web

Server   for   Automating     Metabolome     Annotations.   Pacific   Symposium   on

Biocomputing, (2007) 12:145-156.
Page 32 of 37

89.   Cole SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhancement of the

Chemical Semantic Web through INChIfication. Org Biomol Chem, (2005)

3:1832-1834


90. Willighagen EL, O'Boyle NM, Gopalakrishnan H, Jiao D, Guha R, Steinbeck C and

Wild DJ: Userscripts for the Life Sciences. BMC Bioinformatics, (2007) 8:487.


•• Discusses the use of userscripts to change the appearance of web pages by

modifying web content on the fly to enable aggregation of information and

computational results from different web resources into a single webpage. Indicative

of the future of integration and the possibilities which exist to gather information

from a multitude of resources and reformat and deliver to the consumer.
Page 33 of 37

Figures




Figure 1 - The Compound Summary Page for Taxol in PubChem. Page 1 only is

shown. (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=36314)
Page 34 of 37




Figure 2: The DrugBox for Taxol from Wikipedia (http://en.wikipedia.org/wiki/Taxol)
Page 35 of 37




Figure   3:   The   TotallySynthetic.com   blog.   Paul   Docherty   discusses   complex

syntheses and offers readers an opportunity to comment, analyze and provide

feedback. Many articles are labeled with InChIKeys to allow indexing by search

engines. (http://totallysynthetic.com/blog/)
Page 36 of 37




Figure       4:             An       Example         UsefulChem        wiki       page

(http://usefulchem.wikispaces.com/Exp148)

This UsefulChem wiki page shows a number of important content items: 1) Links to

the prior failed experiment; 2) Links to the docking results that justified making this

compound; 3) Full characterization (spectroscopy and photographs) of an isolated

product, with interactive NMRs (JSpecView/JCAMP-dx) of the starting materials; 4)

In the discussion section a question is posed by Professor Bradley to his student, and

then answered. The entire discussion history is captured. 5) A complete, detailed and

dated log of the steps taken by the student; 6) In the tag section, InChIs of every

compound used are provided for indexing by search engines.
Page 37 of 37



     HO
               O
                    O
HO                      InChI=1/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-10H,1H2/t2-,5+/m0/s1

                        CIWBSHSKHKDKBQ-JLAZNSOCBT
          HO       OH

Figure 5: The InChI String (top) and InChI Key (bottom) for L-ascorbic acid.

More Related Content

What's hot

from local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspacefrom local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global DataspaceOpen Education Consortium
 
UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...
UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...
UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...UKSG: connecting the knowledge community
 
Linked Open Data_mlanet13
Linked Open Data_mlanet13Linked Open Data_mlanet13
Linked Open Data_mlanet13Kristi Holmes
 
Supporting UC Research Data Management
Supporting UC Research Data ManagementSupporting UC Research Data Management
Supporting UC Research Data Managementslabrams
 
W3C Library Linked Data Incubator Group
W3C Library Linked Data Incubator GroupW3C Library Linked Data Incubator Group
W3C Library Linked Data Incubator GroupAntoine Isaac
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentationekansa
 
Countrywide similarity report
Countrywide similarity reportCountrywide similarity report
Countrywide similarity reportJeffrey Lorton
 
Metadata Ownership & Metadata Rights
Metadata Ownership & Metadata RightsMetadata Ownership & Metadata Rights
Metadata Ownership & Metadata RightsChelcie Rowell
 
Publishing Data on the Web
Publishing Data on the Web Publishing Data on the Web
Publishing Data on the Web Centro Web
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data CitationMicah Altman
 
Boundless Opportunity
Boundless OpportunityBoundless Opportunity
Boundless OpportunityRachel Frick
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsEmily Nimsakont
 
Potential of Library 2.0 for research libraries in Kenya
Potential of Library 2.0 for research libraries in KenyaPotential of Library 2.0 for research libraries in Kenya
Potential of Library 2.0 for research libraries in KenyaTom Kwanya
 

What's hot (20)

from local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspacefrom local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspace
 
UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...
UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...
UKSG 2018 Breakout - Trouble(shooting) with a capital T: how categorising and...
 
Linked Open Data_mlanet13
Linked Open Data_mlanet13Linked Open Data_mlanet13
Linked Open Data_mlanet13
 
Supporting UC Research Data Management
Supporting UC Research Data ManagementSupporting UC Research Data Management
Supporting UC Research Data Management
 
W3C Library Linked Data Incubator Group
W3C Library Linked Data Incubator GroupW3C Library Linked Data Incubator Group
W3C Library Linked Data Incubator Group
 
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-researchUc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research
 
Jarrar: Linked Data
Jarrar: Linked DataJarrar: Linked Data
Jarrar: Linked Data
 
Internet-based Tools for Communication and Collaboration in Chemistry
Internet-based Tools for Communication and Collaboration in ChemistryInternet-based Tools for Communication and Collaboration in Chemistry
Internet-based Tools for Communication and Collaboration in Chemistry
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
Countrywide similarity report
Countrywide similarity reportCountrywide similarity report
Countrywide similarity report
 
Metadata Ownership & Metadata Rights
Metadata Ownership & Metadata RightsMetadata Ownership & Metadata Rights
Metadata Ownership & Metadata Rights
 
Semantic Web-Linked Data and Libraries
Semantic Web-Linked Data and LibrariesSemantic Web-Linked Data and Libraries
Semantic Web-Linked Data and Libraries
 
Publishing Data on the Web
Publishing Data on the Web Publishing Data on the Web
Publishing Data on the Web
 
Chapter 1,2,3,6
Chapter 1,2,3,6Chapter 1,2,3,6
Chapter 1,2,3,6
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data Citation
 
Boundless Opportunity
Boundless OpportunityBoundless Opportunity
Boundless Opportunity
 
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library CatalogsPromises and Pitfalls: Linked Data, Privacy, and Library Catalogs
Promises and Pitfalls: Linked Data, Privacy, and Library Catalogs
 
Potential of Library 2.0 for research libraries in Kenya
Potential of Library 2.0 for research libraries in KenyaPotential of Library 2.0 for research libraries in Kenya
Potential of Library 2.0 for research libraries in Kenya
 
Levine Clark NISO-ICSTI Joint Webinar June 30
Levine Clark NISO-ICSTI Joint Webinar June 30Levine Clark NISO-ICSTI Joint Webinar June 30
Levine Clark NISO-ICSTI Joint Webinar June 30
 
data.ac.uk briefing paper
data.ac.uk briefing paperdata.ac.uk briefing paper
data.ac.uk briefing paper
 

Viewers also liked

50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)Heinz Marketing Inc
 
Prototyping is an attitude
Prototyping is an attitudePrototyping is an attitude
Prototyping is an attitudeWith Company
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionIn a Rocket
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanPost Planner
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 

Viewers also liked (8)

50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)
 
Prototyping is an attitude
Prototyping is an attitudePrototyping is an attitude
Prototyping is an attitude
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Similar to Current opinions in drug discovery public compound databases

Sustainable Legal Framework for Open Access to Research Data
Sustainable Legal Framework for Open Access to Research DataSustainable Legal Framework for Open Access to Research Data
Sustainable Legal Framework for Open Access to Research Datagideon christian
 
Linked Data: Why Bother?
Linked Data:  Why Bother?Linked Data:  Why Bother?
Linked Data: Why Bother?Jennifer Bowen
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
Reshaping the world of scholarly communication by Dr. Usha Munshi
Reshaping the world of scholarly communication by Dr. Usha MunshiReshaping the world of scholarly communication by Dr. Usha Munshi
Reshaping the world of scholarly communication by Dr. Usha MunshiAta Rehman
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchGigaScience, BGI Hong Kong
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersPrattSILS
 
1 s2.0-s0098791313000154-main
1 s2.0-s0098791313000154-main1 s2.0-s0098791313000154-main
1 s2.0-s0098791313000154-mainGraham Steel
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureRoss Mounce
 
Media, information and the promise of new technologies in Knowledge Transfer ...
Media, information and the promise of new technologies in Knowledge Transfer ...Media, information and the promise of new technologies in Knowledge Transfer ...
Media, information and the promise of new technologies in Knowledge Transfer ...maudelfin
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Knowledge Sharing in the Sciences
Knowledge Sharing in the SciencesKnowledge Sharing in the Sciences
Knowledge Sharing in the SciencesKaitlin Thaney
 
dkNET Poster ENDO 2016
dkNET Poster ENDO 2016 dkNET Poster ENDO 2016
dkNET Poster ENDO 2016 dkNET
 
Calhoun Data Sharing Panel IFLA Aug 2008
Calhoun Data Sharing Panel IFLA  Aug 2008Calhoun Data Sharing Panel IFLA  Aug 2008
Calhoun Data Sharing Panel IFLA Aug 2008Karen S Calhoun
 

Similar to Current opinions in drug discovery public compound databases (20)

Why open drug discovery needs four simple rules for licensing data and models
Why open drug discovery needs four simple rules for licensing data and modelsWhy open drug discovery needs four simple rules for licensing data and models
Why open drug discovery needs four simple rules for licensing data and models
 
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
 
Sustainable Legal Framework for Open Access to Research Data
Sustainable Legal Framework for Open Access to Research DataSustainable Legal Framework for Open Access to Research Data
Sustainable Legal Framework for Open Access to Research Data
 
Qualifying Online Information Resources for Chemists
Qualifying Online Information Resources for ChemistsQualifying Online Information Resources for Chemists
Qualifying Online Information Resources for Chemists
 
Linked Data: Why Bother?
Linked Data:  Why Bother?Linked Data:  Why Bother?
Linked Data: Why Bother?
 
Reaching out to collaborators and crowdsourcing for pharmaceutical research
Reaching out to collaborators and crowdsourcing for pharmaceutical research  Reaching out to collaborators and crowdsourcing for pharmaceutical research
Reaching out to collaborators and crowdsourcing for pharmaceutical research
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
Reshaping the world of scholarly communication by Dr. Usha Munshi
Reshaping the world of scholarly communication by Dr. Usha MunshiReshaping the world of scholarly communication by Dr. Usha Munshi
Reshaping the world of scholarly communication by Dr. Usha Munshi
 
Nicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do researchNicole Nogoy: GigaScience...how licensing can change the way we do research
Nicole Nogoy: GigaScience...how licensing can change the way we do research
 
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
1 s2.0-s0098791313000154-main
1 s2.0-s0098791313000154-main1 s2.0-s0098791313000154-main
1 s2.0-s0098791313000154-main
 
UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
UKSG 2018 Breakout - Setting your cites to open I4OC - MaccallumUKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
UKSG 2018 Breakout - Setting your cites to open I4OC - Maccallum
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
Media, information and the promise of new technologies in Knowledge Transfer ...
Media, information and the promise of new technologies in Knowledge Transfer ...Media, information and the promise of new technologies in Knowledge Transfer ...
Media, information and the promise of new technologies in Knowledge Transfer ...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Knowledge Sharing in the Sciences
Knowledge Sharing in the SciencesKnowledge Sharing in the Sciences
Knowledge Sharing in the Sciences
 
dkNET Poster ENDO 2016
dkNET Poster ENDO 2016 dkNET Poster ENDO 2016
dkNET Poster ENDO 2016
 
Linked data in pharma R&D
Linked data in pharma R&DLinked data in pharma R&D
Linked data in pharma R&D
 
Calhoun Data Sharing Panel IFLA Aug 2008
Calhoun Data Sharing Panel IFLA  Aug 2008Calhoun Data Sharing Panel IFLA  Aug 2008
Calhoun Data Sharing Panel IFLA Aug 2008
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Current opinions in drug discovery public compound databases

  • 1. Page 1 of 37 Public Chemical Compound Databases Antony J. Williams Address: ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587 Corresponding Author:antony.williams@chemspider.com PHONE: 919 341-8375 The internet has fast become the first port of call for all searches. The increasing array of chemistry-related resources now available provides chemists a direct path to the discovery of information, one previously accessed via library services and limited to commercial and costly resources. The diversity of information available online is expanding at a dramatic rate and a shift to publicly available resources offers significant opportunities in terms of the benefit to science and society. While the data available online do not generally meet the quality standards available from manually curated sources there are efforts afoot to gather scientists and “crowd source” an improvement in the quality of available data. This article will discuss the types of public compound databases available online, provide a series of example databases and focus on the benefits and disruptions associated with the increased availability of such data and integrating technologies to data-mine the available information. Keywords Public databases, chemical structure databases, Open Data, chemoinformatics, data mining, internet chemistry, Wikis, blogs,
  • 2. Page 2 of 37 Introduction The internet is likely used on a daily basis by the majority of scientists. There is little doubt that the web is the primary portal to query for information and data and, when coupled with the intranet services for most companies, is the tool of choice for most general searches. For many years the search for scientific-related information would start at the library and commonly engage skilled professionals in the domain of searching. These people would have a deep understanding of navigating the plethora of databases and resources, using their own query languages, and would perform searches using for-fee resources. While such skills remain of value most scientists conduct the majority of their own searches and certainly utilize their access to a no-cost, intuitive and expansive internet of information. There has been a tremendous growth in scientific internet resources and there are enormous opportunities provided by such facile access to chemistry information and data. Bioinformatics certainly established the trend of providing online access to data and Chemistry, in many ways, is far behind. Open-access databases such as GenBank [1] and the Protein Data Bank (PDB) [2] have been assisting biologists to translate gene and protein sequences into biological relevance for over two decades. It is possible that the differences in efforts results from publishers in Chemistry discouraging the open flow of data and information. This is true not only for scientific articles but also for chemistry databases. With the changing expectations of society in terms of freedom of access to information, and the efforts of many evangelists
  • 3. Page 3 of 37 and groups, a shift towards both free and open access (vide infra) chemistry-related information is well underway and is likely to accelerate. Murray-Rust envisages a world in which all scientific information is instantly available [3•]. This emerging world of e-science or cyberscholarship seeks “to develop the tools, content and social attitudes to support multidisciplinary, collaborative science. Its immediate aims are to find ways of sharing information in a form that is appropriate to all readers.” This article will discuss the work already underway to support this noble and valid effort to provide enhanced public access to Chemistry data and specifically focus on public chemical compound databases. There are many tens of indexes of chemistry databases available online and the reader is encouraged to perform one or more generic searches on “chemistry databases” to retrieve a list of related information. The authors preferred source of information is the Wiki hosted by Gary Wiggins [4•]. While the availability of freely accessible information is clearly of value to scientists there are risks in terms of the quality of information available. It is this quality issue which provides the mainstream publishers, for the time-being, a foothold in the domain of providing value-added access to scientific information. That said, public compound databases especially have become a disruptive force for certain commercial bodies and the threat has caused significant duress. The potential impact on the business models of publishers and the increased capabilities and diversity of data within public compound databases will also be highlighted. Public Chemistry Databases There are many freely available chemical compound databases on the web and they assume many different forms. They can simply be a collection of chemical structures aggregated into a single file and made available, gratis, for people to
  • 4. Page 4 of 37 download and utilize as they see fit. These files are generally available in the form of an SDF file [5] and can be downloaded and then imported to a database for searching and viewing. There are literally hundreds of such files available online and they are commonly available from chemical vendors in order to advertise their catalog collections. These files generally contain the chemical identifiers in the form of chemical names (systematic and trade) and registry numbers. The files can also contain experimental or physical properties, file specific identifiers and pricing information. There are aggregators who gather such files of chemical structures and related information and assemble them into a single database and serve up to the public (some examples will be discussed later). Since the files are assembled in a heterogeneous manner the resulting data are plagued with inconsistencies and data quality issues. Such an approach to gathering and merging data is a far cry from that taken by commercial database vendors who manually gather and curate data. Some examples of these commercial organizations are CAS [6], InfoChem [8] and Symyx [9]. While the commercial databases offer curated data there is certainly a price- barrier to accessing the information. A number of the free online resources are also manually curated and, as will be discussed later, can offer as high a quality as the commercial offerings. These resources are, however, constructed with a specific focus in mind and therefore commonly number in the low thousands of structures rather than the millions available in the larger online databases. Meanwhile, there are a number of large online database resources offering access to valuable data and knowledge. Some of these databases should be thought of as “linkbases”. For the purpose of this article a linkbase is a repository of molecular connection tables (chemical structures) linking out to various sources of data and associated information. While it is impossible to be exhaustive within the confines of an article
  • 5. Page 5 of 37 of this nature an overview of a number of online public compound databases focusing specifically on free access databases will be provided. The confusion around the differences between Open Access (OA) versus Free Access (FA) continues to persist [9] but both offer an opportunity to help advance science by facilitating the sharing of data, information and knowledge with no barriers of price or access. The first major international statement on open access was the Budapest Open Access Initiative (BOAI), in February 2002 [10]. The definition of Open Access is as follows: “By 'open access' to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” [11]. Free Access is not equivalent to Open Access but a simple definition has been suggested [12]: “Free access is access that removes price barriers but not necessarily any permission barriers.” For the purpose of this article we are not only interested in FA and OA but also Open Data. Quoting from an online resource [13] “Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control”. As yet there are no commonly agreed upon definitions but as a result of Open Data evangelists and groups progress is being made [14•,15••,16-18].  The majority of scientists cannot however differentiate between free access and open access since both provide free access to information of value to them in
  • 6. Page 6 of 37 their work. In a similar way, the majority of scientists do not care about the distinctions between Open and Closed data. They utilize free access public chemical compound databases on an as-needed basis, derive value from the content and move on, not concerned whether the data posted online are Open or Closed. Chemical Abstracts Services (CAS) [5] and their CAS Registry Numbers (RNs) [19] have played a dominant role in managing a curated registry of chemical entities and related chemical and biological literature. Their proprietary registration system does not link to chemical structures in the public domain and their business model is at risk [20••,21]. Before reviewing examples of public compound databases we should review the issues of data quality. All content databases containing chemical compounds contain errors. These errors can arise for a series of reasons including errors in transcription, historical errors (a compound was “correct” when entered but later re- characterized), issues with graphical representation and a plethora of other reasons. The quality of chemical information in the public domain is generally quite low. This does not mean that the data are not of value but that care needs to be taken in the nature of the provider as an authority. There is, of course, no central body responsible for the quality of data in the public domain. Databases of chemical structure information such as PubChem [22••], ChemIDPLus [23] and ChemFinder [24] etc., are commonly looked upon as authorities in terms of reliable information. However, these sources are also aggregators of information and are at risk of perpetuating errors form the original public data and depositions. Errors in structure- identifier pairs are common [25] and inaccurate structure representations, specifically in regards to stereochemistry, proliferate across many databases. A definitive description of the challenges regarding quality in public domain databases, and the rigorous processes required to aggregate quality data was provided by Richards et al [26••]. During their assembly of the EPA DSSTox databases the
  • 7. Page 7 of 37 assembled the chemical structures, chemical names and CAS Registry Numbers for over 8000 chemicals from numerous toxicity databases. The data they extracted were carefully curated and validated using multiple public information sources [27]. In regards to the quality of the chemical information presented with bioassay data on PubChem Richards cautioned 'user beware' [26]. Since the chemical structure content is deposited without additional review the user is at risk. Errors in chemical names are common, and multiple structure errors have been identified. Richards encourages users to make informed judgments on the quality of data based on prior knowledge of the data submitter. The responsibility for the quality of the PubChem database therefore rests with the depositors primarily and, as many of these are commercial chemical vendors, their focus on quality is far less than the stringent expectations of the community. The proliferation of errors from PubChem into other databases has been identified [28] and a definitive effort to cleanse the errors from the data, be it in regards to chemical structures, names or identifiers, is going to be required. The efforts of groups such as the ChemSpider team with their online curation [29] offers an opportunity to dramatically improve the quality of the data through both a roboticized cleansing approach and manual examination by many users. Efforts such as these should help reduce errors and result in the proliferation of more validated information. Public Compound Databases PubChem The highest profile online database is certainly PubChem [22]. Launched by NIH in 2004 to support the New Pathways to Discovery component of their roadmap initiative [30]. PubChem archives and organizes information about the biological activities of chemical compounds into a comprehensive biomedical database and is
  • 8. Page 8 of 37 the informatics backbone for the initiative, intended to empower the scientific community to use small molecule chemical compounds in their research. PubChem consists of three databases (PubChem Compound, PubChem Substance, and PubChem Bio-Assay) connected together. PubChem Compound contains 18 million unique structures and provides biological property information for each compound. PubChem Substance contains records of substances from depositors into the system. These are publishers, chemical vendors, commercial databases and other sources. The PubChem Compound database contains records of individual compounds (see Figure 1). PubChem BioAssay contains information about bioassays using specific terms pertinent to the bioassay. PubChem can be searched by alphanumeric text variables such as names of chemicals, property ranges or by structure, substructure or structural similarity. As of December 2007 its content is approaching 38.7 million substances and 18.4 million unique structures. Such a source of data opens up new possibilities [31] in regards to data mining and extraction. Zhou et al [32•] concluded that the system has an important role as a central repository for chemical vendors and content providers enabling evaluation of commercial compound libraries and saving biomedical researchers from the work associated with gathering and searching commercial databases. They identified that over 35% of the 5 million structures from chemical vendors or screening centers found in the PubChem database currently are not present in the CAS registry. PubChem continues to grow in stature, content and capability. The bioassay data resulting from the NIH Roadmap initiative is likely to continue to grow and PubChem will assume a prominent role in distributing the data in a standard format. Despite the obvious value of PubChem the platform has caused quite a furor in recent years including debates regarding the position of CAS relative to the resource. The reader is referred elsewhere for commentaries [33,34]. Others have commented
  • 9. Page 9 of 37 on the quality of the data content within PubChem. Shoichet [35••] believes that the screening data are less rigorous than those in peer-reviewed articles, and contain many false positives. Shoichet worries that chemists who use PubChem may be sent on a wild goose chase. Numerous problems arise from the quality of submissions from various data sources and there are thousands of errors in the structure- identifier associations due to this contamination and this can lead to the retrieval of incorrect chemical structures. It is also common to have multiple representations of a single structure due to incomplete or total lack of stereochemistry for a molecule [36]. DSSTox The EPA Distributed Structure-Searchable Toxicity (DSSTox) database project [38,39] provides a series of documented, standardized and fully structure-annotated files of toxicity information [40]. The initial intention for the project was to deliver a public central repository of toxicity information to allow for flexible analogue searching, SAR model development and the building of chemical relational databases. In order to ensure maximum uptake by the public and allow users to integrate the data into their own systems the DSSTox project adopted the use of the common standard file format (SDF) to include chemical structure, text and property information. The DSSTox databases was also deployed online to provide free public access to the data files without the dependency on a desktop software package for querying and managing the data files. The overall aims of the project, to deeply integrate chemical structure information with existing toxicity data and to facilitate interrogation of the data have been achieved. The DSSTox datasets are among the most highly curated public datasets available and likely the reference standard in publicly available structure-based toxicity data.
  • 10. Page 10 of 37 eMolecules eMolecules [41] offers a free online database of almost 8 million unique chemical structures. The database is assembled from data supplied by over 150 suppliers and provides a path to identifying a vendor for a particular chemical compound. By providing access to compounds for purchase they are providing a free access online service similar to those of commercial databases such as Symyx Available Chemical Directory [42], CAS’ ChemCats [43] and Cambridgesoft’s ChemACX [44] as well as a number of other providers. The system offers access to more than 4 million commercially available screening compounds and many tens of thousands of building blocks and intermediates. Their database was recently enhanced by providing access to NMR, MS and IR spectra from Wiley-VCH [45] for over 500,000 compounds via ChemGate [45], a fee-based service. eMolecules also provides links to many sources of data for spectra, physical properties and biological data including include the NIST WebBook [46], the National Cancer Institute [47], DrugBank [48•] and PubChem. eMolecules is presently fairly limited in its scope and primarily offers a very useful path to the purchase of chemicals and links to the more popular government databases. Nevertheless, the site is popular with chemists who are searching for chemicals and the interface is intuitive and easy to use, a key element in attracting users. DrugBank DrugBank [48•] is a manually curated resource assembled from the collection information of a series of other public domain databases and enhanced with
  • 11. Page 11 of 37 additional data generated within the laboratories of the hosts. The database aggregates both bioinformatics and cheminformatics data and combines detailed drug data with comprehensive drug target (i.e. protein) information. The database is hosted by the University of Alberta, Canada. Version 1 of the database, released in 2006, contained >4100 drug entries including >800 FDA approved small molecule and biotech drugs as well as >3200 experimental drugs. Over 14,000 protein or drug target sequences were linked to these drug entries. Each record in the database, known as a DrugCard, has >80 data fields. The information is split into drug/chemical data and drug target or protein data and many data fields are linked to other databases (KEGG [49], PubChem, ChEBI [50], PDB [2] and others). The database supports extensive text, sequence, chemical structure and relational query searches. DrugBank has been used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. The version 2.0 release of DrugBank [51••] released in January of this year with over 800 new drug entries and each DrugCard entry extended to include over 100 data fields with half of the information being devoted to drug/chemical data and the other half devoted to pharmacological, pharmacogenomic and molecular biological data. They have started to add experimental spectral data (NMR and MS specifically), and have expanded the coverage to nutraceuticals and herbal medicines. The Drugbank team also host the Human Metabolome Database (HMDB) [52], a database containing nformation about small molecule metabolites found in the human body. The database is used by scientists working in the areas of metabolomics, clinical chemistry and biomarker discovery. The database currently contains nearly 3000 metabolite entries and each MetaboCard entry contains more
  • 12. Page 12 of 37 than 90 data fields devoted to chemical, clinical data, enzymatic and biochemical data. NMRShiftDB The NMRShiftDB is an open source collection of chemical structures and their associated NMR shift assignments [53•,54]. The database is generated as a result of contributions by the public and currently contains over 20,000 structures with >220,000 assigned carbon chemical shifts. Datasets entered by contributors are sent to registered reviewers for evaluation. A significant part of NMRShiftDB was initially assembled from in-house databases from collaborating institutions and were entered unchecked. This called for external checks of the data based on independent databases and resources and these have now been carried out by two specific groups [56,57]. Williams et al. [56] performed a cursory examination of the structural diversity within the database and concluded that the data represented a statistically relevant set to use in an evaluation of predictive accuracy and demonstrated that the quality of the data is rather impressive. This effort shows the advantages of providing a set of Open Data for reuse and examination and the benefits of having many scientists examine, validate and correct. The benefit is possible for any database allowing its users to qualify, annotate and correct its data. ChemSpider ChemSpider was released to the public in March 2007 with the intention of “building a structure centric community for chemists”. ChemSpider has grown into a resource containing almost 18 million unique chemical structures and recently shared its data with PubChem providing about 7 million unique compounds. The data sources have been gathered from chemical vendors as well as commercial database
  • 13. Page 13 of 37 vendors and publishers and members of the Open Notebook Science community. ChemSpider has also integrated the SureChem patent database [59] collection of structures to facilitate links [60] between the systems. The database can be queried using structure/substructure searching and alphanumeric text searching of both intrinsic as well as predicted molecular properties. They have recently added virtual screening results using the LASSO similarity search tool [61] to screen the ChemSpider database against all 40 target families from the Database of Useful Decoys (DUD) dataset. ChemSpider has enabled unique capabilities relative to the primary public chemistry databases. These include real time curation of the data, association of analytical data with chemical structures, real-time deposition of single or batch chemical structures (including with activity data) and transaction-based predictions of physicochemical data. The ChemSpider developers have made available a series of web services to allow integration to the system for the purpose of searching the system as well as generation of InChI identifiers and conversion routines. The system also integrates text-based searching of Open Access articles and presently search over 50,000 OA Chemistry articles, soon to be extended to 150,000 articles. The index is expected to increase dramatically as they extract chemical names from OA articles and convert the names to chemical structures using name to structure conversion algorithms. These chemical structures will be deposited back to the ChemSpider database thereby facilitating structure and substructure searching in concert with text-based searching. ChemSpider has a focus on, and commitment to, community curation. The social community aspects of the system demonstrate the potential of this approach. The team have committed to the release of a wiki-like environment for further annotation of the chemical structures in the database, a project they term WiChempedia. They will utilize both available Wikipedia content and deposited
  • 14. Page 14 of 37 content from users to enable the ongoing development of community curated chemistry. Other Databases The list of databases and resources reviewed above is only representative of the type of information available online. Other highly regarded databases frequented by this author include the Chemical Structure Lookup Service (with over 36 million unique structures) [64], CrystalEye [65], KEGG [49] and CheBI [50]. There are also many other resources available and the reader is referred to one of the many indexes of such databases available on the internet to identify potential resources of interest [4,66]. Public Compound Databases versus Commercial Databases The creation, hosting and support of a curated chemical compound database with integrated content is an expensive enterprise. Historically these databases have been built as a result of hundreds if not thousands of man years of rigorous and exacting human effort and then, for some of the original founders in this domain, migrated onto computer systems. In the development of these systems host organizations have created sizeable revenues and estimated annual fees for accessing this information via just a few organizations likely exceeds half a billion dollars. With the advances in technology accompanying the internet boom the hosting of large databases, the text-based searching of immense amounts of data and the ability to disseminate complex forms of graphical information via standard protocols provided an opportunity created for disruptive offerings in this domain. They soon arrived. The primary advantage of commercial databases is that they have been manually examined by skilled curators, addressing the tedious task of quality data-
  • 15. Page 15 of 37 checking. Certainly the aggregation of data from multiple sources, both historical and modern, from multiple countries and languages and from sources not available electronically are significant enhancements over what is available via an internet search. The question remains how long will this remain an issue? Scientists working in new areas of science and domains of expertise reflect on the most recent literature in general. Can you imagine a search about the semantic web being conducted just a few years ago? What about metabonomics or even genomics? Certain areas of the scientific literature, while still of high value, can become antiquated fairly quickly. With the new capabilities of internet-based searching and direct access to abstracts for the majority of publishers even a rudimentary text search can expose articles previously unavailable except through an abstracting service. Search engines will increasingly be utilized for first level searches specifically because they are simple to use, they are fast and they are free. With chemically searchable patents also available online [59,67], at no charge, the landscape for scientists searching for information is more open than ever. If there are data of interest to be located then internet search engines will enable it. The premier curated database offerings of today have an interesting if not challenging future ahead of them. Their value-added enhancements of the distributed data must be significant enough to warrant an investment in their services [68]. As expressed earlier the quality of the data resulting from curation is significant but this author questions the longevity of that distinguishing factor moving forward. Roboticized recognition and conversion of chemical names to chemical structures can dramatically shift this domain and efforts have already been demonstrated in applications to patents and publications. Should the quality reach a sufficient standard then today’s publishers business models will definitely be at risk. The Future of Public Compound Databases
  • 16. Page 16 of 37 The semantic web [69] is already offering us the chance to connect, simultaneously interrogate and mash-up the results of searching multiple public compound databases simultaneously. An enormous diversity of data is already available for interrogation by the public and continues to expand daily. This author remains concerned with the very real quality issues associated with public data sets. While the utopian dream of no errors in freely available data cannot be met the push towards more Open Data without consideration being given to both manual and robotic curation could be risky to those using the data. Real-time curation of data within public compound databases is feasible [29] and certainly Wikipedia is a model of crowd sourcing [71] to build, curate and maintain a quality database. Unfortunately, even these world-renowned platforms actually sit on the shoulders of a very few dedicated individuals, relative to the users, who care about quality. There is no simple solution to the issues of quality and it will persist for the foreseeable future until processes, procedures and momentum to resolve the issues are established. Even in its earliest form PubChem has been referred to, tongue-in-cheek, as “the granddaddy of all free chemistry databases”. Certainly it presently holds the premier position in reputation, capabilities and connectivities built on a database of chemical structures and linked out to biological assay data, the PubMed database and an array of services to facilitate both the distribution of the data and the wealth of tools developed to support the system. The majority of databases discussed in this article now uses two primary identifiers in their systems – the CAS registry number and a PubChem ID number. This alone indicates a shift in equality of commercial versus public compound repositories. For now, PubChem remains focused on its initial intent to support the National Molecular Libraries Initiative. The data within PubChem have never formally been declared as Open Data but are assumed to be
  • 17. Page 17 of 37 available in that manner and thereby offer to scientists a valuable aggregate of data for the purpose of data mining and discovery. At the time of writing the newest addition to the proliferating domain of public chemical compound databases is the ChemSpider Database [57], working to “Build a Structure Centric Community for Chemists”. This system presently offers a series of unique capabilities which might become trend-setting for present and future databases. As discussed earlier these include the user deposition of structures, real- time annotation and curation of data, management of analytical data and online transaction services. It is this authors’ belief that such capabilities will likely become standard for the majority of most public chemical compound databases in the near future. These types of capabilities could help establish the newfound shift to Open Notebook Science and shift the bias from the chemical biology databases (PubChem, Drugbank, HMDB and DSSTox) to even provide an environment for non-life science chemists, polymer chemists and material scientists to manage and research information of interest to them. The WikiSphere, Blogosphere and Internet as a Public Compound Database. Wikis and blogs are common terms now for the majority of users of the worldwide web and both are fast becoming chosen platforms for the exchange of information between many scientists, not only as tools within their own research groups but, more generally, with the public in general. A blog, or weblog is a website where entries are written in chronological order and generally provide commentary or news on a particular subject [71]. A typical blog combines text, images and links to other blogs, web pages, and other media related to its topic. The original blog posting remains untouched by the commenter and readers are free to add their comments, generally in a mediated manner where the blog host retains control over
  • 18. Page 18 of 37 the postings. An example screenshot from a chemistry-based blog hosted with the intention of examining and discussing organic syntheses is shown in Figure 3. The number of chemistry-related blogs continues to grow dramatically and there have been efforts to provide a unified view into some of these [72,73]. A wiki is a type of computer software that allows users easily to create, edit and link web pages and enables documents to be written collaboratively, in a simple markup language using a web browser, and is essentially a database for creating, browsing and searching information. Certainly Wikipedia is the most well-known today though there are many others already online and used within the confines of an organization to manage content. There are active groups supporting the development of chemistry on Wikipedia and there are now thousands of pages describing small organic molecules, inorganics, organometallics, polymers and even large biomolecules. Focusing on small molecules in general, each one has a Drug Box [75] or a Chemical infobox [76]. A drug box provides identifier information (chemical name, registry number, and so on) and commonly the identifiers link out to a related resource. Chemical data, pharmacokinetic data and therapeutic considerations can also be listed. At present there are approximately 8000 articles with a chembox or drugbox [3], with between 500-1000 articles added since May. The detailed information offered on Wikipedia regarding a particular chemical or drug can be excellent [77], see Figure 2, or weak [78]. There are many dedicated supporters and contributors to the quality of the online resource. Drug and chemboxes have been shown to contain errors but the advantage of a wiki is that changes can be made within a few keystrokes and the quality is immediately enhanced. The opposite is also true and vandalism can occur. This community curation process makes Wikipedia a very important online chemistry resource whose impact will only expand with time.
  • 19. Page 19 of 37 Wikis have recently been used as the basis of Open Notebook Science [79]. The UsefulChem Wiki [80] includes a series of experimental pages commonly linked to related blog pages as shown in Figure 4. The Open Notebook Science efforts and the movement appears to be gaining momentum with the support of vocal advocates, such as Neylon [81], Murray-Rust [82] and many others. While both wikis and blogs are very valuable for information exchange, what they enable in terms of text and image exchange is all but crippled in terms of searching by many chemists’ additional query needs for chemical structures, reactions and data. Neither Wikis nor blogs, as yet, are enabled for the purpose of structure and substructure searching and, therefore, remain isolated, in general, from cheminformatics based search procedures. One of the key developments which has already facilitated the Semantic Web for chemistry is the InChI,[83] the International Chemical Identifier. The InChI string is a textual identifier for chemical substances designed to provide a standard and human-readable way to encode molecular information (see Figure 5) and to facilitate the search for such information in databases and on the web. The InChI string, unfortunately, has only partly delivered on the promise of facilitating web-based searches, due to unpredictable breaking of InChI character strings by search engines. In order to resolve this issue the InChIKey was introduced. The condensed, 25 character InChIKey is a hashed version of the full InChI and is not human-readable. The equivalent InChIKey for the InChIString of L-ascorbic acid is CIWBSHSKHKDKBQ-JLAZNSOCBT. The advantage of the key is one of enabling web searches, but a lookup table to identify the associated structure, or reference
  • 20. Page 20 of 37 to the original InChI String, is necessary [85]. While tens of millions of InChI strings and keys have been populated into databases, their value is still in its infancy. Publishers have started to embed InChIs into their articles and the Royal Society of Chemistry [85] is presently pioneering a new publishing model, Project Prospect, including InChI to demonstrate movement toward the semantic web for chemistry. Bloggers have started to use InChI Strings and Keys on their postings, and wiki- pages are being InChI-enabled to help the web become structure searchable. The necessity of a central lookup facility for published InChIStrings will be necessary in order to facilitate substructure searching of the web but this capability is likely to be developed in the near future. Willighagen already aggregates InChI Strings onto a blog [87]. BioSpider [88] users are able to type in almost any kind of biological or chemical identifier (protein/gene name, sequence, accession number, chemical name, brand name, SMILES string, InCHI string, CAS number, etc.) and delivers a report about the biomolecule. BioSpider uses a web-crawler to scan through dozens of public databases and employs a variety of specially developed text mining tools and locally developed prediction tools to find, extract and assemble data for its reports. A summary includes physico-chemical parameters, images, models, data files, descriptions and predictions concerning the query molecule. An increasing number of public databases will continue to become available but the challenge, even now, is how to integrate and access the data. The implementation of InChIs for web-based searching [89], and the delivery of userscripts to aggregate information and computational results from different web resources [90] are bringing together internet resources to appear as a single monolithic public chemistry database. Willighagen et al. [90] use userscripts to
  • 21. Page 21 of 37 enrich biology and chemistry related web resources by incorporating or linking to other computational or data sources on the web. They showed how information from web pages can be used to link to, search, and process information in other resources thereby allowing scientists to select and incorporate the appropriate web resources to enhance their productivity. Such tools connecting open chemistry databases and user web pages is an ideal path to more highly integrated information sharing. Conclusion There is little doubt that the newfound availability of public chemical compound databases with their associated chemistry and biological data is enabling scientists to access information at less cost in both time and currency. The increasing quantity of freely accessible and integrated data can speed decision making and bring clarity or alternatively inundate and saturate the user with poor quality information. Scientists now have free access to structure-searchable patents, open and free access peer-reviewed publications and software tools for the manipulation of chemistry related data. Members of the Open Source movement are developing toolkits including visualization and data-mining tools and, when coupled with the public chemistry databases reviewed here, will likely benefit the process of discovery. There are likely to be challenging times ahead in terms of meshing the needs of commercial database publishers versus proliferation of free databases but this journey will not be halted by the objections of the commercial entities provided that legal copyrights are respected and the shift towards a more open community for science persists. Acknowledgements The author wishes to thank the following people: Stephen Bryant and Evan Bolton from the PubChem team, the IUPAC/National Institute of Standards and Technology
  • 22. Page 22 of 37 InChI team (Alan McNaught, Stephen Stein, Stephen Heller, Dmitrii Tchekhovskoi); David Wishart and Nelson Young (Drugbank and HMDB), Nicko Goncharoff (SureChem), Stephen Boyer (IBM), Marc Nicklaus (Chemical Structure Lookup Service), members of the ChemSpider Advisory Group (Egon Willighagen, Sean Ekins, Joerg Wegner and Alex Tropsha specifically), Ann Richard and Marti Wolf (DSSTox), Christoph Steinbeck (NMRShiftDB), Nick Day and Peter Murray-Rust (CrystalEye), Martin Walker, Andrew Yeung and Dirk Beestra (Wikipedia Chemistry). I would also like to acknowledge the many contributors to the blogging discussions about Open and Free Access. References 1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res. (2007) 35(Database issue):D21-5. 2. Berman HM, Henrick K, Nakamura H: Announcing the worldwide Protein Data Bank. Nature Structural Biology (2003) 12: 980 3. Murray-Rust P: Chemistry for everyone. NATURE (2008) 451, 648-651 •Provides a vision for the future of data distribution, access and integration across the worldwide web and espouses the need for Open Data policies and adoption of the Semantic Web. 4. Gary Wiggins’ Wiki. CHEMBIOGRID, Chemistry Databases on the Web: Alphabetical:http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_D atabases_on_the_Web_%28Alphabetical_List%29 Classified:http://cheminfo.informatics.indiana.edu/cicc/cis/index.php/Chemistry_Dat abases_on_the_Web_%28Classified_List%29
  • 23. Page 23 of 37 •An aggregation of chemistry databases, curated and annoted, to provide significantly more information than would be returned in a generic search of the internet. 5. Symyx: CTFile formats no-fee. (2008) http://www.mdli.com/downloads/public/ctfile/ctfile.jsp 6. CAS: Chemical Abstract Services, Columbus, OH, USA (2006). http://www.cas.org/ 7. InfoChem: InfoChem Gesellschaft für Chemische Information, München, Germany (2008). http://infochem.de/ 8. Symyx: Santa Clara, California, USA (2008). http://www.symyx.com/ 9. The University’s Mandate To Mandate Open Access: Harnad S, (2008) http://openaccess.eprints.org/index.php?/archives/358-The-Universitys-Mandate-To- Mandate-Open-Access.html 10. Open Access: Wikipedia Article on Open Access. (2008) http://en.wikipedia.org/wiki/Open_access 11. The BOAI FAQ page: Frequently Accessed Questions about the Budapest Open Access Initiative (2008), http://www.earlham.edu/~peters/fos/boaifaq.htm 12. Williams AJ: A perspective of Publicly Accessible/Open Access Chemistry Databases: Drug Discovery News (2008), accepted for publication 13. Open Data: Wikipedia Article on Open Data. (2008) http://en.wikipedia.org/wiki/Open_data 14. Murray-Rust P, Rzepa HS, Tyrrell SM and Zhang Y: Representation and use of Chemistry in the Global Electronic Age ChemInform, 36(15), (2005) • An excellent outline regarding the potential of combining open access and the semantic web in chemistry. Rzepa and Murray-Rust are two of the evangelists of this domain and outline in this article how data may be interconnected to the benefit of all chemists.
  • 24. Page 24 of 37 15. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C, Wegner J , Willighagen EL: The Blue Obelisk-Interoperability in Chemical Informatics, J Chem Inf Model, (2006) 46 (3), 991-998. ••The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a group of scientists and developers supporting open source software development, consistent and complimentary chemoinformatics research, open data, and open standards in Chemistry. 16. CODATA, The Committee on Data for Science and Technology: CODATA, Paris, France (2008). http://www.codata.org/ 17. An Introduction to Science Commons: Wilbanks J, Boyle J, (2006). http://sciencecommons.org/wp- content/uploads/ScienceCommons_Concept_Paper.pdf 18. The Open Knowledge Foundation: Protecting and Promoting Open Knowledge in a Digital Age (2008). http://www.okfn.org/ 19. CAS Registry Numbers: Chemical Abstract Services, Columbus, OH, USA (2008). http://www.cas.org/expertise/cascontent/registry/regsys.html 20. Murray-Rust P, Mitchell JB, Rzepa HS: Communication and re-use of chemical information in bioscience. BMC Bioinform (2005) 6:180-196. •• Provides an overview of chemical information on the Internet and, while slightly outdated, is an important read in regards to the challenges and the vision of a Semantic Web for Chemistry. 21. Heller SR, Stein SE, Tchekhovskoi DV: Open source/open access/open data and the IUPAC International Chemical Identifier - InChI. American Chemical Society National Meeting, Washington, DC, USA (2005):CINF-60. 22. NCBI: PubChem: National Center for Biotechnology Information, Bethesda, MD, USA (2008). http://pubchem.ncbi.nlm.nih.gov
  • 25. Page 25 of 37 •• Pubchem is a large data aggregator (nearing 20 million structures) and offers relational searching capabilities via text, structure and substructure searching and access to the entire dataset via download of SDF files. A series of services for the handling of chemistry databases are also available via the website. 23. ChemIDplus: National Library of Medicine, Bethesda, MD, USA (2008). http://chem.sis.nlm.nih.gov/chemidplus/chemidheavy.jsp 24. ChemFinder.com: CambridgeSoft Corp, Cambridge, MA, USA (2008). http://chemfinder.cambridgesoft.com/ 25. Hacking Pubchem - Technology easy, Quality difficult: Williams AJ, (2007) http://www.chemspider.com/blog/hacking-pubchem-technology-easy-quality- difficult.html. 26. Richard AM, Swirsky Gold L, Nicklaus MC: Chemical structure indexing of toxicity data on the Internet: Moving toward a flat world. Current Opinion in Drug Discovery & Development (2006) 9(3): 314-325. •• The review discusses efforts to gather, curate and make publicly available toxicology-related chemical information. The specific discussions regarding the quality issues with public chemistry databases and efforts to produce clean quality databases are noteworthy. 27. DSSTox Quality Chemical Information Review Procedures: US Environmental Protection Agency, Washington, DC, USA (2008). http://www.epa.gov/nheerl/dsstox/ChemicalInfQAProcedures.html 28. PubChem Errors: Williams AJ, PubChem Meeting, Washington DC: (2007) http://www.chemspider.com/docs/PubChem_at_ChemSpider_Overview_SLides_Sept ember_2007.pdf 29. The Process of Curating Identifiers on ChemSpider: Williams AJ, (2008) http://www.chemspider.com/docs/The_Process_of_Curating_Identifiers_on_ChemSpi der.pdf
  • 26. Page 26 of 37 30. The NIH Roadmap Initiative: Office of Portfolio Analysis and Strategic Initiatives, National Institutes of Health, Bethesda, Maryland 20892: (2008) http://nihroadmap.nih.gov/ 31. Hacking PubChem: Why The Open Access Fight is Just the Beginning, Apodaca R, (2006), http://depth-first.com/articles/2006/09/22/hacking-pubchem- why-the-open-access-fight-is-just-the-beginning 32. Zhou Y, Chen K, Yan SF, King FJ, Jiang S, Winzeler EA: Large-Scale Annotation of Small-Molecule Libraries Using Public Databases. J. Chem. Inf. Model. (2007) 47:1386-1394 •• The 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) was used as a model to determine whether automated annotation of screening hits in batch is feasible. 33. The American Chemical Society and NIH’s PubChem, Reshaping Scholarly Communication Blog: (2008) http://osc.universityofcalifornia.edu/news/acs_pubchem.html 34. Background of the PubChem/CAS Issue: (2008) http://www.arl.org/bm~doc/backgroundfaqpb.pdf 35. Baker M: Open-access chemistry databases evolving slowly but not surely:Nature Reviews, Drug Discovery, (2006) 5:707-708 • A critical review of how far publicly available initiatives have to go to catch up with commercial offerings. 36. How big is the challenge of curation and what is the structure of Ginkgolide-B: Antony Williams (2008), http://www.chemspider.com/blog/how-big- is-the-challenge-of-curation-and-what-is-the-structure-of-ginkgolide-b.html
  • 27. Page 27 of 37 37 DSSTOX: Distributed Structure-Searchable Toxicity (DSSTox) Database: US Environmental Protection Agency, Washington, DC, USA (2006). http://www.epa.gov/nheerl/dsstox/ 38. Richard AM and Williams CR (2002) Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network: A Proposal, Mutation Research: New Frontiers, 499:27-52. 39. Richard AM: DSSTox web site launch: Improving public access to databases for building structure-toxicity prediction models, Preclinica, (2006) 2(2):103-108. 40. DSSTox Data Files: http://www.epa.gov/ncct/dsstox/DataFiles.html 41. eMolecules Online Service: eMolecules, Del Mar, CA, USA (2008). http://www.emolecules.com 42. Available Chemical Directory: Santa Clara, California, USA (2008). http://www.mdli.com/products/experiment/available_chem_dir/index.jsp 43. ChemCats: Chemical Abstract Services, Columbus, OH, USA (2006). http://www.cas.org/expertise/cascontent/chemcats.html 44. ChemACX: CambridgeSoft Corp, Cambridge, MA, USA (2008). http://www.cambridgesoft.com/databases/details/?db=12 45. ChemGate: Tony Davies, eMolecules and Spectroscopy: Spectroscopy Europe, (2007) 19(1):27-28 46. The NIST Chemistry WebBook: (2008) http://webbook.nist.gov/chemistry/ 47. NCI/NIH Developmental Therapeutics Program: National Cancer Institute, Frederick/National Institutes of Health, Bethesda, MD, USA. (2008). http://dtp.nci.nih.gov/index.html 48. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J: DrugBank: a comprehensive resource for in silico drug discovery and exploration, Nucleic Acids Res. (2006) 34:D668-72
  • 28. Page 28 of 37 • A detailed description of the intent, development and capabilities of the Drugbank database, one of the most respected public chemistry databases utilized by drug discovery scientists today. 49. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: KEGG: The KEGG resource for deciphering the genome, Nucleic Acids Res. (2004) 32 (Database issue):D277-80 50. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest, Nucl. Acids Res. (2008) 36: D344- D350; 51. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drug targets, Nucleic Acids Res. (2008) 36(Database issue):D901-6. •• An update regarding the DrugBank database as it is released in its Version 2 state. 52. HMDB: The Human Metabolome Database. Nucleic Acids Res. (2007) 35: D521-6 53. Steinbeck C, Krause S, Kuhn S: NMRShiftDB– Constructing A Chemical Information System With Open Source Components. J. Chem. Inf. Comput. Sci. (2003) 43:1733-1739. •The defining article regarding the development of the NMRShiftDB database defining the intention of the work, the development of the software components and a vision of how such a platform can lead to widespread dissemination of analytical data, at no-charge, to the chemistry community.
  • 29. Page 29 of 37 54. Steinbeck C, Kuhn S. NMRShiftDB – Compound Identification And Structure Elucidation Support Through a Free Community-Built Web Database. Phytochemistry, (2004), 65:2711–2717. 55. Blinov KA, Smurnyy YD, Elyashberg ME, Churanova TS, Kvasha M, Steinbeck C, Lefebvre BA, Williams AJ: Performance Validation of Neural Network Based 13C NMR Prediction Using a Publicly Available Data Source. J Chem Inf Model, (2008), Accepted for publication, doi: 10.1021/ci700363r. 56. CSEARCH and NMRShiftDB: Robien W (2007) http://nmrpredict.orc.univie.ac.at/csearchlite/enjoy_its_free.html 57. Williams AJ, ChemSpider and Its Expanding Web: Building a Structure- Centric Community for Chemists, Chemistry International (2007) 30(1): 30. 58. Open Notebook Science: Bradley JC, (2006) Drexel CoAs E-Learning Blog, http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html 59. SureChem: San Francisco, CA, USA (2008) http://www.surechem.org/ 60. Free Access Structure Searching of Patents: Williams AJ (2007), http://www.chemspider.com/docs/Structure_Searching_of_Patents_Using_ChemSpid er.pdf 61. LASSO: Ligand Activity in Surface Similarity Order, SioBioSys Inc., Toronto, Canada. http://www.simbiosys.ca/ehits_lasso/index.html 62. Database of Useful Decoys: http://dud.docking.org./ 63. WiChempedia: ChemSpider Blog (2007) http://www.chemspider.com/blog/wichempedia-is-now-on-its-way.html 64. Chemical Structure Lookup Service: National Institutes of Health, http://cactus.nci.nih.gov/cgi-bin/lookup/search 65. CrystalEye Crystallogrpahic Database: http://wwmm.ch.cam.ac.uk/crystaleye/
  • 30. Page 30 of 37 66. Thirty Two Free Chemistry Databases: Apodaca R, Depth-First Blog, http://depth-first.com/articles/2007/01/24/thirty-two-free-chemistry-databases 67. IBM’s Online Patent Search: (2008) IBM Chemical Search Alpha, IBM, Almaden Services Research, San Jose, CA 95120, USA, https://chemsearch.almaden.ibm.com/chemsearch/SearchServlet 68. Kemper K, Chemical Abstracts still developing ways to help its core – scientists, Columbus Business First, http://columbus.bizjournals.com/columbus/stories/2007/06/18/story20.html?page= 1 69. Feigenbaum L, Herman I, Hongsermeier T, Neumann E, Stephens S: The Semantic Web in Action, Scientific American Magazine http://www.sciam.com/article.cfm?id=the-semantic-web-in-action 70. The Benefits of Crowdsourcing: http://en.wikipedia.org/wiki/Crowdsourcing 71. The Definition of a Blog: http://en.wikipedia.org/wiki/Blog 72. ScienceBlogs: http://scienceblogs.com/ 73. Chemical BlogSpace: http://cb.openmolecules.net/ 74. The Definition of a Wiki: http://en.wikipedia.org/wiki/Wiki 75. Wikipedia Chemical Drugbox: http://en.wikipedia.org/wiki/Template:Drugbox 76. Wikipedia Chemical Infobox: http://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox 77. Taxol on Wikipedia: http://en.wikipedia.org/wiki/Taxol 78. AP7 on Wikipedia: http://en.wikipedia.org/wiki/AP7
  • 31. Page 31 of 37 79. Bradley JC, Open Notebook Science Using Blogs and Wikis, Nature Preceedings (2007) doi:10.1038/npre.2007.39.1, http://precedings.nature.com/documents/39/version/1 80. UsefulChem Open Notebook Science: Bradley JC, Drexel University, http://usefulchem.wikispaces.com/All+Reactions and http://usefulchem- experiments1.blogspot.com/2006/05/exp-009.html 81. Open Notebook Science: Neylon C, Science in the open, An openwetware blog on the challenges of open and connected science (2008) http://blog.openwetware.org/scienceintheopen/2007/12/12/a-big-few-weeks-for- open-notebook-science/ 82. Open Notebook Science NMR: Murray-Rust P, A Scientist and the Web Blog (2008) http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=671 83. The IUPAC International Chemical Identifier: (2008) http://www.iupac.org/inchi/ 84. The IUPAC International Chemical Identifier Software: (2008) http://www.iupac.org/inchi/release102.html 85. Royal Society of Chemistry: (2008) http://www.rsc.org/ 86. Project Prospect: (2008) RSC Publishing, http://www.rsc.org/Publishing/Journals/ProjectProspect/ 87. Chemical Blogspace, (2008) http://cb.openmolecules.net/inchis.php 88. Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS: BioSpider, A web Server for Automating Metabolome Annotations. Pacific Symposium on Biocomputing, (2007) 12:145-156.
  • 32. Page 32 of 37 89. Cole SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhancement of the Chemical Semantic Web through INChIfication. Org Biomol Chem, (2005) 3:1832-1834 90. Willighagen EL, O'Boyle NM, Gopalakrishnan H, Jiao D, Guha R, Steinbeck C and Wild DJ: Userscripts for the Life Sciences. BMC Bioinformatics, (2007) 8:487. •• Discusses the use of userscripts to change the appearance of web pages by modifying web content on the fly to enable aggregation of information and computational results from different web resources into a single webpage. Indicative of the future of integration and the possibilities which exist to gather information from a multitude of resources and reformat and deliver to the consumer.
  • 33. Page 33 of 37 Figures Figure 1 - The Compound Summary Page for Taxol in PubChem. Page 1 only is shown. (http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=36314)
  • 34. Page 34 of 37 Figure 2: The DrugBox for Taxol from Wikipedia (http://en.wikipedia.org/wiki/Taxol)
  • 35. Page 35 of 37 Figure 3: The TotallySynthetic.com blog. Paul Docherty discusses complex syntheses and offers readers an opportunity to comment, analyze and provide feedback. Many articles are labeled with InChIKeys to allow indexing by search engines. (http://totallysynthetic.com/blog/)
  • 36. Page 36 of 37 Figure 4: An Example UsefulChem wiki page (http://usefulchem.wikispaces.com/Exp148) This UsefulChem wiki page shows a number of important content items: 1) Links to the prior failed experiment; 2) Links to the docking results that justified making this compound; 3) Full characterization (spectroscopy and photographs) of an isolated product, with interactive NMRs (JSpecView/JCAMP-dx) of the starting materials; 4) In the discussion section a question is posed by Professor Bradley to his student, and then answered. The entire discussion history is captured. 5) A complete, detailed and dated log of the steps taken by the student; 6) In the tag section, InChIs of every compound used are provided for indexing by search engines.
  • 37. Page 37 of 37 HO O O HO InChI=1/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-10H,1H2/t2-,5+/m0/s1 CIWBSHSKHKDKBQ-JLAZNSOCBT HO OH Figure 5: The InChI String (top) and InChI Key (bottom) for L-ascorbic acid.