Current opinions in drug discovery public compound databases


Published on

The internet has fast become the first port of call for all searches. The increasing array of chemistry-related resources now available provides chemists a direct path to the discovery of information, one previously accessed via library services and limited to commercial and costly resources. The diversity of information available online is expanding at a dramatic rate and a shift to publicly available resources offers significant opportunities in terms of the benefit to science and society. While the data available online do not generally meet the quality standards available from manually curated sources there are efforts afoot to gather scientists and “crowd source” an improvement in the quality of available data. This article will discuss the types of public compound databases available online, provide a series of example databases and focus on the benefits and disruptions associated with the increased availability of such data and integrating technologies to data-mine the available information.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Current opinions in drug discovery public compound databases

  1. 1. Page 1 of 37Public Chemical Compound DatabasesAntony J. WilliamsAddress: ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587Corresponding Author:antony.williams@chemspider.comPHONE: 919 341-8375 The internet has fast become the first port of call for all searches. Theincreasing array of chemistry-related resources now available provides chemists adirect path to the discovery of information, one previously accessed via libraryservices and limited to commercial and costly resources. The diversity of informationavailable online is expanding at a dramatic rate and a shift to publicly availableresources offers significant opportunities in terms of the benefit to science andsociety. While the data available online do not generally meet the quality standardsavailable from manually curated sources there are efforts afoot to gather scientistsand “crowd source” an improvement in the quality of available data. This article willdiscuss the types of public compound databases available online, provide a series ofexample databases and focus on the benefits and disruptions associated with theincreased availability of such data and integrating technologies to data-mine theavailable information.Keywords Public databases, chemical structure databases, Open Data,chemoinformatics, data mining, internet chemistry, Wikis, blogs,
  2. 2. Page 2 of 37Introduction The internet is likely used on a daily basis by the majority of scientists. Thereis little doubt that the web is the primary portal to query for information and dataand, when coupled with the intranet services for most companies, is the tool ofchoice for most general searches. For many years the search for scientific-relatedinformation would start at the library and commonly engage skilled professionals inthe domain of searching. These people would have a deep understanding ofnavigating the plethora of databases and resources, using their own querylanguages, and would perform searches using for-fee resources. While such skillsremain of value most scientists conduct the majority of their own searches andcertainly utilize their access to a no-cost, intuitive and expansive internet ofinformation. There has been a tremendous growth in scientific internet resources andthere are enormous opportunities provided by such facile access to chemistryinformation and data. Bioinformatics certainly established the trend of providing online access todata and Chemistry, in many ways, is far behind. Open-access databases such asGenBank [1] and the Protein Data Bank (PDB) [2] have been assisting biologists totranslate gene and protein sequences into biological relevance for over two decades.It is possible that the differences in efforts results from publishers in Chemistrydiscouraging the open flow of data and information. This is true not only for scientificarticles but also for chemistry databases. With the changing expectations of societyin terms of freedom of access to information, and the efforts of many evangelists
  3. 3. Page 3 of 37and groups, a shift towards both free and open access (vide infra) chemistry-relatedinformation is well underway and is likely to accelerate. Murray-Rust envisages a world in which all scientific information is instantlyavailable [3•]. This emerging world of e-science or cyberscholarship seeks “todevelop the tools, content and social attitudes to support multidisciplinary,collaborative science. Its immediate aims are to find ways of sharing information in aform that is appropriate to all readers.” This article will discuss the work alreadyunderway to support this noble and valid effort to provide enhanced public access toChemistry data and specifically focus on public chemical compound databases. There are many tens of indexes of chemistry databases available online andthe reader is encouraged to perform one or more generic searches on “chemistrydatabases” to retrieve a list of related information. The authors preferred source ofinformation is the Wiki hosted by Gary Wiggins [4•]. While the availability of freelyaccessible information is clearly of value to scientists there are risks in terms of thequality of information available. It is this quality issue which provides themainstream publishers, for the time-being, a foothold in the domain of providingvalue-added access to scientific information. That said, public compound databasesespecially have become a disruptive force for certain commercial bodies and thethreat has caused significant duress. The potential impact on the business models ofpublishers and the increased capabilities and diversity of data within publiccompound databases will also be highlighted.Public Chemistry Databases There are many freely available chemical compound databases on the weband they assume many different forms. They can simply be a collection of chemicalstructures aggregated into a single file and made available, gratis, for people to
  4. 4. Page 4 of 37download and utilize as they see fit. These files are generally available in the form ofan SDF file [5] and can be downloaded and then imported to a database forsearching and viewing. There are literally hundreds of such files available online andthey are commonly available from chemical vendors in order to advertise theircatalog collections. These files generally contain the chemical identifiers in the formof chemical names (systematic and trade) and registry numbers. The files can alsocontain experimental or physical properties, file specific identifiers and pricinginformation. There are aggregators who gather such files of chemical structures andrelated information and assemble them into a single database and serve up to thepublic (some examples will be discussed later). Since the files are assembled in aheterogeneous manner the resulting data are plagued with inconsistencies and dataquality issues. Such an approach to gathering and merging data is a far cry from thattaken by commercial database vendors who manually gather and curate data. Someexamples of these commercial organizations are CAS [6], InfoChem [8] and Symyx[9]. While the commercial databases offer curated data there is certainly a price-barrier to accessing the information. A number of the free online resources are alsomanually curated and, as will be discussed later, can offer as high a quality as thecommercial offerings. These resources are, however, constructed with a specificfocus in mind and therefore commonly number in the low thousands of structuresrather than the millions available in the larger online databases. Meanwhile, thereare a number of large online database resources offering access to valuable data andknowledge. Some of these databases should be thought of as “linkbases”. For thepurpose of this article a linkbase is a repository of molecular connection tables(chemical structures) linking out to various sources of data and associatedinformation. While it is impossible to be exhaustive within the confines of an article
  5. 5. Page 5 of 37 of this nature an overview of a number of online public compound databases focusing specifically on free access databases will be provided. The confusion around the differences between Open Access (OA) versus Free Access (FA) continues to persist [9] but both offer an opportunity to help advance science by facilitating the sharing of data, information and knowledge with no barriers of price or access. The first major international statement on open access was the Budapest Open Access Initiative (BOAI), in February 2002 [10]. The definition of Open Access is as follows: “By open access to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” [11]. Free Access is not equivalent to Open Access but a simple definition has been suggested [12]: “Free access is access that removes price barriers but not necessarily any permission barriers.” For the purpose of this article we are not only interested in FA and OA but also Open Data. Quoting from an online resource [13] “Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control”. As yet there are no commonly agreed upon definitions but as a result of Open Data evangelists and groups progress is being made [14•,15••,16-18]. The majority of scientists cannot however differentiate between free access and open access since both provide free access to information of value to them in
  6. 6. Page 6 of 37their work. In a similar way, the majority of scientists do not care about thedistinctions between Open and Closed data. They utilize free access public chemicalcompound databases on an as-needed basis, derive value from the content andmove on, not concerned whether the data posted online are Open or Closed.Chemical Abstracts Services (CAS) [5] and their CAS Registry Numbers (RNs) [19]have played a dominant role in managing a curated registry of chemical entities andrelated chemical and biological literature. Their proprietary registration system doesnot link to chemical structures in the public domain and their business model is atrisk [20••,21]. Before reviewing examples of public compound databases we should reviewthe issues of data quality. All content databases containing chemical compoundscontain errors. These errors can arise for a series of reasons including errors intranscription, historical errors (a compound was “correct” when entered but later re-characterized), issues with graphical representation and a plethora of other reasons.The quality of chemical information in the public domain is generally quite low. Thisdoes not mean that the data are not of value but that care needs to be taken in thenature of the provider as an authority. There is, of course, no central bodyresponsible for the quality of data in the public domain. Databases of chemicalstructure information such as PubChem [22••], ChemIDPLus [23] and ChemFinder[24] etc., are commonly looked upon as authorities in terms of reliable information.However, these sources are also aggregators of information and are at risk ofperpetuating errors form the original public data and depositions. Errors in structure-identifier pairs are common [25] and inaccurate structure representations,specifically in regards to stereochemistry, proliferate across many databases. Adefinitive description of the challenges regarding quality in public domain databases,and the rigorous processes required to aggregate quality data was provided byRichards et al [26••]. During their assembly of the EPA DSSTox databases the
  7. 7. Page 7 of 37assembled the chemical structures, chemical names and CAS Registry Numbers forover 8000 chemicals from numerous toxicity databases. The data they extractedwere carefully curated and validated using multiple public information sources [27]. In regards to the quality of the chemical information presented with bioassaydata on PubChem Richards cautioned user beware [26]. Since the chemicalstructure content is deposited without additional review the user is at risk. Errors inchemical names are common, and multiple structure errors have been identified.Richards encourages users to make informed judgments on the quality of data basedon prior knowledge of the data submitter. The responsibility for the quality of thePubChem database therefore rests with the depositors primarily and, as many ofthese are commercial chemical vendors, their focus on quality is far less than thestringent expectations of the community. The proliferation of errors from PubCheminto other databases has been identified [28] and a definitive effort to cleanse theerrors from the data, be it in regards to chemical structures, names or identifiers, isgoing to be required. The efforts of groups such as the ChemSpider team with theironline curation [29] offers an opportunity to dramatically improve the quality of thedata through both a roboticized cleansing approach and manual examination bymany users. Efforts such as these should help reduce errors and result in theproliferation of more validated information.Public Compound DatabasesPubChem The highest profile online database is certainly PubChem [22]. Launched byNIH in 2004 to support the New Pathways to Discovery component of their roadmapinitiative [30]. PubChem archives and organizes information about the biologicalactivities of chemical compounds into a comprehensive biomedical database and is
  8. 8. Page 8 of 37the informatics backbone for the initiative, intended to empower the scientificcommunity to use small molecule chemical compounds in their research. PubChem consists of three databases (PubChem Compound, PubChemSubstance, and PubChem Bio-Assay) connected together. PubChem Compoundcontains 18 million unique structures and provides biological property information foreach compound. PubChem Substance contains records of substances from depositorsinto the system. These are publishers, chemical vendors, commercial databases andother sources. The PubChem Compound database contains records of individualcompounds (see Figure 1). PubChem BioAssay contains information about bioassaysusing specific terms pertinent to the bioassay. PubChem can be searched by alphanumeric text variables such as names ofchemicals, property ranges or by structure, substructure or structural similarity. Asof December 2007 its content is approaching 38.7 million substances and 18.4million unique structures. Such a source of data opens up new possibilities [31] inregards to data mining and extraction. Zhou et al [32•] concluded that the systemhas an important role as a central repository for chemical vendors and contentproviders enabling evaluation of commercial compound libraries and savingbiomedical researchers from the work associated with gathering and searchingcommercial databases. They identified that over 35% of the 5 million structures fromchemical vendors or screening centers found in the PubChem database currently arenot present in the CAS registry. PubChem continues to grow in stature, content and capability. The bioassaydata resulting from the NIH Roadmap initiative is likely to continue to grow andPubChem will assume a prominent role in distributing the data in a standard format.Despite the obvious value of PubChem the platform has caused quite a furor inrecent years including debates regarding the position of CAS relative to the resource.The reader is referred elsewhere for commentaries [33,34]. Others have commented
  9. 9. Page 9 of 37on the quality of the data content within PubChem. Shoichet [35••] believes that thescreening data are less rigorous than those in peer-reviewed articles, and containmany false positives. Shoichet worries that chemists who use PubChem may be senton a wild goose chase. Numerous problems arise from the quality of submissionsfrom various data sources and there are thousands of errors in the structure-identifier associations due to this contamination and this can lead to the retrieval ofincorrect chemical structures. It is also common to have multiple representations ofa single structure due to incomplete or total lack of stereochemistry for a molecule[36].DSSToxThe EPA Distributed Structure-Searchable Toxicity (DSSTox) database project[38,39] provides a series of documented, standardized and fully structure-annotatedfiles of toxicity information [40]. The initial intention for the project was to deliver apublic central repository of toxicity information to allow for flexible analoguesearching, SAR model development and the building of chemical relationaldatabases. In order to ensure maximum uptake by the public and allow users tointegrate the data into their own systems the DSSTox project adopted the use of thecommon standard file format (SDF) to include chemical structure, text and propertyinformation. The DSSTox databases was also deployed online to provide free publicaccess to the data files without the dependency on a desktop software package forquerying and managing the data files. The overall aims of the project, to deeplyintegrate chemical structure information with existing toxicity data and to facilitateinterrogation of the data have been achieved. The DSSTox datasets are among themost highly curated public datasets available and likely the reference standard inpublicly available structure-based toxicity data.
  10. 10. Page 10 of 37eMolecules eMolecules [41] offers a free online database of almost 8 million uniquechemical structures. The database is assembled from data supplied by over 150suppliers and provides a path to identifying a vendor for a particular chemicalcompound. By providing access to compounds for purchase they are providing a freeaccess online service similar to those of commercial databases such as SymyxAvailable Chemical Directory [42], CAS’ ChemCats [43] and Cambridgesoft’sChemACX [44] as well as a number of other providers. The system offers access tomore than 4 million commercially available screening compounds and many tens ofthousands of building blocks and intermediates. Their database was recentlyenhanced by providing access to NMR, MS and IR spectra from Wiley-VCH [45] forover 500,000 compounds via ChemGate [45], a fee-based service. eMolecules alsoprovides links to many sources of data for spectra, physical properties and biologicaldata including include the NIST WebBook [46], the National Cancer Institute [47],DrugBank [48•] and PubChem. eMolecules is presently fairly limited in its scope and primarily offers a veryuseful path to the purchase of chemicals and links to the more popular governmentdatabases. Nevertheless, the site is popular with chemists who are searching forchemicals and the interface is intuitive and easy to use, a key element in attractingusers.DrugBankDrugBank [48•] is a manually curated resource assembled from the collectioninformation of a series of other public domain databases and enhanced with
  11. 11. Page 11 of 37additional data generated within the laboratories of the hosts. The databaseaggregates both bioinformatics and cheminformatics data and combines detaileddrug data with comprehensive drug target (i.e. protein) information. The database ishosted by the University of Alberta, Canada. Version 1 of the database, released in2006, contained >4100 drug entries including >800 FDA approved small moleculeand biotech drugs as well as >3200 experimental drugs. Over 14,000 protein or drugtarget sequences were linked to these drug entries. Each record in the database,known as a DrugCard, has >80 data fields. The information is split intodrug/chemical data and drug target or protein data and many data fields are linkedto other databases (KEGG [49], PubChem, ChEBI [50], PDB [2] and others). Thedatabase supports extensive text, sequence, chemical structure and relational querysearches. DrugBank has been used to facilitate in silico drug target discovery, drugdesign, drug docking or screening, drug metabolism prediction, drug interactionprediction and general pharmaceutical education. The version 2.0 release ofDrugBank [51••] released in January of this year with over 800 new drug entries andeach DrugCard entry extended to include over 100 data fields with half of theinformation being devoted to drug/chemical data and the other half devoted topharmacological, pharmacogenomic and molecular biological data. They have startedto add experimental spectral data (NMR and MS specifically), and have expanded thecoverage to nutraceuticals and herbal medicines. The Drugbank team also host the Human Metabolome Database (HMDB)[52], a database containing nformation about small molecule metabolites found inthe human body. The database is used by scientists working in the areas ofmetabolomics, clinical chemistry and biomarker discovery. The database currentlycontains nearly 3000 metabolite entries and each MetaboCard entry contains more
  12. 12. Page 12 of 37than 90 data fields devoted to chemical, clinical data, enzymatic and biochemicaldata.NMRShiftDB The NMRShiftDB is an open source collection of chemical structures and theirassociated NMR shift assignments [53•,54]. The database is generated as a result ofcontributions by the public and currently contains over 20,000 structures with>220,000 assigned carbon chemical shifts. Datasets entered by contributors are sentto registered reviewers for evaluation. A significant part of NMRShiftDB was initiallyassembled from in-house databases from collaborating institutions and were enteredunchecked. This called for external checks of the data based on independentdatabases and resources and these have now been carried out by two specific groups[56,57]. Williams et al. [56] performed a cursory examination of the structuraldiversity within the database and concluded that the data represented a statisticallyrelevant set to use in an evaluation of predictive accuracy and demonstrated that thequality of the data is rather impressive. This effort shows the advantages ofproviding a set of Open Data for reuse and examination and the benefits of havingmany scientists examine, validate and correct. The benefit is possible for anydatabase allowing its users to qualify, annotate and correct its data.ChemSpider ChemSpider was released to the public in March 2007 with the intention of“building a structure centric community for chemists”. ChemSpider has grown into aresource containing almost 18 million unique chemical structures and recently sharedits data with PubChem providing about 7 million unique compounds. The datasources have been gathered from chemical vendors as well as commercial database
  13. 13. Page 13 of 37vendors and publishers and members of the Open Notebook Science community.ChemSpider has also integrated the SureChem patent database [59] collection ofstructures to facilitate links [60] between the systems. The database can be queriedusing structure/substructure searching and alphanumeric text searching of bothintrinsic as well as predicted molecular properties. They have recently added virtualscreening results using the LASSO similarity search tool [61] to screen theChemSpider database against all 40 target families from the Database of UsefulDecoys (DUD) dataset. ChemSpider has enabled unique capabilities relative to the primary publicchemistry databases. These include real time curation of the data, association ofanalytical data with chemical structures, real-time deposition of single or batchchemical structures (including with activity data) and transaction-based predictionsof physicochemical data. The ChemSpider developers have made available a series ofweb services to allow integration to the system for the purpose of searching thesystem as well as generation of InChI identifiers and conversion routines. The system also integrates text-based searching of Open Access articles andpresently search over 50,000 OA Chemistry articles, soon to be extended to 150,000articles. The index is expected to increase dramatically as they extract chemicalnames from OA articles and convert the names to chemical structures using name tostructure conversion algorithms. These chemical structures will be deposited back tothe ChemSpider database thereby facilitating structure and substructure searching inconcert with text-based searching. ChemSpider has a focus on, and commitment to, community curation. Thesocial community aspects of the system demonstrate the potential of this approach.The team have committed to the release of a wiki-like environment for furtherannotation of the chemical structures in the database, a project they termWiChempedia. They will utilize both available Wikipedia content and deposited
  14. 14. Page 14 of 37content from users to enable the ongoing development of community curatedchemistry.Other Databases The list of databases and resources reviewed above is only representative ofthe type of information available online. Other highly regarded databases frequentedby this author include the Chemical Structure Lookup Service (with over 36 millionunique structures) [64], CrystalEye [65], KEGG [49] and CheBI [50]. There are alsomany other resources available and the reader is referred to one of the manyindexes of such databases available on the internet to identify potential resources ofinterest [4,66].Public Compound Databases versus Commercial Databases The creation, hosting and support of a curated chemical compound databasewith integrated content is an expensive enterprise. Historically these databases havebeen built as a result of hundreds if not thousands of man years of rigorous andexacting human effort and then, for some of the original founders in this domain,migrated onto computer systems. In the development of these systems hostorganizations have created sizeable revenues and estimated annual fees foraccessing this information via just a few organizations likely exceeds half a billiondollars. With the advances in technology accompanying the internet boom thehosting of large databases, the text-based searching of immense amounts of dataand the ability to disseminate complex forms of graphical information via standardprotocols provided an opportunity created for disruptive offerings in this domain.They soon arrived. The primary advantage of commercial databases is that they have beenmanually examined by skilled curators, addressing the tedious task of quality data-
  15. 15. Page 15 of 37checking. Certainly the aggregation of data from multiple sources, both historical andmodern, from multiple countries and languages and from sources not availableelectronically are significant enhancements over what is available via an internetsearch. The question remains how long will this remain an issue? Scientists workingin new areas of science and domains of expertise reflect on the most recentliterature in general. Can you imagine a search about the semantic web beingconducted just a few years ago? What about metabonomics or even genomics?Certain areas of the scientific literature, while still of high value, can becomeantiquated fairly quickly. With the new capabilities of internet-based searching anddirect access to abstracts for the majority of publishers even a rudimentary textsearch can expose articles previously unavailable except through an abstractingservice. Search engines will increasingly be utilized for first level searches specificallybecause they are simple to use, they are fast and they are free. With chemicallysearchable patents also available online [59,67], at no charge, the landscape forscientists searching for information is more open than ever. If there are data ofinterest to be located then internet search engines will enable it. The premier curated database offerings of today have an interesting if notchallenging future ahead of them. Their value-added enhancements of thedistributed data must be significant enough to warrant an investment in theirservices [68]. As expressed earlier the quality of the data resulting from curation issignificant but this author questions the longevity of that distinguishing factormoving forward. Roboticized recognition and conversion of chemical names tochemical structures can dramatically shift this domain and efforts have already beendemonstrated in applications to patents and publications. Should the quality reach asufficient standard then today’s publishers business models will definitely be at risk.The Future of Public Compound Databases
  16. 16. Page 16 of 37 The semantic web [69] is already offering us the chance to connect,simultaneously interrogate and mash-up the results of searching multiple publiccompound databases simultaneously. An enormous diversity of data is alreadyavailable for interrogation by the public and continues to expand daily. This authorremains concerned with the very real quality issues associated with public data sets.While the utopian dream of no errors in freely available data cannot be met the pushtowards more Open Data without consideration being given to both manual androbotic curation could be risky to those using the data. Real-time curation of datawithin public compound databases is feasible [29] and certainly Wikipedia is a modelof crowd sourcing [71] to build, curate and maintain a quality database.Unfortunately, even these world-renowned platforms actually sit on the shoulders ofa very few dedicated individuals, relative to the users, who care about quality. Thereis no simple solution to the issues of quality and it will persist for the foreseeablefuture until processes, procedures and momentum to resolve the issues areestablished. Even in its earliest form PubChem has been referred to, tongue-in-cheek, as“the granddaddy of all free chemistry databases”. Certainly it presently holds thepremier position in reputation, capabilities and connectivities built on a database ofchemical structures and linked out to biological assay data, the PubMed databaseand an array of services to facilitate both the distribution of the data and the wealthof tools developed to support the system. The majority of databases discussed in thisarticle now uses two primary identifiers in their systems – the CAS registry numberand a PubChem ID number. This alone indicates a shift in equality of commercialversus public compound repositories. For now, PubChem remains focused on itsinitial intent to support the National Molecular Libraries Initiative. The data withinPubChem have never formally been declared as Open Data but are assumed to be
  17. 17. Page 17 of 37available in that manner and thereby offer to scientists a valuable aggregate of datafor the purpose of data mining and discovery. At the time of writing the newest addition to the proliferating domain of publicchemical compound databases is the ChemSpider Database [57], working to “Build aStructure Centric Community for Chemists”. This system presently offers a series ofunique capabilities which might become trend-setting for present and futuredatabases. As discussed earlier these include the user deposition of structures, real-time annotation and curation of data, management of analytical data and onlinetransaction services. It is this authors’ belief that such capabilities will likely becomestandard for the majority of most public chemical compound databases in the nearfuture. These types of capabilities could help establish the newfound shift to OpenNotebook Science and shift the bias from the chemical biology databases (PubChem,Drugbank, HMDB and DSSTox) to even provide an environment for non-life sciencechemists, polymer chemists and material scientists to manage and researchinformation of interest to them.The WikiSphere, Blogosphere and Internet as a Public Compound Database. Wikis and blogs are common terms now for the majority of users of theworldwide web and both are fast becoming chosen platforms for the exchange ofinformation between many scientists, not only as tools within their own researchgroups but, more generally, with the public in general. A blog, or weblog is a websitewhere entries are written in chronological order and generally provide commentaryor news on a particular subject [71]. A typical blog combines text, images and linksto other blogs, web pages, and other media related to its topic. The original blogposting remains untouched by the commenter and readers are free to add theircomments, generally in a mediated manner where the blog host retains control over
  18. 18. Page 18 of 37the postings. An example screenshot from a chemistry-based blog hosted with theintention of examining and discussing organic syntheses is shown in Figure 3. Thenumber of chemistry-related blogs continues to grow dramatically and there havebeen efforts to provide a unified view into some of these [72,73]. A wiki is a type of computer software that allows users easily to create, editand link web pages and enables documents to be written collaboratively, in a simplemarkup language using a web browser, and is essentially a database for creating,browsing and searching information. Certainly Wikipedia is the most well-knowntoday though there are many others already online and used within the confines ofan organization to manage content. There are active groups supporting thedevelopment of chemistry on Wikipedia and there are now thousands of pagesdescribing small organic molecules, inorganics, organometallics, polymers and evenlarge biomolecules. Focusing on small molecules in general, each one has a Drug Box[75] or a Chemical infobox [76]. A drug box provides identifier information(chemical name, registry number, and so on) and commonly the identifiers link outto a related resource. Chemical data, pharmacokinetic data and therapeuticconsiderations can also be listed. At present there are approximately 8000 articleswith a chembox or drugbox [3], with between 500-1000 articles added since May.The detailed information offered on Wikipedia regarding a particular chemical or drugcan be excellent [77], see Figure 2, or weak [78]. There are many dedicatedsupporters and contributors to the quality of the online resource. Drug andchemboxes have been shown to contain errors but the advantage of a wiki is thatchanges can be made within a few keystrokes and the quality is immediatelyenhanced. The opposite is also true and vandalism can occur. This communitycuration process makes Wikipedia a very important online chemistry resource whoseimpact will only expand with time.
  19. 19. Page 19 of 37 Wikis have recently been used as the basis of Open Notebook Science [79].The UsefulChem Wiki [80] includes a series of experimental pages commonly linkedto related blog pages as shown in Figure 4. The Open Notebook Science efforts andthe movement appears to be gaining momentum with the support of vocaladvocates, such as Neylon [81], Murray-Rust [82] and many others. While both wikis and blogs are very valuable for information exchange, whatthey enable in terms of text and image exchange is all but crippled in terms ofsearching by many chemists’ additional query needs for chemical structures,reactions and data. Neither Wikis nor blogs, as yet, are enabled for the purpose ofstructure and substructure searching and, therefore, remain isolated, in general,from cheminformatics based search procedures. One of the key developments whichhas already facilitated the Semantic Web for chemistry is the InChI,[83] theInternational Chemical Identifier. The InChI string is a textual identifier for chemicalsubstances designed to provide a standard and human-readable way to encodemolecular information (see Figure 5) and to facilitate the search for such informationin databases and on the web. The InChI string, unfortunately, has only partly delivered onthe promise of facilitating web-based searches, due to unpredictable breaking of InChIcharacter strings by search engines. In order to resolve this issue the InChIKey wasintroduced. The condensed, 25 character InChIKey is a hashed version of the full InChIand is not human-readable. The equivalent InChIKey for the InChIString of L-ascorbicacid is CIWBSHSKHKDKBQ-JLAZNSOCBT. The advantage of the key is one ofenabling web searches, but a lookup table to identify the associated structure, or reference
  20. 20. Page 20 of 37to the original InChI String, is necessary [85]. While tens of millions of InChI stringsand keys have been populated into databases, their value is still in its infancy.Publishers have started to embed InChIs into their articles and the Royal Society ofChemistry [85] is presently pioneering a new publishing model, Project Prospect,including InChI to demonstrate movement toward the semantic web for chemistry.Bloggers have started to use InChI Strings and Keys on their postings, and wiki-pages are being InChI-enabled to help the web become structure searchable. Thenecessity of a central lookup facility for published InChIStrings will be necessary inorder to facilitate substructure searching of the web but this capability is likely to bedeveloped in the near future. Willighagen already aggregates InChI Strings onto ablog [87]. BioSpider [88] users are able to type in almost any kind of biological orchemical identifier (protein/gene name, sequence, accession number, chemicalname, brand name, SMILES string, InCHI string, CAS number, etc.) and delivers areport about the biomolecule. BioSpider uses a web-crawler to scan through dozensof public databases and employs a variety of specially developed text mining toolsand locally developed prediction tools to find, extract and assemble data for itsreports. A summary includes physico-chemical parameters, images, models, datafiles, descriptions and predictions concerning the query molecule. An increasing number of public databases will continue to become availablebut the challenge, even now, is how to integrate and access the data. Theimplementation of InChIs for web-based searching [89], and the delivery ofuserscripts to aggregate information and computational results from different webresources [90] are bringing together internet resources to appear as a singlemonolithic public chemistry database. Willighagen et al. [90] use userscripts to
  21. 21. Page 21 of 37enrich biology and chemistry related web resources by incorporating or linking toother computational or data sources on the web. They showed how information fromweb pages can be used to link to, search, and process information in other resourcesthereby allowing scientists to select and incorporate the appropriate web resourcesto enhance their productivity. Such tools connecting open chemistry databases anduser web pages is an ideal path to more highly integrated information sharing.Conclusion There is little doubt that the newfound availability of public chemicalcompound databases with their associated chemistry and biological data is enablingscientists to access information at less cost in both time and currency. The increasingquantity of freely accessible and integrated data can speed decision making andbring clarity or alternatively inundate and saturate the user with poor qualityinformation. Scientists now have free access to structure-searchable patents, openand free access peer-reviewed publications and software tools for the manipulationof chemistry related data. Members of the Open Source movement are developingtoolkits including visualization and data-mining tools and, when coupled with thepublic chemistry databases reviewed here, will likely benefit the process of discovery.There are likely to be challenging times ahead in terms of meshing the needs ofcommercial database publishers versus proliferation of free databases but thisjourney will not be halted by the objections of the commercial entities provided thatlegal copyrights are respected and the shift towards a more open community forscience persists.AcknowledgementsThe author wishes to thank the following people: Stephen Bryant and Evan Boltonfrom the PubChem team, the IUPAC/National Institute of Standards and Technology
  22. 22. Page 22 of 37InChI team (Alan McNaught, Stephen Stein, Stephen Heller, Dmitrii Tchekhovskoi);David Wishart and Nelson Young (Drugbank and HMDB), Nicko Goncharoff(SureChem), Stephen Boyer (IBM), Marc Nicklaus (Chemical Structure LookupService), members of the ChemSpider Advisory Group (Egon Willighagen, SeanEkins, Joerg Wegner and Alex Tropsha specifically), Ann Richard and Marti Wolf(DSSTox), Christoph Steinbeck (NMRShiftDB), Nick Day and Peter Murray-Rust(CrystalEye), Martin Walker, Andrew Yeung and Dirk Beestra (Wikipedia Chemistry).I would also like to acknowledge the many contributors to the blogging discussionsabout Open and Free Access.References1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. NucleicAcids Res. (2007) 35(Database issue):D21-5.2. Berman HM, Henrick K, Nakamura H: Announcing the worldwide Protein DataBank. Nature Structural Biology (2003) 12: 9803. Murray-Rust P: Chemistry for everyone. NATURE (2008) 451, 648-651•Provides a vision for the future of data distribution, access and integration acrossthe worldwide web and espouses the need for Open Data policies and adoption of theSemantic Web.4. Gary Wiggins’ Wiki. CHEMBIOGRID, Chemistry Databases on the Web:Alphabetical:
  23. 23. Page 23 of 37•An aggregation of chemistry databases, curated and annoted, to providesignificantly more information than would be returned in a generic search of theinternet.5. Symyx: CTFile formats no-fee. (2008) CAS: Chemical Abstract Services, Columbus, OH, USA (2006). InfoChem: InfoChem Gesellschaft für Chemische Information, München,Germany (2008). Symyx: Santa Clara, California, USA (2008). The University’s Mandate To Mandate Open Access: Harnad S, (2008) Open Access: Wikipedia Article on Open Access. (2008) The BOAI FAQ page: Frequently Accessed Questions about the Budapest OpenAccess Initiative (2008), Williams AJ: A perspective of Publicly Accessible/Open Access ChemistryDatabases: Drug Discovery News (2008), accepted for publication13. Open Data: Wikipedia Article on Open Data. (2008) Murray-Rust P, Rzepa HS, Tyrrell SM and Zhang Y: Representation and use ofChemistry in the Global Electronic Age ChemInform, 36(15), (2005)• An excellent outline regarding the potential of combining open access and thesemantic web in chemistry. Rzepa and Murray-Rust are two of the evangelists of thisdomain and outline in this article how data may be interconnected to the benefit ofall chemists.
  24. 24. Page 24 of 3715. Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C,Wegner J , Willighagen EL: The Blue Obelisk-Interoperability in ChemicalInformatics, J Chem Inf Model, (2006) 46 (3), 991-998.••The Blue Obelisk Movement ( is the name used by agroup of scientists and developers supporting open source software development,consistent and complimentary chemoinformatics research, open data, and openstandards in Chemistry.16. CODATA, The Committee on Data for Science and Technology: CODATA,Paris, France (2008). An Introduction to Science Commons: Wilbanks J, Boyle J, (2006). The Open Knowledge Foundation: Protecting and Promoting Open Knowledgein a Digital Age (2008). CAS Registry Numbers: Chemical Abstract Services, Columbus, OH, USA(2008). Murray-Rust P, Mitchell JB, Rzepa HS: Communication and re-use ofchemical information in bioscience. BMC Bioinform (2005) 6:180-196.•• Provides an overview of chemical information on the Internet and, while slightlyoutdated, is an important read in regards to the challenges and the vision of aSemantic Web for Chemistry.21. Heller SR, Stein SE, Tchekhovskoi DV: Open source/open access/open dataand the IUPAC International Chemical Identifier - InChI. American ChemicalSociety National Meeting, Washington, DC, USA (2005):CINF-60.22. NCBI: PubChem: National Center for Biotechnology Information, Bethesda, MD,USA (2008).
  25. 25. Page 25 of 37•• Pubchem is a large data aggregator (nearing 20 million structures) and offersrelational searching capabilities via text, structure and substructure searching andaccess to the entire dataset via download of SDF files. A series of services for thehandling of chemistry databases are also available via the website.23. ChemIDplus: National Library of Medicine, Bethesda, MD, USA (2008). CambridgeSoft Corp, Cambridge, MA, USA (2008). Hacking Pubchem - Technology easy, Quality difficult: Williams AJ, (2007) Richard AM, Swirsky Gold L, Nicklaus MC: Chemical structure indexing oftoxicity data on the Internet: Moving toward a flat world. Current Opinion inDrug Discovery & Development (2006) 9(3): 314-325.•• The review discusses efforts to gather, curate and make publicly availabletoxicology-related chemical information. The specific discussions regarding thequality issues with public chemistry databases and efforts to produce clean qualitydatabases are noteworthy.27. DSSTox Quality Chemical Information Review Procedures: USEnvironmental Protection Agency, Washington, DC, USA (2008). PubChem Errors: Williams AJ, PubChem Meeting, Washington DC: (2007) The Process of Curating Identifiers on ChemSpider: Williams AJ, (2008)
  26. 26. Page 26 of 3730. The NIH Roadmap Initiative: Office of Portfolio Analysis and StrategicInitiatives, National Institutes of Health, Bethesda, Maryland 20892: (2008) Hacking PubChem: Why The Open Access Fight is Just the Beginning,Apodaca R, (2006), Zhou Y, Chen K, Yan SF, King FJ, Jiang S, Winzeler EA: Large-ScaleAnnotation of Small-Molecule Libraries Using Public Databases. J. Chem. Inf.Model. (2007) 47:1386-1394•• The 2.5 million compound collection at the Genomics Institute of the NovartisResearch Foundation (GNF) was used as a model to determine whether automatedannotation of screening hits in batch is feasible.33. The American Chemical Society and NIH’s PubChem, Reshaping ScholarlyCommunication Blog: (2008) Background of the PubChem/CAS Issue: (2008) Baker M: Open-access chemistry databases evolving slowly but notsurely:Nature Reviews, Drug Discovery, (2006) 5:707-708• A critical review of how far publicly available initiatives have to go to catch up withcommercial offerings.36. How big is the challenge of curation and what is the structure ofGinkgolide-B: Antony Williams (2008),
  27. 27. Page 27 of 3737 DSSTOX: Distributed Structure-Searchable Toxicity (DSSTox) Database:US Environmental Protection Agency, Washington, DC, USA (2006). Richard AM and Williams CR (2002) Distributed Structure-SearchableToxicity (DSSTox) Public Database Network: A Proposal, Mutation Research:New Frontiers, 499:27-52.39. Richard AM: DSSTox web site launch: Improving public access todatabases for building structure-toxicity prediction models, Preclinica, (2006)2(2):103-108.40. DSSTox Data Files: eMolecules Online Service: eMolecules, Del Mar, CA, USA (2008).http://www.emolecules.com42. Available Chemical Directory: Santa Clara, California, USA (2008). ChemCats: Chemical Abstract Services, Columbus, OH, USA (2006). ChemACX: CambridgeSoft Corp, Cambridge, MA, USA (2008). ChemGate: Tony Davies, eMolecules and Spectroscopy: Spectroscopy Europe,(2007) 19(1):27-2846. The NIST Chemistry WebBook: (2008) NCI/NIH Developmental Therapeutics Program: National Cancer Institute,Frederick/National Institutes of Health, Bethesda, MD, USA. (2008). Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z,Woolsey J: DrugBank: a comprehensive resource for in silico drug discoveryand exploration, Nucleic Acids Res. (2006) 34:D668-72
  28. 28. Page 28 of 37• A detailed description of the intent, development and capabilities of the Drugbankdatabase, one of the most respected public chemistry databases utilized by drugdiscovery scientists today.49. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: KEGG: The KEGGresource for deciphering the genome, Nucleic Acids Res. (2004) 32 (Databaseissue):D277-8050. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A,Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontologyfor chemical entities of biological interest, Nucl. Acids Res. (2008) 36: D344-D350;51. Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B,Hassanali M: DrugBank: a knowledgebase for drugs, drug actions and drugtargets, Nucleic Acids Res. (2008) 36(Database issue):D901-6.•• An update regarding the DrugBank database as it is released in its Version 2state.52. HMDB: The Human Metabolome Database. Nucleic Acids Res. (2007) 35:D521-653. Steinbeck C, Krause S, Kuhn S: NMRShiftDB– Constructing A ChemicalInformation System With Open Source Components. J. Chem. Inf. Comput. Sci.(2003) 43:1733-1739.•The defining article regarding the development of the NMRShiftDB database definingthe intention of the work, the development of the software components and a visionof how such a platform can lead to widespread dissemination of analytical data, atno-charge, to the chemistry community.
  29. 29. Page 29 of 3754. Steinbeck C, Kuhn S. NMRShiftDB – Compound Identification AndStructure Elucidation Support Through a Free Community-Built WebDatabase. Phytochemistry, (2004), 65:2711–2717.55. Blinov KA, Smurnyy YD, Elyashberg ME, Churanova TS, Kvasha M, Steinbeck C,Lefebvre BA, Williams AJ: Performance Validation of Neural Network Based13C NMR Prediction Using a Publicly Available Data Source. J Chem InfModel, (2008), Accepted for publication, doi: 10.1021/ci700363r.56. CSEARCH and NMRShiftDB: Robien W (2007) Williams AJ, ChemSpider and Its Expanding Web: Building a Structure-Centric Community for Chemists, Chemistry International (2007) 30(1): 30.58. Open Notebook Science: Bradley JC, (2006) Drexel CoAs E-Learning Blog, SureChem: San Francisco, CA, USA (2008) Free Access Structure Searching of Patents: Williams AJ (2007), LASSO: Ligand Activity in Surface Similarity Order, SioBioSys Inc., Toronto,Canada. Database of Useful Decoys: WiChempedia: ChemSpider Blog (2007) Chemical Structure Lookup Service: National Institutes of Health, CrystalEye Crystallogrpahic Database:
  30. 30. Page 30 of 3766. Thirty Two Free Chemistry Databases: Apodaca R, Depth-First Blog, IBM’s Online Patent Search: (2008) IBM Chemical Search Alpha, IBM,Almaden Services Research, San Jose, CA 95120, USA, Kemper K, Chemical Abstracts still developing ways to help its core –scientists, Columbus Business First, Feigenbaum L, Herman I, Hongsermeier T, Neumann E, Stephens S: TheSemantic Web in Action, Scientific American Magazine The Benefits of Crowdsourcing: The Definition of a Blog: ScienceBlogs: Chemical BlogSpace: The Definition of a Wiki: Wikipedia Chemical Drugbox: Wikipedia Chemical Infobox: Taxol on Wikipedia: AP7 on Wikipedia:
  31. 31. Page 31 of 3779. Bradley JC, Open Notebook Science Using Blogs and Wikis, NaturePreceedings (2007) doi:10.1038/npre.2007.39.1, UsefulChem Open Notebook Science: Bradley JC, Drexel University, and Open Notebook Science: Neylon C, Science in the open, An openwetware blogon the challenges of open and connected science (2008) Open Notebook Science NMR: Murray-Rust P, A Scientist and the Web Blog(2008) The IUPAC International Chemical Identifier: (2008) The IUPAC International Chemical Identifier Software: (2008) Royal Society of Chemistry: (2008) Project Prospect: (2008) RSC Publishing, Chemical Blogspace, (2008) Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS: BioSpider, A webServer for Automating Metabolome Annotations. Pacific Symposium onBiocomputing, (2007) 12:145-156.
  32. 32. Page 32 of 3789. Cole SJ, Day NE, Murray-Rust P, Rzepa HS, Zhang Y: Enhancement of theChemical Semantic Web through INChIfication. Org Biomol Chem, (2005)3:1832-183490. Willighagen EL, OBoyle NM, Gopalakrishnan H, Jiao D, Guha R, Steinbeck C andWild DJ: Userscripts for the Life Sciences. BMC Bioinformatics, (2007) 8:487.•• Discusses the use of userscripts to change the appearance of web pages bymodifying web content on the fly to enable aggregation of information andcomputational results from different web resources into a single webpage. Indicativeof the future of integration and the possibilities which exist to gather informationfrom a multitude of resources and reformat and deliver to the consumer.
  33. 33. Page 33 of 37FiguresFigure 1 - The Compound Summary Page for Taxol in PubChem. Page 1 only isshown. (
  34. 34. Page 34 of 37Figure 2: The DrugBox for Taxol from Wikipedia (
  35. 35. Page 35 of 37Figure 3: The blog. Paul Docherty discusses complexsyntheses and offers readers an opportunity to comment, analyze and providefeedback. Many articles are labeled with InChIKeys to allow indexing by searchengines. (
  36. 36. Page 36 of 37Figure 4: An Example UsefulChem wiki page( UsefulChem wiki page shows a number of important content items: 1) Links tothe prior failed experiment; 2) Links to the docking results that justified making thiscompound; 3) Full characterization (spectroscopy and photographs) of an isolatedproduct, with interactive NMRs (JSpecView/JCAMP-dx) of the starting materials; 4)In the discussion section a question is posed by Professor Bradley to his student, andthen answered. The entire discussion history is captured. 5) A complete, detailed anddated log of the steps taken by the student; 6) In the tag section, InChIs of everycompound used are provided for indexing by search engines.
  37. 37. Page 37 of 37 HO O OHO InChI=1/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-10H,1H2/t2-,5+/m0/s1 CIWBSHSKHKDKBQ-JLAZNSOCBT HO OHFigure 5: The InChI String (top) and InChI Key (bottom) for L-ascorbic acid.