Utilizing open source software to facilitate communication of chemistry at rsc


Published on

This is a copy of the RSC-ChemInformatics team's chapter that was contributed to the recently published book: Open source software in life science research: Practical solutions to common challenges in the pharmaceutical industry and beyond (Woodhead Publishing Series in Biomedicine)

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Utilizing open source software to facilitate communication of chemistry at rsc

  1. 1. PUBLISHED IN THE BOOK: Open source software in life science research: Practicalsolutions to common challenges in the pharmaceutical industry and beyond (Woodhead Publishing Series in Biomedicine)
  2. 2. Title: Utilizing Open Source Software to Facilitate Communication of Chemistry at RSCAuthors: Aileen Day, Antony Williams*, Colin Batchelor, Richard Kidd and Valery TkachenkoOrganization: Royal Society of ChemistryINTRODUCTIONThe Royal Society of Chemistry (RSC) is the largest organization in Europe with the specific mission of advancingthe chemical sciences. Supported by a worldwide network of 47,000 members and an international publishingbusiness, our activities span education, conferences, science policy and the promotion of chemistry to the public.The information handling requirements of the publishing division have always consumed the largest proportion ofthe available software development resources, traditionally dedicated to enterprise systems to develop robust andwell defined systems to deliver published content to customers. Internal adoption of open source solutions wasinitiated with the development of Project Prospect [1], and then extended with the acquisition of ChemSpider [2].ChemSpider delivered both a platform incorporating much open-source software, staff expertise incheminformatics, as well as new and innovative functionality. The small but agile in-house development teamhave combined commercial and Free/Open Source software tools to develop the platforms necessary to delivercapabilities to the user community. This book chapter will review the systems that have been developed in-house,what they will deliver to the community, the challenges encountered in utilizing these tools and how they havebeen extended to make them fit for purpose.PROJECT PROSPECT AND OPEN ONTOLOGIESRSC began exploring the semantic markup of chemistry articles, together with a number of other publishers in2002, providing support for a number of summer student projects at the Unilever Centre in Cambridge University.This work led to an open source Experimental Data Checker [3] which parsed the text of experimental dataparagraphs and performed validation checks on the extracted and formatted results. This collaboration led to RSCinvolvement, as well as collaboration with Nature Publishing Group [4] and the International Union of
  3. 3. Crystallography [5], in the SciBorg project [6]. The resulting development of OSCAR [7] (Open Source ChemistryAnalysis Routines) as a means of marking up chemical text and linking concepts and chemicals with other resourceswas then explored and was ultimately used as the text mining service underpinning the award winning “ProjectProspect” [1] (see Figure 1).Figure 1: A “prospected” article from RSC. Chemical terms link out to ontology definitions and related articles asappropriate and chemical names link out to ChemSpider (vide infra).It was essential that to develop both a flexible and cost effective solution during this project. Softwaredevelopment was started from scratch, using standards where possible, but still facing numerous unknowns.Licensing a commercial product for semantic markup would have been difficult to justify and also risked bothinflexibility and potential limitations in terms of rapid development. As a result it was decided to work alongsidemembers of the OSCAR development team, contributing back to the open source end product and providing a real
  4. 4. business case to drive improvements in OSCAR. This enabled the creation of a parallel live production system, andRSC became the first publisher to semantically mark-up journal articles, this resulted in the ALPSP PublishingInnovation award in 2007 [8]. More importantly, the chosen path provided a springboard to innovation within thefields of publishing chemistry and the chemical sciences. What follows is a summary of the technical approach thatwas taken to deliver Project Prospect. If the architecture for a software project is designed in the right way then in many cases it is possible toset up the system using open-source software as scaffolding [9] and migrate parts of it, as necessary, to moreadvanced modules, either open or closed source software, as it is determined what the requirements are. ForProject Prospect the following needs were identified: 1. A means of extracting chemical names from text and converting them to electronic structure formats 2. A means of displaying the resulting electronic structure diagrams in an interface for the users 3. A means of storing those chemicals separately from the article XML 4. A means of finding non-structural chemical and biomedical terms in the textChemical names commonly contain punctuation, for example [2-({4-[(4-fluorobenzyl)oxy]phenyl}sulfonyl)-1,2,3,4-tetrahydroisoquinolin-3-yl](oxo)acetic acid, or spaces, like diethyl methyl bismuth, or both, and hence causesignificant problems for natural language processing code that has been written to handle newswire text orbiomedical articles. . For this reason the Sciborg project [6] required code that would identify chemical names sothat they would not interfere with further downstream processing of text. Fortunately a manner for extractingchemical structures out of text was already available. The OSCAR software provided a collection of Open Sourcecode components to meet the explicitly chemical requirements of the Sciborg project. It delivered componentswhich determined whether text was chemical or not, RESTful web services for the Chemistry Development Kit(CDK) [10], routines for training language models, and, importantly, the OPSIN parser [11], which lexes candidatestrings of text and generates the corresponding chemical structures. The original version of OPSIN produced in
  5. 5. 2006 had numerous gaps but was still powerful enough to identify many chemicals. We also used the ChEBIdatabase [12] as the basis of a chemical dictionary. In order to display extracted chemical structures the CDK was used via OSCAR. While the relevant routineswere not entirely reliable and, specifically, did not handle stereochemistry, they were good enough todemonstrate the principle. Following the introduction of the International Chemical Identifier (InChI), [13] andclear interest by various members of the publishing industry and software vendors in supporting the new standard,it was decided to store the connection tables as InChIs. The InChI code is controlled open source, open butpresently only developed as a single trunk of code by one development team. The structures were stored as InChIsboth in the article XML and in a SQL Server database. For the non-structural chemical and biomedical termscontained within the text, resources that were accessible to the casual reader were identified as being mostappropriate. This application was launched with the IUPAC Gold Book [14], which had recently been converted toXML, and with the Gene Ontology [15]. A more detailed account of the integration is given elsewhere [16].Unfortunately the Gold Book turned out to have too broad a scope and too narrow coverage to be particularlyuseful for this work. Initially attempts were made to mark up text with all of the entries in the Gold Book, but toomany of them, such as cis- and trans-, were about nomenclature and parts of names, so only hand-pickedselections from the resource were used. Named reactions and analytical methods were additional obvious areas to select for the further developmentof ontologies applicable to chemistry publishing. RXNO [17] represents named reactions, for example the Diels–Alder reaction, which are particularly easy to identify and a method was established to determine whichclassification a reaction falls under. The Chemical Methods Ontology (CMO) [18] was initially based on the 600 orso terms contained within the IUPAC Orange Book [19] and then extended based on our experience with text-mining. It now contains well over 2500 terms and covers physical chemistry as well as analytical chemistry. RXNOand CMO have been provided as open source and are available from Google Code where we have trackers andmailing lists [17, 18]. When ChemSpider was acquired in 2009 (vide infra) a number of processes were changed tomake use of many of the tools and interfaces available within the system. The processes associated with one, twoand three listed above have been changed but four, the method by which non-structural chemical and biomedical
  6. 6. terms are found in the text, remains the same. At present we utilize OSCAR3, a greatly-improved version of OPSIN[11] and ACD/Labs’ commercial Name-to Structure software [20]. The large assortment of batch scripts and XSLTtransforms has now been replaced by a single program, written in C#, for bulk markup of documents and itmemorizes the results of the name to structure transformations. The key difference for structure rendering is thatChemSpider stores the 2D layouts in addition to the InChIs in the database and as a result the rendering process isnow a lot simpler. The ChemSpider image renderer is also used in place of the original CDK, providing significantimprovements in structure handling and aesthetics. The original RSC Publishing SQL Server database serving theProspect project has now been replaced by integration with ChemSpider, meaning that substructure searching isnow available as well as cross-referencing to journal information from other publishers that has been deposited inChemSpider. New approaches are being investigated to enhance the semantic markup of RSC publications and to roll outnew capabilities as appropriate. These now include the delivery of our semantically enriched articles to the Utopiaplatform [21]. Certainly the development of the RSC semantic markup platform owes much of its success to theavailability of the open source software components, developed by a team of innovative scientists and softwaredevelopers, and these are now used in parallel with both in-house and commercial closed source software todeliver the best capabilities.CHEMSPIDER ChemSpider [22] was initially developed on a shoestring budget as a hobby project, by a small team,simply to contribute a free resource to the chemistry community. Released at the American Chemical Society (ACS)Spring meeting in Chicago in March 2007 it was seeded with just over 10 million chemicals sourced from thePubChem database [23]. Following a two year period of expanding the database content to over 20 millionchemicals, adding new functionality to the system to facilitate database curation and crowd-sourced depositions ofdata, as well as the development of a series of related projects, ChemSpider was acquired by the Royal Society ofChemistry [2]. The original strategic vision of providing a structure-centric community for chemistry was expandedto become the world’s foremost free access chemistry database and to make subsets of the data available as opendata.
  7. 7. The database content in ChemSpider (see Figure 2), now over 26 million structures aggregated from over400 data sources, has been developed as a result of contributions and depositions from chemical vendors,commercial database vendors, government databases, publishers, members of the Open Notebook Sciencecommunity and individual scientists. The database can be queried using structure/substructure searching andalphanumeric text searching of chemical names and both intrinsic, as well as predicted, molecular properties.Various searches have been added to the system to cater to various user personae including, for example, massspectrometrists and medicinal chemists. ChemSpider is very flexible in its applications and nature of availablesearches.Figure 2: The header of the chemical record for Domoic Acid (http://www.chemspider.com/4445428) inChemSpider. The entire record spans multiple pages including links to patents and publications, pre-calculated andexperimental properties and links to many data external data sources and informational websites. The primary ChemSpider architecture is built on commercial software using a Microsoft technologyplatform of ASP.NET and SQL Server 2005/2008 as at inception it allowed for ease of implementation, projectedlongevity and made best use of available skill sets. Early attempts to use SQLite as the database were limited byperformance issues. The structure databasing model was completely developed in-house. The InChI library is the
  8. 8. basis of the ChemSpider registration system as is the exact searching in ChemSpider (which uses InChI layerseparation and comparison). As a result ChemSpider is highly dependent on the availability of the Open SourceInChI library for InChI generation. The choice between using InChI identifiers versus alternative chemical structurehashing algorithms (e.g. CACTVS hash codes [24]) was largely based on community adoption. Attempts were madeto develop our own version of hash codes early on but were abandoned quickly as the standard InChI library wasalready out of beta and increasingly used in the chemical community. No modifications to the InChI source codehave been made except for small changes to the libraries allowing multiple versions of the InChI code to coexist inone process address space. The GGA Bingo toolkit (SQL Server version) [25] is used for substructure and similarity searches inChemSpider. The open source library GGA software is developed by a small team of geographically co-locateddevelopers and, to the best of our knowledge, they do not allow the source code to be modified outside of theirorganization. This platform was chosen over other possible solutions for ChemSpider since the team wasknowledgeable, professional and agile. The original version of Bingo was made available only on SQL Server 2008while ChemSpider was running on SQL Server 2005 at that time. Since the software was available as Open Sourcewe recompiled the source code to work on SQL Server 2005 and the GGA team fixed version discrepancies quicklyand provided a working version of Bingo for SQL Server 2005 within a one day turnaround. This is a testament tothe skills of the software team supporting this Open Source product. In order to perform both structure searching and substructure searching a manner by which to introducea chemical structure drawing is required. We provide access to a number of structure drawing tools, both Java andJavaScript. Two of these structure editors are Open Source (GGA’s Ketcher [26] and JChemPaint [27]) and weimplemented them without modification. There are various needs on the ChemSpider system for the conversion of chemical file formats and weutilize the open source OpenBabel package for this purpose [28]. While the software code was functional we haveidentified a number of general issues with the code including inversion of stereo centers and loss of other chemicalinformation. We believe that OpenBabel is a significant contribution to the cheminformatics community that willcontinue to improve in quality.
  9. 9. We generate 3D conformers on the fly from the 2D layouts in the database. We chose the freely-availableBalloon optimizer [29], primarily because it is free and fast; and it is a command line tool and was relatively easy tointegrate. Balloon is not, however, open source, and cannot be modified. We use Jmol [30] to visualize theresulting optimized 3D molecular structures as well as crystallographic files (CIFs) where available. We do not usethe Java-based JMol to visualize regular 2D images as it can add a significant load to browsers. Literature linking from ChemSpider to open internet services has been established in an automatedfashion taking advantage of freely available application programming interfaces to such websites as PubMed [31],Google Scholar [32] and Google Patents [33]. Validated chemical names are used as the basis of a search againstthe Pubmed database searching only against the title and the abstract. In this way a search on cholesterol, forexample, would only retrieve those articles with cholesterol in the title and abstract rather than the many tens ofthousands of articles likely mentioning cholesterol in the body of the article. A similar approach has been taken tointegrate to Google Scholar and Google Patents. It should be noted that the APIs are free to access but are notopen source. PubMed (through Entrez [34]) has both SOAP and RESTful Application Programming Interfaces (APIs).The Entrez API is both extensive and robust providing access to most of the NCBI/NLM [35]electronic databases.Google now provides RESTful APIs now and has deprecated the SOAP services that it once supported. This likelyreflects the trend to support only lightweight protocols for modern web applications. All of these APIs are called ina similar way: a list of approved synonyms associated with a particular ChemSpider record is listed, sorted by“relevance” (which is calculated based on the length of the synonym as well as its clarity), then used to call againstthe API. The result (whether SOAP, XML or HTML) is then processed by an “adapter”, transformed into anintermediate XML representation and passed through XSLT to produce the final HTML shown in the ChemSpiderrecords. The value of analytical data is as reference data for comparing against other lab-generated data.Acquisition of a spectrum and comparison against a validated reference spectrum speeds up the process of sampleverification without the arduous process of full data analysis. As a result of this general utility ChemSpider hasprovided the ability to upload spectral data of various forms against a chemical record such that an individualchemical can have an aggregated set of analytical data to assist in structure verification. As a result of
  10. 10. contributions from scientists supporting the vision of ChemSpider as a valuable centralizing community-basedresource for chemical data for chemists, over 2000 spectra have been added to ChemSpider in the past 2 yearswith additional data being added regularly. These data include infrared, Raman, mass spectrometric and NMRspectra with the majority being 1H and 13C spectra. Spectral data can be submitted in JCAMP format [36] anddisplayed in an Open Source interactive applet, JSpecView, [37] allowing zooming and expansion. JSpecView isOpen Source but the code seems to lack a clearly defined architecture and boundaries. JSpecView has beenmodified to visualize range selection (inverting a region’s color while dragging a mouse cursor). One of the mainproblems faced with supporting JSpecView is that it understands only one of the many flavours of JCAMPproduced by spectroscopy vendors. This is not the fault of JSpecView but rather the poor adherence to the officialJCAMP standard by the spectroscopy vendors. An alternative spectral display interface is the ChemDoodle spectralweb component [38] which is a “Spectrum Canvas” and renders a JCAMP spectrum in a webpage along withcontrols to interact with it - for example to zoom in on a particular area of interest. However, it relies on HTML5,which limits its usage to modern standards compliant browsers which support HTML5 (for example GoogleChrome and Firefox) and, as described earlier, limits its use in most versions of Internet Explorer unless the GoogleChrome Frame plug-in is installed. This form of spectral display has not been implemented in the ChemSpider webinterface yet but has been installed to support the SpectralGame [39, 40] on mobile devices. While ChemSpider is not an Open Source project per se, depending for its delivery on a MicrosoftASP.NET platform and SQL Server database, it should be clear that the project does take advantage of many OpenSource components to deliver much of the functionality including file format conversion and visualization. Inparticular the InChI identifier, a fully Open Source project, has been a pivotal technology in the foundation ofChemSpider and has become essential in linking the platform out to other databases on the internet using InChIs.CHEMDRAW DIGESTER The ChemDraw Digester is an informatics project bridging the previous two topics discussed - it is a toolwhich uses the structure manipulation programs contained in ChemSpider’s code to help enhance RSC articles. In
  11. 11. the first section we saw that if the most important chemical compounds in a paper can be identified and depositedto ChemSpider then the article can be enhanced with links to provide readers with more compound information.These compounds were generated by using name to structure algorithms after extraction of the chemicalidentifier. However, more often than defining a compound by name, chemistry authors refer to and definecompounds in their paper not by name, but instead by figures in the manuscript where the molecular structuresare being discussed (see Figure 3).Figure 3: Example of figure in article [41] (Reproduced by permission of The Royal Society of Chemistry) definingcompoundsWhere the images accompanying a manuscript have been generated by using the structure drawing packageChemDraw [42], the RSC requests that authors supply these images not only as image files, but also in their originalChemDraw format (with the file extension “.cdx”), since these files preserve the chemical information of thestructures within them. Since the ChemDraw file format can also incorporate graphical objects and text, the filesoften contain labels (reference numbers or text) which correspond to the references of the compound in thecorresponding manuscript, as in the example figure. Therefore, by “digesting” a ChemDraw file we could
  12. 12. potentially decorate these occurrences of the compounds’ identifying labels in the manuscript with a link to itschemical structure in ChemSpider – this is the basic aim of the ChemDraw Digester. The most crucial part of this digestion process is to find each compound in the original ChemDraw file,match it up with its corresponding label, and then convert its 2D molecular structure into the MDL MOLfile format[43] (with extension .mol). The conversion from ChemDraw to mol format is required so that the files can beconcatenated to make a MDL SDF file [44] (with extension .sdf) suitable for deposition to ChemSpider. This SDF fileis also supplemented with article publication details in its associated data fields, which are used during depositionto create links from the new and existing compound pages in ChemSpider back to the source RSC article. Oncedeposited to ChemSpider, the related IDs of each compound can be retrieved and used to mark up their namesand references in the source article with reverse links to the ChemSpider compounds. The ChemDraw format is,unfortunately, not an open standard and it is not straightforward to digest in order to extract and convert thechemical structures and their associated labels. It is a binary file format, and although there is good documentation[45], deciphering it is a painstaking process and this would require considerable effort. Fortunately, as discussed previously there is an existing routine to convert ChemDraw files to SDF usingthe “convert” function of OpenBabel [46] . The ChemDraw digester was written using a Visual Studio, and .NETframework as a C# service with an ASPX/C# web front end so that ultimately it can be reintegrated with the mainChemSpider website. As a result, it could reference the native C++ OpenBabel library in the same way as the mainChemSpider code - via a wrapper managed C++ assembly (OBNET), which only exposed functionality required forthe Digester and ChemSpider. The real advantage of OpenBabel being open source is that source code can beadjusted and the assembly recompiled, allowing adjustments required to deal with the real ChemDraw files fromauthors. These adjustments primarily involved adding new functionality, such as a new “splitter” function to splitChemDraw files which contain multiple “fragment” objects (molecules) into separate ChemDraw files so that eachcould be processed separately. Another issue is that the ChemDraw format supports more features than MOL, sosome information is lost in this conversion. As a result, the ChemDraw reader had to be adjusted to read in thisinformation and store it in the associated data fields in the SDF file generated - for example special bond types arerepresented by the PubChem notation. The other more important example of data lost from the original
  13. 13. ChemDraw file is that of text labels associated with molecules. The difficulty in this case is to define how to matchup a structure with its label. As a first step, OpenBabel was adjusted to recognise text labels which had beenspecifically grouped with a particular structure by the author. However, it became clear that in practice authorsrarely used this grouping feature for this purpose, so that the vast majority of labels in the figures would be lost. The ChemDraw Digester incorporates a review step where the digested information can be reviewed in aneditable webpage as shown in Figure 4. If a label is wrong or absent it can be amended but this is a time consumingprocess and the ultimate aim for the ChemDraw Digester is that it could be run as a fully automated process thatdoes not require human intervention.
  14. 14. Figure 4: A review page of digested information As an alternative to manual correction, the OpenBabel source code was modified to return labels forstructures based on proximity, as well as grouping. A function was added which was called when a “fragment”object (molecule or atom) was found. The function calculates the distance between the fragment object and all ofthe “text” objects in the file (based on their 2D coordinates), so that the closest label to it could be identified. If thedistance between the fragment and its closest text was less than the distance between that same text and any
  15. 15. other fragment, then the value of the “text” property of the text object (the text in the label) was associated withthe structure and returned in the SDF file produced. Certain checks were also built in to ignore labels which do notcontain any alphanumeric characters (e.g. “+”). The ChemDraw Digester is presently in its final stages of development and testing and all of the processedstructures in the SDF file will be reviewed for mistakes and compared with those in the figures of the article so thatall discrepancies are identified. We are already aware of some areas which will need attention. Some can be dealtwith by post-processing the structures in the SDF file after digestion - for example, it is common for authors todraw boxes in the ChemDraw files for aesthetic reasons, and to draw these molecules by simply drawing fourstraight line bonds. As a result, we have added a filter which by default ignores these rogue cyclobutane molecules,and similarly ethane molecules which are commonly used to draw straight lines. When a structure is flagged to beignored for any reason it will not be deposited into ChemSpider or marked up in the original article, but at thereviewing stage any automatically assigned “ignore” flags can be overridden (see the Ignore checkbox in Figure 3).Molecules are also flagged to be ignored based on some basic checks of their chemistry such as having a non-zerooverall charge, possessing atoms with unusual atom valences, undefined stereochemistry, etc. Another issueencountered was one in which the molecule is not drawn out explicitly but, for example, is represented by a single“node” which is labeled, e.g. “FMoc”. These groups are automatically expanded but the placement of atoms inthese expanded groups is sometimes peculiar and leads to ugly 2-dimensional depictions of the molecules. Thiscan be addressed by allowing the ability to apply a cleaning algorithm to the relevant MOL structure in the SDF fileto tidy up and standardise the bond lengths and angles to prevent atom overlap and very long bonds. These are examples that can be dealt with by post-processing the structures in the SDF file. However,since OpenBabel source code availability allows customization, then issues that cannot be fixed with post-processing can be dealt with during the initial ChemDraw to SDF conversion. One such issue is that authors mayuse artistic license to overlay another ChemDraw object onto a molecule – for example to only draw part of alarger structure. The objects can be lines to indicate dangling bonds (even more problems are caused when theseare not drawn as graphical objects but instead as various variations of ethane molecules as in Figure 5a), graphicalpictures (e.g. circles to indicate beads as in Figure 5b) or brackets (e.g. commonly used to indicate repeat units
  16. 16. polymers as in Figure 5c). The objects are usually overlaid onto an unlabeled carbon to give the appearance of abond from the drawn molecule to these objects. The current OpenBabel algorithm would interpret the ChemDrawby simply identifying a carbon atom, and treating any objects overlaid on it as separate entities rather than bondedin any way, and it would not be possible to detect any error in the final molecule that was output. While it isdifficult to envisage any way that we could fully interpret such molecules, we could modify the OpenBabel convertfunction to return a warning when a chemical structure overlaps any other ChemDraw object so that these can beignored by default, rather than processed incorrectly. Another very common problem which is difficult to find asolution for, is dealing with Markush structures – See Figure 5d. Authors commonly save valuable space in thefigures of their articles by representing multiple, similar compounds by defining part of the structure with a placeholder e.g. the label “R” and supplying a label (usually elsewhere in the ChemDraw file) defining the differentgroups which could be substituted for R. This would require quite an extensive alteration to OpenBabel to dealwith it correctly, but it is at least conceivable.
  17. 17. Figure 5. Examples of ChemDraw molecules which are not converted correctly to MOL files by OpenBabelThe long term aim of the ChemDraw Digester is for it to process all ChemDraw files supplied with RSC articlesautomatically. In fact, with some extra effort, it may be possible to extract embedded ChemDraw file objects fromMicrosoft Word files and digest them, and this may allow even more structures to be identified, even whenauthors do not send ChemDraw files. However, rather than simply using the various checks on molecules to filterout and omit problem structures automatically, it would be more useful to be able to feed these warnings back tothe authors to give them the opportunity to revise their ChemDraw files and images, so that they could be used asintended. For this reason, another long-term aim of the ChemDraw Digester is for it to be made available as part ofthe ChemSpider website as an author tool. If this were possible then authors could upload their own ChemDrawfiles when writing their articles, and view the Review webpage (as in Figure 4) and see a clear indicator of whetherthe structures drawn in their figures do indeed adequately define the molecules that they are referring to, and ifnot what issues need to be fixed. Poorly drawn structures are more common than one would expect in academic
  18. 18. papers so the ability to raise the standard of Chemistry in RSC articles would benefit both authors and readers. TheDigester is presently in a testing phase with the RSC editors and we expect to roll it out to general usage within theorganization shortly.LEARN CHEMISTRYThe RSC’s objective is to advance the chemical sciences, not only at a research level but also to provide tools totrain the next generation of chemists. The RSC’s LearnChemistry platform [47] is currently being developed toprovide a central access point and search facility to make it easier to access the various different Chemistryresources that it provides. ChemSpider contains a lot of useful information for students learning Chemistry butthere is also a lot of information which is not relevant to their studies which might be confusing and distracting. Asa result, the RSC is developing a teaching resource, which will belong to LearnChemistry, for students in their lastyears of school, and first years of university (ages 16-19) which restricts the compounds and the properties,spectra and links displayed for each, to those relevant to their studies. However, students do not just needcompound information in isolation – it is most useful when linked to and from study handouts and laboratoryexercises. In addition, this resource is not just intended to be read and browsed, but interactive – allowingstudents to answer a variety of quiz questions, and allowing chemical educators to contribute to the content – sothis resource is to be called “LearnChemistry: Share” (See Figure 6) [48]. The requirements for the platform onwhich this community website needs to be built then is one which can support the ability for multiple users tocontribute collaboratively and to be easily customizable both in terms of appearance and functionality.
  19. 19. Figure 6: The LearnChemistry|Share website As such, an obvious start point was MediaWiki open source software [49]. It is easy to administrate andcustomize in terms of initial set-up, managing user logins, tracking who is making changes and reverting thesewhen necessary. It is also easy for untrained users to add pages and edit existing ones since many people arefamiliar with the same editing interface as Wikipedia [50]. It is also easy to program enhancements to the basicfunctionality since not only is all of the PHP code itself open-source, but it has also been designed to allowextensions (programs which are called when the wiki pages are loaded) to be built into it. There are already manyextensions already available [51] and since these also are open source, it is straightforward to learn how they workand write new extensions. For all of these levels of MediaWiki use, there is plenty of documentation readilyavailable by performing internet searches.
  20. 20. The basic set-up of the wiki is straightforward – anyone can view the website, but a login is required toedit or add pages. Anyone can register for a login and make changes, but this functionality is primarily aimed atteachers. Changes to the site will be monitored by administrators, and can be crowd-curated by other teachers.The content of the website is separated into different namespaces. The “Lab”, “TeacherExpt” and “Expt” sectionscontain traditional HTML content – a mixture of formatted text, pictures and links which describe experiments (anoverview, teachers’ notes and students’ notes respectively). Each page in the “Substance” section contains compound information. Most of this information isdynamically retrieved from a corresponding ChemSpider compound page - images of its structure and a summaryof its properties (molecular formula, mass, IUPAC name, appearance, melting and boiling points, solubility, etc.),and links to view safety sheets and spectra. Where there is a link to a Wikipedia page from the linked ChemSpidercompound, the lead section of the Wikipedia article is shown in the page with a link to it. It is also possible to addextra information e.g. references to the pages using the regular MediaWiki editing interface. There are currentlyapproximately 2,000 of these substance pages which correspond to simple compounds that would commonly beencountered during the last years of school and first years of University. Creating these substance pages posed a variety of technical challenges. The decision to retrieve as muchinformation as possible from ChemSpider and Wikipedia rather than to store it in the “Learn Chemistry: Share”wiki was taken in the interests of maintainability, since both of these sources will potentially undergo continuouscuration. The issue of retrieving compound images from ChemSpider’s website and incorporating them into thewiki pages was easily addressed by using the “EnableImageWhitelist” option [52] in the local settingsconfiguration. However, to retrieve text from the ChemSpider server required the installation of the “web service”extension [53]. This is very easy to install in the same way as the majority of MediaWiki extensions – simply bycopying the source PHP file into the extensions directory of the MediaWiki set-up files, and referencing this file inthe main local settings file of the MediaWiki installation. In the LearnChemistry wiki, the main identifier for each compound is its name which appears in the title ofthe compound page. However, in ChemSpider the main identifier for each compound with a particular structure is
  21. 21. its ChemSpider ID and it is more future-proof to call information from its web services by querying this rather thanits name which is potentially subject to curation. This mapping between the wiki page name and ChemSpider ID iscrucial, and was painstakingly curated before the LearnChemistry pages were created. It also needs to be easilymaintainable in case a change is needed for whatever reason in the future. The “Data” extension [54] is used tomanage this mapping. Each substance page contains hidden text which uses the extension to set a mapping of thecompound name to the ChemSpider ID. The extension can then be used to retrieve this mapped informationwhenever the ChemSpider ID needs to be used (e.g. in a web service call), in the substance page but also any otherin the wiki, rather than “hardcoding” the ChemSpider ID into many places. The substance pages themselves were created by writing and running a MediaWiki “bot” – a PHP scriptwhich accesses the MediaWiki API to login to the wiki, read information from it, or edit pages in it. There is a lot ofinformation on the internet describing the MediaWiki API [55], and examples of bot scripts to get started e.g. [56].For each batch of substance pages to be created, an input file was made containing the basic inputs required topopulate the page. The bot script firstly retrieves the login page of the wiki and supplies it with user credentials,then logs in and retrieves a token to be used when accessing other pages on the wiki. The script then recursesthrough the input file, constructs a URL for each new substance page in edit mode, posts the new content, andthen saves the changes made. The Snoopy open-source PHP class [57] played the crucial role of effectivelysimulating a web browser in this process – it was very well documented and straightforward to implement. Discoverability is also important for these substance pages. An important objective was to make thesubstance pages searchable by structure (as ChemSpider is). An easy way to do this from outside the site is to usethe “Add HTML Meta And Title” extension [58] which was used to set the metadata keywords and description oneach substance page for search engines to use, and making sure that the InChI key was included in the metadata.Structure searching within the wiki (not just from internet search engines) is also necessary, so that students orteachers can draw a molecule using a chemical drawing package embedded within a wiki page. When they click ona Search button in the page the InChI key of the drawn structure is compared with that of all the substance pagesin the wiki, and any matches are returned. This is rather a specialised requirement and required the developmentof a new extension. Developing a new extension was made easier by investigating the range of extensions that are
  22. 22. already available for MediaWiki and reviewing the code behind them which is enabled by the fact that theMediaWiki hooks and handlers are all open source, well documented, and transparent. The functionality of thisstructure search was split so as to create two new MediaWiki extensions rather than one: the first embedded astructure drawer into a wiki page and the second added a Search button which when clicked would display thesearch results. The reason for splitting the functionality into two separate extensions was that various otherapplications of the structure drawer had been suggested (which will be described shortly) and by this design thefirst extension could be used for various other applications without duplicating code. Although a new MediaWiki extension needed to be developed to add a structure editor to a wiki page, itwas not necessary to start from scratch since various open-source structure editors already exist which can beembedded into webpages. The GGA Ketcher [26] structure editor available in ChemSpider was chosen as thestructure editor of choice and implemented in the structure drawing extension because it is easy to use and isbased on Javascript so does not require any extra additional add-ins or Flash support to be installed (which couldbe a problem in a school environment). It was also very easy to integrate into a MediaWiki extension. Toincorporate a Ketcher drawing frame into a HTML page it was simply necessary to download the Javascript and CSSfiles which comprise the Ketcher code, reference these in the head section of the HTML of a wiki, add the Ketcherframe, table and buttons to the body of the html, and add an onload attribute to the page to initialize the Ketcherframe. The only part of these steps which was not immediately straightforward for a version 1.16.0 MediaWikiextension to add to the webpage in which it was called, was the step of adding an onload attribute to the page, buta workaround was used however which involved adding a Javascript function to the HTML head which was calledat the window’s onload event. The resulting extension was called the KetcherDrawer extension. The accompanying extension would add a Search button and would need to perform several actions whenclicked. The first action is to take the MOL depiction of the molecule that has been drawn (which is easily retrievedvia a call to the Ketcher Javascript functions) and convert it into an InChI key so that this can be searched on. Thisconversion is done using the IUPAC InChI code [59], and any warnings that are returned are displayed in the wikipage, for example if stereochemistry is undefined or any atom has an unusual valence. The next action is to postthis InChIKey to a search of the wiki – this was done by using the MediaWiki API to silently retrieve the results of
  23. 23. this search. If one matching substance page was found then the page would redirect to view it. If no match for thefull InChIKey was found then a second search is submitted to the MediaWiki API to find if there are any matchesfor just the first half of the InChIKey. This roughly equates to broadening the search to find matches for themolecule’s skeleton. Any results from this search are listed in the wiki page itself, with a warning that no exactmatch could be found for the molecule but that these are similar molecules. After these two extensions had beenwritten, it was then possible to add the functionality to perform a structure search within the wiki just by callingthe KetcherDrawer and KetcherQuizAnswer extensions in the page. A DisplaySpectrum extension was also written to add an interactive spectrum to a wiki page. As explainedearlier there are two possible display tools for spectra that we use: JspecView and the ChemDoodle spectraldisplay. Approximately two thirds of current viewers of the RSC educational websites use a web browser whichdoes not support canvases, and in most school environments the installation of plug-ins is not an option. To makethe best of both worlds, the DisplaySpectrum extension automatically tests the browser being used and if itsupports canvases then it displays the spectrum using the ChemDoodle spectral viewer [38] and if not it uses theJspecView applet [37]. This work has demonstrated how a simplified version of both the information in, and functionality ofChemSpider has been integrated into the LearnChemistry educational website, using the collaborative aspects ofMediaWiki to allow these and other related pages, such as quizzes and descriptions of experiments to then be builtup. The system was pieced together from many different open source programs and libraries which could not havebeen possible without the flexibility of the MediaWiki platform on which the platform is based.CONCLUSION RSC has embraced the use of Free/Open Source cheminformatics and Wiki tools in order to delivermultiple systems to the chemistry community which facilitate learning, data sharing and access to data andinformation of various types. By utilizing Open Source code where appropriate, and by integrating with othercommercial platforms, we have been able to deliver a rich tapestry of functionality that could not otherwise have
  24. 24. been achieved without significantly higher investment. In choosing our commercial vendor for our substructuresearch engine we also opted for an Open Source platform with the GGA software. Our experiences of using Free/Open Source software are generally very positive. In a number of cases wehave been able to take the software components as is and drop them into our applications and use without anyrecoding and using the existing software interfaces as delivered. In most cases our involvement with the codedevelopers has either been negligible or has required significant dialog to resolve issues. In the cheminformaticsdomain of Open Source software we have found the commercial open source software to be of excellent qualityand rigorously tested and well supported. For open source software of a more academic nature we have foundthat small teams, where the software is supported by one group for example) are highly responsive and effectivein addressing identified issues while applications with a broad development base are less so. In certain cases wehave had to invest significant resources in optimizing the software for our purposes and knitting it into ourapplications. We generally find that documentation suffices for our needs, or that our development staff canunderstand the code even without complete documentation. The true collaborative benefits of platforms such as ChemSpider will be felt as the multitude of onlineresources are integrated into federated searches and semantic web linking in a manner that single queries can bedistributed across the myriad of resources to provide answers through a single interface. There is a clear trend inlife sciences towards more open access to chemistry data. In the near future this may provide additional pre-competitive data allowing the development of federated systems such as the Open PHACTS [60], usingChemSpider as an integral part of as the chemistry database and search engine. The Open PHACTS platform willallow pharmaceutical companies to link data across the abundance of life science databases that are already andwill increasingly become available. ChemSpider is likely to become one of the foundations of the semantic web forchemistry and, with an ongoing focus for enabling collaboration and integration for Life Sciences, will be anessential resource for future generations.ACKNOWLEDGMENTS
  25. 25. ChemSpider is the result of the aggregate work of many contributors. All core ChemSpider development is led byValery Tkachenko (Chief Technology Officer) and we are indebted to our colleagues involved in the development ofmuch of the software discussed in this chapter. These include Sergey Shevelev, Jonathan Steele and AlexeyPshenichnov. Our RSC platforms are supported by a dedicated team of IT specialists that is second to none. Theauthors acknowledge the support of the Open Source community, the commercial software vendors (specificallyAccelrys, ACD/Labs, GGA Software Inc., OpenEye Software Inc., Dotmatics Limited, many data providers, curatorsand users for their contributions to the development of the data content in terms of breadth and quality.REFERENCES1. Project Prospect. [Accessed September 2011]; Available from: http://www.rsc.org/Publishing/Journals/ProjectProspect/FAQ.asp.2. Royal Society of Chemistry acquires ChemSpider. [Accessed September 22nd 2011]; Available from: http://www.rsc.org/AboutUs/News/PressReleases/2009/ChemSpider.asp.3. Adams, S.E., et al., Experimental data checker: better information for organic chemists. Org Biomol Chem, 2004. 2(21): p. 3067-70.4. Nature Publishing Group. [Accessed September 2011]; Available from: http://www.nature.com/npg_/company_info/index.html.5. International Union of Crystallography. Available from: http://www.iucr.org/.6. Sciborg Project. [Accessed September 2011]; Available from: http://www.cl.cam.ac.uk/research/nl/sciborg/www/.7. OSCAR on Sourceforge. [Accessed September 2011]; Available from: http://sourceforge.net/projects/oscar3-chem/.8. Project Prospect wins ALPSP award. [Accessed September 2011]; Available from: http://www.rsc.org/Publishing/Journals/News/ALPSP_2007_award.asp.9. Scaffolding. [Accessed September 2011]; Available from: http://depth- first.com/articles/2006/12/21/scaffolding/10. Steinbeck, C., et al., The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci, 2003. 43(2): p. 493-500.11. Lowe, D.M., et al., Chemical name to structure: OPSIN, an open source solution. J Chem Inf Model, 2011. 51(3): p. 739-53.12. de Matos, P., et al., Chemical Entities of Biological Interest: an update. Nucleic Acids Res, 2010. 38(Database issue): p. D249-54.13. The IUPAC International Chemical Identifier (InChI). [Accessed September 2011]; Available from: http://www.iupac.org/inchi/.14. IUPAC Gold Book. [Accessed September 2011]; Available from: http://goldbook.iupac.org/.15. The Gene Ontology. [Accessed September 2011]; Available from: http://www.geneontology.org/.16. Batchelor C.R., and Corbett, P.T., Semantic enrichment of journal articles using chemical named entity recognition. ACL 07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions 2007: p. 45-48.
  26. 26. 17. The RXNO Reaction Ontology. [Accessed September 2011]; Available from: http://code.google.com/p/rxno/.18. The Chemical Methods Ontology. [Accessed September 2011]; Available from: http://code.google.com/p/rsc-cmo/.19. compiled by J. Inczedy, T. Lengyel and A. M. Ure, Compendium of Analytical Nomenclature (definitive rules 1997) – The Orange Book, 3rd edition 1998.20. ACD/Labs Name to Structure batch. Available from: http://www.acdlabs.com/products/draw_nom/nom/name/.21. Hull, D., S.R. Pettifer, and D.B. Kell, Defrosting the digital library: bibliographic tools for the next generation web. PLoS Comput Biol, 2008. 4(10): p. e1000204.22. ChemSpider. [Accessed September 2011]; Available from: http://www.chemspider.com.23. The PubChem Database. [Accessed September 2011] Available from: http://pubchem.ncbi.nlm.nih.gov/.24. Gregori-Puigjane, E., R. Garriga-Sust, and J. Mestres, Indexing molecules with chemical graph identifiers. J Comput Chem, 2011. 32(12): p. 2638-46.25. The GGA Software Bingo Toolkit. [Accessed September 2011]; Available from: http://ggasoftware.com/opensource/bingo.26. The GGA Ketcher Structure Drawer. [Accessed September 2011]; Available from: http://ggasoftware.com/opensource/ketcher.27. JChemPaint Sourceforge Page. [Accessed September 2011]; Available from: http://sourceforge.net/apps/mediawiki/cdk/index.php?title=JChemPaint.28. OpenBabel Wiki Page. [Accessed September 2011]; Available from: http://openbabel.org/wiki/Main_Page.29. The Balloon 3D Optimizer. [Accessed September 2011]; Available from: http://users.abo.fi/mivainio/balloon/.30. Jmol: An Open Source Java viewer for chemical structures in 3D. [Accessed September 2011]; Available from: http://jmol.sourceforge.net/.31. PubMed. [Accessed September 2011]; Available from: http://www.ncbi.nlm.nih.gov/pubmed/.32. Google Scholar. [Accessed September 2011]; Available from: http://scholar.google.com/.33. Google Patents. [Accessed September 2011]; Available from: http://www.google.com/patents.34. Entrez, the Life Sciences Search Engine. [ Accessed September 2011]; Available from: http://www.ncbi.nlm.nih.gov/sites/gquery.35. NCBI, the national Center for Biotechnology Information. [Accessed September 2011]; Available from: http://www.ncbi.nlm.nih.gov/.36. Published JCAMP-DX Protocols. [Accessed September 2011]; Available from: http://www.jcamp- dx.org/protocols.html.37. Lancashire, R.J., The JSpecView Project: an Open Source Java viewer and converter for JCAMP- DX, and XML spectral data files. Chem Cent J, 2007. 1: p. 31.38. ChemDoodle web components [Accessed September 2011]; Available from: http://web.chemdoodle.com/39. Bradley, J.C., et al., The Spectral Game: leveraging Open Data and crowdsourcing for education. J Cheminform, 2009. 1(1): p. 9.40. The SpectralGame. [Accessed September 2011]; Available from: http://www.spectralgame.com/.41. Younes, A.H., et al., Electronic structural dependence of the photophysical properties of fluorescent heteroditopic ligands - implications in designing molecular fluorescent indicators. Org Biomol Chem, 2010. 8(23): p. 5431-41.
  27. 27. 42. Cambridgesoft ChemDraw. [Accessed September 2011]; Available from: http://www.cambridgesoft.com/software/chemdraw/.43. The Molfile Format. [Accessed September 2011]; Available from: http://goldbook.iupac.org/MT06966.html.44. The SDF file format. [Accessed September 2011]; Available from: http://www.epa.gov/ncct/dsstox/MoreonSDF.html#Details.45. CDX File format specification. [Accessed September 2011]; Available from: http://www.cambridgesoft.com/services/documentation/sdk/chemdraw/cdx/index.htm46. Guha, R., et al., The Blue Obelisk-interoperability in chemical informatics. J Chem Inf Model, 2006. 46(3): p. 991-8.47. LearnChemistry. [Accessed September 2011]; Available from: http://www.rsc.org/learnchemistry48. LearnChemistry:Share. [Accessed September 2011]; Available from: http://www.rsc.org/learnchemistry/share49. MediaWiki. [Accessed September 2011]; Available from: http://www.mediawiki.org/wiki/MediaWiki50. Wikipedia. [Accessed September 2011]; Available from: http://www.wikipedia.org/.51. Mediawiki Extensions. [Accessed September 2011]; Available from: http://www.mediawiki.org/wiki/Category:All_extensions.52. MediaWiki EnableImageWhitelist extension [Accessed September 2011]; Available from: http://www.mediawiki.org/wiki/Manual:$wgEnableImageWhitelist53. MediaWiki Webservice extension [Accessed September 2011]; Available from: http://www.mediawiki.org/wiki/Extension:Webservice54. Mediawiki Data extension [Accessed September 2011]; Available from: http://www.mediawiki.org/wiki/Extension:Data55. MediaWiki API. [Accessed September 2011]; Available from: http://www.mediawiki.org/wiki/API:Main_page56. MediaWiki Bot to make pages [Accessed September 2011]; Available from: http://meta.wikimedia.org/wiki/MediaWiki_Bulk_Page_Creator57. Snoopy PHP class [Accessed September 2011]; Available from: http://sourceforge.net/projects/snoopy/.58. MediaWiki Add HTML Meta and title extension [Accessed September 2011]; Available from: http://www.mediawiki.org/wiki/Extension:Add_HTML_Meta_and_Title59. IUPAC InChI v1.03. [Accessed September 2011]; Available from: http://www.iupac.org/inchi/release103.html.60. Open PHACTS. [Accessed September 2011]; Available from: http://www.openphacts.org/.