My Open Access papers


Published on

Open Access publications of Noel O'Boyle

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

My Open Access papers

  1. 1. Open Access Publications of Noel O’Boyle November 2, 2011
  2. 2. ContentsI Cheminformatics toolkits 51 Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit 72 Cinfony - combining Open Source cheminformatics toolkits behind a common interface 153 Open Babel: An open chemical toolbox 25II Enzyme reaction mechanisms 394 MACiE: a database of enzyme reaction mechanisms 415 MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for search- ing catalytic mechanisms 43III QSAR 496 PYCHEM: a multivariate analysis package for python 517 Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction 53IV The Rest 698 Userscripts for the life sciences 719 Confab - Systematic generation of diverse low-energy conformers 8310 Review of “Data Analysis with Open Source Tools” 9311 Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on 95 3
  3. 3. Part ICheminformatics toolkits 5
  4. 4. Chemistry Central Journal Software Open Access Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit Noel M OBoyle*1,2, Chris Morley3 and Geoffrey R Hutchison4 Address: 1Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK, 2Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK, 3OpenBabel Development Team and 4Department of Chemistry, University of Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA Email: Noel M OBoyle* -; Chris Morley -; Geoffrey R Hutchison - * Corresponding author Published: 9 March 2008 Received: 23 January 2008 Accepted: 9 March 2008 Chemistry Central Journal 2008, 2:5 doi:10.1186/1752-153X-2-5 This article is available from: © 2008 OBoyle et al This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Scripting languages such as Python are ideally suited to common programming tasks in cheminformatics such as data analysis and parsing information from files. However, for reasons of efficiency, cheminformatics toolkits such as the OpenBabel toolkit are often implemented in compiled languages such as C++. We describe Pybel, a Python module that provides access to the OpenBabel toolkit. Results: Pybel wraps the direct toolkit bindings to simplify common tasks such as reading and writing molecular files and calculating fingerprints. Extensive use is made of Python iterators to simplify loops such as that over all the molecules in a file. A Pybel Molecule can be easily interconverted to an OpenBabel OBMol to access those methods or attributes not wrapped by Pybel. Conclusion: Pybel allows cheminformaticians to rapidly develop Python scripts that manipulate chemical information. It is open source, available cross-platform, and offers the power of the OpenBabel toolkit to Python programmers. Background OpenBabel is a C++ toolkit with extensive capabilities for Cheminformaticians often need to write once-off scripts reading and writing molecular file formats (over 80 are to create extract data from text files, prepare data for anal- supported) as well as for manipulating molecular data [2]. ysis or carry out simple statistics. Scripting languages such Many standard chemistry algorithms are included, for as Perl, Python and Ruby are ideally suited to these day- example, determination of the smallest set of smallest to-day tasks [1]. Such languages are, however, an order of rings, bond order perception, addition of hydrogens, and magnitude or more slower than compiled languages such assignment of Gasteiger charges. In relation to cheminfor- as C++. Since cheminformaticians regularly deal with matics, OpenBabel supports SMARTS searching [3], molecular files containing thousands of molecules and molecular fingerprints [4] (both Daylight-type, and struc- many cheminformatics algorithms are computationally tural-key based), and includes group contribution expensive, cheminformatics toolkits are typically written descriptors for LogP [5], polar surface area (PSA) [6] and in compiled languages for performance. molar refractivity (MR) [5]. Page 1 of 7Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  5. 5. Chemistry Central Journal 2008, 2:5 the current popular scripting languages, Python [7] is header files, SWIG generates a C file which, when com-the de-facto standard language for scripting in cheminfor- piled and linked with the Python development librariesmatics. Several commercial cheminformatics toolkits have and OpenBabel, creates a Python extension module,interfaces in Python: OpenEyes closed-source successor openbabel. This can then be imported into a Python scriptto OpenBabel, OEChem [8], is a C++ toolkit with inter- like any other Python module using the "import openbabel"faces in Python and Java; Rational Discoverys RDKit [9], statement.which is now open source, is a C++ cheminformaticstoolkit with a Python interface; the Daylight toolkit [10] For a small number of C++ objects and functions, it wasfrom Daylight Chemical Information Systems, written in necessary to add some convenience functions to facilitateC, only has Java and C++ wrappers but PyDaylight [11], access from Python. Certain types of molecule files haveavailable separately from Dalke Scientific, provides a additional data present in addition to the connectionPython interface to the toolkit; the Cambios Molecular table. OpenBabel stores these data in subclasses of OBGe-Toolkit [12] from Cambios Consulting is a commercial nericData such as OBPairData (for the data fields in mol-C++ toolkit with a Python interface. There are also toolkits ecule files such as MOL files and SDF files) andentirely implemented in Python: Frowns [13], an open OBUnitCell (for the data fields in CIF files). To access thesource cheminformatics toolkit by Brian Kelley, and PyBa- data it is necessary to downcast an instance of OBGener-bel [14], an open source toolkit included in the MGLTools icData to the specific subclass. For this reason, two con-package from the Molecular Graphics Labs at the Scripps venience functions were added to the interface file, one toResearch Institute. Note that the latter is not related to the cast OBGenericData to OBPairData, and one to cast toOpenBabel project; rather its name derives from the fact OBUnitCell. Another convenience function was added tothat its aim was to implement in Python some of the func- convert a Python list to a C array of doubles, as this typetionality of Babel v1.6 [15], a command-line application of input is required for a small number of OpenBabelfor converting file formats which is a predecessor of functions.OpenBabel. Iterators are an important feature of the OpenBabel C++Here we describe the implementation and application of library. For example, OBAtomAtomIter allows the user toPybel, a Python module that provides access to the easily iterate over the atoms attached to a particular atom,OpenBabel C++ library from the Python programming and OBResidueIter is an iterator over the residues in alanguage. Pybel builds on the basic Python bindings to molecule. The OpenBabel iterators use the dereferencemake it easier to carry out frequent tasks in cheminformat- operator to access the data, the increment operator to iter-ics. It also aims to be as Pythonic as possible; that is, to ate to the next element, and the boolean operator to testadhere to Python language conventions and idioms, and whether any elements remain. Iterators are also a core fea-where possible to make use of Python language features ture of the Python language. However, the iterators usedsuch as iterators. The result is a module that takes advan- by OpenBabel are not automatically converted intotage of Pythons expressive syntax to allow cheminforma- Python iterators. To deal with this, Python iterator classesticians to carry out tasks such as SMARTS matching, data that wrap the dereference, increment and boolean opera-field manipulation and calculation of molecular finger- tors behind the scenes were added to the SWIG interfaceprints in just a few lines of code. file, so that Python statements such as "for attached_obatom in OBAtomAtomIter(obatom)" work with-Implementation out problem.SWIG bindingsPython bindings to the OpenBabel toolkit were created Pybel moduleusing SWIG [16]. SWIG (Simplified Wrapper and Inter- The SWIG bindings provide direct access from Python toface Generator) is a tool that automates the generation of the C++ objects and functions in the OpenBabel APIbindings to libraries written in C or C++. One of the (application programming interface). The purpose of theadvantages of SWIG compared to other automated wrap- Pybel module is to wrap these bindings to present a moreping methods such as Boost.Python [17] or SIP [18] is that Pythonic interface to OpenBabel (Figure 1). This extraSWIG also supports the generation of bindings to several level of abstraction is useful as Python programmersother languages. For example, OpenBabel also uses SWIG expect Python libraries to behave in certain ways that ato generate bindings for Perl, Ruby and Java. An addi- C++ library does not. For example, in Python, attributes oftional advantage is that SWIG will directly parse C or C++ an object are often directly accessed whereas in C++ it isheader files while Boost.Python and SIP require each C++ typical to call Get/Set functions to access them. A C++class to be exposed manually. The input to SWIG is an function returning a particular object might require ainterface file containing a list of OpenBabel header files pointer to an empty object as a parameter, whereas thefor which to generate bindings. Using the signatures in the Python equivalent would not. Even something as simple Page 2 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  6. 6. Chemistry Central Journal 2008, 2:5 code shows how to store each molecule in a multimole- cule SDF file in a list called allmols: import openbabel allmols = [] obconversion = openbabel.OBConversion() obconversion.SetInFormat("sdf") obmol = openbabel.OBMol() notatend = obconversion.ReadFile(obmol, "inputfile.sdf") while notatend: allmols.append(obmol) obmol = openbabel.OBMol() notatend = obconversion.Read(obmol) To replace this somewhat verbose code, Pybel provides a readfile method that takes a file format and filename and returns molecules using the yield keyword. This changes the method into a generator, a Python language feature where a method behaves like an iterator. Iterators are a major feature of the Python language which are used for looping over collections of objects. In Pybel, we have used iterators where possible to simplify access to the toolkit. As a result, the equivalent to the preceding code is:Figuretext and1the OpenBabel C++ libraryThe relationship between Python modules described in theThe relationship between Python modules described import pybelin the text and the OpenBabel C++ library. Pythonmodules are shown in green; the C++ library is shown in allmols = [mol for mol in pybel.readblue. file("sdf", "inputfile.sdf")] The benefits of iterator syntax are clear when dealing withas differences in the conventions for the case of letters multimolecule files. For single molecule files, however,used in variable and method names is a problem, as it the user needs to remember to explicitly request the itera-makes it more likely for Python programmers to intro- tor to return the first and only molecule using the nextduce bugs in their code. method:One of the key aims of Pybel was to reduce the amount of mol = pybel.readfile("mol", "inputcode necessary to carry out common tasks. This is espe- file.mol").next()cially important for a scripting language where program-ming is often done interactively at a command prompt. In Pybel provides replacements for two of the main classes inaddition, as for any programming language, repeated the OpenBabel library, OBMol and OBAtom. The follow-entry of code for routine and common tasks (so-called ing discussion describes the Pybel Molecule class whichboilerplate code) is a common cause of errors in code. wraps an instance of OBMol, but the same design princi-Reading and writing molecule files is one of the most ples apply to the Pybel Atom class. Table 1 summarisescommon tasks for users of OpenBabel but requires several the attributes and methods of the Molecule object. Bylines of code if using the SWIG bindings. The following wrapping the base class, Pybel can enhance the Molecule Page 3 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  7. 7. Chemistry Central Journal 2008, 2:5 1: Attributes and methods supported by the Pybel Molecule object Attribute Description* OBMol The underlying OBMol object atoms A list of Pybel Atoms charge The total charge (GetTotalCharge) data A MoleculeData object for access to data fields dim The dimensionality of the coordinates (GetDimension) energy The heat of formation (GetEnergy) exactmass The mass calculated using isotopic abundance (GetExactMass) flags The set of flags used internally by OpenBabel (GetFlags) formula The stoichiometric formula (GetFormula) mod The number of nested BeginModify() calls (Internal use) (GetMod) molwt The standard molar mass (GetMolWt) spin The total spin multiplicity (GetTotalSpinMultiplicity) sssr The smallest set of smallest rings (GetSSSR) title The title of the molecule (often the filename) (GetTitle) unitcell Unit cell data (if present) Method write Write the molecule to a file or return it as a string calcfp Return a molecular fingerprint as a Fingerprint object calcdesc Return the values of the group contribution descriptors __iter__ Enable iteration over the Atoms in the Molecule *Where a Molecule attribute is a direct replacement for a Get method of the underlying OBMol, the name of the method is given in parentheses.object by providing (1) direct access to attributes rather # Using Pybelthan through the use of Get methods, (2) additionalattributes of the object, and (3) additional methods that value = pybel.Molecule(mol).data ["comact on the object. ment"](1) As mentioned earlier, it is typical in Python to access It should be noted that all of these attributes are calculatedattribute values directly rather than using Get/Set meth- on-the-fly rather than stored for future access as the under-ods. With this in mind, the Molecule class adds attributes lying OBMol may have been modified.such as energy, formula and molwt (among others) whichgive the values returned by calling GetEnergy(), GetFor- (3) Four additional methods have been added to themula() and GetMolWt(), respectively on the underlying Pybel Molecule (Table 1). The first is a write methodOBMol (see Table 1 for the full list). which writes a representation of the Molecule to a file and takes care of error handling. As with reading molecules(2) One of the aims of Pybel is to simplify access to some from files (see above), this method simplifies the proce-of the most common attributes. With this in mind, an dure significantly compared to using the SWIG bindingsatoms attribute has been added which returns a list of the directly. In addition, a calcfp method and a calcdescatoms of the molecule as Pybel Atoms. Access to the data method have been added which calculate a binary finger-fields associated with a molecule has been simplified by print for the molecule, and some descriptor values, respec-creation of a MoleculeData object which is returned when tively. In the OpenBabel library these are not methods ofthe data attribute of a Molecule is accessed. MoleculeData the OBMol, but rather are loaded as plugins (by OBFin-presents a dictionary interface to the data fields of the gerprint.FindFingerprint and OBDescriptor.FindType,molecule. Accessing and updating these field is more con- respectively) to which an OBMol is passed as input. Thevoluted if using the SWIG bindings. Compare the follow- __iter__ method is a special Python method that enablesing statements for accessing the "comment" field of the iteration over an object; in the case of a Molecule, thevariable mol, an OBMol: defined iterator loops over the Atoms of the Molecule. This feature enables constructions such as "for atom in# Using the SWIG bindings mol" where mol is a Pybel Molecule.value = openbabel.toPairData(mol.GetData SMARTS is a query language developed by Daylight["comment"]).GetValue() Chemical Information Systems for molecular substructure Page 4 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  8. 8. Chemistry Central Journal 2008, 2:5 [3]. As implemented in the OpenBabel toolkit, The OBMol wrapped by a Pybel Molecule can be accessedfinding matches of a particular substructure in a particular through the OBMol attribute. This makes it easy to call amolecule is a four step process that involves creating an method not wrapped by Pybel, such as OBMol.NumRotors,instance of OBSmartsPattern, initialising it with a which returns the number of rotatable bonds in a mole-SMARTS pattern, searching for a match, and finally cule:retrieving the result: mol = pybel.readfile("mol", "inputobsmarts = openbabel.OBSmartsPattern() file.mol").next()obsmarts.Init("[#6] [#6]") numrotors = mol.OBMol.NumRotors()obsmarts.Match(obmol) Documentation and Testing To minimise programming errors, programs writtenresults = obsmarts.GetUMapList() dynamically-typed languages such as Python should be tested comprehensively. Pybel has 100% code coverage inSince a SMARTS query can be thought of as a regular terms of unit tests, as measured by Ned Batchelders cov-expression for molecules, in Pybel we decided to wrap the [19]. It also has several doctests, short snippets ofSMARTS functionality in an analogous way to Pythons Python code included in documentation strings whichregular expression module, re. With these changes, the serve as both examples of usage and as unit tests.same process takes only two steps, an initialisation stepand a search step: The Pybel API is fully documented with docstrings. These can be accessed in the usual way with the help() com-smarts = pybel.Smarts("[#6] [#6]") mand at the interactive Python prompt after importing Pybel: for example, "help(pybel.Molecule)". In addition, theresults = smarts.findall(pybelmol) OpenBabel Python web page [20] contains a complete description of how to use the SWIG bindings and thePybel was not written to replace the SWIG bindings but Pybel API. The webpage also contains links to HTML ver-rather to make it simpler to perform common tasks. As a sions of the OpenBabel API documentation and Pybel APIresult, Pybel does not attempt to wrap every single documentation. The latter is included in Additional File 1.method and class in the OpenBabel library. Because ofthis, a user may often want to interconvert between an Results and DiscussionOBMol and a Molecule, or an OBAtom and an Atom. This The principle aim of Pybel is to make it simpler to use theis quite a straightforward process. A Pybel Molecule can be OpenBabel toolkit to carry out common tasks in chem-created by passing an OBMol to the Molecule constructor. informatics. These common tasks include reading andIn the following example an OBMol is created using the writing molecule files, accessing data fields of a molecule,SWIG bindings and then written to a file using Pybel: computing and comparing molecular fingerprints and SMARTS matching. Here we present some examples thatobmol = openbabel.OBMol() illustrate how Pybel may be used to carry out common cheminformatics tasks.a = obmol.NewAtom() Removal of duplicate moleculesa.SetAtomicNum(6) When merging different datasets or as a final step in pre- processing, it may be necessary to identify and removea.SetVector(0.0, 1.0, 2.0) # Set coordi duplicate molecules. In the following example, only thenates unique molecules in the multimolecule SDF file "input- file.sdf" will be written to "uniquemols.sdf". Here we willb = obmol.NewAtom() assume that a unique InChI string (IUPAC International Chemical Identifier) indicates a unique molecule. A simi-obmol.AddBond(1, 2, 1) # Single bond from lar procedure could be performed using the OpenBabelAtom 1 to Atom 2 canonical SMILES format, by replacing "inchi" with "can" in the following:pybel.Molecule(obmol).write("mol", "outputfile.mol") import pybel inchis = [] Page 5 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  9. 9. Chemistry Central Journal 2008, 2:5 = pybel.Outputfile("sdf", ties. This is the Lipinski Rule of Fives, so-called as the"uniquemols.sdf") numbers involved are all multiples of five. The following example shows how to filter a database to identify onlyfor mol in pybel.readfile("sdf", "input those molecules that pass all four of the Lipinski criteria.file.sdf"): The values of the Lipinski descriptors are also added to the output file as data fields. Note that whereas molecular inchi = mol.write("inchi") weight is directly available as an attribute of a Molecule, and LogP is available as one of the three group contribu- if inchi not in inchis: tion descriptors calculated by OpenBabel, we need to use SMARTS pattern matching to identify the number of output.write(mol) hydrogen bond donors and acceptors. The SMARTS pat- terns used here correspond to the definitions of hydrogen inchis.append(inchi) bond donor and acceptor used by Lipinski:output.close() import pybelSelection of similar molecules HBD = pybel.Smarts("[#7,#8;!H0]")Another common task in cheminformatics is the selectionof a set of molecules of similar structure to a target mole- HBA = pybel.Smarts("[#7,#8]")cule. Here we will assume that structural similarity is indi-cated by a Tanimoto coefficient [21] of at least 0.7 with def lipinski(mol):respect to Daylight-type (that is, based on hashed pathsthrough the molecular graph) fingerprints. Note that """Return the values of the LipinskiPybel redefines the | operator (bitwise OR) for Fingerprint descriptors."""objects as the Tanimoto coefficient: desc = {molwt: mol.molwt,import pybel HBD: len(HBD.findall(mol)),targetmol = pybel.readfile("sdf", "targetmol.sdf").next() HBA: len(HBA.findall(mol)),targetfp = targetmol.calcfp() LogP: mol.calcdesc([LogP]) [LogP]}output = pybel.Outputfile("sdf", "similarmols.sdf") return descfor mol in pybel.readfile("sdf", "input passes_all_rules = lambda desc: (descfile.sdf"): [molwt] <= 500 and fp = mol.calcfp() desc [HBD] <= 5 and desc [HBA] <= 10 and if fp | targetfp >= 0.7: desc [LogP] <= 5) output.write(mol) if __name__=="__main__":output.close() output = pybel.Outputfile("sdf", "pasApplying a Rule of Fives filter sLipinski.sdf")In an influential paper, Lipinski et al. [22] performed ananalysis of drug compounds that reached Phase II clinical for mol in pybel.readfile("sdf",trials and found that they tended to occupy a certain range "inputfile.sdf"):of values for molecular weight, LogP, and number ofhydrogen bond donors and acceptors. Based on this, they descriptors = lipinski(mol)proposed a rule with four criteria to identify moleculesthat might have poor absorption or permeation proper- if passes_all_rules(descriptors): Page 6 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  10. 10. Chemistry Central Journal 2008, 2:5 Additional material output.write(mol) Additional file 1 Pybel API. The HTML documentation of the Pybel API (application pro- output.close() gramming interface). Click here for fileFuture work [ future development of Pybel is closely linked to any]changes and improvements to OpenBabel. With each newrelease of the OpenBabel API, the SWIG bindings will beupdated to include any additional functionality. How-ever, additions to the Pybel API will only occur if they sim- Acknowledgementsplify access to new features of the OpenBabel toolkit of The idea for the Pybel module was inspired by Andrew Dalkes work on PyDaylight [11]. We thank the anonymous reviewers for their helpful com-general use to cheminformaticians. In general, the Pybel ments.API can be considered stable, and an effort will be madeto ensure that future changes will be backwards compati- Referencesble. 1. Ousterhout JK: Scripting: Higher Level Programming for the 21st Century. [].Conclusion 2. OpenBabel v.2.1.1 [] 3. SMARTS – A Language for Describing Molecular PatternsPybel provides a high-level Python interface to the widely- []used OpenBabel C++ toolkit. This combination of a high 4. Flower DR: On the properties of bit string-based measures of chemical similarity. J Chem Inf Comput Sci 1998, 38:379-386.performance cheminformatics toolkit and an expressive 5. Wildman SA, Crippen GM: Prediction of physicochemicalscripting language makes it easy for cheminformaticians parameters by atomic contributions. J Chem Inf Comput Scito rapidly and efficiently write scripts to manipulate 1999, 39:868-873. 6. Ertl P, Rohde B, Selzer P: Fast calculation of molecular polarmolecular data. surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties.Pybel is freely available from the OpenBabel web site2 J Med Chem 2000, 43:3714-3717. 7. Python []both as part of the OpenBabel source distribution and for 8. OEChem: OpenEye Scientific Software: Santa Fe, NM. .Windows as an executable installer. Compiled versions 9. RDKit [] 10. Daylight Toolkit: Daylight Chemical Information Systems,are also available as packages in some Linux distributions Inc.: Aliso Viejo, CA. .(openbabel-python in Fedora, for example). 11. PyDaylight: Dalke Scientific Software, LLC: Santa Fe, NM. . 12. Cambios Molecular Toolkit: Cambios Computing, LLC: Palo Alto, CA. .Availability and Requirements 13. Frowns []Project name: Pybel 14. PyBabel in MGLTools [] 15. Babel v.1.6 [] 16. SWIG v.1.3.31 []Project home page: 17. Boost.Python [] 18. SIP – A Tool for Generating Python Bindings for C and C++ Libraries []Operating system(s): Platform independent 19. [ age.html]Programming language: Python 20. OpenBabel Python [ Python] 21. Jaccard P: La distribution de la flore dans la zone alpine. RevOther requirements: OpenBabel Gen Sci Pures Appl 1907, 18:961-967. 22. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ: Experimental and computational approaches to estimate solubility andLicense: GNU GPL permeability in drug discovery and development settings. Adv Drug Del Rev 1997, 23:3-25.Any restrictions to use by non-academics: NoneAuthors contributionsGRH is the lead developer of OpenBabel and created theSWIG bindings. NMOB developed Pybel, and extendedthe SWIG interface file. CM compiled the SWIG bindingson Windows and added convenience functions to theOpenBabel API to facilitate access from scripting lan-guages. All authors read and approved the final manu-script. Page 7 of 7 Chem. Cent. J. 2008, 2, 5. (page number not for citation purposes)
  11. 11. Chemistry Central Journal Software Open Access Cinfony – combining Open Source cheminformatics toolkits behind a common interface Noel M OBoyle*1 and Geoffrey R Hutchison2 Address: 1Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, UK and 2Department of Chemistry, University of Pittsburgh, Chevron Science Center, 219 Parkman Avenue, Pittsburgh, PA 15260, USA Email: Noel M OBoyle* -; Geoffrey R Hutchison - * Corresponding author Published: 3 December 2008 Received: 9 October 2008 Accepted: 3 December 2008 Chemistry Central Journal 2008, 2:24 doi:10.1186/1752-153X-2-24 This article is available from: © 2008 OBoyle et al Abstract Background: Open Source cheminformatics toolkits such as OpenBabel, the CDK and the RDKit share the same core functionality but support different sets of file formats and forcefields, and calculate different fingerprints and descriptors. Despite their complementary features, using these toolkits in the same program is difficult as they are implemented in different languages (C++ versus Java), have different underlying chemical models and have different application programming interfaces (APIs). Results: We describe Cinfony, a Python module that presents a common interface to all three of these toolkits, allowing the user to easily combine methods and results from any of the toolkits. In general, the run time of the Cinfony modules is almost as fast as accessing the underlying toolkits directly from C++ or Java, but Cinfony makes it much easier to carry out common tasks in cheminformatics such as reading file formats and calculating descriptors. Conclusion: By providing a simplified interface and improving interoperability, Cinfony makes it easy to combine complementary features of OpenBabel, the CDK and the RDKit. Background In general, all of these toolkits share the same core func- Cheminformatics toolkits are essential to the day-to-day tionality although the implementation details and under- work of the practising cheminformatician. They enable lying chemical model may differ. However, as a result of the user to deal with such tasks as handling different their independent development and history, each has chemistry file formats, substructure searching, calculation functionality specific to itself and each toolkit supports of molecular fingerprints, and structure diagram genera- different sets of file formats and forcefields, and can calcu- tion. The main Open Source cheminformatics libraries late different molecular fingerprints and molecular under active development are OpenBabel [1], the Chem- descriptors (Table 1). Despite the diversity of these istry Development Kit (CDK) [2], and the RDKit [3]. toolkits and the potential benefits in being able to access OpenBabel is a C++ toolkit with bindings in Perl, Python, all of them at the same time, there has been little work on Ruby and Java, the CDK is a Java toolkit, while the RDKit interoperability between them. This has resulted in a bal- is another C++ toolkit with Python bindings. While the kanization of this field such that users of one toolkit rarely CDK has its origins in academia, both OpenBabel and the use another toolkit. RDKit originated in companies (OpenEye and Rational Discovery, respectively) and have subsequently been One way to achieve interoperability of chemical toolkits is developed by the community under Open Source licenses. through the use of standard file formats for exchange of Page 1 of 10Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  12. 12. Chemistry Central Journal 2008, 2:24 1: Some features of toolkits which are not shared by all three toolkits. CDK A large number of descriptors (some overlap with RDKit) Pharmacophore searching (like RDKit*) Calculation of maximum common substructure 2D structure layout (like RDKit) and depiction MACCS keys (also RDKit) and E-State fingerprints Integration with the R statistical programming environment Support for mass-spectrometry analysis (representations for cleavage reactions, structure generation from formulae) Fragmentation schemes (ring fragments, Murcko) 3D structure generation using a template and heuristics (like OpenBabel) 3D similarity using ultrafast shape descriptors Gasteiger π charge calculation OpenBabel Not just focused on cheminformatics Supports a very large number of chemical file formats including quantum mechanics file formats, molecular mechanics trajectories, 2D sketchers 3D structure generation using a template method (like CDK) Included in all major Linux distributions Bindings available from several scripting languages apart from Python, as well as the Java and .NET platforms Conformation generation and searching InChI (also CDK) and InChIKey generation Support for crystallographic space groups Several forcefield implementations: UFF (also RDKit), MMFF94, MMFF94s, Ghemical Ability to add custom data types to atoms, bonds, residues, molecules RDKit A large number of descriptors (some overlap with CDK) Fragmentation using RECAP rules 2D coordinate generation (like CDK) and depiction 3D coordinate generation using geometry embedding Calculation of Cahn-Ingold-Prelog stereochemistry codes (R/S) Pharmacophore searching (like CDK) Calculation of shape similarity (based on volume overlap) Chemical reaction handling and transforms Atom pairs and topological torsions fingerprints Feature maps and feature-map vectors Machine-learning algorithms * Where the term "like" is used, it indicates that the implementation details For example, the CML project has defined a stand- models between different toolkits, and differences in theardised XML format for chemical data [4], with successive API for core cheminformatics tasks shared by the toolkits.releases refining and extending the original standard. TheOpenSMILES effort [5] has attempted to resolve ambigui- Here we describe Cinfony, a Python module that over-ties in the published SMILES definition [6] to create a comes these barriers to provide interoperability at the APIstandard. While these efforts deserve support, they face level. Cinfony allows access to OpenBabel, the CDK, andinevitable problems achieving consensus and they require the RDKit through a common interface, and uses a simplechanges to existing software to support the standard. The yet robust method to pass chemical models betweenlarge number of chemical file formats supported by toolkits. Pybel, one of the components of Cinfony, hasOpenBabel (currently over 80) illustrates both the poten- been described previously [7]. It provides access totial of achieving a standard as well as the difficulties. OpenBabel from standard Python. In this work, we show that the API developed for Pybel may be considered aAn alternative is interoperability at the API (application generic API for accessing any cheminformatics toolkit. Weprogramming interface) level. This has the advantage that describe the design and implementation of the Cinfonyit does require any changes to existing software. However, API for OpenBabel, the RDKit and the CDK. Next, wethere are at least three barriers to overcome: the need for a show how Cinfony simplifies the process of accessing theprogramming language that can access all the toolkits toolkits and how it can be used in practice to combine thesimultaneously, the difficulty of exchanging chemical power of the three Open Source toolkits. Finally, we dis- Page 2 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  13. 13. Chemistry Central Journal 2008, 2:24 performance and some results from comparisons of Although the OBMol of OpenBabel has a correspondingthe toolkits. method, OBMol.AddHydrogens(), the RDKit uses a glo- bal method, AddHs(Mol), while the CDK requires theImplementation user to instantiate a HydrogenAdder object, which canCommon Application Programming Interface then be used to add hydrogens.Cinfony presents the same interface to three cheminfor-matics toolkits, OpenBabel, the CDK and the RDKit. The Molecule methods described in the original Pybel APIThese are available through three separate modules: oba- [7] have been extended to handle hydrogen addition andbel, cdk and rdkit. The API is designed to make it easy to removal, structure diagram generation, assignment of 3Dcarry out many of the common tasks in cheminformatics, geometry to 0D structures and geometry optimisationand covers the core functionality shared by all of the using forcefields. Both the CDK and the RDKit are capabletoolkits. Table 2 gives an overview of the API. The com- of 2D coordinate generation and 2D depiction. However,plete API is available here (see Additional file 1). since OpenBabel currently has neither of these capabili- ties, a fourth toolkit, OASA, is used by Pybel for this pur-The main class containing chemical information is the pose. OASA is a lightweight cheminformatics toolkitMolecule class. Rather than create a new chemical model, implemented in Python [8].the Molecule class is a light wrapper around the moleculeobject in the underlying library, for example, around A new development in the latest version of OpenBabel isOBMol in the case of OpenBabel. Attribute values such as 3D coordinate generation and geometry optimisationthe molecular weight are calculated dynamically by query- using one of a number of forcefields. Since these methodsing the underlying molecule. This ensures that if the are also available in the RDKit, and are under develop-underlying OBMol, for example, is altered, the attribute ment in the CDK, two additional methods have beenvalues returned will still be correct. The actual underlying added to the Cinfony Molecule: make3D(), for 3D coor-object (an OpenBabel OBMol, a CDK Molecule, or an dinate generation, and localopt(), for geometry optimisa-RDKit Mol) can be accessed directly at any point. tion. Particularly in the case of OpenBabel, these new methods simplify the process of generating 3D coordi-The Molecule class also contains several methods that act nates. Compare a single call to make3D() in Cinfony withon molecules such as methods for calculating fingerprints, the following OpenBabel code:adding hydrogens, and calculating descriptor values. Thismakes it easy to access these methods, and also brings structuregenerator = openbabel.OBOp.Findthem to the attention of the user. In the underlying toolkit Type(Gen3D)these methods may not be present as part of the moleculeclass, and in fact, they can be difficult to find in the structuregenerator.Do(mol)toolkits API. For example, the Cinfony method Mole-cule.addh() adds explicit hydrogens to the molecule. mol.AddHydrogens()Table 2: An overview of the Cinfony API. Class name Purpose Molecule Wraps a molecule instance of the underlying toolkit and provides access to methods that act on molecules Atom Wraps an atom instance of the underlying toolkit MoleculeData Provides dictionary-like access to the information contained in the tag fields in SDF and MOL2 files Outputfile Handles multimolecule output file formats Smarts Wraps the SMARTS functionality of the toolkit in an analogous way to the Python re module for regular expression matching Fingerprint Simplifies Tanimoto calculation of binary fingerprints Function name readfile Return an iterator over Molecules in a file readstring Return a Molecule Variable name descs A list of descriptor IDs forcefields A list of forcefield IDs fps A list of fingerprint IDs informatsaa A list of input format IDs outformats A list of output format IDs Page 3 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  14. 14. Chemistry Central Journal 2008, 2:24 = openbabel.OBForceField.Find translation process is transparent to the user. However,Type("MMFF94") the user should be aware of known limitations of particu- lar readers or writers. For example, the SMILES parser inff.Setup(mol) CDK 1.0.3 ignores atom-based stereochemistry and thus that information is lost if a 0D rdkit or obabel Moleculeff.SteepestDescent(50) with atom-based stereochemistry is converted to a cdk Molecule.ff.GetCoordinates(mol) Cinfony Molecules are interconverted using the Mole-The Cinfony API is identical for all of the toolkits. How- cule() constructor. For example, if obabelmol is an obabelever, the values returned by particular API calls are not Molecule, then the corresponding rdkit Molecule can benecessarily standardised across toolkits. This Cinfony constructed using rdkit.Molecule(pybelmol). This mecha-design decision is in agreement with the Principle of Least nism can also be used to interface Cinfony to other chem-Surprise [9]; when the user accesses the underlying toolkit informatics toolkits. The only requirements are that thedirectly, they will get the same result as found when using object passed to the Molecule() constructor needs to haveCinfony. This design decision places the responsibility on a _cinfony attribute set to True, and an _exchangethe user to become familiar with differences in how the attribute containing a tuple (0, SMILES string) or (1, MOLtoolkits behave. For example, all of the toolkits allow the file) depending on whether the molecule is 0D or not.calculation of path-based fingerprints. These encode allpaths in the molecular graph up to a path length of P into Implementationa binary vector of length V, but the default values for V The Python scripting language has two main implementa-and P are different for each toolkit: 1024 and 7 for tions. The most widely used implementation is the origi-OpenBabel, 1024 and 8 for the CDK, and 2048 and 7 for nal reference implementation of Python in C, referred toRDKit. Although it is possible to alter these parameters for as CPython when necessary to distinguish it from otherthe CDK and the RDKit and so standardise V and P to implementations. The next most widely used implemen-1024 and 7 for all of the toolkits, it is reasonable to tation is Jython, an implementation of Python in Java.assume that the developers of each package have chosen Although most users of Python do so through CPython,sensible defaults. In addition, the implementation details Jython scripts have the advantage of being able to accessof each of the fingerprinters would still be different; for Java libraries natively. They can also be compiled into Javaexample, the RDKit sets four bits when hashing each classes to be used from Java programs. Jython scripts aremolecular path, the others set one; OpenBabel does not also useful in contexts where Java is required but it is moreset any bits for the one-atom fragments, N, C and O. convenient to work in Python; for example, to implement a Java web servlet or a node in a Java workflow environ-Interoperability ment such as KNIME [11].The ability to transfer chemical models between toolkits isessential to the goal of interoperability. However, the As discussed earlier, one of the barriers to interoperabilityinternal representation of a molecule is specific to a par- is the requirement for a programming language that canticular toolkit. For example, as well as the connection simultaneously access more than one of the toolkits. Fromtable and coordinates (if present), it may include derived CPython it is possible to use Cinfony modules to connectdata relating to aromaticity, the number of implicit hydro- to OpenBabel (pybel), the CDK (cdkjpype) and the RDKitgens on an atom, or stereochemical configuration. Fortu- (rdkit). From Jython, there are modules for OpenBabelnately, the problem of transfer and storage of chemical (jybel) and the CDK (cdkjython). Convenience modulesinformation has already been solved by the development obabel and cdk are provided that automatically import theof molecular file formats, of which over 80 are now sup- appropriate OpenBabel or CDK module depending onported by OpenBabel. Specifically, the MDL MOL file for- the Python implementation. The relationship betweenmat [10] and the SMILES format [5,6] are shared by all these Cinfony modules and the underlying cheminfor-three toolkits, and are used by Cinfony to exchange infor- matics libraries is summarised in Figure 1.mation on molecules with 2D or 3D coordinates (MOLfile format), and no coordinates (SMILES format), respec- pybel and jybeltively. OpenBabel provides SWIG [12] bindings for both CPy- thon and Java (among other languages). pybel is a wrapperBy using existing file formats rather than trying to inter- around the CPython bindings, and has previously beenconvert the internal models themselves, Cinfony takes described in detail [7]. jybel is an implementation of theadvantage of the existing input/output code of each Cinfony API that allows the user to access OpenBabeltoolkit which is well-tested and mature. In addition, the from Jython using the Java bindings. Despite the fact that Page 4 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  15. 15. Chemistry Central Journal 2008, 2:24 rdkit Support for Python scripting has been part of the design of the RDKit from the start. The Python bindings in RDKit were created using Boost.Python [14], a framework for interfacing Python and C++. The Cinfony module rdkit uses these bindings to implement its API. It is currently not possible to access RDKit from Jython. RDKit has only preliminary support for Java bindings; when these are complete, a corresponding module will be added to Cin- fony. Dependency handling A fully-featured installation of Cinfony relies on a largeFigure 1Relationship of Cinfony modules to Open Source toolkits number of open source libraries. In particular, the 2DRelationship of Cinfony modules to Open Source depiction capabilities introduce dependencies on severaltoolkits. Python modules are accessible from CPython graphics libraries which may be problematic to install on(green), Jython (pale blue), or both (striped green and pale a particular platform (Cairo and its Python bindings,blue). Java libraries are indicated by dark blue, while C++ Python Imaging Library, AGG and the Python wrapperlibraries are yellow. AggDraw). With this in mind, Cinfony treats all depend- encies as optional and only raises an Exception if the user calls a method or imports a module that requires a miss- ing dependency.jybel is used from a Java implementation of Python, and For example, the Python Imaging Library (PIL) is requiredaccesses a C++ library through the Java Native Interface for displaying a 2D depiction on the screen. If all of the(JNI), the jybel code differs from pybel in very few respects. components of cinfony are installed except for PIL, Cin-In Jython, it is not possible to iterate directly over the fony works perfectly except that an Exception is raised ifwrapped STL vectors used by OpenBabel as their Java the Molecule.draw() method is called with show = TrueSWIG bindings do not implement the Iterable interface. (the default). The image can however be written to a fileAlso, the current Jython implementation is 2.2 and does without problems (show = False, filename =not support generator expressions, which were introduced "image.png"). Similarly, if a user is only interested inin Python 2.4. Although both C++ and Python have the using the CDK and the RDKit, it is not necessary to installconcept of a global function or variable, this is not the in Java. SWIG places such functions, and get/setmethods for accessing the variables, in a special class Full installation instructions for Windows, MacOSX andnamed openbabel. Global constants are placed in another Linux are available from the Cinfony website. It should beclass called openbabelConstants. A convenience module, noted that for Windows users, there is no need to compileobabel, is provided which automatically imports the or search for missing libraries as the dependencies areappropriate module depending on the Python implemen- included as binaries in the Cinfony distribution.tation. Resultscdkjpype and cdkjython Cinfony APISince Jython runs on top of the Java Virtual Machine The original Pybel API was designed to make it easy to use(JVM), it can access Java libraries such as the CDK OpenBabel to perform the most common tasks in chem-natively. To access Java libraries from CPython, the informatics and to do so using idiomatic Python. Subse-Python library JPype [13] is needed. This starts an instance quently, we realised that the resulting API could beof the JVM and uses the JNI to communicate back and considered a generic API for wrapping the core function-forth. Overall, the differences between the two wrappers ality of any cheminformatics toolkit. Cinfony implementsare minor. Jython and JPype differ in the syntax used to an extended version of the original Pybel API for the CDKhandle Java exceptions. Also, JPype returns unicode and the RDKit, as well as OpenBabel. While the originalstrings from the CDK and these need to be converted to Pybel was restricted to CPython, Cinfony can also be usedregular strings (otherwise problems arise if they are passed from Jython to access the CDK and an OpenBabel method expecting a std::string). Theappropriate CDK wrapper, cdkjpype or cdkjython, will be Cinfony helps cheminformaticians avoid the steep learn-imported if the user imports the convenience module cdk. ing curve associated with starting to use a new toolkit. Page 5 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  16. 16. Chemistry Central Journal 2008, 2:24 Cinfony, all of the core functionality of the toolkits targetfp = targetmol.calcfp()can be accessed with the same interface. For example, inCinfony, a molecule can be created from a SMILES string output = cdk.Outputfile("sdf", "similarwith: mols.sdf")mol = toolkit.readstring("smi", SMI for mol in cdk.readfile("sdf", "inputLESstring) file.sdf"):RDKit fp = mol.calcfp()mol = Chem.MolFromSmiles(SMILESstring) if fp | targetfp >= 0.7:OpenBabel output.write(mol)mol = openbabel.OBMol() output.close()obconversion = openbabel.OBConversion() Alternatively, we could just have made a single change to the original script, by replacing the import statement fromobconversion.SetInFormat("smi") "import pybel" with "from cinfony import cdk as pybel".obconversion.ReadString(mol, SMI Using Cinfony to combine toolkitsLESstring) Another goal of Cinfony is to make it easy to combine toolkits in the same script. This allows the user to exploitCDK the complementary capabilities of different toolkits (Table 1). For example, lets suppose the user wants to (1)builder = cdk.DefaultChemObject convert a SMILES string to 3D coordinates with OpenBa-Builder.getInstance() bel, then (2) create a 2D depiction of that molecule with the RDKit, next (3) calculate descriptors with the CDK,sp = cdk.smiles.SmilesParser(builder) and finally (4) write out an SDF file containing the descriptor values and the 3D coordinates. The full Pythonmol = sp.parseSmiles(SMILESstring) script is only seven lines long:The RDKit was designed with Python scripting in mind, from cinfony import rdkit, cdk, obabeland of the three toolkits is the most concise. On the otherhand, OpenBabel uses a characteristically C++ approach. mol = obabel.readstring("smi", "CCC=O")An empty molecule is created, and is passed to an OBCon-version instance as a container for the molecule read from mol.make3D()the SMILES string. The SmilesParser in the CDK requiresan instance of an object implementing the IChemObject- rdkit.Molecule(mol).draw(show = False,Builder interface. filename = "aldehyde.png")Another advantage of a common API is that a script writ- descs = cdk.Molecule(mol).calcdesc()ten for one toolkit can easily be modified to use another.As an example, here is a script that selects molecules that similar to a particular target molecule. This script istaken from the original Pybel paper [7], but uses the CDK mol.write("sdf", filename = "aldeinstead of OpenBabel and will run equally well from hyde.sdf")Jython and CPython. The only differences compared tothe original script are that "pybel" has been replaced with For cheminformaticians interested in developing QSAR or"cdk", and the import statement has been changed from QSPR models, Cinfony can be used to simultaneously cal-"import pybel": culate descriptors from the RDKit, the CDK and OpenBa- bel. For example, the following script reads a multilinefrom cinfony import cdk input file, with each line consisting of a SMILES string fol- lowed by a property value. For each molecule, it calculatestargetmol = cdk.readfile("sdf", "target all of the OpenBabel, RDKit and CDK descriptors (exceptmol.sdf").next() for CDKs CPSA) and writes out the results as a tab-sepa- Page 6 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  17. 17. Chemistry Central Journal 2008, 2:24 file suitable for reading with the statistical package R print >> outputfile, "t".join(["Prop[15]. Note that in this example script, if descriptors share erty"] + descnames)the same name only one is retained. This is the case for theTPSA descriptor in OpenBabel, which is replaced by the for smile, propval, desc in zip(smiles,RDKits TPSA descriptor. propvals, descs):import string descvals = [str(desc[descname]) for descname in descnames]from cinfony import obabel, cdk, rdkit print >> outputfile, "t".join([smile,# Read in SMILES strings and observed prop str(propval)] +erty values descvals)smiles, propvals = [], [] outputfile.close()for line in open("data.txt"): Performance broken = line.rstrip().split() Accessing cheminformatics libraries using Cinfony allows the user to rapidly develop scripts that manipulate chem- smiles.append(broken [0]) ical information. However, there is a small price to be paid. Firstly, there is the cost of moving objects across the propvals.append(float(broken)) interface between Python and the cheminformatics librar- ies. Secondly, the additional code required by Cinfony tomols = [obabel.readstring("smi", smile) implement a standard API may slow performance further.for smile in smiles] To assess the performance penalty for accessing chem-# Calculate descriptor values using informatics toolkits using Cinfony rather than directly inOpenBabel, the native language, we looked at two simple test cases: (1) iterating over an SDF file containing 25419 molecules,# the CDK (apart from CPSA) and the RDKit (2) iterating and printing out the molecular weight of each of the molecules. The SDF file used was 3_p0.0.sdf,cdkdescs = [x for x in cdk.descs if x != the first portion of the drug-like subset of the ZINC 7.00CPSA] dataset [16]. The Cinfony scripts, Java and C++ source code are available as Additional file 2. The results aredescs = [] shown in Table 3.for mol in mols: While accessing the CDK using Jython is almost as fast as a pure Java implementation, there is a considerable over- d = mol.calcdesc() head associated with using JPype to access the CDK from CPython (89% slower for the second test case). This over- d.update(cdk.Molecule(mol).calcdesc(cd head is due to passing objects between the JVM and CPy-kdescs)) thon. For OpenBabel, there is little performance cost associated with accessing OpenBabel from either imple- d.update(rdkit.Molecule(mol).calcdesc( mentation of Python, although the jybel scripts are some-)) what slower than pybel scripts. A small portion of this speed difference can be attributed to a slower startup descs.append(d) (about 1.6 seconds for jybel, compared to 0.8 seconds for pybel). Finally, from the RDKit results in Table 3, it is clear# Write a file suitable for read.table that using Boost.Python to wrap a C++ library is more effi-in R cient than using SWIG. The difference in run times between the C++ and Python implementations is negligi-outputfile = open("inputforR.txt", "w") ble.descnames = sorted(descs [0].keys(), key = In practice, the performance of a particular Cinfony scriptstring.lower) will depend on the extent to which information is passed Page 7 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  18. 18. Chemistry Central Journal 2008, 2:24 3: Performance of Cinfony modules compared to a native Java or C++ implementation. Iterate over SDF Iterate and calculate molecular weight CDK Time (s) Normalised Time (s) Normalised Native Java 21.2 1.00 36.8 1.00 cdkjython 23.1 1.09 41.6 1.13 cdkjpype 33.0 1.57 69.5 1.89 OpenBabel Native C++ 31.9 1.00 43.0 1.00 pybel 34.1 1.07 45.1 1.05 jybel 38.0 1.19 49.6 1.15 RDKit Native C++ 99.7 1.00 100.7 1.00 rdkit 99.9 1.00 101.0 1.00 The times reported are wallclock times from the best of three runs on a dual-core Intel Pentium 4 3.2 GHz machine with 1GB RAM.back and forth between Python and the underlying Java or ticomponent molecules. For each molecule, PubChemC++ library. Where most of the time is spent on computa- provides an SDF file containing coordinates for a 2Dtion in the underlying library, the speed difference depiction, as well as the depiction itself as a PNG file.between a native implementation and one using Cinfony PubChem uses the CACTVS toolkit [18] to generate theis expected to be small. 2D coordinates as well as the corresponding depiction. Using a script similar to the following, we used Cinfony toComparison of toolkits generate 2D depictions using OASA (the depiction libraryCinfony makes it easy to compare the results obtained by used by pybel), the CDK and a development version ofdifferent toolkits for the same operations. This can be use- RDKit that all use the same 2D coordinates taken from theful in identifying bugs, applying a test suite, or finding the SDF file:strengths and weaknesses of particular implementations.For example, where different toolkits calculate the same from cinfony import pybel, rdkitdescriptors, if the calculated values are not highly corre-lated it may indicate a bug in one or the other. Earlier, we for toolkit in [rdkit, pybel]:mentioned that a difference in the treatment of implicithydrogens causes different toolkits to give different values name = toolkit.__name__for molecular weight unless hydrogens are explicitlyadded. Ensuring that a particular result is in agreement for mol in toolkit.readfile("sdf",with that obtained by another toolkit can act as a sanity "dataset.sdf"):check in such instances to avoid errors. mol.draw(filename = "%s_%s.png" %When carrying out the same operation with several (mol.title, name),toolkits, it is often convenient to iterate over the toolkitsin an outer loop: show = False,from cinfony import obabel, rdkit, cdk usecoords = True)for toolkit in [obabel, rdkit, cdk]: When the resulting images were compared for the PubChem entry CID7250053, an error was found in the print toolkit.readstring("smi", depiction of the stereochemistry of an isopropyl group"CCC").molwt (Figure 2). Since the error only occurred in certain cases, it had not been previously noticed and would have been dif-As an example of how such comparisons can be used to ficult to identify without such a comparative study. Onceidentify bugs in toolkits, let us consider depiction. As a reported, the problem was quickly solved and the subse-dataset, we randomly chose 100 molecules from quent RDKit release depicted the stereochemistry cor-PubChem [17], with subsequent filtering to remove mul- rectly. A comparison of depictions by commercial toolkits Page 8 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  19. 19. Chemistry Central Journal 2008, 2:24 Other requirements: OpenBabel, CDK, RDKit, Java, OASA, JPype, Python Imaging Library License: BSD Any restrictions to use by non-academics: None Competing interests The authors declare that they have no competing interests. Authors contributions NMOB conceived and developed Cinfony. GRH is the lead developer of OpenBabel and created the Python and Java SWIG bindings. All authors read and approved the final manuscript. Additional material Additional file 1 Miniwebsite API. A mini-website of the Cinfony API documentation. Click here for file [ of depictions of PubChem CID7250053 using]Comparison of depictions of PubChem CID7250053using different toolkits. The depiction using the develop- Additional file 2ment version of RDKit showed incorrect stereochemistry Timing Code. A zip file containing Python, Java and C++ code used forfor the isopropyl substituent of the thiazole ring. run time comparisons for two test cases. Click here for file []and depictions generated by Cinfony is available here (seeAdditional file 3). Additional file 3 Miniwebsite Depictions. A mini-website showing a comparison of theConclusion depictions generated by several cheminformatics toolkits.Cinfony makes it easy to combine complementary fea- Click here for file [ of the three main Open Source cheminformatics]toolkits. By presenting a standard simplified API, thelearning curve associated with starting to use a new toolkitis greatly reduced, thus encouraging users of one toolkit toinvestigate the potential of others. Acknowledgements Cinfony would not be possible without the work of many Open SourceCinfony is freely available from the Cinfony website [19], projects. In particular, we thank several developers who responded quicklyboth as Python source code and as a Windows distribu- to bug reports or queries: Beda Kosata (OASA), Greg Landrum (RDKit),tion containing dependencies. Installation instructions Tim Vandermeersch (OpenBabel), Steve Ménard (JPype). Thanks also toare provided for MacOSX, Linux and Windows. Gilbert Mueller and Chris Morley for feedback on installing Cinfony. NMOB thanks Google Code for providing free web hosting and develop- ment tools for Cinfony. We thank the anonymous reviewers for severalAvailability and requirements useful suggestions.Project name: Cinfony ReferencesProject home page: 1. OpenBabel v.2.2.0 [] 2. Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen E:Operating system(s): Platform independent Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bio- informatics. Curr Pharm Des 2006, 12:2110-2120.Programming language: Python, Jython 3. Landrum G: RDKit. []. 4. Murray-Rust P, Rzepa HS: Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles. J Chem Inf Comput Sci 1999, 39:928-942. Page 9 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)
  20. 20. Chemistry Central Journal 2008, 2:24 Apodaca R, OBoyle N, Dalke A, Van Drie J, Ertl P, Hutchison G, James CA, Landrum G, Morley C, Willighagen E, De Winter H: OpenSMILES. [].6. Daylight Chemical Information Systems Manual [http://]7. OBoyle NM, Morley C, Hutchison GR: Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit. Chem Cent J 2008, 2:5.8. Kosata B: OASA. [].9. Raymond ES: The Art of UNIX Programming 2003 [ ~esr/writings/taoup/index.html]. Reading, MA: Addison-Wesley10. Symyx CTfile formats [ ctfile/ctfile.jsp]11. KNIME – Konstanz Information Miner []12. SWIG v.1.3.36 []13. Ménard S: JPype. [].14. Boost.Python []15. R development core team: R: A language and environment for statistical computing. [].16. Irwin JJ, Shoichet BK: ZINC – A Free Database of Commercially Available Compounds for Virtual Screening. J Chem Inf Model 2005, 45:177-182.17. PubChem []18. CACTVS Chemoinformatics Toolkit: Xemistry GmbH: Lah- ntal, Germany. .19. OBoyle NM: Cinfony. []. Publish with ChemistryCentral and every scientist can read your work free of charge Open access provides opportunities to our colleagues in other parts of the globe, by allowing anyone to view the content free of charge. W. Jeffery Hurst, The Hershey Company. available free of charge to the entire scientific community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours you keep the copyright Submit your manuscript here: Page 10 of 10 Chem. Cent. J. 2008, 2, 24. (page number not for citation purposes)