Evaluating the quality and performance                                                                                    ...
Upcoming SlideShare
Loading in …5

Evaluating the quality and performance of automatic atom mapping algorithms


Published on

Presented by Daniel Lowe at UK QSAR Autumn 2012

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Evaluating the quality and performance of automatic atom mapping algorithms

  1. 1. Evaluating the quality and performance of automatic atom mapping algorithms Daniel Lowe and Roger Sayle, NextMove Software Ltd, Cambridge, UK daniel@nextmovesoftware.co.uk1. Introduction The mapping should be chemically reasonable. Heuristically this may be evaluated by comparing the number of C-C bonds broken. A complete Automatic atom mapping algorithms work on chemical reactions to produce mapping with less C-C bond breakages is more likely to be correct. mappings between the atoms in the reactants and atoms in the product/s. 1.2 Average number of C-C bonds broken per mapping 1.0 Marvin 5.10 ChemDraw 12 0.8 Indigo 1.1 Indigo 1.1 (lenient) ICMap 5.10 0.6 PipelinePilot Mapping Cheshire algorithm 0.4 0.2 0.0 PharmaELN ChemReact68 SPRESI USPTO 4. Discussion In some cases, such as the above example, a reactant may appear multiple ICMAP was found to produce the most chemically plausible atom mappings times in the product due to the input lacking the exact stoichiometry. whilst Pipeline Pilot was able to successfully map the most reactions. An alternative measure of plausibility based on the energy of bonds broken was Chemically plausible mappings are potentially useful for: also tested and found to correlate strongly with the number of C-C bonds • Assigning roles to reagents, hence allowing determination of whether they broken. should be best placed above the reaction arrow Reuse of reactants was found to not be supported by Marvin, ChemDraw, • Reaction normalization for registration Pipeline Pilot or Cheshire and often not performed correctly by Indigo • Performing more precise database searches leading to clearly incorrect mappings: • Identifying suspect reactions e.g. those where a reactant is missing In this work we investigated the strengths and weaknesses of currently available solutions to this problem.2. Methodology We evaluated the following algorithms on four sets of reactions: Vendor:Program Version Test set Reactions ChemAxon:Marvin[1] 5.10.1 Pharmaceutical ELN subset 18,244 GGA:Indigo[2] 1.1 ChemReact68 database 67,926 Example of bad mapping. All algorithms other than ICMAP InfoChem:ICMAP[3] 5.10 SPRESI database subset 5,230 placed atom maps on the pyridine Reactions extracted from 2008-2011 562,872 PerkinElmer:ChemDraw Ultra[4] 12.0 Accelrys:Pipeline Pilot[5] USPTO patent applications[7] A significant limitation found in ICMAP was the mapping of single atoms: Accelrys:Cheshire Advanced Edition[6] Two configurations were used with Indigo; one for the default mapping settings and one with more lenient settings for matching valences, charges and bond orders. In both cases a 60 second timeout was explicitly specified. Marvin was configured to use its best quality mapping strategy. Input and output were reaction SMILES with the exceptions of ICMAP which Example of incomplete mapping from ICMAP. Note that in this case required the conversion to and from RDF, and Cheshire and Pipeline Pilot algorithms supporting single atom mapping incorrectly picked the methyl that required the conversion of their RDF output to SMILES. from the Et2Zn! Although not quantitatively evaluated the speed of the algorithms proved to3. Results be significantly different. The USPTO set could be processed in hours through ChemDraw or ICMAP, a day through Marvin but weeks through Indigo. Where reactions are valid an ideal algorithm will be able to find a mapping 8. Conclusions for every product atom: ICMAP and Pipeline Pilot produced the best results with the trade-off 90 between recall and precision determining which would be most appropriate Percent of reactions with all product atoms mapped 80 to a given task. Of the solutions tested only Indigo is freely available; when Marvin 5.10 configured appropriately it produced adequate results. 70 ChemDraw 12 Indigo 1.1 9. Acknowledgements 60 Indigo 1.1 (lenient) We would like to thank InfoChem for providing an evaluation of ICMAP, and ICMap 5.10 50 Hans Kraut and Nick Tomkinson for assistance and comments on this work. PipelinePilot 40 Cheshire 10. Bibliography 30 1. http://www.chemaxon.com/products/marvin 2. http://ggasoftware.com/opensource/indigo 20 3. http://infochem.de/products/software/icmap.shtml 4. http://www.cambridgesoft.com/software/chemdraw 10 5. http://accelrys.com/products/pipeline-pilot/ 0 6. http://accelrys.com/products/informatics/cheminformatics/accelrys-cheshire.html PharmaELN ChemReact68 SPRESI USPTO 7. Lowe, D. M. Automated Extraction of Reactions from the Patent Literature. 243rd ACS National Meeting & Exposition, San Diego, CA, March 27, 2012. NextMove Software Limited Innovation Centre (Unit 23) www.nextmovesoftware.co.uk Cambridge Science Park www.nextmovesoftware.com Milton Road, Cambridge England CB4 0EY