Successfully reported this slideshow.
Your SlideShare is downloading. ×

Reproducibility in cheminformatics and computational chemistry research: certainly we can do better than this


Check these out next

1 of 36 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)


Similar to Reproducibility in cheminformatics and computational chemistry research: certainly we can do better than this (20)

More from Greg Landrum (12)


Recently uploaded (20)

Reproducibility in cheminformatics and computational chemistry research: certainly we can do better than this

  1. 1. Reproducibility in cheminformatics and computational chemistry research: Certainly we can do better than this Gregory Landrum Ph.D. NIBR IT Novartis Institutes for BioMedical Research Basel GCC 2012 Goslar
  2. 2. Outline §  Reproducibility? §  Requirements for reproducibility of published research §  Practical aspects Landrum, G. A. & Stiefl, N. Is that a scientific publication or an advertisement? Reproducibility, source code and data in the computational chemistry literature. Future Medicinal Chemistry 4, 1885–1887 (2012).
  3. 3. But first!
  4. 4. A new fingerprint for similarity-based virtual screening §  Start with Morgan fingerprints (a.k.a. circular fingerprints1) §  The usual FCFP algorithm uses fairly crude feature definitions §  Combine the RDKit Morgan fingerprint algorithm with pharmacophoric features calculated using “better” feature definitions2. 1.  Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010). 2.  Gobbi, A. & Poppinger, D. Genetic Optimization of Combinatorial Libraries. Biotechnology and Bioengineering (Combinatorial Chemistry) 61, 47–54 (1998). "[$([N;!H0;v3,v4&+1]),$([O,S;H1;+0]),n&H1&+0]", // Donor "[$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),$([O,S;H0;v2]),$([O,S;-]), $([N;v3;!$(N-*=[O,N,P,S])]),n&H0&+0, $([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]", // Acceptor "[a]", //Aromatic "[F,Cl,Br,I]",//Halogen "[#7;+,$([N;H2&+0][$([C,a]);!$([C,a](=O))]), $([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]), $([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]", // Basic "[$([C,S](=[O,S,P])-[O;H1,-1])]" //Acidic
  5. 5. Validation data §  Diverse ChEMBL actives for 50 target classes1 §  Data taken from ChEMBL v14 §  Active : reported activity<10uM and confidence=9 §  Diverse: 100 actives picked using the RDKit’s implementation of the MaxMin algorithm2 with radius 0 Morgan fingerprints (ECFP-like) §  Inactives: 10000 molecules selected from the ZINC druglike set. Selection criterion: two randomly selected neighbors (similarity via Morgan0 fingerprint>=0.5) for each of the 5000 actives 1.  Heikamp, K. & Bajorath, J. Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets. JCIM 51, 1831–1839 (2011). 2.  Ashton, M. et al. Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions. QSAR & Combinatorial Science 21, 598–604 (2002).
  6. 6. Validation procedure §  Repeat 50 times for each data set: •  Randomly pick 5 actives •  Mix the remaining 95 actives with the 10K inactives •  Rank that pool of compounds based on maximum similarity to the 5 actives •  Calculate performance based on enrichment at 5% of the total dataset size (10095) §  Look at average enrichments within each assay §  Compare the new fingerprint to other standard fingerprints; MACCS, Morgan6 (bv + counts), Morgan4 (bv + counts), Morgan0 (bv + counts), Topological Torsions (bv + counts), Atom Pairs (bv + counts), Avalon, 2D Pharmacophore, RDKit, 2 internal fingerprints
  7. 7. Results The new fingerprint is the best for 29 of the 50 datasets FeatureMorgan2 Morgan0
  8. 8. Back to the talk… §  Reproducibility? §  Requirements for reproducibility of published research §  Practical aspects Landrum, G. A. & Stiefl, N. Is that a scientific publication or an advertisement? Reproducibility, source code and data in the computational chemistry literature. Future Medicinal Chemistry 4, 1885–1887 (2012).
  9. 9. Reproducibility Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct. Mathematics papers are expected to contain a proof complete enough to allow knowledgeable readers to fill in any details. Papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension. Mesirov, J. P. Accessible Reproducible Research. Science 327, 415–416 (2010).
  10. 10. Reproducibility An author’s central obligation is to present an accurate and complete account of the research performed, absolutely avoiding deception, including the data collected or used, as well as an objective discussion of the significance of the research. Data are defined as information collected or used in generating research conclusions. The research report and the data collected should contain sufficient detail and reference to public sources of information to permit a trained professional to reproduce the experimental observations. ACS “Ethical Guidelines to Publication of Chemical Research”
  11. 11. Reproducibility With these thoughts in mind, the editors of journals published by the American Chemical Society now present a set of ethical guidelines for persons engaged in the publication of chemical research, specifically, for editors, authors, and manuscript reviewers. These guidelines are offered not in the sense that there is any immediate crisis in ethical behavior, but rather from a conviction that the observance of high ethical standards is so vital to the whole scientific enterprise that a definition of those standards should be brought to the attention of all concerned. We believe that most of the guidelines now offered are already understood and subscribed to by the majority of experienced research chemists. They may, however, be of substantial help to those who are relatively new to research. Even well-established scientists may appreciate an opportunity to review matters so significant to the practice of science ACS “Ethical Guidelines to Publication of Chemical Research”
  12. 12. Reproducibility Experimental reproducibility is the coin of the scientific realm. The extent to which measurements or observations agree when performed by different individuals defines this important tenet of the scientific method. The formal essence of experimental reproducibility was born of the philosophy of logical positivism or logical empiricism, which purports to gain knowledge of the world through the use of formal logic linked to observation. A key principle of logical positivism is verificationism, which holds that every truth is verifiable by experience. In this rational context, truth is defined by reproducible experience, and unbiased scientific observation and determinism are its underpinnings. … The assumption that objectively true scientific observations must be reproducible is implicit, yet direct tests of reproducibility are rarely found in the published literature. This lack of published evidence of reproducibility stems from the limited appeal of studies reproducing earlier work to most funding bodies and to most editors. Furthermore, many readers of scientific journals— especially of higher-impact journals—assume that if a study is of sufficient quality to pass the scrutiny of rigorous reviewers, it must be true; this assumption is based on the inferred equivalence of reproducibility and truth described above. Loscalzo, J. Irreproducible Experimental Results: Causes, (Mis)interpretations, and Consequences. Circulation 125, 1211–1214 (2012).
  13. 13. If it’s not reproducible science? “Let me show you some cool pictures from my lab…”
  14. 14. Requirements for Reproducibility §  Data used §  Code/algorithm description §  Results Peng, R. D. Reproducible Research in Computational Science. Science 334, 1226–1227 (2011).
  15. 15. Requirements for Reproducibility: Data §  This is a no brainer, right? §  Unless it’s completely unprocessed (or the processing is part of the detailed method description/code), it’s better to include the actual data §  For sources like ChEMBL, a version number and SQL to grab the data are probably adequate §  “Ligands from PDB structures X, Y, and Z” probably not good enough
  16. 16. Requirements for Reproducibility: Data As a condition of publication, authors must agree to make available all data necessary to understand and assess the conclusions of the manuscript to any reader of Science. Data must be included in the body of the paper or in the supplementary materials, where they can be viewed free of charge by all visitors to the site. Certain types of data must be deposited in an approved online database, including DNA and protein sequences, microarray data, crystal structures, and climate records. index.xhtml#data_faq
  17. 17. Requirements for Reproducibility: Data §  What about chemical structures? •  a table with drawings of molecules? •  names instead of structures? §  Why not include the structures in a machine-readable format? This expanded use of electronic resources offers an excellent opportunity to make chemical information more accessible and user-friendly to readers of scientific papers. To take advantage of these opportunities, we have developed several online features that expand the usefulness of chemical compound information for Nature Chemical Biology readers … In all original research papers, compounds that are relevant to the background or results of the paper are assigned a bolded, Arabic numeral that serves as a unique identifier for the compound. Each numerical abbreviation in the HTML and PDF versions of the article is linked to a Compound Data page, which shows the structure and the IUPAC or common name of the chemical compound. From there, readers can download a ChemDraw file of the compound…To provide readers with rapid access to all of the chemical compounds discussed in an article, we feature a Compound Data Index page, which is accessible from the Compound Data page, the table of contents entry for the paper, and the navigation tools on the right side of the Nature Chemical Biology website.
  18. 18. Requirements for Reproducibility: Chemical Data From Nature Chemical Biology
  19. 19. Requirements for Reproducibility: Code Data and materials availability All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. All computer codes involved in the creation or analysis of data must also be available to any reader of Science. After publication, all reasonable requests for data and materials must be fulfilled. Any restrictions on the availability of data, codes, or materials, including fees and original data obtained from other sources (Materials Transfer Agreements), must be disclosed to the editors upon submission. gen_info.xhtml#dataavail
  20. 20. Requirements for Reproducibility: Code An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications. Any restrictions on the availability of materials or information must be disclosed to the editors at the time of submission. Any restrictions must also be disclosed in the submitted manuscript, including details of how readers can obtain materials and information. If materials are to be distributed by a for-profit company, this must be stated in the paper. In the meantime, researchers must, when they are arranging the commercialization of their work, bear in mind the implications that these deals may have on their freedom to publish to the standards that the community is entitled to expect. n7098/full/442001a.html
  21. 21. Requirements for Reproducibility: Code Ince, D. C., Hatton, L. & Graham-Cumming, J. The case for open computer programs. Nature 482, 485–488 (2012). We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail.
  22. 22. Requirements for Reproducibility: Code §  “Black box” code sharing: installing the software on a publicly accessible server, or providing executables for people to test §  Does this help with reproducibility? §  Not cut and dried. Needs discussion
  23. 23. Requirements for Reproducibility: Results §  Including the actual results is even more of a no brainer, right? Homology Models of Human All-Trans Retinoic Acid Metabolizing Enzymes CYP26B1 and CYP26B1 Spliced Variant Homology models of CYP26B1 (cytochrome P450RAI2) and CYP26B1 spliced variant were derived using the crystal structure of cyanobacterial CYP120A1 as template for the model building. The quality of the homology models generated were carefully evaluated, and the natural substrate all-trans-retinoic acid (atRA), several tetralone-derived retinoic acid metabolizing blocking agents (RAMBAs), and a well-known potent inhibitor of CYP26B1 (R115866) were docked into the homology model of full-length cytochrome P450 26B1. The results show that in the model of the full-length CYP26B1, the protein is capable of distinguishing between the natural substrate (atRA), R115866, and the tetralone derivatives. The spliced variant of CYP26B1 model displays a reduced affinity for atRA compared to the full-length enzyme, in accordance with recently described experimental information. This paper, presenting two new homology models, does not include either model. Unfortunately I didn’t have to search long to find this example
  24. 24. How are we doing? §  Survey of recent publications: •  Everything in JCIM vol 52 #10 •  Everything in JCAMD vol 26 #10 •  Journal of Cheminformatics from July 2012-Nov 4 2012 §  Big differences between journals §  Plenty of room for improvement Journal   Type  of  paper   Count   Full  Data   Par3al  Data   Missing  Data   Code?   JCIM   Method   13   6   3   4   1   JCIM   Non-­‐method   16   10   3   3   0   JCAMD   Method   3   3   0   0   0   JCAMD   Non-­‐method   4   0   3   1   0   JChemInf   Method   12   7   3   3   8   JChemInf   Non-­‐method   3   0   0   0   0  
  25. 25. Practical considerations §  Where to put the data and code? •  Supplementary material •  Code-sharing sites (, google code, github) •  Figshare §  Considerations: •  It needs to still be there 10+ years from now •  Having a solid connection to the original paper is good
  26. 26. Tools for reproducible research Knime §  Open-source workflow tool §  Strong data manipulation and mining capabilities §  Data and results can be stored with the workflow.
  27. 27. Tools for reproducible research IPython notebook §  Python session running in a browser •  Tab completion •  Access to docstrings §  Text formatting options available for including discussion or capturing mathematics §  Captures all data transformations and displays output §  Tight integration with matplotlib
  28. 28. Tools for reproducible research IPython notebook
  29. 29. Tools for reproducible research IPython notebook
  30. 30. Tools for reproducible research IPython notebook
  31. 31. Tools for reproducible research IPython notebook
  32. 32. Back to the earlier interruption §  Data? YES §  Solid description of method? YES §  Code? NO Still ok, though, right?
  33. 33. Ooops §  I had a typo in the script where I calculated EF_5 for the new fingerprint: §  Fixing that yields: FeatureMorgan2 Morgan0 ef_5 = calcEnrichment(rankedSims,nActivesTotal=80) The new fingerprint is no better than the others. Should be 95
  34. 34. Requirements for Reproducibility §  Data used §  Code/algorithm description §  Results
  35. 35. Perhaps the biggest barrier to reproducible research is the lack of a deeply ingrained culture that simply requires reproducibility for all scientific claims. Peng, R. D. Reproducible Research in Computational Science. Science 334, 1226–1227 (2011).
  36. 36. Acknowledgements §  NIBR: •  Nik Stiefl (GDC/CADD) •  Nikolas Fechner (NIBR IT/IS Sigma) •  Sereina Riniker (NIBR IT/IS Sigma) §  Matthias Rarey