Reproducible research in molecular biophysics and structural biology

1,385 views

Published on

Reproducible science in the context of molecular biophysics and structural biology: state of the art and outlook.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,385
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
36
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Reproducible research in molecular biophysics and structural biology

  1. 1. Reproducible research in molecular biophysics and structural bioinformatics Konrad HINSEN Centre de Biophysique Moléculaire, Orléans, France and Synchrotron SOLEIL, Saint Aubin, France 13 December 2012Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 1 / 36
  2. 2. ReproducibilityOne of the ideals of science Scientific results should be verifiable. Verification requires reproduction by other scientists. Few results actually are reproduced, but it’s still important to make this possible: important for the credibility of science in society (remember “Climategate”, cold fusion, ...) important for the credibility of a specific study The more detail you provide about what you did, the more your peers are willing to believe that you did what you claim to have done. Reproducibility also matters for efficient collaboration inside a team. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 2 / 36
  3. 3. Reproducibility in practiceNear-perfect in mathematicsNo journal publishes a theorem unless the author provides a proof.Often best-effort in experimental science Lab notebooks record all the details for in-house replication. Published protocols are less detailed, but often clear enough for an expert in the field. The main limitation is technical: lab equipment and concrete samples cannot be reproduced identically.Lousy in computational science Papers give short method summaries and concentrate on results. Few scientists can reproduce their own results after a few months. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 3 / 36
  4. 4. Reproducibility in computational scienceWe could do better than experimentalists Results are deterministic, fully determined by input data and algorithms. If we published all input data and programs, anyone could reproduce the results exactly.But we don’t Lots of technical difficulties. Important additional effort. Few incentives.Goals of the Reproducible Research movement Create more awareness of the problem. Provide better tools. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 4 / 36
  5. 5. Reproducible (computational) Research matters“Climategate”In 2009 a server at the Climatic Research Unit at the University ofEast Anglia was hacked and many of its files became public. One ofthem describes a scientist’s difficulties to reproduce his colleagues’results. This has been used by climate change skeptics to discreditclimate research.Protein structure retractionsIn 2006, three important protein structures published in Sciencewere retracted following the discovery of a bug in the software usedfor data processing.For more examples and details: Z. Merali, “...Error ... why scientific programming does not compute”, Nature 467, 775 (2010) Science Special Issue on Computational Biology, 13 April 2012 Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 5 / 36
  6. 6. Today’s tools for Reproducible Research Version control for source code and data Mercurial, Git, Subversion, ... Great for source code, usable for small data sets Workflow management systems VisTrails, Taverna, Kepler, ... Preservation of computational procedures and provenance tracking, great for scientists who like graphical workflows Electronic lab notebooks Mathematica, IPython notebook, ... Preservation of computational procedures and provenance tracking, great for scientists who like scripting Publication tools for computations Collage, IPOL, myExperiment, PyPedia, RunMyCode, SHARE, ... Experimental and/or domain-specific Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 6 / 36
  7. 7. Missing pieces (1/2)Integration of version control and provenance trackingRecord for reproduction that dataset X was obtained from version 5of dataset Y using version 2.2 of program Z compiled with gccversion 4.1.Version control for big datasetsVersion control tools are made for text-based formats.Provenance tracking and workflows across machinesIf you do part of your computations elsewhere (supercomputer, lab’scluster, ...), existing tools won’t work for you.Tool-independent standard file formatsIf Alice wants to reproduce Bob’s results, she needs to use Bob’stools for version control and workflows/notebooks. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 7 / 36
  8. 8. Missing pieces (2/2)Reproducible floating-point computationsConflict between reproducibility and performance: Reproducibility for IEEE float arithmetic requires an exact sequence of load, store, and arithmetic operations. Performance optimization requires shuffling around operations depending on processor type, cache size, memory access speed, number of processors, etc. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 8 / 36
  9. 9. Reproducibility in biomolecular simulations (1/3)Heavy resource requirements Most important technique: Molecular Dynamics Long runs (days to weeks) on big parallel machines Produces large datasets (a few GB per trajectory) Trajectory analysis can be as expensive as the simulation itselfDatasets are too big for version control.Today’s provenance trackers can’t keep track of computations acrossmachines. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural December 2012 13 bioinformatics 9 / 36
  10. 10. Reproducibility in biomolecular simulations (2/3)Complexity Computational models contain thousands of parameters No complete formal description of the models (except for program source code) Simulation algorithms contain many approximations made for performance reasons These approximations are usually undocumentedPublishing a complete description of a molecular simulation is sodifficult that very few scientists could do it and equally few couldunderstand the resulting description.There is no file format for storing a complete description of amolecular simulation in machine-readable form. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 10 / 36
  11. 11. Reproducibility in biomolecular simulations (3/3)A split community A small number of “developers” with backgrounds in physics and chemistry A large number of “users” with backgrounds in biology and biochemistry A negligibly small number of computing specialists Users don’t care about the technical details of the simulations. Users (have to) trust the developers blindly. Developers understand the models but have no training in software development or data management.The field is dominated by a few large monolithic simulation packageswhose internal workings are complex and mostly undocumented.Intermediate stages of processing in a simulation workflow areimpossible to access or available only in program-specific formats. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 11 / 36
  12. 12. Complexity: the Amber force field (1/2)The Amber force field (http://ambermd.org/#ff) is one of the mostpopular computational models for biomolecules. In a research paper,it is typically described by its formula for the potential energy: (0) 2 U = kij rij − rij bonds ij (0) 2 + kijk φijk − φijk angles ijk + kijkl cos (nijkl θijkl − δijkl ) dihedrals ijkl 12 6 σij σij + all pairs ij 4 ij r 12 − r6 qi qj + all pairs ij 4π 0 rij Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 12 / 36
  13. 13. Complexity: the Amber force field (2/2)In addition to that formula, we need rules for summing over bonds, angles, etc. rules for finding the values of all the parameters for a given moleculeMost of these rules can be found on the Amber Web site (with someeffort), but not in any research paper.One rule is, to the best of my knowledge, not written down anywhere.It concerns the attribution of “improper dihedral” terms. I obtained itfrom the Amber developers by asking a precise question by e-mail.That rule refers to the indices of the atoms in the input file definingthe molecular structure. These indices are chosen arbitrarily. Makingan energy dependent on them is profoundly unphysical.How many users of the Amber force field know about this unphysicaldependence on atom indices? The effect is very small. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 13 / 36
  14. 14. A real-life case of (non-)reproducibility (1/5)Goal: find the gas-phase structure of short peptide sequences bymolecular simulation orEarlier work on this topic1 finds that Ac A15 K + H+ forms a helix, butAc K A15 + H+ forms a globule.My simulations predict that both sequences form globules.What did I do differently? 1 M.F. Jarrold, Phys. Chem. Chem. Phys. 9, 1659 (2007) Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 14 / 36
  15. 15. A real-life case of (non-)reproducibility (2/5)From the paper: Molecular Dynamics (MD) simulations were performed to help interpret the experimental results. The simulations were done with the MACSIMUS suite of programs [31] using the CHARMM21.3 parameter set. A dielectric constant of 1.0 was employed.The URL in ref. 31 is broken, but Google helps me find MACSIMUSnevertheless. It’s free and comes with a manual! ¨I download the “latest release” dated 2012-11-09.But which one was used for that paper in 2007? Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 15 / 36
  16. 16. A real-life case of (non-)reproducibility (3/5)MACSIMUS comes with a file charmm21.par that starts with! Parameter File for CHARMM version 21.3 [June 24, 1991]! Includes parameters for both polar and all hydrogen topology files! Based on QUANTA Parameter Handbook [Polygen Corporation, 1990]! Modified by JK using various sourcesDid that paper use the “polar” or the “all hydrogen” topology files?I can find only one set of topology files named charmm21, and that’swith polar hydrogens only, so I guess “polar”.Now I have the parameters, but I don’t know the rules of theCHARMM force field, nor can I be sure that MACSIMUS uses the samerules as the CHARMM software. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 16 / 36
  17. 17. A real-life case of (non-)reproducibility (4/5)From the paper: A variety of starting structures were employed (such as helix, sheet, and extended linear chain) and a number of simulated annealing schedules were used in an effort to escape high energy local minima. Often, hundreds of simulations were performed to explore the energy landscape of a particular peptide. In some cases, MD with simulated annealing was unable to locate the lowest energy conformation and more sophisticated methods were used (see description of evolutionary based methods below).I might as well give up here...My point is not to criticize this particular paper.The level of description is typical for the field. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 17 / 36
  18. 18. A real-life case of (non-)reproducibility (5/5)What I would have liked to get: a machine-readable file containing a full specification of the simulated system: chemical structure all force field terms with their parameters the initial atom positions a script implementing the annealing protocol links to all the software used in the simulation, with version numbersAll that with persistent references (DOIs). Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 18 / 36
  19. 19. Enough whining... Let’s do something to improve the situ- tion!Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 19 / 36
  20. 20. 1 A data model for molecular simulations2 A data model for executable papers3 Teaching best practices to computational scientists Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 20 / 36
  21. 21. Data handling in molecular simulationThere is no standard file format for molecular simulations.Best approximation: the PDB format made for a different purpose (crystallography) lacks precision for coordinates lacks the possibility to store anything else but coordinates limited to 10.000 atoms few if any programs respect it completelyOK... so let’s define something better... ... which is a far from trivial task. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 21 / 36
  22. 22. What do we want to archive and exchange? the system we are simulating the chemical structure of the molecules the number of each kind of molecule the topology of the system (infinite, 3D periodic, 2D periodic, ...) symmetries (crystals, ...) configurations (positions of all atoms plus PBC parameters) velocities, masses, charges, and other per-atom properties force field parameters force fields (the rules) trajectories experimental data used in simulations ... lots of other stuff ...Don’t try to do everything at once: we want a modular set ofconventions. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 22 / 36
  23. 23. Which systems, which simulation techniques? anything at the atomic and molecular scale: gases, liquids, solids from atoms via small molecules to macromolecules all-atom and coarse-grained models from quantum models via classical mechanics to stochastic dynamicsApplication domains: Chemical/statistical physics Solid-state physics Chemistry Molecular/structural biology Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 23 / 36
  24. 24. Mosaic: a data model for molecular simulationMOlecular SimulAtion Interchange Conventionshttp://bitbucket.org/molsim/mosaic/Mosaic is a data model with (currently) two representations.Data model expresses molecular simulation data in terms of basic data items and structures: numbers, strings, lists, trees, ...Representation defines how a data model is stored as a sequence of bytes in a file. Conversion between representations is exact. This is work in progress. In order to succeed, it must become a community project. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 24 / 36
  25. 25. Mosaic: current stateData items System topology, chemical structure Configurations Per-atom/site numerical properties (velocities, charges, masses, . . .) Per-atom/site labels (atom names, atom types, . . .)Representations HDF5 for big data sets, fast access, and easy interfacing to C/C++/Fortran code XML for processing with standard XML tools Planned: JSON for interfacing with Web-based tools Three in-memory representations for PythonTools Conversion between representations biophysics and structural bioinformatics Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular 13 December 2012 25 / 36
  26. 26. HDF5 (Hierarchical Data Format, version 5) Platform-independent binary root group storage groups Efficient library for data handling Metadata (provenance, dependencies, . . . ) Well-established: used by big organizations (NASA, . . .) for very large data sets (meteorology . . .) Ideal for exchanging and archiving data datasets (≈ arrays) Suitable for very large data sets Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 26 / 36
  27. 27. So here’s my model for the gas phase peptide...1 <?xml version="1.0" encoding="utf-8"?>2 <mosaic version="0.1">3 <universe cell_shape="infinite" convention="MMTK" id="universe">4 <molecules>5 <molecule count="1">6 <fragment label="" species="protein">7 <fragments>8 <fragment label="A" polymer_type="polypeptide" species="ACE,Ala,Ala,Ala,Ala,Ala,Ala,Al9 <fragments>0 <fragment label="ACE0" species="ace_beginning">1 <atoms>2 <atom label="CBond" name="C" type="element" />3 <atom label="CH3" name="C" type="element" />4 <atom label="HH31" name="H" type="element" />5 <atom label="HH32" name="H" type="element" />6 <atom label="HH33" name="H" type="element" />7 <atom label="O" name="O" type="element" />8 </ atoms>9 <bonds>0 <bond atoms="CH3 HH31" order="" />1 <bond atoms="CH3 HH32" order="" />2 <bond atoms="CH3 HH33" order="" />3 <bond atoms="CBond CH3" order="" />4 <bond atoms="CBond O" order="" />5 </ bonds>6 </ fragment> . . . (686 lines in total) Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 27 / 36
  28. 28. Biomolecular simulation . . . a few years from now PDB import Mosaic HDF5 file Chain extraction PDB crystal universe + configuration Protein in vacuum universe + configuration Force field Force field for protein in vacuum Solvated protein universe + configuration Solvatation Force field for solvated protein Trajectory for solvated protein MD simulation Small independent All the data for the programs simulation study in one place Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 28 / 36
  29. 29. Managing, publishing, and archiving reproducibleresearchReproducible research results in large sets of interrelated software,data sets, and documentation. Managing files, dependencies, and metadata (provenance, . . .) becomes a nightmare when moving between multiple computers. Publishing everything in an easily usable form is difficult due to a lack of standards. Archiving is almost pointless: which software will still work 50 years from now?ActivePapers(http://dirac.cnrs-orleans.fr/plone/software/activepapers) provides a solution to these problems is a ready-to-run prototype (needs a Java installation) is rather useless in practice for now Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 29 / 36
  30. 30. Preserving data for a long timeFormalized descriptions data models ontologies well-defined file formatsUse standards, even bad onesStandards ensure the critical mass that will motivate future scientiststo maintain the knowledge and the tools required to interpret yourdata.Limited dependencies software: operating systems, compilers, libraries, . . . infrastructure: Internet, communication protocols, . . . organizations: research labs, publishers, . . . Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 30 / 36
  31. 31. Code is dataThere is nothing special about software, it’s just a kind ofdata!Everything is data: Scientific data is data. Code is data. Input parameters are data. Documentation is data.All we need is good data models for everything, plus a way todescribe dependencies.The tricky part is finding a good data model for code. Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 31 / 36
  32. 32. Representing code as data“Data types” for code: source code: defined by a language specification there are lots of languages programmers are attached to them binary/executable: defined by a processor/hardware specification very complex specification evolves rapidly with technological progress bytecode: defined by a virtual machine specification (JVM, CLI, . . .) clear and simple specification not directly visible to the programmer today’s virtual machines were not designed for scientific computing → sub-optimal performanceJVM bytecodes provide the best existing infrastructure.There HINSEN (CBM/SOLEIL) scientific software for the JVMand structural bioinformatics is32 / 36 Konrad is not much Reproducible research in molecular biophysics platform, which 13 December 2012
  33. 33. Dependency handling Dataflow graph of program execution dataset 1 dataset 2 input program Directed source code Acyclic output Graph compiler of dependencies dataset 3 dataset 2 dataset 1 program dataset 3 Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 33 / 36
  34. 34. ActivePapers overview (1/2)An ActivePaper . . . is an HDF5 file containing any number of data sets typically contains scientific data, code, and documentation is identified by a DOI (Digital Object Identifier) can refer to other ActivePapers through their DOIs stores the dependency graph between data sets in HDF5 metadataExecutable code in an ActivePaper . . . cannot access files outside of ActivePaper files cannot modify data except inside its own ActivePaper comes in three varieties: calclet (reads and writes data), viewlet (reads data for interactive visualization), and library (called by other code) Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 34 / 36
  35. 35. ActivePapers overview (2/2)Special cases a raw dataset (usually with some documentation) a software library (usually with some documentation)External dependencies (must be maintained for 50 years . . .) a Java Virtual Machine with the Java standard library the HDF5 library (could be rewritten as a JVM library) JHDF5, a Java interface to HDF5 (could be rewritten as a JVM library)Problems solved future-proof publishing and archiving of computational research secure repeatability of computational studies reusability of scientific data and software integration of scientific data and software into bibliometry Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 35 / 36
  36. 36. Teaching best practices to computational scientistsSoftware Carpentryis an organization dedicated to teaching computing skills toscientists, with support from the Alfred P. Sloan Foundation and theMozilla Foundation.Activities: short intensive workshops (“boot camps”) online coursesYou can help by hosting a boot camp being an instructor at boot-camps working on teaching materialhttp://software-carpentry.org/ Konrad HINSEN (CBM/SOLEIL) Reproducible research in molecular biophysics and structural bioinformatics 13 December 2012 36 / 36

×