2013 03-15- Institut Jacques Monod - bioinfoclub

2,956 views

Published on

  • Be the first to comment

  • Be the first to like this

2013 03-15- Institut Jacques Monod - bioinfoclub

  1. 1. Doing computational science better Some sources of inspiration Some tools Getting help A vous
  2. 2. Some sources of inspiration
  3. 3. EducationA Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science andEngineering, University of Washington, Seattle, Washington, United States of AmericaIntroduction understanding your work or who may be under a common root directory. The evaluating your research skills. Most com- exception to this rule is source code or Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects.es on algorithms, with perhaps some few months from now, you may not Each such program might have a projectcomponents devoted to learning pro- remember what you were up to when you directory of its own.gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-leveluse existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono-fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, anding for a research career, this type of reconstructing your previous experiments logical organization below that. A samplecurriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1.day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have asociated with performing computational This leads to the second principle, data directory for storing fixed data sets, aexperiments. In practice, the principles which is actually more like a version of results directory for tracking computa-behind organizing and documenting Murphy’s Law: Everything you do, you tional experiments peformed on that data,computational experiments are often will probably have to do over again. a doc directory with one subdirectory perlearned on the fly, and this learning is Inevitably, you will discover some flaw in manuscript, and directories such as srcstrongly influenced by personal predilec- your initial preparation of the data being for source code and bin for compiledtions as well as by chance interactions analyzed, or you will get access to new binaries or scripts.with collaborators or colleagues. data, or you will decide that your param- Within the data and results directo- The purpose of this article is to describe eterization of a particular model was not ries, it is often tempting to apply a similar,one good strategy for carrying out com- broad enough. This means that the logical organization. For example, youputational experiments. I will not describe experiment you did last week, or even may have two or three data sets againstprofound issues such as how to formulate the set of experiments you’ve been work- which you plan to benchmark yourhypotheses, design experiments, or draw ing on over the past month, will probably algorithms, so you could create oneconclusions. Rather, I will focus on need to be redone. If you have organized directory for each of them under data.relatively mundane issues such as organiz- and documented your work clearly, then In my experience, this approach is risky,ing files and directories and documenting repeating the experiment with the new because the logical structure of your final
  4. 4. EducationA Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science andEngineering, University of Washington, Seattle, Washington, United States of AmericaIntroduction understanding your work or who may be under a common root directory. The evaluating your research skills. Most com- exception to this rule is source code or Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects.es on algorithms, with perhaps some few months from now, you may not Each such program might have a projectcomponents devoted to learning pro- remember what you were up to when you directory of its own.gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-leveluse existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono-fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, anding for a research career, this type of reconstructing your previous experiments logical organization below that. A samplecurriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1.day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have asociated with performing computational This leads to the second principle, data directory for storing fixed data sets, aexperiments. In practice, the principles which is actually more like a version of results directory for tracking computa-behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of Figure names are typeface, and filenames are Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data, the files are shown here. NoteLaw: Everything you sorted in chronological order. Thecomputational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README source will probably is compiled create bin/ms-analysis a doc directory with one subdirectory per what date. The driver script results/2009-01-15/runalllearned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src files in automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse- generates the three subdirectories split1, split2, and instrongly influenced by personal predilec- script is called by bothpreparation driverthe data being sqt.py your initial of the runall of scripts. for source code and bin for compiled doi:10.1371/journal.pcbi.1000424.g001tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts.with collaborators or colleagues. with this approach,or you will decide that Lab Notebook data, the distinction be- The your param- Within the data and results a complete These types of entries provide directo- The purpose of this article is to describe data and results may of a particular model was not tween eterization not be useful. ries, it is often tempting to apply of the project picture of the development a similar, In parallel with this chronologicalone good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you Instead, could enough. This directory that I useful over time. directory called something like experi- In practice, I ask members of myputational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks ments experiment you did maintain a or even may lab research three put their lab againstprofound issues such as how to formulate 2008-12-19. Optionally, the directory notebook. This is a document that resides the set of experiments you’veroot of the results directory andyou online, behind benchmark your in the been work- which plan to password protection ifhypotheses, design experiments, or draw might ing on over word past month, will probably name also include a or two necessary. When I meet with a member that records your progress algorithms, ofso lab or a could team, we can one indicating the topic of the the experiment in detail. my you project create referconclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on on need a be redone. If and they should be relatively verbose, with to the online them under data. Entries in the notebook directoryrelatively mundane issues such as organiz- will often require more than one day of the current entry but scrolling up to and documented your work clearly, thenimages In my experience, entries approach is risky, this work, and so you may end up working a links or embedded or tables previous as necessary. The URLing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo- few days more before the experiment with the new because logical structure remote final
  5. 5. EducationA Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science andEngineering, University of Washington, Seattle, Washington, United States of AmericaIntroduction understanding your work or who may be under a common root directory. The evaluating your research skills. Most com- exception to this rule is source code or Most bioinformatics coursework focus- monly, however, that ‘‘someone’’ is you. A scripts that are used in multiple projects.es on algorithms, with perhaps some few months from now, you may not Each such program might have a projectcomponents devoted to learning pro- remember what you were up to when you directory of its own.gramming skills and learning how to created a particular set of files, or you may Within a given project, I use a top-leveluse existing bioinformatics software. Un- not remember what conclusions you drew. organization that is logical, with chrono-fortunately, for students who are prepar- You will either have to then spend time logical organization at the next level, anding for a research career, this type of reconstructing your previous experiments logical organization below that. A samplecurriculum fails to address many of the or lose whatever insights you gained from project, called msms, is shown in Figure 1.day-to-day organizational challenges as- those experiments. At the root of most of my projects, I have asociated with performing computational This leads to the second principle, data directory for storing fixed data sets, aexperiments. In practice, the principles which is actually more like a version of results directory for tracking computa-behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of Figure names are typeface, and filenames are Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data, the files are shown here. NoteLaw: Everything you In each results folder: sorted in chronological order. Thecomputational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README source will probably is compiled create bin/ms-analysis a doc directory with one subdirectory per what date. The driver script results/2009-01-15/runalllearned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src files in automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parse- generates the three subdirectories split1, split2, and in •script: getResults.rb or WHATIDID.txtstrongly influenced by personal predilec- script is called by bothpreparation driverthe data being sqt.py your initial of the runall of scripts. for source code and bin for compiled doi:10.1371/journal.pcbi.1000424.g001tions as well as by chance interactions analyzed, or you will get access to new binaries or scripts.with collaborators or colleagues. with this approach,or you will decide that Lab Notebook data, the distinction be- The your param- Within the data and results a complete These types of entries provide directo- •intermediates The purpose of this article is to describe data and results may of a particular model was not tween eterization not be useful. ries, it is often tempting to apply of the project picture of the development a similar, In parallel with this chronologicalone good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you Instead, could enough. This directory that I useful over time. directory called something like experi- In practice, I ask members of myputational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks maintain a or even may lab research three put their lab against •output ments experiment you didprofound issues such as how to formulate 2008-12-19. Optionally, the directory notebook. This is a document that resides the set of experiments you’veroot of the results directory andyou online, behind benchmark your in the been work- which plan to password protection ifhypotheses, design experiments, or draw might ing on over word past month, will probably name also include a or two necessary. When I meet with a member that records your progress algorithms, ofso lab or a could team, we can one indicating the topic of the the experiment in detail. my you project create referconclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on on need a be redone. If and they should be relatively verbose, with to the online them under data. Entries in the notebook directoryrelatively mundane issues such as organiz- will often require more than one day of the current entry but scrolling up to and documented your work clearly, thenimages In my experience, entries approach is risky, this work, and so you may end up working a links or embedded or tables previous as necessary. The URLing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabo- few days more before the experiment with the new because logical structure remote final
  6. 6. Best Practices for Scientific ComputingGreg Wilson ∗ , D.A. Aruliah † , C. Titus Brown ‡ , Neil P. Chue Hong § , Matt Davis ¶ , Richard T. Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using and open source software development [61software. However, most scientists are never taught how to do this ical studies of scientific computing [4, 31,efficiently. As a result, many are unaware of tools and practices that development in general (summarized inwould allow them to write more reliable and maintainable code with practices will guarantee efficient, error-frless effort. We describe a set of best practices for scientific software ment, but used in concert they will reddevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of their errors in scientific software, make it easiesoftware. the authors of the software time and effo focusing on the underlying scientific ques Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusively 1. Write programs for people, not con computational problems, to traditional laboratory and field Scientists writing software need to writescientists, more and more of the daily operation of science re- cutes correctly and can be easily read andvolves around computers. This includes the development of programmers (especially the author’s futnew algorithms, managing and analyzing the large amounts cannot be easily read and understood it isof data that are generated in single research projects, and to know that it is actually doing what it icombining disparate datasets to assess synthetic problems. be productive, software developers must t Scientists typically develop their own software for these aspects of human cognition into accountpurposes because doing so requires substantial domain-specific human working memory is limited, huma
  7. 7. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of their errors in scientific software, make it easie the authors of the software time and effo Software is as important to modern scientific research assoftware. focusing on the underlying scientific questelescopesasand test tubes. From groups that work exclusively Software is important to modern scientific research as 1telescopes and test tubes. From groups that work exclusivelyon computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems,scientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves around computers. This includes the development of cvolves around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing and analyzing the large amounts cannot be easily read and understood it ispnew disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it icombining be productive, software developers must t cof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  8. 8. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. scientific research astelescopesasand test tubes. From groups that work exclusively Software is important to modern scientific research as 1telescopes and test tubes. From groups that work exclusivelyon computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems,scientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves around computers. This includes the development of cvolves around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing and analyzing the large amounts cannot be easily read and understood it ispnew disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it icombining be productive, software developers must t cof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  9. 9. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand test tubes. From groups that work exclusively Software is important to modern scientific research as 1telescopes and test tubes. From groups that work exclusivelyon computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems,scientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves around computers. This includes the development of cvolves around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing and analyzing the large amounts cannot be easily read and understood it ispnew disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it icombining be productive, software developers must t cof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  10. 10. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1telescopes and test tubes. From groups that work exclusivelyon computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems,scientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves around computers. This includes the development of cvolves around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing and analyzing the large amounts cannot be easily read and understood it ispnew disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it icombining be productive, software developers must t cof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  11. 11. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1telescopes andMaketubes. From groups that work exclusively 4. test incremental changes.on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems,scientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves around computers. This includes the development of cvolves around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing and analyzing the large amounts cannot be easily read and understood it ispnew disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it icombining be productive, software developers must t cof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  12. 12. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1telescopes andMaketubes. From groups that work exclusively 4. test incremental changes.on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems, control. 5. Use versionscientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves around computers. This includes the development of cvolves around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing and analyzing the large amounts cannot be easily read and understood it ispnew disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it icombining be productive, software developers must t cof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  13. 13. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1telescopes andMaketubes. From groups that work exclusively 4. test incremental changes.on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems, control. 5. Use versionscientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of cvolves around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing and analyzing the large amounts cannot be easily read and understood it ispnew disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it icombining be productive, software developers must t cof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  14. 14. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1telescopes andMaketubes. From groups that work exclusively 4. test incremental changes.on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems, control. 5. Use versionscientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of cvolves 7. Plan for mistakes. around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing and analyzing the large amounts cannot be easily read and understood it ispnew disparate datasets to assess synthetic problems. that are generated in single research projects, and to know that it is actually doing what it icombining be productive, software developers must t cof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  15. 15. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1telescopes andMaketubes. From groups that work exclusively 4. test incremental changes.on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems, control. 5. Use versionscientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of cvolves 7. Plan for mistakes. around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing andworksand p cannot be easily read and understood it isnew 8. Optimize software only after it analyzingknow that it is actually doing what it i that are generated in single research projects, correctly.the large amounts tocombining disparate datasets to assess synthetic problems. be productive, software developers must tcof Scientists that are generated in single research human cognition andaccount data typically develop their own software for these aspects of projects, intopurposes because doing so requires substantial domain-specific human working memory is limited, huma t
  16. 16. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1telescopes andMaketubes. From groups that work exclusively 4. test incremental changes.on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems, control. 5. Use versionscientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of cvolves 7. Plan for mistakes. around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing andworksand p cannot be easily read and understood it isnew 8. Optimize software only after it analyzingknow that it is actually doing what it i that are generated in single research projects, correctly.the large amounts tocombining disparate datasets to assess synthetic problems. 9. Document the designown software single research projects, and must t and purpose ofthese rather than itssoftware developers code be productive, mechanics. cof Scientists that are generated in for data typically develop theirpurposes because doing so requires substantial domain-specific aspects of human cognition into account human working memory is limited, huma t
  17. 17. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wBest Practices for Scientific ComputingMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive ∗Greg Wilson , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wilsUniversity D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,Ethan P. White ∗∗∗ , Paul Wilson †††∗ Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.AruScientists spend an increasing amount of time building and using aState University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institutesoftware. However, most scientists are never taught how to do this i(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (miefficiently. As a result, many are unaware of tools and practices that dMary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)would allow them to write more reliable and maintainable code with pless effort. We describe a set of best practices for scientific software arXiv:1210.0530v3 [cs.MS] 29 Nov 2012Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, mdevelopment that have solid foundations in ical studies of scientific computing [4, 31,software. However, most scientists are never taught how to do thisefficiently. As a improve are unaware of tools and practices thatand the reliability of theirand that result, many scientists’ productivity e development in general (summarized insoftware. describe a set of best practices for scientific software practices will guarantee efficient, error-frtwould allow them to write more reliable and maintainable code withless effort. We ment, but used in concert they will red fdevelopment that have solid foundations in research and experience, and the not computers. errors in scientific software, make it easieand that improve scientists’ productivitypeople, reliability of their 1. Write programs for the authors of the software time and effo Software is as important to modern focusing on the underlying scientific quessoftware. 2. Automate repetitive tasks. scientific research astelescopesasand computer to record history. as that work exclusively 3. Use important to tubes. From groups Software is the test modern scientific research 1telescopes andMaketubes. From groups that work exclusively 4. test incremental changes.on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field Scientists writing software need to writeSon computational problems, control. 5. Use versionscientists, more and more of the daily operation of science re- operation of science re-scientists, more and more of the daily cutes correctly and can be easily read andvolves aroundDon’t repeat yourself (or others). 6. computers. This includes the development of cvolves 7. Plan for mistakes. around computers. This includes the development ofnew algorithms, managing and analyzing the large amounts programmers (especially the author’s futof data algorithms, managing andworksand p cannot be easily read and understood it isnew 8. Optimize software only after it analyzingknow that it is actually doing what it i that are generated in single research projects, correctly.the large amounts tocombining disparate datasets to assess synthetic problems. 9. Document the designown software single research projects, and must t and purpose ofthese rather than itssoftware developers code be productive, mechanics. cof Scientists that are generated in for data typically develop theirpurposes because doing so code reviews. 10. Conduct requires substantial domain-specific aspects of human cognition into account human working memory is limited, huma t
  18. 18. Ruby.(or maybe python)
  19. 19. Ruby. (or maybe python)“Friends don’t let friends do Perl” - reddit user
  20. 20. Programming better• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway
  21. 21. Programming better• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway• variable naming
  22. 22. Programming better• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway• variable naming• coding width: 100 characters
  23. 23. Programming better• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway• variable naming• coding width: 100 characters• indenting
  24. 24. Programming better• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway• variable naming• coding width: 100 characters• indenting• Followconventions -eg “Google R Style” or https://github.com/hadley/devtools/wiki/ Style
  25. 25. Programming better• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway• variable naming• coding width: 100 characters• indenting• Followconventions -eg “Google R Style” or https://github.com/hadley/devtools/wiki/ Style• Versioning: DropBox & http://github.com/
  26. 26. Programming better• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway• variable naming• coding width: 100 characters• indenting• Followconventions -eg “Google R Style” or https://github.com/hadley/devtools/wiki/ Style• Versioning: DropBox & http://github.com/• Automated testing. e.g.:
  27. 27. Programming better• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway• variable naming• coding width: 100 characters• indenting• Followconventions -eg “Google R Style” or https://github.com/hadley/devtools/wiki/ Style• Versioning: DropBox & http://github.com/ preprocess_snps <- function(snp_table, testing=FALSE) {• Automated testing. e.g.: if (testing) { # run a bunch of tests of extreme situations. # quit if a test gives a weird result. } # real part of function. }
  28. 28. A few tools
  29. 29. Take notes in Markdown to html, pdf,
  30. 30. Take notes in Markdown to html, pdf,
  31. 31. knitr (sweave)Analyzing & Reporting in a single file.MyFile.Rnw
  32. 32. knitr (sweave)Analyzing & Reporting in a single file.MyFile.Rnwdocumentclass{article}usepackage[sc]{mathpazo}usepackage[T1]{fontenc}usepackage{url}begin{document}<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=# this is equivalent to SweaveOpts{...}opts_chunk$set(fig.path=figure/minimal-, fig.align=center, fig.show=hold)options(replace.assign=TRUE,width=90)@title{A Minimal Demo of knitr}author{Yihui Xie}maketitleYou can test if textbf{knitr} works with this minimal demo. OK, letsget started with some boring random numbers:<<boring-random,echo=TRUE,cache=TRUE>>=set.seed(1121)(x=rnorm(20))mean(x);var(x)@The first element of texttt{x} is Sexpr{x[1]}. Boring boxplotsand histograms recorded by the PDF device:<<boring-plots,cache=TRUE,echo=TRUE>>=## two plots side by sidepar(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)boxplot(x)hist(x,main=)@Do the above chunks work? You should be able to compile the TeX{}
  33. 33. knitr (sweave)Analyzing & Reporting in a single file. ### in R:MyFile.Rnw library(knitr)documentclass{article}usepackage[sc]{mathpazo}usepackage[T1]{fontenc} knit(“MyFile.Rnw”)usepackage{url}begin{document} # --> creates MyFile.tex<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=# this is equivalent to SweaveOpts{...} ### in shell:opts_chunk$set(fig.path=figure/minimal-, fig.align=center, fig.show=hold)options(replace.assign=TRUE,width=90)@ pdflatex MyFile.textitle{A Minimal Demo of knitr} # --> creates MyFile.pdfauthor{Yihui Xie}maketitleYou can test if textbf{knitr} works with this minimal demo. OK, letsget started with some boring random numbers:<<boring-random,echo=TRUE,cache=TRUE>>=set.seed(1121)(x=rnorm(20))mean(x);var(x)@The first element of texttt{x} is Sexpr{x[1]}. Boring boxplotsand histograms recorded by the PDF device:<<boring-plots,cache=TRUE,echo=TRUE>>=## two plots side by sidepar(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)boxplot(x)hist(x,main=)@Do the above chunks work? You should be able to compile the TeX{}
  34. 34. knitr (sweave)Analyzing & Reporting in a single file. ### in R:MyFile.Rnw library(knitr)documentclass{article}usepackage[sc]{mathpazo}usepackage[T1]{fontenc} knit(“MyFile.Rnw”)usepackage{url}begin{document} # --> creates MyFile.tex<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=# this is equivalent to SweaveOpts{...} ### in shell:opts_chunk$set(fig.path=figure/minimal-, fig.align=center, fig.show=hold)options(replace.assign=TRUE,width=90)@ pdflatex MyFile.textitle{A Minimal Demo of knitr} # --> creates MyFile.pdfauthor{Yihui Xie}maketitle A Minimal Demo of knitrYou can test if textbf{knitr} works with this minimal demo. OK, letsget started with some boring random numbers: Yihui Xie<<boring-random,echo=TRUE,cache=TRUE>>= February 26, 2012set.seed(1121)(x=rnorm(20))mean(x);var(x) You can test if knitr works with this minimal demo. OK, let’s get started with s@ numbers: set.seed(1121)The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots (x <- rnorm(20))and histograms recorded by the PDF device: ## [1] 0.14496 0.43832 0.15319 1.08494 1.99954 -0.81188 0.16027 0<<boring-plots,cache=TRUE,echo=TRUE>>= ## [10] -0.02531 0.15088 0.11008 1.35968 -0.32699 -0.71638 1.80977 0## two plots side by side ## [19] 0.13272 -0.15594par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)boxplot(x) mean(x)hist(x,main=)@ ## [1] 0.3217Do the above chunks work? You should be able to compile the TeX{} var(x)

×