Reproducible Research &
Sustainable Software
Why care?
The recently reported structure of Sav1866 (4) indicated that our
MsbA structures (1, 2, 5) were incorrect in both the hand of the struc-
ture and the topology. Thus, our biological interpretations based on
these inverted models for MsbA are invalid.
Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.
1. G. Chang, C. B. Roth, Science 293, 1793 (2001).
2. C. L. Reyes, G. Chang, Science 308, 1028 (2005).
3. O. Pornillos, Y.-J. Chen, A. P. Chen, G. Chang, Science 310, 1950 (2005).
4. R. J. Dawson, K. P. Locher, Nature 443, 180 (2006).
5. G. Chang, J. Mol. Biol. 330, 419 (2003).
6. C. Ma, G. Chang, Proc. Natl. Acad. Sci. U.S.A. 101, 2852 (2004).
• Avoid costly mistakes
• Be faster:“stand on the shoulders of giants”
• Increase impact / visibility
Reproducible Research &
Sustainable Software
“Big data” biology

is hard.
• Biology/life is complex
• Field is young.
• Biologists lack computational training.
• Generally, analysis tools suck.
• badly written
• badly tested
• hard to install
• output quality… often questionable.
• Understanding/visualizing/massaging data is hard.
• Datasets continue to grow!
“Big data” biology

is hard.
We need great tools.
Some sources of inspiration
Best Practices for Scientific Computing
Greg Wilson ∗
, D.A. Aruliah †
, C. Titus Brown ‡
, Neil P. Chue Hong §
, Matt Davis ¶
, Richard T. Guy ∥
Steven H.D. Haddock ∗∗
, Katy Huff ††
, Ian M. Mitchell ‡‡
, Mark D. Plumbley §§
, Ben Waugh ¶¶
Ethan P. White ∗∗∗
, Paul Wilson †††
Software Carpentry (,†University of Ontario Institute of Technology (Dhavide.Aru
State University (,§Software Sustainability Institute (,¶ Space Telescope
(,∥University of Toronto (,∗∗Monterey Bay Aquarium Research Institute
(,††University of Wisconsin (,‡‡University of British Columbia (mi
Mary University of London (,¶¶University College London (,∗∗
University (, and †††University of Wisconsin (
Scientists spend an increasing amount of time building and using
software. However, most scientists are never taught how to do this
efficiently. As a result, many are unaware of tools and practices that
would allow them to write more reliable and maintainable code with
less effort. We describe a set of best practices for scientific software
development that have solid foundations in research and experience,
and that improve scientists’ productivity and the reliability of their
Software is as important to modern scientific research as
telescopes and test tubes. From groups that work exclusively
on computational problems, to traditional laboratory and field
scientists, more and more of the daily operation of science re-
volves around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
of data that are generated in single research projects, and
combining disparate datasets to assess synthetic problems.
Scientists typically develop their own software for these
purposes because doing so requires substantial domain-specific
and open source software development [61
ical studies of scientific computing [4, 31,
development in general (summarized in
practices will guarantee efficient, error-fr
ment, but used in concert they will red
errors in scientific software, make it easie
the authors of the software time and effo
focusing on the underlying scientific ques
1. Write programs for people, not c
Scientists writing software need to write
cutes correctly and can be easily read and
programmers (especially the author’s fut
cannot be easily read and understood it is
to know that it is actually doing what it i
be productive, software developers must t
aspects of human cognition into account
1. Write programs for people, not computers.
2. Automate repetitive tasks.
3. Use the computer to record history.
4. Make incremental changes.
5. Use version control.
6. Don’t repeat yourself (or others).
7. Plan for mistakes.
8. Optimize software only after it works correctly.
9. Document the design and purpose of code rather than its mechanics.
10. Conduct code reviews.
Scientists spend an increasing amount of time building and using
However, most scientists are never taught how to do this
We describe a set of best practices for scientific software
development that have solid foundations in research and
experience, and that improve scientists' productivity and the
reliability of their software.
Specific Approaches/Tools
• Planning for mistakes
• Automated testing
• Continuous integration
• Writing for people: use style guide
Code for people: Use a style guide
• For R:
R style guide extract
Coding for people: Indent your code!
and and improve your code in 6
pproximate Damian Conway
Line length
Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a
reasonably sized font. If you find yourself running out of room, this is a good indication that you
should encapsulate some of the work in a separate function.

R style guide extract
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, se
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt',
header = TRUE,
sep = 't',
col.names = c('colony', 'individual', 'headwidth', 'mass')
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, 

sep='t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))
Code for people: Use a style guide
• For R:
• For Ruby:
Automatically check your code:
install.packages(“lint”) # once
library(lint) # everytime
Eliminate redundancy
DRY: Don’t RepeatYourself
knitr (sweave)Analyzing & Reporting in a single file.
### in R:
# --> creates
### in shell:
pandoc -o analysis.pdf
# --> creates MyFile.pdf
A minimal R Markdown example
I know the value of pi is 3.1416, and 2 times pi is 6.2832. To c
library(knitr); knit( minimal.Rmd )
A paragraph here. A code chunk below:
## [1] 2
.4-.7+.3 # what? it is not zero!
## [1] 5.551e-17
Graphics work too
qplot(speed, dist, data = cars) + geom_smooth()
●●● ●
● ●
5 10 15 20
Figure 1: A scatterplot of cars
Organize mindfully
A Quick Guide to Organizing Computational Biology
William Stafford Noble1,2
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Most bioinformatics coursework focus-
es on algorithms, with perhaps some
components devoted to learning pro-
gramming skills and learning how to
use existing bioinformatics software. Un-
fortunately, for students who are prepar-
ing for a research career, this type of
curriculum fails to address many of the
day-to-day organizational challenges as-
sociated with performing computational
experiments. In practice, the principles
behind organizing and documenting
computational experiments are often
learned on the fly, and this learning is
strongly influenced by personal predilec-
tions as well as by chance interactions
with collaborators or colleagues.
The purpose of this article is to describe
one good strategy for carrying out com-
putational experiments. I will not describe
profound issues such as how to formulate
hypotheses, design experiments, or draw
understanding your work or who may be
evaluating your research skills. Most com-
monly, however, that ‘‘someone’’ is you. A
few months from now, you may not
remember what you were up to when you
created a particular set of files, or you may
not remember what conclusions you drew.
You will either have to then spend time
reconstructing your previous experiments
or lose whatever insights you gained from
those experiments.
This leads to the second principle,
which is actually more like a version of
Murphy’s Law: Everything you do, you
will probably have to do over again.
Inevitably, you will discover some flaw in
your initial preparation of the data being
analyzed, or you will get access to new
data, or you will decide that your param-
eterization of a particular model was not
broad enough. This means that the
experiment you did last week, or even
the set of experiments you’ve been work-
ing on over the past month, will probably
under a common root directory. The
exception to this rule is source code or
scripts that are used in multiple projects.
Each such program might have a project
directory of its own.
Within a given project, I use a top-level
organization that is logical, with chrono-
logical organization at the next level, and
logical organization below that. A sample
project, called msms, is shown in Figure 1.
At the root of most of my projects, I have a
data directory for storing fixed data sets, a
results directory for tracking computa-
tional experiments peformed on that data,
a doc directory with one subdirectory per
manuscript, and directories such as src
for source code and bin for compiled
binaries or scripts.
Within the data and results directo-
ries, it is often tempting to apply a similar,
logical organization. For example, you
may have two or three data sets against
which you plan to benchmark your
algorithms, so you could create one
with this approach, the distinction be-
tween data and results may not be useful.
Instead, one could imagine a top-level
directory called something like experi-
ments, with subdirectories with names like
2008-12-19. Optionally, the directory
name might also include a word or two
indicating the topic of the experiment
The Lab Notebook
In parallel with this chronological
directory structure, I find it useful to
maintain a chronologically organized lab
notebook. This is a document that resides
in the root of the results directory and
that records your progress in detail.
These types of entries provide a complete
picture of the development of the project
over time.
In practice, I ask members of my
research group to put their lab notebooks
online, behind password protection if
necessary. When I meet with a member
of my lab or a project team, we can refer
Figure 1. Directory structure for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset of
the files are shown here. Note that the dates are formatted ,year.-,month.-,day. so that they can be sorted in chronological order. The
source code src/ms-analysis.c is compiled to create bin/ms-analysis and is documented in doc/ms-analysis.html. The README
files in the data directories specify who downloaded the data files from what URL on what date. The driver script results/2009-01-15/runall
automatically generates the three subdirectories split1, split2, and split3, corresponding to three cross-validation splits. The bin/parse- script is called by both of the runall driver scripts.
In each results folder:
•script getResults.rb
Track versions of everything
Github: Facebook for code
Github: Facebook for code
• Easy versioning
• Random people use your stuff
• And find problems and fix and improve it!
• Greater impact / better planet
• Easily update
• Easily collaborate
• Identify trends
• Build online reputation
Learn how:
Programming languages
Choosing a programming language
Good: Bad:
Excel quick & dirty easy to make mistakes
doesn’t scale
numbers, stats,
Unix command-line
== shell == bash
Can’t escape it.
Quick & Dirty. HPC.
complicated things
Java 1990s user interfaces overcomplicated.
Perl 1980s. Everything.
Python scripting, text ugly
Ruby scripting, text
scripting, flexibility(web
& client), community
only little bio-stuff
Ruby.“Friends don’t let friends do Perl” - reddit user
### in PERL:
open INFILE, "my_file.txt";
while (defined ($line = <INFILE>)) {
@letters = split(//, $line);
@reverse_letters = reverse(@letters);
$reverse_string = join("", @reverse_letters);
print $reverse_string, "n";
### in Ruby:"a").each { |line|
puts line.chomp.reverse
• example:“reverse each line in file”
• read file; with each line
• remove the invisible “end of line” character
• reverse the contents
• print the reversed line
More ruby examples.
5.times {
puts "Hello world"
# Sorting people
people_sorted_by_age = people.sort_by { |person| person.age}
+many tools for bio-data - e.g. check
Getting help.
• In real life: Make friends with people.Talk to them.
• Online:
• Specific discussion mailing lists (e.g.: R, bioruby, MAKER...)
• Programming:
• Bioinformatics:
• Sequencing-related:
• Stats:
• Codeschool!
“Can you BLAST this for me?”
Anurag Priyam, 

Mechanical engineering student, IIT Kharagpur
Sure, I can
help you…
“Can you BLAST this for me?” SequenceServer
BLAST made easy
(well, we’re trying...)
Aim: An open source idiot-proof web-
interfacefor custom BLAST
Today: SequenceServer
Used in >200 labs

  Aquaculture in operations that reveal little, if any, negative impact on the environment or local ecosys- industrygovernedbyregulationswitharational basis in the ecology of the oceans and the eco- LETTERS I BOOKS I POLICY FORUM I EDUCATION FORUM I PERSPECTIVES 1878 in the classroom 1880 1882 perspectives LETTERS edited by Etta Kavanagh Retraction WE WISH TO RETRACT OUR RESEARCH ARTICLE "STRUCTURE OF MsbA from E. coli:A homolog of the multidrug resistanceATP bind- ing cassette (ABC) transporters" and both of our Reports "Structure of the ABC transporter MsbA in complex with ADP•vanadate and lipopolysaccharide"and"X-raystructureoftheEmrEmultidrugtrans- porter in complex with a substrate" (1–3). The recently reported structure of Sav1866 (4) indicated that our MsbA structures (1, 2, 5) were incorrect in both the hand of the struc- ture and the topology. Thus, our biological interpretations based on these inverted models for MsbA are invalid.
