Scientific Software Development

Avoiding Big Mistakes
in Scientific Computing
Or: How to Write Code That Doesn’t Jeopardize
Your Professional Reputation or Patient’s Lives

Jeff Allen
Quantitative Biomedical Research Center
UT Southwestern Medical Center
BSCI5096 - 3.26.2013

Motivation

• Anil Potti scandal at Duke
– Genomic signature identified that would identify
the best chemo based on a patient‟s genes.
– Over 100 patients enrolled in clinical trials.
– Later discovered gross mishandling of data and
invalidating bugs in software
– Alleged manipulation of data
– Watch: Lecture from Keith Baggerly

Outline

• Revision Control
• Reproducibility and Replicability
• Ensuring Code Quality
• Resources

Outline

– Introduction & Concepts
– Git & GitHub
• Resources

Revision Control

• Tracks changes to files over time
• Keeps a complete log of all changes ever
made to any file in a project
• Supports more collaboration on projects
– Provides an authoritative repository for the code
– Gracefully catch and handle conflicts in files
• Various forms in use today including
Mercurial, Git, Subversion

Git

• Modern distributed revision control system
– “Distributed” means you have the entire history of
the project on your local machine.
– Don‟t have to be online to develop.
• Makes improvements in performance and
usability on past systems.
• Open-Source and free

GitHub

• A website that hosts Git repositories.
• You can “push” your own Git repositories to
their site to gain:
– A web interface – easier way to view your files and
track changes
– Control who has access to which projects
– Project organization – hosts documentation, bug-
tracking, etc.
– Social platform – the “Facebook” of coding
– Client-Side graphical user interface

GitHub Client - GUI

• Only works with GitHub.
• Much easier to use and navigate.
• Mac and Windows versions.
• On campus: Need to open Git Shell and run:
git config --global http.proxy http://proxy.swmed.edu:3128

Use Cases

• “This function used to work.”
– Look at the changes made to that file since it last
worked.
• “Please send me the code used in this
publication.”
– Revert the project back to any point in its history
• “I found a bug and fixed it.”
– (Optionally) Allow others to contribute to your
projects.

Outline

– Replicability
– Reproducibility
• Resources

“‘Replicable’ means „other people get exactly
the same results when doing exactly the same
thing‟, while ‘reproducible’ means „something
similar happens in other people's hands.‟ The
latter is far stronger, in general, because it
indicates that your results are not merely some
quirk of your setup and may actually be right.”
C. TITUS BROWN
http://ivory.idyll.org/blog/replication-i.html

Replicability

• In order for analysis to be replicable, another
researcher must have access to:
– The exact same code you used
– The exact same data you used
• Any changes (including bug-fixes and other
corrections) in your code or data from what
you provide will make your results irreplicable.
– Must track in a revision control system

Reproducibility

• Requires much more time and effort
• Independently arrive at the same conclusions
– Potentially using the same data
– Using different techniques and parameters
• May take as much time to reproduce results
as it did to produce them the first time
• Should be done in high-stakes (i.e. clinical)
applications

Recommended Practices

a. Use a revision control system such as GitHub
b. To ensure replicability, clone your repository
on another computer and re-run all your
analysis. Ensure you get the same results.
• This is a good test of replicability.
• Knowing you‟ll have to do this will make you write
better organized code.
c. If it‟s really important, ask a colleague to
reproduce.

Outline

– Automated Testing
– Code reviews
• Resources

Automated Testing

• Unit testing
– Very specific target
– May have multiple tests
per function
install.packages(
“testthat”)
• Many unit testing
frameworks library(testthat)
– In R: testthat, and Runit

Testing Example - Square

Code

square <- function(x){
sq <- 0
for (i in 1:x){
sq <- sq + x
}
return(sq)
}


Code Tests
expect_that(
square <- function(x){ square(3),
sq <- 0 equals(9)
for (i in 1:x){ ) #Passes
sq <- sq + x
}
return(sq)
}


Code Tests
expect_that(square(3),
square <- function(x){ equals(9)) #Passes
sq <- 0 expect_that(square(5),
for (i in 1:x){ equals(25)) #Passes
sq <- sq + x
}
return(sq)
}

Test-Driven Development (TDD)

• If you see a bug:
1. Write a test that fails
2. Fix the bug
3. Show that the test now passes
4. Commit to revision control


Code Tests
sq <- sq + x expect_that(square(2.5),
} equals(6.25)) #Fails
return(sq)
}


Code Tests
sq <- sq + x expect_that(square(2.5),
} equals(6.25)) #Fails
return(sq) expect_that(square(-2),
} equals(4)) #Fails


Code

sq <- x * x
return(sq)
}


Code Tests
equals(9)) #Passes
equals(25)) #Passes
sq <- x * x
expect_that(square(2.5),
return(sq)
equals(6.25)) #Passes
}
expect_that(square(-2),
equals(4)) #Passes

Test-Driven Development (TDD)

• Advantages
– Ensure that problematic areas are well-tested
– Regression testing – ensure old bugs don‟t ever
come back
– Confidently approach old code
– More assured in handling someone else‟s code
– Saves you time over manual testing

Code Reviews

• Get more than one set of eyes on your code
• Lightweight
– Email to get quick feedback
– GitHub is great for this
• Formal
– Have a meeting to audit
– Less than 500 LOC per meeting

Extreme – Pair Programming

• Two programmers share a single workstation
• Both participate, though only one can type
• Significant learning opportunities for both
• Can strategically pair:
– Senior with Junior, mentoring
– Statistician with Developer, mutual learning
• Improvements in code quality
compensate for short-term efficiency loss
– fewer bugs, easier code to maintain


Code Tests
equals(9)) #Passes
x^2 expect_that(square(2.5),
} equals(6.25)) #Passes
expect_that(square(-2),
equals(4)) #Passes

Resources

• Software Carpentry
– www.software-carpentry.org
– Volunteer organization focused on teaching these
topics to scientific audiences
– Contact us (Jeffrey.Allen@UTSouthwestern.edu) if
you‟d be interested in attending a local Boot Camp
• GitHub Documentation
– https://help.github.com/
– Great documentation on how to use Git and/or
GitHub

Resources

• Unit Testing in R
– http://cran.r-
project.org/web/packages/RUnit/index.html
– http://cran.r-
project.org/web/packages/testthat/index.html
– http://journal.r-project.org/archive/2011-
1/RJournal_2011-1_Wickham.pdf

Suggested Next Steps

• Watch Lecture from Keith Baggerly
• Register for a GitHub account (free), explore
• Write an R function and cover it with unit tests
using the test_that framework
• Then check into a public GitHub repo

Scientific Software Development

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (6)

Similar to Scientific Software Development

Similar to Scientific Software Development (20)

Recently uploaded

Recently uploaded (20)

Scientific Software Development

Editor's Notes