A short presentation with pointers on getting started with reproducible computational research in R. Some of the topics include git, R package development, document generation with R markdown, saving plots, saving tables and using packrat.
2. Topics
– Introduction
– Version control (Git)
– Reproducible analysis in R
• Writing packages
• R Markdown
• Saving plots
• Saving data
• Packrat
3. Reproducible (computational) research
1. For Every Result, Keep Track of How It Was Produced
– Steps, commands, clicks
2. Avoid Manual Data Manipulation Steps
3. Archive the Exact Versions of All External Programs Used
– Packrat (Reproducible package management for R)
4. Version Control All Custom Scripts
5. Record All Intermediate Results, When Possible in Standardized Formats
6. For Analyses That Include Randomness, Note Underlying Random Seeds
– set.seed(42)
7. Always Store Raw Data behind Plots
8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
9. Connect Textual Statements to Underlying Results
10. Provide Public Access to Scripts, Runs, and Results
Sandve GK, Nekrutenko A, Taylor J, Hovig E (2013) Ten Simple Rules for Reproducible Computational
Research. PLoS Comput Biol 9(10): e1003285. doi:10.1371/journal.pcbi.1003285
4.
5.
6. Version control
• Word review on steroids
• When working alone: it’s a database of all the versions of
your files
• When collaborating: it’s a database of all the versions of all
collaborators with one master version where all changes can
be merged into.
• When there are no conflicts then merging can be done
automatically.
• Multiple programs/protocols: git, mercurial, svn, …
• By default not for versioning large files (> 50 mb) but there is
a Git Large File Storage extension
• Works best with text files (code, markdown, csv, …)
7. Git
• Popularized by http://github.com but
supported by different providers
(http://github.ugent.be, http://bitbucket.org).
• Programs for Git on windows:
– Standard Git Gui + command line (git-scm.com)
– GitHub Desktop for Windows
– Atlassian SourceTree
8. Git workflow (1 user)
Workflow:
1. create a repository on your preferred provider
If you want a private repository then use bitbucket.org or apply for
the student developer pack (https://education.github.com/)
2. Clone the repository to your computer
git clone https://github.com/samuelbosch/sdmpredictors.git
3. Make changes
4. View changes (optional)
git status
5. Submit changes
git add
git commit -am “”
git push
9. Git extras to explore
• Excluding files from Git with .gitignore
• Contributing to open source
– Forking
– Pull requests
10. DEMO
• New project on https://github.ugent.be/
• Clone
• Add file
• Status
• Commit
• Edit file
• Commit
• Push
11. R general
• Use Rstudio
https://www.rstudio.com/products/rstudio/down
load/ and explore it
– Projects
– Keyboard shortcuts
– Git integration
– Package development
– R markdown
• R Short Reference Card: https://cran.r-
project.org/doc/contrib/Short-refcard.pdf
• Style guide: http://adv-r.had.co.nz/Style.html
12. R package development
• R packages by Hadley Wickham (http://r-
pkgs.had.co.nz/)
• Advantages:
– Can be shared easily
– One package with your data and your code
– Documentation (if you write it)
– Ease of testing
13. R packages: Getting started
• install.packages(“devtools”)
• Rstudio -> new project -> new directory -> R
package
• # Build and Reload Package: 'Ctrl + Shift + B'
• # Check Package: 'Ctrl + Shift + E'
• # Test Package: 'Ctrl + Shift + T'
• # Build documentation: 'Ctrl + Shift + D'
14. R packages: testing
• Test if your functions returns the expected results
• Gives confidence in the correctness of your code, especially when
changing things
• http://r-pkgs.had.co.nz/tests.html
devtools::use_testthat()
library(stringr)
context("String length")
test_that("str_length is number of characters", {
expect_equal(str_length("a"), 1)
expect_equal(str_length("ab"), 2)
expect_equal(str_length("abc"), 3)
})
15. R Markdown
• Easy creation of dynamic documents
– Mix of R and markdown
– Output to word, html or pdf
– Integrates nicely with version control as
markdown is a text format (easy to diff)
• Rstudio: New file -> R Markdown
• Powered by knitr (alternative to Sweave)
16. R Markdown: example
---
title: "Numbers and their values"
output:
word_document:
fig_caption: yes
---
```{r, echo=FALSE, warning=FALSE, message=FALSE}
# R code block that won’t appear in the output document
three <- 1+2
```
# Chapter 1: On the value of 1 and 2
It is a well known fact that 1 and 2 = `r three`, you can calculate this also inline `r 1+2`.
Or show the entire calculation:
```{r}
1+2
```
17. Markdown basics
Headers
# Heading level 1
## Heading level 2
###### Heading level 6
*italic* and is _this is also italic_
**bold** and __this is also bold__
*, + or - for (unordered) list items (bullets)
1., 2., …. for ordered list
This is an [example link](http://example.com/).
Image here: ![alt text](/path/to/img.jpg)
Bibtex references: [@RCoreTeam2014; @Wand2014] but needs a link
to a bibtex file in the header bibliography: bibliography.bib
More at: http://daringfireball.net/projects/markdown/basics
Used at other places : github, stackoverflow, … but sometimes a dialect
18. Caching intermediate results
Official way: http://yihui.name/knitr/demo/cache/
Hand rolled (more explicit, but doesn’t clean up previous versions and hard coded
cache directory):
library(digest)
make_or_load <- function(change_path, file_prefix, make_fn, force_make = FALSE) {
changeid <- as.integer(file.info(change_path)$mtime)
fn_md5 <- digest(capture.output(make_fn), algo = "md5", serialize = F)
path <- paste0("D:/temp/", file_prefix, changeid, "_", fn_md5, ".RData")
if(!file.exists(path) || force_make) {
result <- make_fn()
save(result, file = path)
}
else {
result <- get(load(path))
}
return(result)
}
df <- make_or_load(wb, "invasives_df_area_", function() { set_area(df) })
21. Saving tables
• As html
stargazer(data, type = "html", summary = FALSE, out
= outputpath , out.header = T)
• As csv
write.csv2(data, file = outputpath)
data <- read.csv2(outputpath)
• As Rdata
save(data, file = outputpath)
data <- load(outputpath)
22. Packrat
Use packrat to make your R projects more:
• Isolated: Installing a new or updated package for one
project won’t break your other projects, and vice versa.
That’s because packrat gives each project its own private
package library.
• Portable: Easily transport your projects from one computer
to another, even across different platforms. Packrat makes
it easy to install the packages your project depends on.
• Reproducible: Packrat records the exact package versions
you depend on, and ensures those exact versions are the
ones that get installed wherever you go.
23. Packrat
Rstudio:
Project support for Packrat on creation of a project or it can be
enabled in the project settings
Manually:
install.packages("packrat")
# intialize packrat in an project directory
packrat::init("D:/temp/demo_packrat")
# install a package
install.packages("raster")
# save the changes in Packrat (by default auto-snapshot
packrat::snapshot()
# view list of packages that might be missing or that can be
removed
packrat::status()
24. DEMO
• Package development (new, existing)
• Rmarkdown (new, existing)
• Packrat (new and existing project)
– packrat::init()