This document discusses reproducible environments for data science projects. It describes challenges like dependency issues when moving code between systems. It recommends tools like Packrat, Checkpoint and Switchr to manage package dependencies when working with R code. Docker is presented as a way to create portable development environments using containerization. The document demos how to build a Docker container for an R project and argues containers will become more important for reproducible data science.
8. Doug Ashton - Consultant
dashton@mango-solutions.com
Two Warnings
• Retracted cancer research –
http://bit.ly/londonr01 (video)
• Reinhart, Rogoff, and the Excel
Error That Changed History -
http://bit.ly/londonr02
9. Doug Ashton - Consultant
dashton@mango-solutions.com
the solution
code
20. Doug Ashton - Consultant
dashton@mango-solutions.com
Package Management Packages
Packages Pros Cons
Packrat Built into Rstudio Bulky in repo (eg git)
Needs foresight
Checkpoint Easy
Rescues old scripts
Lots of libraries
Only one date
Switchr Package manifests
Repo friendly
Not easy
22. Doug Ashton - Consultant
dashton@mango-solutions.com
Checkpoint
Pros
• Get old scripts working
• Simple (just set a date)
checkpoint(“2014-11-05”)
• Downloads packages
from MRAN on that date
(more later)
Cons
• Could end up with lots of
packages (library for each
date)
• Doesn’t help with multiple
dates
• No GitHub
23. Doug Ashton - Consultant
dashton@mango-solutions.com
Switchr
• No demo
32. Doug Ashton - Consultant
dashton@mango-solutions.com
Docker Demo
• https://github.com/dougmet/dockerR
FROM dougmet/r-base:3.1.2
RUN apt-get -y install libgsl0ldbl=1.16*1 libgsl0-dev=1.16*
# Install R package manifest
COPY loadPackages.R /tmp/
COPY packages.csv /tmp/
RUN Rscript /tmp/loadPackages.R
CMD ["R"]
33. Doug Ashton - Consultant
dashton@mango-solutions.com
The near future
• More centralised R installations
• Centralised images
• Windows containers (open container initiative)
• Data Science Workbench
• Managed image repo
• One click to open project
• RStudio integration