R and Reproducibility
A Proposal
David Smith
useR! 2014
What is Reproducibility?
“The goal of reproducible research is to tie
specific instructions to data analysis and
experimen...
Why care about reproducibility?
Academic / Research
• Verify results
• Advance Research
Business
• Production code
• Relia...
R and Reproducibility
4
Results
Interfaces
Platform
Packages
R Engine
• Hand-assembled
• Sweave/knitr/DeployR/Shiny
• R GU...
Observations
• R versions are pretty manageable
– Major versions just once a year
– Patches rarely introduce incompatible ...
Package Problem #1 : The User
http://xkcd.com/234/6
I heard you need to create a
TPS Report. Here, I’ve got an
R script th...
Package Problem #2: The Author
http://xkcd.com/970/7
Time to update
my package on
CRAN!
>> Dependent
packages that
now fai...
Package Problem #3 : The Update
http://xkcd.com/664/8
3 days later…
Woot! A new version of R
is out! I have 10 minutes
now...
The Proposal
• Change the default way R handles packages
– Install packages local to projects
• “Snapshot” CRAN daily
– Ma...
Example
• R script file using 6 most popular packages
10
Sharing a script reproducibly
… and simply
# Run with R 3.1.0
require(RRT)
mran_set(snapshot="2014-06-27")
# find packages...
RRT: The R Reproducibility Toolkit
• Open Source R Package (GPLv2)
• From an R project folder:
– Detect packages & depende...
MRAN - Implementation
A downstream CRAN mirror with daily snapshots
• Use rsync to mirror CRAN daily
– Only downloads chan...
Future work
• Just getting started!
• Snapshot binaries and source packages
• Other repos (BioConductor, GitHub, user)
• I...
Thank You!
David Smith
david@revolutionanalytics.com
blog.revolutionanalytics.com
Possible Solution
• Bundle all packages with scripts
• Packrat solves this very well
– Project + package dependencies stor...
CRAN vs Github
CRAN
• “Repository of Record”
– Default for R users
• Strict quality checking
• Handles dependencies
• Bina...
A downstream CRAN solution?
“I don't see why CRAN needs to be involved in
this effort at all. A third party could take
sna...
Snapshot CRAN repository :
requirements
• Availability
• Latency
• Bandwidth
• Storage
• Binary package archives
• Other e...
Proposal
“Development Branch” “Stable Branch”
Defaults are important!!20
MRANCRAN Downstram
Reproducible
Upcoming SlideShare
Loading in...5
×

R reproducibility

2,766

Published on

Presented at useR! 2014, July 2, 2014.

The R ecosystem is in a state of near constant change. While a new version of the R engine is now released just once a year, 2-3 patches are usually released in the interim. On top of that, new versions of R packages on CRAN are released at rate of several per day (and that’s not counting packages that are part of the BioConductor project or hosted elsewhere on the Web).

While this rapid change is a boon for the advancement of R, it can cause problems for package authors[1] and also for scientists and their peers who may need to reliably reproduce the results of an R script (possibly dependent on a number of packages) months or even years down the line. In this talk we propose a downstream distribution of CRAN packages that provides for the reproducibility of R scripts and reduces the impact of dependencies for packages authors.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,766
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
47
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • http://xkcd.com/242/
  • https://stat.ethz.ch/pipermail/r-devel/2014-March/068552.html
  • R reproducibility

    1. 1. R and Reproducibility A Proposal David Smith useR! 2014
    2. 2. What is Reproducibility? “The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.” CRAN Task View on Reproducible Research (Kuhn) • Method + Environment -> Results • A process for: – Sharing the method – Describing the environment – Recreating the results 2 xkcd.com/242/
    3. 3. Why care about reproducibility? Academic / Research • Verify results • Advance Research Business • Production code • Reliability • Reusability • Regulation 3 www.nytimes.com/2011/07/08/health/research/08genes.html http://arxiv.org/pdf/1010.1092.pdf
    4. 4. R and Reproducibility 4 Results Interfaces Platform Packages R Engine • Hand-assembled • Sweave/knitr/DeployR/Shiny • R GUI / DevelopR / RStudio • Batch / Web Services • OS / Virtualization • Hardware Architecture • CRAN • BioConductor / GitHub / … • R Version • Base + Recommended pkgs
    5. 5. Observations • R versions are pretty manageable – Major versions just once a year – Patches rarely introduce incompatible changes • Good solutions for literate programming – Interfaces help • OS/Hardware not the major cause of problems • The big problem is with packages – CRAN is in a state of continual flux 5
    6. 6. Package Problem #1 : The User http://xkcd.com/234/6 I heard you need to create a TPS Report. Here, I’ve got an R script that does that already. Oh, you need to download these 5 packages first. I already did, and it still doesn’t work! Well, it worked when I wrote it 3 weeks ago. YOUR Grr. Package updates…
    7. 7. Package Problem #2: The Author http://xkcd.com/970/7 Time to update my package on CRAN! >> Dependent packages that now fail to build: 67 >> Resubmit your package and try again Crap.
    8. 8. Package Problem #3 : The Update http://xkcd.com/664/8 3 days later… Woot! A new version of R is out! I have 10 minutes now, time to download and install! … package not found … … can’t install package… … error …
    9. 9. The Proposal • Change the default way R handles packages – Install packages local to projects • “Snapshot” CRAN daily – Make it easy to get & use package versions used in script development Not a new idea! – Ooms, “Possible Directions for Improving Dependency Versioning in R”, R Journal 5/1 – BioConductor Project – Revolution R Enterprise – Linux distros 9
    10. 10. Example • R script file using 6 most popular packages 10
    11. 11. Sharing a script reproducibly … and simply # Run with R 3.1.0 require(RRT) mran_set(snapshot="2014-06-27") # find packages used in this project # get package versions used by script author # install locally to this project require(ggplot2) require(data.table) require(knitr) … 11
    12. 12. RRT: The R Reproducibility Toolkit • Open Source R Package (GPLv2) • From an R project folder: – Detect packages & dependencies used in project – Download and install from MRAN – Versions selected according to script date – Find and use packages from local install github.com/RevolutionAnalytics/RRT 12
    13. 13. MRAN - Implementation A downstream CRAN mirror with daily snapshots • Use rsync to mirror CRAN daily – Only downloads changed packages • Use zfs to store incremental snapshots – Storage only required for new packages • Organize snapshots into a labelled hierarchy – Access package versions by date of use • CRAN snapshot server hosted by cloud provider – Provisioned for availability and latency 13
    14. 14. Future work • Just getting started! • Snapshot binaries and source packages • Other repos (BioConductor, GitHub, user) • Institution-level package duplication – CRAN “behind the firewall” • User-defined package versions • Checks on R versions • Suggestions welcome! github.com/RevolutionAnalytics/RRT 14
    15. 15. Thank You! David Smith david@revolutionanalytics.com blog.revolutionanalytics.com
    16. 16. Possible Solution • Bundle all packages with scripts • Packrat solves this very well – Project + package dependencies stored in Github • But: – Contributes to package fragmentation – Adds friction to the sharing process – Doesn’t address the problem for R generally 16
    17. 17. CRAN vs Github CRAN • “Repository of Record” – Default for R users • Strict quality checking • Handles dependencies • Binaries built – But only current versions saved • Manual update process • Dependent on volunteer support Github • Frictionless publishing / updates – RStudio integration • Social development – Pull requests FTW • Ease of updates • Fragmented – no unified directory of packages • Permanence – accounts closed / repos deleted 17
    18. 18. A downstream CRAN solution? “I don't see why CRAN needs to be involved in this effort at all. A third party could take snapshots of CRAN at R release dates, and make those available to package users in a separate repository. It is not hard to set a different repository than CRAN as the default location from which to obtain packages.” -- R-core member, r-devel, March 2014 18
    19. 19. Snapshot CRAN repository : requirements • Availability • Latency • Bandwidth • Storage • Binary package archives • Other enhancements? 19
    20. 20. Proposal “Development Branch” “Stable Branch” Defaults are important!!20 MRANCRAN Downstram Reproducible
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×