1. A Step Towards Reproducibility
in R
H2O World
November 18 - 19, 2014
2. 2
R’s popularity is growing rapidly
IEEE Spectrum Top Programming Languages
#15: R
• IEEE Spectrum, July 2014 • RedMonk Programming Language
Rankings, 2013
3. 3
R is used more than other data science tools
• O’Reilly Strata 2013 Data Science
Salary Survey
• KDNuggets Poll: Top Languages for
analytics, data mining, data science
4. 4
R is among the highest-paid IT skills in the US
• Dice Tech Salary Survey, January
2014
• O’Reilly Strata 2013 Data Science
Salary Survey
6. Google
“The great beauty of R
is that you can modify
it to do all sorts of
things.”
— Hal Varian
Chief Economist,
Google
6
“R is really
important to the
point that it's hard
to overvalue it.” —
Daryl Pregibon
Head of
Statistics,
Google
• Advertising
Effectiveness
• Economic forecasting
7. Facebook
• Exploratory Data
Analysis
• Experimental Analysis
“Generally, we use R to move
fast when we get a new data
set. With R, we don’t need to
develop custom tools or write
a bunch of code. Instead, we
can just go about cleaning
and exploring the data.” —
Solomon Messing, data
scientist at Facebook
8. 8
Twitter
“A common pattern for me is that I'll code a MapReduce
job in Scala, do some simple command-line munging on
the results, pass the data into Python or R for further
analysis, pull from a database to grab some extra fields,
and so on, often integrating what I find into some
machine learning models in the end” — Ed Chen, Data
Scientist, Twitter
• Data Visualization • Semantic clustering
11. 11
John Deere
Statistical Analysis:
• Short Term Demand Forecasting
• Crop Forecasting
• Long Term Demand Forecasting
• Maintenance and Reliability
• Production Scheduling
• Data Coordination
13. 13
Public Affairs
• Casualty estimation in Warzones • Political Analysis
14. 14
Pharmaceuticals
“R use at the FDA is completely
acceptable and has not caused
any problems.” — Dr Jae
Brodsky, Office of
Biostatistics, Food and Drug
Administration
Regulatory Drug Approvals
• Reproducible research
• Accurate, reliable and consistent statistical analysis
• Internal reporting (Section 508 compliance)
16. 16
Revolution Analytics
Open Source development
– Revolution R Open, RHadoop,
ParallelR, DeployR Open, Reproducible
R Toolkit
– Project funding
Community Support
– User Group Sponsorship
– Meetups
– Events sponsorship
– Revolutions Blog
17. Reproducibility is the ability of an entire experiment or study
to be reproduced, either by the researcher or by someone else
working independently. It is one of the main principles of
the scientific method …Wikipedia
Reproducible research is the idea that data analyses, and
more generally, scientific claims, are published with their
data and software code so that others may verify the
findings and build upon them. Roger Peng
18. Reproducibility – why do we care?
Academic / Research
Verify results
Advance Research
Business
Production code
Reliability
Reusability
Collaboration
Regulation
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
18
19. 19
An R Reproducibility Problem
Adapted from http://xkcd.com/234/ CC BY-NC 2.5
20. 20
Revolution Analytics’ Reproducibility Environment
A Distribution of R (RRO) that points to a static CRAN mirror
The Checkpoint Server: the static CRAN mirror
– CRAN packages fixed with each Revolution R Open update (currently 10/1/14)
Daily CRAN snapshots
– Storing every package version since September 2014
– Binaries and sources
– At mran.revolutionanalytics.com/snapshot
CRAN package checkpoint
CRAN
http://mran.revolutionanalytics.com/snapshot/
RRDaily
snapshots
checkpoint
package
library(checkpoint)
checkpoint("2014-09-17")
CRAN mirror
http://cran.revolutionanalytics.com/
checkpoint
server
Midnight
UTC
21. 21
Using Revolution Analytics’ Reproducibility Tools
Scenario 1: Set up a consistent, company wide R environment
– Have users download RRO
– All users will get the base and recommended packages as of 10/1/14
– For each project, R user run checkpoint to download a consistent set of packages
that are appropriate for that project
Scenario 2: With or w/o RRO share scripts synced to a snapshot
– Have the user with whom you are sharing put your scripts in a separate project and
download the checkpoint package
– Have the user run checkpoint(“yyyy-mm-dd) with a date appropriate for your
project
– Checkpoint will automatically download the correct version of the packages used in
the scripts
22. 22
Using checkpoint
Easy to use: add 2 lines to the top of each script
library(checkpoint)
checkpoint("2014-09-17")
For the package author:
– Use package versions available on the chosen date
– Installs packages local to this project
• Allows different package versions to be used simultaneously
For a script collaborator:
– Automatically installs required packages
• Detects required packages (no need to manually install!)
– Uses same package versions as script author to ensure reproducibility
23. 23
# Create a local checkpoint library
library(checkpoint)
checkpoint("2014-11-14")
> library(checkpoint)
checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics
http://projects.revolutionanalytics.com/rrt/
Warning message:
package ‘checkpoint’ was built under R version 3.1.2
> checkpoint("2014-11-14")
Scanning for loaded pkgs
Scanning for packages used in this project
Installing packages used in this project
Warning: dependencies ‘stats’, ‘tools’, ‘utils’, ‘methods’, ‘graphics’, ‘splines’, ‘grid’, ‘grDevices’ are not available
also installing the dependencies ‘bitops’, ‘stringr’, ‘digest’, ‘jsonlite’, ‘lattice’, ‘RCurl’, ‘rjson’, ‘statmod’,
‘survival’, ‘XML’, ‘httr’, ‘Matrix’
package ‘bitops’ successfully unpacked and MD5 sums checked
package ‘stringr’ successfully unpacked and MD5 sums checked
package ‘digest’ successfully unpacked and MD5 sums checked
package ‘jsonlite’ successfully unpacked and MD5 sums checked
package ‘lattice’ successfully unpacked and MD5 sums checked
package ‘RCurl’ successfully unpacked and MD5 sums checked
package ‘rjson’ successfully unpacked and MD5 sums checked
package ‘statmod’ successfully unpacked and MD5 sums checked
package ‘survival’ successfully unpacked and MD5 sums checked
package ‘XML’ successfully unpacked and MD5 sums checked
package ‘httr’ successfully unpacked and MD5 sums checked
package ‘Matrix’ successfully unpacked and MD5 sums checked
package ‘h2o’ successfully unpacked and MD5 sums checked
package ‘miniCRAN’ successfully unpacked and MD5 sums checked
package ‘igraph’ successfully unpacked and MD5 sums checked
24. 24
MRAN: The Managed R Archive Network
Download RRO
Learn about R and RRO
Daily CRAN snapshots
Explore Packages
– and dependencies
Explore Task Views
25. Thank You
Joseph Rickert
Joseph.rickert@revolutionanalytics.com, @revojoe
blog.revolutionanalytics.com