Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Step Towards Reproducibility in R

1,608 views

Published on

Joseph Rickert presentation to H20 World, November 2014.

  • Be the first to comment

  • Be the first to like this

A Step Towards Reproducibility in R

  1. 1. A Step Towards Reproducibility in R H2O World November 18 - 19, 2014
  2. 2. 2 R’s popularity is growing rapidly IEEE Spectrum Top Programming Languages #15: R • IEEE Spectrum, July 2014 • RedMonk Programming Language Rankings, 2013
  3. 3. 3 R is used more than other data science tools • O’Reilly Strata 2013 Data Science Salary Survey • KDNuggets Poll: Top Languages for analytics, data mining, data science
  4. 4. 4 R is among the highest-paid IT skills in the US • Dice Tech Salary Survey, January 2014 • O’Reilly Strata 2013 Data Science Salary Survey
  5. 5. Companies Using R 5
  6. 6. Google “The great beauty of R is that you can modify it to do all sorts of things.” — Hal Varian Chief Economist, Google 6 “R is really important to the point that it's hard to overvalue it.” — Daryl Pregibon Head of Statistics, Google • Advertising Effectiveness • Economic forecasting
  7. 7. Facebook • Exploratory Data Analysis • Experimental Analysis “Generally, we use R to move fast when we get a new data set. With R, we don’t need to develop custom tools or write a bunch of code. Instead, we can just go about cleaning and exploring the data.” — Solomon Messing, data scientist at Facebook
  8. 8. 8 Twitter “A common pattern for me is that I'll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end” — Ed Chen, Data Scientist, Twitter • Data Visualization • Semantic clustering
  9. 9. 9 Insurance • Risk Analysis • Marketing Analytics • Catastrophe Modeling
  10. 10. 10 Finance and Banking • Credit Risk Analysis • Financial Networks
  11. 11. 11 John Deere Statistical Analysis: • Short Term Demand Forecasting • Crop Forecasting • Long Term Demand Forecasting • Maintenance and Reliability • Production Scheduling • Data Coordination
  12. 12. 12 Monsanto Statistical Analysis: • Plant Breeding • Fertility mapping • Precision Seeding • Disease Management • Yield forecasting
  13. 13. 13 Public Affairs • Casualty estimation in Warzones • Political Analysis
  14. 14. 14 Pharmaceuticals “R use at the FDA is completely acceptable and has not caused any problems.” — Dr Jae Brodsky, Office of Biostatistics, Food and Drug Administration Regulatory Drug Approvals • Reproducible research • Accurate, reliable and consistent statistical analysis • Internal reporting (Section 508 compliance)
  15. 15. 15 Weather and Climate • Climate change forecasts • Flood Warnings
  16. 16. 16 Revolution Analytics  Open Source development – Revolution R Open, RHadoop, ParallelR, DeployR Open, Reproducible R Toolkit – Project funding  Community Support – User Group Sponsorship – Meetups – Events sponsorship – Revolutions Blog
  17. 17. Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method …Wikipedia Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. Roger Peng
  18. 18. Reproducibility – why do we care? Academic / Research  Verify results  Advance Research Business  Production code  Reliability  Reusability  Collaboration  Regulation www.nytimes.com/2011/07/08/health/research/08genes.html http://arxiv.org/pdf/1010.1092.pdf 18
  19. 19. 19 An R Reproducibility Problem Adapted from http://xkcd.com/234/ CC BY-NC 2.5
  20. 20. 20 Revolution Analytics’ Reproducibility Environment  A Distribution of R (RRO) that points to a static CRAN mirror  The Checkpoint Server: the static CRAN mirror – CRAN packages fixed with each Revolution R Open update (currently 10/1/14)  Daily CRAN snapshots – Storing every package version since September 2014 – Binaries and sources – At mran.revolutionanalytics.com/snapshot  CRAN package checkpoint CRAN http://mran.revolutionanalytics.com/snapshot/ RRDaily snapshots checkpoint package library(checkpoint) checkpoint("2014-09-17") CRAN mirror http://cran.revolutionanalytics.com/ checkpoint server Midnight UTC
  21. 21. 21 Using Revolution Analytics’ Reproducibility Tools  Scenario 1: Set up a consistent, company wide R environment – Have users download RRO – All users will get the base and recommended packages as of 10/1/14 – For each project, R user run checkpoint to download a consistent set of packages that are appropriate for that project  Scenario 2: With or w/o RRO share scripts synced to a snapshot – Have the user with whom you are sharing put your scripts in a separate project and download the checkpoint package – Have the user run checkpoint(“yyyy-mm-dd) with a date appropriate for your project – Checkpoint will automatically download the correct version of the packages used in the scripts
  22. 22. 22 Using checkpoint  Easy to use: add 2 lines to the top of each script library(checkpoint) checkpoint("2014-09-17")  For the package author: – Use package versions available on the chosen date – Installs packages local to this project • Allows different package versions to be used simultaneously  For a script collaborator: – Automatically installs required packages • Detects required packages (no need to manually install!) – Uses same package versions as script author to ensure reproducibility
  23. 23. 23 # Create a local checkpoint library library(checkpoint) checkpoint("2014-11-14") > library(checkpoint) checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics http://projects.revolutionanalytics.com/rrt/ Warning message: package ‘checkpoint’ was built under R version 3.1.2 > checkpoint("2014-11-14") Scanning for loaded pkgs Scanning for packages used in this project Installing packages used in this project Warning: dependencies ‘stats’, ‘tools’, ‘utils’, ‘methods’, ‘graphics’, ‘splines’, ‘grid’, ‘grDevices’ are not available also installing the dependencies ‘bitops’, ‘stringr’, ‘digest’, ‘jsonlite’, ‘lattice’, ‘RCurl’, ‘rjson’, ‘statmod’, ‘survival’, ‘XML’, ‘httr’, ‘Matrix’ package ‘bitops’ successfully unpacked and MD5 sums checked package ‘stringr’ successfully unpacked and MD5 sums checked package ‘digest’ successfully unpacked and MD5 sums checked package ‘jsonlite’ successfully unpacked and MD5 sums checked package ‘lattice’ successfully unpacked and MD5 sums checked package ‘RCurl’ successfully unpacked and MD5 sums checked package ‘rjson’ successfully unpacked and MD5 sums checked package ‘statmod’ successfully unpacked and MD5 sums checked package ‘survival’ successfully unpacked and MD5 sums checked package ‘XML’ successfully unpacked and MD5 sums checked package ‘httr’ successfully unpacked and MD5 sums checked package ‘Matrix’ successfully unpacked and MD5 sums checked package ‘h2o’ successfully unpacked and MD5 sums checked package ‘miniCRAN’ successfully unpacked and MD5 sums checked package ‘igraph’ successfully unpacked and MD5 sums checked
  24. 24. 24 MRAN: The Managed R Archive Network  Download RRO  Learn about R and RRO  Daily CRAN snapshots  Explore Packages – and dependencies  Explore Task Views
  25. 25. Thank You Joseph Rickert Joseph.rickert@revolutionanalytics.com, @revojoe blog.revolutionanalytics.com

×