SlideShare a Scribd company logo
1 of 25
A Step Towards Reproducibility 
in R 
H2O World 
November 18 - 19, 2014
2 
R’s popularity is growing rapidly 
IEEE Spectrum Top Programming Languages 
#15: R 
• IEEE Spectrum, July 2014 • RedMonk Programming Language 
Rankings, 2013
3 
R is used more than other data science tools 
• O’Reilly Strata 2013 Data Science 
Salary Survey 
• KDNuggets Poll: Top Languages for 
analytics, data mining, data science
4 
R is among the highest-paid IT skills in the US 
• Dice Tech Salary Survey, January 
2014 
• O’Reilly Strata 2013 Data Science 
Salary Survey
Companies Using R 
5
Google 
“The great beauty of R 
is that you can modify 
it to do all sorts of 
things.” 
— Hal Varian 
Chief Economist, 
Google 
6 
“R is really 
important to the 
point that it's hard 
to overvalue it.” — 
Daryl Pregibon 
Head of 
Statistics, 
Google 
• Advertising 
Effectiveness 
• Economic forecasting
Facebook 
• Exploratory Data 
Analysis 
• Experimental Analysis 
“Generally, we use R to move 
fast when we get a new data 
set. With R, we don’t need to 
develop custom tools or write 
a bunch of code. Instead, we 
can just go about cleaning 
and exploring the data.” — 
Solomon Messing, data 
scientist at Facebook
8 
Twitter 
“A common pattern for me is that I'll code a MapReduce 
job in Scala, do some simple command-line munging on 
the results, pass the data into Python or R for further 
analysis, pull from a database to grab some extra fields, 
and so on, often integrating what I find into some 
machine learning models in the end” — Ed Chen, Data 
Scientist, Twitter 
• Data Visualization • Semantic clustering
9 
Insurance 
• Risk Analysis • Marketing Analytics 
• Catastrophe Modeling
10 
Finance and Banking 
• Credit Risk Analysis • Financial Networks
11 
John Deere 
Statistical Analysis: 
• Short Term Demand Forecasting 
• Crop Forecasting 
• Long Term Demand Forecasting 
• Maintenance and Reliability 
• Production Scheduling 
• Data Coordination
12 
Monsanto 
Statistical Analysis: 
• Plant Breeding 
• Fertility mapping 
• Precision Seeding 
• Disease Management 
• Yield forecasting
13 
Public Affairs 
• Casualty estimation in Warzones • Political Analysis
14 
Pharmaceuticals 
“R use at the FDA is completely 
acceptable and has not caused 
any problems.” — Dr Jae 
Brodsky, Office of 
Biostatistics, Food and Drug 
Administration 
Regulatory Drug Approvals 
• Reproducible research 
• Accurate, reliable and consistent statistical analysis 
• Internal reporting (Section 508 compliance)
15 
Weather and Climate 
• Climate change forecasts • Flood Warnings
16 
Revolution Analytics 
 Open Source development 
– Revolution R Open, RHadoop, 
ParallelR, DeployR Open, Reproducible 
R Toolkit 
– Project funding 
 Community Support 
– User Group Sponsorship 
– Meetups 
– Events sponsorship 
– Revolutions Blog
Reproducibility is the ability of an entire experiment or study 
to be reproduced, either by the researcher or by someone else 
working independently. It is one of the main principles of 
the scientific method …Wikipedia 
Reproducible research is the idea that data analyses, and 
more generally, scientific claims, are published with their 
data and software code so that others may verify the 
findings and build upon them. Roger Peng
Reproducibility – why do we care? 
Academic / Research 
 Verify results 
 Advance Research 
Business 
 Production code 
 Reliability 
 Reusability 
 Collaboration 
 Regulation 
www.nytimes.com/2011/07/08/health/research/08genes.html 
http://arxiv.org/pdf/1010.1092.pdf 
18
19 
An R Reproducibility Problem 
Adapted from http://xkcd.com/234/ CC BY-NC 2.5
20 
Revolution Analytics’ Reproducibility Environment 
 A Distribution of R (RRO) that points to a static CRAN mirror 
 The Checkpoint Server: the static CRAN mirror 
– CRAN packages fixed with each Revolution R Open update (currently 10/1/14) 
 Daily CRAN snapshots 
– Storing every package version since September 2014 
– Binaries and sources 
– At mran.revolutionanalytics.com/snapshot 
 CRAN package checkpoint 
CRAN 
http://mran.revolutionanalytics.com/snapshot/ 
RRDaily 
snapshots 
checkpoint 
package 
library(checkpoint) 
checkpoint("2014-09-17") 
CRAN mirror 
http://cran.revolutionanalytics.com/ 
checkpoint 
server 
Midnight 
UTC
21 
Using Revolution Analytics’ Reproducibility Tools 
 Scenario 1: Set up a consistent, company wide R environment 
– Have users download RRO 
– All users will get the base and recommended packages as of 10/1/14 
– For each project, R user run checkpoint to download a consistent set of packages 
that are appropriate for that project 
 Scenario 2: With or w/o RRO share scripts synced to a snapshot 
– Have the user with whom you are sharing put your scripts in a separate project and 
download the checkpoint package 
– Have the user run checkpoint(“yyyy-mm-dd) with a date appropriate for your 
project 
– Checkpoint will automatically download the correct version of the packages used in 
the scripts
22 
Using checkpoint 
 Easy to use: add 2 lines to the top of each script 
library(checkpoint) 
checkpoint("2014-09-17") 
 For the package author: 
– Use package versions available on the chosen date 
– Installs packages local to this project 
• Allows different package versions to be used simultaneously 
 For a script collaborator: 
– Automatically installs required packages 
• Detects required packages (no need to manually install!) 
– Uses same package versions as script author to ensure reproducibility
23 
# Create a local checkpoint library 
library(checkpoint) 
checkpoint("2014-11-14") 
> library(checkpoint) 
checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics 
http://projects.revolutionanalytics.com/rrt/ 
Warning message: 
package ‘checkpoint’ was built under R version 3.1.2 
> checkpoint("2014-11-14") 
Scanning for loaded pkgs 
Scanning for packages used in this project 
Installing packages used in this project 
Warning: dependencies ‘stats’, ‘tools’, ‘utils’, ‘methods’, ‘graphics’, ‘splines’, ‘grid’, ‘grDevices’ are not available 
also installing the dependencies ‘bitops’, ‘stringr’, ‘digest’, ‘jsonlite’, ‘lattice’, ‘RCurl’, ‘rjson’, ‘statmod’, 
‘survival’, ‘XML’, ‘httr’, ‘Matrix’ 
package ‘bitops’ successfully unpacked and MD5 sums checked 
package ‘stringr’ successfully unpacked and MD5 sums checked 
package ‘digest’ successfully unpacked and MD5 sums checked 
package ‘jsonlite’ successfully unpacked and MD5 sums checked 
package ‘lattice’ successfully unpacked and MD5 sums checked 
package ‘RCurl’ successfully unpacked and MD5 sums checked 
package ‘rjson’ successfully unpacked and MD5 sums checked 
package ‘statmod’ successfully unpacked and MD5 sums checked 
package ‘survival’ successfully unpacked and MD5 sums checked 
package ‘XML’ successfully unpacked and MD5 sums checked 
package ‘httr’ successfully unpacked and MD5 sums checked 
package ‘Matrix’ successfully unpacked and MD5 sums checked 
package ‘h2o’ successfully unpacked and MD5 sums checked 
package ‘miniCRAN’ successfully unpacked and MD5 sums checked 
package ‘igraph’ successfully unpacked and MD5 sums checked
24 
MRAN: The Managed R Archive Network 
 Download RRO 
 Learn about R and RRO 
 Daily CRAN snapshots 
 Explore Packages 
– and dependencies 
 Explore Task Views
Thank You 
Joseph Rickert 
Joseph.rickert@revolutionanalytics.com, @revojoe 
blog.revolutionanalytics.com

More Related Content

What's hot

Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution AnalyticsRevolution Analytics
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopRevolution Analytics
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with RTechsparks
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)
 
Rdf saturator
Rdf saturatorRdf saturator
Rdf saturatorINRIA-OAK
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Revolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 

What's hot (20)

R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
R reproducibility
R reproducibilityR reproducibility
R reproducibility
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with R
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Rdf saturator
Rdf saturatorRdf saturator
Rdf saturator
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 

Viewers also liked

I Should Have Used Social Selling | Gil Gunderson's Guide To Social Sales
I Should Have Used Social Selling | Gil Gunderson's Guide To Social SalesI Should Have Used Social Selling | Gil Gunderson's Guide To Social Sales
I Should Have Used Social Selling | Gil Gunderson's Guide To Social SalesGerry Moran
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
 
50 Best Motivational Quotes to Ignite Your Sales Drive
50 Best Motivational Quotes to Ignite Your Sales Drive50 Best Motivational Quotes to Ignite Your Sales Drive
50 Best Motivational Quotes to Ignite Your Sales DriveHubSpot
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 
The Four Attributes That Drive Sales Growth And Performance
The Four Attributes That Drive Sales Growth And PerformanceThe Four Attributes That Drive Sales Growth And Performance
The Four Attributes That Drive Sales Growth And PerformanceKhufere Qhamata
 
Silent Edge, The Sales Performance Authority, short credentials
Silent Edge, The Sales Performance Authority, short credentialsSilent Edge, The Sales Performance Authority, short credentials
Silent Edge, The Sales Performance Authority, short credentialsRussell Ward
 
How to Develop the Total Person (qualities and attributes of highly effective...
How to Develop the Total Person (qualities and attributes of highly effective...How to Develop the Total Person (qualities and attributes of highly effective...
How to Develop the Total Person (qualities and attributes of highly effective...PowerRound Corporation
 
6 Attributes of a Great Salesperson from Shark Tank's Kevin O'Leary
6 Attributes of a Great Salesperson from Shark Tank's Kevin O'Leary6 Attributes of a Great Salesperson from Shark Tank's Kevin O'Leary
6 Attributes of a Great Salesperson from Shark Tank's Kevin O'LearyRingLead
 
Target employee incentive scheme
Target employee incentive schemeTarget employee incentive scheme
Target employee incentive schemeMohammad rasoolbaig
 
Sales Manager’s Guidebook Volume 3 - Managing Sales Performance
Sales Manager’s Guidebook Volume 3 - Managing Sales PerformanceSales Manager’s Guidebook Volume 3 - Managing Sales Performance
Sales Manager’s Guidebook Volume 3 - Managing Sales PerformanceSean McPheat
 
4 Amazing Sales Tools I Use Every Day - Be Effective - Tools to Close Deals F...
4 Amazing Sales Tools I Use Every Day - Be Effective - Tools to Close Deals F...4 Amazing Sales Tools I Use Every Day - Be Effective - Tools to Close Deals F...
4 Amazing Sales Tools I Use Every Day - Be Effective - Tools to Close Deals F...Daniel Nilsson
 
Good presentations matter
Good presentations matterGood presentations matter
Good presentations matterNed Potter
 

Viewers also liked (20)

I Should Have Used Social Selling | Gil Gunderson's Guide To Social Sales
I Should Have Used Social Selling | Gil Gunderson's Guide To Social SalesI Should Have Used Social Selling | Gil Gunderson's Guide To Social Sales
I Should Have Used Social Selling | Gil Gunderson's Guide To Social Sales
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
50 Best Motivational Quotes to Ignite Your Sales Drive
50 Best Motivational Quotes to Ignite Your Sales Drive50 Best Motivational Quotes to Ignite Your Sales Drive
50 Best Motivational Quotes to Ignite Your Sales Drive
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 
Good sales person
Good sales personGood sales person
Good sales person
 
The Four Attributes That Drive Sales Growth And Performance
The Four Attributes That Drive Sales Growth And PerformanceThe Four Attributes That Drive Sales Growth And Performance
The Four Attributes That Drive Sales Growth And Performance
 
Sales Training
Sales TrainingSales Training
Sales Training
 
Silent Edge, The Sales Performance Authority, short credentials
Silent Edge, The Sales Performance Authority, short credentialsSilent Edge, The Sales Performance Authority, short credentials
Silent Edge, The Sales Performance Authority, short credentials
 
How to Develop the Total Person (qualities and attributes of highly effective...
How to Develop the Total Person (qualities and attributes of highly effective...How to Develop the Total Person (qualities and attributes of highly effective...
How to Develop the Total Person (qualities and attributes of highly effective...
 
Differentiate or Die
Differentiate or DieDifferentiate or Die
Differentiate or Die
 
6 Attributes of a Great Salesperson from Shark Tank's Kevin O'Leary
6 Attributes of a Great Salesperson from Shark Tank's Kevin O'Leary6 Attributes of a Great Salesperson from Shark Tank's Kevin O'Leary
6 Attributes of a Great Salesperson from Shark Tank's Kevin O'Leary
 
Target employee incentive scheme
Target employee incentive schemeTarget employee incentive scheme
Target employee incentive scheme
 
Sales Manager’s Guidebook Volume 3 - Managing Sales Performance
Sales Manager’s Guidebook Volume 3 - Managing Sales PerformanceSales Manager’s Guidebook Volume 3 - Managing Sales Performance
Sales Manager’s Guidebook Volume 3 - Managing Sales Performance
 
Sales Performance Motivation
Sales Performance MotivationSales Performance Motivation
Sales Performance Motivation
 
4 Amazing Sales Tools I Use Every Day - Be Effective - Tools to Close Deals F...
4 Amazing Sales Tools I Use Every Day - Be Effective - Tools to Close Deals F...4 Amazing Sales Tools I Use Every Day - Be Effective - Tools to Close Deals F...
4 Amazing Sales Tools I Use Every Day - Be Effective - Tools to Close Deals F...
 
Good presentations matter
Good presentations matterGood presentations matter
Good presentations matter
 
Incentive plan presentation
Incentive plan presentationIncentive plan presentation
Incentive plan presentation
 

Similar to A Step Towards Reproducibility in R

An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
The use of R statistical package in controlled infrastructure. The case of Cl...
The use of R statistical package in controlled infrastructure. The case of Cl...The use of R statistical package in controlled infrastructure. The case of Cl...
The use of R statistical package in controlled infrastructure. The case of Cl...Adrian Olszewski
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2
 
ownR platform technical introduction
ownR platform technical introductionownR platform technical introduction
ownR platform technical introductionFunctional Analytics
 
Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROWork-Bench
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Venkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-ResumeVenkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-Resumevenkata sateeshs
 
Data analytics using R programming
Data analytics using R programmingData analytics using R programming
Data analytics using R programmingUmang Singh
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesKrzysztof Gorgolewski
 
LCI report-Demo
LCI report-DemoLCI report-Demo
LCI report-DemoMo Mamouei
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R ProgrammingIRJET Journal
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Alleman coonce-agile-2017 may2
Alleman coonce-agile-2017 may2Alleman coonce-agile-2017 may2
Alleman coonce-agile-2017 may2Glen Alleman
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET Journal
 

Similar to A Step Towards Reproducibility in R (20)

An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
The use of R statistical package in controlled infrastructure. The case of Cl...
The use of R statistical package in controlled infrastructure. The case of Cl...The use of R statistical package in controlled infrastructure. The case of Cl...
The use of R statistical package in controlled infrastructure. The case of Cl...
 
WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product Overview
 
ownR platform technical introduction
ownR platform technical introductionownR platform technical introduction
ownR platform technical introduction
 
Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RRO
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Venkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-ResumeVenkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-Resume
 
ownR presentation eRum 2016
ownR presentation eRum 2016ownR presentation eRum 2016
ownR presentation eRum 2016
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
Data analytics using R programming
Data analytics using R programmingData analytics using R programming
Data analytics using R programming
 
Containers in Science: neuroimaging use cases
Containers in Science: neuroimaging use casesContainers in Science: neuroimaging use cases
Containers in Science: neuroimaging use cases
 
LCI report-Demo
LCI report-DemoLCI report-Demo
LCI report-Demo
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Alleman coonce-agile-2017 may2
Alleman coonce-agile-2017 may2Alleman coonce-agile-2017 may2
Alleman coonce-agile-2017 may2
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
 

More from Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solutionRevolution Analytics
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageRevolution Analytics
 
Reproducibility with Revolution R Open
Reproducibility with Revolution R OpenReproducibility with Revolution R Open
Reproducibility with Revolution R OpenRevolution Analytics
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 

More from Revolution Analytics (18)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
 
Reproducibility with Revolution R Open
Reproducibility with Revolution R OpenReproducibility with Revolution R Open
Reproducibility with Revolution R Open
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 

A Step Towards Reproducibility in R

  • 1. A Step Towards Reproducibility in R H2O World November 18 - 19, 2014
  • 2. 2 R’s popularity is growing rapidly IEEE Spectrum Top Programming Languages #15: R • IEEE Spectrum, July 2014 • RedMonk Programming Language Rankings, 2013
  • 3. 3 R is used more than other data science tools • O’Reilly Strata 2013 Data Science Salary Survey • KDNuggets Poll: Top Languages for analytics, data mining, data science
  • 4. 4 R is among the highest-paid IT skills in the US • Dice Tech Salary Survey, January 2014 • O’Reilly Strata 2013 Data Science Salary Survey
  • 6. Google “The great beauty of R is that you can modify it to do all sorts of things.” — Hal Varian Chief Economist, Google 6 “R is really important to the point that it's hard to overvalue it.” — Daryl Pregibon Head of Statistics, Google • Advertising Effectiveness • Economic forecasting
  • 7. Facebook • Exploratory Data Analysis • Experimental Analysis “Generally, we use R to move fast when we get a new data set. With R, we don’t need to develop custom tools or write a bunch of code. Instead, we can just go about cleaning and exploring the data.” — Solomon Messing, data scientist at Facebook
  • 8. 8 Twitter “A common pattern for me is that I'll code a MapReduce job in Scala, do some simple command-line munging on the results, pass the data into Python or R for further analysis, pull from a database to grab some extra fields, and so on, often integrating what I find into some machine learning models in the end” — Ed Chen, Data Scientist, Twitter • Data Visualization • Semantic clustering
  • 9. 9 Insurance • Risk Analysis • Marketing Analytics • Catastrophe Modeling
  • 10. 10 Finance and Banking • Credit Risk Analysis • Financial Networks
  • 11. 11 John Deere Statistical Analysis: • Short Term Demand Forecasting • Crop Forecasting • Long Term Demand Forecasting • Maintenance and Reliability • Production Scheduling • Data Coordination
  • 12. 12 Monsanto Statistical Analysis: • Plant Breeding • Fertility mapping • Precision Seeding • Disease Management • Yield forecasting
  • 13. 13 Public Affairs • Casualty estimation in Warzones • Political Analysis
  • 14. 14 Pharmaceuticals “R use at the FDA is completely acceptable and has not caused any problems.” — Dr Jae Brodsky, Office of Biostatistics, Food and Drug Administration Regulatory Drug Approvals • Reproducible research • Accurate, reliable and consistent statistical analysis • Internal reporting (Section 508 compliance)
  • 15. 15 Weather and Climate • Climate change forecasts • Flood Warnings
  • 16. 16 Revolution Analytics  Open Source development – Revolution R Open, RHadoop, ParallelR, DeployR Open, Reproducible R Toolkit – Project funding  Community Support – User Group Sponsorship – Meetups – Events sponsorship – Revolutions Blog
  • 17. Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method …Wikipedia Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. Roger Peng
  • 18. Reproducibility – why do we care? Academic / Research  Verify results  Advance Research Business  Production code  Reliability  Reusability  Collaboration  Regulation www.nytimes.com/2011/07/08/health/research/08genes.html http://arxiv.org/pdf/1010.1092.pdf 18
  • 19. 19 An R Reproducibility Problem Adapted from http://xkcd.com/234/ CC BY-NC 2.5
  • 20. 20 Revolution Analytics’ Reproducibility Environment  A Distribution of R (RRO) that points to a static CRAN mirror  The Checkpoint Server: the static CRAN mirror – CRAN packages fixed with each Revolution R Open update (currently 10/1/14)  Daily CRAN snapshots – Storing every package version since September 2014 – Binaries and sources – At mran.revolutionanalytics.com/snapshot  CRAN package checkpoint CRAN http://mran.revolutionanalytics.com/snapshot/ RRDaily snapshots checkpoint package library(checkpoint) checkpoint("2014-09-17") CRAN mirror http://cran.revolutionanalytics.com/ checkpoint server Midnight UTC
  • 21. 21 Using Revolution Analytics’ Reproducibility Tools  Scenario 1: Set up a consistent, company wide R environment – Have users download RRO – All users will get the base and recommended packages as of 10/1/14 – For each project, R user run checkpoint to download a consistent set of packages that are appropriate for that project  Scenario 2: With or w/o RRO share scripts synced to a snapshot – Have the user with whom you are sharing put your scripts in a separate project and download the checkpoint package – Have the user run checkpoint(“yyyy-mm-dd) with a date appropriate for your project – Checkpoint will automatically download the correct version of the packages used in the scripts
  • 22. 22 Using checkpoint  Easy to use: add 2 lines to the top of each script library(checkpoint) checkpoint("2014-09-17")  For the package author: – Use package versions available on the chosen date – Installs packages local to this project • Allows different package versions to be used simultaneously  For a script collaborator: – Automatically installs required packages • Detects required packages (no need to manually install!) – Uses same package versions as script author to ensure reproducibility
  • 23. 23 # Create a local checkpoint library library(checkpoint) checkpoint("2014-11-14") > library(checkpoint) checkpoint: Part of the Reproducible R Toolkit from Revolution Analytics http://projects.revolutionanalytics.com/rrt/ Warning message: package ‘checkpoint’ was built under R version 3.1.2 > checkpoint("2014-11-14") Scanning for loaded pkgs Scanning for packages used in this project Installing packages used in this project Warning: dependencies ‘stats’, ‘tools’, ‘utils’, ‘methods’, ‘graphics’, ‘splines’, ‘grid’, ‘grDevices’ are not available also installing the dependencies ‘bitops’, ‘stringr’, ‘digest’, ‘jsonlite’, ‘lattice’, ‘RCurl’, ‘rjson’, ‘statmod’, ‘survival’, ‘XML’, ‘httr’, ‘Matrix’ package ‘bitops’ successfully unpacked and MD5 sums checked package ‘stringr’ successfully unpacked and MD5 sums checked package ‘digest’ successfully unpacked and MD5 sums checked package ‘jsonlite’ successfully unpacked and MD5 sums checked package ‘lattice’ successfully unpacked and MD5 sums checked package ‘RCurl’ successfully unpacked and MD5 sums checked package ‘rjson’ successfully unpacked and MD5 sums checked package ‘statmod’ successfully unpacked and MD5 sums checked package ‘survival’ successfully unpacked and MD5 sums checked package ‘XML’ successfully unpacked and MD5 sums checked package ‘httr’ successfully unpacked and MD5 sums checked package ‘Matrix’ successfully unpacked and MD5 sums checked package ‘h2o’ successfully unpacked and MD5 sums checked package ‘miniCRAN’ successfully unpacked and MD5 sums checked package ‘igraph’ successfully unpacked and MD5 sums checked
  • 24. 24 MRAN: The Managed R Archive Network  Download RRO  Learn about R and RRO  Daily CRAN snapshots  Explore Packages – and dependencies  Explore Task Views
  • 25. Thank You Joseph Rickert Joseph.rickert@revolutionanalytics.com, @revojoe blog.revolutionanalytics.com

Editor's Notes

  1. http://blog.revolutionanalytics.com/2014/02/r-salary-surveys.html http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html http://blog.revolutionanalytics.com/2014/02/r-is-15th-of-top-programming-languages-in-latest-redmonk-ranking.html http://blog.revolutionanalytics.com/2013/09/top-languages-for-data-science.html
  2. Dice Tech Salary Survey, January 2014 O’Reilly Strata 2013 Data Science Salary Survey
  3. A
  4. http://blog.revolutionanalytics.com/2013/05/the-arteries-of-the-world-in-tweets.html http://blog.revolutionanalytics.com/2012/03/r-twitter-and-mcdonalds.html
  5. Deloitte: http://www.revolutionanalytics.com/free-webinars/actuarial-analytics-r
  6. Credit Suisse: http://blog.revolutionanalytics.com/2013/05/sheftel-on-r-on-the-trading-desk.html
  7. http://www.revolutionanalytics.com/free-webinars/order-fulfillment-forecasting-john-deere-how-r-facilitates-creativity-and-flexibility http://blog.revolutionanalytics.com/2012/11/video-how-john-deere-uses-r.html
  8. http://blog.revolutionanalytics.com/2013/11/strata-hadoop-world-2013-recap.html