Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
What We Learned Building an
R-Python Hybrid Analytics Pipeline
Niels Bantilan, Pegged Software
NY R Conference April 8th 2...
Help healthcare organizations recruit better
Pegged Software’s Mission:
Core Activities
● Build, evaluate, refine, and deploy predictive models
● Work with Engineering to ingest, validate, and s...
How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?
Anchor Yourself to Problem Statements / Use Cases
1. Define Problem statement
2. Scope out solution space and trade-offs
3...
R-Python Pipeline
Read Data Preprocess Build Model Evaluate Deploy
Data Science Stack
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Py...
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
A...
● Code quality
● Incremental Knowledge Transfer
● Sanity check
Git
Why? Because Version Control
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
A...
Dependency Management
Why Pip + Pyenv?
1. Easily sync Python package dependencies
2. Easily manage multiple Python version...
Why Packrat? From RStudio
1. Isolated: separate system environment and repo environment
2. Portable: easily sync dependenc...
Packrat Internals
datascience_repo
├─ project_folder_a
├─ project_folder_b
├─ datascience_repo.Rproj
...
├─ .Rprofile # po...
PackratFormat: 1.4
PackratVersion: 0.4.6.1
RVersion: 3.2.3
Repos:CRAN=https://cran.rstudio.com/
...
Package: ggplot2
Sourc...
auto.snapshot: TRUE
use.cache: FALSE
print.banner.on.startup: auto
vcs.ignore.lib: TRUE
vcs.ignore.src: TRUE
load.external...
● Initialize packrat with packrat::init()
● Toggle packrat in R session with packrat::on() / off()
● Save current state of...
Problem: Unable to find source packages when restoring
Happens when there is a new version of a package on an R package
re...
Solution 1: Use R’s Installation Procedure
Packrat Issues
> install.packages(<package_name>)
> packrat::snapshot()
Solutio...
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
A...
Call R from Python: Data Pipeline
Read Data Preprocess Build Model Evaluate Deploy
# model_builder.R
cmdargs <- commandArgs(trailingOnly = TRUE)
data_filepath <- cmdargs[1]
model_type <- cmdargs[2]
formula...
Why subprocess?
1. Python for control flow, data manipulation, IO handling
2. R for model build and evaluation computation...
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
A...
Tolerance to Change
Are we confident that a modification to the codebase will not silently
introduce new bugs?
Automated T...
Working Effectively with Legacy Code - Michael Feathers
1. Identify change points
2. Break dependencies
3. Write tests
4. ...
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
A...
Make is a language-agnostic utility for *nix
● Enables reproducible workflow
● Serves as lightweight documentation for rep...
By adopting the above practices, we:
1. Can maintain the codebase more easily
2. Reduce cognitive load and context switchi...
Necessary Time Investment
1. The learning curve
2. Breaking old habits
3. Create fixes for issues that come with chosen so...
How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?
Questions?
niels@peggedsoftware.com
@cosmicbboy
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
Upcoming SlideShare
Loading in …5
×

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

9,665 views

Published on

Delivered by Niels Bantilan (Data Scientist, Pegged Software) at the 2016 New York R Conference on April 8th and 9th at Work-Bench.

Published in: Data & Analytics
  • Be the first to comment

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

  1. 1. What We Learned Building an R-Python Hybrid Analytics Pipeline Niels Bantilan, Pegged Software NY R Conference April 8th 2016
  2. 2. Help healthcare organizations recruit better Pegged Software’s Mission:
  3. 3. Core Activities ● Build, evaluate, refine, and deploy predictive models ● Work with Engineering to ingest, validate, and store data ● Work with Product Management to develop data-driven feature sets
  4. 4. How might we build a predictive analytics pipeline that is reproducible, maintainable, and statistically rigorous?
  5. 5. Anchor Yourself to Problem Statements / Use Cases 1. Define Problem statement 2. Scope out solution space and trade-offs 3. Make decision, justify it, document it 4. Implement chosen solution 5. Evaluate working solution against problem statement 6. Rinse and repeat Problem-solving Heuristic
  6. 6. R-Python Pipeline Read Data Preprocess Build Model Evaluate Deploy
  7. 7. Data Science Stack Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  8. 8. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  9. 9. ● Code quality ● Incremental Knowledge Transfer ● Sanity check Git Why? Because Version Control
  10. 10. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  11. 11. Dependency Management Why Pip + Pyenv? 1. Easily sync Python package dependencies 2. Easily manage multiple Python versions 3. Create and manage virtual environments
  12. 12. Why Packrat? From RStudio 1. Isolated: separate system environment and repo environment 2. Portable: easily sync dependencies across data science team 3. Reproducible: easily add/remove/upgrade/downgrade as needed. Dependency Management
  13. 13. Packrat Internals datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile # points R to packrat └─ packrat ├─ init.R # initialize script ├─ packrat.lock # package deps ├─ packrat.opts # options config ├─ lib # repo private library └─ src # repo source files Understanding packrat
  14. 14. PackratFormat: 1.4 PackratVersion: 0.4.6.1 RVersion: 3.2.3 Repos:CRAN=https://cran.rstudio.com/ ... Package: ggplot2 Source: CRAN Version: 2.0.0 Hash: 5befb1e7a9c7d0692d6c35fa02a29dbf Requires: MASS, digest, gtable, plyr, reshape2, scales datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock # package deps ├─ packrat.opts ├─ lib └─ src packrat.lock: package version and deps Packrat Internals
  15. 15. auto.snapshot: TRUE use.cache: FALSE print.banner.on.startup: auto vcs.ignore.lib: TRUE vcs.ignore.src: TRUE load.external.packages.on.startup: TRUE quiet.package.installation: TRUE snapshot.recommended.packages: FALSE packrat.opts: project-specific configuration Packrat Internals datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock ├─ packrat.opts # options config ├─ lib └─ src
  16. 16. ● Initialize packrat with packrat::init() ● Toggle packrat in R session with packrat::on() / off() ● Save current state of project with packrat::snapshot() ● Reconstitute your project with packrat::restore() ● Removing unused libraries with packrat::clean() Packrat Workflow
  17. 17. Problem: Unable to find source packages when restoring Happens when there is a new version of a package on an R package repository like CRAN Packrat Issues > packrat::restore() Installing knitr (1.11) ... FAILED Error in getSourceForPkgRecord(pkgRecord, srcDir(project), availablePkgs, : Couldn't find source for version 1.11 of knitr (1.10.5 is current)
  18. 18. Solution 1: Use R’s Installation Procedure Packrat Issues > install.packages(<package_name>) > packrat::snapshot() Solution 2: Manually Download Source File $ wget -P repo/packrat/src <package_source_url> > packrat::restore()
  19. 19. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  20. 20. Call R from Python: Data Pipeline Read Data Preprocess Build Model Evaluate Deploy
  21. 21. # model_builder.R cmdargs <- commandArgs(trailingOnly = TRUE) data_filepath <- cmdargs[1] model_type <- cmdargs[2] formula <- cmdargs[3] build.model <- function(data_filepath, model_type, formula) { df <- read.data(data_filepath) model <- train.model(df, model_type, formula) model } Call R from Python: Example # model_pipeline.py import subprocess subprocess.call([‘path/to/R/executable’, 'path/to/model_builder.R’, data_filepath, model_type, formula])
  22. 22. Why subprocess? 1. Python for control flow, data manipulation, IO handling 2. R for model build and evaluation computations 3. main.R script (model_builder.R) as the entry point into R layer 4. No need for tight Python-R integration Call R from Python
  23. 23. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  24. 24. Tolerance to Change Are we confident that a modification to the codebase will not silently introduce new bugs? Automated Testing
  25. 25. Working Effectively with Legacy Code - Michael Feathers 1. Identify change points 2. Break dependencies 3. Write tests 4. Make changes 5. Refactor Automated Testing
  26. 26. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  27. 27. Make is a language-agnostic utility for *nix ● Enables reproducible workflow ● Serves as lightweight documentation for repo # makefile build-model: python model_pipeline.py -i ‘model_input’ -m_type ‘glm’ -formula ‘y ~ x1 + x2’ # command-line $ make build-model Build Management: Make $ python model_pipeline.py -i input_fp -m_type ‘glm’ -formula ‘y ~ x1 + x2’ VS
  28. 28. By adopting the above practices, we: 1. Can maintain the codebase more easily 2. Reduce cognitive load and context switching 3. Improve code quality and correctness 4. Facilitate knowledge transfer among team members 5. Encourage reproducible workflows Big Wins
  29. 29. Necessary Time Investment 1. The learning curve 2. Breaking old habits 3. Create fixes for issues that come with chosen solutions Costs
  30. 30. How might we build a predictive analytics pipeline that is reproducible, maintainable, and statistically rigorous?
  31. 31. Questions? niels@peggedsoftware.com @cosmicbboy

×