Turbocharge your data science
with Python or R
Kelli-Jean Chun
North Bay Python
Nov 4, 2018
Turbocharge your data science
with Python or AND R
Kelli-Jean Chun
North Bay Python
Nov 4, 2018
What the heck is a data scientist?
It depends on the company, here are a few example roles:
- Data science analysts: aka data analysts or business analysts
- Product data scientists: Partner with product managers &
engineers to focus on product initiatives
- Experimentation data scientists
- Growth/marketing data scientists
Leverage data
to gain insight
and solve
problems
What is R?
Python R
Indexing starts at 0 1 :)
Loops for i in range(3):
print(i)
for (i in 0:2){
print(i)
}
List/Vector [0, 1, 2, 3] c(0, 1, 2, 3)
Data Frames import pandas as pd
pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
data.frame('A' = c(1,2), 'B' = c(3,4))
When typical people say
this, they usually refer to
Type of snake Letter in the alphabet
“R is a language and environment for statistical computing and graphics.”
Source: https://www.r-project.org/
The Great Debate: Python or R
A brief comparison of some Python & R packages
used in Data Science
Use Case Python R
Data frame + manipulation Pandas + Numpy Base R + dplyr
Plotting matplotlib, seaborn, bokeh Base R, ggplot2, highcharter
Statistics statsmodels Base R
ML scikit-learn caret + glm + xgboost + ...
Deep Learning TensorFlow TensorFlow
Connecting to the other
language
rpy2, pyRserve, RPython reticulate, PythonInR,
rPython, rJython,
SnakeCharmR
So, Python or R?
As a data scientist, I’ll have both!
Predicting whether or not a NYC dog is spayed/neutered
There is a publicly available NYC dataset that has
information on licensed dogs, such as the:
- Dog name
- Gender
- Breed
- Birth month & year
- Coloring
- Borough (e.g. Manhattan, Bronx)
- Zip Code
- Whether or not guard or trained
- Whether or not spayed/neutered
Using this dataset, let’s build a model to predict
whether or not a NYC dog is spayed/neutered.
https://project.wnyc.org/dogs-of-nyc/
What is my typical data scientific method when
building a model?
- ETLs
- Pre-learning: Explore the data, feature engineering, visualizations
- Learning: Model the data
- Post-learning:
- Evaluate the model
- Document and present the final model in a consumable format for product, engineering, and
other data scientists
- Deployment: Data science as a service / microservice to call the model in production
Plan of action
Goal: Using the other features (dog name, gender, etc)
provide a prediction for whether or not we believe a dog is
spayed/neutered
1. Pre-learning: Process the data and explore in R
2. Learning: Develop a predictive model in Python
3. Post-learning: Evaluate the model in Python
Pre-Learning
Exploratory data analysis can be quickly done in R and a summary of the
exploration can be easily shared with RMarkdown.
Similar to Jupyter notebooks:
- Allows for reproducible analysis
- Quickly provide a report & visuals for others
- Organize code chunks
- Embed code in report
As a bonus, R provides fast and easy functions (once you understand some of the
strange syntax) to produce clean visuals.
RMarkdown HTML (or PDF)
Learning & Post-Learning
Python + Sklearn + Pandas + Numpy = 100%
- Sklearn (aka Scikit-learn): provides a wide variety of Machine Learning and
Statistical models. As well as allows for easier splitting of data into training
and testing and model evaluation.
- Pandas: provides the DataFrame type that makes working with data easier.
- NumPy provides broadcasting functions that make it easier to work with
arrays (specifically columns in a pandas DataFrame)
How do we connect the two languages?
R in Python with rpy2
Loading the data frame of NYC dogs that was processed in R into Python
can be done with rpy2
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
# Read in data from R
pandas2ri.activate()
readRDS = robjects.r['readRDS']
df = readRDS('data/dogs_proc.RDS')
df = pandas2ri.ri2py(df)
R function to read R’s
RDS files
Python in RMarkdown with reticulate
```{r}
library("reticulate")
```
```{python}
print('Python in R')
for i in range(3):
print(i)
# execute Jupyter notebooks
import papermill as pm
pm.execute_notebook("example_notebook.ipynb",
"executed_notebook/example_notebook.ipynb")
```
Instead of specifying r code
(e.g. with {r}), specify python
Thanks!

Turbocharge your data science with python and r

  • 1.
    Turbocharge your datascience with Python or R Kelli-Jean Chun North Bay Python Nov 4, 2018
  • 2.
    Turbocharge your datascience with Python or AND R Kelli-Jean Chun North Bay Python Nov 4, 2018
  • 3.
    What the heckis a data scientist? It depends on the company, here are a few example roles: - Data science analysts: aka data analysts or business analysts - Product data scientists: Partner with product managers & engineers to focus on product initiatives - Experimentation data scientists - Growth/marketing data scientists Leverage data to gain insight and solve problems
  • 4.
    What is R? PythonR Indexing starts at 0 1 :) Loops for i in range(3): print(i) for (i in 0:2){ print(i) } List/Vector [0, 1, 2, 3] c(0, 1, 2, 3) Data Frames import pandas as pd pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) data.frame('A' = c(1,2), 'B' = c(3,4)) When typical people say this, they usually refer to Type of snake Letter in the alphabet “R is a language and environment for statistical computing and graphics.” Source: https://www.r-project.org/
  • 5.
    The Great Debate:Python or R
  • 6.
    A brief comparisonof some Python & R packages used in Data Science Use Case Python R Data frame + manipulation Pandas + Numpy Base R + dplyr Plotting matplotlib, seaborn, bokeh Base R, ggplot2, highcharter Statistics statsmodels Base R ML scikit-learn caret + glm + xgboost + ... Deep Learning TensorFlow TensorFlow Connecting to the other language rpy2, pyRserve, RPython reticulate, PythonInR, rPython, rJython, SnakeCharmR
  • 7.
    So, Python orR? As a data scientist, I’ll have both!
  • 8.
    Predicting whether ornot a NYC dog is spayed/neutered There is a publicly available NYC dataset that has information on licensed dogs, such as the: - Dog name - Gender - Breed - Birth month & year - Coloring - Borough (e.g. Manhattan, Bronx) - Zip Code - Whether or not guard or trained - Whether or not spayed/neutered Using this dataset, let’s build a model to predict whether or not a NYC dog is spayed/neutered. https://project.wnyc.org/dogs-of-nyc/
  • 9.
    What is mytypical data scientific method when building a model? - ETLs - Pre-learning: Explore the data, feature engineering, visualizations - Learning: Model the data - Post-learning: - Evaluate the model - Document and present the final model in a consumable format for product, engineering, and other data scientists - Deployment: Data science as a service / microservice to call the model in production
  • 10.
    Plan of action Goal:Using the other features (dog name, gender, etc) provide a prediction for whether or not we believe a dog is spayed/neutered 1. Pre-learning: Process the data and explore in R 2. Learning: Develop a predictive model in Python 3. Post-learning: Evaluate the model in Python
  • 11.
    Pre-Learning Exploratory data analysiscan be quickly done in R and a summary of the exploration can be easily shared with RMarkdown. Similar to Jupyter notebooks: - Allows for reproducible analysis - Quickly provide a report & visuals for others - Organize code chunks - Embed code in report As a bonus, R provides fast and easy functions (once you understand some of the strange syntax) to produce clean visuals.
  • 12.
  • 13.
    Learning & Post-Learning Python+ Sklearn + Pandas + Numpy = 100% - Sklearn (aka Scikit-learn): provides a wide variety of Machine Learning and Statistical models. As well as allows for easier splitting of data into training and testing and model evaluation. - Pandas: provides the DataFrame type that makes working with data easier. - NumPy provides broadcasting functions that make it easier to work with arrays (specifically columns in a pandas DataFrame)
  • 15.
    How do weconnect the two languages?
  • 16.
    R in Pythonwith rpy2 Loading the data frame of NYC dogs that was processed in R into Python can be done with rpy2 import rpy2.robjects as robjects from rpy2.robjects import pandas2ri # Read in data from R pandas2ri.activate() readRDS = robjects.r['readRDS'] df = readRDS('data/dogs_proc.RDS') df = pandas2ri.ri2py(df) R function to read R’s RDS files
  • 17.
    Python in RMarkdownwith reticulate ```{r} library("reticulate") ``` ```{python} print('Python in R') for i in range(3): print(i) # execute Jupyter notebooks import papermill as pm pm.execute_notebook("example_notebook.ipynb", "executed_notebook/example_notebook.ipynb") ``` Instead of specifying r code (e.g. with {r}), specify python
  • 19.