SlideShare a Scribd company logo
Taking your machine learning workflow
to the next level using Scikit-Learn
Pipelines
Philip Goddard
github.com/philipmgoddard/pipelines
• Many ML problems can be solved by
assembling modular components:
‒ Raw data in
‒ Predictions out.
• Pipelines encapsulate data preparation
as well as prediction.
Pipelines?
2
• Maintaining a clean workflow can be challenging when building ML models:
‒ Different models require different considerations, leading to multiple versions of
training data.
‒ When experimenting, the state of a Jupyter Notebook isn’t always obvious.
‒ Tuning model hyperparameters is (relatively) easy, tuning data transformations can
be a little more tedious.
Why should I care?
3
• Modular and natural way to build supervised ML models.
• Flexibility to experiment with data transformations as well as model tuning.
• Clean, DRY, reusable code.
How can Pipelines help?
• Scikit-Learn works around the concept of transformers and estimators.
• Typically, data is run through some transformations before reaching an estimator.
• Pipelines are implemented using the Pipeline class, chaining the components together.
• Different transformation stages can built, and combined using FeatureUnion.
• When the final step of the Pipeline is an estimator, the GridSearchCV class allows
hyperparameter tuning.
‒ Tune (or fix) the parameters for the transformers as well as the estimator.
Pipelines in Scikit-Learn
Case Study
7
• Dataset for predicting whether subscribers to a mobile telephone plan will churn.
‒ Provided with a train (3333 observations) and test (1667 observations) set.
• Features include:
‒ Continuous measurements (e.g. total charges).
‒ Count measurements (e.g. number of customer service calls).
‒ Categorical (e.g. customer area code).
• Outcome is binary: ~85% did not churn and ~15% did churn.
Case Study: Customer Churn
github.com/philipmgoddard/pipelines
Understanding data informs Pipeline design
• The data set has a tractable number of features (19).
• We can easily visualise to get a feel for any considerations required for our Pipeline:
‒ Correlated features
‒ Low variance features
‒ Non-linearities
‒ Poorly behaved distributions
‒ Any other nuances
Numerical features (continuous)
Watch out for
correlations!
• Pairwise plots are a good way to get a
feeling for highly correlated features.
• Some classes of model encounter
numerical instabilities when fitting if this
isn’t resolved.
• The pipeline should provide a way to
identify and remove such features.
Numerical features (counts)
Categorical features
Building the Pipeline
Pipeline Schematic
Numerical feature
Pipeline
Categorical feature
Pipeline
FeatureUnion
Filter high
correlation
Estimator
Pipeline components
Select
features
Filter low
variance
features
Filter high
correlations
Center and
scale
Calculate
interactions
Numerical features
Select
features
Filter low
variance
features
Filter high
correlations
Categorical features
Drop
baseline
category
Encode
features
Translate to code using Scikit-Learn API
num_pipeline = Pipeline([
('selector', DataFrameSelector(float_col_names + int_col_names)),
('zero_var', ZeroVariance()),
('correlation', FindCorrelation()),
('opt_scaler', OptionalStandardScaler()),
('poly_features', PolynomialFeatures())
])
Select
features
Filter low
variance
features
Filter high
correlations
Center and
scale
Calculate
interactions
Numerical features
Translate to code using Scikit-Learn API
cat_pipeline = Pipeline([
('selector', DataFrameSelector(fac_col_names)),
('onehot_encoder', OneHotEncoder(sparse=False)),
('manual_dropper', ManualDropper(drop_ix=drop_col_ix)),
('zero_var', ZeroVariance()),
('correlation', FindCorrelation())
])
Select
features
Filter low
variance
features
Filter high
correlations
Categorical features (encoded)
Drop
baseline
category
Encode
features
# bring it all together to produce ’base’ pipeline
base_pipe = Pipeline([
('union', FeatureUnion(
# parallel parts of pipeline
transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
),
# final correlation check
('correlation', FindCorrelation()),
])
The ‘base’ Pipeline in code
Numerical
feature
Pipeline
Categorical
feature
Pipeline
FeatureUnion
correlation
• Sometimes custom transformers are needed to suit the specific problem.
• And sometimes more flexibility is required:
‒ OptionalStandardScaler is a wrapper class around the StandardScaler class in
the preprocessing module.
• Custom transformer classes must define a fit and a transform method.
• See github repository for examples.
Custom transformers
Estimators
• Built a base Pipeline object of transformers.
• Use this as a template, and:
‒ Make a copy,
‒ Append an estimator to the end,
‒ Train using GridSearchCV.
• Possible to throw a handful of estimators into a single Pipeline
‒ However, makes it harder to dissect models.
Estimators
• Logistic regression.
• Trial L1 and L2 penalties, and a range of C (penalty strength).
• As the model is linear, trial non-linear interaction terms.
• Worried about multicollinearity, so explicitly drop baseline categories.
• Center and scale the data.
Our first estimator
# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)
CV results: Logistic Regression
• Selected:
‒ L1 penalty,
‒ C = 0.1,
‒ Quadratic interactions.
• Final underlying model is accessible as
an attribute of the GridSearchCV object.
• Easy to reuse the Pipeline- train a Random Forest.
• Model is nonlinear, so no interaction terms needed.
• Don’t need to center or scale.
• Don’t drop baseline categories (unless binary) for this class of model.
• Hyperparameters to consider:
‒ number of estimators (trees) in our ensemble,
‒ number of features to consider for each split,
‒ maximum depth of trees.
A second estimator
# copy of the base pipeline, append estimator
rf_est = copy.deepcopy(base_pipe)
rf_est.steps.append(('random_forest‘,
RandomForestClassifier(random_state = 1234)))
# parameter grid
rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False],
union__num_pipeline__poly_features__degree=[1],
union__cat_pipeline__manual_dropper__optional_drop_ix=[None],
random_forest__n_estimators=[50, 100, 200],
random_forest__max_depth=[6, 9, 12],
random_forest__max_features=[4, 5, 6])
# ... create the GridSearchCV object as before, and fit to training data
# copy of the base pipeline, append estimator
rf_est = copy.deepcopy(base_pipe)
rf_est.steps.append(('random_forest‘,
RandomForestClassifier(random_state = 1234)))
# parameter grid
rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False],
union__num_pipeline__poly_features__degree=[1],
union__cat_pipeline__manual_dropper__optional_drop_ix=[None],
random_forest__n_estimators=[50, 100, 200],
random_forest__max_depth=[6, 9, 12],
random_forest__max_features=[4, 5, 6])
# ... create the GridSearchCV object as before, and fit to training data
CV results: Random Forest
More advantages of a
flexible pipeline
framework
• Consider upsampling or downsampling to resolve class imbalance.
‒ We want to sample rows.
‒ This is a different problem: transformers act on columns.
• Ratio of majority to minority class is another parameter to investigate.
‒ A pipeline would be perfect to trial different ratios.
• Add extra behavior to Sklearn Classifiers with a mixin class.
Example: Imbalanced Classes
# use the factory to add extra behavior to the sklearn class
LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression)
# copy base pipeline, append estimator
lr_us_est = copy.deepcopy(base_pipe)
lr_us_est.steps.append(('lr_sample’,
LogisticRegressionWithSampling(random_state=1234)))
# we can specify the sampling proportion on the target class
# as a hyperparameter now!
lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
lr_sample__penalty=[‘l1’],
lr_sample__C=[0.01, 0.1, 1.0],
lr_sample__upsample=[True, False],
lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0])
# ... and fit
# use the factory to add extra behavior to the sklearn class
LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression)
# copy base pipeline, append estimator
lr_us_est = copy.deepcopy(base_pipe)
lr_us_est.steps.append(('lr_sample’,
LogisticRegressionWithSampling(random_state=1234)))
# we can specify the sampling proportion on the target class
# as a hyperparameter now!
lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
lr_sample__penalty=[‘l1’],
lr_sample__C=[0.01, 0.1, 1.0],
lr_sample__upsample=[True, False],
lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0])
# ... and fit
CV results: Logistic Regression with up
sampling
Making predictions
• We can evaluate our models by making
predictions on the test set.
• As the fitted Pipelines are estimators, we can
make predictions like any other fitted model.
• For this example order our predictions of churn
from most to least confident.
‒ Visualise with a lift chart.
Evaluating our model on test set
• Discussed advantages of having a framework for ML Pipelines.
• Demonstrated how the Pipeline implementation in Scikit-Learn provides a
framework for flexible, readable and reusable ML.
• Walked through a case study to demonstrate how to apply in practice.
• Hopefully convinced you this is a great way to work with Scikit-Learn!
Conclusion
Thank You
41
github.com/philipmgoddard/pipelines

More Related Content

What's hot

Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
Amir Razmjou
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Wagston Staehler
 
Neural Network Architectures
Neural Network ArchitecturesNeural Network Architectures
Neural Network Architectures
Martin Ockajak
 
A Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial NetworksA Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial Networks
ivaderivader
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systems
Mitul Tiwari
 
A Unified Approach to Interpreting Model Predictions (SHAP)
A Unified Approach to Interpreting Model Predictions (SHAP)A Unified Approach to Interpreting Model Predictions (SHAP)
A Unified Approach to Interpreting Model Predictions (SHAP)
Rama Irsheidat
 
eScience SHAP talk
eScience SHAP talkeScience SHAP talk
eScience SHAP talk
Scott Lundberg
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
Deep Learning Italia
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
JaeJun Yoo
 
Style gan
Style ganStyle gan
Style gan
哲东 郑
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
inovex GmbH
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Rising Media Ltd.
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
Abdulrahman Kerim
 
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...
Sri Ambati
 
Computer Vision sfm
Computer Vision sfmComputer Vision sfm
Computer Vision sfm
Wael Badawy
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
Owin Will
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
SungminYou
 

What's hot (20)

Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Neural Network Architectures
Neural Network ArchitecturesNeural Network Architectures
Neural Network Architectures
 
A Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial NetworksA Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial Networks
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systems
 
A Unified Approach to Interpreting Model Predictions (SHAP)
A Unified Approach to Interpreting Model Predictions (SHAP)A Unified Approach to Interpreting Model Predictions (SHAP)
A Unified Approach to Interpreting Model Predictions (SHAP)
 
eScience SHAP talk
eScience SHAP talkeScience SHAP talk
eScience SHAP talk
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
Style gan
Style ganStyle gan
Style gan
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)
 
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...
Scott Lundberg, Microsoft Research - Explainable Machine Learning with Shaple...
 
Computer Vision sfm
Computer Vision sfmComputer Vision sfm
Computer Vision sfm
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 

Similar to Taking your machine learning workflow to the next level using Scikit-Learn Pipelines

Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Philip Goddard
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
Practical data science
Practical data sciencePractical data science
Practical data science
Ding Li
 
AlphaPy
AlphaPyAlphaPy
AlphaPy
Robert Scott
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
Mark Conway
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
Ahmet Bulut
 
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
Lola Burgueño
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Benjamin Bengfort
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
Mark Peng
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Learning to Optimize
Learning to OptimizeLearning to Optimize
Learning to Optimize
Pramit Choudhary
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Ronald Francisco Vargas Quesada
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptxIntro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
DeepakJangid87
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Feng Zhu
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
Vijayananda Mohire
 
Survey on Software Defect Prediction
Survey on Software Defect PredictionSurvey on Software Defect Prediction
Survey on Software Defect Prediction
Sung Kim
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 

Similar to Taking your machine learning workflow to the next level using Scikit-Learn Pipelines (20)

Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
AlphaPy
AlphaPyAlphaPy
AlphaPy
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
Nose Dive into Apache Spark ML
Nose Dive into Apache Spark MLNose Dive into Apache Spark ML
Nose Dive into Apache Spark ML
 
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Learning to Optimize
Learning to OptimizeLearning to Optimize
Learning to Optimize
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptxIntro to LV in 3 Hours for Control and Sim 8_5.pptx
Intro to LV in 3 Hours for Control and Sim 8_5.pptx
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
Survey on Software Defect Prediction
Survey on Software Defect PredictionSurvey on Software Defect Prediction
Survey on Software Defect Prediction
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 

Taking your machine learning workflow to the next level using Scikit-Learn Pipelines

  • 1. Taking your machine learning workflow to the next level using Scikit-Learn Pipelines Philip Goddard github.com/philipmgoddard/pipelines
  • 2. • Many ML problems can be solved by assembling modular components: ‒ Raw data in ‒ Predictions out. • Pipelines encapsulate data preparation as well as prediction. Pipelines? 2
  • 3. • Maintaining a clean workflow can be challenging when building ML models: ‒ Different models require different considerations, leading to multiple versions of training data. ‒ When experimenting, the state of a Jupyter Notebook isn’t always obvious. ‒ Tuning model hyperparameters is (relatively) easy, tuning data transformations can be a little more tedious. Why should I care? 3
  • 4. • Modular and natural way to build supervised ML models. • Flexibility to experiment with data transformations as well as model tuning. • Clean, DRY, reusable code. How can Pipelines help?
  • 5. • Scikit-Learn works around the concept of transformers and estimators. • Typically, data is run through some transformations before reaching an estimator. • Pipelines are implemented using the Pipeline class, chaining the components together. • Different transformation stages can built, and combined using FeatureUnion. • When the final step of the Pipeline is an estimator, the GridSearchCV class allows hyperparameter tuning. ‒ Tune (or fix) the parameters for the transformers as well as the estimator. Pipelines in Scikit-Learn
  • 7. • Dataset for predicting whether subscribers to a mobile telephone plan will churn. ‒ Provided with a train (3333 observations) and test (1667 observations) set. • Features include: ‒ Continuous measurements (e.g. total charges). ‒ Count measurements (e.g. number of customer service calls). ‒ Categorical (e.g. customer area code). • Outcome is binary: ~85% did not churn and ~15% did churn. Case Study: Customer Churn github.com/philipmgoddard/pipelines
  • 8. Understanding data informs Pipeline design • The data set has a tractable number of features (19). • We can easily visualise to get a feel for any considerations required for our Pipeline: ‒ Correlated features ‒ Low variance features ‒ Non-linearities ‒ Poorly behaved distributions ‒ Any other nuances
  • 10. Watch out for correlations! • Pairwise plots are a good way to get a feeling for highly correlated features. • Some classes of model encounter numerical instabilities when fitting if this isn’t resolved. • The pipeline should provide a way to identify and remove such features.
  • 14. Pipeline Schematic Numerical feature Pipeline Categorical feature Pipeline FeatureUnion Filter high correlation Estimator
  • 15. Pipeline components Select features Filter low variance features Filter high correlations Center and scale Calculate interactions Numerical features Select features Filter low variance features Filter high correlations Categorical features Drop baseline category Encode features
  • 16. Translate to code using Scikit-Learn API num_pipeline = Pipeline([ ('selector', DataFrameSelector(float_col_names + int_col_names)), ('zero_var', ZeroVariance()), ('correlation', FindCorrelation()), ('opt_scaler', OptionalStandardScaler()), ('poly_features', PolynomialFeatures()) ]) Select features Filter low variance features Filter high correlations Center and scale Calculate interactions Numerical features
  • 17. Translate to code using Scikit-Learn API cat_pipeline = Pipeline([ ('selector', DataFrameSelector(fac_col_names)), ('onehot_encoder', OneHotEncoder(sparse=False)), ('manual_dropper', ManualDropper(drop_ix=drop_col_ix)), ('zero_var', ZeroVariance()), ('correlation', FindCorrelation()) ]) Select features Filter low variance features Filter high correlations Categorical features (encoded) Drop baseline category Encode features
  • 18. # bring it all together to produce ’base’ pipeline base_pipe = Pipeline([ ('union', FeatureUnion( # parallel parts of pipeline transformer_list=[ ('num_pipeline', num_pipeline), ('cat_pipeline', cat_pipeline) ]) ), # final correlation check ('correlation', FindCorrelation()), ]) The ‘base’ Pipeline in code Numerical feature Pipeline Categorical feature Pipeline FeatureUnion correlation
  • 19. • Sometimes custom transformers are needed to suit the specific problem. • And sometimes more flexibility is required: ‒ OptionalStandardScaler is a wrapper class around the StandardScaler class in the preprocessing module. • Custom transformer classes must define a fit and a transform method. • See github repository for examples. Custom transformers
  • 21. • Built a base Pipeline object of transformers. • Use this as a template, and: ‒ Make a copy, ‒ Append an estimator to the end, ‒ Train using GridSearchCV. • Possible to throw a handful of estimators into a single Pipeline ‒ However, makes it harder to dissect models. Estimators
  • 22. • Logistic regression. • Trial L1 and L2 penalties, and a range of C (penalty strength). • As the model is linear, trial non-linear interaction terms. • Worried about multicollinearity, so explicitly drop baseline categories. • Center and scale the data. Our first estimator
  • 23. # copy of the base pipeline, append estimator lr_est = copy.deepcopy(base_pipe) lr_est.steps.append(('logistic_regression', LogisticRegression(random_state=1234))) # parameters for grid search lr_param_grid = dict( union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], logistic_regression__penalty=[‘l1’, ‘l2’] logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0]) # initialize GridsearchCV object with cross validation grid_search_lr = GridSearchCV(estimator=lr_est, param_grid=lr_param_grid, scoring='roc_auc', cv=5, refit=True) # fit to training data grid_search_lr.fit(features_train, outcome_train)
  • 24. # copy of the base pipeline, append estimator lr_est = copy.deepcopy(base_pipe) lr_est.steps.append(('logistic_regression', LogisticRegression(random_state=1234))) # parameters for grid search lr_param_grid = dict( union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], logistic_regression__penalty=[‘l1’, ‘l2’] logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0]) # initialize GridsearchCV object with cross validation grid_search_lr = GridSearchCV(estimator=lr_est, param_grid=lr_param_grid, scoring='roc_auc', cv=5, refit=True) # fit to training data grid_search_lr.fit(features_train, outcome_train)
  • 25. # copy of the base pipeline, append estimator lr_est = copy.deepcopy(base_pipe) lr_est.steps.append(('logistic_regression', LogisticRegression(random_state=1234))) # parameters for grid search lr_param_grid = dict( union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], logistic_regression__penalty=[‘l1’, ‘l2’] logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0]) # initialize GridsearchCV object with cross validation grid_search_lr = GridSearchCV(estimator=lr_est, param_grid=lr_param_grid, scoring='roc_auc', cv=5, refit=True) # fit to training data grid_search_lr.fit(features_train, outcome_train)
  • 26. # copy of the base pipeline, append estimator lr_est = copy.deepcopy(base_pipe) lr_est.steps.append(('logistic_regression', LogisticRegression(random_state=1234))) # parameters for grid search lr_param_grid = dict( union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], logistic_regression__penalty=[‘l1’, ‘l2’] logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0]) # initialize GridsearchCV object with cross validation grid_search_lr = GridSearchCV(estimator=lr_est, param_grid=lr_param_grid, scoring='roc_auc', cv=5, refit=True) # fit to training data grid_search_lr.fit(features_train, outcome_train)
  • 27. CV results: Logistic Regression • Selected: ‒ L1 penalty, ‒ C = 0.1, ‒ Quadratic interactions. • Final underlying model is accessible as an attribute of the GridSearchCV object.
  • 28. • Easy to reuse the Pipeline- train a Random Forest. • Model is nonlinear, so no interaction terms needed. • Don’t need to center or scale. • Don’t drop baseline categories (unless binary) for this class of model. • Hyperparameters to consider: ‒ number of estimators (trees) in our ensemble, ‒ number of features to consider for each split, ‒ maximum depth of trees. A second estimator
  • 29. # copy of the base pipeline, append estimator rf_est = copy.deepcopy(base_pipe) rf_est.steps.append(('random_forest‘, RandomForestClassifier(random_state = 1234))) # parameter grid rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False], union__num_pipeline__poly_features__degree=[1], union__cat_pipeline__manual_dropper__optional_drop_ix=[None], random_forest__n_estimators=[50, 100, 200], random_forest__max_depth=[6, 9, 12], random_forest__max_features=[4, 5, 6]) # ... create the GridSearchCV object as before, and fit to training data
  • 30. # copy of the base pipeline, append estimator rf_est = copy.deepcopy(base_pipe) rf_est.steps.append(('random_forest‘, RandomForestClassifier(random_state = 1234))) # parameter grid rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False], union__num_pipeline__poly_features__degree=[1], union__cat_pipeline__manual_dropper__optional_drop_ix=[None], random_forest__n_estimators=[50, 100, 200], random_forest__max_depth=[6, 9, 12], random_forest__max_features=[4, 5, 6]) # ... create the GridSearchCV object as before, and fit to training data
  • 32. More advantages of a flexible pipeline framework
  • 33. • Consider upsampling or downsampling to resolve class imbalance. ‒ We want to sample rows. ‒ This is a different problem: transformers act on columns. • Ratio of majority to minority class is another parameter to investigate. ‒ A pipeline would be perfect to trial different ratios. • Add extra behavior to Sklearn Classifiers with a mixin class. Example: Imbalanced Classes
  • 34. # use the factory to add extra behavior to the sklearn class LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression) # copy base pipeline, append estimator lr_us_est = copy.deepcopy(base_pipe) lr_us_est.steps.append(('lr_sample’, LogisticRegressionWithSampling(random_state=1234))) # we can specify the sampling proportion on the target class # as a hyperparameter now! lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[1,2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], lr_sample__penalty=[‘l1’], lr_sample__C=[0.01, 0.1, 1.0], lr_sample__upsample=[True, False], lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0]) # ... and fit
  • 35. # use the factory to add extra behavior to the sklearn class LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression) # copy base pipeline, append estimator lr_us_est = copy.deepcopy(base_pipe) lr_us_est.steps.append(('lr_sample’, LogisticRegressionWithSampling(random_state=1234))) # we can specify the sampling proportion on the target class # as a hyperparameter now! lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True], union__num_pipeline__poly_features__degree=[2], union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix], lr_sample__penalty=[‘l1’], lr_sample__C=[0.01, 0.1, 1.0], lr_sample__upsample=[True, False], lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0]) # ... and fit
  • 36. CV results: Logistic Regression with up sampling
  • 38. • We can evaluate our models by making predictions on the test set. • As the fitted Pipelines are estimators, we can make predictions like any other fitted model. • For this example order our predictions of churn from most to least confident. ‒ Visualise with a lift chart. Evaluating our model on test set
  • 39. • Discussed advantages of having a framework for ML Pipelines. • Demonstrated how the Pipeline implementation in Scikit-Learn provides a framework for flexible, readable and reusable ML. • Walked through a case study to demonstrate how to apply in practice. • Hopefully convinced you this is a great way to work with Scikit-Learn! Conclusion