Taking your machine learning workflow to the next level using Scikit-Learn Pipelines

Taking your machine learning workflow
to the next level using Scikit-Learn
Pipelines
Philip Goddard
github.com/philipmgoddard/pipelines

• Many ML problems can be solved by
assembling modular components:
‒ Raw data in
‒ Predictions out.
• Pipelines encapsulate data preparation
as well as prediction.
Pipelines?
2

• Maintaining a clean workflow can be challenging when building ML models:
‒ Different models require different considerations, leading to multiple versions of
training data.
‒ When experimenting, the state of a Jupyter Notebook isn’t always obvious.
‒ Tuning model hyperparameters is (relatively) easy, tuning data transformations can
be a little more tedious.
Why should I care?
3

• Modular and natural way to build supervised ML models.
• Flexibility to experiment with data transformations as well as model tuning.
• Clean, DRY, reusable code.
How can Pipelines help?

• Scikit-Learn works around the concept of transformers and estimators.
• Typically, data is run through some transformations before reaching an estimator.
• Pipelines are implemented using the Pipeline class, chaining the components together.
• Different transformation stages can built, and combined using FeatureUnion.
• When the final step of the Pipeline is an estimator, the GridSearchCV class allows
hyperparameter tuning.
‒ Tune (or fix) the parameters for the transformers as well as the estimator.
Pipelines in Scikit-Learn

• Dataset for predicting whether subscribers to a mobile telephone plan will churn.
‒ Provided with a train (3333 observations) and test (1667 observations) set.
• Features include:
‒ Continuous measurements (e.g. total charges).
‒ Count measurements (e.g. number of customer service calls).
‒ Categorical (e.g. customer area code).
• Outcome is binary: ~85% did not churn and ~15% did churn.
Case Study: Customer Churn

Understanding data informs Pipeline design
• The data set has a tractable number of features (19).
• We can easily visualise to get a feel for any considerations required for our Pipeline:
‒ Correlated features
‒ Low variance features
‒ Non-linearities
‒ Poorly behaved distributions
‒ Any other nuances

Numerical features (continuous)

Watch out for
correlations!
• Pairwise plots are a good way to get a
feeling for highly correlated features.
• Some classes of model encounter
numerical instabilities when fitting if this
isn’t resolved.
• The pipeline should provide a way to
identify and remove such features.

Pipeline Schematic
Numerical feature
Pipeline
Categorical feature
Pipeline
FeatureUnion
Filter high
correlation
Estimator

Pipeline components
Select
features
Filter low
variance
features
Filter high
correlations
Center and
scale
Calculate
interactions
Numerical features
Select
features
Filter low
variance
features
Filter high
correlations
Categorical features
Drop
baseline
category
Encode
features

Translate to code using Scikit-Learn API
num_pipeline = Pipeline([
('selector', DataFrameSelector(float_col_names + int_col_names)),
('zero_var', ZeroVariance()),
('correlation', FindCorrelation()),
('opt_scaler', OptionalStandardScaler()),
('poly_features', PolynomialFeatures())
])
Select
features
Filter low
variance
features
Filter high
correlations
Center and
scale
Calculate
interactions
Numerical features

Translate to code using Scikit-Learn API
cat_pipeline = Pipeline([
('selector', DataFrameSelector(fac_col_names)),
('onehot_encoder', OneHotEncoder(sparse=False)),
('manual_dropper', ManualDropper(drop_ix=drop_col_ix)),
('zero_var', ZeroVariance()),
('correlation', FindCorrelation())
])
Select
features
Filter low
variance
features
Filter high
correlations
Categorical features (encoded)
Drop
baseline
category
Encode
features

# bring it all together to produce ’base’ pipeline
base_pipe = Pipeline([
('union', FeatureUnion(
# parallel parts of pipeline
transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
),
# final correlation check
('correlation', FindCorrelation()),
])
The ‘base’ Pipeline in code
Numerical
feature
Pipeline
Categorical
feature
Pipeline
FeatureUnion
correlation

• Sometimes custom transformers are needed to suit the specific problem.
• And sometimes more flexibility is required:
‒ OptionalStandardScaler is a wrapper class around the StandardScaler class in
the preprocessing module.
• Custom transformer classes must define a fit and a transform method.
• See github repository for examples.
Custom transformers

• Built a base Pipeline object of transformers.
• Use this as a template, and:
‒ Make a copy,
‒ Append an estimator to the end,
‒ Train using GridSearchCV.
• Possible to throw a handful of estimators into a single Pipeline
‒ However, makes it harder to dissect models.
Estimators

• Logistic regression.
• Trial L1 and L2 penalties, and a range of C (penalty strength).
• As the model is linear, trial non-linear interaction terms.
• Worried about multicollinearity, so explicitly drop baseline categories.
• Center and scale the data.
Our first estimator

# copy of the base pipeline, append estimator
lr_est = copy.deepcopy(base_pipe)
lr_est.steps.append(('logistic_regression',
LogisticRegression(random_state=1234)))
# parameters for grid search
lr_param_grid = dict(
union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
union__cat_pipeline__manual_dropper__optional_drop_ix=[opt_drop_ix],
logistic_regression__penalty=[‘l1’, ‘l2’]
logistic_regression__C=[0.001, 0.01, 0.1, 1.0, 10.0])
# initialize GridsearchCV object with cross validation
grid_search_lr = GridSearchCV(estimator=lr_est,
param_grid=lr_param_grid,
scoring='roc_auc',
cv=5,
refit=True)
# fit to training data
grid_search_lr.fit(features_train, outcome_train)

CV results: Logistic Regression
• Selected:
‒ L1 penalty,
‒ C = 0.1,
‒ Quadratic interactions.
• Final underlying model is accessible as
an attribute of the GridSearchCV object.

• Easy to reuse the Pipeline- train a Random Forest.
• Model is nonlinear, so no interaction terms needed.
• Don’t need to center or scale.
• Don’t drop baseline categories (unless binary) for this class of model.
• Hyperparameters to consider:
‒ number of estimators (trees) in our ensemble,
‒ number of features to consider for each split,
‒ maximum depth of trees.
A second estimator

# copy of the base pipeline, append estimator
rf_est = copy.deepcopy(base_pipe)
rf_est.steps.append(('random_forest‘,
RandomForestClassifier(random_state = 1234)))
# parameter grid
rf_param_grid = dict(union__num_pipeline__opt_scaler__scale=[False],
union__num_pipeline__poly_features__degree=[1],
union__cat_pipeline__manual_dropper__optional_drop_ix=[None],
random_forest__n_estimators=[50, 100, 200],
random_forest__max_depth=[6, 9, 12],
random_forest__max_features=[4, 5, 6])
# ... create the GridSearchCV object as before, and fit to training data

More advantages of a
flexible pipeline
framework

• Consider upsampling or downsampling to resolve class imbalance.
‒ We want to sample rows.
‒ This is a different problem: transformers act on columns.
• Ratio of majority to minority class is another parameter to investigate.
‒ A pipeline would be perfect to trial different ratios.
• Add extra behavior to Sklearn Classifiers with a mixin class.
Example: Imbalanced Classes

# use the factory to add extra behavior to the sklearn class
LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression)
# copy base pipeline, append estimator
lr_us_est = copy.deepcopy(base_pipe)
lr_us_est.steps.append(('lr_sample’,
LogisticRegressionWithSampling(random_state=1234)))
# we can specify the sampling proportion on the target class
# as a hyperparameter now!
lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[1,2],
lr_sample__penalty=[‘l1’],
lr_sample__C=[0.01, 0.1, 1.0],
lr_sample__upsample=[True, False],
lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0])
# ... and fit

# use the factory to add extra behavior to the sklearn class
LogisticRegressionWithSampling = sample_clf_factory(LogisticRegression)
# copy base pipeline, append estimator
lr_us_est = copy.deepcopy(base_pipe)
lr_us_est.steps.append(('lr_sample’,
LogisticRegressionWithSampling(random_state=1234)))
# we can specify the sampling proportion on the target class
# as a hyperparameter now!
lr_us_param_grid = dict(union__num_pipeline__opt_scaler__scale=[True],
union__num_pipeline__poly_features__degree=[2],
lr_sample__penalty=[‘l1’],
lr_sample__C=[0.01, 0.1, 1.0],
lr_sample__upsample=[True, False],
lr_sample__target_ratio=[0.15, 0.25, 0.5, 0.75, 1.0])
# ... and fit

CV results: Logistic Regression with up
sampling

• We can evaluate our models by making
predictions on the test set.
• As the fitted Pipelines are estimators, we can
make predictions like any other fitted model.
• For this example order our predictions of churn
from most to least confident.
‒ Visualise with a lift chart.
Evaluating our model on test set

• Discussed advantages of having a framework for ML Pipelines.
• Demonstrated how the Pipeline implementation in Scikit-Learn provides a
framework for flexible, readable and reusable ML.
• Walked through a case study to demonstrate how to apply in practice.
• Hopefully convinced you this is a great way to work with Scikit-Learn!
Conclusion

Thank You
41

Taking your machine learning workflow to the next level using Scikit-Learn Pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Taking your machine learning workflow to the next level using Scikit-Learn Pipelines

Similar to Taking your machine learning workflow to the next level using Scikit-Learn Pipelines (20)

Recently uploaded

Recently uploaded (20)

Taking your machine learning workflow to the next level using Scikit-Learn Pipelines