Learning machine learning with Yellowbrick

Learning
machine learning
with Yellowbrick

The Model Selection Triple
Arun Kumar http://bit.ly/2abVNrI
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning

Use radviz or parallel
coordinates to look for
class separability
Yellowbrick Feature Visualizers

● Based on spring tension
minimization algorithm.
● Features equally spaced on a unit
circle, instances dropped into circle.
● Features pull instances towards
their position on the circle in
proportion to their normalized
numerical value for that instance.
● Classification coloring based on
labels in data.
Radial Visualization

Before and after
standardization
Parallel Coordinates

Parallel Coordinates
● Visualize clusters in data.
● Points represented as connected
line segments.
● Each vertical line represents one
attribute (x-axis units not
meaningful).
● One set of connected line segments
represents one instance.
● Points that tend to cluster will
appear closer together.

Use Rank2D for pairwise feature
analysis, find strong correlations
(potential collinearity?)
Rank2D

Rank2D
● Feature engineering requires
understanding of the relationships
between features
● Visualize pairwise relationships as
a heatmap
● Pearson shows us strong
correlations, potential collinearity
● Covariance helps us understand
the sequence of relationships

PCA Projection Plots
● Uses PCA to decompose high
dimensional data into two or three
dimensions
● Each instance plotted in a scatter
plot.
● Projected dataset can be analyzed
along axes of principle variation
● Can be interpreted to determine if
spherical distance metrics can be
utilized.

PCA Projection Plots
Can also plot in 3D to visualize more
components & get a better sense of
distribution in high dimensions

Visualize top tokens,
document distribution &
part-of-speech tagging
Feature Visualizers for Text

How do I select the right
features?

Feature Importance Plot
● Need to select the minimum
required features to produce a
valid model.
● The more features a model
contains, the more complex it is
(sparse data, errors due to
variance).
● This visualizer ranks and plots
underlying impact of features
relative to each other.

Recursive Feature Elimination
● Recursive feature elimination fits a
model and removes the weakest
feature(s) until the specified
number is reached.
● Features are ranked by internal
model’s coef_ or
feature_importances_
● Attempts to eliminate
dependencies and collinearity that
may exist in the model.

Evaluating Classifiers
● How well did predicted values match actual labeled values?
● In a 2-class problem, there are two ways to be “right”:
○ Classifier correctly identifies cases (aka “True Positives”)
○ Classifier correctly identifies non-cases (aka “True Negatives”)
● ...and two ways to be “wrong”:
○ Classifier incorrectly identifies a non-case as a case (aka “False Positive” or
“Type I Error”)
○ Classifier incorrectly identifies a case as a non-case (aka “False Negative”
or “Type II Error”)

Metrics for Classification
Metric Measures In Scikit-learn
Precision How many selected are relevant? from sklearn.metrics import precision_score
Recall How many relevant were selected? from sklearn.metrics import recall_score
F1 Weighted average of precision & recall from sklearn.metrics import f1_score
Confusion Matrix True positives, true negatives, false
positives, false negatives
from sklearn.metrics import confusion_matrix
ROC True positive rate vs. false positive rate, as
classification threshold varies
from sklearn.metrics import roc
AUC Aggregate accuracy, as classification
threshold varies
from sklearn.metrics import auc

accuracy = true positives + true negatives / total
precision = true positives / (true positives + false
positives)
recall = true positives / (false negatives + true
positives)
F1 score = 2 * ((precision * recall) / (precision + recall))

Visualize
accuracy
and begin to
diagnose
problems
Yellowbrick Score Visualizers

Classification Report
from sklearn.metrics import classification_report as cr
print(cr(y, yhat, target_names=target_names))
● includes same basic info as confusion matrix
● 3 different evaluation metrics: precision, recall, F1 score
● includes class labels for interpretability

Classification Heatmaps
Precision: of
those labelled
edible, how
many actually
were?Is it better
to have
false
positives
here or
here?
Recall: how many
of the
poisonous ones
did our model
find?

ROC-AUC
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y,yhat)
roc_auc = auc(fpr, tpr)
Visualize tradeoff between classifier's sensitivity (how well it finds true
positives) and specificity (how well it avoids false positives)
● straight horizontal line -> perfect classifier
● pulling a lot toward the upper left corner -> good accuracy
● exactly aligned with the diagonal -> coin toss

Getting more right comes at the
expense of getting more wrong
ROC-AUC

ROC-AUC for Multiclass Classification
ROC curves are typically used in
binary classification,, but
Yellowbrick allows for multiclass
classification evaluation by
binarizing output (per-class) or
using one-vs-rest (micro score)
or one-vs-all (macro score)
strategies of classification.

Confusion Matrix
● takes as an argument actual values
and predicted values generated by
the fitted model
● outputs a confusion matrix
from sklearn.metrics import confusion_matrix

I have a lot
of classes;
how does
my model
perform on
each?
Do I care
about certain
classes
more than
others?
Confusion Matrix

Class Prediction Error Plot
Similar to
confusion
matrix, but
sometimes
more
interpretable

Discrimination Threshold Visualizer
* for binary
classification
only
● Probability or score at
which positive class is
chosen over negative.
● Generally set to 50%
● Can be adjusted to
increase/decrease
sensitivity to false
positives or other
application factors
● Cases that require special
treatment?

Evaluating Regressors
● How well does the model describe the training data?
● How well does the model predict out-of-sample data?
○ Goodness-of-fit
○ Randomness of residuals
○ Prediction error

Metrics for Regression
Metric Measures In Scikit-learn
Mean Square
Error (MSE,
RMSE)
distance between predicted values and
actual values (more sensitive to
outliers)
from sklearn.metrics import mean_squared_error
Absolute Error
(MAE, RAE)
distance between predicted values and
actual values (less sensitive to outliers)
from sklearn.metrics import
mean_absolute_error, median_absolute_error
Coefficient of
Determination (R-
Squared)
% of variance explained by the
regression; how well future samples are
likely to be predicted by the model
from sklearn.metrics import r2_score

Visualize the
distribution of error to
diagnose
heteroscedasticity
Yellowbrick Score Visualizers

Prediction Error Plots
from sklearn.model_selection import
cross_val_predict
● Cross-validation is a way of measuring model
performance.
● Divide data into training and test splits; fit model on
training, predict on test.
● Use cross_val_predict to visualize prediction errors as a
scatterplot of the predicted and actual values.

Plotting Residuals
● Standardized y-axis
● Model prediction on x-axis.
● Model accuracy on y-axis; distance from line at 0
indicates how good/bad the prediction was for
that value.
● Check whether residuals are consistent with
random error; data points should appear evenly
dispersed around the plotted line. Should not be
able to predict error.
● Visualize train and test data with different colors.

Maybe?
● Silhouette scores
● Elbow curves
Metrics for Clustering ...

● What to do with a low-
accuracy classifier?
● Check for class imbalance.
● Visual cue that we might
try stratified sampling,
oversampling, or getting
more data.
Class Balance

Cross Validation Scores
● Real world data are often
distributed somewhat
unevenly; the fitted model
likely to perform better on
some sections of data than
others.
● See cross-validated scores as a
bar chart (one bar for each
fold) with average score across
all folds plotted as dotted line.
● Explore variations in
performance using different
cross validation strategies.

Learning Curve
● Relationship of the training score
vs. the cross validated test score
for an estimator.
● Do we need more data? If the
scores converge together, then
probably not. If the training score
is much higher than the validation
score, then yes.
● Is the estimator more sensitive to
error due to variance or error due
to bias?

Validation Curve
● Plot the influence of a single
hyperparameter on the training
and test data.
● Is the estimator under- or over-
fitting for some hyperparameter
values?
For SVC, gamma is the coefficient
of the RBF kernel. The larger
gamma is, the tighter the support
vector is around single points (e.g.
overfitting). Here around gamma=0.1
the SVC memorizes the data.

Hyperparameters
● When we call fit() on an estimator, it learns the parameters of the algorithm
that make it fit the data best.
● However, some parameters are not directly learned within an estimator.
These are the ones we provide when we instantiate the estimator.
○ alpha for LASSO or Ridge
○ C, kernel, and gamma for SVC
● These parameters are often referred to as hyperparameters.

Examples:
● Alpha/penalty for regularization
● Kernel function in support vector machine
● Leaves or depth of a decision tree
● Neighbors used in a nearest neighbor classifier
● Clusters in a k-means clustering
Hyperparameters

How to pick the best hyperparameters?
● Use the defaults
● Pick randomly
● Search parameter space for the best score
(e.g. grid search)
… Except that hyperparameter space is large
and gridsearch is slow if you don’t know
already what you’re looking for.
Hyperparameters

Should I use Lasso,
Ridge, or ElasticNet?
Is regularlization
even working?
More alpha => less
complexity
Reduced bias, but
increased variance
Alpha selection with Yellowbrick

● How many clusters do
you see?
● How do you pick an
initial value for k in k-
means clustering?
● How do you know
whether to increase or
decrease k?
● Is partitive clustering
the right choice?
What’s the right k?

higher silhouette scores
mean denser, more
separate clusters
The elbow
shows the
best value
of k…
Or suggests
a different
algorithm
K-selection with Yellowbrick

Manifold Visualization
● Embed instances described
by many dimensions into 2.
● Look for latent structures in
the data, noise, separability.
● Is it possible to create a
decision space in the data?
● Unlike PCA or SVD,
manifolds use nearest
neighbors, can capture non-
linear structures.

Install:
$ pip install yellowbrick
Upgrade:
$ pip install -U yellowbrick
Anaconda:
$ conda install -c districtdatalabs yellowbrick
Quickstart

# Import the estimator
from sklearn.linear_model import Lasso
# Instantiate the estimator
model = Lasso()
# Fit the data to the estimator
model.fit(X_train, y_train)
# Generate a prediction
model.predict(X_test)
Scikit-Learn Estimator Interface

# Import the model and visualizer
from sklearn.linear_model import Lasso
from yellowbrick.regressor import PredictionError
# Instantiate the visualizer
visualizer = PredictionError(Lasso())
# Fit
visualizer.fit(X_train, y_train)
# Score and visualize
visualizer.score(X_test, y_test)
visualizer.poof()
Yellowbrick Visualizer Interface

The main API implemented by
Scikit-Learn is that of the
estimator. An estimator is any
object that learns from data;
it may be a classification,
regression or clustering algorithm,
or a transformer that
extracts/filters useful features
from raw data.
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
Scikit-learn Estimators

Transformers are special
cases of Estimators --
instead of making
predictions, they transform
the input dataset X to a new
dataset X′.
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
Scikit-learn Transformers

A visualizer is an estimator that
produces visualizations based
on data rather than new
datasets or predictions.
Visualizers are intended to work
in concert with Transformers
and Estimators to shed light
onto the modeling process.
class Visualizer(Estimator):
def draw(self):
"""
Draw the data
"""
self.ax.plot()
def finalize(self):
"""
Complete the figure
"""
self.ax.set_title()
def poof(self):
"""
Show the figure
"""
plt.show()
Yellowbrick Visualizers

Yellowbrick is an open source project that is supported by
a community who will gratefully and humbly accept any
contributions you might make to the project.
Large or small, any contribution makes a big difference;
and if you’ve never contributed to an open source project
before, we hope you will start with Yellowbrick!

Please star Yellowbrick on GitHub!
github.com/DistrictDataLabs/yellowbrick

Learning machine learning with Yellowbrick

In this document