Escaping the Black Box
Yellowbrick: A Visual API for
Machine Learning
Once upon a time ...
And then things got ...
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection as ms
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = ms.KFold(len(X), n_folds=12)
max([
ms.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
Try them all!
Data
Automated
Build
Insight
Machine Learning as a Service
● Search is difficult, particularly in
high dimensional space.
● Even with clever optimization
techniques, there is no guarantee of
a solution.
● As the search space gets larger, the
amount of time increases
exponentially.
Except ...
Hyperparameter
Tuning
Algorithm
Selection
Feature
Analysis
The “Model Selection Triple”
Arun Kumar, et al.
Solution: Visual Steering
Enter Yellowbrick
● Extends the Scikit-Learn API.
● Enhances the model selection
process.
● Tools for feature visualization,
visual diagnostics, and visual
steering.
● Not a replacement for other
visualization libraries.
# Import the estimator
from sklearn.linear_model import Lasso
# Instantiate the estimator
model = Lasso()
# Fit the data to the estimator
model.fit(X_train, y_train)
# Generate a prediction
model.predict(X_test)
Scikit-Learn Estimator Interface
# Import the model and visualizer
from sklearn.linear_model import Lasso
from yellowbrick.regressor import PredictionError
# Instantiate the visualizer
visualizer = PredictionError(Lasso())
# Fit
visualizer.fit(X_train, y_train)
# Score and visualize
visualizer.score(X_test, y_test)
visualizer.poof()
Yellowbrick Visualizer Interface
How do I select the right
features?
Is this room occupied?
● Given labelled data
with amount of light,
heat, humidity, etc.
● Which features are
most predictive?
● How hard is it going to
be to distinguish the
empty rooms from the
occupied ones?
Yellowbrick Feature Visualizers
Use radviz or parallel
coordinates to look for
class separability
Yellowbrick Feature Visualizers
Use Rank2D for
pairwise feature
analysis
…for text, too!
Visualize top tokens,
document distribution
& part-of-speech
tagging
What’s the best model?
Why isn’t my model predictive?
● What to do with a low-
accuracy classifier?
● Check for class
imbalance.
● Visual cue that we
might try stratified
sampling,
oversampling, or
getting more data.
Yellowbrick Score Visualizers
Visualize
accuracy
and begin to
diagnose
problems
Visualize the
distribution of error to
diagnose
heteroscedasticity
Yellowbrick Score Visualizers
How do I tune this model?
What’s the right k?
● How many clusters do
you see?
● How do you pick an
initial value for k in k-
means clustering?
● How do you know
whether to increase or
decrease k?
● Is partitive clustering
the right choice?
Hyperparameter Tuning
higher silhouette scores
mean denser, more
separate clusters
The elbow
shows the
best value
of k…
Or
suggests a
different
algorithm
Hyperparameter Tuning
Should I use
Lasso, Ridge, or
ElasticNet?
Is regularization
even working?
What’s next?
Some ideas...
How does token frequency
change over time, in
relation to other tokens?
Some ideas...
View the corpus
hierarchically, after
clustering
Some ideas... Perform token
network
analysis
Some ideas...
Plot token co-
occurrence
Some ideas...
Get an x-ray
of the text
Do you have an idea?
The main API implemented by
Scikit-Learn is that of the
estimator. An estimator is any
object that learns from data;
it may be a classification,
regression or clustering
algorithm, or a transformer that
extracts/filters useful features
from raw data.
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
Estimators
Transformers are special
cases of Estimators --
instead of making
predictions, they transform
the input dataset X to a new
dataset X’.
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
Transformers
A visualizer is an estimator that
produces visualizations based
on data rather than new
datasets or predictions.
Visualizers are intended to
work in concert with
Transformers and Estimators to
shed light onto the modeling
process.
class Visualizer(Estimator):
def draw(self):
"""
Draw the data
"""
self.ax.plot()
def finalize(self):
"""
Complete the figure
"""
self.ax.set_title()
def poof(self):
"""
Show the figure
"""
plt.show()
Visualizers
Thank you!
Twitter: twitter.com/rebeccabilbro
Github: github.com/rebeccabilbro
Email:
rebecca.bilbro@bytecubed.com

Escaping the Black Box

  • 1.
    Escaping the BlackBox Yellowbrick: A Visual API for Machine Learning
  • 3.
    Once upon atime ...
  • 4.
  • 5.
    from sklearn.svm importSVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import model_selection as ms classifiers = [ KNeighborsClassifier(5), SVC(kernel="linear", C=0.025), RandomForestClassifier(max_depth=5), AdaBoostClassifier(), GaussianNB(), ] kfold = ms.KFold(len(X), n_folds=12) max([ ms.cross_val_score(model, X, y, cv=kfold).mean for model in classifiers ]) Try them all!
  • 6.
  • 7.
    ● Search isdifficult, particularly in high dimensional space. ● Even with clever optimization techniques, there is no guarantee of a solution. ● As the search space gets larger, the amount of time increases exponentially. Except ...
  • 8.
  • 9.
  • 10.
    Enter Yellowbrick ● Extendsthe Scikit-Learn API. ● Enhances the model selection process. ● Tools for feature visualization, visual diagnostics, and visual steering. ● Not a replacement for other visualization libraries.
  • 11.
    # Import theestimator from sklearn.linear_model import Lasso # Instantiate the estimator model = Lasso() # Fit the data to the estimator model.fit(X_train, y_train) # Generate a prediction model.predict(X_test) Scikit-Learn Estimator Interface
  • 12.
    # Import themodel and visualizer from sklearn.linear_model import Lasso from yellowbrick.regressor import PredictionError # Instantiate the visualizer visualizer = PredictionError(Lasso()) # Fit visualizer.fit(X_train, y_train) # Score and visualize visualizer.score(X_test, y_test) visualizer.poof() Yellowbrick Visualizer Interface
  • 13.
    How do Iselect the right features?
  • 14.
    Is this roomoccupied? ● Given labelled data with amount of light, heat, humidity, etc. ● Which features are most predictive? ● How hard is it going to be to distinguish the empty rooms from the occupied ones?
  • 15.
    Yellowbrick Feature Visualizers Useradviz or parallel coordinates to look for class separability
  • 16.
    Yellowbrick Feature Visualizers UseRank2D for pairwise feature analysis
  • 17.
    …for text, too! Visualizetop tokens, document distribution & part-of-speech tagging
  • 18.
  • 19.
    Why isn’t mymodel predictive? ● What to do with a low- accuracy classifier? ● Check for class imbalance. ● Visual cue that we might try stratified sampling, oversampling, or getting more data.
  • 20.
  • 21.
    Visualize the distribution oferror to diagnose heteroscedasticity Yellowbrick Score Visualizers
  • 22.
    How do Itune this model?
  • 23.
    What’s the rightk? ● How many clusters do you see? ● How do you pick an initial value for k in k- means clustering? ● How do you know whether to increase or decrease k? ● Is partitive clustering the right choice?
  • 24.
    Hyperparameter Tuning higher silhouettescores mean denser, more separate clusters The elbow shows the best value of k… Or suggests a different algorithm
  • 25.
    Hyperparameter Tuning Should Iuse Lasso, Ridge, or ElasticNet? Is regularization even working?
  • 26.
  • 27.
    Some ideas... How doestoken frequency change over time, in relation to other tokens?
  • 28.
    Some ideas... View thecorpus hierarchically, after clustering
  • 29.
    Some ideas... Performtoken network analysis
  • 30.
  • 31.
    Some ideas... Get anx-ray of the text
  • 32.
    Do you havean idea?
  • 33.
    The main APIimplemented by Scikit-Learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm, or a transformer that extracts/filters useful features from raw data. class Estimator(object): def fit(self, X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred Estimators
  • 34.
    Transformers are special casesof Estimators -- instead of making predictions, they transform the input dataset X to a new dataset X’. class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime Transformers
  • 35.
    A visualizer isan estimator that produces visualizations based on data rather than new datasets or predictions. Visualizers are intended to work in concert with Transformers and Estimators to shed light onto the modeling process. class Visualizer(Estimator): def draw(self): """ Draw the data """ self.ax.plot() def finalize(self): """ Complete the figure """ self.ax.set_title() def poof(self): """ Show the figure """ plt.show() Visualizers
  • 37.
    Thank you! Twitter: twitter.com/rebeccabilbro Github:github.com/rebeccabilbro Email: rebecca.bilbro@bytecubed.com

Editor's Notes

  • #2 Good morning and thank you for coming! Stickers! Today I’d like to tell you about an open source Python project I’ve been working on for the last two years. It’s called Yellowbrick, and it’s a tool you can use to steer the machine learning process using visual transformers. ================================================== TALK DESC/REFERENCE================================================== In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search. This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow. Ben and Rebecca are active contributors to the open source community, and this talk is based on Yellowbrick, a project they've been building together with the team at District Data Labs, an open source collaborative in Washington, DC. They are also co-authors of the forthcoming O'Reilly book, Applied Text Analysis with Python and organizers for Data Community DC - a not-for-profit organization of 9 meetups that organizes free monthly events and lectures for the local data community in Washington, DC. Visualizing the Model Selection Process 1. The Model Selection Process b. YB is part of the model selection workflow b. Feature Selection c. Algorithm Selection (Model Family/Form ⟶ Fitted Model) d. Hyperparameter Tuning e. Selection with Cross-Validation f. Try all the models! g. Hyperparameter Tuning h. Model selection is Search 3. Main Point: Visual Steering Improves Model Selection: a. Leads to better models (better F1/R2 Scores) b. Gets to good models more quickly c. Produces more insight about models d. Research People: prove this. 4. The Scikit-Learn API a. YB _extends_ the Scikit-Learn API b. Tricky because: functional/procedural matplotlib + OO b. Estimators c. Transformers d. Visualizers e. Pipelines f. Visual Workflows and Pipelines 5. Primary YB Requirements a. Fits into the sklearn API and workflow b. Implements matplotlib calls efficiently c. Low overhead if `poof()` is not called d. Just flexible enough for users to adapt to their data e. Easy to add new visualizers f. Looks as good as Seaborn g. Minimal dependencies: sklearn, numpy, matplotlib -- c'est tout! h. Primary Requirement: Implements Visual Steering 6. The Visualizer a. Current class hierarchy b. The Visualizer Interface c. Axes management d. `draw()`, `poof()`, and `finalize()` e. A simple example of a visualizer f. Feature Visualizers g. Score Visualizers h. Multi-Estimator Visualizers 9. Visual Pipelines a. Multiple Visualizations b. Interactivity 10. Optimizing Visualization! 11. Utilities a. Style management: Sequences, Palettes, Color Codes b. Best Fit Lines c. Type Detection (is_classifier, etc.) d. Exceptions 12. Documentation and Sphinx 13. Contributing a. Git/Branch Management b. Issues, milestones, and labels c. Waffle Board 13. User Testing and Research https://flic.kr/p/5EsRYP
  • #4 It all started in an ivory tower somewhere. Once upon a time, you had to go to school to learn machine learning. And you’d spend years and ultimately specialize in a particular model family Bayesian methods Gaussian processes support vector machines ...and get very very good at tuning those models.
  • #5 Then everything got really easy really fast. In 2010, Scikit-Learn was publicly released (by INRIA). And it started to grow very very quickly. Suddenly there were dozens of models at the fingertips of any Python programmer.
  • #6 Scikit-Learn has so much going for it Models, models, models Also transformers, vectorization tools, sample datasets Pipelines But arguably the best part is the consistent API You can plug the same data into nearly any of the models and it will work! So it becomes just an optimization problem You can loop through all the models and just pick the one with the best score. It almost doesn’t matter which model you use, or why or how it’s working. http://derekerdman.com/ilovemilkshakes/january2009/DO_IT/haircuts_try_them_all.jpg
  • #7 Slide showing MLaaS
  • #8 Except that hyperparameter space is large and gridsearch is slow if you don’t know already what you’re looking for Alpha/penalty for regularization Kernel function in support vector machine Leaves or depth of a decision tree Neighbors used in a nearest neighbor classifier Clusters in a k-means clustering https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Blindfold_%28PSF%29.png/1200px-Blindfold_%28PSF%29.png
  • #9 http://pages.cs.wisc.edu/~arun/vision/SIGMODRecord15.pdf
  • #10 The answer is visual steering. Leverage human pattern recognition and engagement. Get to better models, faster, with more insight, Overview first; zoom and filter; details on demand.
  • #11 Enhance model selection and evaluation Goal is not to replace other libraries. Model visualization not data visualization
  • #14 YB extends the Scikit-Learn API This is a high dimensional visualization problem.
  • #15 For classification; potentially we want to see if there is good separability Are some features more predictive than others? Unit circle requires normalization Drop points on the middle and are pulled out to outer edges Ordering of features matters Automatic ordering with optimization to minimize amt of overlap
  • #17 Feature engineering requires understanding of the relationships between features Visualize pairwise relationships Heatmap Pearson shows us strong correlations => potential collinearity Covariance helps us understand the sequence of relationships Other correlation metrics - we just used the ones that were implemented in numpy, but looking to expand
  • #18 Frequency distribution - top 50 tokens Stochastic Neighbor Embedding, decomposition then projection into 2D scatterplot Visual part-of-speech tagging
  • #19 YB extends the Scikit-Learn API
  • #20 Can we quickly detect class imbalance issues Stratified sampling, oversampling, getting more data -- tricks will help us balance But supervised methods can mask training data; simple graphs like these give us an at-a-glance reference As this gets into multiclass problems, domination could be harder to see and really effect modeling
  • #21 Receiver operating characteristics/area under curve Class imbalance Classification report heatmap - Quickly identify strengths & weaknesses of model - F1 vs Type I & Type II error Visual confusion matrix - misclassification on a per-class basis
  • #22 Where/why/how is model performing good/bad Prediction error plot - 45 degree line is theoretical perfect Residuals plot - 0 line is no error See change in amount of variance between x and y, or along x axis => heteroscedasticity
  • #23 YB extends the Scikit-Learn API
  • #26 Which regularization technique to use? Lasso/L1, Ridge/L2, or ElasticNet L1+L2 Regularization uses a Norm to penalize complexity at a rate, alpha The higher the alpha, the more the regularization. Complexity minimization reduces bias in the model, but increases variance Goal: select the smallest alpha such that error is minimized Visualize the tradeoff Surprising to see: higher alpha increasing error, alpha jumping around, etc. Embed R2, MSE, etc into the graph - quick reference
  • #33 Want to contribute? Here’s some information about how the Yellowbrick API works YB extends the Scikit-Learn API Where to hook in?
  • #34 Estimators learn from data Have a fit and predict method
  • #35 Transformers transform data Have a transform method
  • #36 Visualizers can be estimators or transformers Generally have a draw, finalize, and poof method
  • #37 Contribute! Needs: new features, testers, blog posts
  • #38 Stickers!