Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yellowbrick: Steering machine learning with visual transformers


Published on

In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.

This talk presents a new Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Yellowbrick is an open source, pure Python project that extends Scikit-Learn with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.

In this talk, we'll explore not only what you can do with Yellowbrick, but how it works under the hood (since we're always looking for new contributors!). We'll illustrate how Yellowbrick extends the Scikit-Learn and Matplotlib APIs with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process - providing iterative visual diagnostics throughout the transformation of high dimensional data.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Yellowbrick: Steering machine learning with visual transformers

  1. 1. Yellowbrick: Steering Machine Learning with Visual Transformers
  2. 2. Rebecca Bilbro Twitter: Github: Email:
  3. 3. Once upon a time ...
  4. 4. And then things got ...
  5. 5. from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import cross_validation as cv classifiers = [ KNeighborsClassifier(5), SVC(kernel="linear", C=0.025), RandomForestClassifier(max_depth=5), AdaBoostClassifier(), GaussianNB(), ] kfold = cv.KFold(len(X), n_folds=12) max([ cv.cross_val_score(model, X, y, cv=kfold).mean for model in classifiers ]) Try them all!
  6. 6. ● Search is difficult, particularly in high dimensional space. ● Even with clever optimization techniques, there is no guarantee of a solution. ● As the search space gets larger, the amount of time increases exponentially. Except ...
  7. 7. Solution: Visual Steering
  8. 8. Enter Yellowbrick ● Extend the Scikit-Learn API. ● Enhance the model selection process. ● Tools for feature visualization, visual diagnostics, and visual steering. ● Not a replacement for other visualization libraries.
  9. 9. # Import the estimator from sklearn.linear_model import Lasso # Instantiate the estimator model = Lasso() # Fit the data to the estimator, y_train) # Generate a prediction model.predict(X_test) Scikit-Learn Estimator Interface
  10. 10. # Import the model and visualizer from sklearn.linear_model import Lasso from yellowbrick.regressor import PredictionError # Instantiate the visualizer visualizer = PredictionError(Lasso()) # Fit, y_train) # Score and visualize visualizer.score(X_test, y_test) visualizer.poof() Yellowbrick Visualizer Interface
  11. 11. How do I select the right features?
  12. 12. Is this room occupied? ● Given labelled data with amount of light, heat, humidity, etc. ● Which features are most predictive? ● How hard is it going to be to distinguish the empty rooms from the occupied ones?
  13. 13. Yellowbrick Feature Visualizers Use radviz or parallel coordinates to look for class separability
  14. 14. Yellowbrick Feature Visualizers Use Rank2D for pairwise feature analysis
  15. 15. Yellowbrick Feature Visualizers - for text, too! Visualize top tokens, document distribution & part-of-speech tagging
  16. 16. What’s the right model to use?
  17. 17. Why isn’t my model predictive? ● What to do with a low-accuracy classifier? ● Check for class imbalance. ● Visual cue that we might try stratified sampling, oversampling, or getting more data.
  18. 18. Yellowbrick Score Visualizers Visualize accuracy and begin to diagnose problems
  19. 19. Yellowbrick Score Visualizers Visualize the distribution of error to diagnose heteroscedasticity
  20. 20. How do I tune this thing?
  21. 21. What’s the right k? ● How many clusters do you see? ● How do you pick an initial value for k in k-means clustering? ● How do you know whether to increase or decrease k? ● Is partitive clustering the right choice?
  22. 22. Hyperparameter Tuning with Yellowbrick higher silhouette scores mean denser, more separate clusters The elbow shows the best value of k… Or suggests a different algorithm
  23. 23. Hyperparameter Tuning with Yellowbrick Should I use Lasso, Ridge, or ElasticNet? Is regularlization even working?
  24. 24. Got an idea?
  25. 25. The main API implemented by Scikit-Learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm, or a transformer that extracts/filters useful features from raw data. class Estimator(object): def fit(self, X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred Estimators
  26. 26. Transformers are special cases of Estimators -- instead of making predictions, they transform the input dataset X to a new dataset X’. class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime Transformers
  27. 27. A visualizer is an estimator that produces visualizations based on data rather than new datasets or predictions. Visualizers are intended to work in concert with Transformers and Estimators to shed light onto the modeling process. class Visualizer(Estimator): def draw(self): """ Draw the data """ def finalize(self): """ Complete the figure """ def poof(self): """ Show the figure """ Visualizers
  28. 28. Thank you! Twitter: Github: Email: