Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- The AI Rush by Jean-Baptiste Dumont 497683 views
- AI and Machine Learning Demystified... by Carol Smith 3420698 views
- 10 facts about jobs in the future by Pew Research Cent... 565982 views
- 2017 holiday survey: An annual anal... by Deloitte United S... 930830 views
- Harry Surden - Artificial Intellige... by Harry Surden 523966 views
- Inside Google's Numbers in 2017 by Rand Fishkin 1111546 views

393 views

Published on

Published in:
Data & Analytics

No Downloads

Total views

393

On SlideShare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

6

Comments

0

Likes

3

No embeds

No notes for slide

Arun Kumar did a survey of the analytical process

He’s going to crop up in a bit in a more interesting way

This feels right to me; and hopefully you see something similar.

Machine learning is about learning from example

And works on instances (examples)

Cite: http://pages.cs.wisc.edu/~arun/vision/SIGMODRecord15.pdf

analysts typically use an iterative exploratory process

Are some features more predictive than others?

Visualize pairwise relationships as a heatmap

Pearson shows us strong correlations => potential collinearity

Covariance helps us understand the sequence of relationships

Each instance plotted in a scatter plot.

Projected dataset can be analyzed along axes of principle variation

Can be interpreted to determine if spherical distance metrics can be utilized.

Can also be plotted in three dimensions to attempt to visualize more components and get a better sense of the distribution in high dimensions

Each instance plotted in a scatter plot.

Projected dataset can be analyzed along axes of principle variation

Can be interpreted to determine if spherical distance metrics can be utilized.

Can also be plotted in three dimensions to attempt to visualize more components and get a better sense of the distribution in high dimensions

Stochastic Neighbor Embedding, decomposition then projection into 2D scatterplot

Visual part-of-speech tagging

A common approach to eliminating features is to describe their relative importance to a model, then eliminate weak features or combinations of features and re-evalute to see if the model fairs better during cross-validation.

Many model forms describe the underlying impact of features relative to each other. This visualizer uses this attribute to rank and plot relative importances.

Features are ranked by the model’s coef_ or feature_importances_ attributes, and by recursively eliminating a small number of features per loop, RFE attempts to eliminate dependencies and collinearity that may exist in the model.

RFE requires a specified number of features to keep, however it is often not known in advance how many features are valid.

To find the optimal number of features cross-validation is used with RFE to score different feature subsets and select the best scoring collection of features.

The RFECVvisualizer plots the number of features in the model along with their cross-validated test score and variability and visualizes the selected number of features.

Features are ranked by the model’s coef_ or feature_importances_ attributes, and by recursively eliminating a small number of features per loop, RFE attempts to eliminate dependencies and collinearity that may exist in the model.

RFE requires a specified number of features to keep, however it is often not known in advance how many features are valid.

To find the optimal number of features cross-validation is used with RFE to score different feature subsets and select the best scoring collection of features.

The RFECVvisualizer plots the number of features in the model along with their cross-validated test score and variability and visualizes the selected number of features.

Classification report heatmap - Quickly identify strengths & weaknesses of model - F1 vs Type I & Type II error

Visual confusion matrix - misclassification on a per-class basis

The discrimination threshold is the probability or score at which the positive class is chosen over the negative class.

Generally, this is set to 50% but the threshold can be adjusted to increase or decrease the sensitivity to false positives or to other application factors.

One common use is to determine cases that require special treatment.

For example, a fraud prevention application might use a classification algorithm to determine if a transaction is likely fraudulent and needs to be investigated in detail.

Spam/not spam

Precision: An increase in precision is a reduction in the number of false positives; this metric should be optimized when the cost of special treatment is high (e.g. wasted time in fraud preventing or missing an important email).

Recall: An increase in recall decreases the likelihood that the positive class is missed; this metric should be optimized when it is vital to catch the case even at the cost of more false positives. (e.g. SPAM v. VIRUS)

F1 Score: The F1 score is the harmonic mean between precision and recall. The fbetaparameter determines the relative weight of precision and recall when computing this metric, by default set to 1 or F1. Optimizing this metric produces the best balance between precision and recall.

Queue Rate: The “queue” is the spam folder or the inbox of the fraud investigation desk. This metric describes the percentage of instances that must be reviewed. If review has a high cost (e.g. fraud prevention) then this must be minimized with respect to business requirements; if it doesn’t (e.g. spam filter), this could be optimized to ensure the inbox stays clean.

Prediction error plot - 45 degree line is theoretical perfect

Residuals plot - 0 line is no error

See change in amount of variance between x and y, or along x axis => heteroscedasticity

Stratified sampling, oversampling, getting more data -- tricks will help us balance

But supervised methods can mask training data; simple graphs like these give us an at-a-glance reference

As this gets into multiclass problems, domination could be harder to see and really effect modeling

How much the estimator benefits from more data (e.g. do we have “enough data” or will the estimator get better if used in an online fashion).

If the estimator is more sensitive to error due to variance vs. error due to bias.

If the training and cross validation scores converge together as more data is added (shown in the left figure), then the model will probably not benefit from more data. If the training score is much greater than the validation score (as shown in the right figure) then the model probably requires more training examples in order to generalize more effectively.

For a support vector classifier, gamma is the coefficient of the RBF kernel. It controls the influence of a single example. The larger gamma is, the tighter the support vector is around single points (overfitting the model).

In this visualization we see a definite inflection point around gamma=0.1. At this point the training score climbs rapidly as the SVC memorizes the data, while the cross-validation score begins to decrease as the model cannot generalize to unseen data.

Regularization uses a Norm to penalize complexity at a rate, alpha

The higher the alpha, the more the regularization.

Complexity minimization reduces bias in the model, but increases variance

Goal: select the smallest alpha such that error is minimized

Visualize the tradeoff

Surprising to see: higher alpha increasing error, alpha jumping around, etc.

Embed R2, MSE, etc into the graph - quick reference

Unlike decomposition methods such as PCA and SVD, manifolds generally use nearest-neighbors approaches to embedding, allowing them to capture non-linear structures that would be otherwise lost.

The projections that are produced can then be analyzed for noise or separability to determine if it is possible to create a decision space in the data.

Have a fit and predict method

Have a transform method

Generally have a draw, finalize, and poof method

- 1. Learning machine learning with Yellowbrick
- 2. The Model Selection Triple Arun Kumar http://bit.ly/2abVNrI Feature Analysis Algorithm Selection Hyperparameter Tuning
- 3. Feature Analysis
- 4. Use radviz or parallel coordinates to look for class separability Yellowbrick Feature Visualizers
- 5. ● Based on spring tension minimization algorithm. ● Features equally spaced on a unit circle, instances dropped into circle. ● Features pull instances towards their position on the circle in proportion to their normalized numerical value for that instance. ● Classification coloring based on labels in data. Radial Visualization
- 6. Before and after standardization Parallel Coordinates
- 7. Parallel Coordinates ● Visualize clusters in data. ● Points represented as connected line segments. ● Each vertical line represents one attribute (x-axis units not meaningful). ● One set of connected line segments represents one instance. ● Points that tend to cluster will appear closer together.
- 8. Use Rank2D for pairwise feature analysis, find strong correlations (potential collinearity?) Rank2D
- 9. Rank2D ● Feature engineering requires understanding of the relationships between features ● Visualize pairwise relationships as a heatmap ● Pearson shows us strong correlations, potential collinearity ● Covariance helps us understand the sequence of relationships
- 10. PCA Projection Plots ● Uses PCA to decompose high dimensional data into two or three dimensions ● Each instance plotted in a scatter plot. ● Projected dataset can be analyzed along axes of principle variation ● Can be interpreted to determine if spherical distance metrics can be utilized.
- 11. PCA Projection Plots Can also plot in 3D to visualize more components & get a better sense of distribution in high dimensions
- 12. Visualize top tokens, document distribution & part-of-speech tagging Feature Visualizers for Text
- 13. How do I select the right features?
- 14. Feature Importance Plot ● Need to select the minimum required features to produce a valid model. ● The more features a model contains, the more complex it is (sparse data, errors due to variance). ● This visualizer ranks and plots underlying impact of features relative to each other.
- 15. Recursive Feature Elimination
- 16. Recursive Feature Elimination ● Recursive feature elimination fits a model and removes the weakest feature(s) until the specified number is reached. ● Features are ranked by internal model’s coef_ or feature_importances_ ● Attempts to eliminate dependencies and collinearity that may exist in the model.
- 17. Model Evaluation
- 18. Evaluating Classifiers ● How well did predicted values match actual labeled values? ● In a 2-class problem, there are two ways to be “right”: ○ Classifier correctly identifies cases (aka “True Positives”) ○ Classifier correctly identifies non-cases (aka “True Negatives”) ● ...and two ways to be “wrong”: ○ Classifier incorrectly identifies a non-case as a case (aka “False Positive” or “Type I Error”) ○ Classifier incorrectly identifies a case as a non-case (aka “False Negative” or “Type II Error”)
- 19. Metrics for Classification Metric Measures In Scikit-learn Precision How many selected are relevant? from sklearn.metrics import precision_score Recall How many relevant were selected? from sklearn.metrics import recall_score F1 Weighted average of precision & recall from sklearn.metrics import f1_score Confusion Matrix True positives, true negatives, false positives, false negatives from sklearn.metrics import confusion_matrix ROC True positive rate vs. false positive rate, as classification threshold varies from sklearn.metrics import roc AUC Aggregate accuracy, as classification threshold varies from sklearn.metrics import auc
- 20. accuracy = true positives + true negatives / total precision = true positives / (true positives + false positives) recall = true positives / (false negatives + true positives) F1 score = 2 * ((precision * recall) / (precision + recall))
- 21. Visualize accuracy and begin to diagnose problems Yellowbrick Score Visualizers
- 22. Classification Report from sklearn.metrics import classification_report as cr print(cr(y, yhat, target_names=target_names)) ● includes same basic info as confusion matrix ● 3 different evaluation metrics: precision, recall, F1 score ● includes class labels for interpretability
- 23. Classification Heatmaps Precision: of those labelled edible, how many actually were?Is it better to have false positives here or here? Recall: how many of the poisonous ones did our model find?
- 24. ROC-AUC from sklearn.metrics import roc_curve, auc fpr, tpr, thresholds = roc_curve(y,yhat) roc_auc = auc(fpr, tpr) Visualize tradeoff between classifier's sensitivity (how well it finds true positives) and specificity (how well it avoids false positives) ● straight horizontal line -> perfect classifier ● pulling a lot toward the upper left corner -> good accuracy ● exactly aligned with the diagonal -> coin toss
- 25. Getting more right comes at the expense of getting more wrong ROC-AUC
- 26. ROC-AUC for Multiclass Classification ROC curves are typically used in binary classification,, but Yellowbrick allows for multiclass classification evaluation by binarizing output (per-class) or using one-vs-rest (micro score) or one-vs-all (macro score) strategies of classification.
- 27. Confusion Matrix ● takes as an argument actual values and predicted values generated by the fitted model ● outputs a confusion matrix from sklearn.metrics import confusion_matrix
- 28. I have a lot of classes; how does my model perform on each? Do I care about certain classes more than others? Confusion Matrix
- 29. Class Prediction Error Plot Similar to confusion matrix, but sometimes more interpretable
- 30. Discrimination Threshold Visualizer * for binary classification only ● Probability or score at which positive class is chosen over negative. ● Generally set to 50% ● Can be adjusted to increase/decrease sensitivity to false positives or other application factors ● Cases that require special treatment?
- 31. Evaluating Regressors ● How well does the model describe the training data? ● How well does the model predict out-of-sample data? ○ Goodness-of-fit ○ Randomness of residuals ○ Prediction error
- 32. Metrics for Regression Metric Measures In Scikit-learn Mean Square Error (MSE, RMSE) distance between predicted values and actual values (more sensitive to outliers) from sklearn.metrics import mean_squared_error Absolute Error (MAE, RAE) distance between predicted values and actual values (less sensitive to outliers) from sklearn.metrics import mean_absolute_error, median_absolute_error Coefficient of Determination (R- Squared) % of variance explained by the regression; how well future samples are likely to be predicted by the model from sklearn.metrics import r2_score
- 33. Visualize the distribution of error to diagnose heteroscedasticity Yellowbrick Score Visualizers
- 34. Prediction Error Plots from sklearn.model_selection import cross_val_predict ● Cross-validation is a way of measuring model performance. ● Divide data into training and test splits; fit model on training, predict on test. ● Use cross_val_predict to visualize prediction errors as a scatterplot of the predicted and actual values.
- 35. Prediction Error Plots
- 36. Plotting Residuals ● Standardized y-axis ● Model prediction on x-axis. ● Model accuracy on y-axis; distance from line at 0 indicates how good/bad the prediction was for that value. ● Check whether residuals are consistent with random error; data points should appear evenly dispersed around the plotted line. Should not be able to predict error. ● Visualize train and test data with different colors.
- 37. Plotting Residuals
- 38. Metrics for Clustering ...
- 39. Maybe? ● Silhouette scores ● Elbow curves Metrics for Clustering ...
- 40. Why is my F1/R2 so low?
- 41. ● What to do with a low- accuracy classifier? ● Check for class imbalance. ● Visual cue that we might try stratified sampling, oversampling, or getting more data. Class Balance
- 42. Cross Validation Scores ● Real world data are often distributed somewhat unevenly; the fitted model likely to perform better on some sections of data than others. ● See cross-validated scores as a bar chart (one bar for each fold) with average score across all folds plotted as dotted line. ● Explore variations in performance using different cross validation strategies.
- 43. Learning Curve ● Relationship of the training score vs. the cross validated test score for an estimator. ● Do we need more data? If the scores converge together, then probably not. If the training score is much higher than the validation score, then yes. ● Is the estimator more sensitive to error due to variance or error due to bias?
- 44. Validation Curve ● Plot the influence of a single hyperparameter on the training and test data. ● Is the estimator under- or over- fitting for some hyperparameter values? For SVC, gamma is the coefficient of the RBF kernel. The larger gamma is, the tighter the support vector is around single points (e.g. overfitting). Here around gamma=0.1 the SVC memorizes the data.
- 45. Hyperparameter Tuning
- 46. Hyperparameters ● When we call fit() on an estimator, it learns the parameters of the algorithm that make it fit the data best. ● However, some parameters are not directly learned within an estimator. These are the ones we provide when we instantiate the estimator. ○ alpha for LASSO or Ridge ○ C, kernel, and gamma for SVC ● These parameters are often referred to as hyperparameters.
- 47. Examples: ● Alpha/penalty for regularization ● Kernel function in support vector machine ● Leaves or depth of a decision tree ● Neighbors used in a nearest neighbor classifier ● Clusters in a k-means clustering Hyperparameters
- 48. How to pick the best hyperparameters? ● Use the defaults ● Pick randomly ● Search parameter space for the best score (e.g. grid search) … Except that hyperparameter space is large and gridsearch is slow if you don’t know already what you’re looking for. Hyperparameters
- 49. How do I tune this model?
- 50. Should I use Lasso, Ridge, or ElasticNet? Is regularlization even working? More alpha => less complexity Reduced bias, but increased variance Alpha selection with Yellowbrick
- 51. ● How many clusters do you see? ● How do you pick an initial value for k in k- means clustering? ● How do you know whether to increase or decrease k? ● Is partitive clustering the right choice? What’s the right k?
- 52. higher silhouette scores mean denser, more separate clusters The elbow shows the best value of k… Or suggests a different algorithm K-selection with Yellowbrick
- 53. Manifold Visualization ● Embed instances described by many dimensions into 2. ● Look for latent structures in the data, noise, separability. ● Is it possible to create a decision space in the data? ● Unlike PCA or SVD, manifolds use nearest neighbors, can capture non- linear structures.
- 54. Using Yellowbrick
- 55. Install: $ pip install yellowbrick Upgrade: $ pip install -U yellowbrick Anaconda: $ conda install -c districtdatalabs yellowbrick Quickstart
- 56. # Import the estimator from sklearn.linear_model import Lasso # Instantiate the estimator model = Lasso() # Fit the data to the estimator model.fit(X_train, y_train) # Generate a prediction model.predict(X_test) Scikit-Learn Estimator Interface
- 57. # Import the model and visualizer from sklearn.linear_model import Lasso from yellowbrick.regressor import PredictionError # Instantiate the visualizer visualizer = PredictionError(Lasso()) # Fit visualizer.fit(X_train, y_train) # Score and visualize visualizer.score(X_test, y_test) visualizer.poof() Yellowbrick Visualizer Interface
- 58. The main API implemented by Scikit-Learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm, or a transformer that extracts/filters useful features from raw data. class Estimator(object): def fit(self, X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred Scikit-learn Estimators
- 59. Transformers are special cases of Estimators -- instead of making predictions, they transform the input dataset X to a new dataset X′. class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime Scikit-learn Transformers
- 60. A visualizer is an estimator that produces visualizations based on data rather than new datasets or predictions. Visualizers are intended to work in concert with Transformers and Estimators to shed light onto the modeling process. class Visualizer(Estimator): def draw(self): """ Draw the data """ self.ax.plot() def finalize(self): """ Complete the figure """ self.ax.set_title() def poof(self): """ Show the figure """ plt.show() Yellowbrick Visualizers
- 61. Contributing
- 62. Yellowbrick is an open source project that is supported by a community who will gratefully and humbly accept any contributions you might make to the project. Large or small, any contribution makes a big difference; and if you’ve never contributed to an open source project before, we hope you will start with Yellowbrick!
- 63. Please star Yellowbrick on GitHub! github.com/DistrictDataLabs/yellowbrick

No public clipboards found for this slide

Be the first to comment