- 1. SHapley Additive exPlanations SHAP Ted Discussion
- 3. What’s all the fuss about? shapley ● Game theory approach to giving “credit” to cooperative group ● Shapley values calculate the importance of a feature by comparing what a model predicts with and without the feature. However, since the order in which a model sees features can affect its predictions, this is done in every possible order, so that the features are fairly compared. source shap ● What Shapley does is quantifying the contribution that each player brings to the game. What SHAP does is quantifying the contribution that each feature brings to the prediction made by the model. ● One game: one observation. SHAP is local ● Lundberg, Scott M., and Su-In Lee. “A unified approach to interpreting model predictions.” Advances in Neural Information Processing Systems (2017) ● implementation of shapley (TreeSHAP, KernelSHAP) ● connects LIME and shapley values ● one line of python gives you feature explanations
- 9. ● Imagine a machine learning model that predicts the income of a person knowing age, gender and job of the person. ● Shapley values are based on the idea that the outcome of each possible combination (or coalition) of players should be considered to determine the importance of a single player. In our case, this corresponds to each possible combination of f features (f going from 0 to F, F being the number of all features available, in our example 3). ● In math, this is called a “power set” and can be represented as a tree. h/t this article
- 10. ● Cardinality of a power set is 2 ^ n, where n is the number of elements of the original set. ● SHAP requires to train a distinct predictive model for each distinct coalition in the power set (2 ^ F models) ● Models are completely equivalent: hyperparameters and their training data (which is the full dataset). The only thing that changes is the set of features included in the model. ● Imagine that we have already trained our 8 models on the same training data. take a new observation (let us call it x₀) and see what the 8 different models predict for the same observation x₀.
- 11. ● Two nodes connected by an edge differ for just one feature, the gap between the predictions of two connected nodes due to additional feature. This is called “marginal contribution” of a feature. ● Each edge represents the marginal contribution brought by a feature ● Overall effect of Age on the final model (i.e. the SHAP value of Age for x₀) ● Consider marginal contribution of Age in all the models - edges highlighted in red. ● How does SHAP figure out the weights - next section!
- 12. shap specifics
- 13. Shapley Axioms
- 14. Shapley Axioms
- 15. Shapley Axioms
- 16. Shapley Axioms
- 17. Shapley Equation
- 18. Shapley Equation for a subset S, the weight is the product of the number of permutations of S and the number of permutations of the complement of S and i (i.e.; N{S∪{i}}).
- 19. shap example
- 20. Shapley in ML ● Shapley value is computed by perturbing input features and seeing how changes to the input features correspond to the final model prediction. ● Shapley value = the average marginal contribution of a feature to the overall model score ● For ML models, it’s not possible to just “exclude” a feature when determining a prediction. ● The formulation of Shapley values within an ML context simulates “excluded” features by sampling the empirical distribution of the feature’s values and averaging over multiple samples (Monte Carlo with other data sample’s features - FrakenFeatures!)
- 21. Shap Package Explainers for ● Tree models (e.g. XGBoost) ● Deep explainer (neural nets) ● Linear explainer (regression)
- 22. Shapley Usage - Beeswarm Plot
- 23. Shapley Usage - Waterfall Plot
- 24. Shapley Usage - Force Plot
- 25. Advantages / Disadvantages ● Everyone likes explainability ● SHAP python package is two lines and fairly fast (especially for tree based models) ● Model agnostic (black box) ● Performed on each data point - so we get granularity to a single point, and can aggregate over the whole model or subsets of data. ● Brute force calculation is combinatorial, SHAP does some fancy Monte Carlo like approximation, especially when model structure (think trees) is know - but it is still a compute beast ● Stakeholders (who have not heard Junlin’s talk yet) will mistake SHAP analysis for causation NOT correlation ● SHAP may make prediction on unrealistic data ● There is no native SPARK version (so you have to convert pySpark dataframes to pandas.