The harder the question we are trying to solve, the more sophisticated the machine learning models tend to become, making it almost impossible to interpret. This might mean more features, complex algorithms or complex patterns.
E(X)plainableAI (XAI) has been a very trending topic recently. To explain the outcomes of these models, mostly focusing on the point of view of the end user. For research, however, the machine learning models we use are mostly taken for granted as black-boxes because we usually focus on performance and don’t really need to explain the predictions to anyone else.
In this talk, I will cover why explainability (specifically using SHAP values) of a model is important also for the research phase and how it can help not just the end user but also us data scientists that are building the models. We will see several different ways of looking at a model or its predictions can help us improve performance even before the production phase.
Melih is a data scientist at Riskified where he joined almost 2.5 years ago. Today, he is working mainly on the research and improvement of the ATO product.
Originally from Turkey and coming from an engineering background, he pivoted his way into the Data Science/Machine Learning world to follow his passion for data and AI.
He believes in constant learning and endless curiosity. When not doing DS/ML, you can find him doing any kind of sports or tasting new whisky.
2. Melih Bahar
➔ Data Scientist, Riskified
➔ Born in Turkey
➔ Loves whiskey!
melih.bahar@riskified.com
@melih_bhr
About Me
3. Riskified by the numbers
Global team,
nearly 50% in R&D
Countries across
the globe
Online volume
reviewed in 2020
650+ 180+
$60B+
50+
Publicly held companies
among our clients
98%
Client retention
rate in 2020
As of August 2021
4. Account Takeover (ATO)
A quick glance...
An ATO is when a bad actor gains access to another party’s legitimate account.
9. Why
Data Scientist POV
“If you can’t explain it simply, you don’t understand it well enough.”
Albert Einstein
x1
.
.
xn
Data Model Interpretability
Methods
Humans
Validation
Performance
Trust, Informativeness,
Transferability, Fairness
11. Explainability vs. Interpretability
Explainability
Explainability
An explainable model is a function that is too
complicated for a human to understand.
An additional method is needed to
understand how the model works.
Interpretability
A model is interpretable if it is capable
of being understood by humans on its
own.
12. Explainability
… is NOT causality!
Source: https://tuowang.rbind.io/project/causal-inference-notes/
14. Shapley Values
A Short Introduction
The average of the marginal contributions across all permutations.
Source: https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d
Order matters!
Ann
Beth
Cindy
Beth
Ann Cindy
15. SHAP
SHapley Additive exPlanations
• Proposed by Lundberg and Lee (2016) - Based on Shapley Values
• A united approach to explaining the output of any machine learning model -
model agnostic
Properties
• Local Accuracy
• Consistency
• Missingness
Assumptions
• Independent Features
16. Global vs. Local
Global Interpretability
Provides explanations about the
general behavior of the model
over the entire population.
Local Interpretability
Provides explanations for a specified
prediction of the model
18. Model Comparison
• Not all models have out of the box feature importances (such as random forest
etc.)
- SHAP creates a common ground for comparison.
• Adds “explanation” to the “performance”
Tree-based Model (Boosting)
PRAUC: 0.829
Kernel-based Model (SVM)
PRAUC: 0.838
19. Model Debugging
Error Analysis on Instances
It is easier to detect that a selected feature should have a different effect than the observed one.
False Positive False Negative
20. Model Debugging
Error Analysis on Incorrect Predictions
We can take the subset of the errors of our models and see which features
affect the most for these incorrect classifications.
21. Model Debugging
Adding Domain Experts to the Loop
Feedback for/from analysts from/for the model helps getting more accurate
labels and find the logic flaws in the model.
ATO ? Legit ?
22. Feature Selection
• The wrapper-based methods (BORUTA, RFE etc.) may result in suboptimal performances.
• The standard feature importance method of decision trees tends to overestimate the importance
of continuous or high-cardinality categorical variables. SHAP reduces this bias.
23. Feature Selection
Effect of Specific Features
• Helps see how the values of a specific feature affect the predictions.
• Helps decide if a specific feature is helpful or not.
24. Feature Selection
Effect of Specific Features
Too high effect of a specific feature (with respect to the other features) may
indicate too high correlation to the label.
25. Feature Prioritization
Helps us decide which features (or group of features) are “more” important to develop in
production right now and which ones can wait.
For Production
30. The Catch
• Computationally expensive and time-consuming for large numbers of features.
• Explanations are generated too late in the machine learning pipeline.
• Provide no guarantee that your model will behave as expected in the future for new
data.
We do fraud prevention in ecommerce
It can be at all stages of a transaction - during the login, at the checkout (order), after the order (abuse)
Riskified tries to solve fraud from different sides, ATO (account takeover) is one of them and for this meetup we will mostly focus on that.
2 decisions - challenge/allow… ATO is 1, we are trying to detect ATOs, uses internal and external data.
We built the best model, carefully engineered features, checked several algorithms, hyperparameter search and everything… model goes live, everything’s looking great…
But suddenly we get messages from a specific merchant that we are challenging good logins (that they use to test). It can be that it’s ok but THEN it was test logins.
The harder the question we are trying to solve, the more sophisticated the machine learning models tend to become, making it almost impossible to interpret. This might mean more features, complex algorithms or complex patterns.
Should we build an accurate model or sacrifice on accuracy and build an interpretable model? Solution is to have a model simple enough that you can explain but also accurate enough to meet your needs.
Story of John - he gets declined for a loan and asks why...
Explainability enables tech developers to troubleshoot, debug and upgrade their models, as well as innovate new functionalities.
Without explanations, if the model makes lots of bad predictions then it remains a mystery as to why.
We’ll talk about 2 different terms but mostly they are used interchangeably. Same same but different…
Where does ensemble get in to this graph - less interpretable!
Keep it short!
Explainability methods make transparent the correlations picked up by ML models… As a result, explaining the model with will not reveal causal effects.
All predictive models implicitly assume that everyone will keep behaving the same way in the future, and therefore correlation patterns will stay constant. To understand what happens if someone starts behaving differently, we need to build causal models, which requires making assumptions and using the tools of causal analysis.
I won’t get into theory, so theoretical questions i can answer or show you the relevant references.
When applied to machine learning, we assume that each feature is a player in the game, all working together to maximize the prediction.
0 - the feature/player doesn’t contribute
For machine learning ⇒ Coalition - all the options of all features (only 1, only 2.. None.. etc.)
Local accuracy - The feature contributions must add up to the difference of prediction and the average of all predictions (baseline).
Consistency - If a model changes so that the marginal contribution of a feature value increases or stays the same, the Shapley value also increases or stays the same
Missingness - a missing feature gets an attribution of zero
Independent features - without getting too much into theory and how it’s all actually calculated, if features are dependent, this may lead to putting too much weight on unlikely features because of the way shap values are calculated. With the packages, though, at least for Tree based models it’s taken care of.
Shapley values for feature contributions do not directly come from a local regression model. In regression models, the coefficients represent the effect of a feature assuming all the other features are already in the model. It is well-known that the values of the regression coefficients highly depend on the collinearity of the feature of interest with the other features that are included in the model. To eliminate this bias, Shapley values calculate feature contributions by averaging across all permutations of the features joining the model.
I won’t be getting into feature importance too much as Liran will be talking about it in a more detailed way.
Notice that in kernel-based model there are no categorical features (gray)
Different models with a similar performance can base their predictions on completely different relations extracted from the same data. Despite the differences, the explanations are very useful.
Ca_14 for example was built to detect ATOs (the higher the value - the riskier)
Just look at FP or FN to understand the main pattern.
We can also look at them separately - only FP or only FN to catch more specific mistakes
The model used for feature selection may differ (in parameter configuration or in the type) from the one used for final fitting and prediction. This may result in suboptimal performances.
The standard methods tend to overestimate the importance of continuous or high-cardinality categorical variables.
SHAP helps when we perform feature selection. Instead of using the default tree-based importance, we select the best features like the ones with the highest shapley values. It reduces the bias!
We still do it recursively (just instead of impurity or gini index, we use shap values)
Good for manual feature selection
It’s called a dependence plot
For example, here we can see that the same value (about 0) affects the prediction both highly positively and highly negatively.
Worked hard, built 50 features but it takes time for DEV to implement it all in the production right away.
You need to choose wisely as their capacity and time is limited so how do you decide?
Here we can see the feature importance including the existing and new features..
You can also check the feature importance of only the new features...
Only specific features? Or you want the whole group2 or 4?
Some domain knowledge and prior information are needed for accurate number of clusters.
The advantage of using shap values for clustering is that shap values for all features are on the same scale, and have the same unit. Not like regular features with different scales and that are harder to compare/compute distance.
This also shows that the model sees those “segments” differently!
it shouldn’t be done blindly and the scores/scales should be checked and taken care of if there is a need
you just need to find the right way for your use case and for your values to get the most accurate of it.
We can see different segments have completely different set of features that affect the predictions!
The importance can be explained using different segment sets (instead of just using the existing trainset)
Other possibilities I can think of but haven’t tried yet…
Anomaly/Outlier Detection ?
Exponentially increasing number of permutations
Post-model analysis only
The data might change/drift
In theory, SHAP is the better approach as it provides mathematical guarantees for the accuracy and consistency of explanations.