Statistical and visual tools for model interpretation
1.
2.
3. Overall Description
The present work aims at presenting tools for model interpretation derived
from Partial Dependency Plots (in many different guises, explained in the text),
and contrasted to osterior probabilities, hereby called scores.
The work comprises 4 Powerpoint Documents, with a possible fifth (if I get to
it), numbered 0 to 4. 0 describes overall issues, introduces the working data
set and models.
At the risk of spoiling end results, the Multivariate section provides insights at
(almost) the observation level, and requires univariate and bivariate support.
This conclusion is quite surprising to me since I thought that Univariate and
Bivariate would be rendered lacking. But context reality is far more complex
than expected, and model interpretations are as varied as the different
contexts available in the data, that should not be dismissed all too eagerly.
This work is based mostly on visualitzation and I have tried to avoid statistical
inference and lengthy tables.
4. Abstract
Statistical and data science models: Are they Interpretive black-
boxes ? Let’s try for NO.
Molnar’s (2018) “Interpretable Machine Learning”: big effort in finding
solutions. Our presentation is humbler: visual tools for model
interpretation based on partial dependency plots and their variants,
such as collapsed PDPs created by the presenter, some of which may
be polemical and debatable. Almost no use of statistical inference.
Audience should be versed in models creation, and at least some insight into partial
dependency plots. Presentation based on simple working example with 8 predictors and
one binary target variable.
Not possible to detail exhaustively every method described in this presentation.
Extensive document in preparation. Presentation requires 3 hours and wide awake
audience. Double time if not awake. Sleepers will be punished accordingly.
Slides Marked **** can be skipped for easier first reading.
5. Contents: Model Interpretation (MI)
1. Introduction and General Notes
2. Confounding
3. Model Interpretation (MI) and Categorization: UMI, BMI, MMI.
4. Binary Target Study
4.1: Report of coefficients, estimates, etc.
4.2: Models Structures
4.3: GOF and model Interpretation
5. Univariate Model Interpretation: UMI
5. Profile and Model Interpretation area. Univariate Model Interpreation UMI.
6. Partial Dependency plots (PDPs) and their variants. UMI.
6. Bivariate Model Interpretation: BMI
7. PDPs and Bivariate Model Interpretation (BMI.)
7.1: UMI vs. BMI.
8. Multivariate model interpretation.: MMI
9. Future Steps
10. Observation level Interpretation.
11. References
6.
7. Overall comments and introduction.
Presentation by way of example focusing on Fraud/Default
Data set and continuing previous chapters available on
web (standard class for Principal Analytics Prep).
Aim: study interpretation/diagnosis mostly via Partial
Dependency Plots of logistic regression, Classification
Trees and Gradient Boosting.
Presentation(s) available at
https://www.slideshare.net/LeonardoAuslender/visual-
tools-for-interpretation-of-machine-learning-models
At present, lots of written opinions and distinctions about topic. No room or desire to
discuss them all. See Molnar’s (2018) book for an overall view, O’Rourke (2018), Doshi-
Velez et (2017).
8. Overall comments and introduction (cont 1).
No discussion about imbalanced data set modeling
or other modeling issues such as model selection.
This presentation introduces novel visual concepts
as well as tools derived from Partial Dependency
Plots (PDP):
-Overall PDP
-Collapsed PDP and residuals
-Marginal PDP
-PDP vs. actual scores, ….
and how they assist in model interpretation.
9. Model Interpretation (MI) and model building issues.
1) Why/where model makes mistakes (large residuals, outliers, etc.)?
2) Which/when attributes (alone / group) end up being important?
3) Why non-importants?
4) Observation level predictions differ by models?
However, immediate aim is NOT interpretations at observation level (why
predicted sick/churner/innocent…) but
10. Objectives of MI (cont. 1)
Why not directly at observation level?
Suppose model to predict entertainment type preference for
database of families in large cities. Since not possible to
obtain updated family preferences consistently, (i.e., data
are ‘soft’), models necessarily are not interpretable at
specific family levels.
Contrariwise, disease diagnostic prediction is closer to
individual explanation and interpretablity (data typically
‘hard’).
MOTTO: Posterior probability follows Data + Model
algorithm/s. Interpretation follows primarily probability but
must include data (i.e., context) ➔
11.
12.
13. Model Interpretation categorization.
Just as in EDA (but on model results, i.e., predictions), not on initial data),
three types of MI:
Univariate Model Interpretation (UMI): One variable at a time vis-à-vis
predictions/probs. EASIEST to understand and huge source of “makes
sense” discourse. E.g., Classical linear models interpretations;, reasons to
decline a bank loan, etc.
Bivariate Model Interpretation (BMI): Looking at pairs of variables to interpret
model results. Correlation measures immediately spring to mind.
Multivariate Model Interpretation (MMI): Overall model interpretation, most
difficult and valuable.
Typically, most work results in UMI and perhaps BMI. Will aim for MMI as well.
Aside: Does Occam’s razor help?
“Pluralitas non est ponenda sine necessitate. “ ➔ can lead to interpret and
then choose model, or choose model and then interpret ➔ does not help us.
14. Model Interpretation presentation
We will present results in UMI, BMI and MMI order, and at end, compare
across the three methodologies.
Aim is to find insights and contradictions when generalizing UMI without
validating interpretation in BMI and MMI.
And likewise, to verify strong UMI results that are still prevalent in BMI and
MMI.
15.
16. Confounding rears its ugly head.
See earlier chapters for review and
examples.
Must read, not elaborated
Herein.
17.
18. Golden Days of Linear Regression Interpretation ***
Based on “ceteris paribus” assumption that fails In case of
even relatively small VIFs. At present, rule of thumb VIF >=
10 (R-sq = .90 among predictors) ➔ unstable model (see earlier
slides in shareware …).
“Ceteris paribus” exercise: Keeping all other predictors
constant, an increase in …. But if R-sq among predictors is
even 10%, not possible to keep all predictors constant while
increasing by 1 the variable of interest, as per ceteris paribus
frame of analysis.
Advantages however: EASY to conceptualize because
practice follows notion of mostly bivariate correlation
(keeping all else constant, reduces relationship to just 1 var
vs. predictions ➔ UMI). But wrong with even small bivariate
corrs and mostly wrong in multivariate case. Let us see …..
19. ➔Corr (X,Y) = if SD(Y) = SD(X). That is, if both vars
Standardized, otherwise same sign at least, and
interpretation from correlation holds in simple regression
case.
Notice that regression of X on Y is NOT inverse of
regression of Y on X because of SD(X) and SD(Y).
= + +
−
= =
−
=
/
Confusion on signs of coefficients
and interpretation. Simple LR case.
( )
ˆ {
( )
} ˆ
( ) ( )
y
i
xy xy
x
i
xy
Y X
s
Y Y
r r
s
X X
sg r sg
2
1 2
2
β̂
̂
20. 20 5/4/2022
In multiple linear regression, previous relationship does not hold
because predictors can be correlated (rxz) weighted by ryz, hinting at
co-linearity and/or relationships of supression/enhancement (paper on
suppression/enhancement in shareware.net)➔
= + + +
= + +
−
=
−
=
. .
. 2
2
But in multivariate, e.g.: ,
estimated equation (emphasizing "partial")
and for example:
ˆ ˆ ˆ ,
ˆ
1
ˆ
( ) ( )
( ) ( ) and 1
YX Z YZ X
Y YX YZ XZ
YX Z
X XZ
YX
YX YZ XZ XZ
Y X Z
Y a X Z
s r r r
s r
sg sg r
abs r abs r r r
21. Comment on Linear Model Interpretation
Even in traditional UMI land, multivariate relations
given by Partial- and semi-partial correlations
must be part of the interpretation.
Note that while correlation is a bivariate
relationship, partial and semi-partial corrs can be
extended to multivariate setting. In case of binary
target, these relationships are not fully analyzed.
However, even BMI and certainly MMI not so often
performed.
22.
23. EDA and Model Interpretation
EDA analyzes data sets without reference to dependent or target variable
(DV), which is instead done by modeling. Thus, MI = EDA + Predictions
Analysis.
Nevertheless, for given values(s) of DV or of predicted values, UMI, BMI
and MMI can utilize EDA tools. For instance, histogram of posterior
model probabilities is part of Model UEDA and thus part of UMI.
Thus, MI is based on relationship of predictions (and residuals) vis-à-vis
single, pairs, triads, tetrads, etc. of predictors. And this translates in
different techniques such as Original PDPs, Pair PDPs, triads, etc. to be
reviewed below.
NB: We utilize binning and rescaling of variables ranges for easier visual
interpretation. The number of bins is 10 mostly for UMI analysis, and 3
otherwise. We do not discuss issues of optimal binning, left to the reader.
24.
25. Searching for Important variables en route to answering
modeling question.
QUESTION: minimum components to make a car go along
highway.
1) Engine
2) Tires
3) Steering wheel
4) Transmission
5) Gas
6) ….. Other MMI aspects and interrelations.
Take just one of them out, and car won’t MOVE ➔ EXISTENCE OF NO
SINGLE most important variable. Instead, minimum irreducible set of
them is NECESSARY. In Data Science case with n → ∞, possibly
many subsets of ‘important’ variables for (n, p) subsets.
Typically, “suspect VARIABLES” good starting point of
research. “STARTING” is key word.
26.
27.
28.
29. Basic DATA set(s) Information
Model
Name
Item Information
1
M2 TRN DATA set train
. TRN num obs 3595 1
VAL DATA set 1
. VAL num obs 0 1
TST DATA set 1
. TST num obs 0 1
2
Dep. Var fraud 1
TRN % Events 20.389 1
VAL % Events 1
TST % Events 1
30. — 30 —
Data set: Definition by way of Example
• Health insurance company:
Ophtamologic Insurance Claims
• Is claim valid or fraudulent? Binary
target.
• Full description and analysis of this data
set in
https://www.slideshare.net/LeonardoAuslender
(lectures at Principal Analytics Prep).
31. While presenting 3 models results, we’ll concentrate on ‘best’ model for
Interpretation for brevity sake, except to mention specific examples of
Different model interpretations across models.
RequestedModels:Names&Descriptions.
Mode
l#
FullModelName ModelDescription
2
002_M2_TRN_GRAD_BOOSTING GradientBoosting
004_M2_TRN_LOGISTIC_STEPWISE LogisticSTEPWISETRN 4
005_M2_TRN_TREE TREEmodel 5
32. Original Vars + Labels
Model
Name
M2
Var # Variable Label
**
1 FRAUD Fraudulent Activity yes/no
2 TOTAL_SPEND Total spent on opticals **
3 DOCTOR_VISITS Total visits to a doctor **
4 NO_CLAIMS No of claims made recently **
5 MEMBER_DURATION Membership duration **
6 OPTOM_PRESC Number of opticals claimed **
7 SPEND_PER_CLAIM Expenses per claim **
8 CLAIMS_PER_DURATIO
N
Claims per duration
**
35. Probability distributions very different ➔ Model interpretation must be dependent
on model selection. Possible to ‘mix’ all models into one, Ensemble, not in this
ppt. (See slides in shareware).
36.
37. Some conclusions and comments so far: (cont.)
Probability distributions differ in:
1) Extreme points: Logistic and TREES achieve [0; 1], not necessarily
other methods, as GradBoost in our case.
2) Very different % obs in Models’ probability bins.
3) % events per bin fairly linear, except for Logistic ‘drop’ at 0.7. Grad
Boosting has higher % events for higher probability levels than other
2 models.
4) After about 0.4 of posterior probability, 3 methods have similar
distributions. Quite different in segment 0 - < 0.4. Notice GB and
TREE having large proportion of observation at lower probability
levels, compared to Logistic.
5) Relative but not absolute Ml Information can be inferred. % Events
different across models ➔ different probability estimates especially
above segment 0 - < 0.4. Since higher probability levels reflect higher
% events, MI necessarily different.
38. Let’s get into Data Details for sake
Of completion.
Quick EDA area.
U(nivariate) EDA = UEDA
42. Note: Importance and coefficients share one column as well
as p-values and number of rules. Note that models do not share all
Variables. Interestingly, CLAIMS_PER_DURATION is # 1 for the tree
methods and it was not selected by Logistic.
Coefficients, p-values and Importance.
Vars * Models *
Coeffs
Model Name
M2_TRN_GRAD_
BOOSTING
M2_TRN_LOGIST
IC_STEPWISE M2_TRN_TREE
Coeff /
Import
ance
PVal /
Nrules
Coeff /
Import
ance
PVal /
Nrules
Coeff /
Import
ance
PVal /
Nrules
Variable
1.0000 26.000 1.0000 5.000
CLAIMS_PER_DURATION
DOCTOR_VISITS 0.4035 20.000 -0.0180 0.014 0.2895 2.000
MEMBER_DURATION 0.5643 26.000 -0.0065 0.000 0.3650 2.000
NO_CLAIMS 0.2483 6.000 0.7137 0.000
OPTOM_PRESC 0.5963 21.000 0.2185 0.000 0.5383 5.000
SPEND_PER_CLAIM 0.2202 8.000 0.0000 0.001
TOTAL_SPEND 0.6148 29.000 -0.0000 0.000 0.4404 3.000
INTERCEPT -0.5160 0.000
47. Some conclusions and comments so far:
. Logistic stepwise did not select NUM_MEMBERS
that is shown with lowest relative importance in GB and
Trees. More importantly, “claims_per_duration” deemed
most important by tree methods, and disregarded by
logistic. Notice that Logistic Regression does not have
agreed-upon scale of importance. By default, using odds-
ratios.
. CLAIMS_PER is deemed most important single variable for
GB and TREE, but logistic deems NO_CLAIMS as # 1,
OPTOM_PRESC as # 2 (via odds ratios), while GB differed.
. Remaining variables have odds ratios of 1 which seem to
indicate similar effect across, while GB/TREE distinguish
relative importance after first two variables.
53. Curiosly while node numbers don’t mean anything across models, obvious that
GB and LG share similar structure despite being very different algorithms. However, Tree
Representations are just approximations, except in Tree case.
54. Discussion of comparison of Tree representations between LG and GB.
The two methods split initially on Claims_per_duration, but at very
different values (0.00791 (LG) vs. 0.00583 (GB). Remember that actual
logistic regression results had dropped Claims_per_duration.
And later levels obviously differ since the initial split is quite different.
Therefore, these two models should ‘a priori’ differ in model
interpretation.