Visual Tools for explaining Machine Learning Models

Abstract
Statistical and data science models are considered to be, somewhat
pejoratively, black-boxes, interpretation of which has not been
systematically studied.
Molnar’s “Interpretable Machine Learning” is a big effort in finding
solutions. Our presentation is humbler. We aim at presenting visual tools
for model interpretation based on partial dependency plots and their
variants, such as collapsed PDPs created by the presenter, some of
which may be polemical and debatable.
The audience should be versed in models creation, and at least some
insight into partial dependency plots. The presentation will be based on a
simple working example with 6 predictors and one binary target variable
for ease of exposition.
Not possible to detail exhaustively every method described in this
presentation. Extensive document in preparation. Presentation requires
3 hours and wide awake audience. Double if not awake.
Slides Marked **** can be skipped for easier first reading.

Overall comments and introduction.
Presentation by way of example focusing on Fraud/Default
Data sets and continuing previous chapters.
Aim: study interpretation/diagnosis mostly via Partial
Dependency Plots of logistic regression, Classification
Trees and Regression Boosting.
Presentation available at
https://www.slideshare.net/LeonardoAuslender/visual-
tools-for-interpretation-of-machine-learning-models
At present, lots of written opinions and distinctions about
topic. No time to discuss them all. See Molnar’s (2018)
recent book for an overall view.

Overall comments and introduction (cont 1).
No discussion about imbalanced data set modeling
or other modeling issues. No discussion on
literature, all due to time constraints.
This presentation introduces novel visual concepts
as well as tools derived from Partial Dependency
Plots (PDP):
-Overall PDP
-Collapsed PDP
-Marginal PDP
and how they assist in model interpretation.

Objectives of Interpretation.
Why does the model make mistakes (large residuals,
outliers, etc.)?
Which attributes (alone / group) end up being important?
Why is this attribute/s not important?
Why is this observation predicted with high probability
score?
However, immediate aim is NOT explanations at observation
level (why predicted sick/churner/innocent…) but

Objectives of Interpretation (cont. 1)
Why not directly at observation level?
Suppose model to predict entertainment type preference for
database of families in large cities. Since not possible to
obtain updated family preferences consistently, since data is
‘soft’, models necessarily are not created at specific family
levels.
Contrariwise, disease diagnostic prediction is closer to
individual explanation and interpretablity.

Model Interpretation categorization.
Just as in EDA (but on model results, not on initial data), three
types:
Univariate Model Interpretation (UMI): One variable at a
type. EASIEST to understand and huge source of “makes
sense”. E.g., Classical linear models interpretations. E.g.,
reasons to decline a loan.
Bivariate Model Interpretation (BMI): Looking at pairs of
variables to interpret model results.
Multivariate Model Interpreation (MMI): Overall model
interpretation, most difficult.
Typically, most work results in UMI and perhaps BMI.

Days of Linear Regression Interpretation ***
Based on “ceteris paribus” assumption that fails In case of
Even relatively small VIFs. At present, rule of thumb VIF >=
10 (R-sq = .90 among predictors)  unstable model.
“Ceteris paribus” exercise: Keeping all other predictors
Constant, an increase in …. But if R-sq among predictors is
Even 10%, not possible to keep all predictors constant while
Increasing by 1 the variable of interest.
Advantages: EASY to conceptualize because practice
Follows notion of bivariate correlation.
But notion is generally wrong in multivariate case..

Corr (X,Y) = if SD(Y) = SD(X). E.g., if both
Standardized, otherwise same sign at least, and
interpretation from correlation holds in simple regression
case.
Notice that regression of X on Y is NOT inverse of
regression of Y on X because of SD(X) and SD(Y).
/
Confusion on signs of coefficients
and interpretation.
( )
ˆ {
( )
} ˆ( ) ( )yi
xy xy
xi
xy
Y X
sY Y
r r
sX X
sg r sg
  
 
  

  



2
1 2
2

13 2019-06-14
In multiple linear regression, previous relationship does
not hold because predictors can be correlated (rxz)
weighted by ryz, hinting at co-linearity and/or relationships
of supression/enhancement 
. .
. 2
2
But in multivariate, e.g.: ,
estimated equation (emphasizing "partial")
and for example:
ˆ ˆ ˆ ,
ˆ
1
ˆ( ) ( )
( ) ( ) and 1
YX Z YZ X
Y YX YZ XZ
YX Z
X XZ
YX
YX YZ XZ XZ
Y X Z
Y a X Z
s r r r
s r
sg sg r
abs r abs r r r
   
 
 
   

  



 
  

Comment on Linear Model Interpretation
Even in traditional UMI land, we find that
multivariate relations given by Partial- and semi-
partial correlations must be part of the
interpretation.
Note that while correlation is a bivariate
relationship, partial and semipartial corrs can be
extended to multivariate setting.
However, even BMI and certainly MMI not so often
performed.

Searching for Important variables en route to answering
modeling question.
Case study: minimum components to make a car go
along highway.
1) Engine
2) Tires
3) Steering wheel
4) Transmission
5) Gas
6) ….. Other MMI aspects and interrelations.
Take just one of them out, and the car won’t drive. There is no
SINGLE most important variable but a minimum irreducible set of
them. In Data Science case with n  ∞, possibly many subsets of
‘important’ variables.
But “suspect VARS” good starting point of research.

Model Name Item Information
M2 TRN DATA set train
. TRN num obs 3595
VAL DATA set validata
. VAL num obs 2365
TST DATA set
. TST num obs 0
Dep. Var fraud
TRN % Events 20.389
VAL % Events 19.281
TST % Events

Original Vars + Labels Model
Name
M2
Variable Label
**
DOCTOR_VISITS Total visits to a doctor
MEMBER_DURATION Membership duration
**
NO_CLAIMS No of claims made recently
**
NUM_MEMBERS Number of members covered
**
OPTOM_PRESC Number of opticals claimed
**
TOTAL_SPEND Total spent on opticals
**

Requested Models: Names & Descriptions.
Full Model Name Model Description
Overall Models
M2 20 pct prior
M2_BY_DEPVAR Inference
01_M2_GB_TRN_TREES Tree Repr. for Gradient Boosting
02_M2_TRN_GRAD_BOOSTING Gradient Boosting
03_M2_TRN_LOGISTIC_STEPWISE Logistic TRN STEPWISE
04_M2_VAL_GRAD_BOOSTING Gradient Boosting
05_M2_VAL_LOGISTIC_STEPWISE Logistic VAL STEPWISE

— 22 —
Data set: Definition by way of Example
• Health insurance company:
Ophtamologic Insurance Claims
• Is claim valid or fraudulent? Binary
target.
• No transformations created to have
simple data set.
• Full description and analysis of this data
set in
https://www.slideshare.net/LeonardoAuslender
(lectures at Principal Analytics Prep).

Ch. 1.1-23
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informa
t
Label
3 DOCTOR_VISITS Num 8 BEST12
.
F12. Total visits to a doctor
1 FRAUD Num 8 BEST12
.
F12. Fraudulent Activity yes/no
5 MEMBER_DURAT
ION
Num 8 Membership duration
4 NO_CLAIMS Num 8 BEST12
.
F12. No of claims made recently
7 NUM_MEMBERS Num 8 Number of members covered
6 OPTOM_PRESC Num 8 BEST12
.
F12. Number of opticals claimed
2 TOTAL_SPEND Num 8 BEST12
.
F12. Total spent on opticals
Note: No nominal predictors. No transformations to keep
presentation simple but not simpler than necessary

....
Reporting area for
all Model.s
coefficients
Importance, etc. and
Selected Variables.

Vars * Models * Coeffs
Model Name
M2_TRN_GRAD_BO
OSTING
M2_TRN_LOGIS
TIC_STEPWISE
M2_VAL_GRAD_BO
OSTING
M2_VAL_LOGISTIC_S
TEPWISE
Coeff /
Importanc
e
PVal /
Nrules
Coeff /
Import
ance
PVal /
Nrules
Coeff /
Importa
nce
PVal /
Nrules
Coeff /
Importan
ce
PVal /
Nrules
Variable
0.1099 2.000 0.1099 2.000
NUM_MEMBERS
OPTOM_PRESC
0.6211 19.000 0.2178 0.000 0.6211 19.000 0.1463 0.000
DOCTOR_VISITS
0.4434 20.000
-
0.0171 0.020 0.4434 20.000 -0.0065 0.428
MEMBER_DURATION
0.7843 41.000
-
0.0066 0.000 0.7843 41.000 -0.0065 0.000
TOTAL_SPEND
0.6864 29.000
-
0.0000 0.003 0.6864 29.000 -0.0000 0.004
NO_CLAIMS
1.0000 19.000 0.7752 0.000 1.0000 19.000 0.7610 0.000
INTERCEPT -
0.5767 0.000 -0.5635 0.001
Note: Importance and coefficients share one column as well
as p-values and number of rules.

Logistic Selection Steps
Model
Name
M2_TRN
_LOGIST
IC_STEP
WISE
# in
mo
del
P-
val
ue
Step Effect Entered Effect Removed
1 .00
1 no_claims
2 member_duration
2 .00
3 optom_presc
3 .00
4 total_spend
4 .00
5 doctor_visits
5 .02
Dropped Num_members.

Some conclusions and comments so far:
. Logistic stepwise dropped Num_members that is shown
with lowest relative importance in GB. Notice that Logistic
Regression does not have agreed-upon scale of importance.
We can use odds-ratios, e.g.
. NO_CLAIMS is deemed most important single variable for
GB, but logistic deems OPTOM_PRESC as the second one
(via odds ratios), while GB selected MEMBER_DURATION.
. Remaining variables have odds ratios of 1 which seem to
indicate similar effect, while GB distinguishes relative
importance after first two variables.

Definite monotonic relationship, higher values of NO_CLAIMS (e.g., bin 5)
associated with higher probability. Note differences between LG and BG.

GB monotonic decrease for OPTOM_PRESC, diffuse for TOTAL_SPEND.

Mg: Effect:
Change in Prob as X changes.

LG does not have measure of importance, as GB does 
we use marginal effects plots that indicate change in
probability along variable range. Except for
MEMBER_DURATION (that declines initially), other effects
have positive effect of different intensity and max value
declines as per logistic shape. Member duration has
pronounced decline for low duration levels,  possibility
of fraudulent members who join, commit their fraud and
leave.
Note sharper increase in prob. for NO_CLAIMS at bins 1
and 8. Optom_presc at 6.
GB Importance measures impact of individual inputs on
predicting Y, but don’t tell how impact changes along
range of inputs and individual variable effects are not in
consideration
 Use Partial Dependency Plots, also for LG in free ride.

Marginal Effects and PDPs
Marginal effects refer to change in probability with one
unit change in X, ceteris paribus (if meaningful or at
least desirable).
PDPs do not indicate change in Y at all; instead, PDP
measures probability levels at different values of X1
measuring all other predictors at their means (or modes,
medians, etc.).
 No Marginality in PDPs, unless we measured ‘change’
in probability as well. Shown later on, called Marginal
PDPs.

Tree representation(s) up to 4 levels Model M2_GB_TRN_TREES
Requested Tree Models: Names & Descriptions. Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.464
no_claims < 2.5 ( 0.185 ) no_claims < 0.5 (
0.159 )
member_duration
< 180.5 ( 0.199 )
total_spend < 5250
( 0.464 )
total_spend >=
5250 ( 0.186 ) 0.186
member_duration
>= 180.5 ( 0.103 )
doctor_visits >=
5.5 ( 0.093 ) 0.093
doctor_visits < 5.5
( 0.126 ) 0.126
no_claims >= 0.5 (
0.321 )
optom_presc < 3.5
( 0.291 )
total_spend >=
6300 ( 0.273 ) 0.273
total_spend < 6300
( 0.467 ) 0.467
optom_presc >=
3.5 ( 0.59 )
member_duration
< 154.5 ( 0.67 ) 0.670
member_duration
>= 154.5 ( 0.447 ) 0.447
no_claims >= 2.5 ( 0.633 ) no_claims < 4.5 (
0.57 )
optom_presc < 3.5
( 0.54 )
member_duration
>= 128.5 ( 0.498 ) 0.498
member_duration
< 128.5 ( 0.627 ) 0.627
optom_presc >=
3.5 ( 0.81 )
member_duration
>= 137 ( 0.785 ) 0.785
member_duration
< 137 ( 0.85 ) 0.850
no_claims >= 4.5 (
0.761 )
member_duration
< 303.5 ( 0.778 )
member_duration
>= 148 ( 0.757 ) 0.757
member_duration
< 148 ( 0.823 ) 0.823Missing one line.

Tree representation(s) up to 4 levels Model M2_LG_TRN_TREES
Requested Tree Models: Names & Descriptions. Pred
Level 1 + Prob. Level 2 + Prob. Level 3 + Prob. Level 4 + Prob.
0.195no_claims < 1.5 ( 0.164 ) member_duration < 155.5 (
0.235 )
optom_presc < 3.5 ( 0.213 ) no_claims < 0.5 ( 0.195 )
no_claims >= 0.5 ( 0.337 ) 0.337
optom_presc >= 3.5 ( 0.49 ) optom_presc < 6.5 ( 0.404 ) 0.404
optom_presc >= 6.5 ( 0.647
) 0.647
member_duration >= 155.5
( 0.111 )
optom_presc < 3.5 ( 0.103 ) member_duration >= 246.5
( 0.065 )
0.065
member_duration < 246.5 (
0.122 ) 0.122
optom_presc >= 3.5 ( 0.235
)
no_claims >= 0.5 ( 0.353 ) 0.353
no_claims < 0.5 ( 0.213 ) 0.213
no_claims >= 1.5 ( 0.61 ) no_claims < 2.5 ( 0.451 ) member_duration < 155.5 (
0.562 )
optom_presc >= 1.5 ( 0.651
) 0.651
optom_presc < 1.5 ( 0.493 ) 0.493
member_duration >= 155.5
( 0.353 )
member_duration >= 237 (
0.204 )
0.204
member_duration < 237 (
0.39 ) 0.390
no_claims >= 2.5 ( 0.748 ) no_claims < 4.5 ( 0.675 ) member_duration >= 236.5
( 0.477 )
0.477
member_duration < 236.5 (
0.721 ) 0.721
no_claims >= 4.5 ( 0.899 ) member_duration >= 272 (
0.741 )
0.741
Missing one line.

Comment on Tree Representations
LG starts by splitting on NO_CLAIMS at 2, while GB splits at
3. Predictions for first level across 2 models are similar (
.185 ; .633) for GB vs. (.164, .61) for LG, which indicates that
structures identified so far are similar so far (UMI interpr.)
While in 2nd level, GB only splits on NO_CLAIMS, LG splits
on MEMBER_DURATION for suspected non-fraudsters in the
first stage and on NO_CLAIMS for the fraudster suspects
(BMI already different).
Predictions are similar only for the 4th node in level 2 (.748
and .761) but different otherwise. The careful reader may
verify that these two predictions emerge by splitting on
NO_CLAIMS albeit at different values, which supports the
notion of No_claims being the leading clue in our research.

Comment on Tree Representations (cont.1)
NO_CLAIMS not so heavily used after 2nd level however,
and the structures of the models are clearly different. GB
does not use it at all, while LG splits at 4.5 to produce the
highest prediction level of .899. While GB did split initially on
No_claims 2.5 and then on 4.5, it did not reach the same
level of prediction as LG that started splitting at 1.5.
By going to the marginal effects plot, we can see that
No_claims has the largest slope for low values, but
member_duration has the highest for highest value of the
variable. But no similar plots can be created for GB.
Thus, found structures and consequent interpretations differ
and there is no isomorphism from one into the other.
Perhaps fractal approximation?

2nd most “important variable”, very different structures.

1st most “important” variable, very different structures.

Both probs increasing as No_claims increases except for GB between 14
and 17. GB > LG up to NO_CLAIMS = 3, later reverse. Notice increasing
difference in prob. after No_CLAIMS = 5.

Different binned residual results by model and by variables. Notice
LG better fit except between bins 11 through 17.

Overall declining relation except for area around bin 15. Notice no consistent
Higher levels of prob. of either model, contrary to two previous slides.

Both models fit relatively poorly for large values of Member_Duration.

Similar gulf in probabilities and severe divergence between 11 and 13. GB
makes more extreme jumps.

Similar behavior as that of Member_Duration.

Strict interpretations: different.
Obvious similarities: e.g., high values of NO_CLAIMS
related to high Probability Values mostly.
INTERESTING: curves not fully monotonically increasing
or decreasing.
Residual plots show that there are patterns not well
fitted.

....
Ranking the
models
by GOF.
Strongly summarized area for brevity sake, added just for completion.

GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Ran
k
Unw
.
Mea
n
Unw
.
Medi
an
Model Name
1 1 2 1 1 1 1 1 1.13 1
02_M2_TRN_GRAD_BOOST
ING
03_M2_TRN_LOGISTIC_ST
EPWISE 2 2 1 2 2 2 2 2 1.88 2
GOF ranks
GOF measure
rank
AUR
OC
Avg
Squa
re
Error
Class
Rate
Cum
Lift
3rd
bin
Cum
Resp
Rate
3rd Gini
Preci
sion
Rate
Rsqu
are
Cram
er
Tjur
Rank Rank Rank Rank Rank Rank Rank Rank
Unw.
Mean
Unw.
Medi
an
Model Name
1 1 2 1 1 1 1 1 1.13 1
04_M2_VAL_GRAD_BOOS
TING
05_M2_VAL_LOGISTIC_S
TEPWISE 2 2 1 2 2 2 2 2 1.88 2

....
Profile and
Model Interpretation
Area.
....
Univariate Profile diagnostics
for 6
Important Vars.

....
Event Proportions
and Posterior Probabilites
for 5
Important Vars.
by original
Model Names.
Variables and probabilities binned for ease of
visualization. Proportion events same across models (it’s
just original data), but probabilities differ across models.
Not all cases shown.

Some observations
Binned No_claims: While similar in shape, GradBoost seriously
underestimates proportion of events throughout, while logistic has the
problem for bins 2, 3, 5, 6, 7. Logistic has a positive slope, while GB
flattens due to interactive GB model. Up to bin 7, similar behavior for
GB and LG, and then LG jumps to higher level of probability.
Binned Member_Duration: Probability distributions are similar but not
identical. For bins 1, 2, 3 and 16 methods underestimate proportion of
events. Slightly declining slope for both models.
Binned OPTOM_PRESC: Both methods failed to match proportion of
events in the mid range of the bins. Sudden positive upshift in
positive slope for GB starting at bin 15, while overall flat but positive
slope for Logistic.

....
Rescaled Variables
along binned
Posterior
Probability.

Interpretation: In bin 5, No_claims reaches overall max (100), while for bin 1 max is around
35 and 15 in 0-100 scale for respective models. Same interpretation for Q3, etc.

And Conversely ….. (GB = Tree repr. Of Grad_boosting …)

....
Partial Dependency
Plots and variants for
Non Ensemble
Models.
Some variables may be dropped due to computer resources.

Note the narrow range of GB PDPs compared to those of LG due to GB
interactive nature  more difficult to interpret.

Marginal (1) PDP comparative notes
(Marginal (1): one var at a time. Could also marginalize two vars at a time, not
done in this presentation).
BG marginals are rather flat, except for
MEMBER_DURATION, of which caveat later on.
LG is juicier, NO_CLAIMS increase of probability declines
along range, but OPTOM_PRESC increases, which seems to
indicate that leading reason would be prescriptions and not
overall claims.
Corresponding marginals for logistic end up with slowing
down of growh due to logistic shape. BG is not constrained
in that way.

PDP comparative notes
Overall PDP is Model probability when all predictors are at their means.
For LG, it’s about .17, while GB is .53. Individual PDPs (by def.) are
deviations from Overall when var of interest measured along its range,
while others remain at mean values. GB clumps most PDPs around
Overall, LG clearly distinct values instead.
Highest probability level for GB is around .7 while LG reaches 1, and
minima are around .6 and 0 respectively. Note LG monotonicity while
GB is mostly monotonic (except for Doctor_visits), possibly product of
data set created artificially.
In both cases, NO_CLAIMS appears as leading variable, especially in LG
but while Member_duration is rather flat in GB, it certainly declines
steadily in LG, with a very different interpretation. Longer member
duration implies steadier customer and familiarity. No_members had
been excluded in LG’s stepwise and should not be confused with
MEMBER_DURATION.

....
Now, mix
All previous
Probability
and PDPs
Together.

Member_duration alone brags too much.

UMI: Univariate Model Interpretation.
From preceding pages, we can conclude that:
No_claims: positively associated with increased fraud,
for both logistic and grad, but far steeper slope in
Logistic. Grad_b stays in narrow band of probabilty and
more interactive with other predictors  Grad_b requires
more BMI and MMI. GB’s PDP overshoots posterior
probability  other vars bring down this effect in GB.
Member_duration has a U shape relationship,
especially in Logistic case, while GB has a more spiky
one. Note the high spike at duration minimal and
immediate decline which seems to indicate members that
committed fraud as soon as they joined and left
immediately.

UMI: Univariate Model Interpretation (cont. 1)
PDP view: logistic shows positive effects of NO_CLAIMS
and OPTOM_PRESC, balanced by negative effects of
remaining variables.
Comparing posterior probability with No_claims PDP,
they are almost the same for Logistic. Similarly for
MEMBER_DURATION.
Grad_b instead shows more tepid effects of same
variables, and almost unchanging effects of remaining
predictors. Comparing PDP with probability, other
predictors bring down PDP of No_claims. Similar effects
for MEMBER_DURATION.

....
PDPs for
"Pairs of
Variables"
Note: 3d plots tend to interpolate areas of no data producing false
expectations of results. Thus, sometimes 3d charts preferable to
plots.
Not all Pairs of variables available due to computer resources.

Same for LG.
Note correlations (No_claims) with other variables are relative small when
compared to the pair Member_duration – Doctor_visits and Member_duration -
optom_presc. How will this translate in PDPs for 2 variables at a time?

M_: 'M2_TRN_LOGISTIC_STEPWISE' 'ORIGINAL' PDP Corr '-0.02542'
LG High levels of NO_CLAIMS have high probability for lowest level of
total_spend which probably denotes one time fraud. Otherwise, even mid levels
of NO_CLAIMS associated with high probability for any level of TOTAL_SPEND.
It seems that FRAUD is not necessarily linked to TOTAL_SPEND alone.

BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr ' 0.05073' NO_CLAIMS
DOCTOR_VISITS
Combination of No_claims & Doctor_visits shows high probability at NE corner
and middle section stable high prob. level. Too many charts to show but
necessary for full interpretation.

BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr ' 0.02549' NO_CLAIMS
MEMBER_DURATION
About NO_CLAIMS and MEMBER_DURATION, fraud happens for low
level of duration, after which fraudsters leave.

M_: 'M2_TRN_GRAD_BOOSTING' 'ORIGINAL' PDP Corr ' 0.06580'
Prob. growth for low NO_CLAIMS and OPTOM_PRESC increase (not at dramatic
as in LG), and then slow and steady prob. growth with increases in both vars.

BINNED ORIGINAL PDP M2_TRN_LOGISTIC_STEPWISE Corr ' 0.06580'
NO_CLAIMS OPTOM_PRESC
Steep prob. growth for low NO_CLAIMS and OPTOM_PRESC increase, similar
but not so pronounced growth for small OPTOM_PRESC and increasing
NO_CLAIMS.

BINNED ORIGINAL PDP M2_TRN_GRAD_BOOSTING Corr '-0.10759' MEMBER_DURATION
OPTOM_PRESC
About the pair Optom_presc and Member_duration for which we have contrasting
Pair PDPs, with corr = -0.10, interpretation is very different. While GB shows flat
probabilities throughout, except in NE empty corner, logistic shows more
extreme NE corner, plus declining probabilities from NW top.

BINNED ORIGINAL PDP M2_TRN_LOGISTIC_STEPWISE Corr '-0.10759' MEMBER_DURATION
OPTOM_PRESC

M_: 'M2_TRN_GRAD_BOOSTING' 'ORIGINAL' PDP Corr '-0.02542'
Rather flat relationship in GB, but steeper in LG (next slide)

M_: 'M2_TRN_LOGISTIC_STEPWISE' 'ORIGINAL' PDP Corr '-0.02542'

Some BMI comments
Previous charts suggest that one type of fraud happens once for low
levels of duration. Different types of fraud are linked with
Prescriptions and Claims.
For low levels of claims, increasing prescriptions leads to fraud, as
well as a combination of increasing claims and prescriptions
combined.
So it is possible that there are at least 3 types of Fraud being
committed.
Other pair combinations were not interesting and rather flat surfaces.

For 2 dimensional visualization, collapse 3d chart by
averaging levels of variable 2 into those of variable 1 and
compare to original PDP.
Original and collapsed PDPs (CPDPs) are derived from
posterior model probabilities.

Comments for BMI.
In case of NO_CLAIMS, all cases show overlapping of
collapsed and Original, except for OPTOM_PRESC. In GB
case, DURATION brings down probability slightly because
duration is itself strong predictor.
LG shows that presence of OPTOM_PRESC raises
posterior probability, not so accentuated in GB. LG model
could benefit of NO_CLAIMS and OPTOM_PRESC
interaction or possibly overall transformation by way of
obtaining information per month and per number of
members. (LG chart with TOTAL_SPEND omitted for space
brevity).
MEMBER_DURATION shows overlap with all second
variables, plus declining slope, more evident in LG
models.

Comments for BMI (cont).
It is possible to obtain 3-way and higher PDPs, and also
collapse them, not tried here.
Given overlap between Original and CPDP, UMI effects are
correct so far, except possibly for triplet NO_CLAIMS,
MEMBER_DURATION and OPTOM_PRESC.

Collapsed Triad and Tetrad PDPs.
In like manner as 3d PDPs, possible to obtain PDPs for 3 or more
variables but not possible to graph (at least, not yet). Collapsed
Triads bypasses problems by obtaining PDP of specific three
variable set (triads). And Tetrads, for 4 variable sets.
While it is possible to collapse two of them and present average PDP
compared to univariate PDPs, bivariate and original probability, also
possible to collapse Mean PDP but along 4 quartile ranges of the
third variable (in TRIAD case) or fourth variable (in TETRAD case)
(we bypass the quartile presentation for brevity).
In addition to mean PDPs, we also include max and min PDPs as an
overlay to provide view of probability variation.
Still working on improving these presentations.

Original probability not truly represented well by any combination of PDPs in
GB case. GB searches solution over residuals, not fully represented hereby.

Minor differences when compared to previous TRIAD.

Original probability represented very well with 3 variables

Minor differences between triads and tetrad representations.

....
MMI PDPs
in Difference form

Same as previous slide but in difference form.

Residual – iterative GB nature shows that just 3 variable do not capture full
Interpretation since Posterior probability remains below and above most of
data points.

Same as previous slide in difference form.

Comment on Triads and Tetrads
For present data set and choice of represented variables,
there was not much difference between TRIADS and
TETRADS, which means that omission of the fourth
variable does not alter our results by much.
Still we emphasize that it is quite illusory to merely
concentrate on average values. The graphs also contain
max and min values of probability that show the high
variability except for large values of NO_CLAIMS. This is
clear since the event rate was %20 and concentrated on
high values of that variable.

Lowest prob level represented by lowest levels of predictors, except for
member duration.

Highest prob levels determined by mid-levels of No_claims and higher total
spend than in bin 1..

Conclusions on multivariate distribution of predictors
along Prob. bins.
Since event rate is %20, good model should clump those observations
in higher prob. bins.
In lowest bin, predictors bins are typically lowest ones, except for
member_duration. Again, this seems to indicate presence of fraudsters
with short duration that leave immediately, as evinced in bin 5 of
probability slide.
In this one, most predictors have observations from higher bins.
Other probability bins not listed for brevity sake.

Similar coeffs among logistic , GB and beta regressions. By Beta Regression,
standard interpretation of log odds is possible, with caveats.
Vars * Models *
Coeffs
Model Name
M2_TRN_
GRAD_BO
OSTING
M2_TRN_
GRAD_BO
OSTING_
BETA_RE
G
M2_TRN_
LOGISTIC
_STEPWI
SE
M2_TRN_
LOGISTIC
_STEPWI
SE_BETA_
REG
M2_VAL_
GRAD_BO
OSTING
M2_VAL_
LOGISTIC
_STEPWI
SE
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Coeff /
Importan
ce
Variable
0.7318 -0.0051 -0.0057 -0.0057 0.7318 -0.0084
MEMBER_DURATIO
N
DOCTOR_VISITS
0.3925 -0.0061 0.3925
TOTAL_SPEND
0.6610 -0.0000 -0.0000 -0.0000 0.6610 -0.0000
OPTOM_PRESC
0.5944 0.1713 0.2132 0.2132 0.5944 0.1634
NO_CLAIMS
1.0000 0.7027 0.7921 0.7921 1.0000 0.7351
SCALE
21.8895
2.590291
516E16
INTERCEPT
-0.7979 -0.8352 -0.8352 -0.3111

Beta Regression results
Results of analyzing posterior probabilities (i.e.,
original GB and LG posteriors) via BETA
regression show very similar coefficients and
structures  Beta is reassuring but not providing
additional information.

Possible to say that …
1) Manipulate just NO_CLAIMS and problem
solved?
2) Maybe add MEMBER_DURATION and
OPTOM_PRESC for parts of NO_CLAIMS
range?
3) Maybe add if-then rules from simplified
TREE_REPRESENTATION because easier than
GB and more interactive than LG?
4) If using Neural Network, and NN derivatives
abandon all hope of interpretation?
5)  Interpretation needs definition of INTENDED
AUDIENCE (see Tolstoy ut supra).

Possible to say that … (cont. 1)
1) The analyst needs to focus on NO_CLAIMS,
MEMBER_DURATION and OPTOM_PRESC as
an ‘IMPORTANT’ group.
2) Different model Interpretations should be
entertained.
3) Different marginal effects must be explained.

Final thoughts …
MI analysis must proceed further for large
number of predictors obtaining insights from
collapsing 3+ way PDPs for instance.
If preferred ‘easier’ linear model explanation, beta
regression on posterior probability would provide
regression like information. Still, beta regression
is not straightforward and model selection is big
issue.

Ch. 1.1-127 2019-06-14
Future steps
Focus on MMI
1) Beta regression for ‘linear’ interpretation. More
difficult because it requires model search as well.
Plus, additional error in modeling posterior
probability of original model.
2) Andrews’ curves.

Lime: Local Interpretable Model-Agnostic
Explanations:
Uses surrogate interpretable model on black-box model, applied to observations of
interest. Tree representation in this presentation similar to this.
(https://homes.cs.washington.edu/~marcotcr/blog/lime/)
ICE: Clusters or classification variable applied to
PDP results. For given predictor, ICE plots draw one line per obs.,
representing how instance’s prediction changes when predictor
changes.
Shap Values: Shapley Additive Explanation (Lundberg
et al, 2017): Measures +/- feature contribution to prob. Technique used in
game theory to determine each player’s contribution to success in a
game. Affected by corrs among predictors  focusing just on one
predictor to change behavior may change other predictors as well
(available in Python).
AND OTHERS ….

References
Lundberg SM, Lee SI (2017), “Consistent feature
attribution for tree ensembles”, presented at the 2017
ICML Workshop on Human Interpretability in Machine
Learning (WHI 2017), Sydney, NSW, Australia
(https://arxiv.org/abs/1706.06060)
Molnar C. (2018): Interpretable Machine Learning, A
guide for making black box models explainable,
https://christophm.github.io/interpretable-ml-book/
Tolstoy Leo: “The Kingdom of God Is Within You”, (1894)

Visual Tools for explaining Machine Learning Models

Visual Tools for explaining Machine Learning Models

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Visual Tools for explaining Machine Learning Models

Similar to Visual Tools for explaining Machine Learning Models (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

Visual Tools for explaining Machine Learning Models